10,000 Matching Annotations
  1. Oct 2025
    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The current manuscript by Hajra et al deals with the role of the prominent Sirtuins SIRT1 and -3 during infection of macrophages with Salmonella Typhimurium (ST). Apparently, ST infection induces upregulation of host cell SRTs to aid its own metabolism during the intracellular lifestyle and to help reprogramming macrophage polarization. The manuscript has two parts, namely one part that deals with Salmonella infection in cells, where RAW 264.7 murine macrophage-like cells, sharing some features with primary macrophages, were employed. Infected RAW cells displayed a tendency to polarize towards wound-healing M2 and not inflammatory M1 macrophages, which was dependent on SRT. Consequently, the inflammatory response in RAW was more robust in the absence of SRT. Moreover, loss of SRTs leads to impaired bacterial proliferation in these cells, which was attributed to defects in metabolic adaption of the bacteria in the absence of SRT-activity and to the increased M1 inflammatory response.

      Unfortunately, the line of argumentation remains incomplete because corresponding assays in mice showed the opposite result as compared to the experiments using RAW 264.7 cells. i.e. loss of SRTs leads to increased bacterial load in animals (versus impaired proliferation in RAW 264.7 cells). The authors cannot explain this discrepancy.

      Strengths:

      Extensive analysis of Salmonella infection in RAW macrophage-like cells and mice in the context of SRT1/3 function.

      Weaknesses:

      Lack of connection between the cell-based and organismic data, which are not supportive of each other.

      We are highly grateful for your valuable and insightful comments. Thank you for appreciating the merit of our manuscript. We agree with the opposing phenotypes among the RAW264.7 cell line (Fig. 2A), primary peritoneal macrophages (ex vivo) (Fig.2B), and in vivo mouse model (Fig.8) findings. Both RAW264.7 macrophage and peritoneal macrophage infection show attenuated intracellular bacterial proliferation owing to the heightened proinflammatory burst. This is in sharp contrast to our in vivo mouse model of infection which shows increased organ burden and bacterial dissemination. The higher bacterial load in the organs including the spleen (Fig.8B) is attributed to increased pro-inflammatory cytokine burst and ROS production (Fig.8F-H, Fig.S9) triggering bacterial dissemination. The pro-inflammatory arsenals like IL-6, IL-1β and ROS that limit bacterial proliferation within the macrophages (F4/80+ macrophages within the spleen or in RAW264.7 macrophages or primary peritoneal macrophages) are facilitating bacterial dissemination in blood and to the other organs (Fig. 8I-L, Fig.S3F-G). This is in line with the following previous findings-

      Klebsiella pneumoniae infection triggers an inflammatory response via secretion of IL-6 upon HIF-1α activation that induces bacterial dissemination (Holden VI, Breen P, Houle S, Dozois CM, Bachman MA. Klebsiella pneumoniae Siderophores Induce Inflammation, Bacterial Dissemination, and HIF-1α Stabilization during Pneumonia. mBio. 2016 Sep 13;7(5):e01397-16. doi: 10.1128/mBio.01397-16. PMID: 27624128; PMCID: PMC5021805.).

      Correlation analysis of immune responses to Salmonella infection revealed that increased innate immune “cassette” opposes the adaptive immune arm leading to increased bacterial load in mice (Hotson AN, Gopinath S, Nicolau M, Khasanova A, Finck R, Monack D, et al. Coordinate actions of innate immune responses oppose those of the adaptive immune system during Salmonella infection of mice. Science signaling. 2016;9(410):ra4). 

      In our revised manuscript, we have assessed additional splenic populations including CD45+, Ly6C+, and CD11c+ populations. Our results show that the CD45+ splenic population depicts increased bacterial loads like that of the total splenic population within the SIRT1/3 inhibited cohorts. However, CD45+ monocytes and Ly6C positive splenic population exhibit compromised burden within the SIRT1/3 inhibited cohorts. Moreover, within the CD11c+ population, CD45+ granulocytes or lymphocytes show comparable organ loads to that of the vehicle control or SIRT1 activator-treated mice group (Fig. M-S, Fig.S8). Overall, our data suggest heterogeneous bacterial burden in diverse splenic populations.

      Reviewer #2 (Public Review):

      Dipasree Hajra et al demonstrated that Salmonella was able to modulate the expression of Sirtuins (Sirt1 and Sirt3) and regulate the metabolic switch in both host and Salmonella, promoting its pathogenesis. The authors found Salmonella infection induced high levels of Sirt1 and Sirt3 in macrophages, which were skewed toward the M2 phenotype allowing Salmonella to hyper-proliferate. Mechanistically, Sirt1 and Sirt3 regulated the acetylation of HIF-1alpha and PDHA1, therefore mediating Salmonella-induced host metabolic shift in the infected macrophages. Interestingly, Sirt1 and Sirt3-driven host metabolic switch also had an effect on the metabolic profile of Salmonella. Counterintuitively, inhibition of Sirt1/3 led to increased pathogen burdens in an in vivo mouse model. Overall, this is a well-designed study. There are a few comments below that would further strengthen the current study.

      Major comments:

      In the in vivo study (lines 436-446) - the authors noticed increased pathogen burden in the EX-527 or the 3TYP-treated mice cohorts but decreased pathogen burden within the F4/80+ macrophage population. What are the other cell types that have increased pathogen burden in splenocytes from EX-527 or the 3TYP treated? Can this be further explored and explained?

      While the authors indicated that IL-6 cytokine storm and elevated ROS production could result in bacterial dissemination in vivo, one could also argue that Sirt1/3 inhibitors might have an impact on gut function and/or gut microbiota (PMID: 22115311). Did Sirt1/3 inhibitors also lead to increased pathogen burdens in the gut? If so, the potential effect of these in vivo treatments on gut microbiota/colonization resistance should be discussed.

      Minor comment:

      Sirt1 has been shown to be degraded during Salmonella infection (PMID: 28192515), which is different from the current study. An explanation should be provided for this.

      We thank you for your encouraging and gracious comments. We deeply appreciate your time and efforts in providing constructive feedback for the betterment of our work. As per your precious suggestions, we have assessed additional splenic populations including CD45+, Ly6C+, and CD11c+ populations apart from F4/80+ macrophage populations. Our analysis suggests that the CD45+ splenic population show increased bacterial loads similar to the total splenic population within the SIRT1/3 inhibited cohorts. However, CD45+ monocytes and Ly6C positive splenic population exhibit compromised burden within the SIRT1/3 inhibited cohorts. Moreover, CD11c+ population, CD45+ granulocytes or lymphocytes show comparable organ loads to that of the vehicle control or SIRT1 activator treated mice group (Fig. 8M-S). Overall, our data suggest heterogeneous bacterial burden in diverse splenic populations.

      We immensely appreciate the reviewer for this insightful question about the effect of SIRT1/3 on the gut per se. To answer your question, we observed increased pathogen loads within the mesenteric lymph nodes of the gut in the SIRT1/3 inhibitor-treated mice groups (Fig.8B). In our revised manuscript, we evaluated gut inflammation via IL1-β estimation in the mice's ileal tissues and have observed heightened IL-1β production in the inhibitor-treated mice cohorts in comparison to the vehicle control (Fig. S3G). We have also examined gut epithelial pathology via Haematoxylin-Eosin (H&E) staining of the ileal sections to address the effect of in vivo treatment on gut microbiota and colonization resistance which is appended here. However, the gut microbiota crosstalk and their effect on colonization resistance is a part of another current study and it is being examined in detail there. Therefore, this appended H&E has not been incorporated in the revised manuscript.

      Author response image 1.

      In line with the reference PMID: 28192515, where Sirt1 has been shown to be degraded during Salmonella infection at later time points of infection, our study also has shown that both SIRT1 mRNA (Fig. 1A) and protein levels (Fig. S1A) show an elevated expression at 2h and 6h post-infection and show a downregulation at 16h in comparison to the 6h time point.  However, SIRT3 expression levels remain elevated even at later time points of infection. Therefore, we speculate that there is a shared role between SIRT1 and SIRT3 that facilitates the phenotypes reported in our study.

      Reviewer #3 (Public Review):

      Summary:

      In this paper, Hajra et al have attempted to identify the role of Sirt1 and Sirt3 in regulating metabolic reprogramming and macrophage host defense. They have performed gene knockdown experiments in RAW macrophage cell lines to show that depletion of Sirt1 or Sirt3 enhances the ability of macrophages to eliminate Salmonella Typhimurium. However, in mice, inhibition of Sirt1 resulted in dissemination of the bacteria but the bacterial burden was still reduced in macrophages. They suggest that the effect they have observed is due to increased inflammation and ROS production by macrophages. They also try to establish a weak link with metabolism. They present data to show that the switch in metabolism from glycolysis to fatty acid oxidation is regulated by acetylation of Hif1a, and PDHA1.

      Strengths:

      The strength of the manuscript is that the role of Sirtuins in host-pathogen interactions has not been previously explored in-depth making the study interesting. It is also interesting to see that depletion of either Sirt1 or Sirt3 results in a similar outcome.

      Weaknesses:

      The major weakness of the paper is the low quality of data, making it harder to substantiate the claims. Also, there are too many pathways and mechanisms being investigated. It would have been better if the authors had focussed on either Sirt1 or Sirt3 and elucidated how it reprograms metabolism to eventually modulate host response against Salmonella Typhimurium. Experimental evidence is also lacking to prove the proposed mechanisms. For instance, they show correlative data that the knockdown of Sirt1-mediated shift in metabolism is due to HIF1a acetylation but this needs to be proven with further experiments.

      We appreciate the reviewer’s critical analysis of our work. In the revised manuscript, we aimed to eliminate the low-quality data sets and have tried to substantiate them with better and conclusive ones, as directed in the recommendations for the author section. We agree with the reviewer that the inclusion of both Sirtuins 1 and 3 has resulted in too many pathways and mechanisms and focusing on one SIRT and its mechanism of metabolic reprogramming and immune modulation would have been a less complicated alternative approach. However, as rightly pointed out, our work demonstrated the shared and few overlapping roles of the two sirtuins, SIRT1 and SIRT3, together mediating the immune-metabolic switch upon Salmonella infection. As per the reviewer’s suggestion, we have performed additional experiments with HIF-1α inhibitor treatment in our revised manuscript to substantiate our correlative findings on SIRT1-mediated regulation of host glycolysis (Fig.7G).

      Reviewer #1 (Recommendations For The Authors):

      The authors state "SIRT1 and SIRT3 inhibition resulted in increased pathogen loads in organs and triggered enhanced bacterial dissemination, together leading to increased susceptibility of the mice to S. Typhimurium infection owing to increased ROS and IL-6 production." How can this be reconciled? To the reviewer, this is not a convincing explanation. The reviewer is not a mouse pathologist, so maybe did not understand the argument in full.

      However, in order to clarify whether these phenomena can be brought into context and explained by for instance cell-autonomous (in (RAW) macrophages) versus non-autonomous (in mice) mechanisms, it would be required to bring in context the organismic phenotype with a cellular phenotype, using more physiologic primary macrophages.

      (1) The authors show in Figure 8 that in general SRT inhibition leads to increased infection whereas SRT activation results in decreased infection. This is even true for e the spleen (e.g. Figure 8B), which should be full of macrophages upon infection.

      (2) Only Figure 8L implies that endogenous primary, splenic macrophages show a higher infection rate upon pharmacologic SRT activation, which would potentially mirror the RAW results. This is however not supportive of their own explanation: Who would now produce more ROS and IL6 if these macrophages are more supportive of intracellular ST? Is there a difference in the roles or SRTs between different types of macrophages and/or neutrophils? And between macrophages and somatic cells concerning ST infection? The reviewer tends to believe that RAW cells display a defective killing response (such as ROS production) as they are highly transformed cells. Therefore, the authors should use cultured peritoneal macrophages or BMDMs in addition to RAW264.7 cells.

      The literature cited by the authors also implies that the inflammatory response in mice is higher in the absence of SRTs. This is in line with a role for SRTs in (negatively) regulating M1 inflammatory polarization but probably not with increased bacterial burden in mice. If it was, then increased dissemination could be explained by increased tissue damage. However, the flow cytometry experiments from infected organs then do not confirm that, as the infection of individual cells is higher upon SRT inhibition. Thus there seems a broad gap between the role of SRTs in ST infection in RAW264.7 cells versus non-transformed cells.

      I would not discard the RAW results, as I am convinced that they contain valuable data. However, it needs to be clarified what aspect of the host response RAW 264.7 cells represent. Primary macrophages might likely be more aggressive towards the bacteria. Finally, the question arises: what is the role of the metabolic switch in the in vivo setting?

      The reviewer recommends repeating some key experiments by in-vitro-infecting BMDMs or isolated peritoneal macrophages (after some days of culturing) to bridge between the present RAW-derived data and the mouse data. How is the bacterial load with and without SRT inhibitor/activator in primary macrophages, when infected outside of the body? Can ex-vivo infection also affect polarization of e.g. peritoneal macrophages or the metabolic switch? If it is possible to find a conclusive explanation for their data, then this story might really add to our understanding of another aspect of how ST manipulates the host to survive.

      In case the reviewer understands the mouse experiments correctly, all assays on peritoneal cells were performed after in-vivo-infection and/or treatment.

      Together, RAW 264.7 murine macrophage-like cells might not be the right model to understand the phenotypes in full. As far as the reviewer knows, these cells are not capable of killing bacteria as effectively as activated primary macrophages or neutrophils.

      A few of the key findings of RAW264.7 macrophages have been replicated in primary peritoneal macrophages (Fig. 2B, S3E-F, S6B, S7B-D). We wanted to clarify that the peritoneal macrophage experiments were performed ex vivo, wherein peritoneal macrophages were isolated from mice were then subjected to SIRT1/3 inhibitor treatments and Salmonella infection and not after in vivo treatment or infection. In ex vivo setting, we have examined the effect of SIRTs on the metabolic switch during Salmonella infection (Fig. S7B-D) which resembled our RAW264.7 macrophage data. Additionally, in in vivo setting, we have analyzed the transcript level expression of host metabolic genes and corresponding bacterial metabolic genes in infected mice liver and spleen tissue under SIRT1/3 inhibitor treatment (Fig.S7E-F, Fig.6C-D). Our primary peritoneal macrophage data exactly mirrors the RAW264.7 macrophage findings showing attenuated intracellular bacterial proliferation owing to the heightened proinflammatory burst upon SIRT1/3 knockdown or inhibition (Fig.2A-B). This is opposite to our in vivo mouse model of infection which shows increased organ burden and bacterial dissemination (Fig.8A-H). The pro-inflammatory arsenals that limit bacterial proliferation within the macrophages (F4/80+ macrophages within the spleen or in RAW264.7 macrophages or primary peritoneal macrophages) are facilitating bacterial dissemination in blood and to the other organs owing to tissue damage (Fig.8E-L). This is in line with the following previous findings-

      Klebsiella pneumoniae infection triggers an inflammatory response via secretion of IL-6 upon HIF-1α activation that induces bacterial dissemination (Holden VI, Breen P, Houle S, Dozois CM, Bachman MA. Klebsiella pneumoniae Siderophores Induce Inflammation, Bacterial Dissemination, and HIF-1α Stabilization during Pneumonia. mBio. 2016 Sep 13;7(5):e01397-16. doi: 10.1128/mBio.01397-16. PMID: 27624128; PMCID: PMC5021805.).

      Correlation analysis of immune responses to Salmonella infection revealed that increased innate immune “cassette” opposes the adaptive immune arm leading to increased bacterial load in mice (Hotson AN, Gopinath S, Nicolau M, Khasanova A, Finck R, Monack D, et al. Coordinate actions of innate immune responses oppose those of the adaptive immune system during Salmonella infection of mice. Science Signaling. 2016;9(410):ra4). 

      As per the reviewer’s suggestions, we have analyzed other populations apart from F4/80+ macrophages and have observed that the CD45+ splenic population depicts increased bacterial loads like that of the total splenic population within the SIRT1/3 inhibited cohorts. However, CD45+ monocytes and Ly6C positive splenic population exhibit compromised burden within the SIRT1/3 inhibited cohorts. Moreover, the CD1c+ population, CD45+ granulocytes, or lymphocytes show comparable organ loads to that of the vehicle control or SIRT1 activator-treated mice group (Fig.8M-S, Fig.S8). Overall, our data suggest heterogeneous bacterial burden in diverse splenic populations.

      Reviewer #3 (Recommendations For The Authors):

      Abstract

      The authors state that perturbing Sirt1 and Sirt3 results in a shift in Salmonella's metabolism. On the contrary, the data reflects the metabolism in the host cell and not the bacteria. This statement is wrong. They only show increased expression of some of the glycolytic genes in Salmonella, which is not sufficient to make the claim that the switch to fatty acid oxidation in macrophages is due to utilisation of glucose by the bacteria.

      We value the reviewer’s response and have accordingly reframed our sentence in the abstract (Line 24-25).

      Fig 1: Expression of Sirt1 - The data needs to be supported with a western blot for Sirt1 and Sirt3 but the Western blots shown in the supplementary figure are of very poor quality and do not support the authors' claim.

      We have repeated the western blot and have supplemented the previous blot with an alternate blot in Fig. S1A as per your precious input.

      Why haven't the authors shown any representative blots for Sirt1 and Sirt3 upon infection with Salmonella mutants? They need to italicize the genes when they describe mRNA expression.

      Previously we had only performed transcript-level expression of Sirt1 and Sirt3 upon infection with Salmonella mutants and therefore representative blot image was absent. The gene names have been duly italicized while describing mRNA expression (Line 126-154). We regret the inconvenience caused. We have performed the western blotting to assess the protein expression profile upon infection with Salmonella mutants as per the reviewer’s suggestion and the representative blot image has been duly appended in the revised manuscript (Fig. S1B).

      What is the rationale for examining Sirt1 and Sirt3 mRNA in M1 and M2 macrophages? Salmonella infection on its own will polarise the macrophages towards M1. How long were these macrophages infected? The time points are missing.

      The rationale behind the examination of Sirt1 and Sirt3 mRNA in M1 and M2 polarized was to ascertain whether indeed M1 polarized macrophages exhibit decreased expression of Sirt1 or Sirt3 and polarization of macrophages toward M2 state show upregulation of Sirt1 and Sirt3 upon Salmonella infection. After confirming these above-mentioned findings through this preliminary experiment, we then hypothesized whether Salmonella infection on its own will polarise the macrophages toward an immunosuppressive M2 state at a later time course of infection as infection drives the induction of SIRT expression and whether this is mediated by Sirt1 and Sirt3 (Fig. 3). We are extremely apologetic for not mentioning the 16h time-point in the figure and the missing time point has been duly documented in the revised manuscript (Line 155).

      Fig S2 knockdown of Sirt1 and Sirt3 are not convincing.

      We are extremely sorry for the inconclusive knockdown blot. An alternative blot has been substantiated in the revised manuscript (Fig. S2,C-D).

      Fig 2A and 2B the time point post infection has not been mentioned. Although it is stated that 2h and 16h post-infection samples were analysed. Only one time point has been shown.

      We are sorry for the confusion. We wanted to clarify that Fig.2A and Fig. 2B show the fold proliferation where fold proliferation was calculated as CFU at 16hr divided by CFU at 2hr as mentioned in the materials and methods section under the heading of Intracellular proliferation or gentamicin protection assay.

      Fold Proliferation= [CFU at 16h]/[CFU at 2h]

      The cytokines data are intriguing in that the increase in IL-6 relative to control is seen only at 2h and 20h but not at 6h. Il-6 at 20h in untransfected cells is comparable to uninfected cells. Did the authors investigate cell death? Salmonella induces various forms of cell death which could account for the decreased cytokine production at later time points.

      We have investigated the cell death upon Salmonella infection via MTT assay. At later time points of infection, we indeed observed around 16 percent decrease in cell survival compared to the initial time point of 2h. The results have been appended here and it supports our eminent reviewer’s reasoning for the decreased cytokine production at later time points.

      Author response image 2.

      Additional cytokines such as IL-1b would be helpful. Also, not sure how uninfected macrophages produce nearly 200pg of IL-10.

      As per the author’s critical suggestion, we have assessed the IL-1b cytokine production at 16h post-infection in RAW264.7 macrophages and peritoneal macrophages and mice serum samples at 5th day post-infection (Fig.S3C, S3E-F). Our results indicate increased production of IL-b in the infected SIRT1/3 knockdown RAW264.7 macrophages, SIRT1/3 inhibitor-treated peritoneal macrophages and in mice serum samples under SIRT1/3 inhibitor treatment in comparison to the vehicle control. Additionally, we have quantified IL-1b in mice ileal tissues under SIRT1/3 inhibitor treatment (Fig.S3G) and have obtained heightened intestinal IL-1b production in the inhibitor-treated cohorts. We thank the reviewer for raising the concern for 200pg of IL-10 in the uninfected macrophages. We have repeated the experiment and have provided an alternative representative graph for the experiment wherein the IL-10 levels in the uninfected cohorts range between 20-40pg/ml (Fig. S3B).

      It is surprising that the authors have found increased Sirt1 binding to NFkB, however there is no change in acetylated NFkB upon infection (Fig 4B). Acetylated p65 is equally high in uninfected Scrambled siRNA, UI shSirt1, STM Scr, and STM shSirt1. Furthermore, increased binding of Sirt1 with NFkb would mean decreased acetylation hence decreased inflammation. However, Salmonella induces profound inflammation.

      We thank the reviewers for their insightful and critical questioning. We truly acknowledge that due to oversaturation there was no apparent change in the acetylated p65 among the different sample sets. Therefore, in the revised manuscript we have provided an image at lower exposure where the changes in the acetylation of the p65 subunit are apparent. Salmonella induces inflammation upon challenge similar to any other pathogens and induces acute inflammatory responses. This heightened acute inflammation at the initial phases of infection subsides at a later phase of infection. Here, we have performed the Sirt1 interaction with NFκB at 16hr post-infection where increased binding of Sirt1 with NFκB facilitates the resolution of the Salmonella-_induced acute inflammation. This is in line with previous reports that suggest SIRT1 suppresses acute inflammation through the promotion of p65 acetylation and inhibition of NFκB activity. (Yang H, Zhang W, Pan H, et al. SIRT1 activators suppress inflammatory responses through promotion of p65 deacetylation and inhibition of NF-κB activity. _PLoS One. 2012;7(9):e46364. doi:10.1371/journal.pone.0046364, Liu TF, Yoza BK, El Gazzar M, Vachharajani VT, McCall CE. NAD+-dependent SIRT1 deacetylase participates in epigenetic reprogramming during endotoxin tolerance. J Biol Chem. 2011;286(11):9856–64., Liu TF, Vachharajani V, Millet P, Bharadwaj MS, Molina AJ, McCall CE. Sequential actions of SIRT1-RELB-SIRT3 coordinate nuclear-mitochondrial communication during immunometabolic adaptation to acute inflammation and sepsis. J Biol Chem. 2015;290(1):396–408.)

      Please explain how the acetylated p65 was analysed.

      Total endogenous p65 subunit was immunoprecipitated using Anti-NFκB p65 antibody and the immunoprecipitated fraction was probed with Anti-Acetylated Lysine antibody to assess acetylated p65.

      An increase in ROS production is seen in a relatively small percentage of cells- not more than 4% of cells. How does this contribute to such a significant difference in intracellular bacterial burden? Also, it is not clear how the authors calculated the fold change in proliferation. It is better to show the actual bacterial burden logarithmically.

      We strongly agree with the reviewer’s concerns, and we have reanalyzed the flow cytometric data set. The revised data have been presented in Fig. S5 which shows a considerable increase in DCFDA positive population. For instance, the infected scrambled control shows around 2.44% of ROS-producing cells, however knockdown of SIRT1 and SIRT3 increases the ROS-producing cells to 27.34% and 28.64% respectively.

      Fold proliferation was calculated as CFU at 16hr divided by CFU at 2hr as mentioned in the materials and methods section under the heading of Intracellular proliferation or gentamicin protection assay. Fold proliferation has been calculated as opposed to absolute CFU values to nullify the differential phagocytosis of bacteria to the macrophages among the samples.

      Fold Proliferation= [CFU at 16h]/[CFU at 2h]

      An increase in metabolic genes is not sufficient to show that the macrophages are metabolically reprogrammed.

      We thank the reviewer for the valuable comment. We agree that an increase in metabolic gene profile is not sufficient to claim metabolic reprogramming. Therefore, in addition to the metabolic gene profile, we have estimated lactate production (end-product of glycolysis) as an indicator of glycolysis (Fig. 5 C-E) and have performed the fatty acid β oxidation activity (Fig. 5G-H) to support our claims.

      Figure 5F the band intensities do not visually match the bands shown for PFK. For instance, shSIRT1 STM (1.00) and shSIRT3 STM (0.81).

      We are extremely sorry for the erroneous band intensity for shSIRT3. Upon reanalysis of the band intensities, we have corrected the band intensity for shSIRT3 to 2.28 (Fig.5F).

      It is surprising that HADHA is not expressed in uninfected samples.

      We are extremely apologetic for the inappropriate representative blot. We feel that the discrepancy might have arisen due to the usage of old antibodies. We have provided an alternate blot for the HADHA gene where fresh antibody staining solution was used for probing which shows expression even in the uninfected samples (Fig.5F).

      Figure 6A - What is the significance of PFA fixed samples (PI) compared to SI samples? This has not been discussed.

      PFA-fixed samples are paraformaldehyde-treated bacterial samples that harbor the immune signals or Pattern Associated Molecular Patterns (PAMPs). The rationale for using PI in addition to SI samples was to show whether the phenomena is driven by live metabolically active pathogens or is mediated by PAMPs.

      I understand that the hypothesis is that during the later phase of infection, there is an increase in fatty acid oxidation which correlates with a decrease in inflammation. However, at 6h there is no increase in genes regulating fatty acid oxidation. Why did the authors choose 6h when the previous experiments have been done at 16h?

      We indeed agree with the reviewer’s understanding of our hypothesis that there is an increase in fatty acid oxidation along the progression of infection which correlates with a decrease in inflammation. The Salmonella intracellular replication has been reported to commence at 6h post-internalization when SPI-2 effector expression is fully established (Helaine S, Thompson JA, Watson KG, Liu M, Boyle C, Holden DW. Dynamics of intracellular bacterial replication at the single cell level. Proc Natl Acad Sci U S A. 2010;107(8):3746-3751. doi:10.1073/pnas.1000041107). Therefore, we have assessed the 6h timepoint post-infection in addition to the initial and later timepoints of 2h and 16h respectively. Additionally, the nanostring gene profiling data of both host and bacterial genes indicate the onset of both metabolic (Fig. 5A, 6A) and immune genes (Fig. 3A) modulation at 6h post-infection. We have validated these results via qPCR studies and have observed an upregulation in the transcript level of fatty acid oxidation genes as depicted in Fig. S7A in RAW264.7 macrophages.

      Line 355 it is mentioned that Sirt1 and Sirt3 abrogate metabolic shift by reducing glycolytic flux. This is incorrect as experiments such as carbon chase assays have not been performed to investigate glycolytic flux.

      As per the reviewer’s valuable suggestion, we have removed the word ‘flux’ from the above-mentioned statement(Line 351, Line 353).

      Lines 392-393: "We immunoprecipitated PDHA1 and checked for its interaction with SIRT3 or SIRT1 under knockdown condition of SIRT3 or upon SIRT3 inhibitor treatment (Fig.7 G-H)"

      What is the rationale for checking PDHA1 interaction with Sirt under Sirt knockdown conditions?

      We are thankful to the reviewer for the critical comments. The rationale for checking PDHA1 interaction with Sirt was to ascertain that indeed Sirt interacted with PDHA1 under S. Typhimurium infection and abrogation of either protein expression (knockdown) or their enzymatic activity (inhibitor treatment) diminished the interaction.

      Moreover, the blots are very confusing and do not represent the authors' claims.

      (1) In the input blot I do not see Sirt3 depletion in shSirt3 knockdown sample.

      The knockdown has been quantified in the input blot as per your suggestion. A knockdown of 40% has been obtained in the uninfected dataset whereas a knockdown of 47.1% has been obtained in the infected data set at 16h post-infection (Fig.7H).

      (2) Why does Sirt1 interact with PDHA1 similar to Sirt3. Do both the proteins bind to PDHA1 at the same time/ competitively? If so do they both deacetylate?

      In literature, Sirt3 has been shown to interact with PDHA1 and deacetylate PDHA1. However, the interaction of Sirt1 with PDHA1 has not been reported previously and therefore we are unable to comment on the exact dynamics of the interaction. Future studies need to be performed to explore these phenomena in depth. However, SIRT1 agonist SRT1720 has been shown to impact PDH phosphorylation and its activity (Han Y, Sun W, Ren D, Zhang J, He Z, Fedorova J, Sun X, Han F, Li J. SIRT1 agonism modulates cardiac NLRP3 inflammasome through pyruvate dehydrogenase during ischemia and reperfusion. Redox Biol. 2020 Jul;34:101538).

      (3) Figure 7I in the IP: IgG samples Sirt3 seem to bind to IgG non-specifically, which questions the specificity of Sirt3 binding to PDHA1.

      We appreciate the reviewer for pointing out this concern. The immunoprecipitation experiment has been repeated and the same has been appended in the revised manuscript and we observe no non-specific binding of Sirt3 antibody to IgG.

      (4) In Figure 7I all the bands Ac PDHA1, PDHA1, and Sirt3 look similar with double bands, which has not been seen in other blots. How is this possible?

      This cannot explain the increase in beta-oxidation observed.

      We thank the reviewer for raising this concern. We have repeated the experiment and provided the alternative blot as per the reviewer’s suggestion.

      The rationale for performing this experiment was to show that SIRT plays an important role in the activation of downstream TCA cycle pathways via PDHA1 deacetylation during Salmonella infection. The deacetylation of PDHA1 has been previously reported to cause transcriptional activation of the downstream TCA cycle and oxidative phosphorylation (Zhang Y, Wen P, Luo J, et al., Cell Death Dis.,2021). Additionally, PDHA1 hyperacetylation has been reported to cause lactate overproduction (An, S., Yao, Y., Hu, H. et al. PDHA1 hyperacetylation-mediated lactate overproduction promotes sepsis-induced acute kidney injury via Fis1 lactylation. Cell Death Dis 14, 457 (2023)). In our study, increased lactate production and PDHA1 hyperacetylation have been observed during SIRT3 inhibition conditions upon Salmonella infection.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      (1) One issue that needs to be considered is the nomenclature of the enhancer. The authors have presented data to show this enhancer controls the expression of Ctnnb1 in the stomach, intestine, and colon tissues. However, the name proposed by the authors, ieCtnnb1 (intestinal enhancer of Ctnnb1), doesn't represent its functions. It might be more appropriate to call it a different name, such as gieCtnnb1 (gastrointestinal enhancer of Ctnnb1).

      We thank the reviewer for the insightful suggestion and agree that wholemount reporter assays indicated ieCtnnb1 and ieCTNNB1 indeed display activity in the stomach. However, in current study, we focused on the cellular distribution and the function in intestinal epithelia. After careful consideration, we reasoned that the current designation, ieCtnnb1, would be more appropriately represent its expression pattern and functions based on provided evidence. We hope the reviewer could understand our reasoning.  

      (2) The writing of this manuscript can be improved in a few places. 

      a) The definitions or full names for the abbreviations of some terms, e.g., Ctnnb1, ieCtnnb1, in both abstract and main text, are needed when they first appear. Specifically, Line 108 should be moved to Lines 26 and 95. Lines 125126 are redundant. ieCtnnb1 in Line 130 needs to be defined.

      We appreciate the suggestion. In the revision, we have included the definition of Ctnnb1 and the full name of ieCtnnb1 when they first appear in the abstract and the main text. Lines 125-126 were deleted in the revision.

      b) Line 192-194, the description of the result needs to be rewritten to reflect

      the higher expression of LacZ transcript in eGFP+ cells. 

      We would like to emphasize that the key point of this part is that the enhancer activity of ieCtnnb1 is present in both Lgr5-eGFP+ and Lgr5-eGFP- cells. This was validated by single-cell sequencing, which revealed the presence of LacZ transcripts in the Paneth cells. Moreover, we could not confidently conclude that eGFP+ cells have higher expression levels of LacZ, as these measurements were obtained from separate, semi-quantitative RTqPCR experiments.

      c)  More details are needed for how the data using human tumor samples were generated and how they were analyzed. 

      We thank the suggestion. In the revision, we have provided additional details regarding the data and subsequent analyses of human CRC samples as follows: “We previously conducted paired analyses of chromatin immunoprecipitation sequencing (ChIP-seq) for H3K27ac and H3K4me3, alongside RNA-seq on 68 CRC samples and their adjacent normal (native) tissue (Li et al., 2021).  In the current study, we performed analyses for the enrichment of H3K27ac and H3K4me3 at ieCTNNB1 and CTNNB1 promoter regions, as well as the expression levels of CTNNB1, followed by combined analyses (Figure. 5A, Figure 5 - figure supplement 1).”

      d) The genomic structures from multiple species are presented at the bottom of Figure 1a. However, the description and explanation are lacking in both the main text and the figure legend.

      We apologize for not presenting clearly. We have added related description in the legend of Figure 1A as “The sequence conservation of the indicated species is shown at the bottom as vertical lines”. We also added an explanation in lines 162-163 of the main text: “Notably, unlike neCtnnb1, the primary sequence of ieCtnnb1 is not conserved among vertebrates (Figure 1A, bottom)”.

      Reviewer #2:

      (1) One of the main issues emerging during reading concerns the interpretation of the consequence of deleting the ieCtnnb1 enhancer. The authors write on line 235 that the deletion of ieCtnnb1 "undermined" Wnt signaling in the intestinal epithelium. This feels too strong, as the status of the pathway is only mildly affected, testified by the observation that mice with homozygous deletion on ieCtnnb1 are alive and well. The enhancer likely "only" drives higher Ctnnb1 expression, and it does not affect Wnt signaling by other mechanisms. The reduction of Wnt target gene expression upon its deletion is easily interpreted as the consequence of reduced β-catenin. Also the title, in my opinion, allows this ambiguity to stick in readers' minds. In other words, the authors present no evidence that the ieCtnnb1 enhancer controls Wnt signaling dosage via any mechanism other than its upregulation of Ctnnb1 expression in the intestinal epithelium. Reduced Ctnnb1, in turn, could explain the observed reduction of Wnt signaling output and the interesting downstream physiological consequences. Unless the authors think otherwise, I suggest they clarify this throughout the text, including necessary modifications to the title.

      We greatly appreciate the reviewer’s important comments and suggestion. We agree that ieCtnnb1’s direct effect on the canonical Wnt signaling is to regulate the transcription of Ctnnb1 in the intestinal epithelia. Therefore, knockout of ieCtnnb1 leads to compromised expression of Ctnnb1 and, consequently, reduced Wnt signaling.  The term “undermined” is indeed too strong and has been revised to “compromised” in the revision (line 237). Similar revisions have been made throughout the manuscript. Particularly, the title was changed into “A Ctnnb1 enhancer transcriptionally regulates Wnt signaling dosage to balance homeostasis and tumorigenesis of intestinal epithelia”. However, as we state in the following point, decreased levels of β-catenin on ieCtnnb1 loss could lead to indirect effect, including the reduced expression of Bambi, which might cause a more significant decrease of nuclear β-catenin.

      (2) It is unclear how the reduction of Ctnnb1 mRNA caused by deletion of ieCtnnb1 in mice could lead to a preferential decrease of nuclear more than membranous β-catenin (Fig. 1K and L). This might reflect a general cell autonomous reduction in Wnt signaling activation; yet, it is not clear how this could occur. Do the authors have any explanations for this?

      It's a very important question. We observed that in inCtnnb1 knockout epithelia, the expression of Bambi (BMP and activin membrane-bound inhibitor) was significantly downregulated. Since BAMBI has been reported to stabilize β-catenin and facilitate its nuclear translocation, it is likely that the reduced level of BAMBI resulting from the loss of ieCtnnb1 further decreased nuclear βcatenin. In the revision, the expression change of Bambi has been added in Figure 1M. Moreover, the related content was extensively discussed with proper citations: “We noticed that after knocking out ieCtnnb1, the level of βcatenin in the nuclei of small intestinal crypt cells of Ctnnb1Δi.enh mice decreased more significantly compared to that in the cytoplasm (49.5% vs. 29.8%). Although the loss of ieCtnnb1 should not directly lead to reduced nuclear translocation of β-catenin, RNA-seq results showed that the loss of ieCtnnb1 causes a reduction in the expression of Bambi (BMP and activin membranebound inhibitor), a target gene in the canonical Wnt signaling pathway (Figure 1M). BAMBI promotes the binding of Frizzled to Dishevelled, thereby stabilizing β-catenin and facilitating its nuclear translocation (Lin et al., 2008; Liu et al., 2014; Mai et al., 2014; Zhang et al., 2015). Thus, it is likely that the decreased level of BAMBI resulting from the loss of ieCtnnb1 further reduced nuclear βcatenin”. 

      (3) In Figure 1 K-L the authors show β-catenin protein level. Why not show its mRNA?

      The mRNA levels of Ctnnb1 in small and large intestinal crypts were shown in Figure 1I and 1J, demonstrating reduced expression of Ctnnb1 upon ieCtnnb1 knockout. We hope the reviewer understands that it is unnecessary to measure the nuclear and cytosolic levels of Ctnnb1 transcripts, as the total mRNA level generally reflects the protein level. 

      (4) Concerning the GSEA of Figure 1 that includes the Wnt pathway components: a) it would be interesting to see which components and to what extent is their expression affected; b) why should the expression of Wnt components that are not Wnt target genes be affected in the first place? It is odd to see this described uncritically and used to support the idea of downregulated Wnt signaling.

      We appreciate the suggestion and apologize for any lack of clarity. The affected components of the Wnt signaling pathway and the extent of their changes are summarized in Figure 1 – figure supplement 3. Additionally, we have provided explanations for their downregulation. For instance, the reduced expression of Wnt3 and Wnt2b ligands in ieCtnnb1-KO crypts may be attributed to the decreased numbers of Paneth cells.  

      (5) In lines 251-252 the authors refer to "certain technical issues" in the isolation of cell type from the intestinal epithelium. Why this part should be obscure in the characterization of a tissue for which there are several established protocols of isolation and analysis is not clear. I would rather describe what these issues have been and how they protocol of isolation and analysis is not clear. I would rather describe what these issues have been and how they might have affected the data presented.

      We thank the reviewer for pointing this out. The single-cell preparation and sequencing of small intestinal cryptal epithelial cells were carried out largely according to reported protocols with slight modification. The enrichment of live crypt epithelial cells (EpCAM+DAPI-) by flow cytometry and cell filtering after single-cell sequencing were appropriate (Figure 2 – figure supplement 1A1C). We would like to emphasize a few points: 1) Unlike other protocols, we did not exclude immune cells, erythrocytes, or endothelial cells using negative sorting antibodies. 2) When defining cell populations, we focused exclusively on epithelial cell types and did not consider other cell types, such as immune cells. As a result, the so-called “undefined” cells include a mixture of nonepithelial cells. Indeed, markers for erythrocytes (AY036118/Erf1, PMID:12894589) and immune cells (Gm42418 and Lars2, PMID:30940803, PMID: 35659337) were the top three enriched genes in the “undefined” cluster (Figure 2 – figure supplement 1D). 3) Nonetheless, the overall findings remain robust, as key observations such as the loss of Paneth cells and reduced cell proliferation were validated through histological studies. This information has been incorporated into the revised manuscript with related references cited (lines 254-259). 

      (6) It is interesting that human SNPs exist that seem to fall within the ieCTNNB1 enhancer and affect the gastrointestinal expression of CTNNB1. Could the author report or investigate whether this SNP is present in human populations that have been considered in large-scale studies for colorectal cancer susceptibility? It seems to me a rather obvious next step of extreme importance to be ignored.

      (7) From Figure 5A a reader could conclude that colorectal tumor cells have a higher expression of CTNNB1 mRNA than in normal epithelium. This is the first time I have seen this observation which somewhat undermines our general understanding of Wnt-induced carcinogenesis exclusively initiated by APC mutations whereby it is β-catenin's protein level, not expression of its mRNA, of crucial importance. I find this to be potentially the most interesting observation of the current study, which could be linked to the activity of the enhancer discovered, and I suggest the authors elaborate more on this and perhaps consider it for future experimental follow-ups.

      We appreciate the comments and suggestions.  We therefore added related content in the revision (lines 470-475): “Importantly, ieCTNNB1 displayed higher enhancer activity in most CRC samples collected in the study. Moreover, the SNP rs15981379 (C>T) within ieCTNNB1 is associated with the expression of CTNNB1 in the GI tract. Future population studies could investigate how the enhancer activity of ieCTNNB1 and this particular SNP are associated with CRC susceptibility and prognosis”.

      (8) I am surprised that the authors, who seem to have dedicated lots of resources to this study, are satisfied by analyzing their ChIP experiments with qPCR rather than sequencing (Figure 6). ChIP-seq would produce a more reliable profile of the HNF4a and CREB1 binding sites on these loci and in other control regions, lending credibility to the whole experiment and binding site identification. Sequencing would also take care of the two following conceptual problems in primer design. 

      First: while the strategy to divide enhancer and promoter in 6 regions to improve the resolution of their finding is commendable, I wonder how the difference in signal reflects primers' efficiency rather than HNF4/CREB1 exact positioning. The possibility of distinguishing between regions 2 and 3, for example, in a ChIP-qPCR experiment, also depends on the average DNA fragment length after sonication, a parameter that is not specified here. 

      Second: what are the primers designed to detect the ieCtnnb1 enhancer amplifying in the yellow-columns samples of Figure 6G? In this sample, the enhancer is deleted, and no amplification should be possible, yet it seems that a value is obtained and set to 1 as a reference value.

      This is indeed a crucial point, and we fully agree with the reviewer that “ChIP-seq would produce a more reliable profile of the HNF4a and CREB1 binding sites on these loci and in other control regions”. However, we believe that our current ChIP-qPCR experiments have adequately addressed the potential concerns raised by the reviewers. (1) We have ensured that the DNA fragment length after sonication falls within the range of 200 bp to 500 bp, with an average length of approximately 300 bp (Author response image 1A). We have stated the point in the revised methods section (line 633). (2) We have randomly inspected 14 out of 26 primer sets used in Figure 6 and its supplemental figure (Author response image 1B-E), confirming that all primer sets demonstrate equal amplification efficiency (ranging from 90% to 110%). This information has also been included in the revised methods section (line 650). (3) Figures 6G and 6H show reduced enrichment of HNF4𝛼 (6G) and p-S133-CREB1 (6H) at the Ctnnb1 promoter in ieCtnnb1 knockout ApcMin/+ tumor tissues. The ChIP-qPCR primers used were positioned at the Ctnnb1 promoter, not at ieCtnnb1, with IgG control enrichment serving as the reference values on the Y-axes. 

      Author response image 1.

      (A) Agarose gel electrophoresis of sonicated DNA. (B-E) Tests of amplification efficiency for primer sets used in ChIP-qPCR.

      (9) The ChIP-qPCR showing preferential binding of pS133-CREB1 in small intestinal crypts and CHT15 cells (line 393) should be shown. 

      The ChIP-qPCR results demonstrating preferential binding of p-S133-

      CREB1 over CREB1 have been added in revised Figure 6C, 6D and Figure 6 – Supplement 1C.

      (10) It is not entirely clear what the blue tracks represent at the bottom of Figures 6C-D and Figure 6 - Figure Supplement 1C-D. The ChIP-seq profiles of both CREB1 and HNF4a shown in Figures 6A and Figure 6 - Figure Supplement 1A do not seem to match. Taking HNF4a, for example from Figure 6 - Figure Supplement 1A it seems to bind on the Ctnnb1 promoter, while in Figure 6 - Figure Supplement 1D the peaks are within the first intron. I realize this might all be a problem with a different scale across figure panels, but I suggest producing a cleared figure.

      We apologize for the confusion. We have revised Figure 6C-6D, Figure 6 - figure supplement 1C-D, and the corresponding legends to enhance clarity. (1) The top panels of Figures 6C and 6D respectively highlight shaded regions of ieCTNNB1 (pink) and the CTNNB1 promoter (grey) in Figure 6A, emphasizing the enrichment of p-S133-CREB1.  (2) The top panels of Figure 6 – figure supplement 1C and 1D respectively highlight shaded regions of ieCtnnb1 (pink) and the Ctnnb1 promoter (grey) in Figure 6A – figure supplement 1A, emphasizing the enrichment of HNF4α. (3) Because Figures 6C-6D and Figure 6 - figure supplement 1C-1D respectively correspond to human and mouse genomes, the positions of peaks and scales differ.  

      (11) In the intro the authors refer to "TCF-4". I suggest they use the more recent unambiguous nomenclature for this family of transcription factors and call it TCF7L2.

      TCF-4 has been changed into TCF7L2 in the revision (line 81)

      (12) In lines 121-122, the authors write "Although numerous putative enhancers...only a fraction of them were functionally annotated". To what study/studies are the authors referring? Please provide references.

      References were added in the revision (line 124)

      (13) In some parts the authors use strong words that should in my opinion be attenuated. Examples are: (i) at line 224, "maintains" would be better substituted with "contribute", as in the absence of ieCtnnb1, Ctnnb1 is still abundantly expressed; (ii) at line 266 "compromised" when the proliferative capacity of CFCs and TACs seems to be only mildly reduced; (iii) at line 286 "disrupts", the genes are simply downregulated.

      We thank these great suggestions. 1) On lines 224-225, the sentence was revised to: “These data suggest that ieCtnnb1 plays a specific role in regulating the transcription of Ctnnb1 in intestinal epithelia”. 2) On line 271, “compromised” were replaced with “mildly reduced”. 3) In ieCtnnb1 knockout epithelial cells of small intestine, genes related to secretory functions were decreased, while genes related to absorptive functions were increased. Therefore, the term 'disrupts' is more appropriate than 'downregulates'. 

      Reviewer #3:

      Line 81, c-Myc should be human MYC (italics) to agree with the other human gene names in this sentence. 

      c-Myc has been changed into MYC in the revision (line 82)

      Line 215, wildtype should be wild-type. 

      “wildtype” has been changed into “wild-type” in the revision (line 215)

      Line 224, Elimination of the enhancer did not abolish expression of Ctnnb1; therefore, it would be better to say that it "helps to maintain Ctnnb1 transcription" 

      The sentence was changed into “These data suggest that ieCtnnb1 plays a specific role in regulating the transcription of Ctnnb1 in intestinal epithelia” in revision (lines 224-225)

      Line 228, perhaps "to activate transcription" is meant. 

      “active” has been changed into “activate” in the revision (line 228)

      Line 235, consider "reduced" instead of "undermined". 

      “undermined” has been replaced with “compromised” in the revision (line 237)

      Line 262, "em" dashes should be a both ends of this insertion. 

      Line 298, "dysfunctional" would be better.

      Line 356, "samples were". 

      Line 481, 12-hr (add hyphen). 

      All above points have been optimized according to the reviewer’s suggestion.

      Line 712, Is "poly-N" meant? 

      “Poly-N” indicates undetected bases during sequencing. This explanation was added in the revision (lines 759-760).

      Figure 1K, the GAPDH signal is not visible and that panel is unnecessary as there is an H3 control.   

      Figure 1K and 1L respectively show levels of nuclear and cytoplasmic βcatenin. GAPDH and H3 were used as internal references for the cytoplasmic and nuclear fractions, respectively, confirming both robust fractionation and equal loading.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #3 (Public Review):

      The iron manipulation experiments are in the whole animal and it is likely that this affects general feeding behaviour, which is known to affect NB exit from quiescence and proliferative capacity. The loss of ferritin in the gut and iron chelators enhancing the NB phenotype are used as evidence that glia provide iron to NB to support their number and proliferation. Since the loss of NB is a phenotype that could result from many possible underlying causes (including low nutrition), this specific conclusion is one of many possibilities.

      We have investigated the feeding behavior of fly by Brilliant Blue (sigma, 861146)[1]. Our result showed that the amount of dye in the fly body were similar between control group and BPS group, suggesting that BPS almost did not affect the feeding behavior (Figure 3—figure supplement 1A).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      There was a gap between the Pros nuclear localization and downstream targets of ferritin, particularly NADH dehydrogenase and biosynthesis. Could overexpression of Ndi1 restore Pros localization in NBs?

      Ferritin defect downregulates iron level, which leads to cell cycle arrest of NBs via ATP shortage. And cell cycle arrest of NBs probably results in NB differentiation[2, 3]. We have added the experiment in Figure 5—figure supplement 2. This result showed that overexpression of Ndi1 could significantly restore Pros localization in NBs.

      The abstract requires revision to cover the major findings of the manuscript, particularly the second half.

      We revised the abstract to add more major findings of the manuscript in the second half as follows:

      “Abstract

      Stem cell niche is critical for regulating the behavior of stem cells. Drosophila neural stem cells (Neuroblasts, NBs) are encased by glial niche cells closely, but it still remains unclear whether glial niche cells can regulate the self-renewal and differentiation of NBs. Here we show that ferritin produced by glia, cooperates with Zip13 to transport iron into NBs for the energy production, which is essential to the self-renewal and proliferation of NBs. The knockdown of glial ferritin encoding genes causes energy shortage in NBs via downregulating aconitase activity and NAD+ level, which leads to the low proliferation and premature differentiation of NBs mediated by Prospero entering nuclei. More importantly, ferritin is a potential target for tumor suppression. In addition, the level of glial ferritin production is affected by the status of NBs, establishing a bicellular iron homeostasis. In this study, we demonstrate that glial cells are indispensable to maintain the self-renewal of NBs, unveiling a novel role of the NB glial niche during brain development.”

      In Figure 2B Mira appeared to be nuclear in NBs, which is inconsistent with its normal localization. Was it Dpn by mistake?

      In Figure 2B, we confirmed that it is Mira. Moreover, we also provide a magnified picture in Figure 2B’, showing that the Mira mainly localizes to the cortex or in the cytoplasm as previously reported.

      Figure 2C, Fer1HCH-GFP/mCherry localization was non-uniform in the NBs revealing 1-2 regions devoid of protein localization potentially corresponding to the nucleus and Mira crescent enrichment. It is important to co-label the nucleus in these cells and discuss the intracellular localization pattern of Ferritin.

      We have revised the picture with nuclear marker DAPI in Figure 2C. The result showed that Fer1HCH-GFP/Fer2LCH-mCherry was not co-localized with DAPI, which indicated that Drosophila ferritin predominantly distributes in the cytosol[4, 5]. As for the concern mentioned by this reviewer, GFP/mCherry signal in NBs was from glial overexpressed ferritin, which probably resulted in non-uniform signal.

      In Figure 3-figure supplement 3F, glial cells in Fer1HCH RNAi appeared to be smaller in size. This should be quantified. Given the significance of ferritin in cortex glial cells, examining the morphology of cortex glial cells is essential.

      In Figure 3—figure supplement 3F, we did not label single glial cells so it was difficult to determine whether the size was changed. However, it seems that the chamber formed by the cellular processes of glial cells becomes smaller in Fer1HCH RNAi. The glial chamber will undergo remodeling during neurogenesis, which responses to NB signal to enclose the NB and its progeny[6]. Thus, the size of glial chamber is regulated by NB lineage size. In our study, ferritin defect leads to the low proliferation, inducing the smaller lineage of each NB, which likely makes the chamber smaller.

      Since the authors showed that the reduced NB number was not due to apoptosis, a time-course experiment for glial ferritin KD is recommended to identify the earliest stage when the phenotype in NB number /proliferation manifests during larval brain development.

      We observed brains at different larval stages upon glial ferritin KD. The result showed that NB proliferation decreased significantly, but NB number declined slightly at the second-instar larval stage (Figure 5—figure supplement 1E and F), suggesting that brain defect of glial ferritin KD manifests at the second-instar larval stage.

      Transcriptome analysis on ferritin glial KD identified genes in mitochondrial functions, while the in vivo EM data suggested no defects in mitochondria morphology. A short discussion on the inconsistency is required.

      For the observation of mitochondria morphology via the in vivo EM data, we focused on visible cristae in mitochondria, which was used to determine whether the ferroptosis happens[7]. It is possible that other details of mitochondria morphology were changed, but we did not focus on that. To describe this result more accurately, we replaced “However, our observation revealed no discernible defects in the mitochondria of NBs after glial ferritin knockdown” with the “However, our result showed that the mitochondrial double membrane and cristae were clearly visible whether in the control group or glial ferritin knockdown group, which suggested that ferroptosis was not the main cause of NB loss upon glial ferritin knockdown” in line 207-209.

      The statement “we found no obvious defects of brain at the first-instar larval stage (0-4 hours after larval hatching) when knocking down glial ferritin (Figure 5-figure supplement 1C).” lacks quantification of NB number and proliferation, making it challenging to conclude.

      We have provided the quantification of NB number and proliferation rate of the first-instar larval stage in Figure 5—figure supplement 1C and D. The data showed that there is no significant change in NB number and proliferation rate when knocking down ferritin, suggesting that no brain defect manifests at the first-instar larval stage.

      A wild-type control is necessary for Figure 6A-C as a reference for normal brain sizes.

      We have added Insc>mCherry RNAi as a reference in Figure 6A-D, which showed that the brain size of tumor model is larger than normal brain. Moreover, we removed brat RNAi data from Figure 6A-D to Figure 6—figure supplement 1A-D for the better layout.

      In Figures 6B, D, “Tumor size” should be corrected to “Larval brain volume”.

      Here, we measured the brain area to assess the severity of the tumor via ImageJ instead of 3D data of the brain volume. So we think it would be more appropriate to use the “Larval brain size” than “Larval brain volume” here. Thus, we have corrected “Tumor size” to “Larval brain size” in Figure 6B and D to Figure 6—figure supplement 1B and D.

      Considering that asymmetric division defects in NBs may lead to premature differentiation, it is advisable to explore the potential involvement of ferritin in asymmetric division.

      aPKC is a classic marker to determine the asymmetric division defect of NB. We performed the aPKC staining and found it displayed a crescent at the apical cortex based on the daughter cell position whether in control or glial ferritin knockdown (Figure 5—figure supplement 3A). This result indicated that there was no obvious asymmetric defect after glial ferritin knockdown.

      In the statement "Secondly, we examined the apoptosis in glial cells via Caspase-3 or TUNEL staining, and found the apoptotic signal remained unchanged after glial ferritin knockdown (Figure 3-figure supplement 3A-D).", replace "the apoptosis in glial cells" with "the apoptosis in larval brain cells".

      We have replaced "the apoptosis in glial cells" with "the apoptosis in larval brain cells" in line 216.

      Include a discussion on the involvement of ferritin in mammalian brain development and address the limitations associated with considering ferritin as a potential target for tumor suppression.

      We have added the discussion about ferritin in mammalian brain development in line 428-430 and limitation of ferritin for suppressing tumor in line 441-444.

      Indicate Insc-GAL4 as BDSC#8751, even if obtained from another source. Additionally, provide information on the extensively used DeRed fly stock used in this study within the methods section.

      We provided the stock information of Insc-GAL4 and DsRed in line 673-674.

      Reviewer #2 (Recommendations For The Authors):

      Major points:

      The number of NBs differs a lot between experiments. For example, in Fig 1B and 1K controls present less than 100 NBs whereas in Figure 1 Supplementary 2B it can be seen that controls have more than 150. Then, depending on which control you compare the number of NBs in flies silencing Fer1HCH or Fer2LCH, the results might change. The authors should explain this.

      Figure 1 Supplementary 2B (Figure 1 Supplementary 3B in the revised version) shows NB number in VNC region while Fig 1B and 1K show NB number in CB region. At first, we described the general phenotype showing the NB number in CB and VNC respectively (Fig 1 and Fig 1-Supplementary 1 and 3 in the revised version). And the NB number is consistent in each region. After then, we focused on NB number in CB for the convenience.

      This reviewer encourages the authors to use better Gal4 lines to describe the expression patterns of ferritins and Zip13 in the developing brain. On the one hand, the authors do not state which lines they are using (including supplementary table). On the other hand, new Trojan GAL4 (or at least InSite GAL4) lines are a much better tool than classic enhancer trap lines. The authors should perform this experiment.

      All stock source and number were documented in Table 2. Ferritin GAL4 and Zip13 GAL4 in this study are InSite GAL4. In addition, we also used another Fer2LCH enhancer trapped GAL4 to verify our result (DGRC104255) and provided the result in Figure 2—figure supplement 1. Our data showed that DsRed driven by Fer2LCH-GAL4 was co-localized with the glia nuclear protein Repo, instead of the NB nuclear protein Dpn, which was consistent with the result of Fer1HCH/Fer2LCH GAL4. In addition, we will try to obtain the Trojan GAL4 (Fer1HCH/Fer2LCH GAL4 and Zip13 GAL4) and validate this result in the future.

      The authors exclude very rapidly the possibility of ferroptosis based only on some mitochondrial morphological features without analysing the other hallmarks of this iron-driven cell death. The authors should at least measure Lipid Peroxidation levels in their experimental scenario either by a kit to quantify by-products of lipid peroxidation such as Malonaldehide (MDA) or using an anti 4-HNE antibody.

      We combined multiple experiments to exclude the possibility of ferroptosis. Firstly, ferroptosis can be terminated by iron chelator. And we fed fly with iron chelator upon glial ferritin knockdown, but NB number and proliferation were not restored, which suggested that ferroptosis probably was not the cause of NB loss induced by glial ferritin knockdown (Figure 3B and C). Secondly, Zip13 transports iron into the secretary pathway and further out of the cells in Drosophila gut[8]. Our data showed that knocking down iron transporter Zip13 in glia resulted in the decline of NB number and proliferation, which was consistent with the phenotype upon glial ferritin knockdown (Figure 3E-G). More importantly, the knockdown of Zip13 and ferritin simultaneously aggravated the phenotype in NB number and proliferation (Figure 3E-G). These results suggested that the phenotype was induced by iron deficiency in NB, which excluded the possibility of iron overload or ferroptosis to be the main cause of NB loss upon glial ferritin knockdown. Finally, we observed mitochondrial morphology on double membrane and the cristae that are critical hallmarks of ferroptosis, but found no significant damage (Figure 3-figure supplement 2E and F).

      In addition, we have added the 4-HNE determination in Figure 3—figure supplement 2G and H. This result showed that 4-HNE level did not change significantly, suggesting that lipid peroxidation was stable, which supported to exclude the possibility that the ferroptosis led to the NB loss upon glial ferritin knockdown.

      All of the above results together indicate that ferroptosis is not the cause of NB loss after ferritin knockdown.

      A major flaw of the manuscript is related to the chapter Glial ferritin defects result in impaired Fe-S cluster activity and ATP production and the results displayed in Figure 4. The authors talk about the importance of FeS clusters for energy production in the mitochondria. Surprisingly, the authors do not analyse the genes involved in this process such as but they present the interaction with the cytosolic FeS machinery that has a role in some extramitochondrial proteins but no role in the synthesis of FeS clusters incorporated in the enzymes of the TCA cycle and the respiratory chain. The authors should repeat the experiments incorporating the genes NSF1 (CG12264), ISCU(CG9836), ISD11 (CG3717), and fh (CG8971) or remove (or at least rewrite) this entire section.

      Thanks for this constructive advice and we have revised this in Figure 4B and C. We repeated the experiment with blocking mitochondrial Fe-S cluster biosynthesis by knocking down Nfs1 (CG12264), ISCU(CG9836), ISD11 (CG3717), and fh (CG8971), respectively. Nfs1 knockdown in NB led to a low proliferation, which was consistent with CIA knockdown. However, we did not observe the obvious brain defect in ISCU(CG9836), ISD11 (CG3717), and fh (CG8971) knockdown in NB. Our interpretation of these results is that Nfs1 probably is a necessary core component in Fe-S cluster assembly while others are dispensable[9].

      The presence and aim of the mouse model Is unclear to this reviewer. On the one hand, It Is not used to corroborate the fly findings regarding iron needs from neuroblasts. On the other hand, and without further explanation, authors migrate from a fly tumor model based on modifying all neuroblasts to a mammalian model based exclusively on a glioma. The authors should clarify those issues.

      Although iron transporter probably is different in Drosophila and mammal, iron function is conserved as an essential nutrient for cell growth and proliferation from Drosophila to mammal. The data of fly suggested that iron is critical for brain tumor growth and thus we verified this in mammalian model. Glioma is the most common form of central nervous system neoplasm that originates from neuroglial stem or progenitor cells[10]. Therefore, we validated the effect of iron chelator DFP on glioma in mice and found that DFP could suppress the glioma growth and further prolong the survival of tumor-bearing mice.

      Minor points

      Although referred to adult flies, the authors did not include either in the introduction or in the discussion existing literature about expression of ferritins in glia or alterations of iron metabolism in fly glia cells (PMID: 21440626 and 25841783, respectively) or usage of the iron chelator DFP in drosophila (PMID: 23542074). The author should check these manuscripts and consider the possibility of incorporating them into their manuscript.

      Thanks for your remind. We have incorporated all recommended papers into our manuscript line 65-67 and 168.

      The number of experiments in each figure is missing.

      All experiments were repeated at least three times. And we revised this in Quantifications and Statistical Analysis of Materials and methods.

      If graphs are expressed as mean +/- sem, it is difficult to understand the significance stated by the authors in Figure 2E.

      We apologize for this mistake and have revised this in Quantifications and Statistical Analysis. All statistical results were presented as means ± SD.

      When authors measure aconitase activity, are they measuring all (cytosolic and mitochondrial) or only one of them? This is important to better understand the experiments done by the authors to describe any mitochondrial contribution (see above in major points).

      In this experiment, we were measuring the total aconitase activity. We also tried to determine mitochondrial aconitase but it failed, which was possibly ascribed to low biomass of tissue sample.

      In this line, why do controls in aconitase and atp lack an error bar? Are the statistical tests applied the correct ones? It is not the same to have paired or unpaired observations.

      It is the normalization. We repeated these experiments at least three times in different weeks respectively, because the whole process was time-consuming and energy-consuming including the collection of brains, protein determination and ATP or aconitase determination. And the efficiency of aconitase or ATP kit changed with time. We cannot control the experiment condition identically in different batches. Therefore, we performed normalization every time to present the more accurate result. The control group was normalized as 1 via dividing into itself and other groups were divided by the control. This normalized process was repeated three times. Therefore, there is no error bar in the control group. We think it is appropriate to apply ANOVA with a Bonferroni test in the three groups.

      In some cases, further rescue experiments would be appreciated. For example, expression of Ndi restores control NAD+ levels or number of NBs, it would be interesting to know if this is accompanied by restoring mitochondrial integrity and its ability to produce ATP.

      We have determined ATP production after overexpressing Ndi1 and provided this result in Figure 4—figure supplement 1B. The data showed that expression of Ndi1 could restore ATP production upon glial Fer2LCH knockdown, which was consistent with our conclusion.

      Lines 293-299 on page 7 are difficult to understand.

      According to our above results, the decrease of NB number and proliferation upon glial ferritin knockdown (KD) was caused by energy deficiency. As shown in the schematic diagram (Author response image 1), “T” represented the total energy which was used for NB maintenance and proliferation. “N” indicated the energy for maintaining NB number. “P” indicated the energy for NB proliferation. “T” is equal to “N” plus “P”. When ferritin was knocked down in glia, “T”, “N” and “P” declined in “Ferritin KD” compared to “wildtype (WT)”. Knockdown of pros can prevent the differentiation of NB, but it cannot supply the energy for NB, which probably results in the rescue of NB number but not proliferation. Specifically, NB number increased significantly in “Ferritin KD Pros KD” compared to “Ferritin KD”, which resulted in consuming more energy for NB maintenance in “Ferritin KD Pros KD”. As shown in the schematic diagram, “T” was not changed between “Ferritin KD Pros KD” and “Ferritin KD”, whereas ”N” was increased in “Ferritin KD Pros KD” compared to “Ferritin KD”. Thus, “P” was decreased, which suggested that less energy was remained for proliferation, leading to the failure of rescue in NB proliferation. It seemed that the level of proliferation in “Ferritin KD Pros KD” was even lower than “Ferritin KD”.

      Author response image 1.

      The schematic diagram of relationship between energy and NB function in different groups. “T” represents total energy for NB maintenance and proliferation. “N” represents the energy for NB maintenance. “P” represents the energy for NB proliferation. T=N+P 

      Line 601 should indicate that Tables 2 and 3 are part of the supplementary material.

      We have revised this in line 678.

      Figure 4-supplement 1. Only validation of 2 genes from a RNAseq seems too little.

      We dissected hundreds of brains for sorting NBs because of low biomass of fly brain. This is a difficult and energy-consuming work. Most NBs were used for RNA-seq, so we can only use a small amount of sample left for validation which is not enough for more genes.

      Figure 6E, the authors indicate that 10 mg/ml DFP injection could significantly prolong the survival time. Which increase in % is produced by DFP?

      We have provided the bar graph in Author response image 2. The increase is about 16.67% by DFP injection.

      Author response image 2.

      The bar graph of survival time of mice treated with DFP. (The unpaired two-sided Student’s t test was employed to assess statistical significance. Statistical results were presented as means ± SD. n=7,6; *: p<0.05)

      Reviewer #3 (Recommendations For The Authors):

      As I read the initial results that built the story (glia make ferritin>release it> NBs take them up>use it for TCA and ETC) I kept thinking about what it meant for NBs to be 'lost'. This led me to consider alternate possibilities that the results might point to, other than the ones the authors were suggesting. It was only in Figure 5 that the authors ruled out some of those possibilities. I would suggest that they first illustrate how NBs are lost upon glial ferritin loss of function before they delve into the mechanism. This would also be a place to similarly address that glial numbers and general morphology are unchanged upon ferritin loss.

      This recommendation provides a valuable guideline to build this story especially for researchers who are interested in neural stem cell studies. Actually, we tried this logic to present our study but found that there are several gaps in the middle of the manuscript, such as the relationship between glial ferritin and Pros localization in NB, so that the whole story cannot be fluently presented. Therefore, we decided to present this study in the current way.

      More details of the screen would be useful to know. How many lines did they screen, what was the assay? This is not mentioned anywhere in the text.

      We have added this in Screen of Materials and methods. We screened about 200 lines which are components of classical signaling pathways, highly expressed genes in glial cells or secretory protein encoding genes. UAS-RNAi lines were crossed with repo-Gal4, and then third-instar larvae of F1 were dissected. We got the brains from F1 larvae and performed immunostaining with Dpn and PH3. Finally, we observed the brain in Confocal Microscope.

      Many graphs seem to be repeated in the main figures and the supplementary data. This is unnecessary, or at least should be mentioned.

      We appreciate your kind reminder. However, we carefully went through all the figures and did not find the repeated graphs, though some of them look similar.

      The authors mention that they tested which glial subtypes ferritin is needed in, but don't show the data. Could they please show the data? Same with the other iron transport/storage/regulation. Also, in both this and later sections, the authors could mention which Gal4 was used to label what cell types. The assumption is that the reader will know this information.

      We have added the result of ferritin knockdown in glial subpopulations in Figure 1—figure supplement 2. However, considering that the quantity of iron-related genes, we did not take the picture, but we recorded this in Table 3.

      For all their images showing colocalisation, magnified, single-colour images shown in grayscale will be useful. For example, without the magnification, it is not possible to see the NB expression of the protein trap line in Figure 2B. A magnified crop of a few NBs (not a single one like in 2C) would be more useful.

      We have provided Figure 2A’, B’, D’ and Figure 3D’ as suggested.

      There are a lot of very specific assays used to detect ROS, NAD, aconitase activity, among others. It would be nice to have a brief but clear description of how they work in the main text. I found myself having to refer to other sources to understand them. (I believe SoNAR should be attributed to Zhao et al 206 and not Bonnay et al 2020.)

      We have added a brief description about ROS, aconitase activity, NAD in line 198-199, 229-231, and 269 as suggested.

      I did not understand the normalisation done with respect to SoNAR. Is this standard practice? Is the assumption that 'overall protein levels will be higher in slowly proliferating NBs' reasonable? This is why they state the need to normalise.

      The SoNAR normalization is not a standard practice. However, we think that our normalization of SoNar is reasonable. According to our results, the expression level of Dpn and Mira seemed higher in glial ferritin knockdown, so we speculated that some proteins accumulated in slowly proliferating NBs. Thus, we used Insc-GAL4 to drive DsRed for indicating the expression level of Insc and found that DsRed rose after glial ferritin knockdown, suggesting that Insc expression was increased indeed. Therefore, we have to normalize SoNar driven by Insc-GAL4 based on DsRed driven by Insc-Gal4, which eliminates the effect of increased Insc upon glial ferritin knockdown.

      FAC is mentioned as a chelator? But the authors seem to use it oppositely. Is there an error?

      FAC is a type of iron salt, which is used to supply iron. We have also indicated that in line 156 according to your advice. 

      The lack of any cell death in the L3 brain surprised me. There should be plenty of hemilineages that die, as do many NBs, particularly in the abdominal segments. Is the stain working? Related to this, P35 is not the best method for rescuing cell death. H99 might be a better way to go.

      We were also surprised to see this result and repeated this experiment for several times with both negative and positive controls. Moreover, we also used TUNEL to validate this result, which led to the same result. We will try to use H99 to rescue NB loss in the future, because it needs to be integrated and recombined with our current genetic tools.

      It would be nice to see the aconitase activity signal as opposed to just the quantification.

      This method can only determine the absorbance for indicating aconitase activity, so our result is just the quantification.

      Glia are born after NBs are specified. In fact, they arise from NBs (and glioblasts). So, it's unlikely that the knockdown of ferritin in glia can at all affect initial NB specification.

      We completely agree with this statement.

      The section on tumor suppression seems out of place. The fly data on which the authors base this as an angle to chase is weak. Dividing cells will be impaired if they have inadequate energy production. As a therapeutic, this will affect every cell in the body. I'm not sure that cancer therapeutics is pursuing such broadly acting lines of therapies anymore.

      Our data suggested that iron/ferritin is more critical for high proliferative cells. Tumor cells have a high expression of TfR (Transferrin Receptor)[11], which can bind to Transferrin and ferritin[12]. And ferritin specifically targets on the tumor cells[11]. Thus, we think iron/ferritin is extremely essential for tumor cells. If we can find the appropriate dose of iron/ferritin inhibitor, suppressing tumor growth but maintaining normal cell growth, iron/ferritin might be an effective target of tumor treatment.

      The feedback from NB to glial ferritin is also weak data. The increased cell numbers (of unknown identity) could well be contributing to the increase in ferritin. I would omit the last two sections from the MS.

      In brat RNAi and numb RNAi, increased cells are NB-like cells, which cannot undergo further differentiation and are not expected to produce ferritin. More importantly, we used Repo (glia marker) as the reference and quantified the ratio of ferritin level to Repo level, which can exclude the possibility that increased glial cells lead to the increase in ferritin.

      References

      (1) Tanimura T, Isono K, Takamura T, et al. Genetic Dimorphism in the Taste Sensitivity to Trehalose in Drosophila-Melanogaster. J Comp Physiol, 1982,147(4):433-7

      (2) Myster DL, Duronio RJ. Cell cycle: To differentiate or not to differentiate? Current Biology, 2000,10(8):R302-R4

      (3) Dalton S. Linking the Cell Cycle to Cell Fate Decisions. Trends in Cell Biology, 2015,25(10):592-600

      (4) Nichol H, Law JH, Winzerling JJ. Iron metabolism in insects. Annu Rev Entomol, 2002,47:535-59

      (5) Pham DQ, Winzerling JJ. Insect ferritins: Typical or atypical? Biochim Biophys Acta, 2010,1800(8):824-33

      (6) Speder P, Brand AH. Systemic and local cues drive neural stem cell niche remodelling during neurogenesis in Drosophila. Elife, 2018,7

      (7) Mumbauer S, Pascual J, Kolotuev I, et al. Ferritin heavy chain protects the developing wing from reactive oxygen species and ferroptosis. PLoS Genet, 2019,15(9):e1008396

      (8) Xiao G, Wan Z, Fan Q, et al. The metal transporter ZIP13 supplies iron into the secretory pathway in Drosophila melanogaster. Elife, 2014,3:e03191

      (9) Marelja Z, Leimkühler S, Missirlis F. Iron Sulfur and Molybdenum Cofactor Enzymes Regulate the  Life Cycle by Controlling Cell Metabolism. Front Physiol, 2018,9

      (10) Morgan LL. The epidemiology of glioma in adults: a "state of the science" review. Neuro-Oncology, 2015,17(4):623-4

      (11) Fan K, Cao C, Pan Y, et al. Magnetoferritin nanoparticles for targeting and visualizing tumour tissues. Nat Nanotechnol, 2012,7(7):459-64

      (12) Li L, Fang CJ, Ryan JC, et al. Binding and uptake of H-ferritin are mediated by human transferrin receptor-1. Proc Natl Acad Sci U S A, 2010,107(8):3505-10

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The propagation of electrical signals within neuronal circuits is tightly regulated by the physical and molecular properties of neurons. Since neurons vary in size across species, the question arises whether propagation speed also varies to compensate for it. The present article compares numerous speed-related properties in human and rat neurons. They found that the larger size of human neurons seems to be compensated by a faster propagation within dendrites but not the axons of these neurons. The faster dendritic signal propagation was found to arise from wider dendritic diameters and greater conductance load in human neurons. In addition, the article provides a careful characterization of human dendrites and axons, as the field has only recently begun to characterize post-operative human cells. There are only a few studies reporting dendritic properties and these are not all consistent, hence there is the added value of reporting these findings, particularly given that the characterization is condensed in a compartmental model.

      Strengths:

      The study was performed with great care using standard techniques in slice electrophysiology (pharmacological manipulation with somatic patch-clamp) as well as some challenging ones (axonal and dendritic patch-clamp). Modeling was used to parse out the role of different features in regulating dendritic propagation speed. The finding that propagation speed varies across species is novel as previous studies did not find a large change in membrane time constant or axonal diameters (a significant parameter affecting speed). A number of possible, yet less likely factors were carefully tested (Ih, membrane capacitance). The main features outlined here are well-known to regulate speed in neuronal processes. The modeling was also carefully done to verify that the magnitude of the effects is consistent with the difference in biophysical properties. Hence, the findings appear very solid to me.

      Weaknesses:

      The role of diameter in regulating propagation speed is well-known in the axon literature.

      We thank the reviewer for this comment. This is indeed true. The paper does not claim that this is new – we just refereed to Waxman’s book to acknowledge this established effect. Our main emphasize is on the impact of dendritic (rather than axonal) diameter – highlighting the faster EPSP speed near the input synapse and converging to steady-state value further away from the soma and using this to explore the impact of differences in dendritic diameter of rat vs. human on EPSP latency and velocity. We now made this point clearer in the revised text.

      Reviewer #2 (Public Review):

      Summary:

      In this paper, Oláh and colleagues introduce new research data on the cellular and biophysical elements involved in transmission within the pyramidal circuits of the human neocortex. They gathered a comprehensive set of patch-clamp recordings from human and rat pyramidal neurons to compare how the temporal aspect of neuronal processing is maintained in the larger human neocortex. A broad range of experimental, theoretical, and computational methods are used, including two-photon guided dual whole-cell recordings, electron microscopy, and computational simulations of reconstructed neurons.

      Recordings from synaptically connected pyramidal neurons revealed longer intercellular path lengths within the human neocortex. Further, by using dual whole-cell recordings from somadendrite and soma-axon locations, they found that short latencies from soma to soma can be partly attributed to an increased propagation speed for synaptic potentials, but not for the propagation of action potentials along the axon.

      Next, in a series of extensive computational modeling studies focusing on the synaptic potentials, the authors observe that the short-latency within large human pyramidal neural circuits may have a passive origin. For a wide array of local synaptic input sites, the authors show that the conductance load of the dendrites, electrically coupled to a large diameter apical dendrite, affects the cable properties. The result is a relatively faster propagation of EPSPs in the human neuron.

      The manuscript is well-written and the physiological experiments and biophysical arguments are very well explained. I appreciated the in-depth theoretical steps for the simulations. That passive cable properties of the dendrites are causing a higher velocity in human dendrites is interesting but there is a disconnect between the experimental findings and the model simulations. Based on the present data the contribution of active membrane properties cannot be dismissed and deserves further experiments.

      See our response below

      Strengths:

      The authors present state-of-the-art 2P-guided dual whole-cell recordings in human neurons. In combination with detailed reconstructions, these approaches represent the next steps in unravelling the information processing in human circuits.

      The computational modeling based on cable theory and experimentally constrained simulations provides an excellent integrated view of the passive membrane properties.

      Weaknesses:

      There are smaller and larger issues with the statistical analyses of the experimental data which muddles the interim conclusions.

      That the cable properties alone are the main explanation for speeding the electrical signaling in human pyramidal neurons appears inconsistent with the experimental data.

      This is an excellent point – we indeed performed analysis on only passive cases – highlighting (and now also ranking) the impact of the various morpho-electrical properties of the neurons on the differences in signal latency in human vs. rats. We did explored (not shown) the effect of active channels in the dendrites (including the h-current); as expected the results strongly depend on channel density and their spatial distribution over the dendritic tree. As we do not know these parameters for the modelled cells, we decided to remain focus on the impact of passive/morphological parameters. We also note that the experimental results (page 4-5 in manuscript) show minor contribution of h-current emphasizing that the passive properties have the main role in differentiating human and rats. differences between human and rat. 

      Some of the electrophysiological experiments require further control experiments to make robust conclusions.

      Reviewer #3 (Public Review):

      Summary:

      This study indicates that connections across human cortical pyramidal cells have identical latencies despite a larger mean dendritic and axonal length between somas in the human cortex. A precise demonstration combining detailed electrophysiology and modeling indicates that this property is due to faster propagation of signals in proximal human dendrites. This faster propagation is itself due to a slightly thicker dendrite, a larger capacitive load, and stronger hyperpolarizing currents. Hence, the biophysical properties of human pyramidal cells are adapted such that they do not compromise information transfer speed.

      Strengths:

      The manuscript is clear and very detailed. The authors have experimentally verified a large number of aspects that could affect propagation speed and have pinpointed the most important one. This paper provides an excellent comparison of biophysical properties between rat and human pyramidal cells. Thanks to this approach a comprehensive description of the mechanisms underlying the acceleration of propagation in human dendrite is provided.

      Weaknesses:

      Several aspects having an impact on propagation speed are highlighted (dendritic diameter, ionic channels, capacitive load) and there is no clear ranking of their impact on signal propagation speed. It seems that the capacitive load plays a major role, much more than dendritic diameter for which only a 10% increase is observed across species. Both aspects actually indicate that there is an increase in passive signal propagation speed with bigger cells at least close to the soma. This suggests that bigger cells are mechanically more rapid. An intuitive reason why capacitive load increases speed would also help the reader follow the demonstration.

      We thank the referee for both these excellent points. In response to them:

      (i) We now performed a new comprehensive statistical analysis and show the ranking of the effect of the different morphological/cable factors on EPSP propagation. This analysis appears in both Supp. Table 5& 6, Fig. S16 and also in the main text as follows:

      To rank the impact of the various factors affecting EPSP propagation latency in human and rat neurons, we conducted a comprehensive statistical analysis using two complementary approaches: the generalized linear model (GLM) (Kiebel & Holmes, 2007) as well as SHAP (SHapley Additive exPlanations) (Lundberg & Lee, 2017) based on fitting Gradient Tree Boosting  (Friedman, 2002)model. We began by fitting a GLM without interaction terms among the factors affecting EPSP latency (Suppl. Table 5). This enables us to quantify the primary individual factors affecting EPSP propagation. Our analysis revealed the following ranking order: 1) physical distance of synapses from soma had the strongest effect; 2) species differences; 3) conductance load, as demonstrated by our “hybrid cells” manipulation; 4) radii of the apical dendrite, affecting the cables’ space constant, λ; and 5) the specific cable parameters, as revealed when using per-cell fitted parameters versus uniform cable parameters, was minimal. We next performed GLM analysis with interaction terms showing that, as expected, there are significant interactions between the factors affecting EPSP latency (Suppl. Table 6). To further validate the above ranking while incorporating the interactions between the various factors affecting EPSP latency, we performed a SHAP analysis. Notably, even with interactions included, the ranking of the factors affecting signal propagation are aligned with the results from the analysis based on the GLM without interaction terms (see Fig S.16).

      (ii) As for the intuitive explanation required by the referee. We added the following paragraph In the Discussion:

      The intuitive reason for this enhancement is that the large conductance load (the “leaky end” boundary conditions) more effectively “steals” the synaptic (axial) current (like water pouring faster into a large pool). The more mathematical intuition is that the large soma (sink) adds fast time constants to the system (see also related explanation in Fig. 4 in Eyal et al., 2014).

      We thank the editors for considering and revising our manuscript for publication in eLife. We appreciate the positive appreciation of the work and the critical points raised by the reviewers. We have responded in detail to all the excellent comments from all reviewers. We believe that these revisions have significantly improved the quality of our study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      There are two points that could improve the reading experience of this nice manuscript. These should be easily addressed with minor re-phrasing.

      Credit to conduction velocity literature. Less widely known in the dendrite literature, in the axon literature, the relationship between propagation speed and process diameter is well established. I thought the two articles cited (Jack Noble Tsien and Agmon-Snir & Segev) were not as direct in the treatment of this relationship. The work of Stephen Waxman, for instance, made clear how axon diameter tightly controls propagation speed (see for instance the Scholarpedia entry by Swadlow and Waxman). In my opinion, this is a widely known piece of work, that is part of some introductory books to neuroscience. While the article does not claim they found this relationship, parts of the presentation are better understood if we ignore this well-known fact. I am referring to the abstract, intro, and the beginning of results where 'larger' is presented as synonymous with 'slower'. For instance 'to compensate for the increase neurons' size' (abstract) or 'the increase in size of dendrites and axons might come with a cost of longer signal propagation times' only makes sense if 'size' refers to spatial extent, not diameter.

      We thank for this valid point; leaving out axon diameter references was not intentional. We have now added the suggested reference to our manuscript. In the size comparisons, we have only pointed out the obvious size differences between the body and the dendritic processes. We have reworded sentences with size comparisons.

      In Abstract (lines 1-6):

      Human-specific cognitive abilities depend on information processing in the cerebral cortex, where neurons are significantly larger, their processes are longer and sparser compared to rodents. We found that, in synaptically-connected layer 2/3 pyramidal cells (L2/3 PCs), soma-tosoma signal propagation delay is similar in humans and rodents. Thus, to compensate for the increase in neurons’s longer processes, membrane potential changes must propagate faster in human axons and/or dendrites.

      In section “Effect of dendritic thickness” in Results we have modified it as follows:

      The relationship between conduction velocity and axon diameter is well known for small myelinated and unmyelinated axons (Waxman and Bennett, 1972). Anatomical features of neuronal processes dendrites also have a major influence on signal propagation properties 5,19, thus …

      Waxman, S. G. and Bennett, M. V. L. Relative conduction velocity of small myelinated and nonmyelinated fibres in the central nervous system. Nature New Biol., 238217-219, 1972.

      Two or four dendritic factors? The study identifies two major dendritic factors influencing the propagation speed (diameter and load), however the end of the results highlights four factors. I did not understand how factor 2 was different than factor 1. Neither did I understand how factor 4 was different from the other factors. There seemed to be a little redundancy here that could be streamlined.

      We thank the reviewer for pointing this out. We now have changes the respective text, added the ranking statistics (see above) to assess the effect of the different parameters on signal propagation in dendrites.

      Microcircuits? The study found that the changes in speed arise from the dendrites rather than the axons, as such it seems it would be more precise to replace 'microcircuits' with 'dendrites'.

      We are thankful for this suggestion. We change the title to Accelerated signal propagation speed in human neocortical dendrites.

      Typos

      P3 line 24 'find significant difference the propagation'.

      P6 line 35 'how morphological differences' it would be useful to specify which morphological difference here.

      Corrected.

      Reviewer #2 (Recommendations For The Authors):

      (1) The statistical analyses should be changed. T-testing populations and comparing visual differences of differences ("human minus rats") is a common but egregious error in the field of neurosciences (see doi:10.1038/nn.2886). The conclusion that HCN channels "... do not by themselves explained the differences between the two species" (lines 174-176) is not compelling. The design of the experiments presented in Figure 3 is paired recordings and the addition of a blocker (ZD7288 or TTX cocktail). These are classic 2 x 2 factorial designs (species x drug). The authors will need to perform a repeated-measured analysis of variance (RM-ANOVA) and provide information on the interaction significance. Please revise the figures and improve statistical reporting. Post-hoc comparisons of the velocity populations are required to support the idea of whether h-channels are explaining the observed differences.

      Thank you for drawing our attention to this error. The statistical analysis of the pharmacological experiments was re-performed as suggested. After the 2-way ANOVA with repeated measures and Bonferroni post-hoc correction, we can indeed find significant differences only in the control group, namely that the propagation speed of bAPs in human dendrites was significantly higher. The implementation of the proposed statistical analysis demonstrates that the administration of ZD has no statistically significant effect on the propagation speed of human or rat dendrites. The treatment with TTX cocktail resulted in a significant difference in signal propagation in humans but not in rodents. However the trend is discernible and the P = 0.0588 value is close to the widely accepted 0.05 threshold. After the TTX cocktail treatment, the speed of signal propagation did not differ significantly between the two species. However, on average, the human dendrites remained faster. These alterations in P-values do not affect our primary conclusions. The MS text has been modified accordingly.

      (2) Although ZD7288, in my opinion, influences the bAP (see point #1) the authors subsequently leave the h-current unblocked in the experiments in Figures 3D, E. Here, they use sodium, potassium, and calcium currents as well as synaptic conductances. I am puzzled why (in line 188) they claim the dendrites are "passive" although the data show h-currents are contributing to the shape of the bAP in human neurons. In line 196 they conclude voltage-gated conductances have a "minor" contribution and passive properties a main role. Please revise conclusions or provide better experimental support.

      Thank you for this point. We meant to refer to the state in which no action potential can be generated, although the word 'passive' might be misleading in this context; we rephrase these sentences in the MS accordingly.

      (3) A major concern is the injection of an AP in voltage-clamp mode. Although this is the right choice and I'm in support of the experiment, it is technically challenging to space clamp the soma and fully recapitulate the speed and amplitude of a 100 mV depolarization. The voltage drop in peak amplitude as well as the increased delay between the baseline AP (current clamp) and AP in blocker conditions (voltage clamp) could be fully explained by switching between current- and voltage-clamp modes. In additional control experiments, the authors should add a second voltage follower electrode (CC) at the soma showing whether the authors can preserve the original AP (from CC) in VC/blocker condition. It may well be they need to adjust the injection protocol.

      Our experiments were designed to replicate the work of Stuart et al. (1994), in which they compared the attenuation of active and passive backpropagating signals. When they blocked Na+ channels with TTX they injected simulated action potentials in voltage-clamp mode. They concluded that TTX-sensitive Na+ channels cause somatic action potential entry into the dendritic compartment. They found a comparable attenuation of the backward propagating action potential in the dendrites in control conditions (~70 %). 

      We performed control recordings based on the reviewer’s suggestion (Author response image 1).

      Author response image 1.

      Injection of the previously recorded AP (blue) in VC mode produced a completely similar somatic AP in CC mode (orange). The slight temporal delay between the two signal caused by the different position of the pipettes on the cell body.  The right panel shows the plot of the two peak-aligned APs as a function of each other, close to the blue ‘equality’ line. We concluded that the original AP is well preserved in VC/blocker condition.

      (5) From the paragraph entitled "Modeling EPSP propagation in dendrites" and onwards the authors make countless conclusions based on theory and modelling results but without any statistical support. Multiple neurons are used thus it is rather straightforward to provide numerical support for the assertions. For example, but this is not an exhaustive list, how should we interpret that latency ranges are different (line 240, line 253) etc.? Or were the estimated Cm values of human and rat neurons (0.6 versus 1.1) significantly different? And if so, how does this align with the Cm estimates in the nucleated patch experiments?

      We thank the referee for this comment and now added a set of statistical analyses. The results appear now throughout the whole theoretical paper in revised article. In particular with respect to Figs. 6&7 where we now show that, indeed, our various manipulations (e.g., hybrid vs. original cells) as well as the cable parameters (Cm, Rm) are indeed significantly different between human and rats whereas the membrane time constant is not significantly different between human and rat. As for Cm in human. Our limited sample size shows significant difference between human and rat. Yet, the range of values for Cm that we found in our modeling study does fall within the experimental range reported in the present study.

      Minor

      Line 44. The "simulated EPSP" example in Figure 2C is not a command waveform for an EPSC. Line 526 in the methods states that also ramp currents were used. Please revise to clarify the main text.

      Thank you for bringing this discrepancy to our attention. In the experiments, we used ramp injections. We have made this clear in the main text as follows: ”... we tested orthodromic or forward propagating signal propagation velocity by injecting short-duration current ramps to simulate EPSP (sEPSP) signals in the dendrites and recorded the resultant subthreshold voltage response in the soma”

      Line 522. The authors state the recordings were all carried out "in current clamp mode" but detailed VC method information is lacking. Did they use series resistance compensation?

      We did not use series resistance compensation.

      Line 479 From which region(s) where human "neocortical slices" sampled? Please add this information.

      We have added regions of origin to the Methods section: frontal (n = 21), temporal (n = 20), parietal (n = 20), and occipital (n = 1).  

      Please show higher temporal resolution example traces, for example in Figure 3. Differences are at the micrometer scale, but APs are shown at the millisecond scale. Hard to judge the quality of the data. Showing the command potentials (inset Figure 3D, E) is misleading (see major point #3).

      In response to the reviewer's request, we have redrawn the example traces in Figure 3.

      Please check the labeling of figures. There is information missing. For example, in Figure 5 A to C I am missing information and the units of the axes.

      In the black plots on the right side of panels B and C, the y-axis shows the thickness measurements for the given dendrite stacked on top of each other and the x-axis shows the measurement values, the units for the x-axis are µm as mentioned in the figure legend.

      Line 981 "scalebars" should read scale bars."

      Line 986 "bootstraped" should read "bootstrapped".

      Done.

      Are the dendritic diameters increased for all basal and apical higher-order branches? It is unclear how the model simulations were built on diameters of primary and higher-order branches.

      In our modelling study we took the actual diameter of the reconstructed PCs in both proximal and higher order branches. We did compare per-distance differences in diameter – but it is automatically incorporated into the computation of the basal load (“equivalent cables” in Figs 6&8).

      The velocity calculation for axonal propagation (yielding a ~0.9 m/s conduction velocity, Figure 2B) is incorrect. Using the peak of the action potentials between soma and axon misses the fact that action potentials start earlier and spatially distally from the soma in the axon. Please revise the calculation to include the temporal delay and actual distance travelled by the forward propagating action potential.

      Thank you for this question. We are aware that the AP is generated at the AIS and that it is located between the two recording electrodes and we have to take into account that the signal propagates from the AIS to the soma and this may shorten the delay in the system. To the best of our knowledge, there is no experimental evidence of the location of the AP generation site on the AIS in layer 2-3 pyramidal cells in the human neocortex, so we assumed that it is located 35 microns from the soma, and that the propagation speed from the AIS to the two directions is the same. Consequently, we have corrected our propagation velocity values as follows:

      “For the axon bleb recordings we assumed that the axon initial segment (AIS) of the cells are 35 µm from the axon hillock, and the APs propagate to forward (to the bleb) and backward (to the soma) at the same speed. For the correction of the AIS we used the following formula: (2)

      where vcorr is the corrected propagation speed for AIS position, l is the axonal distance between the soma and the axon bleb, t is the latency between the two measuring point, ais is the assumed position of the AIS alongside the axon (35 µm).”

      What explains the strongly attenuated axonal action potential at the bleb? Is this representative?

      The strongly attenuated axonal action potential at the bleb can be explained by a few key factors:

      (1) Membrane Integrity: Bleb formation often indicates some level of membrane damage or alteration. This can disrupt the normal ionic gradients across the membrane, leading to a failure in generating or propagating action potentials effectively.

      (2) Current Leakage: Bleb formation may create additional pathways for ion leakage, which can dissipate the electrical current that would normally propagate the action potential. This leakage reduces the overall amplitude of the action potential.

      Line 275 "To our delight", please rephrase.

      Corrected.

      Reviewer #3 (Recommendations For The Authors):

      - In Figure 1, the number of cells used to assess intersomatic distance is quite low. A larger number of neuron pairs should be analyzed to be more representative. Or at least an explanation of why such a low sampling can be conclusive.

      We appreciate the reviewer’s concerns on sample sizes of the first set of experiments, where the anatomical pathways were measured through the synapses of coupled cells with electrophysiological recordings. We acknowledge that this is a limitation of our study. However, in this series of experiments, we simply wanted to experimentally confirm already known results which consisted of two parts: first, that in humans the dendrites and axons of neurons are longer, and second, that they have the same time delay in terms of synaptic latency. 

      The reported similarity in synaptic latencies is consistent with the results of a recent study by Campagnola et al. (2022) showing that EPSP latencies of local connections between layer 2/3 pyramidal cells are in the same range in humans and mice (human median latency = 1.73 ms vs. mouse median latency = 1.49 ms). We came to the same conclusion in our previous work where we compared pyramidal basket cell synaptically coupled pairs in human and rat pairs (Molnár et al. 2016). 

      On the other hand, we report interspecific differences in cable pathways from soma to soma, again consistent with the literature suggesting that the length of pyramidal neural processes is longer in humans than in rodents (see Supplementary Figure 1 and e.g. Berg et al. 2021).

      From a practical point of view the collection of experimental data in this hard won experiment is particularly difficult. The electrophysiological recording of a connected pair with an appropriate pre- and postsynaptic series resistance, where human tissue samples are limited, is the first step here. To obtain information about the path of the signals between pre- and postsynaptic cells, an anatomical reconstruction is required. This requires a) a high-quality recovery of postsynaptic dendrites and presynaptic axons, b) successful tracing of all potential contact points between presynaptic axons and postsynaptic dendrites back to the pre- and postsynaptic soma. The difficulty of the latter point in particular arises from the fact that parts of the presynaptic axonal arbor are myelinated and the success of biocytin-based tracing depends on the length of the myelinated axon branches. The success/failure of complete axonal tracing only becomes apparent at the end of these efforts.

      - The author should provide an intuitive explanation of why capacitive load accelerates propagation in the dendrite.

      See answer above  

      - The author should more clearly rank the contribution of each difference between rat and human neurons. The 10% increase in dendritic diameter which affects velocity only via a square root seems a very weak contribution. This should be clarified.

      We now added a set of statistical methods to perform such a ranking in the theoretical part of this study, as described above (and in a new paragraph, attached above) in the revised article. 

      References

      Eyal, G., Mansvelder, H. D., de Kock, C. P. J., & Segev, I. (2014). Dendrites impact the encoding capabilities of the axon. Journal of Neuroscience, 34(24), 8063–8071. https://doi.org/10.1523/JNEUROSCI.5431-13.2014

      Friedman, J. H. (2002). Stochastic gradient boosting. In Computational Statistics & Data Analysis (Vol. 38). www.elsevier.com/locate/csda

      Kiebel, S. J., & Holmes, A. P. (2007). The General Linear Model. In K. Friston, J. Ashburner, S. Kiebel, T. Nichols, & P. William (Eds.), Statistical Parametric Mapping (pp. 101–125). Academic Press.

      Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems, 4768–4777.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Comment#1: Ren et al developed a novel computational method to investigate cell evolutionary trajectory for scRNA-seq samples. This method, MGPfact, estimates pseudotime and potential branches in the evolutionary path by explicitly modeling the bifurcations in a Gaussian process. They benchmarked this method using synthetic as well as real-world samples and showed superior performance for some of the tasks in cell trajectory analysis. They further demonstrated the utilities of MGPfact using single-cell RNA-seq samples derived from microglia or T cells and showed that it can accurately identify the differentiation timepoint and uncover biologically relevant gene signatures. Overall I think this is a useful new tool that could deliver novel insights for the large body of scRNA-seq data generated in the public domain. The manuscript is written in a logical way and most parts of the method are well described.

      Thank you for reviewing our manuscript and for your positive feedback on MGPfact. We are pleased that you find it useful for identifying differentiation timepoints and uncovering gene signatures. We will continue to refine MGPfact and explore its applications across diverse datasets. Your insights are invaluable, and we appreciate your support.

      Comment#2: Some parts of the methods are not clear. It should be outlined in detail how pseudo time T is updated in Methods. It is currently unclear either in the description or Algorithm 1.

      Thanks to the reviewers' comments. We've added a description of how pseudotime T is obtained between lines 138 and 147 in the article. In brief, the pseudotime of MGPfact is inferred through Gaussian process regression on the downsampled single-cell transcriptomic data. Specifically, T is treated as a continuous variable representing the progression of cells through the differentiation process. We describe the relationship between pseudotime and expression data using the formula:

      Where f(T) is a Gaussian Process (GP) with covariance matrix S, and Ɛ represents the error term. The Gaussian process is defined as:

      Where is the variance set to 1e-6.

      During inference, we update the pseudotime by maximizing the posterior likelihood. Specifically, the posterior distribution of pseudotime T can be represented as:

      Where is the likelihood function of the observed data Y*, and is the prior distribution of the Gaussian process. This posterior distribution integrates the observed data with model priors, enabling inference of pseudotime and trajectory simultaneously. Due to the high autocorrelation of  in the posterior distribution, we use Adaptive Metropolis within Gibbs (AMWG) sampling (Roberts and Rosenthal, 2009; Tierney, 1994). Other parameters are estimated using the more efficient SLICE sampling technique (Neal, 2003).

      Comment#3: There should be a brief description in the main text of how synthetic data were generated, under what hypothesis, and specifically how bifurcation is embedded in the simulation.

      Thank you for the reviewers' comments. We have added descriptions regarding the synthetic dataset in the methods section. The revised content is from line 487 to 493:

      “The synthetic datasets were generated using four simulators: dyngen (Saelens et al., 2019), dyntoy (Saelens et al., 2019), PROSSTT (Papadopoulos et al., 2019), and Splatter (Zappia et al., 2017), each modeling different trajectory topologies such as linear, branching, and cyclic. Splatter simulates branching events by setting expression states and transition probabilities, dyntoy generates random expression gradients to reflect dynamic changes, and dyngen focuses on complex branching structures within gene regulatory networks.”

      Comment#4: Please explain what the abbreviations mean at their first occurrence.

      We appreciate the reviewers' feedback. We have thoroughly reviewed the entire manuscript and made sure that all abbreviations have had their full forms provided upon their first occurrence.

      Comment#5: In the benchmark analysis (Figures 2/3), it would be helpful to include a few trajectory plots of the real-world data to visualize the results and to evaluate the accuracy.

      We appreciate the reviewer's feedback. To more clearly demonstrate the performance of MGPfact, we selected three representative cases from the dataset for visual comparison. These cases represent different types of trajectory structures: linear, bifurcation, and multifurcation. The revised content is between line 220 and 226.

      As shown in Supplementary Fig. 5, it is evident that MGPfact excels in capturing main developmental paths and identifying key bifurcation points. In the linear trajectory structure, MGPfact accurately predicted the linear structure without bifurcation events, showing high consistency with the ground truth (overall\=0.871). In the bifurcation trajectory structure, MGPfact accurately captured the main bifurcation event (overall\=0.636). In the multifurcation trajectory structure, although MGPfact predicted only one bifurcation point, its overall structure remains close to the ground truth, as evidenced by its high overall score (overall\=0.566). Overall, MGPfact demonstrates adaptability and accuracy in reconstructing various types of trajectory structures.

      Comment#6: It is not clear how this method selects important genes/features at bifurcation. This should be elaborated on in the main text.

      Thanks to the reviewers' comments. To enhance understanding, we've added detailed descriptions of gene selection in the main text and appendix, specifically from lines 150 to 161. In brief, MGPfact employs a Gaussian process mixture model to infer cell fate trajectories and identify independent branching events. We calculate load matrices using formulas 1 and 14 to assess each gene's contribution to the trajectories. Genes with an absolute weight greater than 0.05 are considered predominant in specific branching processes. Subsequently, SCENIC (Aibar et al., 2017; Bravo González-Blas et al., 2023) analysis was conducted to further infer the underlying regulons and annotate the biological processes of these genes.

      Comment#7: It is not clear how survival analysis was performed in Figure 5. Specifically, were critical confounders, such as age, clinical stage, and tumor purity controlled?

      To evaluate the predictive and prognostic impacts of the selected genes, we utilized the Cox multivariate regression model, where the effects of relevant covariates, including age, clinical stage, and tumor purity, were adjusted. We then conducted the Kaplan-Meier survival analysis again to ensure the reliability of the results. The revisions mainly include the following sections:

      (1) We modified the description of adjusting for confounding factors in the survival analysis, from line 637 to 640:

      “To adjust for possible confounding effects, the relevant clinical features including age, sex and tumor stage were used as covariates. The Cox regression model was implemented using R-4.2 package “survival”. And we generated Kaplan-Meier survival curves based on different classifiers to illustrate differences in survival time and report the statistical significance based on Log-rank test.”

      (2) We updated the images in the main text regarding the survival analysis, including Fig. 5a-b, Fig. 6c, and Supplementary Fig. 8e.

      Comment#8: I recommend that the authors perform some sort of 'robustness' analysis for the consensus tree built from the bifurcation Gaussian process. For example, subsample 80% of the cells to see if the bifurcations are similar between each bootstrap.

      We appreciate the reviewers' feedback. We performed a robustness analysis of the consensus tree using 100 training datasets. This involved sampling the original data at different proportions, and then calculating the topological similarity between the consensus trajectory predictions of MGPfact and those without sampling, using the Hamming-Ipsen-Mikhailov (HIM ) metric. A higher score indicates greater robustness. The relevant figure is in Supplementary Fig. 4, and the description is in the main text from line 177 to 182.

      The results indicate that the consensus trajectory predictions based on various sampling proportions of the original data maintain a high topological similarity with the unsampled results (HIM<sub>mean</sub>=0.686). This demonstrates MGPfact’s robustness and generalizability under different data conditions, hence the capability of capturing bifurcative processes in the cells’ trajectory.

      Reviewer #2:

      Comment#1: The authors present MGPfact<sup>XMBD</sup>, a novel model-based manifold-learning framework designed to address the challenges of interpreting complex cellular state spaces from single-cell RNA sequences. To overcome current limitations, MGPfact<sup>XMBD</sup> factorizes complex development trajectories into independent bifurcation processes of gene sets, enabling trajectory inference based on relevant features. As a result, it is expected that the method provides a deeper understanding of the biological processes underlying cellular trajectories and their potential determinants. MGPfact<sup>XMBD</sup> was tested across 239 datasets, and the method demonstrated similar to slightly superior performance in key quality-control metrics to state-of-the-art methods. When applied to case studies, MGPfact<sup>XMBD</sup> successfully identified critical pathways and cell types in microglia development, validating experimentally identified regulons and markers. Additionally, it uncovered evolutionary trajectories of tumor-associated CD8+ T cells, revealing new subtypes with gene expression signatures that predict responses to immune checkpoint inhibitors in independent cohorts. Overall, MGPfact<sup>XMBD</sup> represents a relevant tool in manifold learning for scRNA-seq data, enabling feature selection for specific biological processes and enhancing our understanding of the biological determinants of cell fate.

      Thank you for your thoughtful review of our manuscript. We are thrilled to hear that you find MGPfact<sup>XMBD</sup> beneficial for exploring cellular evolutionary paths in scRNA-seq data. Your insights are invaluable, and we look forward to incorporating them to further enrich our study. Thank you once again for your support and constructive feedback.

      Comment#2: How the methods compare with existing Deep Learning based approaches such as TIGON is a question mark. If a comparison would be possible, it should be conducted; if not, it should be clarified why.

      We appreciate the reviewer's comments. We have added a comparison with the sctour (Li, 2023) and TIGON methods (Sha, 2024).

      It is important to note that the encapsulation and comparison of MGPfact are based on traditional differentiation trajectory construction. Saelens et al. established a systematic evaluation framework that categorizes differentiation trajectory structures into topological subtypes such as linear, bifurcation, multifurcation, graph, and tree, focusing on identifying branching structures in the cell differentiation process (Saelens et al., 2019). The sctour and TIGON methods mentioned by the reviewer are primarily used for estimating RNA velocity, focusing on continuous temporal evolution rather than explicit branching structures, and do not explicitly model branches. Therefore, we considered the predictions of these two methods as linear trajectories and compared them with MGPfact. While scTour explicitly estimates pseudotime, TIGON uses the concept of "growth," which is analogous to pseudotime, so we made the necessary adaptations.

      Author response image 1 show that within this framework, compared to scTour (overall<sub>mean</sub>=0.448) and TIGON (overall<sub>mean</sub>=0.263), MGPfact still maintains a relatively high standard (overall<sub>mean</sub>=0.534). This indicates that MGPfact has a significant advantage in accurately capturing branching structures in cell differentiation, especially in applications where explicit modeling of branches is required.

      Author response image 1.

      Comparison of MGPfact with scTour and TIGON in trajectory inference performance across 239 test datasets. a. Overall scores; b.F1<sub>branches</sub>; c.HIM; d. cor<sub>dist</sub>; e. wcor<sub>features</sub>. All results are color-coded based on the trajectory types, with the black line representing the mean value. The “Overall” assessment is calculated as the geometric mean of all four metrics.

      Comment#3: Missing Methods:

      - The paper lacks a discussion of Deep Learning approaches for bifurcation analysis. e.g. scTour, Tigon.

      - I am missing comments on methods such CellRank, and alternative approaches to delineate a trajectory.

      We thank the reviewer for these comments.

      (1) As mentioned in response to Comments#2, the scTour and TIGON methods are primarily used for estimating RNA velocity, focusing on continuous temporal evolution rather than explicit branching structures, and they do not explicitly model branches. We consider the predictions of these two methods as linear trajectories and compare them with MGPfact. The relevant description and discussion have been addressed in the response.

      (2) We have added a description of RNA velocity estimation methods (scTour, TIGON, CellRank) in the introduction section. The revised content is from line 66 to 71:

      “Moreover, recent studies based on RNA velocity has provided insights into cell state transitions. These methods measure RNA synthesis and degradation rates based on the abundance of spliced and unspliced mRNA, such as CellRank (Lange et al., 2022). Nevertheless, current RNA velocity analyses are still unable to resolve cell-fates with complex branching trajectory. Deep learning methods such as scTour (Li, 2023) and TIGON (Sha, 2024) circumvent some of these limitations, offering continuous state assumptions or requiring prior cell sampling information.”

      Comment#4: Impact of MURP:

      The rationale for using MURP is well-founded, especially for trajectory definition. However, its impact on the final results needs evaluation.

      How does the algorithm compare with a random subselection of cells or the entire cell set?

      Thank you for the comments. We fully agree that MURP is crucial in trajectory prediction. As a downsampling method, MURP is specifically designed to address noise issues in single-cell data by dividing the data into several subsets, thereby maximizing noise reduction while preserving the main structure of biological variation (Ren et al., 2022). In MGPfact, MURP typically reduces the data to fewer than 100 downsampled points, preserving the core biological structure while lowering computational complexity. To assess MURP's impact, we conducted experiments by randomly selecting 20, 40, 60, 80, and 100 cells for trajectory inference. These results were mapped back to the original data using the KNN graph structure for final predictions, which were then compared with the MURP downsampling results. Supplementary results can be found in Supplementary Fig. 3, with additional descriptions in the main text from line 170 to 176.

      The results indicate that trajectory inference using randomly sampled cells has significantly lower prediction accuracy compared to that using MURP. This is particularly evident in branch assignment (F1<sub>branches</sub>) and correlation cor<sub>dist</sub>, where the average levels decrease by 20.5%-64.9%. In contrast, trajectory predictions using MURP for downsampling show an overall score improvement of 5.31%-185%, further highlighting MURP's role in enhancing trajectory inference within MGPfact.

      Comment#5: What is the impact of the number of components selected?

      Thank you for the comments. In essence, MGPfact consists of two main steps: 1) trajectory inference; 2) calculation of factorized scores and identification of high-weight genes. After step 1, MGPfact estimates parameters such as pseudotime T and bifurcation points B.  In step 2, we introduce a rotation matrix to obtain factor scores W<sub>l</sub>  for each trajectory l by rotating Y*.

      For all trajectories,

      where e<sub>l</sub>  is the error term for the -th trajectory. The number of features in Y* must match the dimensions of the rotation matrix R to ensure the factorized score matrix W contains factor scores for  trajectories, achieving effective feature representation and interpretation in the model.

      Additionally, to further illustrate the impact of the number of principal components (PCs) on model performance in step 1, we conducted additional experiments. We used 3 PCs as the default and adjusted the number to evaluate changes from this baseline. As shown in Author response image 2, setting the number of PCs to 1 significantly decreases the overall performance score (overall<sub>mean</sub>=0.363), as well as the wcor<sub>features</sub> and wcor<sub>dist</sub> metrics.  In contrast, increasing the number of PCs does not significantly affect the metrics. It ought to be mentioned that number of components used should be determined by the intrinsic biological characteristics of the cell fate-determination. Our experiment based on a limited number of datasets may not represent more complex scenarios in other cell types.

      Author response image 2.

      Robustness testing of the number of MURP PCA components on 100 training datasets. With the number of principal components (PCs) set to 3 by default; we tested the impact of different number of components (1-10) on the prediction results. In all box plots, the asterisk represents the mean value, while the whiskers extend to the farthest data points within 1.5 times the interquartile range. Significance is denoted as follows: not annotated indicates non-significant; * P < 0.05; ** P < 0.01; *** P < 0.001; two-sided paired Student’s T-tests.

      Comment#6: Please comment on the selection of the kernel functions (rbf and polynomial) and explain why other options were discarded.

      Thank you for the comments. We have added a description regarding the selection of radial basis functions and polynomial kernels in lines 126-130. As the reviewers mentioned, the choice of kernel functions is crucial in the MGPfact analysis pipeline for constructing the covariance matrix of the Gaussian process. We selected the radial basis function (RBF) kernel and the polynomial kernel to balance capturing data complexity and computational efficiency. The RBF kernel is chosen for its ability to effectively model smooth functions and capture local variations in the data, making it well-suited to the continuous and smooth characteristics of biological processes; its hyperparameters offer modeling flexibility. The polynomial kernel is used to capture more complex nonlinear relationships between input features, with its hyperparameters also allowing further customization of the model. In contrast, other complex kernels, such as Matérn or spectral kernels, were omitted due to their interpretability challenges and the risk of overfitting with limited data. However, as suggested by the reviewers, we will consider and test the impact of other kernel functions on the covariance matrix of the Gaussian process and their role in trajectory inference in our subsequent phases of algorithm design.

      Comment#7: What is the impact of the Pseudotime method used initially? This section should be expanded with clear details on the techniques and parameters used in each analysis.

      We are sorry for the confusion. We've added a description of how pseudotime T is obtained between line 138 and 147 in the main text. And the specific hyperparameters involved in the model and their prior settings are detailed in the supplementary information.

      In brief, the pseudotime and related topological parameters of the bifurcative trajectories in MGPfact are inferred by Gaussian process regression from downsampled single-cell transcriptomic data (MURP). Specifically, T is treated as a continuous variable representing the progression of cells through the differentiation process. We describe the relationship between pseudotime and expression data as:

      where f(T) is a Gaussian Process (GP) with covariance matrix S, and ε represents the error term. The Gaussian process is defined as:

      where  is the variance set to 1e-6. During inference, we update the pseudotime by maximizing the posterior liklihood. Specifically, the posterior distribution of pseudotime is obtained by combining the observed data Y* with the prior distribution of the Gaussian process model.

      We use the Markov Chain Monte Carlo method for parameter estimation, particularly employing the adaptive Metropolis-within-Gibbs (AMWG) sampling to handle the high autocorrelation of pseudotime.

      Comment#8: Enhancing Readability: For clarity, provide intuitive descriptions of each evaluation function used in simulated and real data. The novel methodology performs well for some metrics but less so for others. A clear understanding of these measurements is essential.

      To address the concern of readability, we have added descriptions of 5 evaluation metrics in the methodology section (Benchmarking MGPfact to state-of-the-art methods) in line 494 to 515. Additionally, we have included a summary and discussion of these metrics in the conclusion section in line 214-240 to help the readers better understand the significance and impact of these measurements.

      (1) In brief, the Hamming-Ipsen-Mikhailov (HIM) distance measures the similarity between topological structures, combining the normalized Hamming distance and the Ipsen-Mikhailov distance, which focus on edge length differences and degree distribution similarity, respectively. The F1<sub>branches</sub> is used to assess the accuracy of a model's branch assignment via Jaccard similarity between branch pairs. In trajectory inference, cor<sub>dist</sub> quantifies the similarity of inter-cell distances between predicted and true trajectories, evaluating the accuracy of cell ordering. The wcor<sub>features</sub> assesses the similarity of key features through weighted Pearson correlation, capturing biological variation. The Overall score is calculated as the geometric mean of these metrics, providing an assessment of overall performance.

      (2) For MGPfact and the other seven methods included in the comparison, each has its own focus. MGPfact specializes in factorizing complex cell trajectories using Gaussian process mixture models, making it particularly capable of identifying bifurcation events. Therefore, it excels in the accuracy of branch partitioning and similarity of trajectory topology. Among other methods, scShaper (Smolander et al., 2022) and TSCAN(Ji and Ji, 2016) are more suited for generating linear trajectories and excel in linear datasets, accurately predicting pseudotime. The Monocle series, as typical representatives of tree methods, effectively capture complex topologies and are suitable for analyzing cell data with diversified differentiation paths.

      Comment#9: Microglia Analysis:In Figures 3A-C, the genes mentioned in the text for each bifurcation do not always match those shown in the panels. Please confirm this.

      Thank you for pointing this out. We have carefully reviewed the article and corrected the error where the genes shown in the figures did not correspond to the descriptions in the article. The specific corrections have been made between line 257 and 264:

      “The first bifurcation determines the differentiated cell fates of PAM and HM, which involves a set of notable marker genes of both cell types, such as Apoe, Selplg (HM), and Gpnmb (PAM). The second bifurcation determines the proliferative status, which is crucial for the development and function of PAM and HM (Guzmán, n.d.; Li et al., 2019). The genes affected by the second bifurcation are associated with cell cycle and proliferation, such as Mki67, Tubb5, Top2a. The third bifurcation influences the development and maturity of microglia, of which the highly weighted genes, such as Tmem119, P2ry12, and Sepp1 are all previously annotated markers for establishment of the fates of microglia (Anderson et al., 2022; Li et al., 2019) (Supplementary Table 4).”

      Comment#10: Regulons:

      - The conclusions rely heavily on regulons. The Methods section describes using SCENIC, GENIE3, RCisTarget, and AUCell, but their relation to bifurcation analysis is unclear.

      - Do you perform trajectory analysis on all MURP-derived cells or within each identified trajectory based on bifurcation? This point needs clarification to make the outcomes comprehensible. The legend of Figure 4 provides some ideas, but further clarity is required.

      Thank you for the comments.

      (1) To clarify, we used the tools like SCENIC to annotate the highly weighted genes (HWG) resulted from the bifurcation analysis for transcription factor regulation activity and possible impacts on biological processes. We have added descriptions to the analysis of our microglial data. The revised content is between line 265 and 266:

      “Moreover, we retrieved highly active regulons from the HWG by MGPfact, of which the significance is quantified by the overall weights of the member genes.”

      (2) We apologize for any confusion caused by our description. It is important to clarify that we performed an overall trajectory analysis on all MURP results, rather than analyzing within each identified trajectory. Specifically, we first used MURP to downsample all preprocessed cells, where each MURP subset represents a group of cells. We then conducted trajectory inference on all MURP subsets and identified bifurcation points. This process generated multiple independent differentiation trajectories, encompassing all MURP subsets. To clearly convey this point, we have added descriptions in the legend of Figure 4. The revised content is between line 276 and 283:

      “Fig. 4. MGPfact reconstructed the developmental trajectory of microglia, recovering known determinants of microglia fate. a-c. The inferred independent bifurcation processes with respect to the unique cell types (color-coded) of microglia development, where phase 0 corresponds to the state before bifurcation; and phases 1 and 2 correspond to the states post-bifurcation. Each colored dot represents a metacell of unique cell type defined by MURP. The most highly weighted regulons in each trajectory were labeled by the corresponding transcription factors (left panels). The HWG of each bifurcation process include a set of highly weighted genes (HWG), of which the expression levels differ significantly among phases 1, 2, and 3 (right panels).”

      Comment#11: CD8+ T Cells: The comparison is made against Monocle2, the method used in the publication, but it would be beneficial to compare it with more recent methods. Otherwise, the added value of MGPfact is unclear.

      Per your request, we have expanded our comparative analysis to include not only Monocle2 but also more recent methods such as Monocle3 (Cao et al., 2019) and scFates Tree (Faure et al., 2023). We used adjusted R-squared values to evaluate each method's ability to explain trajectory variation. The results have been added to Table 2 and Supplementary Table 6. The revised content is between line 318 and 326:

      We assessed the goodness-of-fit (adjusted R-square) of the consensus trajectory derived by MGPfact and three methods (Monocle 2, Monocle 3 and scFates Tree) for the CD8+ T cell subtypes described in the original studies (Guo et al., 2018; Zhang et al., 2018). The data showed that MGPfact significantly improved the explanatory power for most CD8+ T cell subtypes over Monocle 2, which was used in the original studies (P < 0.05, see Table 2 and Supplementary Table 6), except for the CD8-GZMK cells in the CRC dataset. Additionally, MGPfact demonstrated better explanatory power in specific cell types when compared to Monocle 3 and scFates Tree. For instance, in the NSCLC dataset, MGPfact exhibited higher explanatory power for CD8-LEF1 cells (Table 2, R-squared = 0.935), while Monocle 3 and scFates Tree perform better in other cell types.

      Comment#12: Consensus Trajectory: A panel explaining how the consensus trajectory is generated would be helpful. Include both visual and textual explanations tailored to the journal's audience.

      Thank you for the comments. Regarding how the consensus trajectory is constructed, we have illustrated and described this in Figure 1 and the supplementary methods. Taking the reviewers' suggestions into account, we have added more details about the generation process of the consensus trajectory in the methods section to enhance the completeness of the manuscript. The revised content is from line 599 to 606:

      “Following MGPfact decomposition, we obtained multiple independent bifurcative trajectories, each corresponds to a binary tree within the temporal domain. These trajectories were then merged to construct a coherent diffusion tree, representing the consensus trajectory of cells’ fate. The merging process involves initially sorting all trajectories by their bifurcation time. The first (earliest) bifurcative trajectory is chosen as the initial framework, and subsequent trajectories are integrated to the initial framework iteratively by adding the corresponding branches at the bifurcation timepoints. As a result, the trajectories are ultimately merged into a comprehensive binary tree, serving as the consensus trajectory.”

      Comment#13: Discussion:

      - Check for typos, e.g., line 382 "pseudtime.".

      - Avoid considering HVG as the entire feature space.

      - The first three paragraphs are too similar to the Introduction. Consider shortening them to succinctly state the scenario and the implications of your contribution.

      Thank you for pointing out the typos.

      (1) We conducted a comprehensive review of the document to ensure there are no typographical errors.

      (2) We restructured the first three paragraphs of the discussion section to clarify the limitations in the use of current manifold-learning methods and removed any absolute language regarding treating HVGs as the entire feature space. The revised content is from line 419 to 430:

      “Single-cell RNA sequencing (scRNA-seq) provides a direct, quantitative snapshot of a population of cells in certain biological conditions, thereby revealing the actual cell states and functions. Although existing clustering and embedding algorithms can effectively reveal discrete biological states of cells, these methods become less efficient when depicting continuous evolving of cells over the temporal domain. The introduction of manifold learning offers a new dimension for discovery of relevant biological knowledge in cell fate determination, allowing for a better representation of continuous changes in cells, especially in time-dependent processes such as development, differentiation, and clonal evolution. However, current manifold learning methods face major limitations, such as the need for prior information on pseudotime and cell clustering, and lack of explainability, which restricts their applicability. Additionally, many existing trajectory inference methods do not support gene selection, making it difficult to annotate the results to known biological entities, thereby hindering the interpretation of results and subsequent functional studies.”

      Comment#14: Minor Comments:

      (1) Review the paragraph regarding the "current manifold-learning methods are faced with two major challenges." The message needs clarification.

      (2) Increase the quality of the figures.

      (3) Update the numbering of equations from #(.x) to (x).

      We thank the reviewer for these detailed suggestions.

      (1) We have thoroughly revised the discussion section, addressing overly absolute statements. The revised content is from line 426 to 428:

      “However, current manifold learning methods face major limitations, such as the need for prior information on pseudotime and cell clustering, and lack of explainability, which restricts their applicability.”

      (2) We conducted a comprehensive review of the figures in the article to more clearly present our results.

      (3) We have meticulously reviewed the equations in the article to ensure there are no display issues with the indices.

      Reference

      Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, Hulselmans G, Rambow F, Marine J-C, Geurts P, Aerts J, van den Oord J, Atak ZK, Wouters J, Aerts S. 2017. SCENIC: single-cell regulatory network inference and clustering. Nat Methods 14:1083–1086. doi:10.1038/nmeth.4463

      Anderson SR, Roberts JM, Ghena N, Irvin EA, Schwakopf J, Cooperstein IB, Bosco A, Vetter ML. 2022. Neuronal apoptosis drives remodeling states of microglia and shifts in survival pathway dependence. Elife 11:e76564.

      Bravo González-Blas C, De Winter S, Hulselmans G, Hecker N, Matetovici I, Christiaens V, Poovathingal S, Wouters J, Aibar S, Aerts S. 2023. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. Nat Methods. doi:10.1038/s41592-023-01938-4

      Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, Zhang F, Mundlos S, Christiansen L, Steemers FJ, Trapnell C, Shendure J. 2019. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566:496–502. doi:10.1038/s41586-019-0969-x

      Faure L, Soldatov R, Kharchenko PV, Adameyko I. 2023. scFates: a scalable python package for advanced pseudotime and bifurcation analysis from single-cell data. Bioinformatics 39:btac746. doi:10.1093/bioinformatics/btac746

      Guo X, Zhang Y, Zheng L, Zheng C, Song J, Zhang Q, Kang B, Liu Z, Jin L, Xing R, Gao R, Zhang L, Dong M, Hu X, Ren X, Kirchhoff D, Roider HG, Yan T, Zhang Z. 2018. Global characterization of T cells in non-small-cell lung cancer by single-cell sequencing. Nat Med 24:978–985. doi:10.1038/s41591-018-0045-3

      Guzmán AU. n.d. Single-cell RNA sequencing of spinal cord microglia in a mouse model of neuropathic pain.

      Ji Z, Ji H. 2016. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res 44:e117–e117. doi:10.1093/nar/gkw430

      Lange M, Bergen V, Klein M, Setty M, Reuter B, Bakhti M, Lickert H, Ansari M, Schniering J, Schiller HB, Pe’er D, Theis FJ. 2022. CellRank for directed single-cell fate mapping. Nat Methods 19:159–170. doi:10.1038/s41592-021-01346-6

      Li Q. 2023. scTour: a deep learning architecture for robust inference and accurate prediction of cellular dynamics. Genome Biology.

      Li Q, Cheng Z, Zhou L, Darmanis S, Neff NF, Okamoto J, Gulati G, Bennett ML, Sun LO, Clarke LE, Marschallinger J, Yu G, Quake SR, Wyss-Coray T, Barres BA. 2019. Developmental Heterogeneity of Microglia and Brain Myeloid Cells Revealed by Deep Single-Cell RNA Sequencing. Neuron 101:207-223.e10. doi:10.1016/j.neuron.2018.12.006

      Neal RM. 2003. Slice sampling. The annals of statistics 31:705–767.

      Papadopoulos N, Gonzalo PR, Söding J. 2019. PROSSTT: probabilistic simulation of single-cell RNA-seq data for complex differentiation processes. Bioinformatics 35:3517–3519. doi:10.1093/bioinformatics/btz078

      Ren J, Zhang Q, Zhou Y, Hu Y, Lyu X, Fang H, Yang J, Yu R, Shi X, Li Q. 2022. A downsampling method enables robust clustering and integration of single-cell transcriptome data. Journal of Biomedical Informatics 130:104093. doi:10.1016/j.jbi.2022.104093

      Roberts GO, Rosenthal JS. 2009. Examples of adaptive MCMC. Journal of computational and graphical statistics 18:349–367.

      Saelens W, Cannoodt R, Todorov H, Saeys Y. 2019. A comparison of single-cell trajectory inference methods. Nat Biotechnol 37:547–554. doi:10.1038/s41587-019-0071-9

      Sha Y. 2024. Reconstructing growth and dynamic trajectories from single-cell transcriptomics data 6.

      Smolander J, Junttila S, Venäläinen MS, Elo LL. 2022. scShaper: an ensemble method for fast and accurate linear trajectory inference from single-cell RNA-seq data. Bioinformatics 38:1328–1335. doi:10.1093/bioinformatics/btab831

      Tierney L. 1994. Markov chains for exploring posterior distributions. the Annals of Statistics 1701–1728.

      Zappia L, Phipson B, Oshlack A. 2017. Splatter: simulation of single-cell RNA sequencing data. Genome Biol 18:174. doi:10.1186/s13059-017-1305-0

      Zhang L, Yu X, Zheng L, Zhang Y, Li Y, Fang Q, Gao R, Kang B, Zhang Q, Huang JY, Konno H, Guo X, Ye Y, Gao S, Wang S, Hu X, Ren X, Shen Z, Ouyang W, Zhang Z. 2018. Lineage tracking reveals dynamic relationships of T cells in colorectal cancer. Nature 564:268–272. doi:10.1038/s41586-018-0694-x

    1. Author Response

      The following is the authors’ response to the original reviews.

      We are grateful to the reviewers for their constructive comments. The following is our point-to-point responses.

      Reviewer #1 (Recommendations For The Authors):

      Point 1- Abstract: advanced morning peak « opposite » to pdf/pdfr mutants. To my knowledge, the alteration of PDF/PDFR suppresses the morning peak. I am not sure that an advance of the peak is « opposite » to its inhibition?

      Mutants with disruptions in CNMa or CNMaR display advanced morning activity, indicating an enhanced state. Mutants with disruptions in Pdf or Pdfr exhibit no morning anticipation, suggesting a promoting role of these genes in morning anticipation. Therefore, our revised version is: “Specific elimination of each from clock neurons revealed that loss of the neuropeptide CNMa in two posterior dorsal clock neurons (DN1ps) or its receptor (CNMaR) caused advanced morning activity, indicating a suppressive role of CNMa-CNMaR on morning anticipation, opposite to the promoting role of PDF-PDFR on morning anticipation.” (Line 43-51)

      Point 2- Fig 1K-L: the authors should show the sleep phenotype of the homozygous nAChRbeta2 mutant (if not lethal) for a direct comparison with the FRT/FLP genotype and thus evaluate the efficiency of the system.

      We have incorporated sleep profiles of nAChRbeta2 mutant and W1118 into Fig 1K-L. nAChRbeta2 mutants (red) exhibited a sleep amount comparable to that of pan-neural nAChRbeta2 knockout flies (dark red), as shown below.

      Author response image 1.

      Point 3- Dh31-EGFP-FRT expression patterns look different in figS1 A (or fig1 H) and J. why that?

      We re-examined the original data. Both (with R57C10-GAL4 for Fig. S1A, right, S1J, left) are Dh31EGFP.FRT samples displayed below which demonstrated consistent primary expression subsets. Any observed disparities in region "e" could potentially be attributed to variations during dissection.

      Author response image 2.

      Point 4- The knockdown experiments with the elav-switch (RU486) system (fig S2) do not seem to be as efficient as the HS-FLP system (fig 1H-J). The conclusions on the efficiency should be toned down.

      We have revised accordingly: "Near Complete Disruption of Target Genes by GFPi and Flp-out Based cCCTomics" (Line 130): "Knocking out at the adult stage using either hsFLP driven Flp-out (Golic and Lindquist, 1989) (Fig. 1H-1J) or neural (elav-Switch) driven shRNAGFP (Nicholson et al., 2008; Osterwalder et al., 2001) (Fig. S2A-S2I), also resulted in the elimination of most, though not all, GFP signals." (Line 145-149)

      Point 5- Fig 2H-J: the LD behavioral phenotype of pdfr pan-neuronal cripsr does not seem to correspond to what is described in the literature for the pdfr mutant (han), see hyun et al 2005 (no morning anticipation and advanced evening peak). I understand that the activity index is lower than controls but fig2H shows a large anticipatory activity that seems really unusual, and no advanced evening peak is observed. I think that the authors should show the CRISPR flies and pdfr mutants together, to better compare the phenotypes.

      Thank you for pointing out that the phenotypes of pan-neuronal knockout of PDFR by unmodified Cas9 (Fig. 2H-2I of the previous version) whose morning anticipation still exist (Fig, 2H of the previous manuscript), although the significant decrease of morning anticipation index (Fig 2I of the previous manuscript) and advanced evening activity are not as pronounced as observed in han5304 (Fig. 3C in Hyun et al., 2005).

      First, we have separated the activity plots of Fig. 2H of previous manuscript, as shown below. The activity from ZT18 to ZT24 shows a tendency of decreasing from ZT18 to ZT21 and a tendency of increasing from ZT21 to ZT24. The lowest activity before dawn during ZT18 to ZT24 shows at about ZT21, and the activity at ZT18 is comparable to the activity at ZT24. This is significantly different compared to the two control groups, whose activity tends to increase activity from ZT18 to ZT24 with an activity peak at ZT24.

      The activity from ZT6 to ZT12 increased much faster in Pdfr knockout flies and get to an activity plateau at about ZT11 compared to two control groups with a slower activity increasing from ZT6 to ZT12 with no activity plateau but an activity peak at ZT12.

      Author response image 3.

      Second, we have incorporated the phenotype of Pdfr mutants we previously generated (Pdfr-attpKO Deng et al., 2019) with Pdfr pan-neuronal knockout by Cas9.HC. This mutant lacks all seven transmembrane regions of Pdfr (a). The phenotypes are very similar between Pdfr-attpKO flies and Pdfr pan-neuronal knockout flies. In this experimental repeat, we found that a much more obvious advanced evening activity peak is observed both in pan-neuronal knockout flies and Pdfr-attpKO flies.

      To further analyze the phenotypes of Pdfr pan-neuronal knockout flies by Cas9.HC, we referred to the literature. The activity pattern at ZT18 to ZT24 (activity tends to decrease from ZT18 to ZT21 and tends to increase from ZT21 to ZT24, with the lowest activity before dawn occurring at about ZT21, and activity at ZT18 comparable to activity at ZT24) is also reported in Pdfr knockout flies such as Fig3C and 3H in Hyun et al., 2005, Fig 2B in Lear et al., 2009, Fig 3B in Zhang et al., 2010, Fig .5A in Guo et al., 2014, and Fig 5B in Goda et al., 2019. Additionally, the less pronounced advanced evening activity peak compared to han5304 (Fig. 3C in Hyun et al., 2005) is also reported in Fig. 2B in Lear et al., 2009, Fig. 3B in Zhang et al., 2010, and Fig. 5B in Goda et al., 2019. We consider that this difference is more likely to be caused by environmental conditions or recording strategies (DAM system vs. video tracing).

      Therefore, we revised the text to: “Pan-neuronal knockout of Pdfr resulted in a tendency towards advanced evening activity and weaker morning anticipation compared to control flies (Fig. 2H-2I), which is similar to Pdfr-attpKO flies. These phenotypes were not as pronounced as those reported previously, when han5304 mutants exhibited a more obvious advanced evening peak and no morning anticipation (Hyun et al., 2005)”.

      Author response image 4.

      Point 6-The authors should provide more information about the DD behavior (power is low, but how about the period of rhythmic flies, which is shortened in pdf (renn et al) and pdfr (hyun et al) mutants).

      We have incorporated period data into Fig. 2I. Indeed, conditional knock out of Pdfr by Cas9.HC driven by R57C10-GAL4 shortens the period length, as shown below (previous data), also in Fig. 2I of the revised version.

      In the revised Fig. 2I, we tested 45 Pdfr-attpKO flies during DD condition (3 out of 48 flies died during video tracing in DD condition), and only one fly was rhythmic. In contrast, 9 out of 48 Pdfr pan-neuronal knockout flies were rhythmic.

      Author response image 5.

      Point 7- P15 and fig6. The authors indicate that type II CNMa neurons do not show advanced morning activity as type I do, but Figs 6 I and K seem to show some advance although less important than type I. I am not sure that this supports the claim that type I is the main subset for the control of morning activity. This should be toned down.

      We have re-organized Fig. 6 and revised the summary of these results as: “However, Type II neurons-specific CNMa knockout (CNMa ∩ GMR91F02) showed weaker advanced morning activity without advanced morning peak (Fig. 6N), while Type I neurons-specific CNMa knockout did (Fig. 6J), indicating a possibility that these two type I CNMa neurons constitute the main functional subset regulating the morning anticipation activity of fruit fly”. (Line 400-405)

      Point 8- Figs 6M and N: is power determined from DD data? if yes, how about the period and arrhythmicity? Please also provide the LD activity profiles for the mutants and rescued pdfr genotypes.

      Yes, the power was determined from the DD data. In the new version of the manuscript, we have included the activity plots for the LD phase in supplementary Fig S13, as well as shown below (A, B), and the period and arrhythmicity data for the DD phase in Fig. 6S and Table S7. We have also refined the related description as follows: “Moreover, knocking out Pdfr by GMR51H05, GMR79A11 and CNMa GAL4, which cover type I CNMa neurons, decreased morning anticipation of flies (Fig. 6T, Fig. S13B). However, the decrease in morning anticipation observed in the Pdfr knockout by CNMa-GAL4 was not as pronounced as with the other two drivers. Because the presumptive main subset of functional CNMa is also PDFR-positive, there is a possibility that CNMa secretion is regulated by PDF/PDFR signal”. (Line 413-419)

      Author response image 6.

      Point 9- Fig 7: does CNMaR affect DD behavior? This should be tested.

      We analyzed the CNMaR-/- activity in the dark-dark condition over a span of six days. Results revealed a higher power in CNMaR mutants compared to control flies (Power: 93.5±41.9 (CNMaR-/-, n=48) vs 47.3±31.6 (w1118, n=47); Period: 23.7±0.3 h (CNMaR-/-, n=46) vs 23.7±0.3 h (w1118, n=47); arrhythmic rate 2/48 (CNMaR-/-) vs 0/47 (w1118)). Considering that mutating CNMa had no obvious effect on DD behavior, even if CNMaR affects DD behavior, it cannot be attributed to CNMa signal, we did not further repeat and analyze DD behavior of CNMaR mutant. We believe this raises another question beyond the scope of our current discussion.

      Reviewer #2 (Recommendations For The Authors):

      Point 1-One major concern is the apparent discrepancies in clock network gene expression using the Flp-Out and split-LexA approaches compared to what is known about the expression of several transmitter and peptide-related genes. For example, it is well established that the 5th-sLNv expresses CHAT (along with a single LNd), yet there appears to be no choline acetyltransferase (ChAT) signal in the 5th-sLNv as assayed by the Split-LexA approach (Fig. 4). This approach also suggests that DH31 is expressed in the s-LNvs, which, as one of the most intensely studied clock neuron are known to express PDF and sNPF, but not DH31. The results also suggest that the sLNvs express ChAT, which they do not. Remarkably PDF is not included in the expression analysis, this peptide is well known to be expressed in only two subgroups of clock neurons, and would therefore be an excellent test case for the expression analysis in Fig. 4. PDF should therefore be added to analysis shown in Fig. 4. Another discrepancy is PdfR, which split LexA suggests is expressed in the Large LNvs but not the small LNvs, the opposite of what has been shown using both reporter expression and physiology. The authors do acknowledge that discrepancies exist between their data and previous work on expression within the clock network (lines 237 and 238). However, the extent of these discrepancies is not made clear and calls into question the accuracy of Flp-Out and Split LexA approaches.

      The concerns mentioned above are:

      (1) sLNvs express PDF and sNPF but not Dh31;

      (2) ChAT presents in 5th-sLNv and one LNd but not in other sLNvs;

      (3) PDFR presents in sLNvs but not l-LNvs.

      (4) PDF is not included in the analysis.

      To verify the accuracy of these intersection analyses, all related to PDF positive neurons (except 5th-sLNv and LNds), we stained PDF and examined the co-localization between PDF-positive LNvs and the respective drivers ChAT-KI-LexA, Pdfr-KI -LexA, Dh31-KI -LexA, and Pdf-KI -LexA.

      First, Dh31-KI-LexA labeled four s-LNvs, as shown below (also in Fig. S9A). Therefore, the results of the intersection analysis of Dh31-KI-LexA with Clk856-GAL4 are correct. The difference in the results compared to previous literature is attributed to Dh31-KI-LexA labels different neurons than the previous driver or antibody.

      Second, no s-LNv was labeled by ChAT-KI -LexA as shown below. We rechecked our intersection data and found that we analyzed 10 brains of ChAT-KI-LexA∩Clk856-GAL4 while only two brains showed sLNvs positively. To enhance the accuracy of intersection analysis results, we marked all positive signal records when positive subsets were found in less than 1/3 of the total analyzed brains (Table S4).

      Third, one l-LNv and at least two s-LNvs were labeled by Pdfr-KI-LexA, as shown below (also in Fig. S9B). Fourth, Pdf-KI-LexA labels all PDF-positive neurons, but the intersection analysis by Pdf-KI-LexA and Clk856-GAL4 only showed scattered signals, as shown below (D, also in Fig. S9C). For these cases, we found some positive signals expected but not observed in our dissection. The possible reason could be the inefficiency of LexAop-FRT-myr::GFP driven by LexA. Therefore, our intersection results must miss some positive signals.

      Author response image 7.

      Finally, we revised the text to (Line 286-317):

      To assess the accuracy of expression profiles using CCT drivers, we compared our dissection results with previous reports. Initially, we confirmed the expression of CCHa1 in two DN1s (Fujiwara et al., 2018), sNFP in four s-LNvs and two LNds(Johard et al., 2009), and Trissin in two LNds (Ma et al., 2021), aligning with previous findings. Additionally, we identified the expression of nAChRα1, nAChRα2, nAChRβ2, GABA-B-R2, CCHa1-R, and Dh31-R in all or subsets of LNvs, consistent with suggestions from studies using ligands or agonists in LNvs (Duhart et al., 2020; Fujiwara et al., 2018; Lelito and Shafer, 2012; Shafer et al., 2008) (Table S4).

      Regarding previously reported Nplp1 in two DN1as (Shafer et al., 2006), we found approximately five DN1s positive for Nplp-KI-LexA, indicating a broader expression than previously reported. A similar pattern emerged in our analysis of Dh31-KI-LexA, where four DN1s, four s-LNvs, and two LNds were identified, contrasting with the two DN1s found in immunocytochemical analysis (Goda et al., 2016). Colocalization analysis of Dh31-KI-LexA and anti-PDF revealed labeling of all PDF-positive s-LNvs but not l-LNvs (Fig S9A), suggesting that the differences may arise from the broader labeling of 3' end knock-in LexA drivers or the amplitude effect of the binary expression system. The low protein levels might go undetected in immunocytochemical analysis. This aligns with transcriptome analysis findings showing Nplp1 positive in DN1as, a cluster of CNMa-positive DN1ps, and a cluster of DN3s (Ma et al., 2021), which is more consistent with our dissection.

      Despite the well-known expression of PDF in LNvs and PDFR in s-LNvs (Renn et al., 1999; Shafer et al., 2008), we did not observe stable positive signals for both in Flp-out intersection experiments, although both Pdf-KI-LexA and Pdfr-KI-LexA label LNvs as expected (Fig S9B-S9C). We also noted fewer positive neurons in certain clock neuron subsets compared to previous reports, such as NPF in three LNds and some LNvs (Erion et al., 2016; He et al., 2013; Hermann et al., 2012; Johard et al., 2009; Lee et al., 2006) and ChAT in four LNds and the 5th s-LNv (Johard et al., 2009; Duhart et al., 2020) (Table S4). We attribute this limitation to the inefficiency of LexAop-FRT-myr::GFP driven by LexA, acknowledging that our intersection results may miss some positive signals.

      Point 2-Related to this, the authors rather inaccurately suggest that the field's understanding of PdfR expression within the clock neuron network is "inconsistent" and "variable" (lines 368-377). This is not accurate. It is true that the first attempts to map PdfR expression with antisera and GAL4s were inaccurate. However, subsequent work by several groups has produced strong convergent evidence that with the exception of the l-LNvs after several days post-eclosion, PdfR is expressed in the Cryptochrome expressing a subset of the clock neuron network. This section of the study should be revised.

      We thank the reviewer for pointing this out. As we have already addressed and revised the related part in the RESULTS section (Line 308-317), we have now removed this part from the DISCUSSION section of the revised version.

      Point 3-One minor issue that would avoid unnecessary confusion by readers familiar with the circadian literature is the say that activity profiles are plotted in the study. The authors have centered their averaged activity profiles on the 12h of darkness. This is the opposite of the practice of the field, and it leads to some initial confusion in the examination of the morning and evening peak data. The authors may wish to avoid this by centering their activity plots on the 12h light phase, which would put the morning peak on the left and the evening peak on the right. This is the way the field is accustomed to examining locomotor activity profiles.

      The centering of averaged activity profiles on the 12 h of darkness is done to highlight the phenotype of advanced morning activity. To prevent any confusion among readers, we have included a sentence in the figure legend explaining the difference in our activity profiles compared to previous literatures: "Activity profiles were centered of the 12 h darkness in all figures with evening activity on the left and morning activity on the right, which is different from general circadian literatures. (Fig. 2H legend)" (Line 957-959))

      Point 4-The authors conclude that the loss of PDF and CNMa have opposite effects on the morning peak of locomotor activity (line 392). But they also acknowledge, briefly, that things are not that simple: loss of CNMa causes a phase advance, but loss of PDF causes a loss or reduction in the anticipatory peak. It is still significant to find a peptide transmitter with the clock neuron network that regulates morning activity, but the authors should revise their conclusion regarding the opposing actions of PDF and CNMa, which is not well supported by the data.

      We have revised the relevant parts.

      ABSTRACT: “Specific elimination of each from clock neurons revealed that loss of the neuropeptide CNMa in two posterior dorsal clock neurons (DN1ps) or its receptor (CNMaR) caused advanced morning activity, indicating a suppressive role of CNMa-CNMaR on morning anticipation, opposite to the promoting role of PDF-PDFR on morning anticipation.” (Line 43-48)

      DISCUSSION: “Furthermore, given that the morning anticipation vanishing phenotype of Pdf or Pdfr mutant indicates a promoting role of PDF-PDFR signal, while the enhanced morning anticipation phenotype of CNMa mutant suggests an inhibiting role of CNMa signal, we consider the two signals to be antagonistic.” (Line 492-495)

      Point 5-The authors should acknowledge, cite, and incorporate the substantive discussion of CNMa peptide and the DN1p neuronal class in Reinhard et al. 2022 (Front Physiol. 13: 886432).

      We have revised the text accordingly and cited this paper: “Type I with two neurons whose branches projecting to the anterior region, as in CNMa∩GMR51H05, CNMa∩Pdfr, and CNMa∩GMR79A11 (Fig. 6E, 5G, 6H), and type II with four neurons branching on the posterior side with few projections to the anterior region, as in CNMa∩GMR91F02 (Fig. 6F). These two types of DN1ps’ subsets were also reported and profound discussed previously (Lamaze et al., 2018; Reinhard et al., 2022)”. (Line 393-397)

      Reviewer #3 (Recommendations For The Authors):

      Point 1-Throughout the manuscript figure legends (axis, genotypes, etc) are too small to be appreciated. Fig. 1. Panel A. The labels are very difficult to read.

      We have attempted to enlarge the font as much as possible in the revised version.

      Point 2-Fig. 1. H-J Why is efficiency not mentioned in all the examples?

      In the revised manuscript, the results of Fig 1H-1J are discussed in the revised version (Line 145-147). The reason that we did not calculate the exact efficiency is that the GFP intensity is not stable enough which might change during dissection, mounting or intensity of laser in our experimental process. Therefore, in all results related to GFP signal (Fig. 1B-1J, Fig. S1, Fig. S2, Fig. 2B-2D), we relied on qualitative judgment rather than quantitative judgment, unless the GFP signal was easily quantifiable (such as in cases with limited cells or no GFP signal in the experimental group).

      Point 3-Fig. 1. Panel L, left (light phase): the statistical comparisons are not clearly indicated (the same happens in Figs 3Q and 3R).

      We have now re-arranged Fig. 1L and Fig. 3Q-3R to make the statistical comparisons clear in the new version.

      Point 4-Line 792. Could induced be introduced?

      Yes, we have now corrected this typo.

      Point 5-Fig. S1. Check labels for consistency. GMR57C10 Gal4 driver is most likely R57C10.

      We have now revised the labels (Fig. S1).

      Point 6-Fig. S2. If the experiments were repeated and several brains were observed, the authors should include the efficiency and the number of flies as reported in Fig. S1.

      We have now added the number of flies in Fig. S2 as reported in Fig. S1. As Response to Point 2 mentioned, due to the instability of the GFP signal, we are unable to provide a quantitative efficiency in this context.

      Point 7-Fig S4. The fig legend describes panels I-J which are not shown in the current version of the manuscript.

      We now have deleted them.

      Point 8-Fig 2I. Surprising values for morning anticipation indexes even for controls (0.5 would indicate ¨no anticipation¨; in controls, the expected values would be >>0.5, as most of the activity is concentrated right before the transition. Could the authors explain this unexpected result?

      We have revised the description of the calculation in the methods section (Line 612). After calculating the ratio of the last three hours of activity to the total six hours of activity, the results were further subtracted by 0.5. Therefore, the index should be ≤0.5. When the index is equal to 0, it indicates no morning anticipation.

      Point 9-Fig 2K/L. The authors mention that not all genes are effectively knocked out with their strategy. Could this be accounted for the specific KD strategy, its duration, or the promotor strength? It is surprising no explanation is provided in the text (page 9 line 179).

      In our pursuit of establishing a broadly effective method for gene editing, Fig. 2H-2L and Fig. 2D revealed that previous attempts have fallen short of achieving this objective. The observed inefficiency may be attributed to the intensity of the promoter, resulting in inadequate expression. Alternatively, the insufficient duration of the operation may also contribute to the lack of success. However, in the context of sleep and rhythm research applications, the age of the fruit fly tests is typically fixed, limiting the potential to enhance efficiency by extending the manipulation time. Moreover, increasing the expression level may pose challenges related to cytotoxicity, as reported in previous studies (Port et al., 2014). We refrain from offering specific explanations, as we lack a definitive plan and cannot provide additional robust evidence to support the above speculations. Consequently, in our ongoing efforts, we aim to enhance the efficiency of the tool system while operating within the current constraints.

      Point 10-Page 9, line 179. Can the authors include a brief description of the reason for the different modifications? Only one was referenced.

      We have revised related part in the manuscript (Line 223-231):

      Cas9.M9: We fused a chromatin-modulating peptide (Ding et al., 2019), HMGN1 183 (High mobility group nucleosome binding domain 1), at the N-terminus of Cas9 and HMGB1 184 (High mobility group protein B1) at its C-terminus with GGSGP linker, termed Cas9.M9.

      Cas9.M6: We also obtained a modified Cas9.M6 with HMGN1 at the N-terminus and an undefined peptide (UDP) at the C-terminus. (NOTE:UDP was gained by accident)

      Cas9.M0: We replaced the STARD linker between Cas9 and NLS in Cas9.HC with GGSGP the linker (Zhao et al., 2016), termed Cas9.M0

      Point 11-The authors tested the impact of KO nAChR2 across the different versions of conditional disruption (Fig 1K-L, Fig 2L, Fig 3R). It is surprising they observe a difference in daytime sleep upon knocking down with Cas9.HC (2L) but not with Cas9.M9 (3R) and the reverse is seen for night-time sleep. Could the authors provide an explanation? Efficiency is not the issue at stake, is it?

      In Fig. 2K, the day sleep of flies (R57C10-GAL4/UAS-sgRNAnAChRbeta2; UAS-Cas9/+) was significantly decreased compared to flies (R57C10-GAL4/UAS-sgRNAnAChRbeta2; +/+), but not when compared to flies (R57C10-GAL4/+; UAS-Cas9/+). Our criterion for asserting a difference is that the experimental group must show a significant distinction from both control groups. Therefore, we concluded that there was no significant difference between the experimental group and the control groups in Fig. 2K.

      Point 12-Fig. 4. Which of the two strategies described in A-B was employed to assemble the expression profile of CCT genes in clock neurons shown in C? This information should be part of the fig legend.

      We have now revised the legend as follows: “(A-B) Schematic of intersection strategies used in Clk856 labelled clock neurons dissection, Flp-out strategy (A) and split-LexA strategy (B). The exact strategy used for each gene is annotated in Table S5.”

      Point 13-Similarly, how many brains were analyzed to give rise to the table shown in C?

      We have now revised the legend of Table S4 to address this concern. As indicated in: “The largest N# for each gene in Table S4 is the brain number analyzed for each gene”.

      Point 14-Finally, the sentence ¨The figure is...¨ requires revision.

      We have now revised it: “The exact cell number for each subset is annotated in Table S4”.

      Point 15-Legend to Table S3. The authors have done an incredible job testing many gRNAs for each gene potentially relevant for communication. However, there is very little information to make the most out of it; for instance, the legend does not inform why many of the targeted genes do not appear to have been tested any further. It would be useful to the reader to discern whether despite being the 3 most efficient gRNAs, they were still not effective in targeting the gene of interest, or whether they showed off-targets, or it was simply a matter of testing the educated guesses. This information would be invaluable for the reader.

      First, we designed and generated transgenic UAS-sgRNA fly lines for all these sgRNAs. We randomly selected 14 receptor genes, known for their difficulty in editing based on our experience, to assess the efficiency of our strategy, as depicted in Fig. 3M-3P, Fig. S5, and Fig. S6. We believe these results are representative and indicative of the efficiency of sgRNAs designed using our process and applied with the modified Cas9.

      Secondly, we acknowledge your valid concern. While we selected sgRNAs with no predicted off-target effects through various prediction models (outlined in the Methods under C-cCCTomics sgRNA design), we did not conduct whole-genome sequencing. Consequently, we can only assert that the off-target possibility is relatively low. To address potential misleading effects arising from off-target concerns, it is essential to validate these results through mutants, RNAi, or alternative UAS-sgRNAs targeting the same gene.

      Point 16-Table S4. Some of the data presented derives from observations made in 1-2 brains for a specific cluster; isn´t it too little to base a decision on whether a certain gene is (or not) expressed? It is surprising since the same CCT line was observed/analysed in more brains for other clusters. Can the authors explain the rationale?

      The N# number represents the GFP positive number, and we have revised the legend of Table S4. The largest N# number denotes the total number of brains analyzed for a specific CCT line. It's possible that, due to variations in our dissection or mounting process, some clusters were only observed in 1-2 brains out of the total brains analyzed. To enhance the accuracy of intersection analysis results, we marked all positive signal records when positive subsets were found in less than 1/3 of the total analyzed brains (Table S4).

      Point 17-The paragraph describing this data in the results section needs revising (lines 233-243).

      We have now revised this. (Line 286-317)

      Point 18-While it is customary for authors to attempt to improve the description of the activity patterns by introducing new parameters (i.e. MAPI and EAPI, lines 253-258) it would be interesting to understand the difference between the proposed method and the one already in use (which compares the same parameter, i.e., the slope (defined as ¨the slope of the best-fitting linear regression line over a period of 6 h prior to the transition¨, i.e., Lamaze et al. 2020 and many others). Is there a need to introduce yet another one?

      This approach is necessary. The slope defined by Lamaze et al. utilizes data from only 2 time points, which may not accurately capture the pattern within a period before light on or off. Linear regression is not well-suited for a single fly due to the high variability in activity at each time point, making it challenging to fit the model at the individual level. The parameters we have introduced (MAPI and EAPI) in this paper are concise and can be applied at the individual level, effectively reflecting the morning or evening anticipation characteristics of each fly.

      As an alternative, the activity plot of a certain fly line could be represented by an average of all flies' activity in one experiment. This would make linear regression easier to fit. However, several independent experiments are required for statistical robustness, necessitating the inclusion of hundreds of flies for each strain in a single analysis.

      Point 19-In general, the legends of supplementary figures are a bit too brief. S7 and S8: it is not clear which of the two intersectional strategies were used (it would benefit whoever is interested in replicating the experiments). Legend to Fig S8 should read ¨similar to Fig S7¨.

      We have now revised the legend and included “The exact strategy used for each gene is annotated in Table S5” in the legend.

      Point 20-The legend in Table S6 should clearly state the genotypes examined. What does the marking in bold refer to?

      We have now revised annotation of Table S6. Marking in bold refer to results out of one SD compared to control group.

      Point 21-Line 314. The sentence needs revision.

      We have revised these sentences.

      Point 22-Line 391 (and also in the results section). The authors attempt to describe the CNMa phenotype as the opposite of pdf/pdfr mutant phenotypes. However, no morning anticipation/advanced morning anticipation are not necessarily opposite phenotypes.

      We have revised related description.

      ABSTRACT: “Specific elimination of each from clock neurons revealed that loss of the neuropeptide CNMa in two posterior dorsal clock neurons (DN1ps) or its receptor (CNMaR) caused advanced morning activity, indicating a suppressive role of CNMa-CNMaR on morning anticipation, opposite to the promoting role of PDF-PDFR on morning anticipation.” (Line 43-48)

      DISCUSSION: “Furthermore, given that the morning anticipation vanishing phenotype of Pdf or Pdfr mutant indicates a promoting role of PDF-PDFR signal, while the enhanced morning anticipation phenotype of CNMa mutant suggests an inhibiting role of CNMa signal, we consider the two signals to be antagonistic.” (Line 492-495)

      Reference

      Deng, B., Li, Q., Liu, X., Cao, Y., Li, B., Qian, Y., Xu, R., Mao, R., Zhou, E., Zhang, W., et al. (2019). Chemoconnectomics: mapping chemical transmission in Drosophila. Neuron 101, 876-893.e874.

      Ding, X., Seebeck, T., Feng, Y., Jiang, Y., Davis, G.D., and Chen, F. (2019). Improving CRISPR-Cas9 genome editing efficiency by fusion with chromatin-modulating peptides. Crispr j 2, 51-63.

      Duhart, J.M., Herrero, A., de la Cruz, G., Ispizua, J.I., Pírez, N., and Ceriani, M.F. (2020). Circadian Structural Plasticity Drives Remodeling of E Cell Output. Curr Biol 30, 5040-5048.e5045.

      Erion, R., King, A.N., Wu, G., Hogenesch, J.B., and Sehgal, A. (2016). Neural clocks and Neuropeptide F/Y regulate circadian gene expression in a peripheral metabolic tissue. eLife 5, e13552.

      Fujiwara, Y., Hermann-Luibl, C., Katsura, M., Sekiguchi, M., Ida, T., Helfrich-Förster, C., and Yoshii, T. (2018). The CCHamide1 neuropeptide expressed in the anterior dorsal neuron 1 conveys a circadian signal to the ventral lateral neurons in Drosophila melanogaster. Front Physiol 9, 1276.

      Goda, T., Tang, X., Umezaki, Y., Chu, M.L., Kunst, M., Nitabach, M.N.N., and Hamada, F.N. (2016). Drosophila DH31 neuropeptide and PDF receptor regulate night-onset temperature preference. J Neurosci 36, 11739-11754.

      Goda, T., Umezaki, Y., Alwattari, F., Seo, H.W., and Hamada, F.N. (2019). Neuropeptides PDF and DH31 hierarchically regulate free-running rhythmicity in Drosophila circadian locomotor activity. Sci Rep 9, 838.

      Guo, F., Cerullo, I., Chen, X., and Rosbash, M. (2014). PDF neuron firing phase-shifts key circadian activity neurons in Drosophila. Elife 3.

      He, C., Cong, X., Zhang, R., Wu, D., An, C., and Zhao, Z. (2013). Regulation of circadian locomotor rhythm by neuropeptide Y-like system in Drosophila melanogaster. Insect Mol Biol 22, 376-388.

      Hermann, C., Yoshii, T., Dusik, V., and Helfrich-Förster, C. (2012). Neuropeptide F immunoreactive clock neurons modify evening locomotor activity and free-running period in Drosophila melanogaster. J Comp Neurol 520, 970-987.

      Hyun, S., Lee, Y., Hong, S.T., Bang, S., Paik, D., Kang, J., Shin, J., Lee, J., Jeon, K., Hwang, S., et al. (2005). Drosophila GPCR Han is a receptor for the circadian clock neuropeptide PDF. Neuron 48, 267-278.

      Johard, H.A., Yoishii, T., Dircksen, H., Cusumano, P., Rouyer, F., Helfrich-Förster, C., and Nässel, D.R. (2009). Peptidergic clock neurons in Drosophila: ion transport peptide and short neuropeptide F in subsets of dorsal and ventral lateral neurons. J Comp Neurol 516, 59-73.

      Lamaze, A., Krätschmer, P., Chen, K.F., Lowe, S., and Jepson, J.E.C. (2018). A Wake-Promoting Circadian Output Circuit in Drosophila. Curr Biol 28, 3098-3105.e3093.

      Lear, B.C., Zhang, L., and Allada, R. (2009). The neuropeptide PDF acts directly on evening pacemaker neurons to regulate multiple features of circadian behavior. PLoS Biol 7, e1000154.

      Lee, G., Bahn, J.H., and Park, J.H. (2006). Sex- and clock-controlled expression of the neuropeptide F gene in Drosophila. 103, 12580-12585.

      Lelito, K.R., and Shafer, O.T. (2012). Reciprocal cholinergic and GABAergic modulation of the small ventrolateral pacemaker neurons of Drosophila's circadian clock neuron network. J Neurophysiol 107, 2096-2108.

      Ma, D., Przybylski, D., Abruzzi, K.C., Schlichting, M., Li, Q., Long, X., and Rosbash, M. (2021). A transcriptomic taxonomy of Drosophila circadian neurons around the clock. Elife 10.

      Port, F., Chen, H.M., Lee, T., and Bullock, S.L. (2014). Optimized CRISPR/Cas tools for efficient germline and somatic genome engineering in Drosophila. Proc Natl Acad Sci USA 111, E2967-2976.

      Reinhard, N., Schubert, F.K., Bertolini, E., Hagedorn, N., Manoli, G., Sekiguchi, M., Yoshii, T., Rieger, D., and Helfrich-Förster, C. (2022). The Neuronal Circuit of the Dorsal Circadian Clock Neurons in Drosophila melanogaster. Front Physiol 13, 886432.

      Renn, S.C., Park, J.H., Rosbash, M., Hall, J.C., and Taghert, P.H. (1999). A pdf neuropeptide gene mutation and ablation of PDF neurons each cause severe abnormalities of behavioral circadian rhythms in Drosophila. Cell 99, 791-802.

      Shafer, O.T., Helfrich-Förster, C., Renn, S.C., and Taghert, P.H. (2006). Reevaluation of Drosophila melanogaster's neuronal circadian pacemakers reveals new neuronal classes. J Comp Neurol 498, 180-193.

      Shafer, O.T., Kim, D.J., Dunbar-Yaffe, R., Nikolaev, V.O., Lohse, M.J., and Taghert, P.H. (2008). Widespread receptivity to neuropeptide PDF throughout the neuronal circadian clock network of Drosophila revealed by real-time cyclic AMP imaging. Neuron 58, 223-237.

      Zhang, L., Chung, B.Y., Lear, B.C., Kilman, V.L., Liu, Y., Mahesh, G., Meissner, R.A., Hardin, P.E., and Allada, R. (2010). DN1(p) circadian neurons coordinate acute light and PDF inputs to produce robust daily behavior in Drosophila. Curr Biol 20, 591-599.

      Zhao, P., Zhang, Z., Lv, X., Zhao, X., Suehiro, Y., Jiang, Y., Wang, X., Mitani, S., Gong, H., and Xue, D. (2016). One-step homozygosity in precise gene editing by an improved CRISPR/Cas9 system. Cell Res 26, 633-636.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      This paper describes the development and initial validation of an approach-avoidance task and its relationship to anxiety. The task is a two-armed bandit where one choice is 'safer' - has no probability of punishment, delivered as an aversive sound, but also lower probability of reward - and the other choice involves a reward-punishment conflict. The authors fit a computational model of reinforcement learning to this task and found that self-reported state anxiety during the task was related to a greater likelihood of choosing the safe stimulus when the other (conflict) stimulus had a higher likelihood of punishment. Computationally, this was represented by a smaller value for the ratio of reward to punishment sensitivity in people with higher task-induced anxiety. They replicated this finding, but not another finding that this behavior was related to a measure of psychopathology (experiential avoidance), in a second sample. They also tested test-retest reliability in a sub-sample tested twice, one week apart and found that some aspects of task behavior had acceptable levels of reliability. The introduction makes a strong appeal to back-translation and computational validity, but many aspects of the rationale for this task need to be strengthened or better explained. The task design is clever and most methods are solid - it is encouraging to see attempts to validate tasks as they are developed. There are a few methodological questions and interpretation issues, but they do not affect the overall findings. The lack of replicated effects with psychopathology may mean that this task is better suited to assess state anxiety, or to serve as a foundation for additional task development.

      We thank the reviewer for their kind comments and constructive feedback. We agree that the approach taken in this paper appears better suited to state anxiety, and further work is needed to assess/improve its clinical relevance.

      Reviewer #1 (Recommendations For The Authors):

      1) For the introduction, the authors communicate well the appeal of tasks with translational potential, and setting up this translation through computational validity is a strong approach. However, I had some concerns about how the task was motivated in the introduction:

      a) The authors state that current approach-avoidance tasks used in humans do not resemble those used in the non-human literature, but do not provide details on what exactly is missing from these tasks that makes translation difficult.

      Our intention for the section that the reviewer refers to was to briefly convey that historically, approach-avoidance conflict would have been measured either using questionnaires or joystick-based tasks which have no direct non-human counterpart. However, we note that the phrasing was perhaps unfair to recent tasks that were explicitly designed to be translatable across species. Therefore, we have amended the text to the following:

      In humans, on the other hand, approach-avoidance conflict has historically been measured using questionnaires such as the Behavioural Inhibition/Activation Scale (Carver & White, 1994), or cognitive tasks that rely on motor biases, for example by using joysticks to approach/move towards positive stimuli and avoid/move away from negative stimuli, which have no direct non-human counterparts (Guitart-Masip et al., 2012; Kirlic et al., 2017; Mkrtchian et al., 2017; Phaf et al., 2014).

      b) Although back-translation to 'match' human paradigms to non-animal paradigms is useful for research, this isn't the end goal of task development. What really matters is how well these tasks, whether in humans or not, capture psychopathology-relevant behavior. Many animal paradigms were developed and brought into extensive use because they showed sensitivity to pharmacological compounds (e.g., benzodiazepines). The introduction accepts the validity of these paradigms at face value, and doesn't address whether developing human tests of psychopathology based on sensitivity to existing medication classes is the best way to generate new insights about psychopathology.

      We agree that whilst paradigms with translational and computational validity have merits of their own for neuroscientific theory, clinical validity (i.e. how well the paradigm reflects a phenomenon relevant to psychopathology) is key in the context of clinical applications. While our findings of associations between task performance and self-reported (state) anxiety suggest that our approach is a step in the right direction, the lack of associations with clinical measures was disappointing. Although future work is needed to more directly test the sensitivity of the current approach to psychopathology, this may mean that it, and its non-human counterparts, do not measure behaviours relevant to pathological anxiety. Since our primary focus in this paper was on translational and computational validity, we have opted to discuss the author’s suggestion in the ‘Discussion’ section, as follows:

      Further, it is worth noting that many animal paradigms were developed and widely adopted due to their sensitivity to anxiolytic medication (Cryan & Holmes, 2005). Given the lack of associations with clinical measures in our results, it is possible that current translational models of anxiety may not fully capture behaviours that are directly relevant to pathological anxiety. To develop translational paradigms of clinical utility, future research should place a stronger emphasis on assessing their clinical validity in humans.

      c) The authors may want to bring in the literature on the description-experience gap (e.g., PMID: 19836292) when discussing existing decision tasks and their computational dissimilarity to non-human operant conditioning tasks.

      We thank the reviewer for this useful addition to the introduction. We have now added the following to the 'Introduction’ section:

      Moreover, evidence from economic decision-making suggests that explicit offers of probabilistic outcomes can impact decision-making differently compared to when probabilistic contingencies need to be learned from experience (referred to as the ‘description-experience gap’; Hertwig & Erev, 2009); this finding raises potential concerns regarding the use of offer-based tasks in humans as approximations of non-human tasks that do not involve explicit offers.

      d) How does one evaluate how computationally similar human vs. non-human tasks are? What are the criteria for making this judgement? Specific to the current tasks, many animal learning tasks are not learning tasks in the same sense that human learning tasks are, in terms of the number of trials used and if the animals are choosing from a learned set of contingencies versus learning the contingencies during the testing.

      The computational similarity of human and non-human strategies in a given translational task can be tested empirically. This can be done by fitting models to the data and assessing whether similar models explain choices, even if parameter distributions might vary across species due to, for example, physiological differences. Indeed, non-human animals require much more training to perform even uni-dimensional reinforcement learning, but once they are trained, it should be possible to model their responses. In fact, it should even be possible to take training data into account in some cases. For example, the training phase of the Vogel/Geller-Seifter preclinical tests require an animal to learn to emit a certain action (e.g. lever press) simply to obtain some reward. In the next phase, an aversive outcome is introduced as an additional outcome, but one could model both the training and test phase together – the winning model in our studies would be a suitable candidate to model behaviour here. As we also discuss predictive validity in the ‘Discussion’ section, we opted to add the following text there too:

      … computational validity would also need to be assessed directly in non-human animals by fitting models to their behavioural data. This should be possible even in the face of different procedures across species such as number of trials or outcomes used (shock or aversive sound). We are encouraged by our finding that the winning computational model in our study relies on a relatively simple classical reinforcement learning strategy. There exist many studies showing that non-human animals rely on similar strategies during reward and punishment learning (Mobbs et al., 2020; Schultz, 2013); albeit to our knowledge this has never been modelled in non-human animals where rewards and punishment can occur simultaneously.

      2) What do the authors make of the non-linear relationship between probability of punishment and probability of choosing the conflict stimulus (Fig 2d), especially in the high task-induced anxiety participants? Did this effect show up in the replication sample as well?

      Figures 2c-e were created by binning the continuous predictors of outcome probabilities into discrete bins of equal interval. Since punishment probability varied according to Gaussian random walks, it was also distributed with more of its mass in the central region (~ 0.4), and so values at the extreme bins were estimated on fewer data and with greater variance. The non-linear relationships are likely thus an artefact of our task design and plotting procedure. The pattern was also evident in the replication sample, see Author response image 1:

      Author response image 1.

      However, since these effects were estimated as linear effects in the logistic regression models, and to avoid overfitting/interpretations of noise arising from our task design, we now plot logistic curves fitted to the raw data instead.

      3) How correlated were learning rate and sensitivity parameters? The EM algorithm used here can sometimes result in high correlations among these sets of parameters.

      As the reviewer suspects the parameters were strongly correlated, especially across the punishment-specific parameters. The Pearson’s r estimates for the untransformed parameter values were as follows:

      Reward parameters: discovery sample r = -0.39; replication sample r = -0.78

      Punishment parameters: discovery sample r = -0.91; replication sample r = -0.85

      We have included the correlation matrices of the estimated parameters as Supplementary Figure 2 in the ‘Computational modelling’ section of the Supplement.

      We have now also re-fitted the winning model using variational Bayesian inference (VBI) via Stan, and found that the cross-parameter correlations were much lower than when the data were fitted using EM. We also ran a sensitivity analysis assessing whether using VBI changed the main findings of our studies. This showed that the correlation between task-induced anxiety and the reward-punishment sensitivity index was robust to fitting method, as was the mediating effect of reward-punishment sensitivity index on anxiety’s effect on choice. This indicates that overall our key findings are robust to different methods of parameter-fitting.

      We now direct readers to these analyses from the new ‘Sensitivity analyses’ section in the manuscript, as follows:

      As our procedure for estimating model parameters (the expectation-maximisation algorithm, see ‘Methods’) produced high inter-parameter correlations in our data (Supplementary Figure 2), we also re-estimated the parameters using Stan’s variational Bayesian inference algorithm (Stan Development Team, 2023) – this resulted in lower inter-parameter correlations, but our primary computational finding, that the effect of anxiety on choice is mediated by relative sensitivity to reward/punishment was consistent across algorithms (see Supplement section 9.8 for details).

      We have included the relevant analyses comparing EM and VBI in the Supplement, as follows:

      [9.8 Sensitivity analysis: estimating parameters via expectation maximisation and variational Bayesian inference algorithms]

      Given that the expectation maximisation (EM) algorithm produced high inter-parameter correlations, we ran a sensitivity analysis by assessing the robustness of our computational findings to an alternative method of parameter estimation – (mean-field) variational Bayesian inference (VBI) via Stan (Stan Development Team, 2023). Since, unlike EM, the results of VBI are very sensitive to initial values, we fitted the data 10 times with different initial values.

      Inter-parameter correlations

      The VBI produced lower inter-parameter correlations than the EM algorithm (Supplementary Figure 8).

      Sensitivity analysis

      Since multicollinearity in the VBI-estimated parameters was lower than for EM, indicating less trade-off in the estimation, we re-tested our computational findings from the manuscript as part of a sensitivity analysis. We first assessed whether we observed the same correlations between task-induced anxiety and punishment learning, and reward-punishment sensitivity index (Supplementary Figure 9a). Punishment learning rate was not significantly associated with task-induced anxiety in any of the 10 VBI iterations in the discovery sample, although it was in 9/10 in the replication sample. On the other hand, the reward-punishment sensitivity index was significantly associated with task-induced anxiety in 9/10 VBI iterations in the discovery sample and all iterations in the replication sample. This suggests that the correlation of anxiety and sensitivity index is robust to these two fitting approaches.

      We also re-estimated the mediation models, where in the EM-estimated parameters, we found that the reward-punishment sensitivity index mediated the relationship between task-induced anxiety and task choice proportions (Supplementary Figure 9b). Again, we found that the reward-punishment sensitivity index was a significant mediator in 9/10 VBI iterations in the discovery sample and all iterations in the replication sample. Punishment learning rate was also a significant mediator in 9/10 iterations in the replication sample, although it was not in the discovery sample for all iterations, and this was not observed for the EM-estimated parameters.

      Overall, we found that our key results, that anxiety is associated with greater sensitivity to punishment over reward, and this mediates the relationship between anxiety and approach-avoidance behaviour, were robust across both fitting methods.

      As an aside, we were unable to run the model fitting using Markov chain Monte Carlo sampling approaches due to the computational power and time required for a sample of this size (Pike & Robinson, 2022, JAMA Psychiatry).

      4) What is the split-half reliability of the task parameters?

      We thank the reviewer for this query. We have now included a brief section on the (good-to-excellent) split-half reliability of the task in the manuscript:

      We assessed the split-half reliability of the task by correlating the overall proportion of conflict option choices and model parameters from the winning model across the first and second half of trials. For overall choice proportion, reliability was simply calculated via Pearson’s correlations. For the model parameters, we calculated model-derived estimates of Pearson’s r values from the parameter covariance matrix when first- and second-half parameters were estimated within a single model, following a previous approach recently shown to accurately estimate parameter reliability (Waltmann et al., 2022). We interpreted indices of reliability based on conventional values of < 0.40 as poor, 0.4 - 0.6 as fair, 0.6 - 0.75 as good, and > 0.75 as excellent reliability (Fleiss, 1986). Overall choice proportion showed good reliability (discovery sample r = 0.63; replication sample r = 0.63; Supplementary Figure 5). The model parameters showed good-to-excellent reliability (model-derived r values ranging from 0.61 to 0.85 [0.76 to 0.92 after Spearman-Brown correction]; Supplementary Figure 5).

      5) The authors do a good job of avoiding causal language when setting up the cross-sectional mediation analysis, but depart from this in the discussion (line 335). Without longitudinal data, they cannot claim that "mediation analyses revealed a mechanism of how anxiety induces avoidance".

      Thank you for spotting this, we have now amended the text to:

      … mediation analyses suggested a potential mechanism of how anxiety may induce avoidance.

      Reviewer #2 (Public Review):

      Summary:

      The authors develop a computational approach-avoidance-conflict (AAC) task, designed to overcome limitations of existing offer based AAC tasks. The task incorporated likelihoods of receiving rewards/ punishments that would be learned by the participants to ensure computational validity and estimated model parameters related to reward/punishment and task induced anxiety. Two independent samples of online participants were tested. In both samples participants who experienced greater task induced anxiety avoided choices associated with greater probability of punishment. Computational modelling revealed that this effect was explained by greater individual sensitivities to punishment relative to rewards.

      Strengths:

      Large internet-based samples, with discovery sample (n = 369), pre-registered replication sample (n = 629) and test-retest sub group (n = 57). Extensive compliance measures (e.g. audio checks) seek to improve adherence.

      There is a great need for RL tasks that model threatening outcomes rather than simply loss of reward. The main model parameters show strong effects and the additional indices with task based anxiety are a useful extension. Associations were broadly replicated across samples. Fair to excellent reliability of model parameters is encouraging and badly needed for behavioral tasks of threat sensitivity.

      We thank the reviewer for their comments and constructive feedback.

      The task seems to have lower approach bias than some other AAC tasks in the literature. Although this was inferred by looking at Fig 2 (it doesn't seem to drop below 46%) and Fig 3d seems to show quite a strong approach bias when using a reward/punishment sensitivity index. It would be good to confirm some overall stats on % of trials approached/avoided overall.

      The range of choice proportions is indeed an interesting statistic that we have now included in the manuscript:

      Across individuals, there was considerable variability in overall choice proportions (discovery sample: mean = 0.52, SD = 0.14, min/max = [0.03, 0.96]; replication sample: mean = 0.52, SD = 0.14, min/max = [0.01, 0.99]).

      Weaknesses:

      The negative reliability of punishment learning rate is concerning as this is an important outcome.

      We agree that this is a concerning finding. As reviewer 3 notes, this may have been due to participants having control over the volume used to play the aversive sounds in the task (see below for our response to this point). Future work with better controlled experimental settings will be needed to determine the reliability of this parameter more accurately.

      This may also have been due to the asymmetric nature of the task, as only one option could produce the punishment. This means that there were fewer trials on which to estimate learning about the occurrence of a punishment. Future work using continuous outcomes, as the reviewer suggests below, whilst keeping the asymmetric relationship between the options, could help in this regard.

      We have included the following comment on this issue in the manuscript:

      Alternatively, as participants self-determined the loudness of the punishments, differences in volume settings across sessions may have impacted the reliability of this parameter (and indeed punishment sensitivity). Further, the asymmetric nature of the task may have impacted our ability to estimate the punishment learning rate, as there were fewer occurrences of the punishment compared to the reward.

      The Kendall's tau values underlying task induced anxiety and safety reference/ various indices are very weak (all < 0.1), as are the mediation effects (all beta < 0.01). This should be highlighted as a limitation, although the interaction with P(punishment|conflict) does explain some of this.

      We now include references to the effect sizes to emphasise this limitation. We also note, as the reviewer suggests, that this may be due to crudeness of overall choice proportion as a measure of approach/avoidance, as it is contaminated with variables such as P(punishment|conflict).

      One potentially important limitation of our findings is the small effect size observed in the correlation between task-induced anxiety and avoidance (Kendall's tau values < 0.1, mediation betas < 0.01). This may be attributed to the simplicity of using overall choice proportion as a measure of approach/avoidance, as the effect of anxiety on choice was also influenced by punishment probability.

      The inclusion of only one level of reward (and punishment) limits the ecological validity of the sensitivity indices.

      We agree that using multi-level outcomes will be an important question for future work and now explicitly note this in the manuscript, as below:

      Using multi-level or continuous outcomes would also improve the ecological validity of the present approach and interpretation of the sensitivity parameters.

      Appraisal and impact:

      Overall this is a very strong paper, describing a novel task that could help move the field of RL forward to take account of threat processing more fully. The large sample size with discovery, replication and test-retest gives confidence in the findings. The task has good ecological validity and associations with task-based anxiety and clinical self-report demonstrate clinical relevance. The authors could give further context but test-retest of the punishment learning parameter is the only real concern. Overall this task provides an exciting new probe of reward/threat that could be used in mechanistic disease models.

      We thank the reviewer again for helping us to improve our analyses and manuscript.

      Reviewer #2 (Recommendations For The Authors):

      Additional context:

      In the introduction "cognitive tasks that bear little semblance to those used in the non-human literature" seems a little unfair. One study that is already cited (Ironside et al, 2020) used a task that was adapted from non-human primates for use in humans. It has almost identical visual stimuli (different levels of simultaneous reward and aversive outcome/punishment) and response selection processes (joystick) between species and some overlapping brain regions were activated across species for conflict and aversiveness. The later point that non-human animals must be trained on the association between action and outcome is well taken from the point of view of computational validity but perhaps not sufficient to justify the previous statement.

      Our intention for this section was to briefly convey that historically, approach-avoidance conflict would have been measured either using questionnaires or joystick-based tasks which have no direct non-human counterpart. However, we agree that this phrasing is unfair to recent studies such as those by Ironside and colleagues. Therefore, we have amended the text to the following:

      In humans, on the other hand, approach-avoidance conflict has historically been measured using questionnaires such as the Behavioural Inhibition/Activation Scale (Carver & White, 1994), or cognitive tasks that rely on motor biases to approach/move towards positive stimuli and avoid/move away from negative stimuli which have no direct non-human counterparts (Guitart-Masip et al., 2012; Kirlic et al., 2017; Mkrtchian et al., 2017; Phaf et al., 2014).

      It would be good to speculate on why task induced anxiety made participants slower to update their estimates of punishment probability.

      Although a meta-analysis of reinforcement learning studies using reward and punishment outcomes suggests a positive association between punishment learning rate and anxiety symptoms (and depressed mood), we paradoxically found the opposite effect. However, previous work has suggested that distinct forms of anxiety associate differently with anxiety (Wise & Dolan, 2020, Nat. Commun.), where somatic anxiety was negatively correlated with punishment learning rate whereas cognitive anxiety showed the opposite effect. We have now added the following to the manuscript, and noted that future work is needed to understand the potentially complex relationship between anxiety and learning from punishments:

      Notably, although a recent computational meta-analysis of reinforcement learning studies showed that symptoms of anxiety and depression are associated with elevated punishment learning rates (Pike & Robinson, 2022), we did not observe this pattern in our data. Indeed, we even found the contrary effect in relation to task-induced anxiety, specifically that anxiety was associated with lower rates of learning from punishment. However, other work has suggested that the direction of this effect can depend on the form of anxiety, where cognitive anxiety may be associated with elevated learning rates, but somatic anxiety may show the opposite pattern (Wise & Dolan, 2020) and this may explain the discrepancy in findings. Additionally, parameter values are highly dependent on task design (Eckstein et al., 2022), and study designs to date may be more optimised in detecting differences in learning rate (Pike & Robinson, 2022) – future work is needed to better understand the potentially complex association between anxiety and punishment learning rate. Lastly, as punishment learning rate was severely unreliable in the test-retest analyses, and the associations between punishment learning rate and state anxiety were not robust to an alternative method of parameter estimation (variational Bayesian inference), the negative correlation observed in our study should be treated with caution.

      Were those with more task-based anxiety more inflexible in general?

      The lack of associations across reward learning rate and task-induced anxiety suggest that this was not a general inflexibility effect. To test the reviewer’s hypothesis more directly, we conducted a sensitivity analysis by examining the model with a general learning rate – this did not support a general inflexibility effect. Please see the new section in the Supplement below:

      [9.10 Sensitivity analysis: anxiety and inflexibility]

      As anxious participants were slower to update their estimates of punishment probability, we determined whether this was due to greater general inflexibility by examining the model including two sensitivity parameters, but one general learning rate (i.e. not split by outcome). The correlation between this general learning rate and task-induced anxiety was not significant in either samples (discovery: tau = -0.02, p = 0.504; replication: tau = -0.01, p = 0.625), suggesting that the effect is specific to punishment.

      Was the 16% versus 20% of the two samples with clinically relevant anxiety symptoms significantly different? What about other demographics in the two samples?

      The difference in proportions were not significantly different (χ2 = 2.33, p = 0.127). The discovery sample included more females and was older on average compared to the replication sample – information which we now report in the manuscript:

      The discovery sample consisted of a significantly greater proportion of female participants than the replication sample (59% vs 52%, χ2 = 4.64, p = 0.031). The average age was significantly different across samples (discovery sample mean = 37.7, SD = 10.3, replication sample mean = 34.3, SD = 10.4; t785.5 = 5.06, p < 0.001). The differences in self-reported psychiatric symptoms across samples did not reach significance (p > 0.086).

      It would be interesting to know how many participants failed the audio attention checks.

      We have now included information about what proportion of participants fail each of the task exclusion criteria in the manuscript:

      Firstly, we excluded participants who missed a response to more than one auditory attention check (see above; 8% in both discovery and replication samples) – as these occurred infrequently and the stimuli used for the checks were played at relatively low volume, we allowed for incorrect responses so long as a response was made. Secondly, we excluded those who responded with the same response key on 20 or more consecutive trials (> 10% of all trials; 4/6% in discovery and replication samples, respectively). Lastly, we excluded those who did not respond on 20 or more trials (1/2% in discovery and replication samples, respectively). Overall, we excluded 51 out of 423 (12%) in the discovery sample, and 98 out of 725 (14%) in the replication sample.

      There doesn't appear to be a model with only learning from punishment (i.e. no reward learning) included in the model comparison. It would be interesting to see how it compared.

      We have fitted the suggested model and found that it is the least parsimonious of the models. Since participants were monetarily incentivised based on the rewards only, this was to be expected. We have now added this ‘punishment learning only’ model and its variant including a lapse term into the model comparison. The two lowest bars on the y-axis in Author response image 2 represent these models.

      Author response image 2.

      Were sex effects examined as these have been commonly found in AAC tasks. How about other covariates such as age?

      We have now tested the effects of sex and age on behaviour and on parameter values. There were indeed some significant effects, albeit with some inconsistencies across the two samples, which for completeness we have included in the manuscript, as follows:

      While sex was significantly associated with choice in the discovery sample (β = 0.16 ± 0.07, p = 0.028) with males being more likely to choose the conflict option, this pattern was not evident in the replication sample (β = 0.08 ± 0.06, p = 0.173), and age was not associated with choice in either sample (p > 0.2).

      Comparing parameters across sexes via Welch’s t-tests revealed significant differences in reward sensitivity (t289 = -2.87, p = 0.004, d = 0.34; lower in females) and consequently reward-punishment sensitivity index (t336 = -2.03, p = 0.043, d = 0.22; lower in females i.e. more avoidance-driven). In the replication sample, we observed the same effect on reward-punishment sensitivity index (t626 = -2.79, p = 0.005, d = 0.22; lower in females). However, the sex difference in reward sensitivity did not replicate (p = 0.441), although we did observe a significant sex difference in punishment sensitivity in the replication sample (t626 = 2.26, p = 0.024, d = 0.18).

      Minor: Still a few placeholders (Supplementary Table X/ Table X) in the methods

      We thank the reviewer for spotting these errors. We have now corrected these references.

      Reviewer #3 (Public Review):

      This study investigated cognitive mechanisms underlying approach-avoidance behavior using a novel reinforcement learning task and computational modelling. Participants could select a risky "conflict" option (latent, fluctuating probabilities of monetary reward and/or unpleasant sound [punishment]) or a safe option (separate, generally lower probability of reward). Overall, participant choices were skewed towards more rewarded options, but were also repelled by increasing probability of punishment. Individual patterns of behavior were well-captured by a reinforcement learning model that included parameters for reward and punishment sensitivity, and learning rates for reward and punishment. This is a nice replication of existing findings suggesting reward and punishment have opposing effects on behavior through dissociated sensitivity to reward versus punishment.

      Interestingly, avoidance of the conflict option was predicted by self-reported task-induced anxiety. This effect of anxiety was mediated by the difference in modelled sensitivity to reward versus punishment (relative sensitivity). Importantly, when a subset of participants were retested over 1 week later, most behavioral tendencies and model parameters were recapitulated, suggesting the task may capture stable traits relevant to approach-avoidance decision-making.

      We thank the reviewer for their useful analysis of our study. Indeed, it was reassuring to see that performance indices were reliable across time.

      However, interpretation of these findings are severely undermined by the fact that the aversiveness of the auditory punisher was largely determined by participants, with the far-reaching impacts of this not being accounted for in any of the analyses. The manipulation check to confirm participants did not mute their sound is highly commendable, but the thresholding of punisher volume to "loud but comfortable" at the outset of the task leaves substantial scope for variability in the punisher delivered to participants. Indeed, participants' ratings of the unpleasantness of the punishment was moderate and highly variable (M = 31.7 out of 50, SD = 12.8 [distribution unreported]). Despite having this rating, it is not incorporated into analyses. It is possible that the key finding of relationships between task-induced anxiety, reward-punishment sensitivity and avoidance are driven by differences in the punisher experienced; a louder punisher is more unpleasant, driving greater task-induced anxiety, model-derived punishment sensitivity, and avoidance (and vice versa). This issue can also explain the counterintuitive findings from re-tested participants; lower/negatively correlated task-induced anxiety and punishment-related cognitive parameters may have been due to participants adjusting their sound settings to make the task less aversive (retest punisher rating not reported). It can therefore be argued that the task may not actually capture meaningful cognitive/motivational traits and their effects on decision-making, but instead spurious differences in punisher intensity.

      We thank the reviewer for raising this important potential limitation of our study. We agree that how participants self-adjusted their sound volume may important consequences for our interpretations of the data. Unfortunately, despite the scalability of online data collection, this highlights one of its major weaknesses in the lack of controllability over experimental parameters. The previous paper from which we obtained our aversive sounds (Seow & Hauser, 2021, Behav Res, doi.org/10.3758/s13428-021-01643-0) contains useful analyses with regards to this discussion. When comparing the unpleasantness of the sounds played at 50% vs 100% volume, the authors indeed found that the lower volumes lead to lower unpleasantness ratings. However, the magnitude of this effect did not appear to be substantial (Fig. 4 from the paper), and even at 50% volume, the scream sounds we used were rated in the top quartile for unpleasantness, on average. This implies that the sounds have sufficient inherent unpleasantness, even when played at half intensity. We find this reassuring, in the sense that any self-imposed volume effects may not be large. Of note, our instructions to participants to adjust the volume to a ‘loud but comfortable’ level was based on the same phrasing used in this study.

      To the reviewers point on how this might affect the reliability of the task, we have included the following in the ‘Discussion’ section:

      Alternatively, as participants self-determined the loudness of the punishments, differences in volume settings across sessions may have impacted the reliability of this parameter (and indeed other measures).

      Please see below for analyses accounting for punishment unpleasantness ratings.

      This undercuts the proposed significance of this task as a translational tool for understanding anxiety and avoidance. More information about ratings of punisher unpleasantness and its relationship to task behavior, anxiety and cognitive parameters would be valuable for interpreting findings. It would also be of interest whether the same results were observed if the aversiveness of the punisher was titrated prior to the task.

      As suggested, we have now included sensitivity analyses using the unpleasantness ratings that show their effect is minimal on our primary inference. We report relevant results below in the ‘Recommendations For The Authors’ section. At the same time, we think it is important to acknowledge that unpleasantness is a combination of both the inherent unpleasantness of the sound and the volume it is presented at, where only the latter is controlled by the participant. Therefore, these analyses are not a perfect indicator of the effect of participant control. For convenience, we reproduce the key findings from this sensitivity analysis here:

      Approach-avoidance hierarchical logistic regression model

      We assessed whether approach and avoidance responses, and their relationships with state anxiety, were impacted by punishment unpleasantness, by including unpleasantness ratings as a covariate into the hierarchical logistic regression model. Whilst unpleasantness was a significant predictor of choice (positively predicting safe option choices), all significant predictors and interaction effects from the model without unpleasantness survived (Supplementary Figure 11). Critically, this suggests that punishment unpleasantness does not account for all of the variance in the relationship between anxiety and avoidance.

      Mediation model

      When unpleasantness ratings were included in the mediation models, the mediating effect of the reward-punishment sensitivity index did not survive (discovery sample: standardised β = 0.003 ± 0.003, p = 0.416; replication sample: standardised β = 0.004 ± 0.003, p = 0.100; Supplementary Figure 12). Pooling the samples resulted in an effect that narrowly missed the significance threshold (standardised β = 0.004 ± 0.002, p = 0.068).

      More generally, whether or not to titrate the punishments (and indeed the rewards) is an interesting experimental decision, which we think should be guided by the research question. In our case, we were interested in individual differences in reward/punishment learning and sensitivity and their relation to anxiety, so variation in how aversive the sounds affected approach-avoidance decisions was an important aspect of our design. In studies where the aim is to understand more general processes of how humans act under approach-avoidance conflict, it may be better to tightly control the salience of reinforcers.

      Ultimately, the best test of the causal role of anxiety on avoidance, and against the hypothesis that our results were driven by spurious volume control effects, would be to run within-subjects anxiety interventions, where these volume effects are naturally accounted for. This will be an important direction for future studies using similar measures. We have added a paragraph in the ‘Discussion’ section on this point:

      Relatedly, participants had some control over the intensity at which the punishments were presented, which may have driven our findings relating to anxiety and putative mechanisms of anxiety-related avoidance. Sensitivity analyses showed that our finding that anxiety is positively associated with avoidance in the task was robust to individual differences in self-reported punishment unpleasantness, whilst the mediation effects were not. Future work imposing better control over the stimuli presented, and/or using within-subjects designs will be needed to validate the role of reward/punishment sensitivities in anxiety-related avoidance.

      Although the procedure and findings reported here remain valuable to the field, claims of novelty including its translational potential are perhaps overstated. This study complements and sits within a much broader literature that investigates roles for aversion and cognitive traits in approach-avoidance decisions. This includes numerous studies that apply reinforcement learning models to behavior in two-choice tasks with latent probabilities of reward and punishment (e.g., see doi: 10.1001/jamapsychiatry.2022.0051), as well as other translationally-relevant paradigms (e.g., doi: 10.3389/fpsyg.2014.00203, 10.7554/eLife.69594, etc).

      We agree with the reviewer that our approach builds on previous work in reinforcement learning, approach-avoidance conflict and translational measures of anxiety. Whilst there are by now many studies using two-choice learning tasks with latent reward and punishment probabilities, our main, and which we refer to as ‘novel’, aim was to bring these fields together in such a way so as to model anxiety-related behaviour.

      We note that we do not make strong statements about whether these effects speak to traits per se, and as Reviewer 1 notes, the evidence from our study suggests that the present measure may be better suited to assessing state anxiety. While computational model parameters can and are certainly often interpreted as constituting stable individual traits, a more simple interpretation of our findings may be that state anxiety is associated with a momentary preference for punishment avoidance over reward pursuit. This can still be informative for the study of anxiety, especially given the notion of a continuous relationship between adaptive/state anxiety and maladaptive/persistent anxiety.

      Having said that, we agree with the underlying premise of the reviewer’s point that how the measure relates to trait-level avoidance/inhibition measures will be an interesting question for future work. We appreciate the importance of using tasks such as ours and those highlighted by the reviewer as trait-level measures, especially in computational psychiatry. We have now included a discussion on the potential roles of cognitive/motivational traits, in line with the reviewer’s recommendation – briefly, we have included the suggested references by the reviewer, discussed the measure’s potential relevance to cognitive/motivational traits, and direct interested readers to the broader literature. Please see below for details.

      Reviewer #3 (Recommendations For The Authors):

      As stated in the public review, punisher unpleasantness and its relationship to key findings (including for retest) should be reported and discussed.

      We signpost readers to our new analyses, incorporating unpleasantness ratings into the statistical models, from the main manuscript as follows:

      Since participants self-determined the volume of the punishments in the task, and therefore (at least in part) their aversiveness, we conducted sensitivity analyses by accounting for self-reported unpleasantness ratings of the punishment (see the Supplement). Our finding that anxiety impacts approach-avoidance behaviour was robust to this sensitivity analysis (p < 0.001), however the mediating effect of the reward-sensitivity sensitivity index was not (p > 0.1; see Supplement section 9.9 for details).

      We reproduce the relevant section from the Supplement below. Overall, we found that the effect of anxiety on choices (via its interaction with punishment probability) remained significant after accounting for unpleasantness, however the mediating effect of reward-punishment sensitivity was no longer significant when unpleasantness ratings were included in the model. As noted above, unpleasantness ratings are not a perfect measure of self-imposed sound volume, and indeed punishment sensitivity is essentially a computationally-derived measure of unpleasantness, which makes it difficult to interpret the mediation model which contains both of these measures. However, since we found that anxiety affected choice over and above and effects of self-imposed sound volume (using unpleasantness ratings as a proxy measure), we argue that the task still holds value as a model of anxiety-related avoidance.

      [Supplement Section 9.9: Sensitivity analyses of punishment unpleasantness]

      Distribution of unpleasantness

      The punishments were rated as unpleasant by the participants, on average (discovery sample: mean rating = 31.1 [scored between 0 and 50], SD = 13.1; replication sample: mean rating = 32.1, SD = 12.7; Supplementary Figure 10).

      Approach-avoidance hierarchical logistic regression model

      We assessed whether approach and avoidance responses, and their relationships with state anxiety, were impacted by punishment unpleasantness, by including unpleasantness ratings as a covariate into the hierarchical logistic regression model. Whilst unpleasantness was a significant predictor of choice (positively predicting safe option choices), all significant predictors and interaction effects from the model without unpleasantness ratings survived (Supplementary Figure 11). Critically, this suggests that punishment unpleasantness does not account for all of the variance in the relationship between anxiety and avoidance.

      Mediation model

      When unpleasantness ratings were included in the mediation models, the mediating effect of the reward-punishment sensitivity index did not survive (discovery sample: standardised β = 0.003 ± 0.003, p = 0.416; replication sample: standardised β = 0.004 ± 0.003, p = 0.100; Supplementary Figure 12). Pooling the samples resulted in an effect that narrowly missed the significance threshold (standardised β = 0.004 ± 0.002, p = 0.068).

      Test-retest reliability of unpleasantness

      The test-retest reliability of unpleasantness ratings was excellent (ICC(3,1) = 0.75), although participants gave significantly lower ratings in the second session (t56 = 2.7, p = 0.008, d = 0.37; mean difference of 3.12, SD = 8.63).

      Reliability of other measures with/out unpleasantness

      To assess the effect of accounting for unpleasantness ratings on reliability estimates of task performance, we extracted variance components from linear mixed models, following a standard approach (Nakagawa et al., 2017) – note that this was not the method used to estimate reliability values in the main analyses, but we used this specific approach to compare the reliability values with and without the covariate of unpleasantness ratings. The results indicated that unpleasantness ratings did not have a material effect on reliability (Supplementary Figure 14).

      We discuss the findings of these sensitivity analyses in the ‘Discussion’ section, as follows:

      Relatedly, participants had some control over the intensity at which the punishments were presented, which may have driven our findings relating to anxiety and putative mechanisms of anxiety-related avoidance. Sensitivity analyses showed that our finding that anxiety is positively associated with avoidance in the task was robust to individual differences in self-reported punishment unpleasantness, whilst the mediation effects were not. Future work imposing better control over the stimuli presented, and/or using within-subjects designs will be needed to validate the role of reward/punishment sensitivities in anxiety-related avoidance.

      Introduction and discussion should spend more time relating the task and current findings to existing procedures and findings examining individual differences in avoidance and cognitive/motivational correlates.

      We thank the reviewer for the opportunity to expand on the literature. Whilst there are numerous behavioural paradigms in both the human and non-human literature that involve learning about rewards and punishments, our starting point for the introduction was the state-of-the-art in translational models of approach-avoidance conflict models of anxiety. Therefore, for the sake of brevity and logical flow of our introduction, we have opted to bring in the discussion on other procedures primarily in the ‘Discussion’ section of the manuscript.

      We have now included the reviewer’s suggested citations from their ‘Public Review’ as follows:

      Since we developed our task with the primary focus on translational validity, its design diverges from other reinforcement learning tasks that involve reward and punishment outcomes (Pike & Robinson, 2022). One important difference is that we used distinct reinforcers as our reward and punishment outcomes, compared to many studies which use monetary outcomes for both (e.g. earning and losing £1 constitute the reward and punishment, respectively; Aylward et al., 2019; Jean-Richard-Dit-Bressel et al., 2021; Pizzagalli et al., 2005; Sharp et al., 2022). Other tasks have been used that induce a conflict between value and motor biases, relying on prepotent biases to approach/move towards rewards and withdraw from punishments, which makes it difficult to approach punishments and withdraw from rewards (Guitart-Masip et al., 2012; Mkrtchian et al., 2017). However, since translational operant conflict tasks typically induce a conflict between different types of outcome (e.g. food and shocks/sugar and quinine pellets; Oberrauch et al., 2019; van den Bos et al., 2014), we felt it was important to implement this feature. One study used monetary rewards and shock-based punishments, but also included four options for participants to choose from on each trial, with rewards and punishments associated with all four options (Seymour et al., 2012). This effectively requires participants to maintain eight probability estimates (i.e. reward and punishment at each of the four options) to solve the task, which may be too difficult for non-human animals to learn efficiently.

      We have also included a discussion on the measure’s potential relevance to cognitive/motivational traits as follows:

      Finally, whilst there is a broad literature on the roles of behavioural inhibition and avoidance tendency traits on decision-making and behaviour (Carver & White, 1994; Corr, 2004; Gray, 1982), we did not replicate the correlation of experiential avoidance and avoidance responses or the reward-punishment sensitivity index. Since there were also no significant correlations across task performance indices and clinical symptom measures, our findings suggest that the measure may be more sensitive to behaviours relating to state anxiety, rather more stable traits. Nevertheless, how performance in the present task relates to other traits such as behavioural approach/inhibition tendencies (Carver & White, 1994), as has been found in previous studies on reward/punishment learning (Sharp et al., 2022; Wise & Dolan, 2020) and approach-avoidance conflict (Aupperle et al., 2011), will be an important question for future work.

      We also now direct readers to a recent, comprehensive review on applying computational methods to approach-avoidance behaviours in the ‘Introduction’ section:

      A fundamental premise of this approach is that the brain acts as an information-processing organ that performs computations responsible for observable behaviours, including approach and avoidance (for a recent review on the application of computational methods to approach-avoidance conflict, see Letkiewicz et al., 2023).

      I am curious why participants were excluded if they made the same response on 20+ consecutive trials. How does this represent a cut-off between valid versus invalid behavioral profiles?

      We apologise for the lack of clarity on this point in our original submission – this exclusion criterion was specifically if participants used the same response key (e.g. the left arrow button) on 20 or more consecutive trials, indicating inattention. Since the left-right positions of the stimuli were randomised across trials, this did not exclude participants who repeatedly chose the same option frequently. However, as we show in the Supplement, this, along with the other exclusion criteria, did not affect our main findings.

      We have now clarified this as follows:

      … we excluded those who responded with the same response key on 20 or more consecutive trials (> 10% of all trials; 4%/6% in discovery and replication samples, respectively) – note that as the options randomly switched sides on the screen across trials, this did not exclude participants who frequently and consecutively chose a certain option.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #2 (Public review):

      Summary:

      This work by Grogan and colleagues aimed to translate animal studies showing that acetylcholine plays a role in motivation by modulating the effects of dopamine on motivation. They tested this hypothesis with a placebo-controlled pharmacological study administering a muscarinic antagonist (trihexyphenidyl; THP) to a sample of 20 adult men performing an incentivized saccade task while undergoing electroencephalography (EEG). They found that reward increased vigor and reduced reaction times (RTs) and, importantly, these reward effects were attenuated by trihexyphenidyl. High incentives increased preparatory EEG activity (contingent negative variation), and though THP also increased preparatory activity, it also reduced this reward effect on RTs.

      Strengths:

      The researchers address a timely and potentially clinically relevant question with a within-subject pharmacological intervention and a strong task design. The results highlight the importance of the interplay between dopamine and other neurotransmitter systems in reward sensitivity and even though no Parkinson's patients were included in this study, the results could have consequences for patients with motivational deficits and apathy if validated in the future.

      Weaknesses:

      The main weakness of the study is the small sample size (N=20) that unfortunately is limited to men only. Generalizability and replicability of the conclusions remain to be assessed in future research with a larger and more diverse sample size and potentially a clinically relevant population. The EEG results do not shape a concrete mechanism of action of the drug on reward sensitivity.

      We thank the reviewer for their time and their assessment of this manuscript, and we appreciate their helpful comments on the previous version.

      We agree that the sample size being smaller than planned due to the pandemic restrictions is a weakness for this study, and hope that future studies into cholinergic effects on motivation in humans will use larger sample sizes. They should also ensure women are not excluded from sample populations, which will become even more important if the research progresses to clinical populations.

      Reviewer #3 (Public review):

      Summary:

      Grogan et al examine a role for muscarinic receptor activation in action vigor in a saccadic system. This work is motivated by a strong literature linking dopamine to vigor, and some animal studies suggesting that ACH might modulate these effects, and is important because patient populations with symptoms related to reduced vigor are prescribed muscarinic antagonists. The authors use a motivated saccade task with distractors to measure the speed and vigor of actions in humans under placebo or muscarinic antagonism. They show that muscarinic antagonism blunts the motivational effects of reward on both saccade velocity and RT, and also modulates the distractibility of participants, in particular by increasing the repulsion of saccades away from distractors. They show that preparatory EEG signals reflect both motivation and drug condition, and make a case that these EEG signals mediate the effects of the drug on behavior.

      Strengths:

      This manuscript addresses an interesting and timely question and does so using an impressive within subject pharmacological design and a task well designed to measure constructs of interest. The authors show clear causal evidence that ACH affects different metrics of saccade generation related to effort expenditure and their modulation by incentive manipulations. The authors link these behavioral effects to motor preparatory signatures, indexed with EEG, that relate to behavioral measures of interest and in at least one case statistically mediate the behavioral effects of ACH antagonism.

      Weaknesses:

      A primary weakness of this paper is the sample size - since only 20 participants completed the study. The authors address the sample size in several places and I completely understand the reason for the reduced sample size (study halt due to covid). Nonetheless, it is worth stating explicitly that this sample size is relatively small for the effect sizes typically observed in such studies highlighting the need for future confirmatory studies.

      We thank the reviewer for their time and their assessment of this manuscript, and we appreciate their helpful comments on the previous version.

      We agree that the small sample size is a weakness of the study, and hope that future work into cholinergic modulation of motivation can involve larger samples to replicate and extend this work.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Thank you for addressing my comments and clarifying the analysis sections. Women can be included in such studies by performing a pregnancy test before each test session, but I understand how this could have added to the pandemic limitations. Best of luck with your future work!

      Thank you for your time in reviewing this paper, and your helpful comments.

      Reviewer #3 (Recommendations for the authors):

      The authors have done a great job at addressing my concerns and I think that the manuscript is now very solid. That said, I have one minor concern.

      Thank you for your time in reviewing this paper, and your helpful comments.

      For descriptions of mass univariate analyses and cluster correction, I am still a bit confused on exactly what terms were in the regression. In one place, the authors state:

      On each iteration we shuffled the voltages across trials within each condition and person, and regressed it against the behavioural variable, with the model 'variable ~1 + voltage + incentive*distractorPresent*THP + (1 | participant)'.

      I take this to mean that the regression model includes a voltage regressor and a three-way interaction term, along with participant level intercept terms.

      However, elsewhere, the authors state:

      "We regressed each electrode and time-point against the three behavioural variables separately, while controlling for effects of incentive, distractor, THP, the interactions of those factors, and a random effect of participant."

      I take this to mean that the regression model included regressors for incentive, distractorPresent, THP, along with their 2 and 3 way interactions. I think that this seems like the more reasonable model - but I just want to 1) verify that this is what the authors did and 2) encourage them to articulate this more clearly and consistently throughout.

      We apologise for the lack of clarity about the whole-brain regression analyses.

      We used Wilkinson notation for this formula, where ‘A*B’ denotes ‘A + B + A:B’, so all main effects and lower-order interactions terms were included in the regression, as your second interpretation says. The model written out in full would be:

      'variable ~1 + voltage + incentive + distractorPresent + THP + incentive*distractorPresent + incentive*THP + distractorPresent*THP +  incentive*distractorPresent*THP + (1 | participant)'    

      We will clarify this in the Version of Record.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors used a motivated saccade task with distractors to measure response vigor and reaction time (RT) in healthy human males under placebo or muscarinic antagonism. They also simultaneously recorded neural activity using EEG with event-related potential (ERP) focused analyses. This study provides evidence that the muscarinic antagonist Trihexyphenidyl (THP) modulates the motivational effects of reward on both saccade velocity and RT, and also increases the distractibility of participants. The study also examined the correlational relationships between reaction time and vigor and manipulations (THP, incentives) with components of the EEG-derived ERPs. While an interesting correlation structure emerged from the analyses relating the ERP biomarkers to behavior, it is unclear how these potentially epiphenomenal biomarkers relate to relevant underlying neurophysiology.

      Strengths:

      This study is a logical translational extension from preclinical findings of cholinergic modulation of motivation and vigor and the CNV biomarker to a normative human population, utilizing a placebo-controlled, double-blind approach.

      While framed in the context of Parkinson's disease where cholinergic medications can be used, the authors do a good job in the discussion describing the limitations in generalizing their findings obtained in a normative and non-age-matched cohort to an aged PD patient population.

      The exploratory analyses suggest alternative brain targets and/or ERP components that relate to the behavior and manipulations tested. These will need to be further validated in an adequately powered study. Once validated, the most relevant biomarkers could be assessed in a more clinically relevant population.

      Weaknesses:

      The relatively weak correlations between the main experimental outcomes provide unclear insight into the neural mechanisms by which the manipulations lead to behavioral manifestations outside the context of the ERP. It would have been interesting to evaluate how other quantifications of the EEG signal through time-frequency analyses relate to the behavioral outcomes and manipulations.

      The ERP correlations to relevant behavioral outcomes were not consistent across manipulations demonstrating they are not reliable biomarkers to behavior but do suggest that multiple underlying mechanisms can give rise to the same changes in the ERP-based biomarkers and lead to different behavioral outcomes.

      We thank the reviewer for their review and their comments.

      We agree that these ERPs may not be reliable biomarkers yet, given the many-to-one mapping we observed where incentives and THP antagonism both affected the CNV in different ways, and hope that future studies will help clarify the use and limitations of the CNV as a potential biomarker of invigoration.

      Our original hypothesis was specifically about the CNV as an index of preparatory behaviour, but we plan to look at potential changes to frequency characteristics in future work. We have included this in the discussion of future investigations. (page 16, line 428):

      “Future investigations of other aspects of the EEG signals may illuminate us. Such studies could also investigate other potential signals that may be more sensitive to invigoration and/or muscarinic antagonism, including frequency-band power and phase-coherence, or measures of variability in brain signals such as entropy, which may give greater insight into processes affected by these factors.”

      Reviewer #2 (Public Review):

      Summary:

      This work by Grogan and colleagues aimed to translate animal studies showing that acetylcholine plays a role in motivation by modulating the effects of dopamine on motivation. They tested this hypothesis with a placebo-controlled pharmacological study administering a muscarinic antagonist (trihexyphenidyl; THP) to a sample of 20 adult men performing an incentivized saccade task while undergoing electroengephalography (EEG). They found that reward increased vigor and reduced reaction times (RTs) and, importantly, these reward effects were attenuated by trihexyphenidyl. High incentives increased preparatory EEG activity (contingent negative variation), and though THP also increased preparatory activity, it also reduced this reward effect on RTs.

      Strengths:

      The researchers address a timely and potentially clinically relevant question with a within-subject pharmacological intervention and a strong task design. The results highlight the importance of the interplay between dopamine and other neurotransmitter systems in reward sensitivity and even though no Parkinson's patients were included in this study, the results could have consequences for patients with motivational deficits and apathy if validated in the future.

      Weaknesses:

      The main weakness of the study is the small sample size (N=20) that unfortunately is limited to men only. The generalizability and replicability of the conclusions remain to be assessed in future research with a larger and more diverse sample size and potentially a clinically relevant population. The EEG results do not shape a concrete mechanism of action of the drug on reward sensitivity.

      We thank the reviewer for their review, and their comments.

      We agree that our study was underpowered, not reaching our target of 27 participants due to pandemic restrictions halting our recruitment, and hope that future studies into muscarinic antagonism in motivation will have larger sample sizes, and include male and female participants across a range of ages, to assess generalisability.

      We only included men to prevent the chance of administering the drug to someone pregnant. Trihexyphenidyl is categorized by the FDA as a Pregnancy Category Class C drug, and the ‘Summary of Product Characteristics’ states: “There is inadequate information regarding the use of trihexyphenidyl in pregnancy. Animal studies are insufficient with regard to effects on pregnancy, embryonal/foetal development, parturition and postnatal development. The potential risk for humans is unknown. Trihexyphenidyl should not be used during pregnancy unless clearly necessary.”

      While the drug can be prescribed where benefits may outweigh this risk, as there were no benefits to participants in this study, we only recruited men to keep the risk at zero.

      We have updated the Methods/Drugs section to explain this (page 17, line 494):

      “The risks of Trihexyphenidyl in pregnancy are unknown, but the Summary Product of Characteristics states that it “should not be used during pregnancy unless clearly necessary”. As this was a basic research study with no immediate clinical applications, there was no justification for any risk of administering the drug during pregnancy, so we only recruited male participants to keep this risk at zero.”

      And we reference to this in the Methods/Participants section (page 18, line 501):

      “We recruited 27 male participants (see Drugs section above),…”

      We agree that future work is needed to replicate this in different samples, and that this work cannot tell us the mechanism by which the drug is dampening invigoration, but we think that showing these effects do occur and can be linked to anticipatory/preparatory activity rather than overall reward sensitivity is a useful finding.

      Reviewer #3 (Public Review):

      Summary:

      Grogan et al examine a role for muscarinic receptor activation in action vigor in a saccadic system. This work is motivated by a strong literature linking dopamine to vigor, and some animal studies suggesting that ACH might modulate these effects, and is important because patient populations with symptoms related to reduced vigor are prescribed muscarinic antagonists. The authors use a motivated saccade task with distractors to measure the speed and vigor of actions in humans under placebo or muscarinic antagonism. They show that muscarinic antagonism blunts the motivational effects of reward on both saccade velocity and RT, and also modulates the distractibility of participants, in particular by increasing the repulsion of saccades away from distractors. They show that preparatory EEG signals reflect both motivation and drug condition, and make a case that these EEG signals mediate the effects of the drug on behavior.

      Strengths:

      This manuscript addresses an interesting and timely question and does so using an impressive within-subject pharmacological design and a task well-designed to measure constructs of interest. The authors show clear causal evidence that ACH affects different metrics of saccade generation related to effort expenditure and their modulation by incentive manipulations. The authors link these behavioral effects to motor preparatory signatures, indexed with EEG, that relate to behavioral measures of interest and in at least one case statistically mediate the behavioral effects of ACH antagonism.

      Weaknesses:

      In full disclosure, I have previously reviewed this manuscript in another journal and the authors have done a considerable amount of work to address my previous concerns. However, I have a few remaining concerns that affect my interpretation of the current manuscript.

      Some of the EEG signals (figures 4A&C) have profiles that look like they could have ocular, rather than central nervous, origins. Given that this is an eye movement task, it would be useful if the authors could provide some evidence that these signals are truly related to brain activity and not driven by ocular muscles, either in response to explicit motor effects (ie. Blinks) or in preparation for an upcoming saccade.

      We thank the reviewer for re-reviewing the manuscript and for raising this issue.

      All the EEG analyses (both ERP and whole-brain) are analysing the preparation period between the ready-cue and target appearance when no eye-movements are required. We reject trials with blinks or saccades over 1 degree in size, as detected by the Eyelink software according the sensitive velocity and acceleration criteria specified in the manuscript (Methods/Eye-tracking, page 19, line 550). This means that there should be no overt eye movements in the data. However, microsaccades and ocular drift are still possible within this period, which indeed could drive some effects. To measure this, we counted the number of microsaccades (<1 degree in size) in the preparation period between incentive cue and the target onset, for each trial. Further, we measure the mean absolute speed of the eye during the preparation period (excluding the periods during microsaccades) for each trial.

      We have run a control analysis to check whether including ocular drift speed or number of microsaccades as a covariate in the whole-brain regression analysis changes the association between EEG and the behavioural metrics at frontal or other electrodes. Below we show these ‘variable ~ EEG’ beta-coefficients when controlling for each eye-movement covariate, in the same format as Figure 4. We did not run the permutation testing on this due to time/computational costs (it takes >1 week per variable), so p-values were not calculated, only the beta-coefficients. The beta-coefficients are almost unchanged, both in time-course and topography, when controlling for either covariate.  The frontal associations to velocity and distractor pull remain, suggesting they are not due to these eye movements.

      We have added this figure as a supplemental figure.

      For additional clarity in this response, we also plot the differences between these covariate-controlled beta-coefficients, and the true beta-coefficients from figure 4 (please note the y-axis scales are -0.02:0.02, not -0.15:0.15 as in Figure 4 and Figure 4-figure supplement 2). This shows that the changes to the associations between EEG and velocity/distractor-pull were not frontally-distributed, demonstrating eye-movements were not driving these effects. Relatedly, the RT effect’s change was frontally-distributed, despite Figure 4 showing the true relationship was central in focus, again indicating that effect was also not related to these eye movements.

      Author response image 1.

      Difference in beta-coefficients when eye-movement covariates are included. This is the difference from the beta-coefficients shown in Figure 4, please note the smaller y-axis limits.

      The same pattern was seen if we controlled for the change in eye-position from the baseline period (measured by the eye-tracker) at each specific time-point, i.e., controlling for the distance the eye had moved from baseline at the time the EEG voltage is measured. The topographies and time-course plots were almost identical to the above ones:

      Author response image 2.

      Controlling for change in eye-position at each time-point does not change the regression results. Left column shows the beta-coefficients between the variable and EEG voltage, and the right column shows the difference from the main results in Figure 4 (note the smaller y-axis limits for the right-hand column).

      Therefore, we believe the brain-behaviour regressions are independent of eye-movements. We have included the first figure presented here as an additional supplemental figure, and added the following to the text (page 10, line 265):

      “An additional control analysis found that these results were not driven by microsaccades or ocular drift during the preparation period, as including these as trial-wise covariates did not substantially change the beta-coefficients (Figure 4 – Figure Supplement 2).”

      For other EEG signals, in particular, the ones reported in Figure 3, it would be nice to see what the spatial profiles actually look like - does the scalp topography match that expected for the signal of interest?

      Yes, the CNV is a central negative potential peaking around Cz, while the P3a is slightly anterior to this (peaking between Cz and FCz). We have added the topographies to the main figure (see point below).

      This is the topography of the mean CNV (1200:1500ms from the preparation cue onset), which is maximal over Cz, as expected.

      The P3a’s topography (200:280ms after preparation cue) is maximal slightly anterior to Cz, between Cz and FCz.

      A primary weakness of this paper is the sample size - since only 20 participants completed the study. The authors address the sample size in several places and I completely understand the reason for the reduced sample size (study halt due to COVID). That said, they only report the sample size in one place in the methods rather than through degrees of freedom in their statistical tests conducted throughout the results. In part because of this, I am not totally clear on whether the sample size for each analysis is the same - or whether participants were removed for specific analyses (ie. due to poor EEG recordings, for example).  

      We apologise for the lack of clarity here. All 20 participants were included in all analyses, although the number of trials included differed between behavioural and EEG analyses. We only excluded trials with EEG artefacts from the EEG analyses, not from the purely behavioural analyses such as Figures 1&2, although trials with blinks/saccades were removed from behavioural analyses too. Removing the EEG artefactual trials from the behavioural analyses did not change the findings, despite the lower power. The degrees of freedom in the figure supplement tables are the total number of trials (less 8 fixed-effect terms) included in the single-trial / trial-wise regression analyses we used.

      We have clarified this in the Methods/Analysis (page 20, line 602):

      “Behavioural and EEG analysis included all 20 participants, although trials with EEG artefacts were included in the behavioural analyses (18585 trials in total) and not the EEG analyses (16627 trials in total), to increase power in the former. Removing these trials did not change the findings of the behavioural analyses.”

      And we state the number of participants and trials in the start of the behavioural results (page 3, line 97):

      “We used single-trial mixed-effects linear regression (20 participants, 18585 trials in total) to assess the effects of Incentive, Distractors, and THP, along with all the interactions of these (and a random-intercept per participant), on residual velocity and saccadic RT.”

      and EEG results section (page 7, line 193):

      “We used single-trial linear mixed-effects regression to see the effects of Incentive and THP on each ERP (20 participants, 16627 trials; Distractor was included too, along with all interactions, and a random intercept by participant).”

      Beyond this point, but still related to the sample size, in some cases I worry that results are driven by a single subject. In particular, the interaction effect observed in Figure 1e seems like it would be highly sensitive to the single subject who shows a reverse incentive effect in the drug condition.

      Repeating that analysis after removing the participant with the large increase in saccadic RT with incentives did not remove the incentive*THP interaction effect – although it did weaken slightly from (β = 0.0218, p = .0002) to  (β=0.0197, p=.0082). This is likely because that while that participant did have slower RTs for higher incentives on THP, they were also slower for higher incentives under placebo (and similarly for distractor present/absent), making them less of an outlier in terms of effects than in raw RT terms. Below is Author response image 3 the mean-figure without that participant, and Author response image 4 that participant shown separately.

      Author response image 3.

      Author response image 4.

      There are not sufficient details on the cluster-based permutation testing to understand what the authors did or whether it is reasonable. What channels were included? What metric was computed per cluster? How was null distribution generated?

      We apologise for not giving sufficient details of this, and have updated the Methods/Analysis section to include these details, along with a brief description in the Results section.

      To clarify here, we adapted the DMGroppe Mass Univariate Testing toolbox to also run cluster-based permutation regressions to examine the relationship between the behavioural variables and the voltages at all EEG electrodes at each time point. On each iteration we shuffled the voltages across trials within each condition and person, and regressed it against the behavioural variable, with the model ‘variable ~1 + voltage + incentive*distractorPresent*THP + (1 | participant)’. The Voltage term measured the association between voltage and the behavioural variable, after controlling for effects of incentive*distractor*THP on behaviour – i.e. does adding the voltage at this time/channel explain additional variance in the variable not captured in our main behavioural analyses. By shuffling the voltages, we removed the relationship to the behavioural variable, to build the null distribution of t-statistics across electrodes and time-samples. We used the ‘cluster mass’ method (Bullmore et al., 1999; Groppe et al., 2011; Maris & Oostenveld, 2007) to build the null distribution of cluster mass (across times/channels per iteration), and calculated the p-value as the proportion of this distribution further from zero than the absolute true t-statistics (two-tailed test).

      We have given greater detail for this in the Methods/Analysis section (page 20, line 614):

      “We adapted this toolbox to also run cluster-based permutation regressions to examine the relationship between the behavioural variables and the voltages at all EEG electrodes at each time point. On each iteration we shuffled the voltages across trials within each condition and person, and regressed it against the behavioural variable, with the model ‘~1 + voltage + incentive*distractorPresent*THP + (1 | participant)’. The Voltage term measured the association between voltage and the behavioural variable, after controlling for effects of incentive*distractor*THP on behaviour. By shuffling the voltages, we removed the relationship to the behavioural variable, to build the null distribution of t-statistics across electrodes and time-samples. We used the ‘cluster mass’ method (Bullmore et al., 1999; Groppe et al., 2011; Maris & Oostenveld, 2007) to build the null distribution, and calculated the p-value as the proportion of this distribution further from zero than the true t-statistics (two-tailed test). Given the relatively small sample size here, these whole-brain analyses should not be taken as definitive.”

      And we have added a brief explanation to the Results section also (page 9, line 246):

      “We regressed each electrode and time-point against the three behavioural variables separately, while controlling for effects of incentive, distractor, THP, the interactions of those factors, and a random effect of participant. This analysis therefore asks whether trial-to-trial neural variability predicts behavioural variability. To assess significance, we used cluster-based permutation tests (DMGroppe Mass Univariate toolbox; Groppe, Urbach, & Kutas, 2011), shuffling the trials within each condition and person, and repeating it 2500 times, to build a null distribution of ‘cluster mass’ from the t-statistics (Bullmore et al., 1999; Maris & Oostenveld, 2007) which was used to calculate two-tailed p-values with a family-wise error rate (FWER) of .05 (see Methods/Analysis for details).”

      The authors report that "muscarinic antagonism strengthened the P3a" - but I was unable to see this in the data plots. Perhaps it is because the variability related to individual differences obscures the conditional differences in the plots. In this case, event-related difference signals could be helpful to clarify the results.

      We thank the reviewer for spotting this wording error, this should refer to the incentive effect weakening the P3a, as no other significant effects were found on the P3a, as stated correctly in the previous paragraph. We have corrected this in the manuscript (page 9, line 232):

      “This suggests that while incentives strengthened the incentive-cue response and the CNV and weakened the P3a, muscarinic antagonism strengthened the CNV,”

      The reviewer’s suggestion for difference plots is very valuable, and we have added these to Figure 3, as well as increasing the y-axis scale for figure 3c to show the incentives weakening the P3a more clearly, and adding the topographies suggested in an earlier comment. The difference waves for Incentive and THP effects show that both are decreasing voltage, albeit with slightly different onset times – Incentive starts earlier, thus weakening the positive P3a, while both strengthen the negative CNV. The Incentive effects within THP and Placebo separately illustrate the THP*Incentive interaction.

      We have amended the Results text and figure (page 7, line 200):

      “The subsequent CNV was strengthened (i.e. more negative; Figure 3d) by incentive (β = -.0928, p < .0001) and THP (β = -0.0502, p < .0001), with an interaction whereby THP decreased the incentive effect (β= 0.0172, p = .0213). Figure 3h shows the effects of Incentive and THP on the CNV separately, using difference waves, and Figure 3i shows the incentive effect grows more slowly in the THP condition than the Placebo condition.

      For mediation analyses, it would be useful in the results section to have a much more detailed description of the regression results, rather than just reporting things in a binary did/did not mediate sort of way. Furthermore, the methods should also describe how mediation was tested statistically (ie. What is the null distribution that the difference in coefficients with/without moderator is tested against?).

      We have added a more detailed explanation of how we investigated mediation and mediated moderation, and now report the mediation effects for all tests run and the permutation-test p-values.

      We had been using the Baron & Kenny (1986) method, based on 4 tests outlined in the updated text below, which gives a single measure of change in absolute beta-coefficients when all the tests have been met, but without any indication of significance; any reduction found after meeting the other 3 tests indicates a partial mediation under this method. We now use permutation testing to generate a p-value for the likelihood of finding an equal or larger reduction in the absolute beta-coefficients if the CNV were not truly related to RT. This found that the CNV’s mediation of the Incentive effect on RT was highly significant, while the Mediated Moderation of CNV on THP*Incentive was weakly significant.

      During this re-analysis, we noticed that we had different trial-numbers in the different regression models, as EEG-artefactual trials were not excluded from the behavioural-only model (‘RT ~ 1 + Incentive’). However, this causes issues with the permutation testing as we are shuffling the ERPs and need the same trials included in all the mixed-effects models. Therefore, we have redone these mediation analyses, including only the trials with valid ERP measures (i.e. no artefactual trials) in all models. This has changed the beta-coefficients we report, but not the findings or conclusions of the mediation analyses. We have updated the figure to have these new statistics.

      We have updated the text to explain the methodology in the Results section (page 12, line 284):

      “We have found that neural preparatory activity can predict residual velocity and RT, and is also affected by incentives and THP. Finally, we ask whether the neural activity can explain the effects of incentives and THP, through mediation analyses. We used the Baron & Kenny ( 1986) method to assess mediation (see Methods/Analysis for full details). This tests whether the significant Incentive effect on behaviour could be partially reduced (i.e., explained) by including the CNV as a mediator in a mixed-effects single-trial regression. We measured mediation as the reduction in (absolute) beta-coefficient for the incentive effect on behaviour when the CNV was included as a mediator (i.e., RT ~ 1 + Incentive + CNV + Incentive*CNV + (1 | participant)). This is a directional hypothesis of a reduced effect, and to assess significance we ran a permutation-test, shuffling the CNV within participants, and measuring the change in absolute beta-coefficient for the Incentive effect on behaviour. This generates a distribution of mediation effects where there is no relationship between CNV and RT on a trial (i.e., a null distribution). We ran 2500 permutations, and calculated the proportion with an equal or more negative change in absolute beta-coefficient, equivalent to a one-tailed test. We ran this mediation analysis separately for the two behavioural variables of RT and residual velocity, but not for distractor pull as it was not affected by incentive, so failed the assumptions of mediation analyses (Baron & Kenny, 1986; Muller et al., 2005). We took the mean CNV amplitude from 1200:1500ms as our Mediator.

      Residual velocity passed all the assumption tests for Mediation analysis, but no significant mediation was found. That is, Incentive predicted velocity (β=0.1304, t(1,16476)=17.3280, p<.0001); Incentive predicted CNV (β=-0.9122, t(1,16476)=-12.1800, p<.0001); and CNV predicted velocity when included alongside Incentive (β=0.0015, t(1,16475)=1.9753, p=.0483). However, including CNV did not reduce the Incentive effect on velocity, and in fact strengthened it (β=0.1318, t(1,16475)=17.4380, p<.0001; change in absolute coefficient: Δβ=+0.0014). Since there was no mediation (reduction), we did not run permutation tests on this.

      However, RT did show a significant mediation of the Incentive effect by CNV: Incentive predicted RT (β=-0.0868, t(1,16476)=-14.9330, p<.0001); Incentive predicted CNV (β=-0.9122, t(1,16476)=-12.1800, p<.0001); and CNV predicted RT when included alongside Incentive (β=0.0127, t(1,16475)=21.3160, p<.0001). The CNV mediated the effect of Incentive on RT, reducing the absolute beta-coefficient (β=-0.0752, t(1,16475)=-13.0570, p<.0001; change in absolute coefficient: Δβ= -0.0116). We assessed the significance of this change via permutation testing, shuffling the CNV across trials (within participants) and calculating the change in absolute beta-coefficient for the Incentive effect on RT when the permuted CNV was included as a mediator. We repeated this 2500 times to build a null distribution of Δβ, and calculated the proportion with equal or stronger reductions for a one-tailed p-value, which was highly significant (p<.0001). This suggests that the Incentive effect on RT is partially mediated by the CNV’s amplitude during the preparation period, and this is not the case for residual velocity.

      We also investigated whether the CNV could explain the cholinergic reduction in motivation (THP*Incentive interaction) on RT – i.e., whether CNV mediation the THP moderation. We measured Mediated Moderation as suggested by Muller et al. (2005; see Methods/Analysis for full explanation): Incentive*THP was associated with RT (β=0.0222, t(1,16474)=3.8272, p=.0001); and Incentive*THP was associated with CNV (β=0.1619, t(1,16474)=2.1671, p=.0302); and CNV*THP was associated with RT (β=0.0014, t(1,16472)=2.4061, p=.0161). Mediated Moderation was measured by the change in absolute Incentive*THP effect when THP*CNV was included in the mixed-effects model (β=0.0214, t(1,16472)=3.7298, p=.0002; change in beta-coefficient: Δβ= -0.0008), and permutation-testing (permuting the CNV as above) found a significant effect (p=.0132). This indicates cholinergic blockade changes how incentives affect preparatory negativity, and how this negativity reflects RT, which can explain some of the reduced invigoration of RT. However, this was not observed for saccade velocity.

      And we have updated the Methods/Analysis section with a more detailed explanation too (page 21, line 627):

      “For the mediation analysis, we followed the 4-step process  (Baron & Kenny, 1986; Muller et al., 2005), which requires 4 tests be met for the outcome (behavioural variable, e.g. RT), mediator (ERP, e.g., CNV) and the treatment (Incentive):

      (1) Outcome is significantly associated with the Treatment (RT ~ 1 + Incentive + (1 | participant))

      (2) Mediator is significantly associated with the Treatment (ERP ~ 1 + Incentive + (1 | participant))

      (3) Mediator is significantly associated with the Outcome (RT ~ 1 + Incentive + ERP + (1 | participant))

      (4) And the inclusion of the Mediator reduces the association between the Treatment and Outcome (Incentive effect from model #3)

      The mediation was measured by the reduction in the absolute standardised beta coefficient between incentive and behaviour when the ERP mediator was included (model #3 vs model #1 above). We used permutation-testing to quantify the likelihood of finding these mediations under the null hypothesis, achieved by shuffling the ERP across trials (within each participant) to remove any link between the ERP and behaviour. We repeated this 2500 times to build a null distribution of the change in absolute beta-coefficients for the RT ~ Incentive effect when this permuted mediator was included (model #3 vs model #1). We calculated a one-tailed p-value by finding the proportion of the null distribution that was equal or smaller than the true values (as Mediation is a one-tailed prediction).

      Mediated moderation (Muller et al., 2005) was used to see whether the effect of THP (the Moderator) on behaviour is mediated by the ERP, with the following tests (after the previous Mediation tests were already satisfied):

      (5) THP moderates the Incentive effect, via a significant Treatment*Moderator interaction on the Outcome (RT ~ 1 + Incentive + THP + Incentive*THP + (1 | participant))

      (6) THP moderates the Incentive effect on the Mediator, via a Treatment*Moderator interaction on the Outcome (ERP ~ 1 + Incentive + THP + Incentive*THP + (1 | participant))

      (7) THP’s moderation of the Incentive effect is mediated by the ERP, via a reduction in the association of Treatment*Moderator on the Outcome when the Treatment*Moderator interaction is included (RT ~ 1 + Incentive + THP + Incentive*THP + ERP + ERP*THP + (1 | participant)

      Mediated moderation is measured as the reduction in absolute beta-coefficients for ‘RT ~ Incentive*THP’ between model #5 and #7, which captures how much of this interaction could be explained by including the Mediator*Moderator interaction (ERP*THP in model #7). We tested the significance of this with permutation testing as above, permuting the ERP across trials (within participants) 2500 times, and building a null distribution of the change in the absolute beta-coefficients for RT ~ Incentive*THP between models #7 and #5. We calculated a one-tailed p-value from the proportion of these that were equal or smaller than the true change.”

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      (1) The analysis section could benefit from greater detail. For example, how exactly did they assess that the effects of the drug on peak velocity and RT were driven by non-distracting trials? Ideally, for every outcome, the analysis approach used should be detailed and justified.

      We apologise for the confusion from this. To clarify, we found a 2-way regression (incentive*THP) on both residual velocity and saccadic RT and this pattern was stronger in distractor-absent trials for residual velocity, and stronger in distractor-present trials for saccadic RT, as can be seen in Figure 1d&e. However, as there was no significant 3-way interaction (incentive*THP*distractor) for either metric, and the 2-way interaction effects were in the same direction in distractor present/absent trials for both metrics, we think these effects were relatively unaffected by distractor presence.

      We have updated the Results section to make this clearer: (page 3, line 94):

      We measured vigour as the residual peak velocity of saccades within each drug session (see Figure 1c & Methods/Eye-tracking), which is each trial’s deviation of velocity from the main sequence. This removes any overall effects of the drug on saccade velocity, while still allowing incentives and distractors to have different effects within each drug condition. We used single-trial mixed-effects linear regression (20 participants, 18585 trials in total) to assess the effects of Incentive, Distractors, and THP, along with all the interactions of these (and a random-intercept per participant), on residual velocity and saccadic RT. As predicted, residual peak velocity was increased by incentives (Figure 1d; β = 0.1266, p < .0001), while distractors slightly slowed residual velocity (β = -0.0158, p = .0294; see Figure 1 – Figure supplement 1 for full behavioural statistics). THP decreased the effect of incentives on velocity (incentive * THP: β = -0.0216, p = .0030), indicating that muscarinic blockade diminished motivation by incentives. Figure 1d shows that this effect was similar in distractor absent/present trials, although slightly stronger when the distractor was absent; the 3-way (distractor*incentive*THP) interaction was not significant (p > .05), suggesting that the distractor-present trials had the same effect but weaker (Figure 1d).

      Saccadic RT (time to initiation of saccade) was slower when participants were given THP (β = 0.0244, p = < .0001), faster with incentives (Figure 1e; β = -0.0767, p < .0001), and slowed by distractors (β = 0.0358, p < .0001). Again, THP reduced the effects of incentives (incentive*THP: β = 0.0218, p = .0002). Figure 1e shows that this effect was similar in distractor absent/present trials, although slightly stronger when the distractor was present; as the 3-way (distractor*incentive*THP) interaction was not significant and the direction of effects was the same in the two, it suggests the effect was similar in both conditions. Additionally, the THP*Incentive interactions were correlated between saccadic RT and residual velocity at the participant level (Figure 1 – Figure supplement 2).

      We have given more details of the analyses performed in the Methods section and the results, as requested by you and the other reviewers (page 20, line 602):

      Behavioural and EEG analysis included all 20 participants, although trials with EEG artefacts were included in the behavioural analyses (18585 trials in total) and not the EEG analyses (16627 trials in total), to increase power in the former. Removing these trials did not change the findings of the behavioural analyses.

      We used single-trial linear-mixed effects models to analyse our data, including participant as a random effect of intercept, with the formula ‘~1 + incentive*distractor*THP + (1 | participant)’. We z-scored all factors to give standardised beta coefficients.

      For the difference-wave cluster-based permutation tests (Figure 3 – Figure supplement 4), we used the DMGroppe Mass Univariate toolbox (Groppe et al., 2011), with 2500 permutations, to control the family-wise error rate at 0.05. This was used for looking at difference waves to test the effects of incentive, THP, and the incentive*THP interaction (using difference of difference-waves), across all EEG electrodes.

      We adapted this toolbox to also run cluster-based permutation regressions to examine the relationship between the behavioural variables and the voltages at all EEG electrodes at each time point. On each iteration we shuffled the voltages across trials within each condition and person, and regressed it against the behavioural variable, with the model ‘~1 + voltage + incentive*distractorPresent*THP + (1 | participant)’. The Voltage term measured the association between voltage and the behavioural variable, after controlling for effects of incentive*distractor*THP on behaviour. By shuffling the voltages, we removed the relationship to the behavioural variable, to build the null distribution of t-statistics across electrodes and time-samples. We used the ‘cluster mass’ method (Bullmore et al., 1999; Groppe et al., 2011; Maris & Oostenveld, 2007) to build the null distribution, and calculated the p-value as the proportion of this distribution further from zero than the true t-statistics (two-tailed test). Given the relatively small sample size here, these whole-brain analyses should not be taken as definitive.

      For the mediation analysis, we followed the 4-step process  (Baron & Kenny, 1986; Muller et al., 2005), which requires 4 tests be met for the outcome (behavioural variable, e.g. RT), mediator (ERP, e.g., CNV) and the treatment (Incentive):

      (1) Outcome is significantly associated with the Treatment (RT ~ 1 + Incentive + (1 | participant))

      (2) Mediator is significantly associated with the Treatment (ERP ~ 1 + Incentive + (1 | participant))

      (3) Mediator is significantly associated with the Outcome (RT ~ 1 + Incentive + ERP + (1 | participant))

      (4) And the inclusion of the Mediator reduces the association between the Treatment and Outcome (Incentive effect from model #3)

      The mediation was measured by the reduction in the absolute standardised beta coefficient between incentive and behaviour when the ERP mediator was included (model #3 vs model #1 above). We used permutation-testing to quantify the likelihood of finding these mediations under the null hypothesis, achieved by shuffling the ERP across trials (within each participant) to remove any link between the ERP and behaviour. We repeated this 2500 times to build a null distribution of the change in absolute beta-coefficients for the RT ~ Incentive effect when this permuted mediator was included (model #3 vs model #1). We calculated a one-tailed p-value by finding the proportion of the null distribution that was equal or more negative than the true value (as Mediation is a one-tailed prediction). For this mediation analysis, we only included trials with valid ERP measures, even for the models without the ERP included (e.g., model #1), to keep the trial-numbers and degrees of freedom the same.

      Mediated moderation (Muller et al., 2005) was used to see whether the effect of THP (the Moderator) on behaviour is mediated by the ERP, with the following tests (after the previous Mediation tests were already satisfied):

      (5) THP moderates the Incentive effect, via a significant Treatment*Moderator interaction on the Outcome (RT ~ 1 + Incentive + THP + Incentive*THP + (1 | participant))

      (6) THP moderates the Incentive effect on the Mediator, via a Treatment*Moderator interaction on the Outcome (ERP ~ 1 + Incentive + THP + Incentive*THP + (1 | participant))

      (7) THP’s moderation of the Incentive effect is mediated by the ERP, via a reduction in the association of Treatment*Moderator on the Outcome when the Treatment*Moderator interaction is included (RT ~ 1 + Incentive + THP + Incentive*THP + ERP + ERP*THP + (1 | participant)

      Mediated moderation is measured as the reduction in absolute beta-coefficients for ‘RT ~ Incentive*THP’ between model #5 and #7, which captures how much of this interaction could be explained by including the Mediator*Moderator interaction (ERP*THP in model #7). We tested the significance of this with permutation testing as above, permuting the ERP across trials (within participants) 2500 times, and building a null distribution of the change in the absolute beta-coefficients for RT ~ Incentive*THP between models #7 and #5. We calculated a one-tailed p-value from the proportion of these that were equal or more negative than the true change.

      (2) Please explain why only men were included in this study. We are all hoping that men-only research is a practice of the past.

      We only included men to prevent any chance of administering the drug to someone pregnant. Trihexyphenidyl is categorized by the FDA as a Pregnancy Category Class C drug, and the ‘Summary of Product Characteristics’ states: “There is inadequate information regarding the use of trihexyphenidyl in pregnancy. Animal studies are insufficient with regard to effects on pregnancy, embryonal/foetal development, parturition and postnatal development. The potential risk for humans is unknown. Trihexyphenidyl should not be used during pregnancy unless clearly necessary.”

      While the drug can be prescribed where benefits may outweigh this risk, as there were no benefits to participants in this study, we only recruited men to keep the risk at zero.

      We have updated the Methods/Drugs section to explain this (page 17, line 494):

      “The risks of Trihexyphenidyl in pregnancy are unknown, but the Summary Product of Characteristics states that it “should not be used during pregnancy unless clearly necessary”. As this was a basic research study with no immediate clinical applications, there was no justification for any risk of administering the drug during pregnancy, so we only recruited male participants to keep this risk at zero.”

      And we have referenced this in the Methods/Participants section (page 18, line 501):

      “Our sample size calculations suggested 27 participants would detect a 0.5 effect size with .05 sensitivity and .8 power. We recruited 27 male participants (see Drugs section above)”

      (3) Please explain acronyms (eg EEG) when first used.

      Thank you for pointing this out, we have explained EEG at first use in the abstract and the main text, along with FWER, M1r, and ERP which had also been missed at first use.

      Reviewer #3 (Recommendations For The Authors):

      The authors say: "Therefore, acetylcholine antagonism reduced the invigoration of saccades by incentives, and increased the pull of salient distractors. We next asked whether these effects were coupled with changes in preparatory neural activity." But I found this statement to be misleading since the primary effects of the drug seem to have been to decrease the frequency of distractor-repulsed saccades... so "decreased push" would probably be a better analogy than "increased pull".

      Thank you for noticing this, we agree, and have changed this to (page 5, line 165):

      “Therefore, acetylcholine antagonism reduced the invigoration of saccades by incentives, and decreased the repulsion of salient distractors. We next asked whether these effects were coupled with changes in preparatory neural activity.”

      I don't see anything in EEG preprocessing about channel rejection and interpolation. Were these steps performed? There are very few results related to the full set of electrodes.

      We did not reject or interpolate any channels, as visual inspection found no obvious outliers in terms of noisiness, and no channels had standard deviations (across time/trials) higher than our standard cutoff (of 80). The artefact rejection was applied across all EEG channels, so any trials with absolute voltages over 200uV in any channel were removed from the analysis. On average 104/120 trials were included (having passed this check, along with eye-movement artefact checks) per condition per person, and we have added the range of these, along with totals across conditions to the Analysis section and a statement about channel rejection/interpolation (page 20, line 588):

      “Epochs were from -200:1500ms around the preparation cue onset, and were baselined to the 100ms before the preparation cue appeared. Visual inspection found no channels with outlying variance, so no channel rejection or interpolation was performed. We rejected trials from the EEG analyses where participants blinked or made saccades (according to EyeLink criteria above) during the epoch, or where EEG voltage in any channel was outside -200:200μV (muscle activity). On average 104/120 trials per condition per person were included (SD = 21, range = 21-120), and 831/960 trials in total per person (SD=160, range=313-954). A repeated-measures ANOVA found there were no significant differences in number of trials excluded for any condition (p > .2).”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Review #1:

      Summary:

      Jin et al. investigated how the bacterial DNA damage (SOS) response and its regulator protein RecA affect the development of drug resistance under short-term exposure to beta-lactam antibiotics. Canonically, the SOS response is triggered by DNA damage, which results in the induction of error-prone DNA repair mechanisms. These error-prone repair pathways can increase mutagenesis in the cell, leading to the evolution of drug resistance. Thus, inhibiting the SOS regulator RecA has been proposed as a means to delay the rise of resistance. 

      In this paper, the authors deleted the RecA protein from E. coli and exposed this ∆recA strain to selective levels of the beta-lactam antibiotic, ampicillin. After an 8-hour treatment, they washed the antibiotic away and allowed the surviving cells to recover in regular media. They then measured the minimum inhibitory concentration (MIC) of ampicillin against these treated strains. They note that after just 8-hour treatment with ampicillin, the ∆recA had developed higher MICs towards ampicillin, while by contrast, wild-type cells exhibited unchanged MICs. This MIC increase was also observed in subsequent generations of bacteria, suggesting that the phenotype is driven by a genetic change.

      The authors then used whole genome sequencing (WGS) to identify mutations that accounted for the resistance phenotype. Within resistant populations, they discovered key mutations in the promoter region of the beta-lactamase gene, ampC; in the penicillin-binding protein PBP3 which is the target of ampicillin; and in the AcrB subunit of the AcrAB-TolC efflux machinery. Importantly, mutations in the efflux machinery can impact the resistance towards other antibiotics, not just beta-lactams. To test this, they repeated the MIC experiments with other classes of antibiotics, including kanamycin, chloramphenicol, and rifampicin. Interestingly, they observed that the ∆recA strains pre-treated with ampicillin showed higher MICs towards all other antibiotics tested. This suggests that the mutations conferring resistance to ampicillin are also increasing resistance to other antibiotics.

      The authors then performed an impressive series of genetic, microscopy, and transcriptomic experiments to show that this increase in resistance is not driven by the SOS response, but by independent DNA repair and stress response pathways. Specifically, they show that deletion of the recA reduces the bacterium's ability to process reactive oxygen species (ROS) and repair its DNA. These factors drive the accumulation of mutations that can confer resistance to different classes of antibiotics. The conclusions are reasonably well-supported by the data, but some aspects of the data and the model need to be clarified and extended.

      We sincerely appreciate your overall summary of the manuscript and their positive evaluation of our work.

      Strengths:

      A major strength of the paper is the detailed bacterial genetics and transcriptomics that the authors performed to elucidate the molecular pathways responsible for this increased resistance. They systemically deleted or inactivated genes involved in the SOS response in E. coli. They then subjected these mutants to the same MIC assays as described previously. Surprisingly, none of the other SOS gene deletions resulted in an increase in drug resistance, suggesting that the SOS response is not involved in this phenotype. This led the authors to focus on the localization of DNA PolI, which also participates in DNA damage repair. Using microscopy, they discovered that in the RecA deletion background, PolI co-localizes with the bacterial chromosome at much lower rates than wild-type. This led the authors to conclude that deletion of RecA hinders PolI and DNA repair. Although the authors do not provide a mechanism, this observation is nonetheless valuable for the field and can stimulate further investigations in the future.

      In order to understand how RecA deletion affects cellular physiology, the authors performed RNA-seq on ampicillin-treated strains. Crucially, they discovered that in the RecA deletion strain, genes associated with antioxidative activity (cysJ, cysI, cysH, soda, sufD) and Base Excision Repair repair (mutH, mutY, mutM), which repairs oxidized forms of guanine, were all downregulated. The authors conclude that down-regulation of these genes might result in elevated levels of reactive oxygen species in the cells, which in turn, might drive the rise of resistance. Experimentally, they further demonstrated that treating the ∆recA strain with an antioxidant GSH prevents the rise of MICs. These observations will be useful for more detailed mechanistic follow-ups in the future.

      We are grateful to you for your positive assessment of the strengths of our manuscript and your recognition of its potential future applications.

      Weaknesses:

      Throughout the paper, the authors use language suggesting that ampicillin treatment of the ∆recA strain induces higher levels of mutagenesis inside the cells, leading to the rapid rise of resistance mutations. However, as the authors note, the mutants enriched by ampicillin selection can play a role in efflux and can thus change a bacterium's sensitivity to a wide range of antibiotics, in what is known as cross-resistance. The current data is not clear on whether the elevated "mutagenesis" is driven ampicillin selection or by a bona fide increase in mutation rate.

      We greatly appreciate you for raising this issue, as it is an important premise that must be clearly stated throughout the entire manuscript. To verify that the observed increase in mutation rate is a bona fide increase and not due to experimental error, we used a non-selective antibiotic, rifampicin, to evaluate the mutation frequency after drug induction, as it is a gold-standard method documented in other studies [Heterogeneity in efflux pump expression predisposes antibiotic-resistant cells to mutation, Science, 362, 6415, 686-690, 2018.]. In the absence of ampicillin treatment, the natural mutation rates detected using rifampicin were consistent between the wild-type and the ΔrecA strain. However, after ampicillin treatment, the mutation rate detected using rifampicin was significantly elevated only in the ΔrecA strain (Fig. 1G). We also employed other antibiotics, such as ciprofloxacin and chloramphenicol, in our experiments to treat the cells (data not shown). However, we observed that beta-lactam antibiotics specifically induced the emergence of resistance or altered the MIC in our bacterial populations. If resistance had pre-existed before antibiotic exposure or a bona fide increase in mutation rate, we would expect other antibiotics to exhibit a similar selective effect, particularly given the potential for cross-resistance to multiple antibiotics.

      Furthermore, on a technical level, the authors employed WGS to identify resistance mutations in the treated ampicillin-treated wild-type and ∆recA strains. However, the WGS methodology described in the paper is inconsistent. Notably, wild-type WGS samples were picked from non-selective plates, while ΔrecA WGS isolates were picked from selective plates with 50 μg/mL ampicillin. Such an approach biases the frequency and identity of the mutations seen in the WGS and cannot be used to support the idea that ampicillin treatment induces higher levels of mutagenesis.

      We appreciate your concern regarding potential inconsistencies in the WGS methodology. However, we would like to clarify that the primary aim of the WGS experiment was to identify the types of mutations present in the wild-type and ΔrecA strains after treatment of ampicillin, rather than to quantify or compare mutation frequencies. This purpose was explicitly stated in the manuscript.

      Furthermore, the choice of selective and non-selective conditions was made to ensure the successful isolation of mutants in both strains. Specifically, if selective conditions (50 μg/mL ampicillin) were applied to the wild-type strain, it would have been nearly impossible to recover colonies for WGS analysis, as wild-type cells are highly susceptible to ampicillin at this concentration (Top, Author response image 1). Conversely, under non-selective conditions, ΔrecA mutants carrying resistance mutations may not have been effectively isolated, which would have limited our ability to identify resistance mutations in these strains (Bottom, Author response image 1 Thus, the use of different selection pressures was essential for achieving the objective of mutation identification in this study.

      Author response image 1.

      After 8 hours of antibiotic treatment, the wild type or the ΔrecA cells were plated on agar plates either without ampicillin or with 50 μg/mL ampicillin and incubated for 24-48 hours. Top: Under selective conditions, no wild type colonies were recovered, indicating high susceptibility to the antibiotic, preventing further analysis. Bottom: In non-selective conditions, both ΔrecA resistant mutants and non-resistant cells grew, making it difficult to distinguish and isolate the mutants carrying resistance mutations.

      Finally, it is important to establish what the basal mutation rates of both the WT and ∆recA strains are. Currently, only the ampicillin-treated populations were reported. It is possible that the ∆recA strain has inherently higher mutagenesis than WT, with a larger subpopulation of resistant clones. Thus, ampicillin treatment might not in fact induce higher mutagenesis in ∆recA.

      Thanks for this suggestion. The basal mutation frequency of the wild-type and the ∆recA strain have been measured using rifampicin (Fig. 1G), and there is no significant difference between them.

      Reviewer #2:

      Summary:

      This study aims to demonstrate that E. coli can acquire rapid antibiotic resistance mutations in the absence of a DNA damage response. To investigate this, the authors employed a sophisticated experimental framework based on a modified Adaptive Laboratory Evolution (ALE) workflow. This workflow involves numerous steps culminating in the measurement of antibiotic resistance. The study presents evidence that a recA strain develops ampicillin resistance mutations more quickly than the wild-type, as shown by measuring the Minimum Inhibitory Concentration (MIC) and mutation frequency. Whole-genome sequencing of 15 recA-colonies resistant to ampicillin revealed predominantly inactivation of genes involved in the multi-drug efflux pump system, whereas, in the wild-type, mutations appear to enhance the activity of the chromosomal ampC cryptic promoter. By analyzing mutants involved in the SOS response, including a lexA3 mutant incapable of inducing the SOS response, the authors conclude that the rapid evolution of antibiotic resistance occurs in an SOS-independent manner when recA is absent.

      Furthermore, RNA sequencing (RNA-seq) of the four experimental conditions suggests that genes related to antioxidative responses drive the swift evolution of antibiotic resistance in the recA-strain.

      We greatly appreciate your overall summary of the manuscript and their positive evaluation of our work.

      Weaknesses:

      However, a potential limitation of this study is the experimental design used to determine the 'rapid' evolution of antibiotic resistance. It may introduce a significant bottleneck in selecting ampicillin-resistant mutants early on. A recA mutant could be more susceptible to ampicillin than the wild-type, and only resistant mutants might survive after 8 hours, potentially leading to their enrichment in subsequent steps. To address this concern, it would be critical to perform a survival analysis at various time points (0h, 2h, 4h, 6h, and 8h) during ampicillin treatment for both recA and wild-type strains, ensuring there is no difference in viability.

      We appreciate your suggestion. We measured the survival fraction at 0, 2, 4, 6, and 8 hours after ampicillin treatment. The results show no significant difference in antibiotic sensitivity between the wild-type and ΔrecA strain (Fig. S2). We therefore added a description int the main text, “Meanwhile, after 8 hours of treatment with 50 μg/mL ampicillin, the survival rates of both wild type and ΔrecA strain were consistent (Fig. S2)”.

      The observation that promoter mutations are absent in ΔrecA strains could be explained by previous research indicating that amplification of the AmpC genes is a mechanism for E. coli resistance to ampicillin, which does not occur in a recA-deficient background (PMID# 19474201).

      We are very grateful to you for providing this reference. We did examine the amplification of the ampC gene in both wild-type and _recA-_deficient strains, but we found no significant changes in its copy number after ampicillin treatment (Author response image 2). Therefore, the results and discussion regarding gene copy number were not included in this manuscript.

      Author response image 2.

      Copy number variations of genes in the chromosome before and after exposure to ampicillin at 50 µg/mL for 8 hours in the wild type and ΔrecA strain.

      The section describing Figure 3 is poorly articulated, and the conclusions drawn are apparent. The inability of a recA strain to induce the SOS response is well-documented (lines 210 and 278). The data suggest that merely blocking SOS induction is insufficient to cause 'rapid' evolution in their experimental conditions. To investigate whether SOS response can be induced independently of lexA cleavage by recA, alternative experiments, such as those using a sulA-GFP fusion, might be more informative.

      Thanks for your suggestion. We agree that detecting the expression level of SulA can provide valuable information to reveal the impact of the SOS system on rapid drug resistance. In addition to fluorescence visualization and quantification of SulA expression, regulating the transcription level of the sulA gene can achieve the same objective. Therefore, in our transcriptome sequencing analysis, we focused on evaluating the transcription level of sulA (Fig. 4E).

      In Figure 4E, the lack of increased SulA gene expression in the wild-type strain treated with ampicillin is unexpected, given that SulA is an SOS-regulated gene. The fact that polA (Pol I) is going down should be taken into account in the interpretation of Figures 2D and 2E.

      Thank you for your observation regarding the lack of increased SulA gene expression in the wild-type strain treated with ampicillin in Figure 4E. We agree that SulA is typically an SOS-regulated gene, and its expression is expected to increase in response to DNA damage induced by antibiotics like ampicillin. However, in our experimental conditions, the observed lack of increased SulA expression could be due to different factors. One possibility is that the concentration of ampicillin used, or the duration of treatment, was not applicable to induce a strong SOS response in the wild type strain under the specific conditions tested. Additionally, differences in experimental setups such as timing, sampling, or cellular stress responses could account for the lack of a pronounced upregulation of SulA.

      You may state that the fact that polA (Pol I) is going down should be taken into account in the interpretation of Figures 3D and 3E, and we agree with you.

      The connection between compromised DNA repair, the accumulation of Reactive Oxygen Species (ROS) based on RNA-seq data, and accelerated evolution is merely speculative at this point and not experimentally established.

      We greatly appreciate your comments. First, the correlation between DNA mutations and the accumulation of reactive oxygen species (ROS) has been experimentally confirmed. As shown in Fig. 4I, after the addition of the antioxidant GSH, DNA resistance mutations were not detected in the ΔrecA strain treated with ampicillin for 8 hours, compared to those without the addition of GSH, proving that the rapid accumulation of ROS induces the enhancement of DNA resistance mutations. Second, the enhancement of DNA resistance mutations in relation to bacterial resistance has been widely validated and is generally accepted. Finally, we appreciate the your suggestion to strengthen the evidence supporting ROS enhancement. To address this, we have added an experiment to measure ROS levels. Through flow cytometry, we found that ROS levels significantly increased in both the wild-type and ΔrecA strain after 8 hours of ampicillin treatment. However, ROS levels in the ΔrecA strain showed a significant further increase compared to the wild-type strain (Fig. 4G). Additionally, with the addition of 50 mM glutathione, no significant change in ROS levels was observed in either the wild-type or ΔrecA strain before and after ampicillin treatment (Fig. 4H). This result further confirms our finding in Fig. 4I, where adding GSH inhibited the development of antibiotic resistance.

      Reviewer #3:

      Summary:

      In the present work, Zhang et al investigate the involvement of the bacterial DNA damage repair SOS response in the evolution of beta-lactam drug resistance evolution in Escherichia coli. Using a combination of microbiological, bacterial genetics, laboratory evolution, next-generation, and live-cell imaging approaches, the authors propose short-term drug resistance evolution that can take place in RecA-deficient cells in an SOS response-independent manner. They propose the evolvability of drug resistance is alternatively driven by the oxidative stress imposed by the accumulation of reactive oxygen species and inhibition of DNA repair. Overall, this is a nice study that addresses a growing and fundamental global health challenge (antimicrobial resistance). However, although the authors perform several multi-disciplinary experiments, there are several caveats to the authors' proposal that ultimately do not fully support their interpretation that the observed antimicrobial resistance evolution phenotype is due to compromised DNA repair.

      We greatly appreciate your overall summary of the manuscript and positive evaluation of our work.

      Strengths:

      The authors introduce new concepts to antimicrobial resistance evolution mechanisms. They show short-term exposure to beta-lactams can induce durably fixed antimicrobial resistance mutations. They propose this is due to comprised DNA repair and oxidative stress. This is primarily supported by their observations that resistance evolution phenotypes only exist for recA deletion mutants and not other genes in the SOS response.

      Thanks for your positive comments.

      Weaknesses:

      The authors do not show any direct evidence (1) that these phenotypes exist in strains harboring deletions in other DNA repair genes outside of the SOS response, (2) that DNA damage is increased, (3) that reactive oxygen species accumulate, (4) that accelerated resistance evolution can be reversed by anything other than recA complementation. The authors do not directly test alternative hypotheses. The conclusions drawn are therefore premature.

      We sincerely thank you for your insightful comments. First, in this study, our primary focus is on the role of recA deficiency in bacterial antibiotic resistance evolution. Therefore, we conducted an in-depth investigation on E. coli strains lacking RecA and found that its absence promotes resistance evolution through mechanisms involving increased ROS accumulation and downregulation of DNA repair pathways. While we acknowledge the importance of other DNA repair genes outside of the SOS response, exploring them is beyond the scope of this paper. However, in a separate unpublished study, we have identified the involvement of another DNA recombination protein, whose role in resistance evolution is not yet fully elucidated, in promoting resistance development. This finding is part of another independent investigation.

      Regarding DNA damage and repair, our paper emphasizes that resistance-related mutations in DNA are central to the development of antibiotic resistance. These mutations are a manifestation of DNA damage. To demonstrate this, we measured mutation frequency and performed whole-genome sequencing, both of which confirmed an increase in DNA mutations.

      We appreciate the reviewer's suggestion to provide additional evidence for ROS accumulation, and we have now supplemented our manuscript with relevant experiments. Through flow cytometry, we found that ROS levels significantly increased in both the wild type and ΔrecA strains after 8 hours of ampicillin treatment. However, ROS levels in the ΔrecA strain showed a significant further increase compared to the wild-type strain (Fig. 4G). Additionally, with the addition of 50 mM glutathione, no significant change in ROS levels was observed in either the wild-type or ΔrecA strain before and after ampicillin treatment (Fig. 4H). This result further confirms our finding in Fig. 4I, where adding GSH inhibited the development of antibiotic resistance.

      Finally, in response to your question about reversing accelerated resistance evolution, we would like to highlight that, in addition to recA complementation, we successfully suppressed rapid resistance evolution by supplementing with an antioxidant, GSH (Fig. 4I). This further supports our hypothesis that increased ROS levels play a key role in driving accelerated resistance evolution in the absence of RecA.

      Recommendations for the authors:

      Reviewer #1:

      The author's model asserts that deletion of recA impairs DNA repair in E. coli, leading to an accumulation of ROS in the cell, and ultimately driving the rapid rise of resistance mutations. However, the experimental evidence does not adequately address whether the resistance mutations are true, de novo mutations that arose due to beta-lactam treatment, or mutations that confer cross-resistance enriched by ampicillin selection.

      a. Major: In Figure 1F & G, the authors show that the ∆recA strain, following ampicillin treatment, has higher resistance and mutation frequency towards rifampicin than WT. However, it is not clear whether the elevated resistance and mutagenesis are driven by mutations enriched by the ampicillin treatment (e.g. mutations in acrB, as seen in Figure 2) or by "new" mutations in the rpoB gene. As the authors note, the mutants enriched by ampicillin selection can play a role in efflux and can thus change a bacterium's sensitivity to a wide range of antibiotics, including rifampicin, in what is known as cross-resistance. Therefore, the mutation frequency calculation, which relies on quantifying rifampicin-resistant clones, might be confounded by bacteria with mutations that confer cross-resistance. A better approach to calculate mutation frequency would be to employ an assay that does not require antibiotic selection, such as a lac-reversion assay. This would mitigate the confounding effects of cross-resistance of drug-resistant mutations.

      We appreciate your thoughtful comments regarding the potential for cross-resistance to confound the mutation frequency calculation based on rifampicin-resistant clones. Indeed, as noted, ampicillin selection can enrich for mutants with enhanced efflux activity, which may confer cross-resistance to a range of antibiotics, including rifampicin.

      However, we believe that the current approach of calculating mutation frequency using rifampicin-resistant mutants is still valid in our specific context. Rifampicin targets the RNA polymerase β subunit, and resistance typically arises from specific mutations in the rpoB gene. These mutations are well-characterized and distinct from those typically associated with efflux-related cross-resistance. Thus, the likelihood of cross-resistance affecting our mutation frequency calculation is minimized in this scenario.

      Additionally, while the lac-reversion assay could be an alternative, it focuses on specific metabolic pathway mutations (such as those affecting lacZ) and would not necessarily capture the same types of mutations relevant to rifampicin resistance or antibiotic-induced mutagenesis. Given our experimental objective of understanding how ampicillin induces mutations that confer antibiotic resistance, the current approach of using rifampicin selection provides a direct and relevant measurement of mutation frequency under antibiotic stress.

      b. Major: It is important to establish what the basal mutation frequencies/rates of both the WT and ∆recA strains are. Currently, only the ampicillin-treated populations were reported. It is possible that the ∆recA strain has an inherently higher mutagenesis than WT. Thus, ampicillin treatment might not in fact induce higher mutagenesis in ∆recA.

      Thanks for your suggestion. The basal mutation frequency of the wild-type and the ∆recA strain have been measured using rifampicin (Fig. 1G), and there is no significant difference between them.

      c. Major: In the text, the authors write, "To verify whether drug resistance associated DNA mutations have led to the rapid development of antibiotic resistance in recA mutant strain, we randomly selected 15 colonies on non-selected LB agar plates from the wild type surviving isolates, and antibiotic screening plates containing 50 μg/mL ampicillin from the ΔrecA resistant isolates, respectively." Why were the WT clones picked from non-selective plates and the recA mutant from selective ones for WGS? It appears that such a procedure would bias the recA mutant clones to show more mutations (caused by selection on the ampicillin plate). The authors need to address this discrepancy.

      We appreciate your concern regarding potential inconsistencies in the WGS methodology. However, we would like to clarify that the primary aim of the WGS experiment was to identify the types of mutations present in the wild-type and ΔrecA strains after treatment of ampicillin, rather than to quantify or compare mutation frequencies. This purpose was explicitly stated in the manuscript.

      Furthermore, the choice of selective and non-selective conditions was made to ensure the successful isolation of mutants in both strains. Specifically, if selective conditions (50 μg/mL ampicillin) were applied to the wild type strain, it would have been nearly impossible to recover colonies for WGS analysis, as wild-type cells are highly susceptible to ampicillin at this concentration (Top, Author response image 1). Conversely, under non-selective conditions, ΔrecA mutants carrying resistance mutations may not have been effectively isolated, which would have limited our ability to identify resistance mutations in these strains (Bottom, Author response image 1). Thus, the use of different selection pressures was essential for achieving the objective of mutation identification in this study.

      d. Major: In some instances, the authors do not use accurate language to describe their data. In Figure 2A, the authors randomly selected 15 ∆recA clones from a selective plate with 50 µg/mL of ampicillin. These clones were then subjected to WGS, which subsequently identified resistant mutations. Based on the described methods, these mutations are a result of selection: in other words, resistant mutations were preexisting in the bacterial population, and the addition of ampicillin selection killed off the sensitive cells, enabling the proliferation of the resistant clones. However, the in Figure 2 legend and associated text, the authors suggest that these mutations were "induced" by beta-lactam exposure, which is misleading. The data does not support that.

      We appreciate your detailed feedback on the language used to describe our data. We understand the concern regarding the use of the term "induced" in relation to beta-lactam exposure. To clarify, we employed not only beta-lactam antibiotics but also other antibiotics, such as ciprofloxacin and chloramphenicol, in our experiments (data not shown). However, we observed that beta-lactam antibiotics specifically induced the emergence of resistance or altered the MIC in our bacterial populations. If resistance had pre-existed before antibiotic exposure, we would expect other antibiotics to exhibit a similar selective effect, particularly given the potential for cross-resistance to multiple antibiotics.

      Furthermore, we used two different ∆recA strains, and the results were consistent between the strains (Fig. S3). Given that spontaneous mutations can occur with significant variability in populations, if resistance mutations pre-existed before antibiotic exposure, the selective outcomes should have varied between the two strains.

      Most importantly, we found that the addition of anti-oxidative compound GSH prevented the evolution of antibiotic from the treatment of ampicillin in the ΔrecA strain. If we assume that resistant bacteria preexist in the ∆recA strain, then the addition of GSH should not affect the evolution of resistance. Therefore, we believe that the resistance mutations we detected were not simply the result of selection from preexisting mutations but were indeed induced by beta-lactam exposure.

      e. Major: For Figure 4J, using WGS the authors show that the addition of GSH to WT and ∆recA cells inhibited the rise of resistance mutations; no resistance mutations were reported. However, in the "Whole genome sequencing" section under "Materials and Methods", they state that "Resistant clones were isolated by selection using LB agar plates with the supplementation of ampicillin at 50 μg/mL". These clones were then genome-extracted and sequenced. Given the methodology, it is surprising that the WGS did not reveal any resistance mutations in the GSH-treated cells. How were these cells able to grow on 50 μg/mL ampicillin plates for isolation in the first place? The authors need to address this.

      We sincerely apologize for the confusion caused by the incorrect expression in the "Materials and Methods" section. Indeed, when bacteria were treated with the combination of antibiotics and GSH, resistance was significantly suppressed, and no resistant clones could be isolated from selective plates (i.e., LB agar supplemented with 50 μg/mL ampicillin).

      To address this, we instead plated the bacteria treated with antibiotics and GSH onto non-selective plates (without ampicillin) and randomly selected 15 colonies for WGS. None of them showed resistance mutations. We will revise the text in the "Materials and Methods" section to accurately reflect this procedure and provide clarity.

      f. Minor: for Figure 1G, it is misleading to have both "mutation frequency" and "mutant rate" in the y-axis; the two are defined and calculated differently. Based on the Materials and Materials, "mutation frequency" would be the appropriate term. Also, for the ∆recA strain, it is a bit unusual to see mutation frequencies that are tightly clustered. Usually, mutation frequencies follow the Luria-Delbruck distribution. Can the authors explain why the ∆recA data looks so different compared to, say, the WT mutation frequencies?

      Thank you for your insightful feedback. We agree that having both "mutation frequency" and "mutant rate" on the y-axis is misleading, as these terms are defined and calculated differently. To avoid confusion, we will revise Figure 1G to use only "mutation frequency" as the correct term, in line with the methods described in the Materials and Methods section.

      Regarding the ∆recA strain's mutation frequencies, we acknowledge that the data appear more tightly clustered compared to the expected Luria-Delbruck distribution seen in the wild type strain. In fact, the y-axis of the Figure 1G is logarithmic, this causes the data to appear more clustered.

      We further added the basal mutation frequency in the wild type and ∆recA strains before the exposure to ampicillin. The basal mutation frequency of the wild-type and the ∆recA strain have been measured using rifampicin (Fig. 1G), and there is no significant difference between them.

      g. Minor: It needs to be made clear in the Main Text what the selective antibiotic agar plate used was, rifampicin or ampicillin. I am assuming it was rifampicin, as ampicillin plates would yield resistance frequencies close to 100%, given the prior treatment of the culture with ampicillin.

      Thanks for your comments. Depending on the objective, we used different selective plates. For example, when testing the mutation frequency of antibiotic resistance, we used a selective plate containing rifampicin in order to utilize a non-inducing antibiotic, which is the standard method for calculating resistance mutation frequency. In the WGS experiment, to obtain mutations specific to ampicillin resistance, we selected a selective plate containing ampicillin.

      Reviewer #2:

      The Y-axis label (log10 mutant rate) in Figure 1G is misleading or incorrect.

      Thanks for your comments and we apologize for this misleading information. The Figure 1G has been revised accordingly.

      In line 393 of the discussion, the authors claim that excessive ROS accumulation drives the evolution of ampicillin resistance, which has not been conclusively demonstrated. Additional experiments are needed to support this statement.

      We greatly appreciate your comments. First, the correlation between DNA mutations and the accumulation of reactive oxygen species (ROS) has been experimentally confirmed. As shown in Fig. 4I, after the addition of the antioxidant GSH, DNA resistance mutations were not detected in the ΔrecA strain treated with ampicillin for 8 hours, compared to those without the addition of GSH, proving that the rapid accumulation of ROS induces the enhancement of DNA resistance mutations. Second, the enhancement of DNA resistance mutations in relation to bacterial resistance has been widely validated and is generally accepted. Finally, we appreciate the your suggestion to strengthen the evidence supporting ROS enhancement. To address this, we have added an experiment to measure ROS levels. Through flow cytometry, we found that ROS levels significantly increased in both the wild-type and ΔrecA strain after 8 hours of ampicillin treatment. However, ROS levels in the ΔrecA strain showed a significant further increase compared to the wild-type strain (Fig. 4G). Additionally, with the addition of 50 mM glutathione, no significant change in ROS levels was observed in either the wild-type or ΔrecA strain before and after ampicillin treatment (Fig. 4H). This result further confirms our finding in Fig. 4I, where adding GSH inhibited the development of antibiotic resistance.

      The abstract is overly complex and difficult to read, e.g. "Contrary to previous findings, it is shown that this accelerated resistance development process is dependent on the hindrance of DNA repair, which is completely orthogonal to the SOS response").

      Thank you for the valuable feedback regarding the complexity of the abstract. We agree that certain sections could be simplified for clarity. In response, we have revised the abstract to make it more concise and easier to understand. For example, the sentence “Contrary to previous findings, it is shown that this accelerated resistance development process is dependent on the hindrance of DNA repair, which is completely orthogonal to the SOS response” has been rewritten as: "Unlike earlier studies, we found that the rapid development of resistance relies on the hindrance of DNA repair, a mechanism that operates independently of the SOS response."

      Reviewer #3:

      As indicated above, direct evidence is needed to show (1) that these phenotypes exist in strains harboring deletions in other DNA repair genes outside of the SOS response, (2) that DNA damage is increased, (3) that reactive oxygen species accumulate, (4) that accelerated resistance evolution can be reversed by anything other than recA complementation. There are also other resistance evolution mechanisms untested here, including transcription-coupled repair (TCR) mechanisms involving Mfd. These need to be shown in order to draw the conclusions proposed.

      We sincerely thank you for your insightful comments. First, in this study, our primary focus is on the role of recA deficiency in bacterial antibiotic resistance evolution. Therefore, we conducted an in-depth investigation on E. coli strains lacking RecA and found that its absence promotes resistance evolution through mechanisms involving increased ROS accumulation and downregulation of DNA repair pathways. While we acknowledge the importance of other DNA repair genes outside of the SOS response and other resistance evolution mechanisms including the TCR mechanism, exploring them is beyond the scope of this paper. However, in a separate unpublished study, we have identified the involvement of another DNA recombination protein, whose role in resistance evolution is not yet fully elucidated, in promoting resistance development. This finding is part of another independent investigation.

      Regarding DNA damage and repair, our paper emphasizes that resistance-related mutations in DNA are central to the development of antibiotic resistance. These mutations are a manifestation of DNA damage. To demonstrate this, we measured mutation frequency and performed whole-genome sequencing, both of which confirmed an increase in DNA mutations.

      We appreciate the reviewer's suggestion to provide additional evidence for ROS accumulation, and we have now supplemented our manuscript with relevant experiments. Through flow cytometry, we found that ROS levels significantly increased in both the wild type and ΔrecA strains after 8 hours of ampicillin treatment. However, ROS levels in the ΔrecA strain showed a significant further increase compared to the wild-type strain (Fig. 4G). Additionally, with the addition of 50 mM glutathione, no significant change in ROS levels was observed in either the wild-type or ΔrecA strain before and after ampicillin treatment (Fig. 4H). This result further confirms our finding in Fig. 4I, where adding GSH inhibited the development of antibiotic resistance.

      Finally, in response to your question about reversing accelerated resistance evolution, we would like to highlight that, in addition to recA complementation, we successfully suppressed rapid resistance evolution by supplementing with an antioxidant, GSH (Fig. 4I). This further supports our hypothesis that increased ROS levels play a key role in driving accelerated resistance evolution in the absence of RecA.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Joint Public Review: 

      The molecular mechanisms that mediate the regulated exocytosis of neuropeptides and neurotrophins from neurons via large dense-core vesicles (LDCVs) are still incompletely understood. Motivated by their earlier discovery that the Rab3-RIM1 pathway is essential for neuronal LDCV exocytosis, the authors now examined the role of the Rab3 effector Rabphilin-3A in neuronal LDCV secretion. Based on multiple live and confocal imaging approaches, the authors provide evidence for a synaptic enrichment of Rabphilin-3A and for independent trafficking of Rabphilin-3A and LDCVs. Using an elegant NPY-pHluorin imaging approach, they show that genetic deletion of Rabphilin-3A causes an increase in electrically triggered LDCV fusion events and increased neurite length. Finally, knock-out-replacement studies, involving Rabphilin-3A mutants deficient in either Rab3- or SNAP25-binding, indicate that the synaptic enrichment of Rabphilin-3A depends on its Rab3 binding ability, while its ability to bind to SNAP25 is required for its effects on LDCV secretion and neurite development. The authors conclude that Rabphilin-3A negatively regulates LDCV exocytosis and propose that this mechanism also affects neurite growth, e.g. by limiting neurotrophin secretion. These are important findings that advance our mechanistic understanding of neuronal large dense-core vesicle (LDCV) secretion. 

      The major strengths of the present paper are: 

      (i) The use of a powerful Rabphilin-3A KO mouse model. 

      (ii) Stringent lentiviral expression and rescue approaches as a strong genetic foundation of the study. 

      (iii) An elegant FRAP imaging approach. 

      (iv) A cutting-edge NPY-pHluorin-based imaging approach to detect LDCV fusion events. 

      We thank the reviewers for their positive evaluation of our manuscript.

      Weaknesses that somewhat limit the convincingness of the evidence provided and the corresponding conclusions include the following: 

      (i) The limited resolution of the various imaging approaches introduces ambiguity to several parameters (e.g. LDCV counts, definition of synaptic localization, Rabphilin-3A-LDCV colocalization, subcellular and subsynaptic localization of expressed proteins, AZ proximity of Rabphilin-3A and LDCVs) and thereby limits the reliability of corresponding conclusions. Super-resolution approaches may be required here. 

      We thank the reviewer for their constructive suggestion. We fully agree that super-resolution imaging would produce a more precise localization of RPH3A and co-localization with DCVs. We have now repeated our (co)-localization experiments with STED microscopy. We find that RPH3A colocalized with the pre-synaptic marker Synapsin1 and, to a lesser extent, with the post synaptic marker Homer and DCV marker chromogranin B (new Figure 1). This indicates that RPH3A is highly enriched in synapses, mostly the pre-synapse, and that RPH3A partly co-localizes with DCVs.  

      (ii) The description of the experimental approaches lacks detail in several places, thus complicating a stringent assessment. 

      We apologize for the lack of detail in explaining the experimental approaches. We have included a more detailed description in the revised manuscript. 

      (iii) Further analyses of the LDCV secretion data (e.g. latency, release time course) would be important in order to help pinpoint the secretory step affected by Rabphilin-3A. 

      We agree. To address this comment, we have now included the duration of the fusion events (new Figure S2D-F). The start time of the fusion events are shown in the cumulative plots in now Figure 3F and I. The kinetics are normal in the RPH3A KO neurons.

      (iv) It remains unclear why a process that affects a general synaptic SNARE fusion protein - SNAP25 - would specifically affect LDCV but not synaptic vesicle fusion. 

      We agree that we have not addressed this issue systematically enough in the original manuscript. We have now added a short discussion on this topic in the Discussion of the revised manuscript (p 15, line 380-386). In brief, we do not claim full selectivity for the DCV pathway. Some effects of RPH3A deficiency on the synaptic vesicle cycle have been observed. Furthermore, because DCVs typically do not mix in the synaptic vesicle cluster and fuse outside the active zone (and outside the synapse), DCVs might be more accessible to RPH3A regulation.

      (v) The mechanistic links between Rabphilin-3A function, LDCV density in neurites, neurite outgrowth, and the proposed underlying mechanisms involving trophic factor release remain unclear. 

      We agree that we have not addressed all these links systematically enough in the original manuscript, although we feel that we have at least postulated the best possible working model to link RPH3A function to DCV exocytosis/neurotrophic factor release and neurite outgrowth (p 15-16, line 396-400). Of course, a single study cannot support all these links with sufficient experimental evidence. We have now added a short text on what we can conclude exactly based on our experiments and how we see the links between RPH3A function, DCV exocytosis/neurotrophic factor release, neurite outgrowth and DCV density in neurites (p 13-14, line 317-325).

      Reviewer #1 (Public Review): 

      Summary:

      The manuscript by Hoogstraaten et al. investigates the effect of constitutive Rabphilin 3A (RPH3A) ko on the exocytosis of dense core vesicles (DCV) in cultured mouse hippocampal neurons. Using mCherry- or pHluorin-tagged NPY expression and EGFP- or mCherry tagged RPHA3, the authors first analyse the colocalization of DCVs and RPH3A. Using FRAP, the authors next analyse the mobility of DCVs and RAB3A in neurites. The authors go on to determine the number of exocytotic events of DCVs in response to high-frequency electrical stimulation and find that RPH3A ko increases the number of exocytotic events by a factor 2-3, but not the fraction of released DCVs in a given cell (8x 50Hz stim). In contrast, the release fraction is also increased in RBP3A KOs when doubling the stimulation number (16x 50Hz). They further observe that RPH3A ko increases dendrite and axon length and the overall number of ChgrB-positive DCVs. However, the overall number of DCVs and dendritic length in ko cells directly correlate, indicating that the number of vesicles per dendritic length remains unaffected in the RPH3A KOs. Lentiviral co-expression of tetanus toxin (TeNT) showed a non-significant trend to reduce axon and dendrite length in RPH3a KOs. Finally, the authors use co-expression of RAB3A and SNAP25 constructs to show that RAB3A but not SNAP25 interaction is required to allow the exocytosis-enhancing effect in RPH3A KOs. 

      While the authors' methodology is sound, the microscopy results are performed well and analyzed appropriately, but their results in larger parts do not sufficiently support their conclusions. Moreover, the experiments are not always described in sufficient detail (e.g. FRAP; DCV counts vs. neurite length) to fully understand their claims. 

      Overall, I thus feel that the manuscript does not provide a sufficient advance in knowledge. 

      Strengths: 

      - The authors' methodology is sound, and the microscopy results are performed well and analyzed appropriately. 

      - Figure 2: The exocytosis imaging is elegant and potentially very insightful. The effect in the RPH3A KOs is convincing. 

      - Figure 4: the logic of this experiment is elegant. It shows that the increased number of DCV fusion events in RPH3A KOs is related to the interaction of RPH3A with RAB3A but not with SNAP25. 

      We thank the reviewer for their positive evaluation of our manuscript.

      Weaknesses: 

      - The results in larger parts do not sufficiently support the conclusions. 

      - The experiments are not always described in sufficient detail (e.g. FRAP; DCV counts vs. neurite length) to fully understand their claims. 

      - Not of sufficient advance in knowledge for this journal 

      - The significance of differences in control experiments WT vs. KO) varies between experiments shown in different figures. 

      - Axons and dendrites were not analyzed separately in Figures 1 and 2. 

      - The colocalization study in Figure 1 would require super-resolution microscopy. 

      To address the reviewers’ comments, we have provided a more detailed explanation of our analysis (p 19-20, line 521-542). In addition, we have repeated our colocalization experiments using STED microscopy, see Joint Public Review item (i).  

      Reviewer #2 (Public Review): 

      Summary: 

      Hoogstraaten et al investigated the involvement of rabphilin-3A RPH3A in DCV fusion in neurons during calcium-triggered exocytosis at the synapse and during neurite elongation. They suggest that RPH3A acts as an inhibitory factor for LDV fusion and this is mediated partially via its interaction with SNAP25 and not Rab3A/Rab27. It is a very elegant study although several questions remain to be clarified. 

      Strengths: 

      The authors use state-of-the-art techniques like tracking NPY-PHluorin exocytosis and FRAP experiments to quantify these processes providing novel insight into LDCs exocytosis and the involvement of RPH3A. 

      We thank the reviewer for their positive evaluation of our manuscript.

      Weaknesses: 

      At the current state of the manuscript, further supportive experiments are necessary to fully support the authors' conclusions. 

      We thank the reviewer for their comments and suggestions. We have performed additional experiments to support our conclusions, see Joint Public Review items (i) – (iv)

      Reviewer #3 (Public Review): 

      Summary: 

      The molecular mechanism of regulated exocytosis has been extensively studied in the context of synaptic transmission. However, in addition to neurotransmitters, neurons also secrete neuropeptides and neurotrophins, which are stored in dense core vesicles (DCVs). These factors play a crucial role in cell survival, growth, and shaping the excitability of neurons. The mechanism of release for DCVs is similar, but not identical, to that used for SV exocytosis. This results in slow kinetic and low release probabilities for DCV compared to SV exocytosis. There is a limited understanding of the molecular mechanisms that underlie these differences. By investigating the role of rabphilin-3A (RPH3A), Hoogstraaten et al. uncovered for the first time a protein that inhibits DCV exocytosis in neurons. 

      Strengths: 

      In the current work, Hoogstraaten et al. investigate the function of rabphilin-3A (RPH3A) in DVC exocytosis. This RAB3 effector protein has been shown to possess a Ca2+ binding site and an independent SNAP25 binding site. Using colocalization analysis of confocal imaging the authors show that in hippocampal neurons RPH3A is enriched at pre- and post-synaptic sites and associates specifically with immobile DCVs. Using site-specific RPH3A mutants they found that the synaptic location was due to its RAB3 interaction site. They further could show that RPH3A inhibits DCV exocytosis due to its interaction with SNAP25. They came to that conclusion by comparing NPY-pHluorin release in WT and RPH3A KO cells and by performing rescue experiments with RPH3A mutants. Finally, the authors showed that by inhibiting stimulated DCV release, RPH3A controlled the axon and dendrite length possibly through the reduced release of neurotrophins. Thereby, they pinpoint how the proper regulation of DCV exocytosis affects neuron physiology. 

      We thank the reviewer for their positive evaluation of our manuscript.

      Weaknesses: 

      Data context 

      One of the findings is that RPH3A accumulates at synapses and is mainly associated with immobile DCVs.

      However, Farina et al. (2015) showed that 66% of all DCVs are secreted at synapses and that these DCVs are immobile prior to secretion. To provide additional context to the data, it would be valuable to determine if RPH3A KO specifically enhances secretion at synapses. Additionally, the authors propose that RPH3A decreases DCV exocytosis by sequestering SNAP25 availability. At first glance, this hypothesis appears suitable. However, due to RPH3A synaptic localization, it should also limit SV exocytosis, which it does not. In this context, the only explanation for RPH3A's specific inhibition of DCV exocytosis is that RPH3A is located at a synapse site remote from the active zone, thus protecting the pool of SNAP25 involved in SV exocytosis from binding to RPH3A. This hypothesis could be tested using super-resolution microscopy. 

      We thank the reviewer for their suggestion. We have now performed super resolution microscopy, see Joint Public Review item (i). However, these new data do not necessarily explain the stronger effect of RP3A deficiency on DCV exocytosis, relative to SV exocytosis. We have added a short discussion on this topic to the revised manuscript, see Joint Public Review item (iv).

      Technical weakness 

      One technical weakness of this work consists in the proper counting of labeled DCVs. This is significant since most findings in this manuscript rely on this analysis. Since the data was acquired with epi-fluorescence or confocal microscopy, it doesn't provide the resolution to visualize individual DCVs when they are clumped. The authors use a proxy to count the number of DCVs by measuring the total fluorescence of individual large spots and dividing it by the fluorescence intensity of discrete spots assuming that these correspond to individual DCVs. This is an appropriate method but it heavily depends on the assumption that all DCVs are loaded with the same amount of NPY-pHluorin or chromogranin B (ChgB). Due to the importance of this analysis for this manuscript, I suggest that the authors show that the number of DCVs per µm2 is indeed affected by RPH3A KO using super-resolution techniques such as dSTORM, STED, SIM, or SRRF. 

      The reviewer is correct that this is a crucial issue, that we have not addressed optimally until now. We have previously devoted a large part of a previous manuscript to this issue, but have not referred to this previous work clearly enough. We have now clarified this (p 7, line 187-190). In brief, we have previously quantified the ratio between fluorescent intensity of ChgB and NPY-pHluorin in confocal microscopy over the number of dSTORM puncta in sparse areas of WT mouse hippocampal neurons (Persoon et al., 2018). This quantification yielded a unitary fluorescence intensity per vesicle that was very stable of different neurons. Although there might be some underestimation of the total number of DCVs when using confocal microscopy, the study of Persoon et al. (2018) has demonstrated that these parameters correlate well and that the estimations are accurate. Considering that the rF/F0 is similar in RPH3A WT and KO neurons (now Figure S2I), meaning that the intensity of NPY-pHluorin of one fusion event is comparable, we can presume that this correlation also applies for the RPH3A KO neurons.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      Major points: 

      (1) The authors perform an extensive analysis regarding the colocalization of RPH3A and DCVs (Figure 1 upper part). This analysis is hampered by the fact that the recorded data has in relation to vesicle size limited resolution (> 1 µm) to allow making strong claims here. In my view, super-resolution microscopy would be required for the co-localization studies shown in Figure 1. 

      We fully agree and have now performed super-resolution microscopy, see Joint Public Review item (i)

      (2) The FRAP experiments (Figure 1 lower part) cannot be sufficiently understood from what is presented. The methods say that both laser channels were activated during bleaching but NPY-pHluorin is not bleached in Fig.1E. Explanation of the bleaching is not very circumspect. In 1D, it is rather EGFP-RPH3A that is entering the bleached area than the NPY vesicles. These experiments require a more careful explanation of methodology, observed results, and their interpretation. Overall, the observed effects in the original kymograph traces require a better explanation. 

      We acknowledge that NPY-pHluorin in Figure 1E (now Figure 2C) is not completely bleached. NPY-pHluorin appeared to be more difficult to bleach than NPY-mCherry. However, it is important to clarify that we merely bleached the neurites to remove the stationary puncta and facilitate our analysis of DCV/RPH3A dynamics. This bleaching step does not affect the interpretation of our results. We apologize that this was not clearly stated in the text and have made the necessary adjustments in legend, results- and methods section, (p 6-7, line 162-163; p 5, line 140-142 and p 19, line 508-513). Additionally, we apologize for the accidental switch of the kymographs for NPY-mCherry and EGFP-RPH3A in Figure 1D (now Figure 2B, C). We greatly appreciate identifying this error.  

      (3) Figure 1: The authors need to mention whether axons, dendrites, or both were analyzed throughout the different panels and how they were identified. Is it possible that axons were wrapping around dendrites in their cultures (compare e.g. Shimojo et al., 2015)? Given the limited spatial resolution and because of this wrapping, interpretation of results could be affected. 

      We completely agree with the reviewer’s assessment and conclusion. We are unable to distinguish axons from dendrites using this experimental design. We have made sure to specify in the text that our observation that RPH3A does not co-travel with DCVs is true for both dendrites and axons, (p 5, line 150).

      (4) Figure 2: The exocytosis imaging is elegant and potentially very insightful. The effect in the RPH3A KOs is convincing. However, the authors determine the efficacy of exocytosis from NPY-pHluorin unquenching of DCVs only. This is only one of several possible parameters to read out the efficiency of exocytosis. Kinetics like e.g. delay between stimulation and start of exocytosis events or release time course of NPY after DCV fusion were not determined. Such analysis could give a better insight into what process before or after the fusion of DCVs is affected by RPH3A ko. 

      We fully agree with the reviewer. We have now included the duration of the fusion events (new Figure S2D-F). The start time of the fusion events are shown in the cumulative plots in now Figure 3F and I. The kinetics are normal in the RPH3A KO neurons.

      Moreover, it needs to be mentioned whether 2C and D are from WT or ko cultures. It would be best to show representative examples from both genotypes. 

      We have now adjusted this in the new figure (now Figure 3C, D).

      The number of fusion events is much increased but the release fraction is not significantly changed. While this is consistent with results in Figure 4C it is at variance with 4F. This raises questions about the reliability of the effects in RPH3A KOs. 

      The release fraction indicates the number of fusion events normalized to the total DCV pool. In Figure 4D, we observed a slightly bigger pool size, which explains the lack of significance when analyzing the released fraction. In Figure 4G, however, DCV pool sizes are similar between KO and WT, leading to a statistically significant effect on release fraction in KO neurons. Furthermore, Figures 4B and E distinctly show a substantial increase in fusion events in RPH3A KO neurons. This variability in pool size observed could potentially be attributed to variation in culture or inherent biological variability.

      Given the increased number of ChgrB-positive DCVs in RPH3A KOs (shown in Figure 2) and that only the cumulative number of exocytosis events were analysed, how can the authors exclude that the RPH3A ko only affects vesicle number but not release, if the % change in released vesicles is not different to WT? Kinetics of release don't seem to be affected. Importantly, what was the density of NPY-pHluorin vesicles in WT vs. ko? 

      In Figure 2 (now Figure 5) we show that RPH3A KO neurons are larger and contain more endogenous ChgB+ puncta than WT neurons. This increased number of ChgrB+ puncta scales with their size as puncta density is not increased. A previous study (Persoon et al., 2018) has demonstrated a strong correlation between DCV number and neuron size. Our data show that RPH3A deficiency increased DCV exocytosis, but the released fraction of vesicles depends on the total number of DCVs, which we determined during live recording by dequenching NPY-pHluorin using NH4+. Considering that this is an overexpression of a heterologous DCV-fusion reporter, and not endogenous staining of DCVs, as in the case of ChgrB+ puncta, some variability is not unexpected.

      Also in these experiments, the question arises of whether the authors analyse axons, dendrites, or both throughout the different panels and how they were identified. 

      In our experimental design we record all fusion events per cell, including both axons and dendrites but excluding the cell soma. We have clarified this in the method section, (p 19, line 508 and p 19, line 521-522).

      (5) Figure 3: in D the authors show that ChgrB-pos. DCV density is slightly increased in KOs. How does this relate to the density of NPY-pHluorin DCVS in Figure 2? 

      We do not observe a difference in NPY-pHluorin density (see Author response image 1). However, it is important to note that we relied on tracing neurites in live recording images to determine the neuronal size. In contrast, the ChgB density was based on dendritic length using MAP2 (post-hoc) staining was limited. In addition, Chgr+ puncta represent an endogenous DCV staining, NPY-pHluorin quantification is based on overexpression of a heterologous DCV-fusion reporter. These two factors likely contribute some variability.

      Author response image 1.

      The authors show a non-significant trend of TeNT coexpression to reduce axon and dendrite lengths in RPH3A KOs. While this trend is visible, I think one cannot draw conclusions from that when not reaching significance. The argument of the authors that the increased axon and dendrite lengths are created by growth factor peptide release from DCV during culture time is interesting. However, the fact that TeNT expression shows a trend toward reducing this effect on axons/dendrites is not sufficient to prove the release of such growth factors. 

      We agree. We have toned down this speculation in the revised manuscript, (p 15-16, line 395-400).

      Lastly, the authors don't provide insight into the mechanisms, of how RPH3A ko increases the number of DCVs per µm dendritic length in the neurons. In my view, there are too many loose ends in this story of how RPH3A ko first increases spontaneous release of DCVs and then enhances neurite growth and DCV density. Did the authors e.g. measure the spontaneous release of DCVs in their cultures? 

      We measured spontaneous release of DCVs during the 30s baseline recording prior to stimulation. We observed no difference in spontaneous release between WT and KO neurons (now Figure S2H). However, baseline recording lasted only 30 seconds. It is possible that this was too short to detect subtle effects.

      Other points: 

      (1) Figure 4: the logic of this experiment is elegant. It shows that the increased number of DCV fusion events in RPH3A KOs is related to the interaction of RPH3A with RAB3A but not with SNAP25. As mentioned above, it is irritating that the reduction of fusion events in KOs and on the release fraction is sometimes reaching significance, but sometimes it does not. Likewise, the absence of significant effects on DCV numbers is not consistent with the results shown in Figures 3C and D. 

      DCV numbers in Figure 3 (now Figure 5) are determined by staining for endogenous ChgB, whereas in Figure 4D and G DCV numbers are determined by overexpressing NPY-pHluorin and counting the dequenched puncta following a NH4+ puff.

      (2) Figure 1B: truncation of the y-axis needs to be clearly indicated. 

      We have replaced this figure with new Figure 1 and have indicated truncations of the y-axis when needed (new Figure 1E). 

      (3) Page 10: "Given that neuropeptides are key modulators of adult neurogenesis (Mu et al., 2010), and that RPH3A depletion leads to increased DCV exocytosis, it is coherent that we observed longer neurites in RPH3A KO neurons." I cannot follow the argument of the authors here: what has neurogenesis to do with neurite length? 

      We apologize for the confusion. We have clarified this in the revised text, (p 16, line 398-400).

      Minor point: 

      There are some typos in the manuscript. e.g., page 8: "... may partially dependent on regulated secretion...); page 6: "...to dequence all...". 

      Thank you for noticing, we have corrected the typos.

      Reviewer #2 (Recommendations For The Authors): 

      (1) Supplementary Figure S1A, in my opinion, should be in Figure 1A as it illustrates all the constructs used in this study and helps the reader to follow it up. 

      We thank the reviewer for their suggestion. However, we feel that with the adjustments we have made in Figure 1, the illustrations of the constructs fit better in Figure S1, since new Figure 1 shows the localization of endogenous RPH3A and not that of the constructs.  

      (2) One of the conclusions of the manuscript is the synaptic localization of the different RPH3A mutants. The threshold for defining synaptic localization is not clear either from the images nor from the analysis: for example, the Menders coefficient for VGut1-Syn1 which is used as a positive control, ranges from 0.65-0.95 and that of RPH3A and Syn1 ranges from 0.5-0.95. These values should be compared to all mutants and the conclusions should be based on such comparison. 

      We agree. We have now repeated our initial co-localization experiment with all the RPH3A mutants (now Figure S1D-F).  

      (3) Strengthening this figure with STED/SIM/dSTORM microscopy can verify and add a new understanding of the subtle changes of RPH3A localization. 

      We fully agree and have now added super-resolution microscopy data, see Joint Public Review item (i).

      (4) As RAB3A/RAB27A (ΔRAB3A/RAB27A) loses the punctate distribution, please clarify how can it function at the synapse and not act as a KO. Is it sorted to the synapse and how does it is sorted to the synapse? 

      We used lentiviral delivery to introduce our constructs, resulting in the overexpression of ΔRAB3A/RAB27A mutant RPH3A. This overexpression likely compensates for the loss of the punctate distribution of RPH3A, thereby maintaining its limiting effect on DCV exocytosis. It is plausible that under physiological conditions, the mislocalization of RPH3A would lead to increased exocytosis, similar to what we observed in the KO. 

      (5) Is RPH3A expressed in both excitatory and inhibitory neurons? 

      We agree this is an important question. Single cell RNA-seq already suggests the protein is expressed in both, but we nevertheless decided to test expression of RPH3A protein in excitatory and inhibitory neurons, using immunocytochemistry with VGAT and VGLUT as markers in hippocampal and striatal WT neurons. We found that RPH3A is expressed in both VGLUT+ hippocampal neurons and VGAT+ striatal neurons (new Figure S1A, B).  

      (6) The differential use of ChgB and NPY as markers for DCVs should be clarified and compared as these are used at different stages of the manuscript. 

      We have previously addressed the comparison between ChgB and NPY-pHluorin (Persoon et al., 2018). We made sure to indicate this more clearly throughout the manuscript to clarify the use of the two markers. 

      (7) FRAP experiments- A graph describing NPY recovery should be added as a reference to 2H and discussed. 

      We agree. We have made the necessary adjustments (new Figure 2G).

      (8) Figure 2E shows some degree of "facilitation" between the 2 8x50 pulses RPH3A KO neurons. Can the author comment on that? What was the reason for using this dual stimulation protocol? 

      There is indeed some facilitation between the two 8 x 50 pulses in KO neurons and to a lesser extent also in the WT neurons, which we have observed before in WT neurons (Baginska et al., 2023). Baginska et al. (2023) showed recently that different stimulation protocols can influence certain fusion dynamics, like the ratio of persistent and transient events and event duration. We used two different stimulation protocols to thoroughly investigate the effect of RPH3A on exocytosis, and assess the robustness of our findings regarding the number of fusion events. Fusion kinetics was similar in WT an KO neurons for both stimulation protocols (new Figure 2D-F).

      (9) Figure 3 quantifies dendrites length and then moves to quantify both axon and dendrites for the Tetanus toxin experiment. What are the effects of KO on axon length? In the main figures, it is not mentioned but in S3 it seems not to be affected. How does it reconcile with the main conclusion on neurite length? 

      Figure 3H (now Figure 6C) shows the effect of the KO on axon length: the axon length is increased in RPH3A KO neurons compared to WT, similar to dendrite length. Re-expressing RPH3A in KO neurons rescues axonal length to WT levels. In Figure S3, we observe a similar trend as in main Figure 3 (new Figure 6), yet this effect did not reach significance. Based on this, we concluded that neurite length is increased upon RPH3A depletion.

      (10) For lay readers, please explain the total pool and how you measured it. However, see the next comment. 

      We agree. We have now defined this better in the revised manuscript, (p 19, line 524-527 and p 20, line 535-539).

      (11) It is a bit hard to understand if the total number of DCV was increased in the KO and if the pool size was increased and in which figure it is quantified. Some sentences like: "A trend towards a larger intracellular DCV pool in KO compared to WT neurons was observed" do not fit with "No difference in DCV pool size was observed between WT and KO neurons (Figure S2D)" or with "During stronger stimulation (16 bursts of 50 APs at 50 Hz), the total fusion and released fraction of DCVs were increased in KO neurons compared to WT". They are not directly supported, or not related to specific figures. Please indicate if the total DCVs pool, as measured by NH4, was increased and based on that, the fraction of the releasable DCVs following the long stimulation. From Figure 2H, the conclusion is an increase in fusion events. In general, NH4 is not quantified clearly- is it quantified in Figure S2C? And if it is a trend, how can it become significant in Figure 3? 

      We agree there has been some inconsistency in the way we describe the data on the total number of DCVs. We have addressed this in the revised text to ensure better clarity. The total DCV pool measured by NPY-pHluorin was not significantly increased in KO neurons, we see a trend towards a bigger DCV pool in the 2x8 50 Hz stimulation paradigm (now Figure S2C), therefore the released fraction of vesicles is not increased in Figure 1G (now Figure 3G). The number of DCV in Figure 3 (now Figure 5) is based on endogenous ChgB staining and not overexpression like the DCV pool measured by NPY-pHluorin. In Figure 3 (now Figure 5) we show that RPH3A KO neurons have slightly more ChgB+ puncta compared to WT.

      (12) In Figure 3, the quantification is not clear, discrete puncta are not visible but rather a smear of chromogranin staining. How was it quantified? An independent method to count DCV number, size, and distribution like EM is necessary to support and add further understanding. 

      We acknowledge that discrete ChgB puncta are not completely visible in Figure 3 (now Figure 5). Besides the inherent limitation in resolution with confocal imaging, we believe that this is due to ChgB accumulation in the KO neurons, as shown in now Figure 5D. Nonetheless, to address this concern of the reviewer, we have selected other images that represent our dataset (now Figure 5A). Furthermore, the number of ChgB+ DCVs was calculated using SynD software (Schmitz et al., 2011; van de Bospoort et al., 2012) (see previous reply). EM would offer valuable independent confirmation on the total DCV number, size and distribution. However, with the current method we already know that vesicle numbers are at least similar. Does that justify the (major) investment in a quantitative EM study? Moreover, this issue does not affect the central message of the current study.

      (13) Can the author discuss if the source of DCVs that are released at the synapse is similar or different from the source of DCVs fused while neurites elongate? 

      With our current experimental design, we are unable to draw conclusions regarding this aspect. We are not sure how experiments to identify this source (probably the Golgi?) would be crucial to sustain the central message of our study.

      (14) An interesting and related question: what are the expression levels of RPH3A during development and neuronal growth during the nervous system development? 

      While we have not specifically examined the expression levels of RPH3A over development, public databases show that RPH3A expression increases over time in mice, consistent with other synaptic proteins (Blake et al., 2021; Baldarelli et al., 2021; Krupke et al., 2017). We have now added this to the revised manuscript (p 2, line 55-56).

      (15) The conclusion from Figure 4 about the contribution of SNAP25 interaction to RPH3A inhibitory effect is not convincing. The data are scattered and in many neurons, high levels of fusion events were detected. Further or independent experiments are needed to support this conclusion. For example, is the interaction with SNAP25 important for its inhibitory activity in other DCV-releasing systems like adrenal medulla chromaffin cells? 

      We agree that further studies in other DCV-releasing systems like chromaffin cells would provide valuable insight into the role of SNAP25 interaction in RPH3A’s inhibitory effect on exocytosis. However, we believe that starting new series of experiments in another model system is outside of the scope of our current study.

      (16) Furthermore, the number of DCVs in the KO is similar in this experiment, raising some more questions about the quantification of the number of vesicles, that differ, in different sections of the manuscript (points # 10,11). 

      The total DCV pool in the fusion experiments is measured by overexpression NPY-pHluorin, this cannot be directly compared to the number of endogenous ChgB+ DCV in Figure 3 (now Figure 5), see also item (11)

      (17) The statement - "RPH3A is the only negative regulator of DCV" is not completely accurate as other DCV inhibitors like tomosyn were described before. 

      We agree. By this statement, we intend to convey that RPH3A is the only negative regulator of DCVs without substantial impact on synaptic vesicle exocytosis, unlike Tomosyns. We have clarified this in the revised text, (p 15, line 366-367).

      (18) The support for the effect of KO on the "clustering of DCVs" is not convincing. 

      The intensity of endogenous ChgB puncta was decreased in RPH3A KO neurons (now Figure 5E). However, the peak intensity induced by single NPY-pHluorin labeled DCV fusion events (quanta) was unchanged (now Figure S2I). This indicates that the decrease in ChgB puncta intensity must be due to a reduced number of DCVs (quanta) in this specific location. We have interpreted that as ‘clustering’, or maybe ‘accumulation’. However, we only put forward this possibility. We are now more careful in our speculations within the text, (p 11 line 271-277).

      (19) Final sentence: "where RPH3A binds available SNAP25, consequently restricting the assembly of SNARE complexes" should be either demonstrated or rephrased as no effect of trans or general SNARE complex formation is shown. 

      We agree. We have made the necessary adjustments in the text, (p 15, line 387-389).   

      (20) A scheme summarizing RPH3A's interaction with synaptic proteins and its effects on DCVs release, maybe even versus its effects on SVs release, should be considered as a figure or graphic abstract. 

      We have included a working model in Figure 7.  

      (21) Figure 4 logically should come after Figure 2 to summarize the fusion-related chapter before moving to neurite elongation. 

      We have placed Figure 4 after Figure 2 (now Figure 3).

      Reviewer #3 (Recommendations For The Authors): 

      One important finding of this study is that RPH3A downregulates neuron size, possibly by inhibiting DCV release. Additionally, the authors demonstrated that the number of DCVs is directly proportional to the number of DCVs per µm2, and that RPH3A KO reduces DCV clustering. This conclusion was drawn by comparing ChgB with NPY-pHluorin loading of the DCVs. However, this comparison is not valid as ChgB is expressed at an endogenous level and NPY-pHluorin is over-expressed. In the KO situation where DCV exocytosis is enhanced, the available endogenous ChgB may be depleted faster than the overexpressed NPY-pHluorin. Hoogstraaten et al. should either perform a study in which ChgB is overexpressed to test whether the difference in DCV remains or at least provides an alternative interpretation of their data. 

      We thank the reviewer for this comment. The reviewer challenges one or two conclusions in our original manuscript (It is not entirely clear to what exactly “This conclusion” refers): (a) “the number of DCVs is directly proportional to the number of DCVs per µm2”, and (b) “that RPH3A KO reduces DCV clustering”. The reviewer probably means that the number of DCVs per neuron is directly proportional to size of the neuron (a) and states this (these) conclusion(s) are “not valid as ChgB is expressed at an endogenous level and NPY-pHluorin is over-expressed” because “endogenous ChgB may be depleted faster than the overexpressed NPY-pHluorin”. We have three arguments to conclude that faster depletion of ChgB cannot affect these two conclusions: (1) DCVs bud off from the Golgi with newly synthesized (fresh) ChgB. Whether or not a larger fraction of DCVs is released does not influence this initial ChgB loading into DCVs (together with over-expressed NPY-pHluorin); (2) in hippocampal neurons merely 1-6% of the total DCV pool undergoes exocytosis (the current study and also extensively demonstrated in Persoon et al., 2018). RPH3A KO neurons release few percent more of the total DCV pool. Hence, “depletion of ChgB” is only marginally different between experimental groups; and (c) the proposed experiment overexpressing ChgB will not help scrutinize our current conclusions as ChgB overexpression is known to affect DCV biogenesis and the total DCV pool, most likely much more than a few percent more release by RPH3A deficiency.

      Hoogstraaten et al. conducted a thorough analysis of the impact of RPH3A KO and its rescue using various mutants on dendrite and axon length (see Supplementary Figure 3). However, they did not test the effect of the ΔSNAP25 mutant. The authors demonstrated that this mutant is the least efficient in rescuing DCV exocytosis (Figure 4E). Hence the neurons expressing this mutant should have a similar size to the KO neurons. This finding would strongly support the argument that DCV exocytosis regulates neuron size. Otherwise, it would suggest that RPH3A may have a function in regulating exocytosis at the growth cones that is independent of SNAP25. Since the authors most probably have the data that allows them to measure the neuron size (acquired for Supplementary Figure 2), I suggest that they perform the required analysis. 

      We agree this is important and performed new experiments to determine the dendrite length of RPH3A WT, KO and KO neurons expressing the ΔSNAP25 mutant. We observed that the dendrite length of RPH3A KO neurons expressing ΔSNAP25 mutant is indeed similar to KO neurons (new Figure S3C). Although not significant we observe a clear trend towards bigger neurons compared to WT.  This strengthens our conclusion that increased DCV exocytosis contributes to the observed increased neuronal size.

      The authors displayed the result of DCV exocytosis in two ways. One is by showing the number of exocytosis events the other is to display the proportion of DCVs that were secreted. They do the latter by dividing the secreted DCV by the total number of DCVs. These are visualized at the end of the experiment through NH4+ application. While this method works well for synaptic secretion as the marker of SV is localized to the SV membrane and remains at the synapse upon SV exocytosis, it cannot be applied in the same manner when it is the DCV content that is labeled as it is released upon secretion. Hence, the total pool of vesicles should be the number of DCV counted upon NH4+ application in addition to those that are secreted. This way of analyzing the total pool of DCV might also explain the difference in this pool size between KO neurons stimulated two times with 8 stimuli instead of one time with 16 stimuli (Sup Fig 2 C and D). This is an important point as it affects the conclusions drawn from Figure 2. 

      We thank the reviewed for this comment. We agree, and we have made the necessary adjustments throughout the manuscript. 

      The kymogram of DCV exocytic events displayed in Figure 2D shows a majority of persistent (>20s long) events. This is strange as NPY-pHluori corresponds to the released cargo. Previous work using the same labeling and stimulation technique showed that content release occurs in less than 10s (Baginska et al. 2023). The authors should comment on that difference. 

      In Baginska et al. (2023), the authors distinguished between persistent and transient events. The transient events are shorter than 10s for the 2x8 and 16x stimulation paradigms, whereas persistent events can last for more than 10s. In our study we did not make this distinction. However, in response to this reviewer, we have now quantified the fusion duration per cell. These new data show that the mean duration is similar between genotypes for both stimulation paradigms. We have added these new data (new Figure S2D-F).

      In Figures 1D and E, some puncta in the kymogram appeared to persist after bleaching. This raises questions about the effectiveness of the bleaching procedure for the FRAP experiment. 

      The reviewer is correct that NPY-pHluorin in Figure 1E (now Figure 2C) is not fully bleached. NPY-pHluorin was more resistant to bleaching than NPY-mCherry. However, we merely bleached the neurites to facilitate our analysis by reducing fluorescence of the stationary puncta without causing phototoxicity. Some remaining fluorescence after bleaching does not affect our conclusions in any way.

      In the discussion, the paragraph titled "RPH3A does not travel with DCVs in hippocampal neurons" is quite confusing and would benefit from a streamlined explanation. 

      We thank the reviewed for this comment. We made the necessary adjustments to make this paragraph clearer, (p 14, line 339-351).

      First paragraph of page 8 "TeNT expression in KO neurons restored neurite length to WT levels. When compared to KO neurons without TeNT, neurite length was not significantly decreased but displayed a trend towards WT levels (Figure 3G, H)." These two sentences are confusing as they seem contradictory. 

      We agree that this conclusion has been too strong. However, we do not see a contradiction. The significant effect between KO and control neurons on both axon and dendrite length is lost upon TeNT expression (which forms the basis for our conclusions cited by the reviewer, now Figure 6B, C). While the difference between KO neurons +/- TeNT did not reach statistical significance. The (strong) trend is clearly in the same direction. We have refined our original conclusion in the revised manuscript, (p 12, line 304-306).

      The data availability statement is missing. 

      We have added the data availability statement, (p 21, line 571-572).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Common comments

      (1) Significance of zero mutation rate

      Reviewers asked why we included mutation rate even though setting mutation rate to zero doesn’t change results. We think that including non-zero mutation rate makes our results more generalisable, and thus is a strength rather than weakness. To better motivate this choice, we have added a sentence to the beginning of Results:

      (2) Writing the mu=0 case first

      Reviewers suggested that we should first focus on the mu=0 case, and then generalize the result. The suggestions are certainly good. However, given the large amount of work involved in a re-organization, we have decided to adhere to our current narrative. However, we now only include equations where mu=0 in the main text, and have moved the case of nonzero mutation rate to Supplementary Information.

      (3) Making equations more accessible

      We have taken three steps to make equations more readable.

      ● Equations in the main text correspond to the case of zero-mutation rate.

      ● The original section on equation derivation is now in a box in the main text so that readers have the choice of skipping it but interested readers can still get a gist of where equations came from.

      ● We have provided a much more detailed interpretation of the equation (see page 10).

      (4) Validity of the Gaussian approximation

      Reviewers raised concerns about the validity of Gaussian approximation on F frequency𝑓(𝜏). The fact that our calculations closely match simulations suggest that this approximation is reasonable. Still, we added a discussion about the validity of this approximation in Box 1.

      We also added to SI with various cases of initial S and F sizes. This figure shows that when either initial S or initial F is small, the distribution of𝑓(𝜏) is not normal. However, if initial S and F are both on the order of hundreds, then the distribution of 𝑓(𝜏) is approximately Gaussian.

      Public Reviews:

      Summary:

      The authors demonstrate with a simple stochastic model that the initial composition of the community is important in achieving a target frequency during the artificial selection of a community.

      Strengths:

      To my knowledge, the intra-collective selection during artificial selection has not been seriously theoretically considered. However, in many cases, the species dynamics during the incubation of each selection cycle are important and relevant to the outcome of the artificial selection experiment. Stochasticity from birth and death (demographic stochasticity) plays a big role in these species' abundance dynamics. This work uses a simple framework to tackle this idea meticulously.

      This work may or may not be hysteresis (path dependency). If this is true, maybe it would be nice to have a discussion paragraph talking about how this may be the case. Then, this work would even attract the interest of people studying dynamic systems.

      We have added this clarification in the main text:

      “Note that here, selection outcome is path-dependent in the sense of being sensitive to initial conditions. This phenomenon is distinct from hysteresis where path-dependence results from whether a tuning parameter is increased or decreased.

      Weaknesses:

      (1) Connecting structure and function

      In typical artificial selection literature, most of them select the community based on collective function. Here in this paper, the authors are selecting a target composition. Although there is a schematic cartoon illustrating the relationship between collective function (y-axis) and the community composition in the main Figure 1, there is no explicit explanation or justification of what may be the origin of this relationship. I think giving the readers a naïve idea about how this structure-function relationship arises in the introduction section would help. This is because the conclusion of this paper is that the intra-collective selection makes it hard to artificially select a community that has an intermediate frequency of f (or s). If there is really evidence or theoretical derivation from this framework that indeed the highest function comes from the intermediate frequency of f, then the impact of this paper would increase because the conclusions of this stochastic model could allude to the reasons for the prevalent failures of artificial selection in literature.

      We have added this to introduction: “This is a common quest: whenever a collective function depends on both populations, collective function is maximised, by definition, at an intermediate frequency (e.g. too little of either population will hamper function [23]).”

      (2) Explain intra-collective and inter-collective selection better for readers.

      The abstract, the introduction, and the result section use these terms or intra-collective and inter-collective selection without much explanation. For the wide readership of eLife, a clear definition in the beginning would help the audience grasp the importance of this paper, because these concepts are at the core of this work.

      This is a great point. We have added in Abstract:

      “Such collective selection is dictated by two opposing forces: during collective maturation, intra-collective selection acts like a waterfall, relentlessly driving the S-frequency to lower values, while during collective reproduction, inter-collective selection resembles a rafter striving to reach the target frequency. Due to this model structure, maintaining a target frequency requires the continued action of inter-collective selection.”

      and in Introduction

      “A selection cycle consists of three stages (Fig. 1). During collective maturation, intra-collective selection favors fast-growing individuals within a collective. At the end of maturation, inter-collective selection acts on collectives and favors those achieving the target composition. Finally during collective reproduction, offspring collectives sample stochastically from the parents, a process dominated by genetic drift.”

      (3) Achievable target frequency strongly depending on the degree of demographic stochasticity.

      I would expect that the experimentalists would find these results interesting and would want to consider these results during their artificial selection experiments. The main Figure 4 indicates that the Newborn size N0 is a very important factor to consider during the artificial selection experiment. This would be equivalent to how much bottleneck is imposed on the artificial selection process in every iteration step (i.e., the ratio of serial dilution experiment). However, with a low population size, all target frequencies can be achieved, and therefore in these regimes, the initial frequency now does not matter much. It would be great for the authors to provide what the N0 parameter actually means during the artificial selection experiments. Maybe relative to some other parameter in the model. I know this could be very hard. But without this, the main result of this paper (initial frequency matters) cannot be taken advantage of by the experimentalists.

      We have added an analytical approximation for N0˘, the Newborn size below which all target frequencies can be achieved in SI.

      Also, we have added lines indicating N0˘ in Fig4a.

      (4) Consideration of environmental stochasticity.

      The success (gold area of Figure 2d) in this framework mainly depends on the size of the demographic stochasticity (birth-only model) during the intra-collective selection. However, during experiments, a lot of environmental stochasticity appears to be occurring during artificial selection. This may be out of the scope of this study. But it would definitely be exciting to see how much environmental stochasticity relative to the demographic stochasticity (variation in the Gaussian distribution of F and S) matters in succeeding in achieving the target composition from artificial selection.

      You are correct that our work considers only demographic stochasticity.

      Indeed, considering other types of stochasticity will be an exciting future research direction. We added in the main text:

      “Overall our model considers mutational stochasticity, as well as demographic stochasticity in terms of stochastic birth and stochastic sampling of a parent collective by offspring collectives. Other types of stochasticity, such as environmental stochasticity and measurement noise, are not considered and require future research.”

      (5) Assumption about mutation rates

      If setting the mutation rates to zero does not change the result of the simulations and the conclusion, what is the purpose of having the mutation rates \mu? Also, is the unidirectional (S -> F -> FF) mutation realistic? I didn't quite understand how the mutations could fit into the story of this paper.

      This is a great point. We have added this to the beginning of Results to better motivate our study:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations. This scenario is encountered in biotechnology: an engineered pathway will slow down growth, and breaking the pathway (and thus faster growth) is much easier than the other way around. When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.

      See answer on common question 1.

      (6) Minor points

      In Figure 3b, it is not clear to me how the frequency difference for the Intra-collective and the Inter-collective selection is computed.

      We added a description in caption 3b.

      In Figure 5b, the gold region (success) near the FF is not visible. Maybe increase the size of the figure or have an inset for zoom-in. Why is the region not as big as the bottom gold region?

      We increased the resolution of Fig 5b so that the gold region near FF is more visible.

      We have added Fig 5c and the following explanation to the main text:

      “From numerical simulations, we identified two accessible regions: a small region near FF and a band region spanning from S to F (gold in Fig. 5b i). Intuitively, the rate at which FF grows faster than S+F is greater than the rate at which F grows faster than S (see section VIII in Supplementary Information). Thus, the problem can initially be reduced to a two-population problem (i.e. FF versus F+S; Fig. 5c left), and then expanded to a three-population problem (Fig. 5c right).”

      Recommendations For The Authors

      Since the conclusion of the model greatly depends on the noise (variation) of F and S in the Gaussian distribution, it would be nice to have a plot where the y-axis is the variation in terms of frequency and the x-axis is the s_0 or f_0 (frequency). In the plot, I would love to see how the variation in the frequency depends on the initial frequency of S and F. Maybe this is just trivial.

      In the SI, we added Fig6a, as per your request. Previous Fig6 became Fig6b.

      Reviewer #2 (Public review):

      The authors provide an analytical framework to model the artificial selection of the composition of communities composed of strains growing at different rates. Their approach takes into account the competition between the targeted selection at the level of the meta-community and the selection that automatically favors fast-growing cells within each replicate community. Their main finding is a tipping point or path-dependence effect, whereby compositions dominated by slow-growing types can only be reached by community-level selection if the community does not start and never crosses into a range of compositions dominated by fast growers during the dynamics.

      These results seem to us both technically correct and interesting. We commend the authors on their efforts to make their work reproducible even when it comes to calculations via extensive appendices, though perhaps a table of contents and a short description of these appendices at the start of SI would help navigate them.

      Thank you for the suggestion. We have added a paragraph at the beginning of SI.

      The main limitation in the current form of the article is that it could clarify how its assumptions and findings differ from and improve upon the rest of the literature:

      -  Many studies discuss the interplay between community-level evolution and species- or strain-level evolution. But "evolution" can be a mix of various forces, including selection, drift/randomness, and mutation/innovation.

      - This work's specificity is that it focuses strictly on constant community-level selection versus constant strain-level selection, all other forces being negligible (neither stochasticity nor innovation/mutation matter at either level, as we try to clarify now).

      Note that intra-collective selection is not strictly “constant” in the sense that selection favoring F is the strongest at intermediate F frequency (Fig 3). However, we think that you mean that intra- and inter-collective selection are present in every cycle, and this is correct for our case, and for community selection in general.

      -  Regarding constant community-level selection, it is only briefly noted that "once a target frequency is achieved, inter-collective selection is always required to maintain that frequency due to the fitness difference between the two types" [pg. 3 {section sign}2]. In other words, action from the selector is required indefinitely to maintain the community in the desired state. This assumption is found in a fraction of the literature, but is still worth clarifying from the start as it can inform the practical applicability of the results.

      This is a good point. We have added to abstract:

      “Such collective selection is dictated by two opposing forces: during collective maturation, intra-collective selection acts like a waterfall, relentlessly driving the S-frequency to lower values, while during collective reproduction, inter-collective selection resembles a rafter striving to reach the target frequency. Due to this model structure, maintaining a target frequency requires the continued action of inter-collective selection.”

      - More importantly, strain-level evolution also boils down here to pure selection with a constant target, which is less usual in the relevant literature. Here, (1) drift from limited population sizes is very small, with no meaningful counterbalancing of selection, (2) pure exponential regime with constant fitness, no interactions, no density- or frequency-dependence, (3) there is no innovation in the sense that available types are unchanging through time (no evolution of traits such as growth rate or interactions) and (4) all the results presented seem unchanged when mutation rate mu = 0 (as noted in Appendix III), meaning that the conclusions are not "about" mutation in any meaningful way.

      With regard to point (1), Figure 4a (reproduced below) shows how Newborn size affects the region of achievable targets. Indeed at large Newborn size (e.g. 5000 and above), no target frequency is achievable (since drift is too small to generate sufficient inter-community variation and consequently all communities are dominated by fast-growing F). However at Newborn size of for example 1000, there are two regions of accessible target frequencies. At smaller Newborn size, all target frequencies become achievable due to drift becoming sufficiently strong.

      With regard to points (2) and (3), we have added to Introduction

      “To enable the derivation of an analytical expression, we have made the following simplifications.

      First, growth is always exponential, without complications such as resource limitation, ecological interactions between the two populations, or density-dependent growth. Thus, the exponential growth equation can be used. Second, we consider only two populations (genotypes or species): the fast-growing F population with size F and the slow-growing S population with size S. We do not consider a spectrum of mutants or species, since with more than two populations, an analytical solution becomes very difficult.”

      With regard to point (4), we view this as a strength rather than weakness. We have added the following to the beginning of Results and Discussions:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations.”

      “When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.”

      See Point 1 of Common comments.

      - Furthermore, the choice of mutation mechanism is peculiar, as it happens only from slow to fast grower: more commonly, one assumes random non-directional mutations, rather than purely directional ones from less fit to fitter (which is more of a "Lamarckian" idea). Given that mutation does not seem to matter here, this choice might create unnecessary opposition from some readers or could be considered as just one possibility among others.

      We have added the following justification:

      “This scenario is encountered in biotechnology: an engineered pathway will slow down growth, and breaking the pathway (and thus faster growth) is much easier than the other way around.”

      It would be helpful to have all these points stated clearly so that it becomes easy to see where this article stands in an abundant literature and contributes to our understanding of multi-level evolution, and why it may have different conclusions or focus than others tackling very similar questions.

      Finally, a microbial context is given to the study, but the assumptions and results are in no way truly tied to that context, so it should be clear that this is just for flavor.

      We have deleted “microbial” from the title, and revised our abstract:

      Recommendations For The Authors

      (1) More details concerning our main remark above:

      - The paragraph discussing refs [24, 33] is not very clear in how they most importantly differ from this study. Our impression is that the resource aspect is not very important for instance, and the main difference is that these other works assume that strains can change in their traits.

      We are fairly sure that resource depletion is important in Rainey group’s study, as the attractor only evolved after both strains grew fast enough to deplete resources by the end of maturation. Indeed, evolution occurred in interaction coefficients which dictate the competition between strains for resources.

      Regardless, you raised an excellent point. As discussed earlier, we have added the following:

      “To enable the derivation of an analytical expression, we have made the following simplifications.

      First, growth is always exponential, without complications such as resource limitation, ecological interactions between the two populations, or density-dependent growth. Thus, the exponential growth equation can be used. Second, we consider only two populations (genotypes or species): the fast-growing F population with size F and the slow-growing S population with size S. We do not consider a spectrum of mutants or species, since with more than two populations, an analytical solution becomes very difficult.”

      - We would advise the main text to focus on mu = 0, and only say in discussion that results can be generalized.

      Your suggestion is certainly good. However, given the large amount of work involved in a reorganisation, we have decided to adhere to our current narrative. However, as discussed earlier, we have added this at the beginning of Results to help orient readers:

      “We will start with a complete model where S mutates to F at a nonzero mutation rate µ. We made this choice because it is more challenging to attain or maintain the target frequency when the abundance of fast-growing F is further increased via mutations.”

      “When the mutation rate is set to zero, the same model can be used to capture collectives of two species with different growth rates.”

      (2) We think the material on pg. 5 "Intra-collective evolution is the fastest at intermediate F frequencies, creating the "waterfall" phenomenon", although interesting, could be presented in a different way. The mathematical details on how to find the probability distribution of the maximum of independent random variables (including Equation 1) will probably be skipped by most of the readers (for experienced theoreticians, it is standard content; for experimentalists, it is not the most relevant), as such I would recommend displacing them to SM and report only the important results.

      This is an excellent suggestion. We have put a sketch of our calculations in a box in the main text to help orient interested readers. As before, details are in SI.

      Similarly, Equations 2, 3, and 4 are hard to read given the large amount of parameters and the low amount of simplification. Although exploring the effect of the different parameters through Figures 3 and 4 is useful, I think the role of the equations should be reconsidered:

      i. Is it possible to rewrite them in terms of effective variables in a more concise way?

      See Point 3 of Common comments.

      ii. Is it possible to present extreme/particular cases in which they are easier to interpret?

      We have focused on the case where the mutation rate is zero. This makes the mathematical expressions much simpler (see above).

      (3) Is it possible to explain more in detail why the distribution of f_k+1 conditional to f_k^* is well approximated by a Gaussian? Also, have you explored to what extent the results would change if this were not true (in light of the few universal classes for the maximum of independent variables)?

      Despite the appeal to the CLT and the histograms in the Appendix suggesting that the distribution looks a bit like a Gaussian at a certain scale, fluctuations on that scale are not necessarily what is relevant for the results - a rapid (and maybe wrong) attempt at a characteristic function calculation suggests that in your case, one does not obtain convergence to Gaussians unless we renormalize by S(t=0) and F(t=0), so it seems there is a justification missing in the text as is for the validity of this approximation (or that it is simply assumed).

      See point 4 of Common comments.

      Reviewer #3 (Public Reviews):

      The authors address the process of community evolution under collective-level selection for a prescribed community composition. They mostly consider communities composed of two types that reproduce at different rates, and that can mutate one into the other. Due to such differences in 'fitness' and to the absence of density dependence, within-collective selection is expected to always favour the fastest grower, but the collective-level selection can oppose this tendency, to a certain extent at least. By approximating the stochastic within-generation dynamics and solving it analytically, the authors show that not only high frequencies of fast growers can be reproducibly achieved, aligned with their fitness advantage. Small target frequencies can also be maintained, provided that the initial proportion of fast growers is sufficiently small. In this regime, similar to the 'stochastic corrector' model, variation upon which selection acts is maintained by a combination of demographic stochasticity and of sampling at reproduction. These two regions of achievable target compositions are separated by a gap, encompassing intermediate frequencies that are only achievable when the bottleneck size is small enough or the number of communities is (disproportionately) larger.

      A similar conclusion, that stochastic fluctuations can maintain the system over evolutionary time far from the prevalence of the faster-growing type, is then confirmed by analyzing a three-species community, suggesting that the qualitative conclusions of this study are generalizable to more complex communities.

      I expect that these results will be of broad interest to the community of researchers who strive to improve community-level selection, but are often limited to numerical explorations, with prohibitive costs for a full characterization of the parameter space of such embedded populations. The realization that not all target collective functions can be as easily achieved and that they should be adapted to the initial conditions and the selection protocol is also a sobering message for designing concrete applications.

      A major strength of this work is that the qualitative behaviour of the system is captured by an analytically solvable approximation so that the extent of the 'forbidden region' can be directly and generically related to the parameters of the selection protocol.

      Thanks so much for these positive comments.

      I however found the description of the results too succinct and I think that more could be done to unpack the mathematical results in a way that is understandable to a broader audience. Moreover, the phenomenon the authors characterize is of purely ecological nature. Here, mutations of the growth rate are, in my understanding, neither necessary (non-trivial equilibria can be maintained also when \mu =0) nor sufficient (community-level selection is necessary to keep the system far from the absorbing state) for the phenomenon described. Calling this dynamics community evolution reflects a widespread ambiguity, and is not ascribable just to this work. I find that here the authors have the opportunity to make their message clearer by focusing on the case where the 'mutation' rate \mu vanishes (Equations 39 & 40 of the SI) - which is more easily interpretable, at least in some limits - while they may leave the more general equations 3 & 4 in the SI.

      See points 1-4 of Common comments.

      Combined with an analysis of the deterministic equations, that capture the possibility of maintaining high frequencies of fast growers, the authors could elucidate the dynamics that are induced by the presence of a second level of selection, and speculate on what would be the result of real open-ended evolution (not encompassed by the simple 'switch mutations' generally considered in evolutionary game theory), for instance discussing the invasibility (or not) of mutant types with slightly different growth rates.

      Indeed, evolution is not restricted to two types. However, our main goal here is to derive an analytical expression, and it was difficult for even two types. For three-type collectives, we had to resort to simulations. Investigating the case where fitness effects of mutations are continuously distributed is beyond the scope of this study.

      The single most important model hypothesis that I would have liked to be discussed further is that the two types do not interact. Species interactions are not only essential to achieve inheritance of composition in the course of evolution but are generally expected to play a key role even on ecological time scales. I hope the authors plan to look at this in future work.

      In our system, the S and F do interact in a competitive fashion: even though S and F are not competing for nutrients (which are always in excess), they are competing for space. This is because a fixed number of cells are transferred to the next cycle. Thus, the presence of F will for example reduce the chance of S being propagated. We have added this clarification to our main text:

      “Note that even though S and F do not compete for nutrients, they compete for space: because the total number of cells transferred to the next cycle is fixed, an overabundance of one population will reduce the likelihood of the other being propagated.”

      Recommendations For The Authors

      I felt the authors could put some additional effort into making their theoretical results meaningful for a population of readers who, though not as highly mathematically educated as they are, can nonetheless appreciate the implications of simple relations or scaling. Below, you find some suggestions:

      (1) In order to make it clear that there is a 'natural' high-frequency equilibrium that can be reached even in the absence of selection, the authors could examine first the dynamics of the deterministic system in the absence of mutations, and use its equilibria to elucidate the combined role of the 'fitness' difference \omega and of the generation duration \tau in setting its value. The fact that these parameters always occur in combination (when there are no mutations) is a general and notable feature of the stochastic model as well. Moreover, this model would justify why you only focus on decreasing the frequency in the new generation.

      Note that the ‘natural’ high-frequency equilibrium in the absence of collective selection is when fast grower F becomes fixed in the population. Following your suggestion, we have introduced two parameters 𝑅τ and 𝑊τ to reflect the coupling between ‘fitness’ and ‘generation duration’:

      (2) Since the phenomenon described in the paper is essentially ecological in nature (as the author states, it does not change significantly if the 'mutation rate' \mu is set to zero), I would put in the main text Equations 39 & 40 of the SI in order to improve intelligibility.

      See Point 2 at the beginning of this letter.

      These equations can be discussed in some detail, especially in the limit of small f^*_k, where I think it is worth discussing the different dependence of the mean and the variance of the frequency distribution on the system's parameters.

      This is a great suggestion. We have added the following:

      “In the limit of small , Equation (3) becomes f while Equation (4) becomes . Thus, both Newborn size (N<sub>0</sub>) and fold-change in F/S during maturation (W<sub>τ</sub>) are important determinants of selection progress.

      (3) I would have appreciated an explanation in words of what are the main conceptual steps involved in attaining Equation 2, the underlying hypotheses (notably on community size and distributions), and the expected limits of validity of the approximation.

      See points 3 and 4 at the beginning of this letter.

      (4) I think that some care needs to be put into explaining where extreme value statistics is used, and why is the median of the conditional distribution the most appropriate statistics to look at for characterizing the evolutionary trajectory (which seems to me mostly reliant on extreme values).

      Great point! We added an explanation of using median value in Box 1.

      and also added figure 7 to explaining it in SI.

      Showing in a figure the different distributions you are considering (for instance, plotting the conditional distribution for one generation in the trajectories displayed in Figure 2) would be useful to understand what information \bar f provides on a sequence of collective generations, where in principle there may be memory effects.

      Thanks for this suggestion. We have added to Fig 2d panel to illustrate the shape and position of F frequency distributions in each step in the first two selection cycles.

      (5) Similarly, I do not understand why selecting the 5% best communities should push the system's evolution towards the high-frequency solution, instead of just slowing down the improvement (unless you are considering the average composition of the top best communities - which should be justified). I think that such sensitivity to the selection intensity should be appropriately referenced and discussed in the main text, as it is a parameter that experimenters are naturally led to manipulate.

      In the main text, we have added this explanation:

      “In contrast with findings from an earlier study [23], choosing top 1 is more effective than the less stringent “choosing top 5%”. In the earlier study, variation in the collective trait is partly due to nonheritable factors such as random fluctuations in Newborn biomass. In that context, a less stringent selection criterion proved more effective, as it helped retain collectives with favorable genotypes that might have exhibited suboptimal collective traits due to unfavorable nonheritable factors. However, since this study excludes nonheritable variations in collective traits, selecting the top 1 collective is more effective than selecting the top 5% (see Fig. 11 in Supplementary Information).”

      (6) Equation 1 could be explained in simpler terms as the product between the probability that one collective reaches the transmitted value times the probability that all others do worse than that. The current formulation is unclear, perhaps just a matter of English formulation.

      We have revised our description to state:

      “Equation (1) can be described as the product between two terms related to probability: (i) describes the probability density that any one of the g Adult collectives achieves f given , and (ii) describes the probability that all other g – 1 collectives achieve frequencies above f and thus not selected.”

      (7) I think that the discussion of the dependence of the boundaries of the 'waterfall' region with the difference in growth rate \omega is important and missing, especially if one wants to consider open-ended evolution of the growth rate - which can occur at steps of different magnitude.

      We added a new chapter and figure in supplementary information on the threshold values when \omega varies. As expected, smaller \omega enlarges the success area.

      We have also added a new figure panel to show how maturation time affects selection efficacy.

      (8) Notations are a bit confusing and could be improved. First of all, in most equations in the main text and SI, what is initially introduced as \omega appears as s. This is confusing because the letter s is also used for the frequency of the slow type.

      The letter S is used to denote an attribute of cells (S cells), the type of cells (Equations 1-3 of the SI) and the number of these cells in the population, sometimes with different meanings in the same sentence. This is confusing, and I suggest referring to slow cells or fast cells instead (or at least to S-cells and F-cells), and keeping S and F as variables for the number of cells of the two types.

      All typos related to the notation have been fixed. We use S and F as types, and S and F (italic) and population numbers.

      (9) On page 3, when introducing the sampling of newborns as ruled by a binomial distribution, the information that you are just transmitting one collective is needed, while it is conveyed later.

      We have added this emphasis:

      “At the end of a cycle, a single Adult with the highest function (with F frequency f closest to the target frequency ) is chosen to reproduce g Newborn collectives each with N<sub>0</sub> cells (‘Selection’ and ’Reproduction’ in Fig. 1).”

      (10) I found that the abstract talks too early about the 'waterfall' phenomenon. As this is a concept introduced here, I suggest the authors first explain what it is, then use the term. It is a useful metaphor, but it should not obscure the more formal achievements of the paper.

      We feel that the “waterfall” analogy offers a gentle helping hand to orient those who have not thought much about the phenomenon. We view abstract as an opportunity to attract readership, and thus the more accessible the better.

      (11) In the SI there are numerous typos and English language issues. I suggest the authors read carefully through it, and add line numbers to the next version so that more detailed feedback is possible.

      Thank you for going through SI. We have gone through the SI, and fixed problems.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      The authors aimed to investigate the contribution of antigenic drift in the HA and NA genes of seasonal influenza A(H3N2) virus to their epidemic dynamics. Analyzing 22 influenza seasons before the COVID-19 pandemic, the study explored various antigenic and genetic markers, comparing them against indicators characterizing the epidemiology of annual outbreaks. The central findings highlight the significant influence of genetic distance on A(H3N2) virus epidemiology and emphasize the role of A(H1N1) virus incidence in shaping A(H3N2) epidemics, suggesting subtype interference as a key factor. 

      Major Strengths: 

      The paper is well-organized, written with clarity, and presents a comprehensive analysis. The study design, incorporating a span of 22 seasons, provides a robust foundation for understanding influenza dynamics. The inclusion of diverse antigenic and genetic markers enhances the depth of the investigation, and the exploration of subtype interference adds valuable insights. 

      Major Weaknesses: 

      While the analysis is thorough, some aspects require deeper interpretation, particularly in the discussion of certain results. Clarity and depth could be improved in the presentation of findings. Furthermore, the evolving dynamics of H3N2 predominance post-2009 need better elucidation.  

      Reviewer #2 (Public Review): 

      Summary: This paper aims to achieve a better understanding of how the antigenic or genetic compositions of the dominant influenza A viruses in circulation at a given time are related to key features of seasonal influenza epidemics in the US. To this end, the authors analyze an extensive dataset with a range of statistical, data science and machine learning methods. They find that the key drivers of influenza A epidemiological dynamics are interference between influenza A subtypes and genetic divergence, relative to the previous one or two seasons, in a broader range of antigenically related sites than previously thought. 

      Strengths: A thorough investigation of a large and complex dataset. 

      Weaknesses: The dataset covers a 21 year period which is substantial by epidemiological standards, but quite small from a statistical or machine learning perspective. In particular, it was not possible to follow the usual process and test predictive performance of the random forest model with an independent dataset. 

      Reviewer #3 (Public Review): 

      Summary: 

      This paper explores the relationships among evolutionary and epidemiological quantities in influenza, using a wide range of datasets and features, and using both correlations and random forests to examine, primarily, what are the drivers of influenza epidemics. It's a strong paper representing a thorough and fascinating exploration of potential drivers, and it makes a trove of relevant data readily available to the community. 

      Strengths: 

      This paper makes links between epidemiological and evolutionary data for influenza. Placing each in the context of the other is crucial for understanding influenza dynamics and evolution and this paper does a thorough job of this, with many analyses and nuances. The results on the extent to which evolutionary factors relate to epidemic burden, and on interference among influenza types, are particularly interesting. The github repository associated with the paper is clear, comprehensive, and well-documented. 

      Weaknesses: 

      The format of the results section can be hard to follow, and we suggest improving readability by restructuring and simplifying in some areas. There are a range of choices made about data preparation and scaling; the authors could explore sensitivity of the results to some of these. 

      Response to public reviews

      We appreciate the positive comments from the reviewers and have implemented or responded to all of the reviewers’ recommendations.

      In response to Reviewer 1, we expand on the potential drivers and biological implications of the findings pointed out in their specific recommendations. For example, we now explicitly mention that antigenically distinct 3c.2a and 3c.3a viruses began to co-circulate in 2012 and underwent further diversification during subsequent seasons in our study. We note that, after the 2009 A(H1N1) pandemic, the mean fraction of influenza positive cases typed as A(H3N2) in A(H3N2) dominant seasons is lower compared to A(H3N2) dominant seasons prior to 2009. We propose that the weakening of A(H3N2) predominance may be linked to the diversification of A(H3N2) viruses during the 2010s, wherein multiple antigenically distinct clades with similar fitness circulated in each season, as opposed to a single variant with high fitness.

      In response to Reviewer 2, we agree that it would be ideal and best practice to measure model performance with an independent test set, but our dataset includes only ~20 seasons. Predictions of independent test sets of 2-3 seasons had unstable performance, which indicates we do not have sufficient power to measure model performance with a test set this small. In the revised manuscript, we provide more justification and clarification of our methodology. Instead of testing model performance on an independent test set, we use leave-one-season-out cross-validation to train models and measure model performance, wherein each “assessment” set contains one season of data (predicted by the model), and the corresponding “analysis” set (“fold”) contains the remaining seasons. This approach is roughly analogous to splitting data into training and test sets, but all seasons are used at some point in the training of the model (Kuhn & Johnson, 2019).

      In response to Reviewer 3, we follow the reviewer’s advice to put the Methods section before the Results section. Concerning Reviewer 3’s question about the sensitivity of our results to data preparation and rescaling, we provide more justification and clarification of our methodology in the revised manuscript. In our study, we adjust influenza type/subtype incidences for differences in reporting between the pre- and post-2009 pandemic periods and across HHS regions. We adjust for differences in reporting between the pre- and post-2009 periods because the US CDC and WHO increased laboratory testing capacity in response to the 2009 A(H1N1) pandemic, which led to substantial, long-lasting improvements to influenza surveillance that are still in place today. Figure 1 - figure supplement 2 shows systematic increases in influenza test volume in all HHS regions after the 2009 pandemic. Given the substantial increase in test volume after 2009, we opted to keep the time trend adjustment for the pre- and post-2009 pandemic periods and evaluate whether adjusting for regional reporting differences affects our results. When estimating univariate correlations between various A(H3N2) epidemic metrics and evolutionary indicators, we found qualitatively equivalent results when adjusting for both pre- and post-2009 pandemic reporting and regional reporting versus only adjusting for the pre- and post-2009 pandemic reporting.

      Reviewer #1 (Recommendations For The Authors): 

      Specific comments: 

      (1) Line 155-156. Request for a reference for: "Given that protective immunity wanes after 1-4 years" 

      We now include two references (He et al. 2015 and Wraith et al. 2022), which were cited at the beginning of the introduction when referring to the duration of protective immunity for antigenically homologous viruses. (Lines 640-642 in revised manuscript)

      (2) Line 162-163: Request a further explanation of the negative correlation between seasonal diversity of HA and NA LBI values and NA epitope distance. Clarify biological implications to aid reader understanding. 

      In the revised manuscript we expand on the biological implications of A(H3N2) virus populations characterized by high antigenic novelty and low LBI diversity.

      Lines 649-653:

      “The seasonal diversity of HA and NA LBI values was negatively correlated with NA epitope distance (Figure 2 – figure supplements 5 – 6), with high antigenic novelty coinciding with low genealogical diversity. This association suggests that selective sweeps tend to follow the emergence of drifted variants with high fitness, resulting in seasons dominated by a single A(H3N2) variant rather than multiple cocirculating clades.”

      (3) Figure S3 legend t-2 may be marked as t-1. 

      Thank you for catching this. We have fixed this typo. Note: Figure S3 is now Figure 2 – figure supplement 5.

      (4) Lines 201-214. The key takeaways from the analysis of subtype dominance are ultimately not clear. It also misses the underlying dynamics that H3N2 predominance following an evolutionary change has waned since 2009.

      In the revised manuscript we elaborate on key takeaways concerning the relationship between antigenic drift and A(H3N2) dominance. We also add a caveat noting that A(H3N2) predominance is weaker during the post-2009 period, which may be linked to the diversification of A(H3N2) lineages after 2012. We do not know of a reference that links the diversification of A(H3N2) viruses in the 2010s to a particular evolutionary change. Therefore, we do not attribute the diversification of A(H3N2) viruses to a specific evolutionary change in A(H3N2) variants circulating at the time (A/Perth/16/2009-like strains (PE09)). Instead, we allude to the potential role of A(H3N2) diversification in creating multiple co-circulating lineages that may have less of a fitness advantage.

      Lines 681-703:

      “We explored whether evolutionary changes in A(H3N2) may predispose this subtype to dominate influenza virus circulation in a given season. A(H3N2) subtype dominance – the proportion of influenza positive samples typed as A(H3N2) – increased with H3 epitope distance (t – 2) (R2 = 0.32, P = 0.05) and N2 epitope distance (t – 1) (R2 = 0.34, P = 0.03) (regression results: Figure 4; Spearman correlations: Figure 3 – figure supplement 1). Figure 4 illustrates this relationship at the regional level across two seasons in which A(H3N2) was nationally dominant, but where antigenic change differed. In 2003-2004, we observed widespread dominance of A(H3N2) viruses after the emergence of the novel antigenic cluster, FU02 (A/Fujian/411/2002-like strains). In contrast, there was substantial regional heterogeneity in subtype circulation during 2007-2008, a season in which A(H3N2) viruses were antigenically similar to those circulating in the previous season. Patterns in type/subtype circulation across all influenza seasons in our study period are shown in Figure 4 – figure supplement 1. As observed for the 2003-2004 season, widespread A(H3N2) dominance tended to coincide with major antigenic transitions (e.g.,

      A/Sydney/5/1997 (SY97) seasons, 1997-1998 to 1999-2000; A/California/7/2004 (CA04) season, 20042005), though this was not universally the case (e.g., A/Perth/16/2009 (PE09) season, 2010-2011). 

      After the 2009 A(H1N1) pandemic, A(H3N2) dominant seasons still occurred more frequently than A(H1N1) dominant seasons, but the mean fraction of influenza positive cases typed as A(H3N2) in A(H3N2) dominant seasons was lower compared to A(H3N2) dominant seasons prior to 2009. Antigenically distinct 3c.2a and 3c.3a viruses began to co-circulate in 2012 and underwent further diversification during subsequent seasons in our study (https://nextstrain.org/seasonal-

      flu/h3n2/ha/12y@2024-05-13) (Dhanasekaran et al., 2022; Huddleston et al., 2020; Yan et al., 2019). The decline in A(H3N2) predominance during the post-2009 period may be linked to the genetic and antigenic diversification of A(H3N2) viruses, wherein multiple lineages with similar fitness co-circulated in each season.”

      (5) Line 253-255: It would be beneficial to provide a more detailed interpretation of the statement that "pre-2009 seasonal A(H1N1) viruses may limit the circulation of A(H3N2) viruses to a greater extent than A(H1N1)pdm09 viruses." Elaborate on the cause-and-effect relationship within this statement.

      In the revised manuscript we suggest that seasonal A(H1N1) viruses may interfere with the circulation of A(H3N2) viruses to a greater extent than A(H1N1)pdm09 viruses, because seasonal A(H1N1) viruses and A(H3N2) are more closely related, and thus may elicit stronger cross-reactive T cell responses.

      Lines 738-745:

      “The internal gene segments NS, M, NP, PA, and PB2 of A(H3N2) viruses and pre-2009 seasonal A(H1N1) viruses share a common ancestor (Webster et al., 1992) whereas A(H1N1)pdm09 viruses have a combination of gene segments derived from swine and avian reservoirs that were not reported prior to the 2009 pandemic (Garten et al., 2009; Smith et al., 2009). Non-glycoprotein genes are highly conserved between influenza A viruses and elicit cross-reactive antibody and T cell responses (Grebe et al., 2008; Sridhar, 2016). Because pre-2009 seasonal A(H1N1) viruses and A(H3N2) are more closely related, we hypothesized that seasonal A(H1N1) viruses could potentially limit the circulation of A(H3N2) viruses to a greater extent than A(H1N1)pdm09 viruses, due to greater T cell-mediated cross-protective immunity.”

      (6) In the results section, many statements report statistical results of correlation analyses. Consider providing further interpretations of these results, such as the implications of nonsignificant correlations and how they support or contradict the hypothesis or previous studies. For example, the statement on line 248 regarding the lack of significant correlation between influenza B epidemic size and A(H3N2) epidemic metrics would benefit from additional discussion on what this non-significant correlation signifies and how it relates to the hypothesis or previous research. 

      In the Discussion section, we suggest that the lack of an association between influenza B circulation and A(H3N2) epidemic metrics is due to few T and B cell epitopes shared between influenza A and B viruses (Terajima et al., 2013).

      Lines 1005-1007 in revised manuscript (Lines 513-515 in original manuscript): 

      “Overall, we did not find any indication that influenza B incidence affects A(H3N2) epidemic burden or timing, which is not unexpected, given that few T and B cell epitopes are shared between the two virus types (Terajima et al., 2013).”

      Minor comments: 

      (1) Line 116-122: Include a summary statistical description of all collected data sets, detailing the number of HA and NA sequence data and their sources. Briefly describe subsampled data sets, specifying preferences (e.g., the number of HA or NA sequence data collected from each region). 

      In our revised manuscript we now include supplementary tables that summarize the number of A/H3 and

      A/N2 sequences in each subsampled dataset, aggregated by world region, for all seasons combined (Figure 2 - table supplements 1 - 2). We also include supplementary figures showing the number of sequences collected in each month and each season in North America versus the other nine world regions combined (Figure 2 - figure supplements 1 - 2). Subsampled datasets are plotted individually in the figures below but individual time series are difficult to discern due to minor differences in sequence counts across the datasets.

      (2) Figure 7A: Due to space limitations, consider rounding numbers on the x-axis to whole numbers for clarity. 

      Thank you for this suggestion. In the revised manuscript we round numbers in the axes of Figure 7A (Figure 9A in the revised manuscript) so that the axes are less crowded.

      (3) Figure 4C & Figure 4D: Note that Region 10 (purple) data were unavailable for seasons before 2009 (lines 1483-1484). Label each region on the map with its respective region number (1 to 10) and indicate this in the legend for easy identification. 

      In our original submission, the legend for Figure 4 included “Data for Region 10 (purple) were not available for seasons prior to 2009” at the end of the caption. We have moved this sentence, as well as other descriptions that apply to both C and D, so that they follow the sentence “C-D. Regional patterns of influenza type and subtype incidence during two seasons when A(H3N2) was nationally dominant.”

      In our revised manuscript, Figure 4, and Figure 4 - figure supplement 1 (Figure S10 in original submission) include labels for each HHS region.

      We did not receive specific recommendations from Reviewer #2. However, our responses to Reviewer #3 addresses the study’s weaknesses mentioned by Reviewer #2.

      Reviewer #3 (Recommendations For The Authors): 

      This paper explores the relationships among evolutionary and epidemiological quantities in influenza, using a wide range of datasets and features, and using both correlations and random forests to examine, primarily, what are the drivers of influenza epidemics. 

      This is a work horse of paper, in the volumes of data that are analyzed and the extensive analysis that is done. The data that are provided are a treasure trove resource for influenza modelers and for anyone interested in seeing influenza surveillance data in the context of evolution, and evolutionary information in the context of epidemiology. 

      L53 - end of sentence "and antigenic drift": not sure this fits, explain? I thought this sentence was in contrast to antigenic drift.

      Thank you for catching this. We did not intend to include “and antigenic drift” at the end of this sentence and have removed it (Line 59).

      Para around L115: would using primarily US data be a limitation, because it's global immunity that shapes success of strains? Or, how much does each country's immunity and vaccination and so on actually shape what strains succeed there, compared to global/international factors? 

      The HA and NA phylogenetic trees in our study are enriched with US sequences because our study focuses on epidemiological dynamics in the US, and we wanted to prioritize A(H3N2) viruses that the US human population encountered in each season. We agree with the reviewer that the world population may be the right scale to understand how immunity, acquired by vaccination or natural infection, may shape the emergence and success of new lineages that will go on to circulate globally. However, our study assesses the overall impact of antigenic drift on regional A(H3N2) epidemic dynamics in the US. In other words, our driving question is whether we can predict the population-level impact of an A(H3N2) variant in the US, conditional on this particular lineage having established in the US and circulating at relatively high levels. We do not assess the global or population-level factors that may influence which A(H3N2) virus lineages are successful in a given location or season.

      We have added a clarifying sentence to the end of the Introduction to narrow the scope of the paper for the reader. 

      Line 114-116: “Rather than characterize in situ evolution of A(H3N2) lineages circulating in the U.S., we study the epidemiological impacts of antigenic drift once A(H3N2) variants have arrived on U.S. soil and managed to establish and circulate at relatively high levels.”

      In the Results section, I found the format hard to follow, because of the extensive methodological details, numbers with CIs and long sentences. Sentences sometimes included the question, definitions of variables, and lists. For example at line 215 we have: "Next, we tested for associations between A(H3N2) evolution and epidemic timing, including onset week, defined as the winter changepoint in incidence [16], and peak week, defined as the first week of maximum incidence; spatiotemporal synchrony, measured as the variation (standard deviation, s.d.) in regional onset and peak timing; and epidemic speed, including seasonal duration and the number of weeks from onset to peak (Table 2, Figure S11)". I would suggest putting the methods section first, using shorter sentences, separating lists from the question being asked, and stating what was found without also putting in all the extra detail. Putting the methods section before the results might reduce the sense that you have to explain what you did and how in the results section too.

      Thank you for suggesting how to improve the readability of the Results section. In the revised manuscript, we follow the reviewer’s advice to put the Methods section before the Results section. Although eLife formatting requirements specify the order: Introduction, Results, Discussion, and Methods, the journal allows for the Methods section to follow the Introduction when it makes sense to do so. We agree with the reviewer that putting the Methods section before the Results section makes our results easier to follow because we no longer need to introduce methodological details at the beginning of each set of results.

      L285 in the RF you remove variables without significant correlations with the target variables, but isn't one of the aims of RF to uncover relationships where a correlation might not be evident, and in part to reveal combinations of features that give the targeted outcome? Also with the RF, I am a bit concerned that you could not use the leave-one-out approach because it was "unstable" - presumably that means that you obtain quite different results if you leave out a season. How robust are these results, and what are the most sensitive aspects? Are the same variables typically high in importance if you leave out a season, for example? What does the scatterplot of observed vs predicted epidemic size (as in Fig 7) look like if each prediction is for the one that was left out (i.e. from a model trained on all the rest)? In my experience, where the RF is "unstable", that can look pretty terrible even if the model trained on all the data looks great (as does Figure 7). In any case I think it's worth discussing sensitivity.

      (1) In response to the reviewer’s first question, we explain our rationale for not including all candidate predictors in random forest and penalized regression models. 

      Models trained with different combinations of predictors can have similar performance, and these combinations of predictors can include variables that do not necessarily have strong univariate associations with the target variable. The performance of random forest and LASSO regression models are not sensitive to redundant or irrelevant predictors (see Figure 10.2 in Kuhn & Johnson, 2019). However,  if our goal is variable selection rather than strictly model performance, it is considered best practice to remove collinear, redundant, and/or irrelevant variables prior to training models (see section 11.3 in Kuhn & Johnson, 2019). In both random forest and LASSO regression models, if there are highly collinear variables that are useful for predicting the target variable, the predictor chosen by the model becomes a random selection. In random forest models, these highly collinear variables will be used in all splits across the forest of decision trees, and this redundancy dilutes variable importance scores. Thus, failing to minimize multicollinearity prior to model training could result in some variables having low rankings and the appearance of being unimportant, because their importance scores are overshadowed by those of the highly correlated variables. Our rationale for preprocessing predictor data follows the philosophy of Kuhn & Johnson, 2019, who recommend including the minimum possible set of variables that does not compromise model performance. Even if a particular model is insensitive to extra predictors, Kuhn and John explain that “removing predictors can reduce the cost of acquiring data or improve the throughput of the software used to make predictions.”

      In the revised manuscript, we include more details about our steps for preprocessing predictor data. We also follow the reviewer’s suggestion to include all evolutionary predictors in variable selection analyses, regardless of whether they have strong univariate correlations with target outcomes, because the performance of random forest and LASSO regression models is not affected by redundant predictors. 

      Including additional predictors in our variable selection analyses does not change our conclusions. As reported in our original manuscript, predictors with strong univariate correlations with various epidemic metrics were the highest ranked features in both random forest and LASSO regression models.

      Lines 523-563:

      “Preprocessing of predictor data: The starting set of candidate predictors included all viral fitness metrics: genetic and antigenic distances between current and previously circulating strains and the standard deviation and Shannon diversity of H3 and N2 LBI values in the current season. To account for potential type or subtype interference, we included A(H1N1) or A(H1N1)pdm09 epidemic size and B epidemic size in the current and prior season and the dominant IAV subtype in the prior season (Lee et al., 2018). We included A(H3N2) epidemic size in the prior season as a proxy for prior natural immunity to A(H3N2). To account for vaccine-induced immunity, we considered four categories of predictors and included estimates for the current and prior seasons: national vaccination coverage among adults (18-49 years coverage × ≥ 65 years coverage), adjusted A(H3N2) vaccine effectiveness (VE), a combined metric of vaccination coverage and A(H3N2) VE (18-49 years coverage × ≥ 65 years coverage × VE), and H3 and N2 epitope distances between naturally circulating A(H3N2) viruses and the U.S. A(H3N2) vaccine strain in each season. We could not include a predictor for vaccination coverage in children or consider cladespecific VE estimates, because these data were not available for most seasons in our study.

      Random forest and LASSO regression models are not sensitive to redundant (highly collinear) features (Kuhn & Johnson, 2019), but we chose to downsize the original set of candidate predictors to minimize the impact of multicollinearity on variable importance scores. For both types of models, if there are highly collinear variables that are useful for predicting the target variable, the predictor chosen by the model becomes a random selection (Kuhn & Johnson, 2019). In random forest models, these highly collinear variables will be used in all splits across the forest of decision trees, and this redundancy dilutes variable importance scores (Kuhn & Johnson, 2019). We first confirmed that none of the candidate predictors had zero variance or near-zero variance. Because seasonal lags of each viral fitness metric are highly collinear, we included only one lag of each evolutionary predictor, with a preference for the lag that had the strongest univariate correlations with various epidemic metrics. We checked for multicollinearity among the remaining predictors by examining Spearman’s rank correlation coefficients between all pairs of predictors. If a particular pair of predictors was highly correlated (Spearman’s 𝜌 > 0.8), we retained only one predictor from that pair, with a preference for the predictor that had the strongest univariate correlations with various epidemic metrics. Lastly, we performed QR decomposition of the matrix of remaining predictors to determine if the matrix is full rank and identify sets of columns involved in linear dependencies. This step did not eliminate any additional predictors, given that we had already removed pairs of highly collinear variables based on Spearman correlation coefficients. 

      After these preprocessing steps, our final set of model predictors included 21 variables, including 8 viral evolutionary indicators: H3 epitope distance (t – 2), HI log2 titer distance (t – 2), H3 RBS distance (t – 2), H3 non-epitope distance (t – 2), N2 epitope distance (t – 1), N2 non-epitope distance (t – 1), and H3 and N2 LBI diversity (s.d.) in the current season; 6 proxies for type/subtype interference and prior immunity:

      A(H1N1) and B epidemic sizes in the current and prior season, A(H3N2) epidemic size in the prior season, and the dominant IAV subtype in the prior season; and 7 proxies for vaccine-induced immunity: A(H3N2) VE in the current and prior season, H3 and N2 epitope distances between circulating strains and the vaccine strain in each season, the combined metric of adult vaccination coverage × VE in the current and prior season, and adult vaccination coverage in the prior season.”

      (2) Next, we clarify our model training methodology to address the reviewer’s second point about using a leave-one-out cross-validation approach.

      We believe the reviewer is mistaken; we use a leave-one-season-out validation approach which lends some robustness to the predictions. In our original submission, we stated “We created each forest by generating 3,000 regression trees from 10 repeats of a leave-one-season-out (jackknife) cross-validated sample of the data. Due to the small size of our dataset, evaluating the predictive accuracy of random forest models on a quasi-independent test set produced unstable estimates.” (Lines 813-816 in the original manuscript)

      To clarify, we use leave-one-season-out cross-validation to train models and measure model performance, wherein each “assessment” set contains one season of data (predicted by the model), and the corresponding “analysis” set (“fold”) contains the remaining seasons. This approach is roughly analogous to splitting data into training and test sets, but all seasons are used at some point in the training of the model (see Section 3.4 in Kuhn & Johnson, 2019). To reduce noise, we generated 10 bootstrap resamples of each fold and averaged the RMSE and R2 values of model predictions from resamples. 

      Although it would be ideal and best practice to measure model performance with an independent test set, our dataset includes only ~20 seasons. We found that predictions of independent test sets of 2-3 seasons had unstable performance, which indicates we do not have sufficient power to measure model performance with a test set this small. Further, we suspect that large antigenic jumps in a small subset of seasons further contribute to variation in prediction accuracy across randomly selected test sets. Our rationale for using cross-validation instead of an independent test set is best described in Section 4.3 of Kuhn and Johnson’s book “Applied Predictive Modeling” (Kuhn & Johnson, 2013):

      “When the number of samples is not large, a strong case can be made that a test set should be avoided because every sample may be needed for model building. Additionally, the size of the test set may not have sufficient power or precision to make reasonable judgements. Several researchers (Molinaro 2005; Martin and Hirschberg 1996; Hawkins et al. 2003) show that validation using a single test set can be a poor choice. Hawkins et al. (2003) concisely summarize this point: “holdout samples of tolerable size [...] do not match the cross-validation itself for reliability in assessing model fit and are hard to motivate. “Resampling methods, such as cross-validation, can be used to produce appropriate estimates of model performance using the training set. These are discussed in length in Sect.4.4. Although resampling techniques can be misapplied, such as the example shown in Ambroise and McLachlan (2002), they often produce performance estimates superior to a single test set because they evaluate many alternate versions of the data.”

      In our revised manuscript, we provide additional clarification of our methods (Lines 574-590):

      “We created each forest by generating 3,000 regression trees. To determine the best performing model for each epidemic metric, we used leave-one-season-out (jackknife) cross-validation to train models and measure model performance, wherein each “assessment” set is one season of data predicted by the model, and the corresponding “analysis” set contains the remaining seasons. This approach is roughly analogous to splitting data into training and test sets, but all seasons are used at some point in the training of each model (Kuhn & Johnson, 2019). Due to the small size of our dataset (~20 seasons), evaluating the predictive accuracy of random forest models on a quasi-independent test set of 2-3 seasons produced unstable estimates. Instead of testing model performance on an independent test set, we generated 10 bootstrap resamples (“repeats”) of each analysis set (“fold”) and averaged the predictions of models trained on resamples (Kuhn & Johnson, 2013, 2019). For each epidemic metric, we report the mean root mean squared error (RMSE) and R2 of predictions from the best tuned model. We used permutation importance (N = 50 permutations) to estimate the relative importance of each predictor in determining target outcomes. Permutation importance is the decrease in prediction accuracy when a single feature (predictor) is randomly permuted, with larger values indicating more important variables. Because many features were collinear, we used conditional permutation importance to compute feature importance scores, rather than the standard marginal procedure (Altmann et al., 2010; Debeer & Strobl, 2020; Strobl et al., 2008; Strobl et al., 2007).”

      (3) In response to the reviewer’s question about the sensitivity of results when one season is left out, we clarify that the variable importance scores in Figure 8 and model predictions in Figure 9 were generated by models tuned using leave-one-season-out cross-validation. 

      As explained above, in our leave-one-season-out cross-validation approach, each “assessment” set contains one season of data predicted by the model, and the corresponding “analysis” set (“fold”) contains the remaining seasons. We generated predictions of epidemic metrics and variable importance rankings by averaging the model output of 10 bootstrap resamples of each cross-validation fold. 

      In Lines 791-806, we describe which epidemic metrics have the highest prediction accuracy and report that random forest models tend to underpredict most epidemic metrics in seasons with high antigenic novelty:

      “We measured correlations between observed values and model-predicted values at the HHS region level. Among the various epidemic metrics, random forest models produced the most accurate predictions of A(H3N2) subtype dominance (Spearman’s 𝜌 = 0.95, regional range = 0.85 – 0.97), peak incidence (𝜌 = 0.91, regional range = 0.72 – 0.95), and epidemic size (𝜌 = 0.9, regional range = 0.74 – 0.95), while predictions of effective 𝑅! and epidemic intensity were less accurate (𝜌 = 0.81, regional range = 0.65 – 0.91; 𝜌 = 0.78, regional range = 0.63 – 0.92, respectively) (Figure 9). Random forest models tended to underpredict most epidemic targets in seasons with substantial H3 antigenic transitions, in particular the SY97 cluster seasons (1998-1999, 1999-2000) and the FU02 cluster season (2003-2004) (Figure 9). 

      For epidemic size and peak incidence, seasonal predictive error – the root-mean-square error (RMSE) across all regional predictions in a season – increased with H3 epitope distance (epidemic size, Spearman’s 𝜌 = 0.51, P = 0.02; peak incidence, 𝜌 = 0.63, P = 0.004) and N2 epitope distance (epidemic size, 𝜌 = 0.48, P = 0.04; peak incidence, 𝜌 = 0.48, P = 0.03) (Figure 9 – figure supplements 1 – 2). For models of epidemic intensity, seasonal RMSE increased with N2 epitope distance (𝜌 = 0.64, P = 0.004) but not H3 epitope distance (𝜌 = 0.06, P = 0.8) (Figure 9 – figure supplements 1 – 2). Seasonal RMSE of effective 𝑅! and subtype dominance predictions did not correlate with H3 or N2 epitope distance (Figure 9 – figure supplements 1 – 2).”

      I think the competition (interference) results are really interesting, perhaps among the most interesting aspects of this work. 

      Thank you! We agree that our finding that subtype interference has a greater impact than viral evolution on A(H3N2) epidemics is one of the more interesting results in the study.

      Have you seen the paper by Barrat-Charlaix et al? They found that LBI was not good predicting frequency dynamics (see https://pubmed.ncbi.nlm.nih.gov/33749787/); instead, LBI was high for sequences like the consensus sequence, which was near to future strains. LBI also was not positively correlated with epidemic impact in Figure S7.

      The local branching index (LBI) measures the rate of recent phylogenetic branching and approximates relative fitness among viral clades, with high LBI values representing greater fitness (Neher et al. 2014).

      Two of this study’s co-authors (John Huddleston and Trevor Bedford) are also co-authors of BarratCharlaix et al. 2021. Barrat-Charlaix et al. 2021 assessed the performance of LBI in predicting the frequency dynamics and fixation of individual amino acid substitutions in A(H3N2) viruses. Our study is not focused on predicting the future success of A(H3N2) clades or the frequency dynamics or probability of fixation of individual substitutions. Instead, we use the standard deviation and Shannon diversity of LBI values in each season as a proxy for genealogical (clade-level) diversity. We find that, at a seasonal level, low diversity of H3 or N2 LBI values in the current season correlates with greater epidemic intensity, higher transmission rates, and shorter seasonal duration.

      In the Discussion we provide an explanation for these correlation results (Lines 848-857): 

      “The local branching index (LBI) is traditionally used to predict the success of individual clades, with high LBI values indicating high viral fitness (Huddleston et al., 2020; Neher et al., 2014). In our epidemiological analysis, low diversity of H3 or N2 LBI in the current season correlated with greater epidemic intensity, higher transmission rates, and shorter seasonal duration. These associations suggest that low LBI diversity is indicative of a rapid selective sweep by one successful clade, while high LBI diversity is indicative of multiple co-circulating clades with variable seeding and establishment times over the course of an epidemic. A caveat is that LBI estimation is more sensitive to sequence sub-sampling schemes than strain-level measures. If an epidemic is short and intense (e.g., 1-2 months), a phylogenetic tree with our sub-sampling scheme (50 sequences per month) may not incorporate enough sequences to capture the true diversity of LBI values in that season.”

      Figure 1 - LBI goes up over time. Is that partly to do with sampling? Overall how do higher sampling volumes in later years impact this analysis? (though you choose a fixed number of sequences so I guess you downsample to cope with that). I note that LBI is likely to be sensitive to sequencing density. 

      Thank you for pointing this out. We realized that increasing LBI Shannon diversity over the course of the study period was indeed an artefact of increasing sequence volume over time. Our sequence subsampling scheme involves selecting a random sample of up to 50 viruses per month, with up to 25 viruses selected from North America (if available) and the remaining sequences evenly divided across nine other global regions. In early seasons of the study (late 1990s/early 2000s), sampling was often too sparse to meet the 25 viruses/month threshold for North America or for the other global regions combined (H3: Figure 2 - figure supplement 1; N2: Figure 2 - figure supplement 2). Ecological diversity metrics are sensitive to sample size, which explains why LBI Shannon diversity appeared to steadily increase over time in our original submission. In our revised manuscript, we correct for uneven sample sizes across seasons before estimating Shannon diversity and clarify our methodology. 

      Lines 443-482: 

      “Clade growth: The local branching index (LBI) measures the relative fitness of co-circulating clades, with high LBI values indicating recent rapid phylogenetic branching (Huddleston et al., 2020; Neher et al., 2014). To calculate LBI for each H3 and N2 sequence, we applied the LBI heuristic algorithm as originally described by Neher et al., 2014 to H3 and N2 phylogenetic trees, respectively. We set the neighborhood parameter 𝜏 to 0.4 and only considered viruses sampled between the current season 𝑡 and the previous season 𝑡 – 1 as contributing to recent clade growth in the current season 𝑡.  

      Variation in the phylogenetic branching rates of co-circulating A(H3N2) clades may affect the magnitude, intensity, onset, or duration of seasonal epidemics. For example, we expected that seasons dominated by a single variant with high fitness might have different epidemiological dynamics than seasons with multiple co-circulating clades with varying seeding and establishment times. We measured the diversity of clade growth rates of viruses circulating in each season by measuring the standard deviation (s.d.) and Shannon diversity of LBI values in each season. Given that LBI measures relative fitness among cocirculating clades, we did not compare overall clade growth rates (e.g., mean LBI) across seasons.

      Each season’s distribution of LBI values is right-skewed and does not follow a normal distribution. We therefore bootstrapped the LBI values of each season in each replicate dataset 1000 times (1000 samples with replacement) and estimated the seasonal standard deviation of LBI from resamples, rather than directly from observed LBI values. We also tested the seasonal standard deviation of LBI from log transformed LBI values, which produced qualitatively equivalent results to bootstrapped LBI values in downstream analyses.

      As an alternative measure of seasonal LBI diversity, we binned raw H3 and N2 LBI values into categories based on their integer values (e.g., an LBI value of 0.5 is assigned to the (0,1] bin) and estimated the exponential of the Shannon entropy (Shannon diversity) of LBI categories (Hill, 1973; Shannon, 1948). The Shannon diversity of LBI considers both the richness and relative abundance of viral clades with different growth rates in each season and is calculated as follows:  

      where 𝑞 𝐷 is the effective number of categories or Hill numbers of order 𝑞 (here, clades with different growth rates), with 𝑞 defining the sensitivity of the true diversity to rare versus abundant categories (Hill,

      1973). exp is the exponential function, 𝑝# is the proportion of LBI values belonging to the 𝑖th category, and 𝑅 is richness (the total number of categories). Shannon diversity 1𝐷 (𝑞 = 1) estimates the effective number of categories in an assemblage using the geometric mean of their proportional abundances 𝑝# (Hill, 1973).  

      Because ecological diversity metrics are sensitive to sampling effort, we rarefied H3 and N2 sequence datasets prior to estimating Shannon diversity so that seasons had the same sample size. For each season in each replicate dataset, we constructed rarefaction and extrapolation curves of LBI Shannon diversity and extracted the Shannon diversity estimate of the sample size that was twice the size of the reference sample size (the smallest number of sequences obtained in any season during the study) (iNEXT R package) (Chao et al., 2014). Chao et al. found that their diversity estimators work well for rarefaction and short-range extrapolation when the extrapolated sample size is up to twice the reference sample size. For H3, we estimated seasonal diversity using replicate datasets subsampled to 360 sequences/season; For N2, datasets were subsampled to 230 sequences/season.”

      Estimating the Shannon diversity of LBI from datasets with even sampling across seasons removes the previous secular trend of increasing LBI diversity over time (Figure 2 in revised manuscript).

      Figure 3 - I wondered what about the co-dominant times? 

      In Figure 3, orange points correspond to seasons in which A(H3N2) and A(H1N1) were codominant. We are not sure of the reviewer’s specific question concerning codominant seasons, but if it concerns whether antigenic drift is linked to epidemic magnitude among codominant seasons alone, we cannot perform separate regression analyses for these seasons because there are only two codominant seasons during the 22 season study period.

      Figure 4 - Related to drift and epidemic size, dominance, etc. -- when is drift measured, and (if it's measured in season t), would larger populations create more drift, simply by having access to more opportunity (via a larger viral population size)? This is a bit 'devil's advocate' but what if some epidemiological/behavioural process causes a larger and/or later peak, and those gave rise to higher drift?

      Seasonal drift is measured as the genetic or antigenic distance between viruses circulating during season t and viruses circulating in the prior season (𝑡 – 1) or two seasons ago (𝑡 – 2).

      Concerning the question about whether larger human populations lead to greater rates of antigenic drift, phylogeographic studies have repeatedly found that East-South-Southeast Asia are the source populations for A(H3N2) viruses (Bedford et al., 2015; Lemey et al., 2014), in part because these regions have tropical or subtropical climates and larger human populations, which enable year-round circulation and higher background infection rates. Larger viral populations (via larger host population sizes) and uninterrupted transmission may increase the efficiency of selection and the probability of strain survival and global spread (Wen et al., 2016). After A(H3N2) variants emerge in East-South-Southeast Asia and spread to other parts of the world, A(H3N2) viruses circulate via overlapping epidemics rather than local persistence (Bedford et al., 2015; Rambaut et al., 2008). Each season, A(H3N2) outbreaks in the US (and other temperate regions) are seeded by case importations from outside the US, genetic diversity peaks during the winter, and a strong genetic bottleneck typically occurs at the end of the season (Rambaut et al., 2008).

      Due to their faster rates of antigenic evolution, A(H3N2) viruses undergo more rapid clade turnover and dissemination than A(H1N1) and B viruses, despite similar global migration networks across A(H3N2), A(H1N1), and B viruses (Bedford et al., 2015). Bedford et al. speculate that there is typically little geographic differentiation in A(H3N2) viruses circulating in each season because A(H3N2) viruses tend to infect adults, and adults are more mobile than children. Compared to A(H3N2) viruses, A(H1N1) and B viruses tend to have greater genealogical diversity, geographic differentiation, and longer local persistence times (Bedford et al., 2015; Rambaut et al., 2008). Thus, some A(H1N1) and B epidemics are reseeded by viruses that have persisted locally since prior epidemics (Bedford et al., 2015).

      Theoretical models have shown that epidemiological processes can influence rates of antigenic evolution (Recker et al., 2007; Wen et al., 2016; Zinder et al., 2013), though the impact of flu epidemiology on viral evolution is likely constrained by the virus’s intrinsic mutation rate. 

      In conclusion, larger host population sizes and flu epidemiology can indeed influence rates of antigenic evolution. However, given that our study is US-centric and focuses on A(H3N2) viruses, these factors are likely not at play in our study, due to intrinsic biological characteristics of A(H3N2) viruses and the geographic location of our study.

      We have added a clarifying sentence to the end of the Introduction to narrow the scope of the paper for the reader.

      Line 114-116: “Rather than characterize in situ evolution of A(H3N2) lineages circulating in the U.S., we study the epidemiological impacts of antigenic drift once A(H3N2) variants have arrived on U.S. soil and managed to establish and circulate at relatively high levels.”

      Methods -- 

      L 620 about rescaling and pre- vs post-pandemic times : tell us more - how has reporting changed? could any of this not be because of reporting but because of NPIs or otherwise? Overall there is a lot of rescaling going on. How sensitive are the results to it? 

      it would be unreasonable to ask for a sensitivity analysis for all the results for all the choices around data preparation, but some idea where there is a reason to think there might be a dependence on one of these choices would be great.

      In response to the 2009 A(H1N1) pandemic, the US CDC and WHO increased laboratory testing capacity and strengthened epidemiological networks, leading to substantial, long-lasting improvements to influenza surveillance that are still in place today (https://www.cdc.gov/flu/weekly/overview.htm). At the beginning of the COVID-19 pandemic, influenza surveillance networks were quickly adapted to detect and understand the spread of SARS-CoV-2. The 2009 pandemic occurred over a time span of less than one year, and strict non-pharmaceutical interventions (NPIs), such as lockdowns and mask mandates, were not implemented. Thus, we attribute increases in test volume during the post-2009 period to improved virologic surveillance and laboratory testing capacity rather than changes in care-seeking behavior. In the revised manuscript, we include a figure (Figure 1 - figure supplement 2) that shows systematic increases in test volume in all HHS regions after the 2009 pandemic.

      Given the substantial increase in influenza test volume after 2009, we opted to keep the time trend adjustment for the pre- and post-2009 pandemic periods and evaluate whether adjusting for regional reporting differences affects our results. When estimating univariate correlations between various

      A(H3N2) epidemic metrics and evolutionary indicators, we found qualitatively equivalent results for Spearman correlations and regression models, when adjusting for the pre- and post-2009 pandemic time periods and regional reporting versus only adjusting for the pre-/post-2009 pandemic time periods. Below, we share adjusted versions of Figure 3 (regression results) and Figure 3 - figure supplement 1 (Spearman correlations). Each figure only adjusts for differences in pre- and post-2009 pandemic reporting.

      Author response image 1.

      Adjustment for pre- and post-2009 pandemic only

      Author response image 2.

      Adjustment for pre- and post-2009 pandemic only

      L635 - Why discretize the continuous LBI distribution and then use Shannon entropy when you could just use the variance and/or higher moments? (or quantiles)? Similarly, why not use the duration of the peak, rather than Shannon entropy? (though there, because presumably data are already binned weekly, and using duration would involve defining start and stop times, it's more natural than with LBI)

      We realize that we failed to mention in the methods that we calculated the standard deviation of LBI in each season, in addition to the exponential of the Shannon entropy (Shannon diversity) of LBI. Both the Shannon diversity of LBI values and the standard deviation of LBI values were negatively correlated with effective Rt and epidemic intensity and positively correlated with seasonal duration. The two measures were similarly correlated with effective Rt and epidemic intensity (Figure 3 - figure supplements 2 - 3), while the Shannon diversity of LBI had slightly stronger correlations with seasonal duration than s.d. LBI (Figure 5). Thus, both measures of LBI diversity appear to capture potentially biologically important heterogeneities in clade growth rates.

      Separately, we use the inverse Shannon entropy of the incidence distribution to measure the spread of an A(H3N2) epidemic during the season, following the methods of Dalziel et al. 2018. The peak of an epidemic is a single time point at which the maximum incidence occurs. We have not encountered “the duration of the peak” before in epidemiology terminology, and, to our knowledge, there is not a robust way to measure the “duration of a peak,” unless one were to measure the time span between multiple points of maximum incidence or designate an arbitrary threshold for peak incidence that is not strictly the maximum incidence. Given that Shannon entropy is based on the normalized incidence distribution over the course of the entire influenza season (week 40 to week 20), it does not require designating an arbitrary threshold to describe epidemic intensity.

      L642 - again why normalize epidemic intensities, and how sensitive are the results to this? I would imagine given that the RF results were unstable under leave-one-out analysis that some of those results could be quite sensitive to choices of normalization and scaling.

      Epidemic intensity, defined as the inverse Shannon entropy of the incidence distribution, measures the spread of influenza cases across the weeks in a season. Following Dalziel et al. 2018, we estimated epidemic intensity from normalized incidence distributions rather than raw incidences so that epidemic intensity is invariant under differences in reporting rates and/or attack rates across regions and seasons. If we were to use raw incidences instead, HHS regions or seasons could have the appearance of greater or lower epidemic intensity (i.e., incidence concentrated within a few weeks or spread out over several weeks), due to differences in attack rates or test volume, rather than fundamental differences in the shapes of their epidemic curves. In other words, epidemic intensity is intended to measure the shape and spread of an epidemic, regardless of the actual volume of cases in a given region or season.

      In the methods section, we provide further clarification for why epidemic intensities are based on normalized incidence distributions rather than raw incidences.

      Lines 206-209: “Epidemic intensity is intended to measure the shape and spread of an epidemic, regardless of the actual volume of cases in a given region or season. Following the methodology of Dalziel et al. 2018, epidemic intensity values were normalized to fall between 0 and 1 so that epidemic intensity is invariant to differences in reporting rates and/or attack rates across regions and seasons.”  

      L643 - more information about what goes into Epidemia (variables, priors) such that it's replicable/understandable without the code would be good. 

      We now include additional information concerning the epidemic models used to estimate Rt, including all model equations, variables, and priors (Lines 210-276 in Methods).

      L667 did you do breakpoint detection? Why linear models? Was log(incidence) used? 

      In our original submission, we estimated epidemic onsets using piecewise regression models (Lines 666674 in original manuscript), which model non-linear relationships with breakpoints by iteratively fitting linear models (Muggeo, 2003). Piecewise regression falls under the umbrella of parametric methods for breakpoint detection.

      We did not include results from linear models fit to log(incidence) or GLMs with Gaussian error distributions and log links, due to two reasons. First, models fit to log-transformed data require non-zero values as inputs. Although breakpoint detection does not necessarily require weeks of zero incidence leading up to the start of an outbreak, limiting the time period for breakpoint detection to weeks with nonzero incidence (so that we could use log transformed incidence) substantially pushed back previous more biologically plausible estimates of epidemic onset weeks. Second, as an alternative to limiting the dataset to weeks with non-zero incidence, we tried adding a small positive number to weekly incidences so that we could fit models to log transformed incidence for the whole time period spanning epidemic week 40 (the start of the influenza season) to the first week of maximum incidence. Fitting models to log

      transformed incidences produced unrealistic breakpoint locations, potentially because log transformations 1) linearize data, and 2) stabilize variance by reducing the impact of extreme values. Due to the short time span used for breakpoint detection, log transforming incidence diminishes abrupt changes in incidence at the beginning of outbreaks, making it difficult for models to estimate biologically plausible breakpoint locations. Log transformations of incidence may be more useful when analyzing time series spanning multiple seasons, rather than short time spans with sharp changes in incidence (i.e., the exponential growth phase of a single flu outbreak).

      As an alternative to piecewise regression, our revised manuscript also estimates epidemic onsets using a Bayesian ensemble algorithm that accounts for the time series nature of incidence data and allows for complex, non-linear trajectories interspersed with change points (BEAST - a Bayesian estimator of Abrupt change, Seasonal change, and Trend; Zhao et al., 2019). Although a few regional onset time times differed across the two methods, our conclusions did not change concerning correlations between viral fitness and epidemic onset timing.

      We have rewritten the methods section for estimating epidemic onsets to clarify our methodology and to include the BEAST method (Lines 292-308):

      “We estimated the regional onsets of A(H3N2) virus epidemics by detecting breakpoints in A(H3N2) incidence curves at the beginning of each season. The timing of the breakpoint in incidence represents epidemic establishment (i.e., sustained transmission) rather than the timing of influenza introduction or arrival (Charu et al., 2017). We used two methods to estimate epidemic onsets: 1) piecewise regression, which models non-linear relationships with break points by iteratively fitting linear models to each segment (segmented R package) (Muggeo, 2008; Muggeo, 2003), and 2) a Bayesian ensemble algorithm (BEAST – a Bayesian estimator of Abrupt change, Seasonal change, and Trend) that explicitly accounts for the time series nature of incidence data and allows for complex, non-linear trajectories interspersed with change points (Rbeast R package) (Zhao et al., 2019). For each region in each season, we limited the time period of breakpoint detection to epidemic week 40 to the first week of maximum incidence and did not estimate epidemic onsets for regions with insufficient signal, which we defined as fewer than three weeks of consecutive incidence and/or greater than 30% of weeks with missing data. We successfully estimated A(H3N2) onset timing for most seasons, except for three A(H1N1) dominant seasons: 20002001 (0 regions), 2002-2003 (3 regions), and 2009-2010 (0 regions). Estimates of epidemic onset weeks were similar when using piecewise regression versus the BEAST method, and downstream analyses of correlations between viral fitness indicators and onset timing produced equivalent results. We therefore report results from onsets estimated via piecewise regression.”

      L773 national indicators -- presumably this is because you don't have regional-level information, but it might be worth saying that earlier so it doesn't read like there are other indicators now, called national indicators, that we should have heard of 

      In the revised manuscript, we move a paragraph that was at the beginning of the Results to the beginning of the Methods.

      Lines 123-132: 

      “Our study focuses on the impact of A(H3N2) virus evolution on seasonal epidemics from seasons 19971998 to 2018-2019 in the U.S.; whenever possible, we make use of regionally disaggregated indicators and analyses. We start by identifying multiple indicators of influenza evolution each season based on changes in HA and NA. Next, we compile influenza virus subtype-specific incidence time series for U.S. Department of Health and Human Service (HHS) regions and estimate multiple indicators characterizing influenza A(H3N2) epidemic dynamics each season, including epidemic burden, severity, type/subtype dominance, timing, and the age distribution of cases. We then assess univariate relationships between national indicators of evolution and regional epidemic characteristics. Lastly, we use multivariable regression models and random forest models to measure the relative importance of viral evolution, heterosubtypic interference, and prior immunity in predicting regional A(H3N2) epidemic dynamics.”

      In Lines 484-487 in the Methods, we now mention that measures of seasonal antigenic and genetic distance are at the national level. 

      “For each replicate dataset, we estimated national-level genetic and antigenic distances between influenza viruses circulating in consecutive seasons by calculating the mean distance between viruses circulating in the current season 𝑡 and viruses circulating during the prior season (𝑡 – 1 year; one season lag) or two prior seasons ago (𝑡 – 2 years; two season lag).”

      L782 Why Beta regression and what is "the resampled dataset" ? 

      Beta regression is appropriate for models of subtype dominance, epidemic intensity, and age-specific proportions of ILI cases because these data are continuous and restricted to the interval (0, 1) (Ferrari & Cribari-Neto, 2004). “The resampled dataset” refers to the “1000 bootstrap replicates of the original dataset (1000 samples with replacement)” mentioned in Lines 777-778 of the original manuscript. 

      In the revised manuscript, we include more background information about Beta regression models, and explicitly mention that regression models were fit to 1000 bootstrap replicates of the original dataset.

      Lines 503-507: 

      “For subtype dominance, epidemic intensity, and age-specific proportions of ILI cases, we fit Beta regression models with logit links. Beta regression models are appropriate when the variable of interest is continuous and restricted to the interval (0, 1) (Ferrari & Cribari-Neto, 2004). For each epidemic metric, we fit the best-performing regression model to 1000 bootstrap replicates of the original dataset.”

      The github is clear, comprehensive and well-documented, at least at a brief glance. 

      Thank you! At the time of resubmission, our GitHub repository is updated to incorporate feedback from the reviewers.

      References

      Altmann, A., Tolosi, L., Sander, O., & Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10), 1340-1347.

      https://doi.org/10.1093/bioinformatics/btq134  

      Barrat-Charlaix, P., Huddleston, J., Bedford, T., & Neher, R. A. (2021). Limited Predictability of Amino Acid Substitutions in Seasonal Influenza Viruses. Mol Biol Evol, 38(7), 2767-2777.

      https://doi.org/10.1093/molbev/msab065  

      Bedford, T., Riley, S., Barr, I. G., Broor, S., Chadha, M., Cox, N. J., Daniels, R. S., Gunasekaran, C. P.,

      Hurt, A. C., Kelso, A., Klimov, A., Lewis, N. S., Li, X., McCauley, J. W., Odagiri, T., Potdar, V., Rambaut, A., Shu, Y., Skepner, E., . . . Russell, C. A. (2015). Global circulation patterns of seasonal influenza viruses vary with antigenic drift. Nature, 523(7559), 217-220.

      https://doi.org/10.1038/nature14460  

      Chao, A., Gotelli, N. J., Hsieh, T. C., Sander, E. L., Ma, K. H., Colwell, R. K., & Ellison, A. M. (2014). Rarefaction and extrapolation with Hill numbers: a framework for sampling and estimation in species diversity studies. Ecological Monographs, 84(1), 45-67. https://doi.org/10.1890/13-0133.1  Charu, V., Zeger, S., Gog, J., Bjornstad, O. N., Kissler, S., Simonsen, L., Grenfell, B. T., & Viboud, C. (2017). Human mobility and the spatial transmission of influenza in the United States. PLoS

      Comput Biol, 13(2), e1005382. https://doi.org/10.1371/journal.pcbi.1005382  

      Dalziel, B. D., Kissler, S., Gog, J. R., Viboud, C., Bjornstad, O. N., Metcalf, C. J. E., & Grenfell, B. T.

      (2018). Urbanization and humidity shape the intensity of influenza epidemics in U.S. cities.

      Science, 362(6410), 75-79. https://doi.org/10.1126/science.aat6030  

      Debeer, D., & Strobl, C. (2020). Conditional permutation importance revisited. BMC Bioinformatics, 21(1), 307. https://doi.org/10.1186/s12859-020-03622-2  

      Dhanasekaran, V., Sullivan, S., Edwards, K. M., Xie, R., Khvorov, A., Valkenburg, S. A., Cowling, B. J., & Barr, I. G. (2022). Human seasonal influenza under COVID-19 and the potential consequences of influenza lineage elimination. Nat Commun, 13(1), 1721. https://doi.org/10.1038/s41467-02229402-5  

      Ferrari, S., & Cribari-Neto, F. (2004). Beta Regression for Modelling Rates and Proportions. Journal of Applied Statistics, 31(7), 799-815. https://doi.org/10.1080/0266476042000214501  

      Garten, R. J., Davis, C. T., Russell, C. A., Shu, B., Lindstrom, S., Balish, A., Sessions, W. M., Xu, X., Skepner, E., Deyde, V., Okomo-Adhiambo, M., Gubareva, L., Barnes, J., Smith, C. B., Emery, S. L., Hillman, M. J., Rivailler, P., Smagala, J., de Graaf, M., . . . Cox, N. J. (2009). Antigenic and genetic characteristics of swine-origin 2009 A(H1N1) influenza viruses circulating in humans.

      Science, 325(5937), 197-201. https://doi.org/10.1126/science.1176225  

      Grebe, K. M., Yewdell, J. W., & Bennink, J. R. (2008). Heterosubtypic immunity to influenza A virus:

      where do we stand? Microbes Infect, 10(9), 1024-1029.

      https://doi.org/10.1016/j.micinf.2008.07.002  

      Hill, M. O. (1973). Diversity and Evenness: A Unifying Notation and Its Consequences. Ecology, 54(2), 427-432. https://doi.org/https://doi.org/10.2307/1934352  

      Huddleston, J., Barnes, J. R., Rowe, T., Xu, X., Kondor, R., Wentworth, D. E., Whittaker, L., Ermetal, B., Daniels, R. S., McCauley, J. W., Fujisaki, S., Nakamura, K., Kishida, N., Watanabe, S., Hasegawa, H., Barr, I., Subbarao, K., Barrat-Charlaix, P., Neher, R. A., & Bedford, T. (2020).

      Integrating genotypes and phenotypes improves long-term forecasts of seasonal influenza

      A/H3N2 evolution. Elife, 9, e60067. https://doi.org/10.7554/eLife.60067  Kuhn, M., & Johnson, K. (2013). Applied predictive modeling (Vol. 26). Springer. 

      Kuhn, M., & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models. Chapman and Hall/CRC. 

      Lee, E. C., Arab, A., Goldlust, S. M., Viboud, C., Grenfell, B. T., & Bansal, S. (2018). Deploying digital health data to optimize influenza surveillance at national and local scales. PLoS Comput Biol,

      14(3), e1006020. https://doi.org/10.1371/journal.pcbi.1006020  

      Lemey, P., Rambaut, A., Bedford, T., Faria, N., Bielejec, F., Baele, G., Russell, C. A., Smith, D. J., Pybus,

      O. G., Brockmann, D., & Suchard, M. A. (2014). Unifying viral genetics and human transportation

      data to predict the global transmission dynamics of human influenza H3N2. PLoS Pathog, 10(2), e1003932. https://doi.org/10.1371/journal.ppat.1003932  

      Muggeo, V. (2008). Segmented: An R Package to Fit Regression Models With Broken-Line Relationships. R News, 8, 20-25. 

      Muggeo, V. M. (2003). Estimating regression models with unknown break-points. Stat Med, 22(19), 30553071. https://doi.org/10.1002/sim.1545  

      Neher, R. A., Russell, C. A., & Shraiman, B. I. (2014). Predicting evolution from the shape of genealogical trees. Elife, 3, e03568. https://doi.org/10.7554/eLife.03568  

      Rambaut, A., Pybus, O. G., Nelson, M. I., Viboud, C., Taubenberger, J. K., & Holmes, E. C. (2008). The genomic and epidemiological dynamics of human influenza A virus. Nature, 453(7195), 615-619.

      https://doi.org/10.1038/nature06945  

      Recker, M., Pybus, O. G., Nee, S., & Gupta, S. (2007). The generation of influenza outbreaks by a network of host immune responses against a limited set of antigenic types. Proceedings of the National Academy of Sciences, 104(18), 7711-7716.

      https://doi.org/doi:10.1073/pnas.0702154104  

      Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423. 

      Smith, G. J., Vijaykrishna, D., Bahl, J., Lycett, S. J., Worobey, M., Pybus, O. G., Ma, S. K., Cheung, C. L., Raghwani, J., Bhatt, S., Peiris, J. S., Guan, Y., & Rambaut, A. (2009). Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic. Nature, 459(7250), 1122-1125. https://doi.org/10.1038/nature08182  

      Sridhar, S. (2016). Heterosubtypic T-Cell Immunity to Influenza in Humans: Challenges for Universal TCell Influenza Vaccines. Front Immunol, 7, 195. https://doi.org/10.3389/fimmu.2016.00195  

      Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9, 307. https://doi.org/10.1186/1471-2105-9-307  

      Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics, 8, 25.

      https://doi.org/10.1186/1471-2105-8-25  

      Terajima, M., Babon, J. A., Co, M. D., & Ennis, F. A. (2013). Cross-reactive human B cell and T cell epitopes between influenza A and B viruses. Virol J, 10, 244. https://doi.org/10.1186/1743-422x10-244  

      Webster, R. G., Bean, W. J., Gorman, O. T., Chambers, T. M., & Kawaoka, Y. (1992). Evolution and ecology of influenza A viruses. Microbiological Reviews, 56(1), 152-179.

      https://doi.org/doi:10.1128/mr.56.1.152-179.1992  

      Wen, F., Bedford, T., & Cobey, S. (2016). Explaining the geographical origins of seasonal influenza A

      (H3N2). Proc Biol Sci, 283(1838). https://doi.org/10.1098/rspb.2016.1312  

      Yan, L., Neher, R. A., & Shraiman, B. I. (2019). Phylodynamic theory of persistence, extinction and speciation of rapidly adapting pathogens. Elife, 8. https://doi.org/10.7554/eLife.44205  

      Zhao, K., Wulder, M. A., Hu, T., Bright, R., Wu, Q., Qin, H., Li, Y., Toman, E., Mallick, B., Zhang, X., & Brown, M. (2019). Detecting change-point, trend, and seasonality in satellite time series data to track abrupt changes and nonlinear dynamics: A Bayesian ensemble algorithm. Remote Sensing

      of Environment, 232, 111181. https://doi.org/10.1016/j.rse.2019.04.034  

      Zinder, D., Bedford, T., Gupta, S., & Pascual, M. (2013). The Roles of Competition and Mutation in Shaping Antigenic and Genetic Diversity in Influenza. PLOS Pathogens, 9(1).

      https://doi.org/10.1371/journal.ppat.1003104

    1. Author response:

      The following is the authors’ response to the original reviews.

      Life Assessment

      This valuable study builds on previous work by the authors by presenting a potentially key method for correcting optical aberrations in GRIN lens-based micro endoscopes used for imaging deep brain regions. By combining simulations and experiments, the authors show that the obtained field of view is significantly increased with corrected, versus uncorrected microendoscopes. The evidence supporting the claims of the authors is solid, although some aspects of the manuscript should be clarified and missing information provided. Because the approach described in this paper does not require any microscope or software modifications, it can be readily adopted by neuroscientists who wish to image neuronal activity deep in the brain.

      We thank the Referees for their interest in the paper and for the constructive feedback. We have taken the time necessary to address all of their comments, acquiring new data and performing additional analyses. With the inclusion of these new results, we modified four main figures (Figures 1, 6, 7, and 8), added three new Supplementary Figures (Supplementary Figures 1, 2, and 3), and significantly edited the text. Based on the additional work suggested by the Referees, we believe that we have improved our manuscript, provided missing information, and clarified some aspects of the manuscript, which the Referees pointed our attention to.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Referee’s comment: Sattin, Nardin, and colleagues designed and evaluated corrective microlenses that increase the useable field of view of two long (>6mm) thin (500 um diameter) GRIN lenses used in deep-tissue two-photon imaging. This paper closely follows the thread of earlier work from the same group (e.g. Antonini et al, 2020; eLife), filling out the quiver of available extended-fieldof-view 2P endoscopes with these longer lenses. The lenses are made by a molding process that appears practical and easy to adopt with conventional two-photon microscopes.

      Simulations are used to motivate the benefits of extended field of view, demonstrating that more cells can be recorded, with less mixing of signals in extracted traces, when recorded with higher optical resolution. In vivo tests were performed in the piriform cortex, which is difficult to access, especially in chronic preparations.

      The design, characterization, and simulations are clear and thorough, but not exhaustive (see below), and do not break new ground in optical design or biological application. However, the approach shows much promise, including for applications not mentioned in the present text such as miniaturized GRIN-based microscopes. Readers will largely be interested in this work for practical reasons: to apply the authors' corrected endoscopes.

      Strengths:

      The text is clearly written, the ex vivo analysis is thorough and well-supported, and the figures are clear. The authors achieved their aims, as evidenced by the images presented, and were able to make measurements from large numbers of cells simultaneously in vivo in a difficult preparation.

      Weaknesses:

      Referee’s comment: (1) The novelty of the present work over previous efforts from the same group is not well explained. What needed to be done differently to correct these longer GRIN lenses?

      We thank the Referee for the positive evaluation of our work. The optical properties of GRIN lenses depend on the geometrical and optical features of the specific GRIN lens type considered, i.e. its diameter, length, numerical aperture, pitch, and radial modulation of the refractive index. Our approach is based on the addition of a corrective optical element at the back end of the GRIN lens to compensate for aberrations that light encounters as it travels through the GRIN lens. The corrective optical element must, therefore, be specifically tailored to the specific GRIN lens type we aim to correct the aberrations of. The novelty of the present article lies in the successful execution of the ray-trace simulations and two-photon lithography fabrication of corrective optical elements necessary to achieve aberration correction in the two novel and long GRIN lens types, i.e. NEM-050-25-15-860-S-1.5p and NEM-050-23-15-860-S-2.0p (GRIN length, 6.4 mm and 8.8 mm, respectively). Our previous work (Antonini et al. eLife 2020) demonstrated aberration correction with GRIN lenses shorter than 4.1 mm. The design and fabrication of a single corrective optical element suitable to enlarge the field-of-view (FOV) in these longer GRIN lenses is not obvious, especially because longer GRIN lenses are affected by stronger aberrations. To better clarify this point, we revised the Introduction at page 5 (lines 3-10 from bottom) as follows:

      “Recently, a novel method based on 3D microprinting of polymer optics was developed to correct for GRIN aberrations by placing specifically designed aspherical corrective lenses at the back end of the GRIN lens 7. This approach is attractive because it is built-in on the GRIN lens and corrected microendoscopes are ready-to-use, requiring no change in the optical set-up. However, previous work demonstrated the feasibility of this method only for GRIN lenses of length < 4.1 mm 7, which are too short to reach the most ventral regions of the mouse brain. The applicability of this technology to longer GRIN lenses, which are affected by stronger optical aberrations 19, remained to be proven.”

      (2) Some strong motivations for the method are not presented. For example, the introduction (page 3) focuses on identifying neurons with different coding properties, but this can be done with electrophysiology (albeit with different strengths and weaknesses). Compared to electrophysiology, optical methods more clearly excel at genetic targeting, subcellular measurements, and molecular specificity; these could be mentioned.

      Thank you for the comment. We added a paragraph in the Introduction (page 3, lines 2-8) according to what suggested by the Reviewer:

      “High resolution 2P fluorescence imaging of the awake brain is a fundamental tool to investigate the relationship between the structure and the function of brain circuits 1. Compared to electrophysiological techniques, functional imaging in combination with genetically encoded indicators allows monitoring the activity of genetically targeted cell types, access to subcellular compartments, and tracking the dynamics of many biochemical signals in the brain (2). However, a critical limitation of multiphoton microscopy lies in its limited (< 1 mm) penetration depth in scattering biological media 3”.

      Another example, in comparing microfabricated lenses to other approaches, an unmentioned advantage is miniaturization and potential application to mini-2P microscopes, which use GRIN lenses.

      We added the concept suggested by the Reviewer in the Discussion (page 21, lines 4-7 from bottom). The text now reads:

      “Another advantage of long corrected microendoscopes described here over adaptive optics approaches is the possibility to couple corrected microendoscopes with portable 2P microscopes 42-44, allowing high resolution functional imaging of deep brain circuits on an enlarged FOV during naturalistic behavior in freely moving mice”.

      (3) Some potentially useful information is lacking, leaving critical questions for potential adopters:

      How sensitive is the assembly to decenter between the corrective optic and the GRIN lens?

      Following the Referee’s comment, we conducted new optical simulations to evaluate the decrease in optical performance of the corrected endoscopes as a function of the radial shift of the corrective lens from the optical axis of the GRIN rod (decentering, new Supplementary Figure 3), using light rays passing either off- or on-axis. For off-axis rays, we found that the Strehl ratio remained above 0.8 (Maréchal criterion) for positive translations in the range 6-11.5 microns and 16-50 microns for the 6.4 mm- and the 8.8 mm-long corrected microendoscope, respectively, while the Strehl ratio decreased below 0.8 for negative translations of amplitude ~ 5 microns. Please note that for the most marginal rays, a negative translation produces a mismatch between the corrective microlens and the GRIN lens such that the light rays no longer pass through the corrective lens. In contrast, rays passing near the optical axis were still focused by the corrected probe with Strehl ratio above 0.8 in a range of radial shifts of -40 – 40 microns for both microendoscope types. Altogether, these novel simulations suggest that decentering between the corrective microlens and the GRIN lens < 5 microns do not majorly affect the optical properties of the corrected endoscopes. These new results are now displayed in Supplementary Figure 3 and described on page 7 (lines 3-5 from bottom).

      What is the yield of fabrication and of assembly?

      The fabrication yield using molding was ~ 90% (N > 30 molded lenses). The main limitation of this procedure was the formation of air bubbles between the mold negative and the glass coverslip. Molded lenses were visually inspected with a stereomicrscope and, in case of air bubble formation, they were discarded.

      The assembly yield, i.e. correct positioning of the GRIN lens with respect to the coverslip, was 100 % (N = 27 endoscopes).

      We added this information in the Methods at page 29 (lines 1-12), as follows:

      “After UV curing, the microlens was visually inspected at the stereomicroscope. In case of formation of air bubbles, the microlens was discarded (yield of the molding procedure: ~ 90 %, N > 30 molded lenses). The coverslip with the attached corrective lens was sealed to a customized metal or plastic support ring of appropriate diameter (Fig. 2C). The support ring, the coverslip and the aspherical lens formed the upper part of the corrected microendoscope, to be subsequently coupled to the proper GRIN rod (Table 2) using a custom-built opto-mechanical stage and NOA63 (Fig. 2C) 7. The GRIN rod was positioned perpendicularly to the glass coverslip, on the other side of the coverslip compared to the corrective lens, and aligned to the aspherical lens perimeter (Fig. 2C) under the guidance of a wide field microscope equipped with a camera. The yield of the assembly procedure for the probes used in this work was 100 % (N = 27 endoscopes). For further details on the assembly of corrected microendoscope see(7)”. 

      Supplementary Figure 1: Is this really a good agreement between the design and measured profile? Does the figure error (~10 um in some cases on average) noticeably degrade the image?

      As the Reviewer correctly noticed, the discrepancy between the simulated profile and the experimentally measured profile can be up to 5-10 microns at specific radial positions. This discrepancy could be due to issues with: (i) the fabrication of the microlens; (ii) the experimental measurement of the lens profile with the stylus profilometer. To discriminate among these two possibilities, we asked what would be the expected optical properties of the corrected endoscope should the corrective lens have the experimentally measured (not the simulated) profile. To this aim, we performed new optical simulations of the point spread function (PSF) of the corrected probe using, as corrective microlens profile, the average, experimentally measured, profile of a fabricated corrective lens. For both microendoscope types, we first fitted the mean experimentally measured profile of the fabricated lens with the aspherical function reported in equation (1) of the main text:

      where:

      -                is the radial distance from the optical axis;

      -                is equal to 1⁄ , where R is the radius of curvature;

      -                is the conic constant;

      -                − are asphericity coefficients;

      -                is the height of the microlens profile on-axis.

      The fitting values of the parameters of equation (1) for the two lenses are reported for the Referee’s inspection here below (variables describing distances are expressed in mm):

      Author response table 1.

      Fitting values for the parameters of Equation (1) describing the profile of corrective microlens replicas measured with the stylus profilometer. Distances are expressed in mm.

      We then assumed that the profile of the corrective microlenses were equal to the mean experimentally measured profiles and used the aspherical fitting functions in the optical simulations to compute the performance of corrected microendoscopes. For both microendoscope types, we found that the Strehl ratio was lower than 0.35, well below the theoretical diffractionlimited threshold of 0.8 (Maréchal criterion) at moderate distances from the optical axis (68 μm94 μm and 67 μm-92 μm on the focal plane in the object space, after the front end of the GRIN lens, for the 6.4 mm- and the 8.8 mm-long corrected microendoscope, respectively, Author response image 1A, C), and the PSF was strongly distorted (Author response image 1B, D).

      Author response image 1.

      Simulated optical performance of corrected probes with profiles of corrective microlenses equal to the mean experimentally measured profiles of fabricated corrective lenses. A) The Strehl ratio for the 6.4 mm-long corrected microendoscope with measured microlens profile (black dots) is computed on-axis (distance from the center of the FOV d = 0 µm) and at two radial distances off-axis (d = 68 μm and 94 μm on the focal plane in the object space) and compared to the Strehl ratio of the uncorrected (red line) and corrected (blue line) microendoscopes. B) Lateral (x,y) and axial (x,z) fluorescence intensity (F) profiles of simulated PSFs on-axis (left) and off-axis (right, at the indicated distance d computed on the focal plane in the object space) for the 6.4 mm-long corrected microendoscope with measured microlens profile. C) Same as in (A) for the 8.8 mm-long corrected microendoscope (off-axis d = 67 μm and 92 μm on the focal plane in the object space). D) Same as in (B) for the 8.8 mm-long corrected microendoscope.

      These simulated findings are in contrast with the experimentally measured optical properties of our corrected endoscopes (Figure 3). In other words, these novel simulated results show that experimentally measured profiles of the corrected lenses are incompatible with the experimental measurements of the optical properties of the corrected endoscopes. Therefore, our experimental recording of the lens profile shown in Supplementary Figure 1 of the first submission (now Supplementary Figure 4) should be used only as a coarse measure of the lens shape and cannot be used to precisely compare simulated lens profiles with measured lens profiles.

      How do individual radial profiles compare to the presented means?

      We provide below a modified version of Supplementary Figure 4 (Supplementary Figure 1 in the first submission), where individual profiles measured with the stylus profilometer and the mean profile are displayed for both microendoscope types (Author response image 2). In the manuscript (Supplementary Figure 4), we would suggest to keep showing mean profiles ± standard errors of the mean, as we did in the original submission.

      Author response image 2.

      Characterization of polymeric corrective lens replicas. A) Stylus profilometer measurements were performed along the radius of the corrective polymer microlens replica for the 6.4 mm-long corrected microendoscope. Individual measured profiles (grey solid lines) obtained from n = 3 profile measurements on m = 3 different corrective lens replicas, plus the mean profile (black solid line) are displayed. B) Same as (A) for the 8.8 mm-long microendoscope.

      What is the practical effect of the strong field curvature? Are the edges of the field, which come very close to the lens surface, a practical limitation?

      A first practical effect of the field curvature is that structures at different z coordinates are sampled. The observed field curvature of corrected endoscopes may therefore impact imaging in brain regions characterized by strong axially organized anatomy (e.g., the pyramidal layer of the hippocampus), but would not significantly affect imaging in regions with homogeneous cell density within the axial extension of the field curvature (< 170 µm, see more details below). A second consequence of the field curvature, as the Referee correctly points out, is that cell at the border of the FOV are closer to the front end of the GRIN lens. In measurements of subresolved fluorescent layers (Figure 3A-D), we observed that the field curvature extends in the axial direction to ~ 110 μm and ~170 μm for the 6.4 mm- and the 8.8 mm-long microendoscopes, respectively. Considered that the nominal working distances on the object side of the 6.4 mm- and the 8.8 mm-long microendoscopes were, respectively, 210 μm and 178 μm (Table 3), structures positioned at the very edge of the FOV were ~ 100 μm and ~ 8 μm away from the GRIN front end for the 6.4 mm-long and for the 8.8 mm-long probe, respectively. Previous studies have shown that brain tissue within 50-100 μm from the GRIN front end may show signs of tissue reaction to the implant (Curreli et al. PLOS Biology 2022, Attardo et al. Nature 2015). Therefore, structures at the very edge of the FOV of the 8.8 mm-long endoscopes, but not those at the edge of the 6.4 mm-long endoscopes, may be within the volume showing tissue reaction. We added a paragraph in the text to discuss these points (page 18 lines 10-14).

      The lenses appear to be corrected for monochromatic light; high-performance microscopes are generally achromatic. Is the bandwidth of two-photon excitation sufficient to warrant optimization over multiple wavelengths?

      Thanks for this comment. All optical simulations described in the first submission were performed at a fixed wavelength (λ = 920 nm). Following the Referee’s request, we explored the effect of changing wavelength on the Strehl ratio using new optical simulations. We found that the Strehl ratio remains > 0.8 at least within ± 10 nm from λ = 920 nm (new Supplementary Figure 1A-D, left panels), which covers the limited bandwidth of our femtosecond laser. Moreover, these simulations demonstrate that, on a much wider wavelength range (800 - 1040 nm), high Strehl ratio is obtained, but at different z planes (new Supplementary Figure 1A-D, right panels). This means that the corrective lens is working as expected also for wavelengths which are different from 920 nm, with different wavelengths having the most enlarged FOV located at different working distances. These new results are now described on page 7 (lines 8-10).

      GRIN lenses are often used to access a 3D volume by scanning in z (including in this study). How does the corrective lens affect imaging performance over the 3D field of view?

      The optical simulations we did to design the corrective lenses were performed maximizing aberration correction only in the focal plane of the endoscope. Following the Referee’s comment, we explored the effect of aberration correction outside the focal plane using new optical simulations. In corrected endoscopes, we found that for off-axis rays (radial distance from the optical axis > 40 μm) the Strehl ratio was > 0.8 (Maréchal criterion) in a larger volume compared to uncorrected endoscopes (new Supplementary Figure 2), demonstrating that the aberration correction method developed in this study does extend beyond the focal plane for short distances. For example, at a radial distance of ~ 90 μm from the optical axis, the axial range in which the Strehl ratio was > 0.8 in corrected endoscopes was 28 μm and 19 μm for the 6.4 mm- and the 8.8 mm-long microendoscope, respectively. These new results are now described on page 7 (10-19).

      (4) The in vivo images (Figure 7D) have a less impressive resolution and field than the ex vivo images (Figure 4B), and the reason for this is not clear. Given the difference in performance, how does this compare to an uncorrected endoscope in the same preparation? Is the reduced performance related to uncorrected motion, field curvature, working distance, etc?

      In comparing images in Figure 4B with images shown in Figure 7D, the following points should be considered:

      (1) Figure 4B is a maximum fluorescence intensity projection of multiple axial planes of a z-stack acquired through a thin brain slice (slice thickness: 50 µm) using 8 frame averages for each plane. In contrast, images in Figure 7D are median projection of a t-series acquired on a single plane in the awake mouse at 30 Hz resonant scanning imaging (8 min, 14,400 frames).

      (2) Images of the fixed brain slice in Figure 4B were acquired at 1024 pixels x 1024 pixels resolution, nominal pixel size 0.45 µm/pixel, and with objective NA = 0.50, whereas in vivo images in Figure 7D were acquired at 512 pixels x 512 pixels resolution, nominal pixel size 0.72 - 0.84 µm/pixel, and with objective NA = 0.45.

      (3) In the in vivo preparation (Figure 7D), excitation and emission light travel through > 180 µm of scattering and absorbing brain tissue, reducing spatial resolution and the SNR of the collected fluorescence signal.

      (4) By shifting the sample in the x, y plane, in Figure 4B we could chose a FOV containing homogenously stained cells. x, y shifting and selecting across multiple FOVs was not possible in vivo, as the GRIN lens was cemented on the animal skull.

      (5) Images in Figure 7D were motion corrected, but we cannot exclude that part of the decrease in resolution observed in Figure 7D when compared to images in Figure 4B are due to incomplete correction of motion artifacts.

      For all the reasons listed above, we believe that it is expected to see smaller resolution and contrast in images recorded in vivo (Figure 7D) compared to images acquired in fixed tissue (Figure 4B).

      Regarding the question of how do images from an uncorrected and a corrected endoscopes compared in vivo, we think that this comparison is better performed in fixed tissue (Figure 4) or in simulated calcium data (Figure 5-6), rather than in vivo recordings (Figure 7). In fact, in the brain of living mice motion artifacts, changes in fluorophore expression level, variation in the optical properties of the brain (e.g., the presence of a blood vessel over the FOV) may make the comparison of images acquired with uncorrected and corrected microendoscopes difficult, requiring a large number of animals to cancel out the contributions of these factors. Comparing optical properties in fixed tissue is, in contrast, devoid of these confounding factors. Moreover, the major advantage of quantifying how the optical properties of uncorrected and corrected endoscopes impact on the ability to extract information about neuronal activity in simulated calcium data is that, under simulated conditions, we can count on a known ground truth as reference (e.g., how many neurons are in the FOV, where they are, and which is their electrical activity). This is clearly not possible in the in vivo recordings.

      Regarding Figure 7, there is no analysis of the biological significance of the calcium signals or even a description of where olfactory stimuli were presented.

      We appreciate the Reviewer pointing out the lack of detailed analysis regarding the biological significance of the calcium signals and the presentation of olfactory stimuli in Figure 7. Our initial focus was on demonstrating the effectiveness of the optimized GRIN lenses for imaging deep brain areas like the piriform cortex, with an emphasis on the improved signal-tonoise ratio (SNR) these lenses provide. However, we agree that including more context about the experimental conditions would enhance the manuscript. To address this point, we added a new panel (Figure 7F) showing calcium transients aligned with the onset of olfactory stimulus presentations, which are now indicated by shaded light blue areas. Additionally, we have specified the timing of each stimulus presented in Figure 7E. This revision allows readers to better understand the relationship between the calcium signals and the olfactory stimuli.

      The timescale of jGCaMP8f signals in Figure 7E is uncharacteristically slow for this indicator (compared to Zhang et al 2023 (Nature)), though perhaps this is related to the physiology of these cells or the stimuli.

      Regarding the timescale of the calcium signals observed in Figure 7E, we apologize for the confusion caused by a mislabeling we inserted in the original manuscript. The experiments presented in Figure 7 were conducted using jGCaMP7f, not jGCaMP8f as previously stated (both indicators were used in this study but in separate experiments). We have corrected this error in the Results section (caption of Figure 7D, E). It is important to note that jGCaMP7f has a longer half-decay time compared to jGCaMP8f, which could in part account for the slower decay kinetics observed in our data. Furthermore, the prolonged calcium signals can be attributed to the physiological properties of neurons in the piriform cortex. Upon olfactory stimulation, these neurons often fire multiple action potentials, resulting in extended calcium transients that can last several seconds. This sustained activity has been documented in previous studies, such as Roland et al. (eLife 2017, Figure 1C therein) in anesthetized animals and Wang et al. (Neuron 2020, Figure 1E therein) in awake animals, which report similar durations for calcium signals.

      (5) The claim of unprecedented spatial resolution across the FOV (page 18) is hard to evaluate and is not supported by references to quantitative comparisons. The promises of the method for future studies (pages 18-19) could also be better supported by analysis or experiment, but these are minor and to me, do not detract from the appeal of the work.

      GRIN lens-based imaging of piriform cortex in the awake mouse had already been done in Wang et al., Neuron 2020. The GRIN lens used in that work was NEM-050-50-00920-S-1.5p (GRINTECH, length: 6.4 mm; diameter: 0.5 mm), similar to the one that we used to design the 6.4 mm-long corrected microendoscope. Here we used a microendoscope specifically design to correct off-axis aberrations and enlarge the FOV, in order to maximize the number of neurons recorded with the highest possible spatial resolution, while keeping the tissue invasiveness to the minimum. Following the Referee’s comments, we revised the sentence at page 19 (lines 68 from bottom) as follows:

      “We used long corrected microendoscopes to measure population dynamics in the olfactory cortex of awake head-restrained mice with unprecedented combination of high spatial resolution across the FOV and minimal invasiveness(17)”.

      (6) The text is lengthy and the material is repeated, especially between the introduction and conclusion. Consolidating introductory material to the introduction would avoid diluting interesting points in the discussion.

      We thank the Reviewer for this comment. As suggested, we edited the Introduction and shortened the Discussion.

      Reviewer #2 (Public review):

      In this manuscript, the authors present an approach to correct GRIN lens aberrations, which primarily cause a decrease in signal-to-noise ratio (SNR), particularly in the lateral regions of the field-of-view (FOV), thereby limiting the usable FOV. The authors propose to mitigate these aberrations by designing and fabricating aspherical corrective lenses using ray trace simulations and two-photon lithography, respectively; the corrective lenses are then mounted on the back aperture of the GRIN lens.

      This approach was previously demonstrated by the same lab for GRIN lenses shorter than 4.1 mm (Antonini et al., eLife, 2020). In the current work, the authors extend their method to a new class of GRIN lenses with lengths exceeding 6 mm, enabling access to deeper brain regions as most ventral regions of the mouse brain. Specifically, they designed and characterized corrective lenses for GRIN lenses measuring 6.4 mm and 8.8 mm in length. Finally, they applied these corrected long micro-endoscopes to perform high-precision calcium signal recordings in the olfactory cortex.

      Compared with alternative approaches using adaptive optics, the main strength of this method is that it does not require hardware or software modifications, nor does it limit the system's temporal resolution. The manuscript is well-written, the data are clearly presented, and the experiments convincingly demonstrate the advantages of the corrective lenses.

      The implementation of these long corrected micro-endoscopes, demonstrated here for deep imaging in the mouse olfactory bulb, will also enable deep imaging in larger mammals such as rats or marmosets.

      We thank the Referee for the positive comments on our study. We address the points indicated by the Referee in the “Recommendation to the authors” section below.

      Reviewer #3 (Public review):

      Summary:

      This work presents the development, characterization, and use of new thin microendoscopes (500µm diameter) whose accessible field of view has been extended by the addition of a corrective optical element glued to the entrance face. Two micro endoscopes of different lengths (6.4mm and 8.8mm) have been developed, allowing imaging of neuronal activity in brain regions >4mm deep. An alternative solution to increase the field of view could be to add an adaptive optics loop to the microscope to correct the aberrations of the GRIN lens. The solution presented in this paper does not require any modification of the optical microscope and can therefore be easily accessible to any neuroscience laboratory performing optical imaging of neuronal activity.

      Strengths:

      (1) The paper is generally clear and well-written. The scientific approach is well structured and numerous experiments and simulations are presented to evaluate the performance of corrected microendoscopes. In particular, we can highlight several consistent and convincing pieces of evidence for the improved performance of corrected micro endoscopes:

      a) PSFs measured with corrected micro endoscopes 75µm from the centre of the FOV show a significant reduction in optical aberrations compared to PSFs measured with uncorrected micro endoscopes.

      b) Morphological imaging of fixed brain slices shows that optical resolution is maintained over a larger field of view with corrected micro endoscopes compared to uncorrected ones, allowing neuronal processes to be revealed even close to the edge of the FOV.

      c) Using synthetic calcium data, the authors showed that the signals obtained with the corrected microendoscopes have a significantly stronger correlation with the ground truth signals than those obtained with uncorrected microendoscopes.

      (2) There is a strong need for high-quality micro endoscopes to image deep brain regions in vivo. The solution proposed by the authors is simple, efficient, and potentially easy to disseminate within the neuroscience community.

      Weaknesses:

      (1) Many points need to be clarified/discussed. Here are a few examples:

      a) It is written in the methods: “The uncorrected microendoscopes were assembled either using different optical elements compared to the corrected ones or were obtained from the corrected

      probes after the mechanical removal of the corrective lens.”

      This is not very clear: the uncorrected microendoscopes are not simply the unmodified GRIN lenses?

      We apologize for not been clear enough on this point. Uncorrected microendoscopes are not simply unmodified GRIN lenses, rather they are GRIN lenses attached to a round glass coverslip (thickness: 100 μm). The glass coverslip was included in ray-trace optical simulations of the uncorrected system and this is the reason why commercial GRIN lenses and corresponding uncorrected microendoscopes have different working distances, as reported in Tables 2-3. To make the text clearer, we added the following sentence at page 27 (last 4 lines):

      “To evaluate the impact of corrective microlenses on the optical performance of GRIN-based microendoscopes, we also simulated uncorrected microendoscopes composed of the same optical elements of corrected probes (glass coverslip and GRIN rod), but in the absence of the corrective microlens”.

      b) In the results of the simulation of neuronal activity (Figure 5A, for example), the neurons in the center of the FOV have a very large diameter (of about 30µm). This should be discussed.

      Thanks for this comment. In synthetic calcium imaging t-series, cell radii were randomly sampled from a Gaussian distribution with mean = 10 µm and standard deviation (SD) = 3 µm. Both values were estimated from the literature (ref. no. 28: Suzuki & Bekkers, Journal of Neuroscience, 2011) as described in the Methods (page 35). In the image shown in Figure 5A, neurons near to the center of the FOV have radius of ~ 20 µm corresponding to the right tail of the distribution (mean + 3SD = 19 µm). It is also important to note that, for corrected microendoscopes, neurons in the central portion of the FOV appear larger than cells located near the edges of the FOV, because the magnification depends on the distance from the optical axis (see Figure 3E, F) and near the center the magnification is > 1 for both microendoscope types.

      Also, why is the optical resolution so low on these images?

      Images shown in Figure 5 are median fluorescence intensity projections of 5 minute-long simulated t-series. Simulated calcium data were generated with pixel size 0.8 μm/pixel and frame rate 30 Hz, similarly to in vivo recordings. In the simulations, pixels not belonging to any cell soma were assigned a value of background fluorescence randomly sampled from a normal distribution with mean and standard deviation estimated from experimental data, as described in the Methods section (page 37). To simulate activity, the mean spiking rate of neurons was set to 0.3 Hz, thus in a large fraction of frames neurons do not show calcium transients. Therefore, the median fluorescence intensity value of somata will be close to their baseline fluorescence value (_F_0). Since in simulations F0 values (~ 45-80 a.u.) were not much higher than the background fluorescence level (~ 45 a.u.), this may generate the appearance of low contrast image in Figure 5A. Finally, we suspect that PDF rendering also contributed to degrade the quality of those images. We will now submit high resolution images alongside the PDF file.

      c) It seems that we can't see the same neurons on the left and right panels of Figure 5D. This should be discussed.

      The Referee is correct. When we intersected the simulated 3D volume of ground truth neurons with the focal surface of microendoscopes, the center of the FOV for the 8.8 mmlong corrected microendoscope was located at a larger depth than the FOV of the 8.8 mm uncorrected microendoscope. This effect was due to the larger field curvature of corrected 8.8 mmlong endoscopes compared to 8.8 mm-long uncorrected endoscopes. This is the reason why different neurons were displayed for uncorrected and corrected endoscopes in Figure 5D. We added this explanation in the text at page 37 (lines 1-4). The text reads:

      “Due to the stronger field curvature of the 8.8 mm-long corrected microendoscope (Figure 1C) compared to 8.8 mm-long uncorrected microendoscopes, the center of the corrected imaging focal surface resulted at a larger depth in the simulated volume compared to the center of the uncorrected focal surface(s). Therefore, different simulated neurons were sampled in the two cases”.

      d) It is not very clear to me why in Figure 6A, F the fraction of adjacent cell pairs that are more correlated than expected increases as a function of the threshold on peak SNR. The authors showed in Supplementary Figure 3B that the mean purity index increases as a function of the threshold on peak SNR for all micro endoscopes. Therefore, I would have expected the correlation between adjacent cells to decrease as a function of the threshold on peak SNR. Similarly, the mean purity index for the corrected short microendoscope is close to 1 for high thresholds on peak SNR: therefore, I would have expected the fraction of adjacent cell pairs that are more correlated than expected to be close to 0 under these conditions. It would be interesting to clarify these points.

      Thanks for raising this point. We defined the fraction of adjacent cell pairs more correlated than expected as the number of adjacent cell pairs more correlated than expected divided by the number of adjacent cell pairs. The reason why this fraction raises as a function of the SNR threshold is shown in Supplementary Figure 2 in the first submission (now Supplementary Figure 5). There, we separately plotted the number of adjacent cell pairs more correlated than expected (numerator) and the number of adjacent cell pairs (denominator) as a function of the SNR threshold. For both microendoscope types, we observed that the denominator more rapidly decreased with peak SNR threshold than the numerator. Therefore, the fraction of adjacent cell pairs more correlated than expected increases with the peak SNR threshold.

      To understand why the denominator decreases with SNR threshold, it should be considered that, due to the deterioration of spatial resolution and attenuation of fluorescent signal collection as a function of the radial distance from the optical axis (see for example fluorescent film profiles in Figure 3A, C), increasing the threshold on the peak SNR of extracted calcium traces implies limiting cell detection to those cells located within smaller distance from the center of the FOV. This information is shown in Figure 5C, F.

      In the manuscript text, this point is discussed at page 12 (lines 1-3 from bottom) and page 13 (lines 1-4):

      “The fraction of pairs of adjacent cells (out of the total number of adjacent pairs) whose activity correlated significantly more than expected increased as a function of the SNR threshold for corrected and uncorrected microendoscopes of both lengths (Fig. 6A, F). This effect was due to a larger decrease of the total number of pairs of adjacent cells as a function of the SNR threshold compared to the decrease in the number of pairs of adjacent cells whose activity was more correlated than expected (Supplementary Figure 5)”.

      e) Figures 6C, H: I think it would be fairer to compare the uncorrected and corrected endomicroscopes using the same effective FOV.

      To address the Reviewer’s concern, we repeated the linear regression of purity index as a function of the radial distance using the same range of radial distances for the uncorrected and corrected case of both microendoscope types. Below, we provide an updated version of Figure 6C, H for the referee’s perusal. Please note that the maximum value displayed on the x-axis of both graphs is now corresponding to the minimum value between the two maximum radial distance values obtained in the uncorrected and corrected case (maximum radial distance displayed: 151.6 µm and 142.1 μm for the 6.4 mm- and the 8.8 mm-long GRIN rod, respectively). Using the same effective FOV, we found that the purity index drops significantly more rapidly with the radial distance for uncorrected microendoscopes compared to the corrected ones, similarly to what observed in the original version of Figure 6. The values of the linear regression parameters and statistical significance of the difference between the slopes in the uncorrected and corrected cases are stated in the Author response image 3 caption below for both microendoscope types. In the manuscript, we would suggest to keep showing data corresponding to all detected cells, as we did in the original submission.

      Author response image 3.

      Linear regression of purity index as a function of the radial distance. A) Purity index of extracted traces with peak SNR > 10 was estimated using a GLM of ground truth source contributions and plotted as a function of the radial distance of cell identities from the center of the FOV for n = 13 simulated experiments with the 6.4 mm-long uncorrected (red) and corrected (blue) microendoscope. Black lines represent the linear regression of data ± 95% confidence intervals (shaded colored areas). Maximum value of radial distance displayed: 151.6 μm. Slopes ± standard error (s.e.): uncorrected, (-0.0015 ± 0.0002) µm-1; corrected, (-0.0006 ± 0.0001) μm-1. Uncorrected, n = 991; corrected, n = 1156. Statistical comparison of slopes, p < 10<sup>-10</sup>, permutation test. B) Same as (A) for n = 15 simulated experiments with the 8.8 mm-long uncorrected and corrected microendoscope. Maximum value of radial distance displayed: 142.1 μm. Slopes ± s.e.: uncorrected, (-0.0014 ± 0.0003) μm-1; corrected, (-0.0010 ± 0.0002) µm-1. Uncorrected, n = 718; corrected, n = 1328. Statistical comparison of slopes, p = 0.0082, permutation test.

      f) Figure 7E: Many calcium transients have a strange shape, with a very fast decay following a plateau or a slower decay. Is this the result of motion artefacts or analysis artefacts?

      Thank you for raising this point about the unusual shapes of the calcium transients in Figure 7E. The observed rapid decay following a plateau or a slower decay is indeed a result of how the data were presented in the original submission. Our experimental protocol consisted of 22 s-long trials with an inter-trial interval of 10 s (see Methods section, page 44). In the original figure, data from multiple trials were concatenated, which led to artefactual time courses and apparent discontinuities in the calcium signals. To resolve this issue, we revised Figure 7E to accurately represent individual concatenated trials. We also added a new panel (please see new Figure 7F) showing examples of single cell calcium responses in individual trials without concatenation, with annotations indicating the timing and identity of presented olfactory stimuli.

      Also, the duration of many calcium transients seems to be long (several seconds) for GCaMP8f. These points should be discussed.

      Author response: regarding the timescale of the calcium signals observed in Figure 7E, we apologize for the confusion caused by a mislabeling we inserted in the manuscript. The experiments presented in Figure 7 were conducted using jGCaMP7f, not jGCaMP8f as previously stated (both indicators were used in this study, but in separate experiments). We have corrected this error in the Results section (caption of Figure 7D, E). It is important to note that jGCaMP7f has a longer half-decay time compared to jGCaMP8f, which could in part account for the slower decay kinetics observed in our data. Furthermore, the prolonged calcium signals can be attributed to the physiological properties of neurons in the piriform cortex. Upon olfactory stimulation, these neurons often fire multiple action potentials, resulting in extended calcium transients that can last several seconds. This sustained activity has been documented in previous studies, such as Roland et al. (eLife 2017, Figure 1C therein) in anesthetized animals and Wang et al. (Neuron 2020, Figure 1E therein) in awake animals, which report similar durations for calcium signals. We cite these references in the text. We believe that these revisions and clarifications address the Reviewer's concern and enhance the overall clarity of our manuscript.

      g) The authors do not mention the influence of the neuropil on their data. Did they subtract the neuropil's contribution to the signals from the somata? It is known from the literature that the presence of the neuropil creates artificial correlations between neurons, which decrease with the distance between the neurons (Grødem, S., Nymoen, I., Vatne, G.H. et al. An updated suite of viral vectors for in vivo calcium imaging using intracerebral and retro-orbital injections in male mice. Nat Commun 14, 608 (2023). https://doi.org/10.1038/s41467-023-363243; Keemink SW, Lowe SC, Pakan JMP, Dylda E, van Rossum MCW, Rochefort NL. FISSA: A neuropil decontamination toolbox for calcium imaging signals. Sci Rep. 2018 Feb 22;8(1):3493.

      doi: 10.1038/s41598-018-21640-2. PMID: 29472547; PMCID: PMC5823956)

      This point should be addressed.

      We apologize for not been clear enough in our previous version of the manuscript. The neuropil was subtracted from calcium traces both in simulated and experimental data. Please note that instead of using the term “neuropil”, we used the word “background”. We decided to use the more general term “background” because it also applies to the case of synthetic calcium tseries, where neurons were modeled as spheres devoid of processes. The background subtraction is described in the Methods on page 39:

      F(t) was computed frame-by-frame as the difference between the average signal of pixels in each ROI and the background signal. The background was calculated as the average signal of pixels that: i) did not belong to any bounding box; ii) had intensity values higher than the mean noise value measured in pixels located at the corners of the rectangular image, which do not belong to the circular FOV of the microendoscope; iii) had intensity values lower than the maximum value of pixels within the boxes”.

      h) Also, what are the expected correlations between neurons in the pyriform cortex? Are there measurements in the literature with which the authors could compare their data?

      We appreciate the reviewer's interest in the correlations between neurons in the piriform cortex. The overall low correlations between piriform neurons we observed (Figure 8) are consistent with a published study describing ‘near-zero noise correlations during odor inhalation’ in the anterior piriform cortex of rats, based on extracellular recordings (Miura et al., Neuron 2013). However, to the best of our knowledge, measurements directly comparable to ours have not been described in the literature. Recent analyses of the correlations between piriform neurons were restricted to odor exposure windows, with the goal to quantify odor-specific activation patterns (e.g. Roland et al., eLife 2017; Bolding et al., eLife 2017, Pashkovski et al., Nature 2020; Wang et al., Neuron 2020). Here, we used correlation analyses to characterize the technical advancement of the optimized GRIN lens-based endoscopes. We showed that correlations of pairs of adjacent neurons were independent from radial distance (Figure 8B), highlighting homogeneous spatial resolution in the field of view.

      (2) The way the data is presented doesn't always make it easy to compare the performance of corrected and uncorrected lenses. Here are two examples:

      a) In Figures 4 to 6, it would be easier to compare the FOVs of corrected and uncorrected lenses if the scale bars (at the centre of the FOV) were identical. In this way, the neurons at the centre of the FOV would appear the same size in the two images, and the distances between the neurons at the centre of the FOV would appear similar. Here, the scale bar is significantly larger for the corrected lenses, which may give the illusion of a larger effective FOV.

      We appreciate the Referee’s comment. Below, we explain why we believe that the way we currently present imaging data in the manuscript is preferable:

      (1) current figures show images of the acquired FOV as they are recorded from the microscope (raw data), without rescaling. In this way, we exactly show what potential users will obtain when using a corrected microendoscope.

      (2) In the current version of the figures, the fact that the pixel size is not homogeneous across the FOV, nor equal between uncorrected and corrected microendoscopes, is initially shown in Figure 3E, F and then explicitly stated throughout the manuscript when images acquired with a corrected microendoscope are shown.

      (3) Rescaling images acquired with the corrected endoscopes gives the impression that the acquisition parameters were different between acquisitions with the corrected and uncorrected microendoscopes, which was not the case.

      Importantly, the larger FOV of the corrected microendoscope, which is one of the important technological achievements presented in this study, can be appreciated in the images regardless of the presentation format.

      b) In Figures 3A-D it would be more informative to plot the distances in microns rather than pixels. This would also allow a better comparison of the micro endoscopes (as the pixel sizes seem to be different for the corrected and uncorrected micro endoscopes).

      The Referee is correct that the pixel size is different between the corrected and uncorrected probes. This is because of the different magnification factor introduced by the corrective microlens, as described in Figure 3E, F. The rationale for showing images in Figure 3AD in pixels rather than microns is the following:

      (1) Optical simulations in Figure 1 suggest that a corrective optical element is effective in compensating for some of the optical aberrations in GRIN microendoscopes.

      (2) After fabricating the corrective optical element (Figure 2), in Figure 3A-D we conduct a preliminary analysis of the effect of the corrective optical element on the optical properties of the GRIN lens. We observed that the microfabricated optical element corrected for some aberrations (e.g., astigmatism), but also that the microfabricated optical element was characterized by significant field curvature. This can be appreciated showing distances in pixels.

      (3) The observed field curvature and the aspherical profile of the corrected lens prompted us to characterize the magnification factor of the corrected endoscopes as a function of the radial distance. We found that the magnification factor changed as a function of the radial distance (Figure 3E-F) and that pixel size was different between uncorrected and corrected endoscopes. We also observed that, in corrected endoscopes, pixel size was a function of the radial distance (Figure 3E-F).

      (4) Once all of the above was established and quantified, we assigned precise pixel size to images of uncorrected and corrected endoscopes and we show all following images of the study (Figure 3G on) using a micron (rather than pixel) scale.

      (3) There seems to be a discrepancy between the performance of the long lenses (8.8 mm) in the different experiments, which should be discussed in the article. For example, the results in Figure 4 show a considerable enlargement of the FOV, whereas the results in Figure 6 show a very moderate enlargement of the distance at which the person's correlation with the first ground truth emitter starts to drop.

      Thanks for raising this point and helping us clarifying data presentation. Images in Figure 4B are average z-projections of z-stacks acquired through a mouse fixed brain slice and they were taken with the purpose of showing all the neurons that could be visualized from the same sample using an uncorrected and a corrected microendoscope. In Figure 4B, all illuminated neurons are visible regardless of whether they were imaged with high axial resolution (e.g., < 10 µm as defined in Figure 3J) or poor axial resolution. In contrast, in Figure 6J we evaluated the correlation between the calcium trace extracted from a given ROI and the real activity trace of the first simulated ground truth emitter for that specific ROI. The moderate increase in the correlation for the corrected microendoscope compared to the uncorrected microendoscope (Figure 6J) is consistent with the moderate improvement in the axial resolution of the corrected probe compared to the uncorrected probe at intermediate radial distances (60-100 µm from the optical axis, see Figure 3J). We added a paragraph in the Results section (page 14, lines 8-18) to summarize the points described above.

      a) There is also a significant discrepancy between measured and simulated optical performance, which is not discussed. Optical simulations (Figure 1) show that the useful FOV (defined as the radius for which the size of the PSF along the optical axis remains below 10µm) should be at least 90µm for the corrected microendoscopes of both lengths. However, for the long microendoscopes, Figure 3J shows that the axial resolution at 90µm is 17µm. It would be interesting to discuss the origin of this discrepancy: does it depend on the microendoscope used?

      As the Reviewer correctly pointed out, the size of simulated PSFs at a given radial distance (e.g., 90 µm) tends to be generally smaller than that of the experimentally measured PSFs. This might be due to multiple reasons:

      (1) simulated PSFs are excitation PSFs, i.e. they describe the intensity spatial distribution of focused excitation light. On the contrary, measured PSFs result from the excitation and emission process, thus they are also affected by aberrations of light emitted by fluorescent beads and collected by the microscope.

      (2) in the optical simulations, the Zemax file of the GRIN lenses contained first-order aberrations. High-order aberrations were therefore not included in simulated PSFs.

      (3) intrinsic variability of experimental measurements (e.g., intrinsic variability of the fabrication process, alignment of the microendoscope to the optical axis of the microscope, the distance between the GRIN back end and the objective…) are not considered in the simulations.

      We added a paragraph in the Discussion section (page 17, lines 9-18) summarizing the abovementioned points.

      Are there inaccuracies in the construction of the aspheric corrective lens or in the assembly with the GRIN lens? If there is variability between different lenses, how are the lenses selected for imaging experiments?

      The fabrication yield, i.e. the yield of generating the corrective lenses, using molding was ~ 90% (N > 30 molded lenses). The main limitation of this procedure was the formation of air bubbles between the mold negative and the glass coverslip. Molded lenses were visually inspected with the stereoscope and, in case of air bubble formation, they were discarded.

      The assembly yield, i.e. the yield of correct positioning of the GRIN lens with respect to the coverslip, was 100 % (N = 27 endoscopes).

      We added this information in the Methods at page 29 (lines 1-12), as follows:

      “After UV curing, the microlens was visually inspected at the stereomicroscope. In case of formation of air bubbles, the microlens was discarded (yield of the molding procedure: ~ 90 %, N > 30 molded lenses). The coverslip with the attached corrective lens was sealed to a customized metal or plastic support ring of appropriate diameter (Fig. 2C). The support ring, the coverslip and the aspherical lens formed the upper part of the corrected microendoscope, to be subsequently coupled to the proper GRIN rod (Table 2) using a custom-built opto-mechanical stage and NOA63 (Fig. 2C) 7. The GRIN rod was positioned perpendicularly to the glass coverslip, on the other side of the coverslip compared to the corrective lens, and aligned to the aspherical lens perimeter (Fig. 2C) under the guidance of a wide field microscope equipped with a camera. The yield of the assembly procedure for the probes used in this work was 100 % (N = 27 endoscopes). For further details on the assembly of corrected microendoscope see(7)”.

      Reviewer #1 (Recommendations for the authors):

      (1) Page 4, what is meant by 'ad-hoc" in describing software control?

      With “ad-hoc” we meant “specifically designed”. We revised the text to make this clear.

      (2) It was hard to tell how the PSF was modeled for the simulations (especially on page 34, describing the two spherical shells of the astigmatic PSF and ellipsoids modeled along them). Images or especially videos that show the modeling would make this easier to follow.

      Simulated calcium t-series were generated following previous work by our group (Antonini et al., eLife 2020), as stated in the Methods on page 37 (line 5). In Figure 4A of Antonini et al. eLife 2020, we provided a schematic to visually describe the procedure of simulated data generation. In the present paper, we decided not to include a similar drawing and cite the eLife 2020 article to avoid redundancy.

      (3) Some math symbols are missing from the methods in my version of the text (page 36/37).

      We apologize for the inconvenience. This issue arose in the PDF conversion of our Word document and we did not spot it at the time of submission. We will now make sure the PDF version of our manuscript correctly reports symbols and equations.

      (4) The Z extent of stacks (i.e. number of steps) used to generate images in Figure 4 is missing.

      We thank the Reviewer for the comment and we now revised the caption of Figure 4 and the Methods section as follows:

      “Figure 4. Aberration correction in long GRIN lens-based microendoscopes enables highresolution imaging of biological structures over enlarged FOVs. A) jGCaMP7f-stained neurons in a fixed mouse brain slice were imaged using 2PLSM (λexc = 920 nm) through an uncorrected (left) and a corrected (right) microendoscope based on the 6.4 mm-long GRIN rod. Images are maximum fluorescence intensity (F) projections of a z-stack acquired with a 5 μm step size. Number of steps: 32 and 29 for uncorrected and corrected microendoscope, respectively. Scale bars: 50 μm. Left: the scale applies to the entire FOV. Right, the scale bar refers only to the center of the FOV; off-axis scale bar at any radial distance (x and y axes) is locally determined multiplying the length of the drawn scale bar on-axis by the corresponding normalized magnification factor shown in the horizontal color-coded bar placed below the image (see also Fig. 3, Supplementary Table 3, and Materials and Methods for more details). B) Same results for the microendoscope based on the 8.8 mm-long GRIN rod. Number of steps: 23 and 31 for uncorrected and corrected microendoscope, respectively”.

      We also modified the text in the Methods (page 35, lines 1-2):

      “(1024 pixels x 1024 pixels resolution; nominal pixel size: 0.45 µm/pixel; axial step: 5 µm; number of axial steps: 23-32; frame averaging = 8)”.

      (5) Overall, the text is wordy and a bit repetitive and could be cut down significantly in length without loss of clarity. This is true throughout, but especially when comparing the introduction and discussion.

      We edited the text (Discussion and Introduction), as suggested by the Reviewer.

      (6) Although I don't think it's necessary, I would advise including comparison data with an uncorrected endoscope in the same in vivo preparation.

      We thank the Referee for the suggestion. Below, we list the reasons why we decided not to perform the comparison between the uncorrected and corrected endoscopes in the in vivo preparation:

      (1) We believe that the comparison between uncorrected and corrected endoscopes is better performed in fixed tissue (Figure 4) or in simulated calcium data (Figure 5-6), rather than in vivo recordings (Figure 7). In fact, in the brain of living mice motion artifacts, changes in fluorophore expression level, variation in the optical properties of the brain (e.g., the presence of a blood vessel over the FOV) may make the comparison of images acquired with uncorrected and corrected microendoscopes difficult, requiring a large number of animals to cancel out the contributions of all these factors. Comparing optical properties in fixed tissue is, in contrast, devoid of these confounding factors.

      (2) A major advantage of quantifying how the optical properties of uncorrected and corrected endoscope impact on the ability to extract information about neuronal activity in simulated calcium data is that, under simulated conditions, we can count on a known ground truth as reference (e.g., how many neurons are in the FOV, where they are, and which is their electrical activity). This is clearly not possible under in vivo conditions.

      (3) The proposed experiment requires to perform imaging in the awake mouse with a corrected microendoscope, then anesthetize the animal to carefully remove the corrective microlens using forceps, and finally repeat the optical recordings in awake mice with the uncorrected microendoscope. Although this is feasible (we performed the proposed experiment in Antonini et al. eLife 2020 using a 4.1 mm-long microendoscope), the yield of success of these experiments is low. The low yield is due to the fact that the mechanical force applied on top of the microendoscope to remove the corrective microlens may induce movement of the GRIN lens inside the brain, both in vertical and horizontal directions. This can randomly result in change of the focal plane, death or damage of the cells, tissue inflammation, and bleeding. From our own experience, the number of animals used for this experiment is expected to be high.

      Reviewer #2 (Recommendations for the authors):

      Below, I provide a few minor corrections and suggestions for the authors to consider before final submission.

      (1) Page 5: when referring to Table 1 maybe add "Table 1 and Methods".

      Following the Reviewer’s comment, we revised the text at page 6 (lines 4-5 from bottom) as follows:

      “(see Supplementary Table 1 and Materials and Methods for details on simulation parameters)”.

      (2) Page 8: "We set a threshold of 10 µm on the axial resolution to define the radius of the effective FOV (corresponding to the black triangles in Fig. 3I, J) in uncorrected and corrected microendoscopes. We observed an enlargement of the effective FOV area of 4.7 times and 2.3 times for the 6.4 mm-long micro endoscope and the 8.8 mm-long micro endoscope, respectively (Table 1). These findings were in agreement with the results of the ray-trace simulations (Figure 1) and the measurement of the subresolved fluorescence layers (Figure 3AD)." I could not find the information given in this paragraph, specifically:

      a) Upon examining the black triangles in Figure 3I and J, the enlargement of the effective FOV does not appear to be 4.7 and 2.3 times.

      In Figure 3I, J, black triangles mark the intersections between the curves fitting the data and the threshold of 10 µm on the axial resolution. The values on the x-axis corresponding to the intersections (Table 1, “Effective FOV radius”) represent the estimated radius of the effective FOV of the probes, i.e. the radius within which the microendoscope has spatial resolution below the threshold of 10 μm. The ratios of the effective FOV radii are 2.17 and 1.53 for the 6.4 mm- and the 8.8 mm-long microendoscope, respectively, which correspond to 4.7 and 2.3 times larger FOV (Table 1). To make this point clearer, we modified the indicated sentence as follows (page 10, lines 3-11 from bottom):

      “We set a threshold of 10 µm on the axial resolution to define the radius of the effective FOV (corresponding to the black triangles in Fig. 3I, J) in uncorrected and corrected microendoscopes. We observed a relative increase of the effective FOV radius of 2.17 and 1.53 for the 6.4 mm- and the 8.8 mm-long microendoscope, respectively (Table 1). This corresponded to an enlargement of the effective FOV area of 4.7 times and 2.3 times for the 6.4 mm-long microendoscope and the 8.8

      mm-long microendoscope, respectively (Table 1). These findings were in agreement with the results of the ray-trace simulations (Figure 1) and the measurement of the subresolved fluorescence layers (Figure 3A-D)."

      b) I do not understand how the enlargements in Figure 3I and J align with the ray trace simulations in Figure 1, indicating an enlargement of 5.4 and 5.6.

      In Figure 1C, E of the first submission we showed the Strehl ratio of focal spots focalized after the microendoscope, in the object plane, as a function of radial distance from the optical axis of focal spots focalized in the focal plane at the back end of the GRIN rod (“Objective focal plane” in Figure 1A, B), before the light has traveled along the GRIN lens. After reading the Referee’s comment, we realized this choice does not facilitate the comparison between Figure 1 and Figure 3I, J. We therefore decided to modify Figure 1C, E by showing the Strehl ratio of focal spots focalized after the microendoscope as a function of their radial distance from the optical axis in the objet plane (where the Strehl ratio is computed), after the light has traveled through the GRIN lens (radial distances are still computed on a plane, not along the curved focal surface represented by the “imaging plane” in Figure 1 A, B). Computing radial distances in the object space, we found that the relative increase in the radius of the FOV due to the correction of aberrations was 3.50 and 3.35 for the 6.4 mm- and the 8.8 mm-long microendoscope, respectively. We also revised the manuscript text accordingly (page 7, lines 6-8):

      “The simulated increase in the radius of the diffraction-limited FOV was 3.50 times and 3.35 times for the 6.4 mm-long and 8.8 mm-long probe, respectively (Fig. 1C, E)”. We believe this change should facilitate the comparison of the data presented in Figure 1 and Figure 3.

      Moreover, in comparing results in Figure 1 and Figure 3, it is important to keep in mind that:

      (1) the definitions of the effective FOV radius were different in simulations (Figure 1) and real measurements (Figure 3). In simulations, we considered a theoretical criterion (Maréchal criterion) and set the lower threshold for a diffraction-limited FOV to a Strehl ratio value of 0.8. In real measures, the effective FOV radius obtained from fluorescent bead measurements was defined based on the empirical criterion of setting the upper threshold for the axial resolution to 10 µm.

      (2) the Zemax file of the GRIN lenses contained low-order aberrations and not high-order aberrations.

      (3) the small variability in some of the experimental parameters (e.g., the distance between the GRIN back end and the focusing objective) were not reflected in the simulations.

      Given the reasons listed above, it is expected that the prediction of the simulations do not perfectly match the experimental measurements and tend to predict larger improvements of aberration correction than the experimentally measured ones.

      c) Finally, how can the enlargement in Figure 3I be compared to the measurements of the sub-resolved fluorescence layers in Figures 3A-D? Could the authors please clarify these points?

      When comparing measurements of subresolved fluorescent films and beads it is important to keep in mind that the two measures have different purposes and spatial resolution. We used subresolved fluorescent films to visualize the shape and extent of the focal surface of microendoscopes in a continuous way along the radial dimension (in contrast to bead measurements that are quantized in space). This approach comes at the cost of spatial resolution, as we are using fluorescent layers, which are subresolved in the axial but not in the radial dimension. Therefore, fluorescent film profiles are not used in our study to extract relevant quantitative information about effective FOV enlargement or spatial resolution of corrected microendoscopes. In contrast, to quantitatively characterize axial and lateral resolutions we used measurements of 100 nm-diameter fluorescent beads (therefore subresolved in the x, y, and z dimensions) located at different radial distances from the center of the FOV, using a much smaller nominal pixel size compared to the fluorescent films (beads, lateral resolution: 0.049 µm/pixel, axial resolution: 0.5 µm/pixel; films, lateral resolution: 1.73 µm/pixel, axial resolution: 2 µm/pixel).

      (3) On page 15, the statement "significantly enlarge the FOV" should be more specific by providing the actual values for the increase. It would also be good to mention that this is not a xy lateral increase; rather, as one moves further from the center, more of the imaged cells belong to axially different planes.

      The values of the experimentally determined FOV enlargements (4.7 times and 2.3 times for 6.4 mm- and 8.8 mm-long microendoscope, respectively) are provided in Table 1 and are now referenced on page 10. Following the Referee’s request, we added the following sentence in the discussion (page 18, lines 10-14) to underline that the extended FOV samples on different axial positions because of the field curvature effect:

      “It must be considered, however, that the extended FOV achieved by our aberration correction method was characterized by a curved focal plane. Therefore, cells located in different radial positions within the image were located at different axial positions and cells at the border of the FOV were closer to the front end of the microendoscope”.

      (4) On page 36, most of the formulas appear to be corrupted. This may have occurred during the conversion to the merged PDF. Please verify this and check for similar problems in other equations throughout the text as well.

      We apologize for the inconvenience. This issue arose in the PDF conversion of our Word document and we did not spot it upon submission. We will now make sure the PDF version of our manuscript correctly reports symbols and equations.

      (5) In the discussion, the authors could potentially add comments on how the verified performance of the corrective lenses depends on the wavelength and mention the range within which the wavelength can be changed without the need to redesign a new corrective lens.

      Following this comments and those of other Reviewers, we explored the effect of changing wavelength on the Strehl ratio using new Zemax simulations. We found that the Strehl ratio remains > 0.8 within ± at least 10 nm from λ = 920 nm (new Supplementary Figure 1A-D, left panels), which covers the limited bandwidth of our femtosecond laser. Moreover, these simulations demonstrate that, on a much wider wavelength range (800 - 1040 nm), high Strehl ratio is obtained but at different z planes (new Supplementary Figure 1A-D, right panels). These new results are now described on page 7 (lines 8-10).

      (6) Also, they could discuss if and how the corrective lens could be integrated into fiberscopes for freely moving experiments.

      Following the Referee’s suggestion, we added a short text in the Discussion (page 21, lines 4-7 from bottom). It reads:

      “Another advantage of long corrected microendoscopes described here over adaptive optics approaches is the possibility to couple corrected microendoscopes with portable 2P microscopes(42-44), allowing high resolution functional imaging of deep brain circuits on an enlarged FOV during naturalistic behavior in freely moving mice”.

      (7) Finally, since the main advantage of this approach is its simplicity, the authors should also comment on or outline the steps to follow for potential users who are interested in using the corrective lenses in their systems.

      Thanks for this comment. The Materials and Methods section of this study and that of Antonini et al. eLife 2020 describe in details the experimental steps necessary to reproduce corrective lenses and apply them to their experimental configuration.

      Reviewer #3 (Recommendations for the authors):

      (1) Suggestions for improved or additional experiments, data, or analyses, and Recommendations for improving the writing and presentation:

      See Public Review.

      Please see our point-by-point response above.

      (2) Minor corrections on text and figures: a) Figure 6A: is the fraction of cells expressed in %?

      Author response: yes, that is correct. Thank you for spotting it. We added the “%” symbol to the y label.

      b) Figurer 8A, left: The second line is blue and not red dashed. In addition, it could be interesting to also show a line corresponding to the 0 value.

      Thank you for the suggestions. We modified Figure 8 according to the Referee’s comments.

      c) Some parts of equation (1) and some variables in the Material and Methods section are missing

      We apologize for the inconvenience. This issue arose in the PDF conversion of our Word document and we did not spot it upon submission. We will now make sure the PDF version of our manuscript correctly reports symbols and equations.

      d) In the methods, the authors mention a calibration ruler with ticks spaced every 10 µm along two orthogonal directions and refer to the following product: 4-dot calibration slide, Cat. No. 1101002300142, Motic, Hong Kong. However, this product does not seem to correspond to a calibration ruler.

      We double check. The catalog number 1101002300142 is correct and product details can be found at the following link:

      https://moticmicroscopes.com/products/calibration-slide-4-dots-1101002300142?srsltid=AfmBOorGYx9PcXtAlIMmSs_tEpxS4nX21qIcV8Kfn4qGwizQK3LYOQn3

    1. Author Response

      The following is the authors’ response to the original reviews.

      We are grateful to the reviewers for their appreciation of our study and thoughtful comments. In response to the main concern raised by all reviewers regarding the potential influences of external noise factors on intuitive inference, such as external disturbances or imperfect observations, we have conducted three new experiments suggested by the reviewers. These experiments were designed to: (1) assess the influence of external forces on humans’ judgments by implementing a wall to block wind disturbances from one direction, (2) examine human accuracy in predicting the landing position of a falling ball when its trajectory is obscured, and (3) evaluate the effect of object geometry on human judgment of stability. The findings from these experiments consistently support our proposal of the stochastic world model on gravity embedded in human mind. Besides, we have also addressed the rest comments from the reviewers in a one-by-one fashion.

      Reviewer #1 (Recommendations For The Authors):

      As mentioned in the public review, I did not find it entirely convincing that the study shows evidence for a Gaussian understanding of gravity. There are two studies that would bolster this claim: 1. Replicate experiment 1, but also ask people to infer whether there was a hidden force. If people are truly representing gravity as proposed in the paper, you should get no force inferences. However, if the reason the Gaussian gravity model works is that people infer unseen forces, this should come out clearly in this study.

      Author response image 1.

      Wall experiment to test the impact of external forces on the measurement of stochastic gravity. (a) Experimental setting. We replicated the original setup with the addition of a wall implemented on one side. Left: the overall experimental scene; Right, the scene shown to participants. (b) Human behaviors. Three participants conducted this experiment, and their responses consistently showed normal distributions without any skewness, suggesting that their judgments were not affected by the presence of the wall. These results support our claim that humans’ judgments on stability were not affected by potential concerns regarding external forces.

      R1: We thank the reviewer for this suggestion. To directly test whether participants’ judgments were influenced by their implicit assumptions about external forces, we duplicated the original experimental setup with the addition of a wall implemented on one side (Supplementary Figure 4A). Before the start of the experiment, we explicitly informed the participants that the wall was designed to block wind, ensuring that any potential wind forces from the direction of the wall would not influence the collapse. If participants’ judgments were affected by external noise, we would expect to observe a skewed angle distribution. Contrary to this prediction, our results showed a normal distribution across all three participants tested (1 female; ages: 24-30), similar to the experiment without the wall (Supplementary Figure 4B). Therefore, the stochastic nature of intuitive inference on objects’ stability is embedded in the mind, not shaped by external forces or explicit instructions.

      This new experiment has been added to the revised manuscript

      Line 166-168: “…, and remained unchanged with the addition of a wall on one side to block potential external disturbances from wind (Supplementary Figure 4).”

      (2) Similarly, you can imagine a simple study where you drop an object behind a floating occluder and you check where people produce an anticipatory fixation (i.e., where do they think the object will come out?). If people have a stochastic representation of gravity, this should be reflected in their fixations. But my guess is that everyone will look straight down.

      Author response image 2.

      Trajectory experiment to test the stochastic nature of gravity represented in the mind. (a) Experiment design. In this experiment, participants were required to use a mouse to determine the landing point of a parabolic trajectory (marked by the green dot), obscured by a grey rectangle. Note that the parabolic trajectory was determined only by gravity, and no external disturbances were introduced. The parameters used in this experiment are detailed in the upper right corner. (b) Predictive errors from three participants. The predictive errors from all three participants conform to Gaussian distributions with non-negligible variances. These results suggest the notion of an inherent stochastic property of gravity represented in the mind.

      R2: We thank the reviewer for suggesting this thought experiment. However, when predicting the landing point of a falling object, participants may rely more on learned knowledge that an unimpeded object continues to fall in a straight line, rather than drawing on their intuitive physics. To avoid this potential confounding factor, we designed a similar experiment where participants were asked to predict the landing point of a parabolic trajectory, obscured by an occluder (Author response image 2A). In each trial, participants used a mouse (clicking the left button) to predict the landing point of each parabolic trajectory, and there were 100 trials in total. This design not only limits the impact of direct visual cues but also actively engages the mental simulation of intuitive physics. All three participants (1 female; ages: 24-30) were unable to accurately predict the landing points of the trajectories, and the predictive errors conformed to Gaussian distributions with different variances (Author response image 2B). Therefore, this new experiment confirms the stochastic nature of intuitive physics.

      (3) I believe the correct alternative model should be the one that has uncertainty over unseen forces, which better captures current proposals in the field, and controls for the amount of uncertainty in the models.

      R3: We thank the reviewers for the above-mentioned suggestions, and the findings from these two new experiments reinforce our proposal regarding the inherent stochastic characteristic of how the mind represents gravity.

      (4) I was not convinced that the RL framework was set up correctly to tackle the questions it claims to tackle. What this shows is that you can evolve a world model with Gaussian gravity in a setup that has no external perturbations. That does not imply that that is how humans evolved their intuitive physics, particularly when creatures have evolved in a world full of external perturbations. Showing that when (1) there are hidden perturbations, and (2) these perturbations are learnable, but (3) the model nonetheless just learns stochastic gravity, would be a more convincing result.

      R4: We completely agree with the reviewer that the RL framework serves primarily as a theoretic model to explain the stochastic nature of the world model on gravity, rather than as a demonstration of the developmental origins of intuitive physics abilities. The genesis of such abilities is multifaceted and unlikely to be fully replicated through a simple simulation like RL. Therefore, the purpose of incorporating the RL framework in our study is to demonstrate that external perturbances are not necessary for the development of a stochastic representation of gravity. In fact, introducing additional external noise into the RL framework likely heightens the uncertainty in learning gravity’s direction, potentially amplifying, rather than diminishing, the stochastic nature of mental gravity.

      In revision, we have clarified the role of the RL framework

      Line 265-277: “While the cognitive impenetrability and the self-consistency observed in this study, without resorting to an external perturbation, favor the stochastic model over the deterministic one, the origin of this stochastic feature of the world model is unclear.

      Here we used a reinforcement learning (RL) framework to unveil this origin, because our intelligence emerges and evolves under the constraints of the physical world. Therefore, the stochastic feature may emerge as a biological agent interacts with the environment, where the mismatches between external feedback from the environment and internal expectations from the world model are in turn used to fine-tune the world model (Friston et al., 2021; MacKay, 1956; Matsuo et al., 2022). Note that a key aspect of the framework is determining whether the stochastic nature of the world model on gravity emerges through this interaction, even in the absence of external noise.”

      (5) Some comments on the writing:

      The word 'normality' is used to refer to people's judgments about whether a tower collapsed looked 'normal'. I was a bit confused by this because normality can also mean 'Gaussian' and the experiments are also sampling from Gaussian distributions. There were several points where it took me a second to figure out which sense of 'normality' the paper was using. I would recommend using a different term.

      R5: We are sorry for the confusion. In revision, the term “normality” has been replaced with “confidence level about normal trajectory”.

      (6) One small comment is that Newton's laws are not a faithful replica of the "physical laws of the world" they are a useful simplification that only works at certain timescales. I believe some people propose Newtonian physics as a model of intuitive physics in part because it is a rapid and useful approximation of complex physical systems, and not because it is an untested assumption of perfect correspondence.

      R6: We are sorry for the inaccurate expression. We have revised our statements in the manuscript Line 15-16: “We found that the world model on gravity was not a faithful replica of the physical laws, but instead encoded gravity’s vertical direction as a Gaussian distribution.”

      (7) Line 49-50: Based on Fig 1d, lower bound of possible configurations for 10 blocks is ~17 in log-space, which is about 2.5e7. But the line here says it's 3.72e19, which is much larger. Sorry if I am missing something.

      R7: We thank the reviewer to point out this error. We re-calculated the number of possible configurations using the formula (3) in the appendix, and the number of configurations with 10 blocks is:

      Thus,

      This estimated number is much larger than that in our previous calculation, which has been corrected in the revised text.

      Line 827-829: “d) The lower bound of configurations’ possible number and the number of blocks in a stack followed an exponential relationship with a base of 10. The procedure can create at least 1.14×1050 configurations for stacks consisting of 10 blocks.”

      Line 49-50: “… but the universal cardinality of possible configurations is at least 1.14×1050 (Supplementary Figure 1), …”

      Line 1017-1018: “… the number of configurations can be estimated with formula (9), which is 1.14×1050.”

      (8) Lines 77-78: "A widely adopted but not rigorously tested assumption is that the world model in the brain is a faithful replica of the physical laws of the world." This risks sounding like you are asserting that colleagues in the field do not rigorously test their models. I think you meant to say that they did not 'directly test', rather than 'rigorously test'. If you meant rigorous, you might want to say more to justify why you think past work was not rigorous.

      R8: We apologize for the inappropriate wording, the sentence has been revised and we illustrate the motivation more comprehensively in the revised text,

      Line 76-92: “A prevailing theory suggests that the world model in the brain accurately mirrors the physical laws of the world (Allen et al., 2020; Battaglia et al., 2013; Zhou et al., 2022). For example, the direction of gravity encoded in the world model, a critical factor in stability inference, is assumed to be straight downward, aligning with its manifestation in the physical world. To explain the phenomenon that tall and thin objects are subjectively perceived as more unstable compared to short and fat ones (Supplementary Figure 2), external noise, such as imperfect perception and assumed external forces, is introduced to influence the output of the model. However, when the brain actively transforms sensory data into cognitive understanding, these data can become distorted (Kriegeskorte and Douglas, 2019; Naselaris et al., 2011), hereby introducing uncertainty into the representation of gravity’s direction. In this scenario, the world model inherently incorporates uncertainty, eliminating the need for additional external noise to explain the inconsistency between subjective perceptions of stability and the actual stability of objects. Note that this distinction of these two theories is nontrivial: the former model implies a deterministic representation of the external world, while the latter suggests a stochastic approach.”

      (9) Lines 79-84 States that past models encode gravity downward. It then says that alternatively there is consensus that the brain uses data from sensory organs and adds meaning to them. I think there might be a grammatical error here because I did not follow why saying there is 'consensus' on something is a theoretical alternative. I also had trouble following why those two statements are in opposition. Is any work on physics engines claiming the brain does not take data from sensory organs and add meaning to them?

      R9: We are sorry for the confusion. Here we intend to contrast the deterministic model (i.e., the uncertainty comes from outside the model) with the stochastic model (i.e., the uncertainty is inherently built into the model). In revision, we have clarified the intention. For details, please see R8.

      (10) Lines 85-88: Following on the sentence above, you then conclude that the representation of the world may therefore not be the same as reality. I did not understand why this followed. It seems you are saying that, because the brain takes data from sensory organs, therefore its representations may differ from reality.

      R10: Again, we are sorry about the confusion. Please see the revised text in R8.

      (11) Lines 190-191: I had trouble understanding this sentence. I believe you are missing an adjective to clarify that participants were more inclined to judge taller stacks as more likely to collapse.

      R11: We are sorry for the confusion. What we intended to state here is that participants’ judgment was biased, showing a tendency to predict a collapse for stacks regardless of their actual stability. We have revised this confusing sentence in the revision. Line 202–204: “However, the participants showed an obvious bias towards predicting a collapse for stacks regardless of their actual stability, as the dots in Fig 2b are more concentrated on the lower side of the diagonal line.”

      (12) Line 201: I don't think it's accurate to say that MGS "perfectly captured participants' judgments" unless the results are actually perfect.

      R12: We agree, and in revision we have toned down the statement Line 213–214: “…, the MGS, in contrast to the NGS, more precisely reflected participants’ judgments of stability …”

      Reviewer #2 (Recommendations For The Authors):

      I think this is an impressive set of experiments and modeling work. The paper is nicely written and I appreciate the poetic license the authors took at places in the manuscript. I only have clarification points and suggest a simple experiment that could lend further support to their conclusions. 1. In my opinion, the impact of this work is twofold. First, the suggestion that gravity is represented as a distribution of the world and not a result of (inferred) external perturbations. Second, that the distribution is advantageous as it balances speed and accuracy, and lessens computational processing demands (i.e., number of simulations). The second point here is contingent on the first point, which is really only supported by the RL model and potentially the inverted scene condition. I am somewhat surprised that the RL model does not converge on a width much smaller than ~20 degrees after 100,000 simulations. From my understanding, it was provided feedback with collapses based on natural gravity (deterministically downward). Why is learning so slow and the width so large? Could it be the density of the simulated world model distribution? If the model distribution of Qs was too dense, then Q-learning would take forever. If the model distribution was too sparse, then its final estimate would hit a floor of precision. Could the authors provide more details on the distribution of the Qs for the RL model?

      Author response image 3.

      RL learning curves as a function of θ angle with different sampling densities and learning rates. Learning rates were adjusted to low (a), intermediate (b) and high (c) settings, while sampling densities were chosen at four levels: 5x5, 11x11, 31x31, and 61x61 shown from the left to the right. Two key observations emerged from the simulations as the reviewer predicted. First, higher learning rates resulted in a more rapid decline in learning curves but introduced larger variances. Second, increased sampling density necessitated more iterations for convergence. Note that in all simulations, we limited the iterations to 1,000 times (as opposed to 100,000 times reported in the manuscript) to demonstrate the trend without excessive computational demands.

      R1: To illustrate the distribution of the Q-values for the RL model, we re-ran the RL model with various learning rates and sampling densities (Author response image 3). These results support the reviewer’s prediction that higher learning rates resulted in a more rapid decline in learning curves but introduced larger variances, and increased sampling density requires more iterations for convergence.

      This simulation also elucidates the slower learning observed in the experiment described in the text, where the force sphere was divided into 61x61 angle pairs, and the learning rate was set to 0.15. This set of parameters ensured convergence within a reasonable brief timeframe while maintaining high-resolution force assessments.

      Besides, the width of the Gaussian distribution is mainly determined by the complexity of stacks. As shown in Figure 3c and Supplementary Figure 9, stacks with fewer blocks (i.e., less complex) caused a larger width, whereas those with more blocks resulted in a narrower spread. In the study, we used a collection of stacks varying from 2 to 15 blocks to simulate the range of stacks humans typically encounter in daily life.

      In revision, we have incorporated these insights suggested by the reviewer to clarify the performance of the RL framework:

      Line 634-639: “The angle density and learning rate are two factors that affect the learning speed. A larger angle density prolongs the time to reach convergence but enables a more detailed force space; a higher learning rate accelerates convergence but incurs larger variance during training. To balance speed and convergence, we utilized 100,000 configurations for the training.”

      Line 618-619: “…, separately divided them into 61 sampling angles across the spherical force space (i.e., the angle density).”

      (2) Along similar lines, the authors discuss the results of the inverted science condition as reflecting cognitive impenetrability. However, do they also interpret it as support for an intrinsically noisy distribution of gravity? I would be more convinced if they created a different scene that could have the possibility of affecting the direction of an (inferred) external perturbation - a previously held explanation of the noisy world model. For example, a relatively simple experiment would be to have a wall on one side of the scene such that an external perturbation would be unlikely to be inferred from that direction. In the external perturbation account, phi would then be affected resulting in a skewed distribution of angle pairs. However, in the authors' stochastic world model phi would remain unaffected resulting in the same uniform distribution of phi the authors observed. In my opinion, this would provide more compelling evidence for the stochastic world model.

      Author response image 4.

      Wall experiment to test the impact of external forces on the measurement of stochastic gravity. (a) Experimental setting. We replicated the original setup with the addition of a wall implemented on one side. Left: the overall experimental scene; Right, the scene shown to participants. (b) Human behaviors. Three participants conducted this experiment, and their responses consistently showed normal distributions without any skewness, suggesting that their judgments were not affected by the presence of the wall. These results support our claim that humans’ judgments on stability were not affected by potential concerns regarding external forces.

      R2: We thank the reviewer for this suggestion. Following the reviewer’s concern, we designed the experiment with the addition of a wall implemented on one side (Supplementary figure 4A). We explicitly informed the participants that the wall was designed to block wind before the start of the experiment, ensuring no potential wind forces from the direction of the wall to influence the collapse trajectory of configurations. Participants need to judge if the trajectory was normal. If participants’ judgments were influenced by external noises, we would expect to observe a skewed angle distribution. However, our results still showed a normal distribution across all participants tested, consistent with the experiment without the wall (Supplementary figure 4B). This experiment suggested the stochastic nature of intuitive inference on objects’ stability is embedded in the mind, rather than shaped by external forces or explicit instructions.

      We revised the original manuscript, and added this new experiment

      Line 166-168: “…, and remained unchanged with the addition of a wall on one side to block potential external disturbances from wind (Supplementary Figure 4).”

      (3) I didn't completely follow the authors' explanation for the taller objects illusion. On lines 229-232, the authors state that deviations from gravity's veridical direction are likely to accumulate with the height of the objects. Is this because, in the stochastic world model account, each block gets its own gravity vector that is sampled from the distribution? The authors should clarify this more explicitly. If this is indeed the author's claim, then it would seem that it could be manipulated by varying the dimensions of the blocks (or whatever constitutes an object).

      R3: We are sorry for the confusion caused by the use of the term ‘accumulate’. In the study, there is only one gravity vector sampled from the distribution for the entire structure, rather than each block having a unique gravity vector. The height illusion is attributed to the fact that the center of gravity in taller objects is more susceptible to influence when gravity deviates slightly from a strictly downward direction. This is especially true for objects consisting of multiple blocks stacked atop one another. In revision, we have removed the confusing term ‘accumulate’ for clarification.

      Line 242-244: “…, because the center of gravity in taller objects is more susceptible to influence when gravity deviates slightly from a strictly downward direction during humans’ internal simulations.”

      (4) The authors refer to the RL simulations as agent-environment interactions, but in reality, the RL model does not interact with the blocks. Would experience-dependent or observation be more apropos?

      R4: We completely agree. Indeed, the RL model did not manipulate stacks; rather, it updated its knowledge of natural gravity based on the discrepancies between the RL model’s predictions and observed outcomes. In revision, we have removed the confusing term ‘agent-environment interactions’ and clarified its intended meaning.

      Line 19-22: “Furthermore, a computational model with reinforcement learning revealed that the stochastic characteristic likely originated from experience-dependent comparisons between predictions formed by internal simulations and the realities observed in the external world, …”

      Reviewer #3 (Public Review):

      (1) In spite of the fact that the Mental Gravity Simulation (MGS) seems to predict the data of the two experiments, it is an untenable hypothesis. I give the main reason for this conclusion by illustrating a simple thought experiment. Suppose you ask subjects to determine whether a single block (like those used in the simulations) is about to fall. We can think of blocks of varying heights. No matter how tall a block is, if it is standing on a horizontal surface it will not fall until some external perturbation disturbs its equilibrium. I am confident that most human observers would predict this outcome as well. However, the MSG simulation would not produce this outcome. Instead, it would predict a non-zero probability of the block to tip over. A gravitational field that is not perpendicular to the base has the equivalent effect of a horizontal force applied on the block at the height corresponding to the vertical position of the center of gravity. Depending on the friction determined by the contact between the base of the block and the surface where it stands there is a critical height where any horizontal force being applied would cause the block to fall while pivoting about one of the edges at the base (the one opposite to where the force has been applied). This critical height depends on both the size of the base and the friction coefficient. For short objects this critical height is larger than the height of the object, so that object would not fall. But for taller blocks, this is not the case. Indeed, the taller the block the smaller the deviation from a vertical gravitational field is needed for a fall to be expected. The discrepancy between this prediction and the most likely outcome of the simple experiment I have just outlined makes the MSG model implausible. Note also that a gravitational field that is not perpendicular to the ground surface is equivalent to the force field experienced by the block while standing on an inclined plane. For small friction values, the block is expected to slide down the incline, therefore another prediction of this MSG model is that when we observe an object on a surface exerting negligible friction (think of a puck on ice) we should expect that object to spontaneously move. But of course, we don't, as we do not expect tall objects that are standing to suddenly fall if left unperturbed. In summary, a stochastic world model cannot explain these simple observations.

      Author response image 5.

      Differentiating Subjectivity from Objectivity. In both Experiment 1 (a) and Experiment 2 (b), participants were instructed to determine which shape appeared most stable. Objectively, in the absence of external forces, all shapes possess equal stability. Yet, participants typically perceived the shape on the left as the most stable because of its larger base area. The discrepancy between objective realities and subjective feelings, as we propose, is attributed to the human mind representing gravity’s direction as a Gaussian distribution, rather than as a singular value pointing directly downward.

      R1: We agree with the reviewer that objects will remain stable until disturbed by external forces. However, in many cases, this is a clear discrepancy between objective realities and subjective feelings. For example, electromagnetic waves associated with purple and red colors are the farthest in the electromagnetic space, yet purple and red are the closest colors in the color space. Similarly, as shown in Supplementary Figure 4, in reality all shapes possess equal stability in the absence of external forces. Yet, humans typically perceive the shape on the left as more stable because of its larger base area. In this study, we tried to explore the mechanism underlying this discrepancy by proposing that the human mind represents gravity’s direction as a Gaussian distribution, rather than as a singular value pointing directly downward.

      In revision, we have clarified the rationale of this study

      Line 76-98: “A prevailing theory suggests that the world model in the brain accurately mirrors the physical laws of the world (Allen et al., 2020; Battaglia et al., 2013; Zhou et al., 2022). For example, the direction of gravity encoded in the world model, a critical factor in stability inference, is assumed to be straight downward, aligning with its manifestation in the physical world. To explain the phenomenon that tall and thin objects are subjectively perceived as more unstable compared to short and fat ones (Supplementary Figure 2), external noise, such as imperfect perception and assumed external forces, is introduced to influence the output of the model. However, when the brain actively transforms sensory data into cognitive understanding, these data can become distorted (Kriegeskorte and Douglas, 2019; Naselaris et al., 2011), hereby introducing uncertainty into the representation of gravity’s direction. In this scenario, the world model inherently incorporates uncertainty, eliminating the need for additional external noise to explain the inconsistency between subjective perceptions of stability and the actual stability of objects. Note that this distinction of these two theories is nontrivial: the former model implies a deterministic representation of the external world, while the latter suggests a stochastic approach. Here, we investigated these two alternative hypotheses regarding the construction of the world model in the brain by examining how gravity’s direction is represented in the world model when participants judged object stability.”

      (2) The question remains as to how we can interpret the empirical data from the two experiments and their agreement with the predictions of the stochastic world model if we assume that the brain has internalized a vertical gravitational field. First, we need to look more closely at the questions posed to the subjects in the two experiments. In the first experiment, subjects are asked about how "normal" a fall of a block construction looks. Subjects seem to accept 50% of the time a fall is normal when the gravitational field is about 20 deg away from the vertical direction. The authors conclude that according to the brain, such an unusual gravitational field is possible. However, there are alternative explanations for these findings that do not require a perceptual error in the estimation of the direction of gravity. There are several aspects of the scene that may be misjudged by the observer. First, the 3D interpretation of the scene and the 3D motion of the objects can be inaccurate. Indeed, the simulation of a normal fall uploaded by the authors seems to show objects falling in a much weaker gravitational field than the one on Earth since the blocks seem to fall in "slow motion". This is probably because the perceived height of the structure is much smaller than the simulated height. In general, there are even more severe biases affecting the perception of 3D structures that depend on many factors, for instance, the viewpoint.

      R2: We thank the reviewer for highlighting several potential confounding factors in our study. We address each of these concerns point-by-point:

      (a) Misinterpretation of the 3D scene and motion. In Response Figure 4 shown above, there is no 3D structure, yet participants’ judgment on stability still deviated from objective realities. In addition, the introduction of 3D motion was to aid in understanding the stacks’ 3D structure. Previous studies without 3D motion have reported similar findings (Allen et al., 2020). Therefore, regardless of whether objects are presented in 2D or 3D, or in static or in motion formats, humans’ judgment on object stability appears consistent.

      (b) Errors in perceived height. While there might be discrepancies between perceived and simulated heights, such errors are systematic across all conditions. Therefore, they may affect the width of the Gaussian distribution but do not fundamentally alter its existence.

      (c) The viewpoint. In one experiment, we inverted gravity’s direction to point upward, diverging from common daily experience. Despite this change in viewpoint, the Gaussian distribution was still observed. That is, the viewpoint appears not a key factor in influencing how gravity’s direction is represented as a Gaussian distribution in our mental world.

      In summary, both our and previous studies (Allen et al., 2020; Battaglia et al., 2013) agree that humans’ subjective assessments of objects’ stability deviate from actual stability due to noise in mental simulation. Apart from previous studies, we suggest that this noise is intrinsic, rather than stemming from external forces or imperfect observations.

      (3) Second, the distribution of weight among the objects and the friction coefficients acting between the surfaces are also unknown parameters. In other words, there are several parameters that depend on the viewing conditions and material composition of the blocks that are unknown and need to be estimated. The authors assume that these parameters are derived accurately and only that assumption allows them to attribute the observed biases to an error in the estimate of the gravitational field. Of course, if the direction of gravity is the only parameter allowed to vary freely then it is no surprise that it explains the results. Instead, a simulation with a titled angle of gravity may give rise to a display that is interpreted as rendering a vertical gravitational field while other parameters are misperceived. Moreover, there is an additional factor that is intentionally dismissed by the authors that is a possible cause of the fall of a stack of cubes: an external force. Stacks that are initially standing should not fall all of a sudden unless some unwanted force is applied to the construction. For instance, a sudden gust of wind would create a force field on a stack that is equivalent to that produced by a tilted gravitational field. Such an explanation would easily apply to the findings of the second experiment. In that experiment subjects are explicitly asked if a stack of blocks looks "stable". This is an ambiguous question because the stability of a structure is always judged by imagining what would happen to the structure if an external perturbation is applied. The right question should be: "do you think this structure would fall if unperturbed". However, if stability is judged in the face of possible external perturbations then a tall structure would certainly be judged as less stable than a short structure occupying the same ground area. This is what the authors find. What they consider as a bias (tall structures are perceived as less stable than short structures) is instead a wrong interpretation of the mental process that determines stability. If subjects are asked the question "Is it going to fall?" then tall stacks of sound structure would be judged as stable as short stacks, just more precarious.

      R3: Indeed, the external forces suggested by the reviewer certainly influence judgments of objects’ stability. The critical question, however, is whether humans’ judgments on objects’ stability accurately mirror the actual stability of objects in the absence of external forces. To address this question, we designed two new experiments.

      Experiment 1: we duplicated the original experimental setup with the addition of a wall implemented on one side (Supplementary Figure 4A). We explicitly informed the participants that the wall could block wind, ensuring that no potential wind from the direction of the wall could influence the configuration. If participants’ judgments were affected by external noise, we would expect to observe a skewed angle distribution. Contrary to this prediction, our results showed a normal distribution across all three participants (Age: 25-30, two females), which is similar to the experiment without the wall (Supplementary Figure 4B).

      Author response image 6.

      Wall experiment to test the impact of external forces on the measurement of stochastic gravity. (a) Experimental setting. We replicated the original setup with the addition of a wall implemented on one side. Left: the overall experimental scene; Right, the scene shown to participants. (b) Human behaviors. Three participants conducted this experiment, and their responses consistently showed normal distributions without any skewness, suggesting that their judgments were not affected by the presence of the wall. These results support our claim that humans’ judgments on stability were not affected by potential concerns regarding external forces.

      Experiment 2: The second experiment adopted another paradigm to test the hypothesis of stochastic mental simulation. Consider humans to infer the landing point of a parabolic trajectory that was obscured by an occlude (Author response image 2A), the stochastic mental simulation predicted that humans’ behavior follows a Gaussian distribution. However, if humans’ judgments were influenced by external noise, the landing points could not be Gaussian. The experiment consists of 100 trials in total, and in each trial participants used a mouse to predict the landing point of each trajectory by clicking the left button. Our results found all three participants (1 female; ages: 24-30) were unable to accurately predict the landing points of the trajectories, and the predictive errors conformed to Gaussian distributions with different variances (Author response image 2B). Therefore, this new experiment confirms the stochastic nature of intuitive physics.

      Author response image 7.

      Trajectory experiment to test the stochastic nature of gravity represented in the mind. (a) Experiment design. In this experiment, participants were required to use a mouse to determine the landing point of a parabolic trajectory (marked by the green dot), obscured by a grey rectangle. Note that the parabolic trajectory was determined only by gravity, and no external disturbances were introduced. The parameters used in this experiment are detailed in the upper right corner. (b) Predictive errors from three participants. The predictive errors from all three participants conform to Gaussian distributions with non-negligible variances. These results suggest the notion of an inherent stochastic property of gravity represented in the mind.

      (4) The RL model used as a proof of concept for how the brain may build a stochastic prior for the direction of gravity is based on very strong and unverified assumptions. The first assumption is that the brain already knows about the force of gravity, but it lacks knowledge of the direction of this force of gravity. The second assumption is that before learning the brain knows the effect of a gravitational field on a stack of blocks. How can the brain simulate the effect of a non-vertical gravitational field on a structure if it has never observed such an event?

      R4: We agree with the reviewer that the RL framework serves primarily as a theoretic model to explain the stochastic nature of the world model on gravity, rather than as a demonstration of the developmental origins of intuitive physics abilities. The genesis of such abilities is multifaceted and unlikely to be fully replicated through a simple simulation like RL. Therefore, the purpose of incorporating the RL framework in our study is to demonstrate that external perturbances are not necessary for the development of a stochastic representation of gravity.

      In revision, we have clarified the role of the RL framework

      Line 265-277: “While the cognitive impenetrability and the self-consistency observed in this study, without resorting to an external perturbation, favor the stochastic model over the deterministic one, the origin of this stochastic feature of the world model is unclear.

      Here we used a reinforcement learning (RL) framework to unveil this origin, because our intelligence emerges and evolves under the constraints of the physical world. Therefore, the stochastic feature may emerge as a biological agent interacts with the environment, where the mismatches between external feedback from the environment and internal expectations from the world model are in turn used to fine-tune the world model (Friston et al., 2021; MacKay, 1956; Matsuo et al., 2022). Note that a key aspect of the framework is determining whether the stochastic nature of the world model on gravity emerges through this interaction, even in the absence of external noise.”

      (5) The third assumption is that from the visual input, the brain is able to figure out the exact 3D coordinates of the blocks. This has been proven to be untrue in a large number of studies. Given these assumptions and the fact that the only parameters the RL model modifies through learning specify the direction of gravity, I am not surprised that the model produces the desired results.

      Author response image 8.

      Perception Uncertainty in 3D stacks structures. (a) Experimental design. A pair of two stacks with similar placements of blocks were presented sequentially to participants, who were instructed to judge whether the stacks were identical and to rate their confidence in this judgment. Each stack was presented on the screen for 2 seconds. (b) Behavior Performance. Three participants (2 males, age range: 24-30) were recruited to the experiment. The confidence in determining whether a pair of stacks remained unchanged rapidly decreased when each block had a very small displacement, suggesting humans could keenly perceive trivial changes in configurations. The x-axis denotes the difference in block placement between stacks, with the maximum value (0.4) corresponding to the length of a block’s short side. The Y-axis denotes humans’ confidence in reporting no change. The red curve illustrates the average confidence level across 4 runs, while the yellow curve is the confidence level of each run.

      R5: Indeed, uncertainty is inevitable when perceiving the external world, because our perception is not a faithful replica of external reality. A more critical question pertains to the accuracy of our perception in representing the 3D coordinates of a stack’s blocks. To address this question, we designed a straightforward experiment (Author response image 5a), where participants were instructed to determine whether a pair of stacks were identical. The position of each block was randomly changed horizontally. We found that all participants were able to accurately identify even minor positional variations in the 3D structure of the stacks (Author response image 5b). This level of perceptual precision is adequate for locating the difference between predictions from mental simulations and actual observations of the external world.

      (6)Finally, the argument that the MGS is more efficient than the NGS model is based on an incorrect analysis of the results of the simulation. It is true that 80% accuracy is reached faster by the MGS model than the 95% accuracy level is reached by the NGS model. But the question is: how fast does the NGS model reach 80% accuracy (before reaching the plateau)?

      R6: Yes. The NGS model achieved 80% accuracy as rapidly as the MGS model. However, the NGS model required a significantly longer period to reach the plateau crucial for decision-making. In revision, this information is now included.

      Line 348-350: “…, while the initial growth rates of both models were comparable, the MGS reached the plateau crucial for decision-making sooner than the NGS.”

      We greatly appreciate the thorough and insightful review provided by all three reviewers, which has considerably improved our manuscript, especially in terms of clarity in the presentation of the approach and further validation of the robustness implications of our results.

      Reference: Allen KR, Smith KA, Tenenbaum JB. 2020. Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning. Proceedings of the National Academy of Sciences 117:29302–29310.

      Battaglia PW, Hamrick JB, Tenenbaum JB. 2013. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences 110:18327–18332.

      Friston K, Moran RJ, Nagai Y, Taniguchi T, Gomi H, Tenenbaum J. 2021. World model learning and inference. Neural Networks 144:573–590.

      Kriegeskorte N, Douglas PK. 2019. Interpreting encoding and decoding models. Current opinion in neurobiology 55:167–179.

      MacKay DM. 1956. The epistemological problem for automataAutomata Studies.(AM-34), Volume 34. Princeton University Press. pp. 235–252.

      Matsuo Y, LeCun Y, Sahani M, Precup D, Silver D, Sugiyama M, Uchibe E, Morimoto J. 2022. Deep learning, reinforcement learning, and world models. Neural Networks.

      Naselaris T, Kay KN, Nishimoto S, Gallant JL. 2011. Encoding and decoding in fMRI. Neuroimage 56:400–410.

      Zhou L, Smith K, Tenenbaum J, Gerstenberg T. 2022. Mental Jenga: A counterfactual simulation model of physical support.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment:

      This study presents an important finding on the implicit and automatic emotion perception from biological motion (BM). The evidence supporting the claims of the authors is solid, although inclusion of a larger number of samples and more evidence for the discrepancy between Intact and local emotional BMs would have strengthened the study. The work will be of broad interest to perceptual and cognitive neuroscience.

      We express our sincere gratitude for the positive and constructive evaluation of our manuscript. We have now included more participants and conducted a replication experiment to strengthen our results.

      Reviewer #1 (Public Review):

      Summary:

      Tian et al. investigated the effects of emotional signals in biological motion on pupil responses. In this study, subjects were presented with point-light biological motion stimuli with happy, neutral, and sad emotions. Their pupil responses were recorded with an eye tracker. Throughout the study, emotion type (i.e., happy/sad/neutral) and BM stimulus type (intact/inverted/non-BM/local) were systematically manipulated. For intact BM stimuli, happy BM induced a larger pupil diameter than neutral BM, and neutral BM also induced a larger pupil diameter than sad BM. Importantly, the diameter difference between happy and sad BM correlated with the autistic trait of individuals. These effects disappeared for the inverted BM and non-BM stimuli. Interestingly, both happy and sad emotions show superiority in pupil diameter.

      Strengths:

      (1) The experimental conditions and results are very easy to understand.

      (2) The writing and data presentation are clear.

      (3) The methods are sound. I have no problems with the experimental design and results.

      Weaknesses:

      (1) My main concern is the interpretation of the intact and local condition results. The processing advantage of happy emotion is not surprising given a number of existing studies. However, the only difference here seems to be the smaller (or larger) pupil diameter for sad compared to neutral in the intact (or local, respectively) condition. The current form only reports this effect but lacks in-depth discussions and explanations as to why this is the case.

      Thanks for pointing this out, our apology for not making this point clear. It has long been documented that pupil size reflects the degree of cognitive effort and attention input (Joshi & Gold, 2019; van der Wel & van Steenbergen, 2018), and indexes the noradrenalin activity in emotion processing structures like amygdala (Dal Monte et al., 2015; Harrison et al., 2006; Liddell et al., 2005). Accordingly, we proposed that the smaller pupil response observed under the sad condition as compared to the neutral condition is because the sad biological motion (BM) could be less efficient in attracting visual attention and evoking emotional arousal. In line with this, it has been found that infants looked more at the neutral point-light walker when displayed in pair with the sad walker (Ogren et al., 2019), suggesting that the sad BM is less effective in capturing visual attention than the neutral BM. Besides, neural studies have revealed that, compared with other emotions (anger, happiness, disgust, and fear), the processing of sad emotion failed to evoke heightened activities in any emotionally relevant brain regions including the amygdala, the extrastriate body area (EBA) and the fusiform body area (FBA) (Peelen et al., 2007)(Peelen et al., 2007). The current study echoed with these previous findings by demonstrating a disadvantage for intact sad BM in evoking pupil responses. Notably, different from the intact sad BM, the local sad BM would instead induce stronger pupil responses than the neutral local BM. This distinctive pupil modulation effect observed in intact and local sad BM could be explained as a multi-level emotion processing model of BM. Specifically, even though both the intact and local BM conveyed important life information (Chang & Troje, 2008, 2009; Simion et al., 2008), the latter is deprived of the global form feature. Hence, the processing of emotions in local BM may occur at a more basic and preliminary level, responding to the general affective salient emotion information (happy and sad) without detailed analysis. In fact, similar dissociated emotion processing phenomenon has been observed in another important type of emotional signal with analogous function (i.e., facial expression). For example, happy and fearful faces elicited differential amygdala activations when perceived consciously. However, they elicited comparable amygdala activations when suppressed (Williams et al., 2004). Moreover, it has been proposed that there exist two parallel routes for facial expression processing: a quick but coarse subcortical route that detects affective salient information without detailed analysis, and a fine-grained but slow cortical route that discriminates the exact emotion type. Similarly, the dissociated emotion processing in local and intact BM may function in the same manner, with the former serving as a primary emotion detection mechanism and the latter serving as a detailed emotion discrimination mechanism. Still, future studies adopting more diverse experimental paradigms and neuroimaging techniques were needed to further investigate this issue. We have added these points and more thoroughly discussed the potential mechanism in the revised text (see lines 329-339, 405-415, 418-420).

      References:

      Chang, D. H. F., & Troje, N. F. (2008). Perception of animacy and direction from local biological motion signals. Journal of Vision, 8(5), 3. https://doi.org/10.1167/8.5.3

      Chang, D. H. F., & Troje, N. F. (2009). Characterizing global and local mechanisms in biological motion perception. Journal of Vision, 9(5), 8–8. https://doi.org/10.1167/9.5.8

      Dal Monte, O., Costa, V. D., Noble, P. L., Murray, E. A., & Averbeck, B. B. (2015). Amygdala lesions in rhesus macaques decrease attention to threat. Nature Communications, 6(1). https://doi.org/10.1038/ncomms10161

      Harrison, N. A., Singer, T., Rotshtein, P., Dolan, R. J., & Critchley, H. D. (2006). Pupillary contagion: central mechanisms engaged in sadness processing. Social Cognitive and Affective Neuroscience, 1(1), 5–17. https://doi.org/10.1093/scan/nsl006

      Joshi, S., & Gold, J. I. (2019). Pupil size as a window on neural substrates of cognition. Trends in Cognitive Sciences, 24(6), 466–480. https://doi.org/10.31234/osf.io/dvsme

      Liddell, B. J., Brown, K. J., Kemp, A. H., Barton, M. J., Das, P., Peduto, A., Gordon, E., & Williams, L. M. (2005). A direct brainstem–amygdala–cortical ‘alarm’ system for subliminal signals of fear. NeuroImage, 24(1), 235–243.

      Ogren, M., Kaplan, B., Peng, Y., Johnson, K. L., & Johnson, S. P. (2019). Motion or emotion: infants discriminate emotional biological motion based on low-level visual information. Infant Behavior and Development, 57, 101324. https://doi.org/10.1016/j.infbeh.2019.04.006

      Peelen, M. V., Atkinson, A. P., Andersson, F., & Vuilleumier, P. (2007). Emotional modulation of body-selective visual areas. Social Cognitive and Affective Neuroscience, 2(4), 274–283. https://doi.org/10.1093/scan/nsm023

      Simion, F., Regolin, L., & Bulf, H. (2008). A predisposition for biological motion in the newborn baby. Proceedings of the National Academy of Sciences, 105(2), 809–813. https://doi.org/10.1073/pnas.0707021105

      van der Wel, P., & van Steenbergen, H. (2018). Pupil dilation as an index of effort in cognitive control tasks: a review. Psychonomic Bulletin & Review, 25(6), 2005–2015. https://doi.org/10.3758/s13423-018-1432-y

      Williams, M. A., Morris, A. P., McGlone, F., Abbott, D. F., & Mattingley, J. B. (2004). Amygdala responses to fearful and happy facial expressions under conditions of binocular suppression. Journal of Neuroscience, 24(12), 2898-2904.

      (2) I also found no systematic discussion and theoretical contributions regarding the correlation with the autistic traits. If the main point of this paper is to highlight an implicit and objective behavioral marker of the autistic trait, more interpretation and discussion of the links between the results and existing findings in ASD are needed.

      We thank the reviewer for this insightful suggestion. The perception of biological motion (BM) has long been considered an important hallmark of social cognition. Abundant studies reported that individuals with social cognitive deficits (e.g., ASD) were impaired in BM perception (Blake et al., 2003; Freitag et al., 2008; Klin et al., 2009; Nackaerts et al., 2012). More recently, it has been pointed out that the extraction of more complex social information (e.g., emotions, intentions) from BM, as compared to basic BM recognitions, could be more effective in detecting ASDs (Federici et al., 2020; Koldewyn et al., 2009; Parron et al., 2008; Todorova et al., 2019). Specifically, a meta-analysis found that the effect size expanded nearly twice when the task required emotion recognition as compared to simple perception/detection (Todorova et al., 2019). However, for the high-functioning ASD individuals, it has been reported that they showed comparable performance with the control group in explicitly labelling BM emotions, while their responses were rather delayed (Mazzoni et al., 2021). This suggested that ASD individuals could adopt compensatory strategies to complete the explicit BM labelling task, while their automatic behavioural responses remained impaired. This highlights the importance of using more objective measures that do not rely on active reports to investigate the intrinsic perception of emotions from BM and its relationship with ASD-related social deficits. The current study thus introduced the pupil size measurement to this field, and we combined it with the passive viewing task to investigate the more automatic aspect of BM emotion processing. More importantly, in addition to diagnostic ASDs, the non-clinical general population also manifested autistic tendencies that followed normal distribution and demonstrated substantial heritability (Hoekstra et al., 2007). Here, we focused on the autistic tendencies in the general population, and our results showed that pupil modulations by BM emotions were indicative of individual autistic traits. Specifically, passively viewing the happy BMs evoked larger pupil responses than the sad BMs, while such emotional modulation diminished with the increase of autistic tendencies. More detailed test-retest examination further illustrated such a correlation was driven by the general diminishment in pupil modulation effects by emotional BM (happy or sad) for individuals with high autistic tendencies. This finding demonstrated that the automatic emotion processing of BM stimuli was impaired in individuals with high autistic tendencies, lending support to previous studies (Hubert et al., 2006; Nackaerts et al., 2012; Parron et al., 2008). This indicated the utility of emotional BM stimuli and pupil measurement in identifying ASD-related tendencies in both clinical and non-clinical populations. We have added these points to the revised text (see lines 347-375).

      References:

      Blake, R., Turner, L. M., Smoski, M. J., Pozdol, S. L., & Stone, W. L. (2003). Visual recognition of biological motion is impaired in children with autism. Psychological Science, 14(2), 151–157. https://doi.org/10.1111/1467-9280.01434

      Federici, A., Parma, V., Vicovaro, M., Radassao, L., Casartelli, L., & Ronconi, L. (2020). Anomalous perception of biological motion in autism: a conceptual review and meta-analysis. Scientific Reports, 10(1). https://doi.org/10.1038/s41598-020-61252-3

      Freitag, C. M., Konrad, C., Häberlen, M., Kleser, C., von Gontard, A., Reith, W., Troje, N. F., & Krick, C. (2008). Perception of biological motion in autism spectrum disorders. Neuropsychologia, 46(5), 1480–1494. https://doi.org/10.1016/j.neuropsychologia.2007.12.025

      Hoekstra, R. A., Bartels, M., Verweij, C. J. H., & Boomsma, D. I. (2007). Heritability of autistic traits in the general population. Archives of Pediatrics & Adolescent Medicine, 161(4), 372. https://doi.org/10.1001/archpedi.161.4.372

      Hubert, B., Wicker, B., Moore, D. G., Monfardini, E., Duverger, H., Fonséca, D. D., & Deruelle, C. (2006). Brief report: recognition of emotional and non-emotional biological motion in individuals with autistic spectrum disorders. Journal of Autism and Developmental Disorders, 37(7), 1386–1392. https://doi.org/10.1007/s10803-006-0275-y

      Klin, A., Lin, D. J., Gorrindo, P., Ramsay, G., & Jones, W. (2009). Two-year-olds with autism orient to non-social contingencies rather than biological motion. Nature, 459(7244), 257–261. https://doi.org/10.1038/nature07868

      Koldewyn, K., Whitney, D., & Rivera, S. M. (2009). The psychophysics of visual motion and global form processing in autism. Brain, 133(2), 599–610. https://doi.org/10.1093/brain/awp272

      Mazzoni, N., Ricciardelli, P., Actis-Grosso, R., & Venuti, P. (2021). Difficulties in recognising dynamic but not static emotional body movements in autism spectrum disorder. Journal of Autism and Developmental Disorders, 52(3), 1092–1105. https://doi.org/10.1007/s10803-021-05015-7

      Nackaerts, E., Wagemans, J., Helsen, W., Swinnen, S. P., Wenderoth, N., & Alaerts, K. (2012). Recognizing biological motion and emotions from point-light displays in autism spectrum disorders. PLoS ONE, 7(9), e44473. https://doi.org/10.1371/journal.pone.0044473

      Parron, C., Da Fonseca, D., Santos, A., Moore, D. G., Monfardini, E., & Deruelle, C. (2008). Recognition of biological motion in children with autistic spectrum disorders. Autism, 12(3), 261–274. https://doi.org/10.1177/1362361307089520

      Todorova, G. K., Hatton, R. E. M., & Pollick, F. E. (2019). Biological motion perception in autism spectrum disorder: a meta-analysis. Molecular Autism, 10(1). https://doi.org/10.1186/s13229-019-0299-8

      Reviewer #2 (Public Review):

      Summary:

      Through a series of four experiments, Yuan, Wang and Jiang examined pupil size responses to emotion signals in point-light motion stimuli. Experiment 1 examined upright happy, sad and neutral point-light biological motion (BM) walkers. The happy BM induced a significantly larger pupil response than the neutral, whereas the sad BM evoked a significantly smaller pupil size than the neutral BM. Experiment 2 examined inverted BM walkers. Experiment 3 examined BM stimuli with acceleration removed. No significant effects of emotion were found in neither Experiment 2 nor Experiment 3. Experiment 4 examined scrambled BM stimuli, in which local motion features were preserved while the global configuration was disrupted. Interestingly, the scrambled happy and sad BM led to significantly greater pupil size than the scrambled neutral BM at a relatively early time, while no significant difference between the scrambled happy and sad BM was found. Thus, the authors argue that these results suggest multi-level processing of emotions in life motion signals.

      Strengths:

      The experiments were carefully designed and well-executed, with point-light stimuli that eliminate many potential confounding effects of low-level visual features such as luminance, contrast, and spatial frequency.

      Weaknesses:

      Correlation results with limited sample size should be interpreted with extra caution.

      Thanks for pointing this out. To strengthen the correlation results, we have conducted a replication experiment (Exp.1b) and added a test-retest examination to further assess the reliability of our measurements. Specifically, a new group of 24 participants (16 females, 8 males) were recruited to perform the identical experiment procedure as in Experiment 1. Then, after at least seven days, they were asked to return to the lab for a retest. The results successfully replicated the previously reported main effect of emotional condition in both the first test (F(2, 46) = 12.0, p < .001, ηp2 = 0.34, Author response image 1A) and the second test (F(2, 46) = 14.8, p < .001, ηp2 = 0.39, Author response image 1B). The happy BM induced a significantly larger pupil response than the neutral BM (First Test: t(23) = 2.60, p = .022, Cohen’s d = 0.53, 95% CI for the mean difference = [0.02, 0.14], Holm-corrected, p = .048 after Bonferroni correction, Author response image 1A; Second Test: t(23) = 3.36, p = .005, Cohen’s d = 0.68, 95% CI for the mean difference = [0.06, 0.24], Holm-corrected, p = .008 after Bonferroni correction, Author response image 1B). On the contrary, the sad BM induced a significantly smaller pupil response than the neutral BM (First Test: t(23) = -2.77, p = .022, Cohen’s d = 0.57, 95% CI for the mean difference = [-0.19, -0.03], Holm-corrected, p = .033 after Bonferroni correction; Second Test: t(23) = -3.19, p = .005, Cohen’s d = 0.65, 95% CI for the mean difference = [-0.24, -0.05], Holm-corrected, p = .012 after Bonferroni correction, Author response image 1B). Besides, the happy BM induced significantly larger pupil response than the sad BM (first test: t(23) = 4.23, p < .001, Cohen’s d = 0.86, 95% CI for the mean difference = [0.10, 0.28], Holm-corrected, p < .001 after Bonferroni correction, Author response image 1A; second test: t(23) = 4.26, p < .001, Cohen’s d = 0.87, 95% CI for the mean difference = [0.15, 0.44], Holm-corrected, p < .001 after Bonferroni correction, Author response image 1B). The results of the cluster-based permutation analysis were also similar (see Supplementary Material for more details).

      Author response image 1.

      Normalized mean pupil responses in the replication experiment (Experiment 1b) of Experiment 1a and its retest, using the neutral condition as baseline, plotted against happy and sad conditions. (A) In the first test, the group average pupil response to happy intact BM is significantly larger than that to sad and neutral BM, while the pupil response induced by sad BM is significantly smaller than that evoked by neutral BM, replicating the results of Experiment 1a. (B) Moreover, such results were similarly found in the second test.

      Notably, we successfully replicated the negative correlation between the happy over sad dilation effect and individual autistic traits in the first test (r(23) = -0.46, p = .023, 95% CI for the mean difference = [-0.73, -0.07], Author response image 2A). No other significant correlations were found (see Author response image 2B-C). Moreover, in the second test, such a correlation was similarly found and was even stronger (r(23) = -0.61, p = .002, 95% CI for the mean difference = [-0.81, -0.27], Author response image 2D). We‘ve also performed a test-retest reliability analysis on the happy over sad pupil dilation effect and the AQ score. The results showed robust correlations. See Author response table 1 for more details.

      Author response table 1.

      Reliability of pupil size and AQ indices.

      Importantly, in the second test, we’ve also observed a significant negative correlation between AQ and the happy minus neutral pupil dilation effect (r(23) = -0.44, p = .032, 95% CI for the mean difference = [-0.72, -0.04], Author response image 2E), and a significant positive correlation between the sad minus neutral pupil size and AQ (r(23) = 0.50, p = .014, 95% CI for the mean difference = [0.12, 0.75], Author response image 2F). This indicated that the overall correlation between happy over sad dilation effect and AQ was driven both by the diminished happy dilation effect as well as the sad constriction effect. Overall, our replication experiment consistently found a significant negative correlation between AQ and happy over sad dilation effect both in the test and the retest. Moreover, it revealed that such an effect was contributed by both a negative correlation between AQ and happy-neutral pupil response and a positive correlation between AQ and sad-neutral pupil response, demonstrating a general impairment in BM emotion perception (happy or sad) for individuals with high autistic tendencies. This also indicated the utility of adopting a test-retest pupil examination to more precisely detect individual autistic tendencies. We have added these points in the revised text (see lines 135-173, lines 178-180).

      Author response image 2.

      Correlation results for pupil modulation effects and AQ scores in the replication experiment (Experiment 1b) of Experiment 1a and its retest. (A) We replicated the negative correlation between the happy over sad pupil dilation effect and AQ in the first test. (B-C) No other significant correlations were found. (D) In the second test, the negative correlation between the happy over sad pupil dilation effect and AQ was similarly observed and even stronger. (E-F) Moreover, the happy vs. neutral pupil dilation effect and the sad vs. neutral pupil constriction effect respectively correlate with AQ in the second test.

      It would be helpful to add discussions as a context to compare the current results with pupil size reactions to emotion signals in picture stimuli.

      Thanks for this this thoughtful comment. The modulation of emotional information on pupil responses has been mostly investigated using picture stimuli. Bradley et al. (2008) first demonstrated that humans showed larger pupil responses towards emotional images as compared to neutral images, while no difference was observed between the positive and negative images. This was regarded as the result of increased sympathetic activity induced by emotional arousal that is independent of the emotional valence. Similar results have been replicated with different presentation durations, repetition settings, and tasks (Bradley & Lang, 2015; Snowden et al., 2016). However, the emotional stimuli adopted in these studies were mostly complicated scene images that conveyed rather general emotional information. When it comes to the specific emotion cues (e.g., fear, anger, happy, sad) delivered by our conspecifics through biologically salient signals (e.g., faces, gestures, voices), the results became intermixed. Some studies demonstrated that fearful, disgusted, and angry static faces induced larger pupil sizes than the neutral face, while sad and happy faces failed to induce such pupil dilatory effects (Burley et al., 2017). In contrast, other studies observed larger pupil responses for happy faces as compared to sad and fearful faces (Aktar et al., 2018; Burley & Daughters, 2020; Jessen et al., 2016). These conflicting results could be due to the low-level confounds of emotional faces (e.g., eye size) (Carsten et al., 2019; Harrison et al., 2006). Similar to faces, BM also conveyed salient clues concerning the emotional states of our interactive partners. However, they were highly simplified, deprived of various irrelevant visual confounders (e.g., body shape). Here, we reported that the happy BM induced a stronger pupil response than the neutral and sad BM, lending support to the happy dilation effect observed with faces (Burley & Daughters, 2020; Prunty et al., 2021). Moreover, it helps ameliorate the concern regarding the low-level confounding factors by identifying similar pupil modulations in another type of social signal with distinctive perceptual features. We have added these points to the revised text (see lines 301-321).

      References:

      Aktar, E., Mandell, D. J., de Vente, W., Majdandžić, M., Oort, F. J., van Renswoude, D. R., Raijmakers, M. E. J., & Bögels, S. M. (2018). Parental negative emotions are related to behavioral and pupillary correlates of infants’ attention to facial expressions of emotion. Infant Behavior and Development, 53, 101–111. https://doi.org/10.1016/j.infbeh.2018.07.004

      Bradley, M. M., & Lang, P. J. (2015). Memory, emotion, and pupil diameter: repetition of natural scenes. Psychophysiology, 52(9), 1186–1193. https://doi.org/10.1111/psyp.12442

      Bradley, M. M., Miccoli, L., Escrig, M. A., & Lang, P. J. (2008). The pupil as a measure of emotional arousal and autonomic activation. Psychophysiology, 45(4), 602–607. https://doi.org/10.1111/j.1469-8986.2008.00654.x

      Burley, D. T., & Daughters, K. (2020). The effect of oxytocin on pupil response to naturalistic dynamic facial expressions. Hormones and Behavior, 125, 104837. https://doi.org/10.1016/j.yhbeh.2020.104837

      Burley, D. T., Gray, N. S., & Snowden, R. J. (2017). As far as the eye can see: relationship between psychopathic traits and pupil response to affective stimuli. PLOS ONE, 12(1), e0167436. https://doi.org/10.1371/journal.pone.0167436

      Carsten, T., Desmet, C., Krebs, R. M., & Brass, M. (2019). Pupillary contagion is independent of the emotional expression of the face. Emotion, 19(8), 1343–1352. https://doi.org/10.1037/emo0000503

      Harrison, N. A., Singer, T., Rotshtein, P., Dolan, R. J., & Critchley, H. D. (2006). Pupillary contagion: central mechanisms engaged in sadness processing. Social Cognitive and Affective Neuroscience, 1(1), 5–17. https://doi.org/10.1093/scan/nsl006

      Jessen, S., Altvater-Mackensen, N., & Grossmann, T. (2016). Pupillary responses reveal infants’ discrimination of facial emotions independent of conscious perception. Cognition, 150, 163–169. https://doi.org/10.1016/j.cognition.2016.02.010

      Prunty, J. E., Keemink, J. R., & Kelly, D. J. (2021). Infants show pupil dilatory responses to happy and angry facial expressions. Developmental Science, 25(2). https://doi.org/10.11<br /> 11/desc.13182

      Snowden, R. J., O’Farrell, K. R., Burley, D., Erichsen, J. T., Newton, N. V., & Gray, N. S. (2016). The pupil’s response to affective pictures: role of image duration, habituation, and viewing mode. Psychophysiology, 53(8), 1217–1223. https://doi.org/10.1111/psyp.12668

      Overall, I think this is a well-written paper with solid experimental results that support the claim of the authors, i.e., the human visual system may process emotional information in biological motion at multiple levels. Given the key role of emotion processing in normal social cognition, the results will be of interest not only to basic scientists who study visual perception, but also to clinical researchers who work with patients of social cognitive disorders. In addition, this paper suggests that examining pupil size responses could be a very useful methodological tool to study brain mechanisms underlying emotion processing.

      Reviewer #3 (Public Review):

      Summary:

      The overarching goal of the authors was to understand whether emotional information conveyed through point-light biological motion can trigger automatic physiological responses, as reflected in pupil size.

      Strengths:

      This manuscript has several noticeable strengths: it addresses an intriguing research question that fills that gap in existing literature, presents a clear and accurate presentation of the current literature, and conducts a series of experiments and control experiments with adequate sample size. Yet, it also entails several noticeable limitations - especially in the study design and statistical analyses.

      Weaknesses:

      (1) Study design:

      (1.1) Dependent variable:

      Emotional attention is known to modulate both microsaccades and pupil size. Given the existing pupillometry data that the authors have collected, it would be both possible and valuable to determine whether the rate of microsaccades is also influenced by emotional biological motion.

      We thank the reviewer for this advice. Microsaccades functioned as a mechanism to maintain visibility by continuously shifting the retinal image to overcome visual adaptation (Martinez-Conde et al., 2006). Moreover, it was found to be sensitive to attention processes (Baumeler et al., 2020; Engbert & Kliegl, 2003b; Meyberg et al., 2017), and could reflect the activity of superior colliculus (SC) and other related brain areas (Martinez-Conde et al., 2009, 2013). Previous studies have found that, compared with neutral and pleasant images, unpleasant images significantly inhibit early microsaccade rates (Kashihara, 2020; Kashihara et al., 2013). This is regarded as the result of retaining previous crucial information at the sacrifice of updating new visual input. We agree with the reviewer that it would be valuable to investigate whether emotional information conveyed by BM could modulate microsaccades. However, it should be noted that our data collection and experimental design are not optimized for this purpose. This is because we have only recorded the left eye’s data, while abundant methodological studies have doubted the reliability of using only one eye’s data to analyze microsaccades (Fang et al., 2018; Hauperich et al., 2020; Nyström et al., 2017) and suggested that the microsaccades should be defined by spontaneous binocular eye movement (Engbert & Kliegl, 2003a, 2003b). Besides, according to Kashihara et al. (2013), participants showed differential microsaccade rates after the stimuli disappeared so as to maintain the previously observed different emotional information. However, in the current study, we discarded the data after the stimuli disappeared, making it impossible to analyze the microsaccade data after the stimuli disappeared. Despite these disadvantages, we have attempted to analyze the microsaccade rate during the stimuli presentation using only the left eye’s data. Specifically, we applied the algorithm developed by Otero-Millan et al. (2014) (minimum duration =6 ms, maximum amplitude = 1.5 degrees, maximum velocity = 150 degrees/sec) to the left eye’s data from 100 ms before to 4000 ms after stimulus onset. Subsequently, we calculated the microsaccade rates using a moving window of 100 ms (stepped in 1 ms) (Engbert & Kliegl, 2003b; Kashihara et al., 2013). The microsaccade rate displayed a typical curve, with suppression shortly after stimulus appearance (inhibition phase), followed by an increased rate of microsaccade occurrence (rebound phase). The cluster-based permutation analysis was then applied to explore the modulation of BM emotions on microsaccade rates. However, no significant differences among different emotional conditions (happy, sad, neutral) were found for the four experiments.

      Author response image 3.

      Time-series change in the microsaccade rates to happy, sad, and neutral BM in Experiments 1-4. Solid lines represent microsaccade rates under each emotional condition as a function of time (happy: red; sad: blue; neutral: gray); shaded areas represent the SEM between participants. No significant differences were found after cluster-based permutation correction for the four experiments.

      It is important to note that the microsaccade rate analysis was conducted on only the left eye’s data and that the experiment design is not optimized for this analysis, thus, extra caution should be exercised in interpreting the results. Still, we found it very innovative and important to combine the microsaccade index with the pupil size to holistically investigate the processing of emotional information in BM, and future studies are highly needed to adopt more suitable recording techniques and experiment designs to further probe this issue. We have discussed this issue in the revised text (see lines 339-344).

      References:

      Baumeler, D., Schönhammer, J. G., & Born, S. (2020). Microsaccade dynamics in the attentional repulsion effect. Vision Research, 170, 46–52. https://doi.org/10.1016/j.visres.2020.03.009

      Engbert, R., & Kliegl, R. (2003a). Binocular coordination in microsaccades. In The Mind’s Eye (pp. 103–117). Elsevier. https://doi.org/10.1016/b978-044451020-4/50007-4

      Engbert, R., & Kliegl, R. (2003b). Microsaccades uncover the orientation of covert attention. Vision Research, 43(9), 1035–1045. https://doi.org/10.1016/s0042-6989(03)00084-1

      Fang, Y., Gill, C., Poletti, M., & Rucci, M. (2018). Monocular microsaccades: do they really occur? Journal of Vision, 18(3), 18. https://doi.org/10.1167/18.3.18

      Hauperich, A.-K., Young, L. K., & Smithson, H. E. (2020). What makes a microsaccade? a review of 70 years research prompts a new detection method. Journal of Eye Movement Research, 12(6). https://doi.org/10.16910/jemr.12.6.13

      Kashihara, K. (2020). Microsaccadic modulation evoked by emotional events. Journal of Physiological Anthropology, 39(1). https://doi.org/10.1186/s40101-020-00238-6

      Kashihara, K., Okanoya, K., & Kawai, N. (2013). Emotional attention modulates microsaccadic rate and direction. Psychological Research, 78(2), 166–179. https://doi.org/10.1007/s00426-013-0490-z

      Martinez-Conde, S., Macknik, S. L., Troncoso, X. G., & Dyar, T. A. (2006). Microsaccades counteract visual fading during fixation. Neuron, 49(2), 297–305. https://doi.org/10.1016/j.neuron.2005.11.033

      Martinez-Conde, S., Macknik, S. L., Troncoso, X. G., & Hubel, D. H. (2009). Microsaccades: a neurophysiological analysis. Trends in Neurosciences, 32(9), 463–475. https://doi.org/10.1016/j.tins.2009.05.006

      Martinez-Conde, S., Otero-Millan, J., & Macknik, S. L. (2013). The impact of microsaccades on vision: towards a unified theory of saccadic function. Nature Reviews Neuroscience, 14(2), 83–96. https://doi.org/10.1038/nrn3405

      Meyberg, S., Sinn, P., Engbert, R., & Sommer, W. (2017). Revising the link between microsaccades and the spatial cueing of voluntary attention. Vision Research, 133, 47–60. https://doi.org/10.1016/j.visres.2017.01.001

      Nyström, M., Andersson, R., Niehorster, D. C., & Hooge, I. (2017). Searching for monocular microsaccades – a red hering of modern eye trackers? Vision Research, 140, 44–54. https://doi.org/10.1016/j.visres.2017.07.012

      Otero-Millan, J., Castro, J. L. A., Macknik, S. L., & Martinez-Conde, S. (2014). Unsupervised clustering method to detect microsaccades. Journal of Vision, 14(2), 18–18. https://doi.org/10.1167/14.2.18

      (1.2) Stimuli:

      It appears that the speed of the emotional biological motion stimuli mimics the natural pace of the emotional walker. What is the average velocity of the biological motion stimuli for each condition?

      Thanks for pointing out this issue. The neutral and emotional (sad or happy) BM stimuli are equal in walking speed (one step for one second, 1Hz). We have also computed their physical velocity by calculating the Euclidean distance in pixel space of each key point between adjacent frames (Poyo Solanas et al., 2020). The velocity was 5.76 pixels/frame for the happy BM, 4.14 pixels/frame for the neutral BM, and 3.21 pixels/frame for the sad BM. This difference in velocity profile was considered an important signature for conveying emotional information, as the happy walker was characterized by a larger step pace and longer arm swing and the sad walker would instead exhibit a slouching gait with short slow strides and smaller arm movement (Barliya et al., 2012; Chouchourelou et al., 2006; Halovic & Kroos, 2018; Roether et al., 2009). More importantly, our current results could not be explained by the differences in velocities. This is because the inverted emotional BM with identical velocity characteristics failed to induce any modulations on pupil responses. Furthermore, the local sad and happy BM differed the most in velocity feature, while they induced similar modulations on pupil sizes. We have added these points in the revised text (see lines 254-257, 484-491).

      References:

      Barliya, A., Omlor, L., Giese, M. A., Berthoz, A., & Flash, T. (2012). Expression of emotion in the kinematics of locomotion. Experimental Brain Research, 225(2), 159–176. https://doi.org/10.1007/s00221-012-3357-4

      Chouchourelou, A., Matsuka, T., Harber, K., & Shiffrar, M. (2006). The visual analysis of emotional actions. Social Neuroscience, 1(1), 63–74. https://doi.org/10.1080/17470910600630599

      Halovic, S., & Kroos, C. (2018). Not all is noticed: kinematic cues of emotion-specific gait. Human Movement Science, 57, 478–488. https://doi.org/10.1016/j.humov.2017.11.008

      Poyo Solanas, M., Vaessen, M. J., & de Gelder, B. (2020). The role of computational and subjective features in emotional body expressions. Scientific Reports, 10(1). https://doi.org/10.1038/s41598-020-63125-1

      Roether, C. L., Omlor, L., Christensen, A., & Giese, M. A. (2009). Critical features for the perception of emotion from gait. Journal of Vision, 9(6), 15–15. https://doi.org/10.1167/9.6.15

      When the authors used inverted biological motion stimuli, they didn't observe any modulation in pupil size. Could there be a difference in microsaccades when comparing inverted emotional biological motion stimuli?

      Thanks for this consideration. Both microsaccades and pupil size can provide valuable insights into the underlying neural dynamics of attention and cognitive control (Baumeler et al., 2020; Engbert & Kliegl, 2003; Meyberg et al., 2017). Notably, previous studies have shown that the microsaccades and pupil sizes could be similar and highly correlated in reflecting various cognitive processes, such as multisensory integration, inhibitory control, and cognitive load (Krejtz et al., 2018; Wang et al., 2017; Wang & Munoz, 2021). Moreover, the generation of both microsaccades and pupil responses would involve shared neural circuits, including the midbrain structure superior colliculus (SC) and the noradrenergic system (Hafed et al., 2009; Hafed & Krauzlis, 2012; Wang et al., 2012). However, the pupil size could be more sensitive than microsaccade rates in contexts such as affective priming (Krejtz et al., 2020) and decision formation (Strauch et al., 2018). Moreover, abundant former studies have all shown that inversion would significantly disrupt the perception of emotions from BM (Atkinson et al., 2007; Dittrich et al., 1996; Spencer et al., 2016; Yuan et al., 2022, 2023). Overall, it is unlikely for the microsaccade rates to show significant differences when comparing inverted emotional biological motion stimuli. Besides, we have attempted to analyze the microsaccade rate in the inverted BM situation, while our results showed no significant differences (see also Point 1.1, Author response image 3). Still, it is needed for future studies to combine the microsaccade index and pupil size to provide a thorough understanding of BM emotion processing. We have discussed this issue in the revised text (see lines 339-344).

      References:

      Atkinson, A. P., Tunstall, M. L., & Dittrich, W. H. (2007). Evidence for distinct contributions of form and motion information to the recognition of emotions from body gestures. Cognition, 104(1), 59–72. https://doi.org/10.1016/j.cognition.2006.05.005

      Baumeler, D., Schönhammer, J. G., & Born, S. (2020). Microsaccade dynamics in the attentional repulsion effect. Vision Research, 170, 46–52. https://doi.org/10.1016/j.visres.2020.03.009

      Dittrich, W., Troscianko, T., Lea, S., & Morgan, D. (1996). Perception of emotion from dynamic point-light displays represented in dance. Perception, 25(6), 727–738. https://doi.org/10.1068/p250727

      Engbert, R., & Kliegl, R. (2003). Microsaccades uncover the orientation of covert attention. Vision Research, 43(9), 1035–1045. https://doi.org/10.1016/s0042-6989(03)00084-1

      Hafed, Z. M., Goffart, L., & Krauzlis, R. J. (2009). A neural mechanism for microsaccade generation in the primate superior colliculus. Science, 323(5916), 940–943. https://doi.org/10.1126/science.1166112

      Hafed, Z. M., & Krauzlis, R. J. (2012). Similarity of superior colliculus involvement in microsaccade and saccade generation. Journal of neurophysiology, 107(7), 1904-1916.

      Krejtz, K., Duchowski, A. T., Niedzielska, A., Biele, C., & Krejtz, I. (2018). Eye tracking cognitive load using pupil diameter and microsaccades with fixed gaze. Plos One, 13(9), e0203629. https://doi.org/10.1371/journal.pone.0203629

      Krejtz, K., Żurawska, J., Duchowski, A., & Wichary, S. (2020). Pupillary and microsaccadic responses to cognitive effort and emotional arousal during complex decision making. Journal of Eye Movement Research, 13(5). https://doi.org/10.16910/jemr.13.5.2

      Meyberg, S., Sinn, P., Engbert, R., & Sommer, W. (2017). Revising the link between microsaccades and the spatial cueing of voluntary attention. Vision Research, 133, 47–60. https://doi.org/10.1016/j.visres.2017.01.001

      Spencer, J. M. Y., Sekuler, A. B., Bennett, P. J., Giese, M. A., & Pilz, K. S. (2016). Effects of aging on identifying emotions conveyed by point-light walkers. Psychology and Aging, 31(1), 126–138. https://doi.org/10.1037/a0040009

      Strauch, C., Greiter, L., & Huckauf, A. (2018). Pupil dilation but not microsaccade rate robustly reveals decision formation. Scientific Reports, 8(1). https://doi.org/10.1038/s41598-018-31551-x

      Wang, C.-A., Blohm, G., Huang, J., Boehnke, S. E., & Munoz, D. P. (2017). Multisensory integration in orienting behavior: pupil size, microsaccades, and saccades. Biological Psychology, 129, 36–44. https://doi.org/10.1016/j.biopsycho.2017.07.024

      Wang, C.-A., Boehnke, S. E., White, B. J., & Munoz, D. P. (2012). Microstimulation of the monkey superior colliculus induces pupil dilation without evoking saccades. Journal of Neuroscience, 32(11), 3629–3636. https://doi.org/10.1523/jneurosci.5512-11.2012

      Wang, C.-A., & Munoz, D. P. (2021). Differentiating global luminance, arousal and cognitive signals on pupil size and microsaccades. European Journal of Neuroscience, 54(10), 7560–7574. https://doi.org/10.1111/ejn.15508

      Yuan, T., Ji, H., Wang, L., & Jiang, Y. (2022). Happy is stronger than sad: emotional information modulates social attention. Emotion. https://doi.org/10.1037/emo0001145

      Yuan, T., Wang, L., & Jiang, Y. (2023). Cross-channel adaptation reveals shared emotion representation from face and biological motion. In Emotion (p. In Press).

      (2) Statistical analyses

      (2.1) Multiple comparisons:

      There are many posthoc comparisons throughout the manuscript. The authors should consider correction for multiple comparisons. Take Experiment 1 for example, it is important to note that the happy over neutral BM effect and the sad over neutral BM effect are no longer significant after Bonferroni correction, which is worth noting.

      Thanks for this suggestion. In our original analysis, we applied the Holm post-hoc corrections for multiple comparisons. The Holm correction is a step-down correction method and is more powerful but less conservative than the Bonferroni correction. We have now conducted the stricter Bonferroni post-hoc correction. In Experiment 1, the happy over neutral, and happy over sad BM effect is still significant after the Bonferroni post-hoc correction (happy vs. neutral: p = .036; happy vs. sad: p = .009), and the sad over neutral comparison remains marginally significant after the Bonferroni post-hoc correction (p = .071). Importantly, the test-retest replication experiment also yielded significant results for the comparisons between happy and neutral (First Test: p = .022, Holm-corrected, p = .048, Bonferroni-corrected; Second Test: p = .005,  Holm-corrected, p = .008, Bonferroni-corrected), sad and neutral (First Test: p = .022, Holm-corrected, p = .033, Bonferroni-corrected; Second Test: p = .005, Holm-corrected, p = .012, Bonferroni-corrected, Author response image 1B), and happy and sad BM  (First test: p < .001, Holm-corrected, p < .001, Bonferroni-corrected; Second test: p < .001, Holm-corrected, p < .001, Bonferroni-corrected). These results provided support for the replicability and consistency of the reported significant contrasts. See also Point 2.3.

      In Experiment 4, the significance levels of all comparisons remained the same after Bonferroni post-hoc correction (happy vs. neutral: p = .011; sad vs. neutral: p = .007; happy vs. sad: p = 1.000). We have now added these results in the main text (See lines 119, 122, 124, 143, 145, 148, 150, 153, 155, 248, 251, 254).

      (2.2) The authors present the correlation between happy over sad dilation effect and the autistic traits in Experiment 1, but do not report such correlations in Experiments 2-4. Did the authors collect the Autistic Quotient measure in Experiments 2-4? It would be informative if the authors could demonstrate the reproducibility (or lack thereof) of this happy-sad index in Experiments 2-4.

      We apologize for not making it clear. We have collected the AQ scores in Experiments 2-4. However, it should be pointed out that the happy over sad pupil dilation effect was only observed in Experiment 1. Moreover, we’ve again identified such happy over sad pupil dilation effect in the replication experiment (Experiment 1b) as well as its correlation with AQ. Instead, no significant correlations between AQ and the happy-sad pupil index were found in Experiments 2-4, see Author response image 4 for more details. We have reported these correlations in the main text (see lines 157-173, 190-194, 212-216, 257-262).

      Author response image 4.

      Correlations between the happy over sad pupil dilation effect and AQ scores. (A)  The happy over sad pupil dilation effect correlated negatively with individual autistic scores. (B-C) Such correlation was similarly observed in the test and retest of the replication experiment. (D-F) No such correlations were found for the inverted, nonbiological, and local BM stimuli.

      (2.3) The observed correlation between happy over sad dilation effect and the autistic traits in Experiment 1 seems rather weak. It could be attributed to the poor reliability of the Autistic Quotient measure or the author-constructed happy-sad index. Did the authors examine the test-retest reliability of their tasks or the Autistic Quotient measure?

      Thanks for this suggestion. We have now conducted a test-retest replication study to further confirm the observed significant correlations. Specifically, we recruited a new group of 24 participants (16 females, 8 males) to perform the identical procedure as in Experiment 1, and they were asked to return to the lab for a retest after at least seven days. We’ve replicated the significant main effect of emotional conditions in both the first test (F(2, 46) = 12.0, p < .001, ηp2 = 0.34) and the second test (F(2, 46) = 14.8, p < .001, ηp2 = 0.39). Besides, we also replicated the happy minus neutral pupil dilation effect (First Test: t(23) = 2.60, p = .022, Cohen’s d = 0.53, 95% CI for the mean difference = [0.02, 0.14], Holm-corrected, p = .048 after Bonferroni correction; Second Test: t(23) = 3.36, p = .005, Cohen’s d = 0.68, 95% CI for the mean difference = [0.06, 0.24], Holm-corrected, p = .008 after Bonferroni correction), and the sad minus neutral pupil constriction effect (First Test: t(23) = -2.77, p = .022, Cohen’s d = 0.57, 95% CI for the mean difference = [-0.19, -0.03], Holm-corrected, p = .033 after Bonferroni correction; Second Test: t(23) = -3.19, p = .005, Cohen’s d = 0.65, 95% CI for the mean difference = [-0.24, -0.05], Holm-corrected, p = .012 after Bonferroni correction). Additionally, the happy BM still induced a significantly larger pupil response than the sad BM (first test: t(23) = 4.23, p < .001, Cohen’s d = 0.86, 95% CI for the mean difference = [0.10, 0.28], Holm-corrected, p < .001 after Bonferroni correction; second test: t(23) = 4.26, p < .001, Cohen’s d = 0.87, 95% CI for the mean difference = [0.15, 0.44], Holm-corrected, p < .001 after Bonferroni correction).

      Notably, we’ve successfully replicated the negative correlation between the happy over sad dilation effect and individual autistic traits (r(23) = -0.46, p = .023, 95% CI for the mean difference = [-0.73, -0.07]). Such a correlation was similarly found and was even stronger in the retest (r(23) = -0.61, p = .002, 95% CI for the mean difference = [-0.81, -0.27]). A test-retest reliability analysis was conducted on the happy over sad pupil dilation effect and the AQ score. The results showed robust correlations (r(happy-sad pupil size)= 0.56; r(AQ)= 0.90) and strong test-retest reliabilities (α(happy-sad pupil size)= 0.60; α(AQ)= 0.82). We have added these results to the main text (see lines 135-173). See also Response to Reviewer #2 Response 1 for more details.

      (2.4) Relatedly, the happy over sad dilation effect is essentially a subtraction index. Without separately presenting the pipul size correlation with happy and sad BM in supplemental figures, it becomes challenging to understand what's primarily driving the observed correlation.

      Thanks for pointing this out. We have now presented the separate correlations between AQ and the pupil response towards happy and sad BM in Experiment 1 (see Author response image 5A), and the test-retest replication experiment of Experiment 1 (see Author response image 5B-C). No significant correlations were found. This is potentially because the raw pupil response is a mixed result of BM perception and emotion perception, while the variations in pupil sizes across emotional conditions could more faithfully reflect individual sensitivities to emotions in BM (Burley et al., 2017; Pomè et al., 2020; Turi et al., 2018).  

      Author response image 5.

      No significant correlations between AQ and pupil response towards happy and sad intact BM were found in Experiment 1a and the test-retest replication experiment (Experiment 1b).

      To probe what's primarily driving the observed correlation between happy-sad pupil size and AQ, we instead used the neutral as the baseline and separately correlated AQ with the happy-neutral and the sad-neutral pupil modulation effects. No significant correlation was found in Experiment 1a (Author response image 6A-B) and the first test of the replication experiment (Experiment 1b) (Author response image 6C-D). Importantly, in the second test of the replication experiment, we found a significant negative correlation between AQ and the happy-neutral pupil size (r(23) = -0.44, p = .032, 95% CI for the mean difference = [-0.72, -0.04], Author response image 6E), and a significant positive correlation between AQ and the sad-neutral pupil size (r(23) = 0.50, p = .014, 95% CI for the mean difference = [0.12, 0.75], Author response image 6F). This suggested that the overall correlation between AQ and the happy over sad dilation effect was driven by diminished pupil modulations towards both the happy and sad BM for high AQ individuals, demonstrating a general deficiency in BM emotion perception (happy or sad) among individuals with high autistic tendencies. It further revealed the potential of adopting a test-retest pupil examination to more precisely detect individual autistic tendencies. We have reported these results in the main text (see lines 166-173).

      Author response image 6.

      Correlation results for pupil modulations and AQ scores. (A-B) In Experiment 1a, no significant correlation was observed between AQ and the happy pupil modulation effect, as well as between AQ and the sad pupil modulation effect. (C-D) Similarly, no significant correlations were found in the first test of the replication experiment (Experiment 1b). (E-F) Importantly, in the second test of Experiment 1b, the happy vs. neutral pupil dilation effect was positively correlated with AQ, and the sad vs. neutral pupil constriction effect was positively correlated with AQ.

      References:

      Burley, D. T., Gray, N. S., & Snowden, R. J. (2017). As Far as the Eye Can See: Relationship between Psychopathic Traits and Pupil Response to Affective Stimuli. PLOS ONE, 12(1), e0167436. https://doi.org/10.1371/journal.pone.0167436

      Pomè, A., Binda, P., Cicchini, G. M., & Burr, D. C. (2020). Pupillometry correlates of visual priming, and their dependency on autistic traits. Journal of vision, 20(3), 3-3.

      Turi, M., Burr, D. C., & Binda, P. (2018). Pupillometry reveals perceptual differences that are tightly linked to autistic traits in typical adults. eLife, 7. https://doi.org/10.7554/elife.32399

      (2.5) For the sake of transparency, it is important to report all findings, not just the positive results, throughout the paper.

      Thanks for this suggestion. We have now reported all the correlations results between AQ and pupil modulation effects (happy-sad, happy-neutral, sad-neutral) in the main text (see lines 130-131, 157-162, 166-170, 190-194, 212-216, 257-262). Given that no significant correlations were observed between AQ and the raw pupil responses across four experiments, we reported their correlations with AQ in the supplementary material. We have stated this point in the main text (see lines 132-134).

      (3) Structure

      (3.1) The Results section immediately proceeds to the one-way repeated measures ANOVA. This section could be more reader-friendly by including a brief overview of the task procedures and variables, e.g., shifting Fig. 3 to this section.

      Thanks for this advice. We have now added a brief overview of the task procedures and variables and we have also shifted the figure position (see lines 101-103).

      Reviewer #1 (Recommendations For The Authors):

      (1) I suggest that the authors first explain the task (i.e., Fig. 3) at the beginning of the results. And it seems more appropriate to show the time course figures (Fig. 2) and before the bar plots (Fig. 1). If I understand correctly, the bar plots reflect the averaged data from the time course plots. Also, please clearly state the time window used to average the data. The results of the correlation analysis can be displayed in the last step.

      Thanks for this suggestion. We have now added a concise explanation of the task at the beginning of the results (see lines 101-103). We have also adjusted the figure positions and adjusted the order of our results according to the reviewer’s suggestion. The time window we used to average the data was from the onset of the stimuli until the end of the stimuli presentation. We have now clearly stated these issues in the revised text (see lines 111-112).

      (2) According to the above, I think a more reasonable arrangement should be Fig. 3, 2, and 1.

      Thanks for this suggestion. We have adjusted the figure positions accordingly.

      (3) Please include each subject's data points in the bar plots in Fig. 1.

      We have now presented each subject’s individual data point in the bar plot.

      (4) Lines 158-160 and 199-202 report interaction effects of the two-way ANOVA. This is good, but the direction of interaction effect should also be reported.

      We thank the reviewer for this suggestion. We have now reported the direction of the interaction effect. The significant interaction observed across Experiment 1 and Experiment 2 was mainly due to the diminishment of emotional modulation in inverted BM. The significant interaction crossing Experiment 1 and Experiment 3 was similarly caused by the lack of emotional modulation in nonbiological stimuli. With regard to the significant interaction across Experiment 1 and Experiment 4, it could be primarily attributed to the vanishment of pupil modulation effect between happy and sad local BM. We have specified these points in the revised text, see lines 198-199, 219-220, 267-269.

      Reviewer #3 (Recommendations For The Authors):

      (1) Number of experiments:

      As stated in the Methods section, this study seems to consist of five experiments (120/24=5) according to the description below. However, the current manuscript only reports findings from four of these experiments. Can the authors clarify on this matter?

      "A total of 120 participants (44 males, 76 females) ranging from 18 to 29 years old (M ± SD = 23.1 ± 2.5) were recruited, with 24 in each experiment."

      We apologize for not making it clear. This referred to a pure behavior explicit emotion classification experiment (N=24) that served as a prior test to confirm that the local BM stimuli conveyed recognizable emotional information. We have now more carefully stated this issue in the revised text, see lines 456-458.

      (2) Emotion processing mechanism of BM

      "Mechanism" is a very strong word, suggesting a causal relationship. In the setting of a passive viewing task that lacks any behavioral report, it is possible that the observed changes in pupil size could be epiphenomenal, rather than serving as the underlying mechanism.

      Thanks for this suggestion. We have now either changed “mechanism” into “phenomenon” or deleted it. We have also carefully discussed the potential implications for future studies to incorporate variant behavioral, physiological and neural indexes to yield more robust causal evidence to unveil the potential mechanism serving the observed multi-level BM emotion processing phenomenon.

      (3) Data sharing

      The authors could improve their efforts in promoting data transparency to ensure a comprehensive view of the results. This implies sharing deidentified raw data instead of summary data in an Excel spreadsheet.

      Thanks for this suggestion. We have now uploaded the deidentified raw data. (https://doi.org/10.57760/sciencedb.psych.00125).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors):

      1. Experiments regarding the inducible expression of MukBEF: The authors should provide western blots or rt-qPCR for MukBEF expression at 40 min and 2H.

      We provide now a western blot of MukB in non-induced and induced conditions as Figure 1-figure supplement 1D.

      1. Experiments with RiTer and LiTer constructs:<br /> a. Authors compare the mukB deletion against wild type (Fig. 2C). It would be additionally informative if these comparisons are made for matP deletion and wild type as well. This will strengthen the conclusion that long-range interactions in ter do increase in the absence of matP.

      We agree that the matP mutant may help the reader to compare the effect of the translocation in different backgrounds and have added it to the figure. This strengthens the conclusion that longrange interactions in ter do increase in the absence of matP in a rearranged chromosome, as observed in the WT configuration (Lioy et al., 2018).

      b. Additionally, in Fig. 2C, it appears that there is some decrease in long-range interactions in the absence of mukB in ter1 (Riter). Is this a significant change?

      The change observed is not significant. The results shown in Fig. 2C have been obtained using a 3C approach, which generated slightly more variability than Hi-C. Furthermore, we measured the range of contacts for the segment corresponding to Ter1 in RiTer (matS12-matS28), in different genetic contexts and different configurations. The results show that this level of variation is not significant (see graph below reporting two independent experiments).

      Author response image 1.

      Range of interactions measured on the interval matS12-matS18 in different genetic contexts and different configurations (MG1655 WT(1 and 2), ∆mukB, RiTer, RiTer ∆mukB).

      1. Experiments with various matS organizations: These experiments are interesting and an important part of the paper. However, it is rather hard to visualize the chromosome conformations in the strains after transposition. To aid the reader (particularly with panel E), authors can provide schematics of the chromosome conformations and anticipated/ observed chromosomal interactions. Circular interaction plots would be useful here.

      We thank the reviewer for this interesting remark; we have tried in the past to represent these interactions using a circular representation (see for example the web site of Ivan Junier; https://treetimc.github.io/circhic/index.html). However, this representation is not trivial to apprehend for nonspecialists, especially in strains with a rearranged chromosome configuration. Nonetheless, we have added graphical circular representations of the chromosome configurations to help the reader.

      1. ChIP experiments:<br /> a. This section of the manuscript needs to be further strengthened. It is not clear whether the ChIP signal observed is significant (for example at T10 or T20 min, the peak value does not appear to go above 1.1 fold. Can the authors be sure that this small increase is not simply a consequence of increase in copy number of the loci around the origin, as replication has initiated?

      The basal value of the ChIP on the non-replicated sequences (between 0-3.5 Mb for 10 minutes and 0-3 Mb for 20 minutes) is 0.8 and 0.7, respectively, whereas the mean value of the replicated sequence is 1.6 and 1.45. So the enrichment observed for these two points is about 2-fold, not 1.1 and it is 4 fold for t40min. These values were obtained by dividing the number of normalized reads in the ChIP (the number of reads at each position divided by the total number of reads) by the normalized reads of the input. Therefore, the increase in copy number is considered in the calculation. Furthermore, we added a supplementary figure (Figure Sup9) in which we performed a ChIP without tags on synchronized cells, and in this case, we did not observe any enrichment triggered by replication.

      b. Authors make a conclusion that MukB loads behind the replication fork. However, the time resolution of the presented experiments is not sufficient to be certain of this. Authors would need to perform more time-resolved experiments for the same.

      Reviewer 1 is correct; we attempted to discriminate whether the observed enrichment is (i) associated with the replication fork since we observed a decrease in the center of the enrichment at oriC as the maximum enrichment moves away with the replication fork after 20 and 40 minutes, or (ii) associated with the newly replicated sequence. To investigate this, we attempted to induce a single round of replication by shifting the cells back to 40°C after 10 minutes at 30°C. Unfortunately, replication initiation is not immediately halted by shifting the cells to 40°C, and we were unable to induce a single round of replication. To clarify our conclusions, we modified our manuscript to

      “Altogether, these findings indicate that MukBEF is loaded into regions newly replicated either at the replication fork or even further behind it, except in the Ter region from which it would be excluded.”

      c. Authors conclude that in the LiTer7 strain, MukB signal is absent from Ter2. However, when compared with the ChIP profiles by eye across panels in A and B, this does not seem to be significant. In the same results sections, authors state that there is a 3-fold increase in MukB signal in other regions. The corresponding graph does not show the same.

      Rather than relying solely on the enrichment levels, which can be challenging to compare across different strains due to slight variations in replication levels, we believe there is a clear disruption in this profile that corresponds to the Ter2 sequence. Furthermore, this discontinuity in enrichment relative to the replication profile is also observable in the WT configuration. At T40min, MukB ChIPseq signals halt at the Ter boundary, even though Ter is actively undergoing replication, as evidenced by observations in the input data.

      Regarding the fold increase of MukB, Reviewer 1 is correct; we overestimated this enrichment in the text and have now corrected it.

      d. Authors should provide western blot of MukB-Flag.

      We have added Supplementary Figure 1 D, which contains a Western blot of MukB-Flag.

      1. The bioinformatic analysis of matS site distribution is interesting, but this is not followed upon. The figure (Fig 5) is better suited in the supplement and used only as a discussion point.

      We acknowledge the reviewer's point, but we used this section to attempt to extend our findings to other bacteria and emphasize the observation that even though a few matS sites are necessary to inhibit MukBEF, the Ter domains are large and centered on dif even in other bacteria.

      1. The discussion section is lacking many references and key papers have not been cited (paragraph 1 of discussion for example has no references).

      The possibility that SMC-ScpAB and MukBEF can act independent of replication has been suggested previously, but are not cited or discussed. Similarly, there is some evidence for SMC-ScpAB association with newly replicated DNA (PMID 21923769).

      We have added references to the suggested paragraph and highlighted the fact that MukBEF's activity independent of replication was already known. However, we believe that the situation is less clear for SMC-ScpAB in B. subtilis or C. crescentus. In a similar manner, we found no clear evidence that SMCScpAB is associated with newly replicated DNA in the referenced studies.

      To clarify and enrich the discussion section, we have added a paragraph that provides perspective on the loading mechanisms of SMC-ScpAB and MukBEF.

      1. There are minor typographical errors that should be corrected. Some are highlighted here:

      a. Abstract: L5: "preferentially 'on' instead of 'in'"

      b. Introduction: Para 1 L8: "features that determine"

      c. Introduction: Para 2 L1: please check the phrasing of this line

      d. Results section 2: L1: Ter "MD" needs to be explained

      e. Page 8: Para 2: L6: "shows that 'a'"

      g. Page 13: Para 2: "MukBEF activity...". This sentence needs to be fixed.

      i. Figure 4: "input" instead of "imput"

      We thank Reviewer 1 for pointing out all these grammatical or spelling mistakes. We have corrected them all.

      f. Page 12: Para 2: "Xer" instead of "XDS"? *We added a reference to clarify the term.

      h. Methods: ChIP analysis: Authors state "MatP peaks", however, reported data is for MukB

      This description pertains to the matP peak detection shown in Supplementary Figure 3. We have incorporated this clarification into the text.

      j. Supplementary figure legends need to be provided (currently main figure legends appear to be pasted twice)

      Supplementary figure legends are provided at the end of the manuscript, and we have edited the manuscript to remove one copy of the figure legends.

      k. Authors should ensure sequencing data are deposited in an appropriate online repository and an accession number is provided.

      We waited for the appropriate timing in the editing process to upload our data, which we have now done. Additionally, we have added a data availability section to the manuscript, including sequence references on the NCBI.

      Reviewer #2 (Recommendations For The Authors):

      The authors largely avoid speculation on what might be the physiological relevance of the exclusion of MukBEF (and Smc-ScpAB) from the replication termination region (and the coordination with DNA replication). At this stage it would be helpful to present possible scenarios even if not yet supported by data. The authors should for example consider the following scenario: loop extrusion of a dif site in a chromosome dimer followed by dimer resolution by dif recombination leads to two chromosomes that are linked together by MukBEF (equivalent to cohesin holding sister chromatids together in eukaryotes but without a separase). This configuration (while rare) will hamper chromosome segregation. Is MatP particularly important under conditions of elevated levels of chromosome dimers? Could this even be experimentally tested? Other scenarios might also be entertained.

      Even though we prefer to avoid speculations, we agree that we may attempt to propose some hypotheses to the reader. To do so, we have added a few sentences at the end of our discussion. “We may speculate, based on in vitro observations (Kumar et al., 2022), that MukBEF could interfere with TopIV activity and delay potential chromosome decatenation. Another possibility is that chromosome dimers resolved at the dif site may become trapped in loops formed by MukBEF, thus delaying segregation. But none of these possible scenarios are supported by data yet, and a major challenge for the future is to determine whether and how MukBEF may interfere with one or both of these processes.”

      The manuscript text is well written. However, the labeling of strains in figures and text is sometimes inconsistent which can be confusing (LiTer Liter liter; e.g Riter Fig 2C). For consistency, always denote the number of matS sites in LiTer strains and also in the RiTer strain. The scheme denoting LiTer and RiTer strains should indicate the orientation of DNA segments so it is clear that the engineering does not involve inversion (correct?). Similarly: Use uniform labelling for time points: see T40mn vs 40mn vs T2H vs 2H

      We have reviewed the manuscript to standardize our labeling. Additionally, we have included a schema in Figure 2, indicating the matS numbers at the Ter border to emphasize that the transposition events do not involve inversion.

      matS sites do not have identical sequences and bind different levels of MatP (suppl fig 3). Does this possibly affect the interpretation of some of the findings (when altering few or only a single matS site). Maybe a comment on this possibility can be added.

      We agree with the referee; we do not want to conclude too strongly about the impact of matS density, so we have added this sentence at the end of the section titled 'matS Determinants to Prevent MukBEF Activity':

      “Altogether, assuming that differences in the matS sequences do not modify MatP's ability to bind to the chromosome and affect its capacity to inhibit MukBEF, these results suggested that the density of matS sites in a small chromosomal region has a greater impact than dispersion of the same number of matS sites over a larger segment”

      Figure 5: show selected examples of matS site distribution in addition to the averaged distribution (as in supplemental figure)?

      Figure 5 shows the median of the matS distribution based on the matS positions of 16 species as displayed in the supplementary figure. We believe that this figure is interesting as it represents the overall matS distribution across the Enterobacterales, Pasteurellales, and Vibrionales.

      How do authors define 'background levels' (page 9)in their ChIP-Seq experiments? Please add a definition or reword.

      We agree that the term 'background level' here could be confusing, so we have modified it to 'basal level' to refer to the non-replicating sequence. The background level can be observed in Supplementary Figure 9 in the ChIP without tags, and, on average, the background level is 1 throughout the entire chromosome in these control experiments.

      This reviewer would naively expect the normalized ChIP-Seq signals to revolve around a ratio of 1 (Fig. 4)? They do in one panel (Figure 4B) but not in the others (Figure 4A). Please provide an explanation.

      We thank the referee for this pertinent observation. An error was made during the smoothing of the data in Figure 4A, which resulted in an underestimation of the input values. This mistake does not alter the profile of the ChIP (it's a division by a constant) and our conclusions. We provide a revised version of the figure.

      Inconsistent axis labelling: e.g Figure 4

      Enterobacterals should be Enterobacterales (?)

      KB should be kb

      MB should be Mb

      Imput should be Input

      FlaG should be Flag

      We have made the suggested modifications to the text.

      'These results unveiled that fluorescent MukBEF foci previously observed associated with the Ori region were probably not bound to DNA' Isn't the alternative scenario that MukBEF bound to distant DNA segments colocalize an equally likely scenario? Please rephrase.

      Since we lack evidence regarding what triggers the formation of a unique MukB focus associated with the origin and what this focus could represent, we have removed this sentence.

      Reviewer #3 (Recommendations For The Authors):

      The text is well-written and easy to follow, but I would suggest several improvements to make things clearer:

      1. Many plots are missing labels or legends. (I) All contact plots such as Fig. 1C should have a color legend. It is not clear how large the signal is and whether the plots are on the same scale. (II)<br /> Ratiometric contact plots such as in Fig. 1D should indicate what values are shown. Is this a log ratio?

      As indicated in the materials and methods section, the ratio presented on this manuscript was calculated for each point on the map by dividing the number of contacts in one condition by the number of contacts in the other condition. The Log2 of the ratio was then plotted using a Gaussian filter.

      1. Genotypes and strain names are often inconsistent. Sometimes ΔmukB, ΔmatP, ΔmatS is used, other times it is just mukB, matP, matS; There are various permutations of LiTer, Liter, liter etc.

      These inconsistencies have been corrected.

      1. The time notation is unconventional. I recommend using 0 min, 40 min, 120 min etc. instead of T0, T40mn, T2H.

      As requested, we have standardized and used conventional annotations.

      1. A supplemental strain table listing detailed genotypes would be helpful.

      A strain table has been added, along with a second table recapitulating the positions of matS in the different strains.

      1. Fig. 1A: Move the IPTG labels to the top? It took me a while to spot them.

      We have moved the labels to the top of the figure and increased the font size to make them more visible.

      1. Fig 1C: Have these plots been contrast adjusted? If so, this should be indicated. The background looks very white and the transitions from diagonal to background look quite sharp.

      No, these matrices haven't been contrast-adjusted. They were created in MATLAB, then exported as TIFF files and directly incorporated into the figure. Nevertheless, we noticed that the color code of the matrix in Figure 3 was different and subsequently adjusted it to achieve uniformity across all matrices.

      7, Fig 1C: What is the region around 3 Mb and 4 Mb? It looks like the contacts there are somewhat MukBEF-independent.

      The referee is right. In the presence of the plasmid pPSV38 (carrying the MukBEF operon or not), we repeatedly observed an increase of long range contacts around 3 Mb. The origin of these contacts is unknown.

      1. Fig 1D: Have the log ratios been clipped at -1 and 1 or was some smoothing filter applied? I would expect the division of small and noisy numbers in the background region to produce many extreme values. This does not appear to be the case.

      The referee is right, dividing two matrices generates a ratio with extreme values. To avoid this, the Log2 of the ratio is plotted with a Gaussian filter, as described before (Lioy et al., 2018).

      1. Fig 1E: I recommend including a wild-type reference trace as a point of reference.

      We have added the WT profile to the figure.

      1. Fig 2: I feel the side-by-side cartoon from Supplemental Fig. 2A could be included in the main figure to make things easier to grasp.

      We added a schematic representation of the chromosome configuration on top of the matrices to aid understanding.

      1. Fig. 2C: One could put both plots on the same y-axis scale to make them comparable.

      We have modified the axes as required.

      1. Fig. 3C: The LiTer4 ratio plot has two blue bands in the 3-4.5 Mb region. I was wondering what they might be. These long-range contacts seem to be transposition-dependent and suppressed by MatP, is that correct?

      The referee is right. This indicates that in the absence of MatP, one part of the Ter was able to interact with a distal region of the chromosome, albeit with a low frequency. The origin is not yet known.

      1. Fig. 3E: It is hard to understand what is a strain label and what is the analyzed region of interest. The plot heading and figure legend say Ter2 (but then, there are different Ter2 variants), some labels say Ter, others say Ter2, sometimes it doesn't say anything, some labels say ΔmatS or ΔmatP, others say matS or matP, and so on.

      We have unified our notation and add more description on the legend to clarify this figure :

      “Ter” corresponds to the range of contacts over the entire Ter region, in the WT strain (WT Ter) or in the ΔmatP strain (ΔmatP Ter). The column WT matSX-Y corresponds to the range of contacts between the designated matS sites in the WT configuration. This portion of the Ter can be compared with the same Ter segment in the transposed strain (Ter2). Additionally, the matS20-28 segment corresponds to Ter2 in LiTer9, just as matS22-28 corresponds to Ter2 in LiTer7, and matS25-28 to Ter2 in LiTer4. The range of contacts of this segment was also measured in a ΔmatP or ΔmatS background.”

      1. Fig. 4 and p.9: "Normalized ChIP-seq experiments were performed by normalizing the quantity of immuno-precipitated fragments to the input of MukB-Flag and then divide by the normalized ChIP signals at t0 to measure the enrichment trigger by replication."

      This statement and the ChIP plots in Fig. 4A are somewhat puzzling. If the data were divided by the ChIP signal at t0, as stated in the text, then I would expect the first plot (t0) to be a flat line at value 1. This is not the case. I assume that normalized ChIP is shown without the division by t0, as stated in the figure legend.

      The referee is right. This sentence has been corrected, and as described in the Methods section, Figure 4 shows the ChIP normalized by the input.

      If that's true and the numbers were obtained by dividing read-count adjusted immunoprecipitate by read-count adjusted input, then I would expect an average value of 1. This is also not the case. Why are the numbers so low? I think this needs some more details on how the data was prepared.

      The referee is right; we thank him for this remark. Our data are processed using the following method: the value of each read is divided by the total number of reads. A sliding window of 50 kb is applied to these normalized values to smooth the data. Then, the resulting signal from the ChIP is divided by the resulting signal from the input. This is what is shown in Figure 4. Unfortunately, for some of our results, the sliding window was not correctly applied to the input data. This did not alter the ChIP profile but did affect the absolute values. We have resolved this issue and corrected the figure.

      Another potential issue is that it's not clear what the background signal is and whether it is evenly distributed. The effect size is rather small. Negative controls (untagged MukB for each timepoint) would help to estimate the background distribution, and calibrator DNA could be used to estimate the signal-to-background ratio. There is the danger that the apparent enrichment of replicated DNA is due to increased "stickiness" rather than increased MukBEF binding. If any controls are available, I would strongly suggest to show them.

      To address this remark, a ChIP experiment with a non-tagged strain under comparable synchronization conditions has been performed. The results are presented as Supplementary Figure 9; they reveal that the enrichment shown in Figure 4 is not attributed to nonspecific antibody binding or 'stickiness’.

      1. Fig. 4A, B: The y-axes on the right are unlabeled and the figure legends mention immunoblot analysis, which is not shown.

      We labeled the y-axes as 'anti-Flag ChIP/input' and made corrections to the figure legend.

      1. Fig. 4B: This figure shows a dip in enrichment at the Ter2 region of LiTer7, which supports the authors' case. Having a side-by-side comparison with WT at 60 min would be good, as this time point is not shown in Fig. 4A.

      Cell synchronization can be somewhat challenging, and we have observed that the timing of replication restart can vary depending on the genetic background of the cells. This delay is evident in the case of LiTer7. To address this, we compared LiTer7 after 60 minutes to the wild type strain (WT) after 40 minutes of replication. Even though the duration of replication is 20 minutes longer in LiTer7, the replication profiles of these two strains under these two different conditions (40 minutes and 60 minutes) are comparable and provide a better representation of similar replication progression.

      1. Fig. 4C: Highlighting the position of the replication origin would help to interpret the data.

      We highlight oriC position with a red dash line

      1. Fig. 4C: One could include a range-of-contact plot that compares the three conditions (similar to Fig. 1E).

      We have added this quantification to Supplemental Figure 8

      1. Supplemental Fig. 2A: In the LiTer15 cartoon, the flanking attachment sites do not line up. Is this correct? I would also recommend indicating the direction of the Ter1 and Ter2 regions before and after recombination.

      In this configuration, attB and attR, as well as attL and attB', should be aligned but the remaining attR attL may not. We have corrected this misalignment. To clarify the question of sequence orientation, we have included in the figure legend that all transposed sequences maintain their original orientation.

      1. Supplemental Fig. 3: One could show where the deleted matS sites are.

      We added red asterisks to the ChIP representation to highlight the positions of the missing matS.

      1. Supplemental Fig. 3B: The plot legend is inconsistent with panel A (What is "WT2")?

      We have corrected it.

      1. Supplemental Fig. 3C: The E-value notation is unusual. Is this 8.9 x 10^-61?

      The value is 8.9 x 10-61; we modified the annotation.

      23) Abstract: "While different features for the activity of the bacterial canonical SMC complex, SmcScpAB, have been described in different bacteria, not much is known about the way chromosomes in enterobacteria interact with their SMC complex, MukBEF."

      Could this be more specific? What features are addressed in this manuscript that have been described for Smc-ScpAB but not MukBEF? Alternatively, one could summarize what MukBEF does to capture the interest of readers unfamiliar with the topic.

      We modified these first sentences.

      1. p.5 "was cloned onto a medium-copy number plasmid under control of a lacI promoter" Is "lacI promoter" correct? My understanding is that the promoter of the lacI gene is constitutive, whereas the promoter of the downstream lac operon is regulated by LacI. I would recommend providing an annotated plasmid sequence in supplemental material to make things clearer.

      We modified it and replaced “ lacI promoter” with the correct annotation, pLac.

      1. p. 5 heading "MukBEF activity does not initiate at a single locus" and p. 6 "Altogether, the results indicate that the increase in contact does not originate from a specific position on the chromosome but rather appears from numerous sites". Although this conclusion is supported by the follow-up experiments, I felt it is perhaps a bit too strong at this point in the text. Perhaps MukBEF loads slowly at a single site, but then moves away quickly? Would that not also lead to a flat increase in the contact plots? One could consider softening these statements (at least in the section header), and then be more confident later on.

      We used 'indicate' and 'suggesting' at the end of this results section, and we feel that we have not overreached in our conclusions at this point. While it's true that we can consider other hypotheses, we believe that, at this stage, our suggestion that MukBEF is loaded over the entire chromosome is the simplest and more likely explanation.

      1. p.7: "[these results] also reveal that MukBEF does not translocate from the Ori region to the terminus of the chromosome as observed with Smc-ScpAB in different bacteria."

      This isn't strictly true for single molecules, is it? Some molecules might translocate from Ori to Ter. Perhaps clarify that this is about the bulk flux of MukBEF?

      At this point, our conclusion that MukBEF does not travel from the ori to Ter is global and refers to the results described in this section. However, the referee is correct in pointing out that we cannot exclude the possibility that in a WT configuration (without a Ter in the middle of the right replicore), a specific MukBEF complex can be loaded near Ori and travel all along the chromosome until the Ter. To clarify our statement, we have revised it to 'reveal that MukBEF does not globally translocate from the Ori region to the terminus of the chromosome.' This change is intended to highlight the fact that we are drawing a general conclusion about the behavior of MukBEF and to facilitate its comparison with Smc-ScpAB in B. subtilis.

      1. p. 10: The section title "Long-range contacts correlate with MukBEF binding" and the concluding sentence "Altogether, these results indicate that MukBEF promotes long-range DNA contacts independently of the replication process even though it binds preferentially in newly replicated regions" seem to contradict each other. I would rephrase the title as "MukBEF promotes long-range contacts in the absence of replication" or similar.

      We agree with this suggestion and have used the proposed title.

      1. p. 13: I recommend reserving the name "condensin" for the eukaryotic condensin complex and using "MukBEF" throughout.

      We used MukBEF throughout.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary: 

      Beyond what is stated in the title of this paper, not much needs to be summarized. eIF2A in HeLa cells promotes translation initiation of neither the main ORFs nor short uORFs under any of the conditions tested. 

      Strengths: 

      Very comprehensive, in fact, given the huge amount of purely negative data, an admirably comprehensive and well-executed analysis of the factor of interest. 

      Weaknesses: 

      The study is limited to the HeLa cell line, focusing primarily on KO of eIF2A and neglecting the opposite scenario, higher eIF2A expression which could potentially result in an increase in non-canonical initiation events. 

      We thank the reviewer for the positive evaluation. As suggested by the reviewer in the detailed recommendations, we will clarify in the title, abstract and text that our conclusions are limited to HeLa cells. Furthermore, as suggested we will test the effect of eIF2A overexpression on the luciferase reporter constructs, and will upload a revised manuscript.

      Reviewer #2 (Public review):

      Summary 

      Roiuk et al describe a work in which they have investigated the role of eIF2A in translation initiation in mammals without much success. Thus, the manuscript focuses on negative results. Further, the results, while original, are generally not novel, but confirmatory, since related claims have been made before independently in different systems with Haikwad et al study recently published in eLife being the most relevant. 

      Despite this, we find this work highly important. This is because of a massive wealth of unreliable information and speculations regarding eIF2A role in translation arising from series of artifacts that began at the moment of eIF2A discovery. This, in combination with its misfortunate naming (eIF2A is often mixed up with alpha subunit of eIF2, eIF2S1) has generated a widespread confusion among researchers who are not experts in eukaryotic translation initiation. Given this, it is not only justifiable but critical to make independent efforts to clear up this confusion and I very much appreciate the authors' efforts in this regard.  

      Strengths 

      The experimental investigation described in this manuscript is thorough, appropriate and convincing. 

      Weaknesses 

      However, we are not entirely satisfied with the presentation of this work which we think should be improved. 

      We thank the reviewer for the positive evaluation. We will revise the manuscript according to the reviewer's suggestions made in the detailed recommendations.

      Reviewer #3 (Public review):

      Summary: 

      This is a valuable study providing solid evidence that the putative non-canonical initiation factor eIF2A has little or no role in the translation of any expressed mRNAs in cultured human (primarily HeLa) cells. Previous studies have implicated eIF2A in GTP-independent recruitment of initiator tRNA to the small (40S) ribosomal subunit, a function analogous to canonical initiation factor eIF2, and in supporting initiation on mRNAs that do not require scanning to select the AUG codon or that contain near-cognate start codons, especially upstream ORFs with non-AUG start codons, and may use the cognate elongator tRNA for initiation. Moreover, the detected functions for eIF2A were limited to, or enhanced by, stress conditions where canonical eIF2 is phosphorylated and inactivated, suggesting that eIF2A provides a back-up function for eIF2 in such stress conditions. CRISPR gene editing was used to construct two different knockout cell lines that were compared to the parental cell line in a large battery of assays for bulk or gene-specific translation in both unstressed conditions and when cells were treated with inhibitors that induce eIF2 phosphorylation. None of these assays identified any effects of eIF2A KO on translation in unstressed or stressed cells, indicating little or no role for eIF2A as a back-up to eIF2 and in translation initiation at near-cognate start codons, in these cultured cells. 

      The study is very thorough and generally well executed, examining bulk translation by puromycin labeling and polysome analysis and translational efficiencies of all expressed mRNAs by ribosome profiling, with extensive utilization of reporters equipped with the 5'UTRs of many different native transcripts to follow up on the limited number of genes whose transcripts showed significant differences in translational efficiencies (TEs) in the profiling experiments. They also looked for differences in translation of uORFs in the profiling data and examined reporters of uORF-containing mRNAs known to be translationally regulated by their uORFs in response to stress, going so far as to monitor peptide production from a uORF itself. The high precision and reproducibility of the replicate measurements instil strong confidence that the myriad of negative results they obtained reflects the lack of eIF2A function in these cells rather than data that would be too noisy to detect small effects on the eIF2A mutations. They also tested and found no evidence for a recent claim that eIF2A localizes to the cytoplasm in stress and exerts a global inhibition of translation. Given the numerous papers that have been published reporting functions of eIF2A in specific and general translational control, this study is important in providing abundant, high-quality data to the contrary, at least in these cultured cells. 

      Strengths: 

      The paper employed two CRISPR knock-out cell lines and subjected them to a combination of high-quality ribosome profiling experiments, interrogating both main coding sequences and uORFs throughout the translatome, which was complemented by extensive reporter analysis, and cell imaging in cells both unstressed and subjected to conditions of eIF2 phosphorylation, all in an effort to test previous conclusions about eIF2A functioning as an alternative to eIF2. 

      Weaknesses: 

      There is some question about whether their induction of eIF2 phosphorylation using tunicamycin was extensive enough to state forcefully that eIF2A has little or no role in the translatome when eIF2 function is strongly impaired. Also, similar conclusions regarding the minimal role of eIF2A were reached previously for a different human cell line from a study that also enlisted ribosome profiling under conditions of extensive eIF2 phosphorylation; although that study lacked the extensive use of reporters to confirm or refute the identification by ribosome profiling of a small group of mRNAs regulated by eIF2A during stress. 

      We thank the reviewer for the positive evaluation. We will revise the manuscript according to the recommendations made in the detailed recommendations. Regarding the two points mentioned here:

      (1) The reason eIF2alpha phosphorylation does not increase appreciably is because unfortunately the antibody is very poor. The fact that the Integrated Stress Response (ISR) is induced by our treatment can be seen, for instance, by the fact that ATF4 protein levels increase strongly (in the very same samples where eIF2alpha phosphorylation does not increase much, in Suppl. Fig. 5E). We will strengthen the conclusion that the ISR is indeed activated with additional experiments/data as suggested by the reviewer.

      (2) We agree that our results are in line with results from the previous study mentioned by the reviewer, so we will revise the manuscript to mention this other study more extensively in the discussion.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) I suggest to state (already in the abstract, but perhaps also even in the title, definitely in the rest of the paper) that this analysis is limited to the HeLa cell line. 

      As suggested, we have now specified in both the title and the abstract that the work is done in HeLa cells.

      (2) In my view, it is a pity that the authors - given the tools are available - did not check the impact of high eIF2A levels on expression of individual mRNAs under normal and stress conditions. I am not suggesting to repeat ribo-seq in this setup, it would be too much to ask for, but re-examining some of the many reporters the authors generated with eIF2A overexpressed may point to some function, e.g. increased number of non-canonical initiation events (non-AUG-initiated)? If anything, the use of HeLa and the primary focus on eIF2A KO neglecting the prospective impact of eIF2A overexpression should be mentioned as two main limitations of this study. 

      We thank the reviewer for the good suggestion to test our synthetic reporters with eIF2A overexpression. New Suppl. Fig. 4G now shows that overexpression of eIF2A does not affect translation of synthetic reporters carrying an ATG start codon in different initiation contexts, or carrying near-cognate start codons, in agreement with a lack of effect on translation which we previously observed with loss of eIF2A.

      (3) Ribo-seq with eIF2A. Did the authors focus on ORFs that are known, or whose isoforms are known, to be non-AUG initiated? Would the loss of eIF2A decrease FPs in their CDSes under at least some conditions?

      We have now assessed the read distribution on the eIF4G2 transcript in both the control and tunicamycin conditions ( Author response image 1). In our hands, eIF4G2 is one of the best examples of non-AUG initiation in human cells, since the main coding sequence starts with GTG and the CDS is well translated. Nonetheless, we do not observe any significant changes in read distribution (panels A-B) or overall translation efficiency of eIF4G2 upon eIF2A loss (panels C-D).

      Author response image 1.

      (A-B) Average reads occupancy on the eIF4G2 (ENST0000339995) transcript in DMSO treated (panel A, n=3) or tunicamycin treated samples (panel B, n=2) derived from either control (black) or eIF2A-KO (red) HeLa cells. Reads counts were normalized to sequencing depth and averaged between either 3 (DMSO-treated) or 2 (tunicamycin-treated) replicates. Graphs were then smoothened with a sliding window of 3 nt. (C-D) The total number of reads mapping to the eIF4G2 CDS, normalized to library sequencing depth per replica was quantified. No significant difference between control and eIF2A-KO cells was observed in either DMSO treated (panel C) or tunicamycin treated (panel D) cells. Significance by unpaired, two-sided, t-test. ns = not significant.

      Thank you for giving me the opportunity to review this article.

      Reviewer #2 (Recommendations for the authors):

      While some of our suggestions below may be considered subtle, in our opinion they are important and it would be good if the authors consider them for their revision, we also have a couple of technical suggestions. 

      (1) Abstract. 

      The authors failed to identify the role of eIF2A in translation initiation and have provided compelling evidence that eIF2A is not involved in recognition of non-AUG codons as start codons nor in recruitment of initiator tRNA during stress conditions which are two activities most commonly misattributed to eIF2A. However, they have not exhausted all possible potential functions of eIF2A, see below, it is also possible that eIF2A may have a role not yet suggested by anyone and it may function in translation initiation in special circumstances that have not been tested yet. The authors indeed discuss such possibility in the Discussion section. Given that there is genetic evidence (that is unaffected by biochemical impurities) linking eIF2A to other initiation factors (5B and 4E), we are not yet convinced that eIF2A does not have any role in translation initiation and therefore we find the last sentence of the abstract premature. We suggest to soften this statement into something like this: whether eIF2A has any role in translation remains unknown, it may even have a role in a different aspect of RNA Biology. 

      We agree with the reviewer. We changed the last sentence of the abstract to read as follows:

      “It is possible that eIF2A plays a role in translation regulation in specific conditions that we have not tested here, or that it plays a role in a different aspect of RNA biology.”

      (2) Recently eIF2A has been implicated in ribosomal frameshifting, see Wei et al 2023 DOI: 10.1016/j.celrep.2023.112987 

      Could authors look into PEG10 mRNA ribosome profile to see if there are detectable statistically significant changes in footprint density downstream of frameshift site between WT and eIF2A Kos? It is likely that the coverage will be insufficient to give a definitive answer, but it is worth checking, it would be a pity to miss it. 

      We thank the reviewer for this suggestion. We have now looked at the distribution of ribosome footprints on the PEG10 transcript variant that is expressed in HeLa cells (ENST00000482108) and indeed observe coverage downstream of the annotated stop codon, consistent with a frameshifting event that results in an extended protein isoform being translated. Visual assessment of the read distribution between the main ORF and the "ORF extension" does not show a substantial difference between control and eIF2A knock-out cells ( Author response image 2A-B). Additionally, we quantified the ratio of reads mapping to the PEG10 ORF upstream of the slippery site versus those mapping downstream, extending into the predicted longer protein. Nonetheless, we could not detect significant changes between control and eIF2A-KO cells in either tested condition ( Author response image 2C-D).

      Author response image 2.

      (A-B) Average reads occupancy on the PEG10 (ENST00000482108) transcript in DMSO treated (panel A, n=3) or tunicamycin treated samples (panel B, n=2) derived from either control (black) or eIF2A-KO (red) HeLa cells are shown. Reads counts were normalized to sequencing depth and averaged between either 3 (DMSO-treated) or 2 (tunicamycin-treated) replicates. Graphs were then smoothened with a sliding window of 3 nt. (C-D) The ratio of reads mapping to the ORF upstream of the slippery site to reads mapping to the predicted extended protein downstream to the slippery site is shown. Reads counts were normalized to the sequencing depth. Neither DMSO treated samples (panel C) nor tunicamycin treated samples (panel D) had a significant difference between control and eIF2A-KO cells. Significance by unpaired, two-sided, t-test. ns = not significant.

      (3) Introduction 

      Given the volume of unreliable claims regarding eIF2A in the literature and the overall confusion it is very difficult (may even be impossible) to write a clear coherent introduction into the topic. Nonetheless, there are few points that need to be taken into account. 

      The authors state that eIF2A is capable to recruit initiator tRNA citing Zoll et al 2002. This activity was later shown to be a biochemical artefact (which was most likely reproduced by Kim et al 2018), eIF2A fraction was contaminated with eIF2D which does bind tRNAs in GTP-independent manner. eIF2A purified from RRL separates from initiator tRNA binding activity, see Dmitriev et al 2010 DOI: 10.1074/jbc.M110.119693. This point is also relevant to the second paragraph of Discussion, it should be acknowledged that it has been shown previously that eIF2A does not bind the initiator tRNA.

      We appreciate the advice provided by the reviewer. We have modified both the introduction and the 2nd paragraph of the discussion to reflect that the tRNA-binding activity is due to contaminating eIF2D rather than eIF2A.

      In many cases the authors describe certain claims as facts even though they refute them themselves. For example 

      "Such eIF2A-driven non-AUG initiation events were shown to play a crucial role in different aspects of cell physiology and disease progression: cellular adaptation during the integrated stress response (Chen et al., 2019; Starck et al., 2016)"  While non-AUG initiation events do play crucial roles in different aspects of cell physiology (reviewed in Andreev et al 2023 doi: 10.1186/s13059-022-02674-2) eIF2A has nothing to do with it as the authors show themselves. Therefore different language should be used, e.g.. "eIF2A has been suggested (or proposed or reported) to be responsible for non-AUG initiation events that were shown to play ..." 

      The word "shown" is used in many other instances for the claims that the authors refute. "Shown" is only appropriate for strong evidence that leaves little doubt. 

      We agree with the reviewer and made the suggested changes in the text.

      (4) Supplementary Fig. 1. 

      Panel C is used to argue that eIF2A has a higher concentration than in the nucleus, perhaps it is worth explaining how this conclusion was drawn. If levels in cytoplasm are comparable to GAPDH and Tubulin but less than c-Myc in nucleus does it really mean that there is less eIF2A in the nucleus than in cytoplasm? This is not obvious to us. Also, presumably WCL stands for Whole Cell Lysate, it would be nice to introduce this abbreviation somewhere. 

      To compare levels of eIF2A in the nuclear and cytosolic fractions, we lysed the two fractions in equal volumes of buffer (i.e. the cytosolic fraction was extracted in 200 µl of hypotonic buffer, and the nuclear fraction was extracted in 200 µl of cell extraction buffer). This assures that per microliter of lysate we have the same number of "cytosols" or nuclei. Hence, equal intensity bands in the cytosolic and nuclear fractions would mean that half of the protein is in the nucleus and half is in the cytosol. We originally described this in the Methods section, but now also mention it in the Results and in the figure legend.

      We replaced WCL with "whole cell" in the figure. 

      (5) The differential translation analysis is described very briefly "To obtain values of translation efficiency, log2 fold changes, and adjusted p values the DESeq2 software package was used". Was TE calculated based on ribosome footprint to RNA-seq ratios? How exactly DESeq2 was used here? TE measured in this way spuriously correlates with RNA-seq values, see Larsson et al 2010 DOI: 10.1073/pnas.1006821107, perhaps it would be worse assessing differential translation with anota2seq (Oertlin et al 2019 doi: 10.1093/nar/gkz223.)? Anota2seq avoids calculating the ratios and enables comprehensive analysis of differential translation including detection of buffered translation which might be the case here while avoiding artefacts that may arise from varying RNA levels.  

      We now specified in more detail in the Methods section how we analyzed the data. Indeed, the DeSeq2 was used on translation efficiency values, which we calculated as the ratio of ribosome footprints to RNA-seq. 

      As suggested, we have now also performed the analysis using anota2seq (Suppl. Fig. 3C) and this analysis identified zero transcripts that are translationally regulated, in agreement with our analysis.

      (6) Section "eIF2a-inactivating stresses do not redirect tRNA delivery function to eIF2A." 

      The description of ISR mechanism is a bit inaccurate. Strictly speaking eIF2alpha phosphorylation does not inactivate it eIF2alpha. It results in formation of a very stable eIF2*GDP*eIF2B complex, thus severely depleting eIF2B which serves as a GEF for eIF2. This in turn reduces the ternary complex (eIF2*GTP*tRNAi) concentration since there is no free eIF2B to exchange GDP for GTP. Without getting into much detail, we think it would be more accurate to say that eIF2alpha phosphorylation leads to ternary complex depletion instead of saying that stress inactivates eIF2alpha. 

      We agree with the reviewer - we were trying to use simple, compact wording. We have now reworded the section title to "No detectable role for eIF2A in translation when eIF2 is inhibited" and rephrased the subsequent text to be correct.

      Also the subtitle uses eIF2a with small a that stands for alpha which potentially could lead to substantial confusion since in this case the difference between eIF2alpha and eIF2A is only in capitalisation of the last letter, many text-mining engines such as modern LLMs may not be able to pick the differences. Perhaps it would be better to refer to eIF2alpha by the HGNC approved name of its gene - eIF2S1 to avoid further confusions. For clarity it may be stated at the beginning that eIF2S1 is commonly known as eIF2alpha. 

      We thank the reviewer for this point. We have removed all instances of eIF2a (with lowercase a) from the manuscript to avoid this source of confusion. In the first instance of eIF2a we also added the official HGNC gene name. However, we prefer to use eIF2a instead of eIF2S1 because people outside the translation field tend to know the subunit as eIF2a, and we think it is important that also people outside the translation field read this manuscript, since some of the questionable papers on eIF2A come from labs working at the interface between translation and other fields.

      Minor 

      Introduction 

      (7) "uses the CAT anticodon" change CAT to CAU 

      We corrected CAT to CAU

      (8) "In the canonical initiation pathway", change "canonical" to "most common", canonical is somewhat a judgemental statement that originates in theology. Same applies to numerous occurrences of "canonical AUG", simply using "AUG" would be simpler and more accurate as you will avoid giving impression that there are "non-canonical AUGs".  

      Done.

      (9) "eIF2A was initially considered to be a functional analogue of prokaryotic IF2 (Merrick and Anderson, 1975), however later this role was reassigned to the above-mentioned heterotrimeric factor eIF2 (a,b,g) (Levin et al., 1973)." - there is a chronological contradiction within this sentence, the initial consideration is attributed to 1975 while its later reassignment to 1973. 

      We are grateful to the reviewer for spotting this mistake. There was a citation problem; we fixed it and now cite the correct paper for the initial discovery of eIF2A to PMID 5472357 (Shafritz et al 1970).

      (10) "On the other hand, studies on the role of eIF2A on viral IRES translation have arrived at conflicting results." Remove "On the other hand" since conflicting results have been mentioned above. In fact the entire sentence is somewhat redundant given prior "For example, eIF2A has been studied in the context of internal ribosome entry sites (IRES), where it was found to act both as a suppressor and an activator of IRESmediated initiation."  

      We have rewritten the paragraph to make it more coherent.

      (11) Fig. 1. C-D. is using CHX abbreviation for cycloheximide, this need to be mentioned on the legend or elsewhere in the text. Otherwise CHX may not be clear for a reader uninitiated in ribosome profiling. 

      We now mention in the figure legend that CHX stands for cycloheximide and indicate that it was used as a negative control to block translation. 

      (12) Page 7, section "Ribosome profiling reveals a few eIF2Adependent transcripts" 

      In this section you describe ribosome profiling experiments and identify few transcripts whose translation seems to be changing based on ribosome profiling data. Then you attempt to verify them using gene expression reporters and reasonably suggest that these are false positives. In essence this section argues that there are no eIF2A-dependent transcripts, therefore the title of this subsection is misleading, it makes sense to rename it so that it better reflects the content of this section. 

      We agree and have renamed the section to "Ribosome profiling identifies no eIF2Adependent transcripts"

      (13) Page 8, top. Rephrase "To do this, we performed ribosome profiling on control and eIF2AKO cells, which sequences the mRNA footprints protected by ribosomes."  

      Fixed.

      (14) Page 10, bottom. "Several studies have reported that eIF2A can delivery alternative initiator tRNAs to uORFs with nearcognate start codons". Change "delivery" to "deliver". 

      Thanks for spotting it. We corrected to “deliver”

      (15) Page 13 "This suggests that, as in non-stressed conditions, eIF2A has a minimal effect on global translation also when eIF2a activity is low." - rephrase to avoid impression that eIF2alpha activity is low in normal conditions, also please see comment #6 above. 

      We fixed this sentence to read: “This suggests that, as in non-stressed conditions, eIF2A has a minimal effect on global translation also when the integrated stress response is active.”

      Reviewer #3 (Recommendations for the authors):

      - The experimental data in Fig. S5E do not support the claim of increased eIF2 phosphorylation on TM treatment; although, comparing Fig. S5A with Fig. 1B supports a marked reduction in bulk translation and the reporter data in Fig. 4A show the expected induction of the uORF-containing reporters by TM. Because these are the conditions employed for ribosome profiling in stress conditions shown in Fig. 4B, it would be reassuring to document TM-induced translational efficiencies of ATF4 and the other known mRNAs resistant to eIF2 phosphorylation in the ribosome profiling data, including gene browser images of the replicate experiments. If the induction of TEs by TM for such mRNAs was not robust, it would be valuable to repeat the analysis using arsenite (SA) treatment, which produces a greater inhibition of bulk translation. 

      Unfortunately, the eIF2alpha antibody is not very good and also detects the nonphosphorylated protein, causing high background and poor apparent induction in response to tunicamycin. The fact that the ISR was activated is visible from the induction of ATF that was assessed by western blot in the Suppl. Fig. 5E. To ensure that our ribosome profiling libraries also recorded the activation of ISR we built single gene plots for ATF4 both in control and HeLa eIF2A-KO cell. As shown in  Author response image 3 A&B in both cell lines tunicamycin treatment led to the induction of ATF4. This can also be seen by the 4-fold induction in ATF4 translation efficiency in response to tunicamycin in both WT and eIF2A-KO cells ( Author response image 3C). Additionally, we checked that another marker induced by tunicamycin, HSPA5, is also translationally upregulated in both cell lines, as well as the downstream target of ATF4 – PPP1R15B. ( Author response image 3C). 

      Author response image 3.

      (A-B) Average read occupancy on the ATF4 (ENST00000674920) transcript in DMSO treated (n=3) or tunicamycin treated samples (n=2) derived from either control (panel A) or eIF2A-KO (panel B) HeLa cells are shown. Read counts were normalized to sequencing depth and averaged between either 3 (DMSO-treated) or 2 (tunicamycin-treated) replicates. Graphs were then smoothened with a sliding window of 3 nt. (C) Scatter plot of log2(fold change) of Translation Efficiency TM/DMSO for control cells on the xaxis versus eIF2AKO cells on the y-axis. The induction of ATF4 as well as the downstream target PPP1R15B are shown. The upregulation of HSP5A translation, the other hallmark of ER-stress induced by tunicamycin treatment is shown.

      - It should be pointed out in the text that in both published studies being cited here of cells lacking eIF2A, that by Gaikwad et al. on a yeast eIF2A deletion mutant, and that by Ichihara et al. on human HEK293 CRISPR KO cells, the analyses included stress conditions in which eIF2 phosphorylation is induced (amino acid starvation or SA treatment, respectively), as was conducted here.  

      Good point - we added this information into the introduction: 

      "Furthermore, loss of eIF2A in several systems did not recapitulate these effects on non-AUG initiation in either non-stressed or stress conditions (caused either by amino acid depletion or sodium arsenate treatment) (Gaikwad et al., 2024; Ichihara et al., 2021)."

      - The Ichihara et al. (2021) study just mentioned reached some of the same conclusions for HEK cells obtained here by conducting ribosome profiling in untreated and SA-treated cells, finding only 1 mRNA (untreated) or four mRNAs (SA-treated cells) that showed significantly reduced TEs in the eIF2A knockout vs. parental cells. It seems appropriate for the authors to expand their treatment of this prior work by summarizing its findings in some detail and also noting how their study goes beyond this previous one. 

      We have added a paragraph to the discussion pointing out that our data agree fully with Ichihara et al. (2021), and that Ichihara et al. (2021) also found only very few mRNAs that change in TE upon loss of eIF2A in either non-stressed or stressed conditions.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This useful study could potentially represent a step forward towards personalized medicine by combining cell-based data and a prior-knowledge network to derive Boolean-based predictive logic models to uncover altered protein/signaling networks within cancer cells. However, the level of evidence supporting the conclusions is inadequate, and further validation of the reported approach is required. If properly validated, these findings could be of interest to medical biologists working in the field of cancer and would inform drug development and treatment choices in the field of oncology.

      We thank the editor and the reviewer for their constructive comments, which helped us to improve our story. We have now performed new analyses and experiments to further support our proposed approach.

      Public Reviews:

      Reviewer #1 (Public Review):

      (1) The authors deploy a combination of their own previously developed computational methods and databases (SIGNOR and CellNOptR) to model the FLT3 signaling landscape in AML and identify synergistic drug combinations that may overcome the resistance AML cells harboring ITD mutations in the TKI domain of FLT3 to FLT3 inhibitors. I did not closely evaluate the details of these computational models since they are outside of my area of expertise and have been previously published. The manuscript has significant issues with data interpretation and clarity, as detailed below, which, in my view, call into question the main conclusions of the paper.

      The authors train the model by including perturbation data where TKI-resistant and TKIsensitive cells are treated with various inhibitors and the activity (i.e. phosphorylation levels) of the key downstream nodes are evaluated. Specifically, in the Results section (p. 6) they state "TKIs sensitive and resistant cells were subjected to 16 experimental conditions, including TNFa and IGF1 stimulation, the presence or absence of the FLT3 inhibitor, midostaurin, and in combination with six small-molecule inhibitors targeting crucial kinases in our PKN (p38, JNK, PI3K, mTOR, MEK1/2 and GSK3)". I would appreciate more details on which specific inhibitors and concentrations were used for this experiment. More importantly, I was very puzzled by the fact that this training dataset appears to contain, among other conditions, the combination of midostaurin with JNK inhibition, i.e. the very combination of drugs that the authors later present as being predicted by their model to have a synergistic effect. Unless my interpretation of this is incorrect, it appears to be a "self-fulfilling prophecy", i.e. an inappropriate use of the same data in training and verification/test datasets.

      We thank the reviewer for this comment. We have now extensively revised the Figure 2B and edited the text to clarify and better describe the experimental conditions of our multiparametric analysis. As the reviewer stated, we have used different combinations of drugs, including midostaurin and JNK inhibitor to generate two cell-specific predictive models recapitulating the main signal transduction events, down-stream FLT3, occurring in resistant (FLT3ITD-TKD) and sensitive (FLT3ITD-JMD) cells. These experiments were performed by treating cells at very early time points to obtain a picture of the signaling response of FLT3-ITD positive cells. Indeed, we have measured the phosphorylation level of signaling proteins, because at these early time points (90 minutes) we do not expect a modulation of downstream crucial phenotypes, including apoptosis or proliferation. To infer perturbations impacting the apoptosis or proliferation phenotypes, we applied a computational two-steps strategy:

      (1) We extracted key regulators of ‘apoptosis’ and ‘proliferation’ hallmarks from SIGNOR database.

      (2) We applied our recently developed ProxPath algorithm to retrieve significant paths linking nodes of our two optimized models to ‘proliferation’ and ‘apoptosis’ phenotypes.

      This allowed us to evaluate in silico the “proliferation” and “apoptosis” rate upon inactivation of each node of the network. With the proposed approach, we identified JNK as a potential drug target to use in combination with FLT3 to restore sensitivity (i.e. in silico inducing apoptosis and reducing proliferation) of FLT3 ITD-TKD cells. We here want to stress once more that although the first piece of information (the effect of JNK and FLT3 inhibition) on sentinel readouts was provided in the training dataset, the second piece of information (the effect on this treatment over the entire model and, as a consequence, on the cellular phenotype) was purely the results of our computational models. As such, we hope that the reviewer will agree that this could not represent a “self-fulfilling prophecy".

      That said, we understand that this aspect was not clearly defined in the manuscript. For this reason, we have now 1) extensively revised the Figure 2B; 2) edited the text (pg. 6) to clarify the purpose and the results of our approach; and 3) described in further detail (pg. 16-18) the experimental conditions of our multiparametric analysis.

      (2) My most significant criticism is that the proof-of-principle experiment evaluating the combination effects of midostaurin and SP600125 in FLT3-ITD-TKD cell line model does not appear to show any synergism, in my view. The authors' interpretation of the data is that the addition of SP600125 to midostaurin rescues midostaurin resistance and results in increased apoptosis and decreased viability of the midostaurin-resistant cells. Indeed, they write on p.9: "Strikingly, the combined treatment of JNK inhibitor (SP600125) and midostaurin (PKC412) significantly increased the percentage of FLT3ITD-TKD cells in apoptosis (Fig. 4D). Consistently, in these experimental conditions, we observed a significant reduction of proliferating FLT3ITD- TKD cells versus cells treated with midostaurin alone (Fig. 4E)." However, looking at Figs 4D and 4E, it appears that the effects of the midostaurin/SP600125 combination are virtually identical to SP600125 alone, and midostaurin provides no additional benefit. No p-values are provided to compare midostaurin+SP600125 to SP600125 alone but there seems to be no appreciable difference between the two by eye. In addition, the evaluation of synergism (versus additive effects) requires the use of specialized mathematical models (see for example Duarte and Vale, 2022). That said, I do not appreciate even an additive effect of midostaurin combined with SP600125 in the data presented.

      We agree with the reviewer that the JNK inhibitor and midostaurin do not have neither a synergic nor additive effect and we have now revised the text accordingly. It is highly discussed in the scientific community whether FLT3ITD-TKD AML cells benefit from midostaurin treatments. In a recently published retroprospective study of K. Dohner et al. (Rücker et al., 2022), the authors investigated the prognostic and predictive impact of FLT3-ITD insertion site (IS) in 452 patients randomized within the RATIFY trial, which evaluated midostaurin additionally to intensive chemotherapy. Their study clearly showed that “Midostaurin exerted a significant benefit only for JMDsole” patients. In agreement with this result, we have demonstrated that midostaurin treatment had no effects on apoptosis of blasts derived from FLT3ITD-TKD patients (Massacci et al., 2023). On the other hand, we and others observed that midostaurin triggers apoptosis in FLT3ITD-TKD cells to a lesser extent as compared to FLT3ITDJMD cells (Arreba-Tutusaus et al., 2016). The data presented here (Fig. 4) and our previously published papers (Massacci et al., 2023; Pugliese et al., 2023) pinpoint that hitting cell cycle regulators (WEE1, CDK7, JNK) induce a significant apoptotic response of TKI resistant FLT3ITD-TKD cells. Prompted by the reviewer comment, we have now revised the text and discussion (pg.9; 14) highlighting the crucial role of JNK in apoptosis induction.

      (3) In my view, there are significant issues with clarity and detail throughout the manuscript. For example, additional details and improved clarity are needed, in my view, with respect to the design and readouts of the signaling perturbation experiments (Methods, p. 15 and Fig 2B legend). For example, the Fig 2B legend states: "Schematic representation of the experimental design: FLT3 ITD-JMD and FLT3 ITD-JMD cells were cultured in starvation medium (w/o FBS) overnight and treated with selected kinase inhibitors for 90 minutes and IGF1 and TNFa for 10 minutes. Control cells are starved and treated with PKC412 for 90 minutes, while "untreated" cells are treated with IGF1 100ng/ml and TNFa 10ng/ml with PKC412 for 90 minutes.", which does not make sense to me. The "untreated" cells appear to be treated with more agents than the control cells. The logic behind cytokine stimulation is not adequately explained and it is not entirely clear to me whether the cytokines were used alone or in combination. Fig 2B is quite confusing overall, and it is not clear to me what the horizontal axis (i.e. columns of "experimental conditions", as opposed to "treatments") represents. The Method section states "Key cell signaling players were analyzed through the X-Map Luminex technology: we measured the analytes included in the MILLIPLEX assays" but the identities of the evaluated proteins are not given in the Methods. At the same time, the Results section states "TKIs sensitive and resistant cells were subjected to 16 experimental conditions" but these conditions do not appear to be listed (except in Supplementary data; and Fig 2B lists 9 conditions, not 16). In my subjective view, the manuscript would benefit from a clearer explanation and depiction of the experimental details and inhibitors used in the main text of the paper, as opposed to various Supplemental files/Figures. The lack of clarity on what exactly were the experimental conditions makes the interpretation of Fig 2 very challenging. In the same vein, in the PCA analysis (Fig 2C) there seems to be no reference to the cytokine stimulation status while the authors claim that PC2 stratifies cells according to IGF1 vs TNFalpha. There are numerous other examples of incomplete or confusing legends and descriptions which, in my view, need to be addressed to make the paper more accessible.

      We thank the reviewer for his/her comment. We have now extensively revised the text of the manuscript (pg. 6), revised Fig. 2B (now Fig 2C) and methods (pg. 16-18) to improve the clarity of our manuscript, making the take-home messages more accessible. We believe that the revised versions of text and of Figure 2 better explain our strategy and clarify the experimental set up, we added details on the choices of the experimental conditions, and we proposed a better graphic representation of the analysis.

      (4) I am not sure that I see significant value in the patient-specific logic models because they are not supported by empirical evidence. Treating primary cells from AML patients with relevant drug combinations would be a feasible and convincing way to validate the computational models and evaluate their potential benefit in the clinical setting.

      We thank the reviewer for this comment. We have now performed additional experiments in a small cohort of FLT3-ITD positive patient-derived primary blasts. Specifically, we have treated blasts from 2 FLT3ITD-TKD patients and 3 FLT3ITD-JMD+TKD patients with PKC412 (100nM) 24h and/or 10μM SP600125 (JNK inhibitor). After 24h of treatment we have measured the apoptotic rate. As shown below and in the new Fig. 4F (see pg.10, main text), midostaurin triggers higher levels of apoptosis in FLT3ITD-JMD+TKD blasts as compared to FLT3ITD-TKD blasts. Importantly, treatment with the JNK inhibitor SP600125 alone triggers apoptosis in FLT3ITD-TKD blasts, validating the crucial role of JNK in FLT3ITD-TKD cell survival and TKI resistance. The combined treatment of midostaurin and SP600125 increases the percentage of apoptotic cells as compared to midostaurin treatment alone but to a lesser extent than single agent treatment. This result is in agreement with the current debate in the scientific community on the actual beneficial effect of midostaurin treatment in FLT3ITD-TKD AML patients.

      Author response image 1.

      Primary samples from AML patients with the FLT3ITD-TKD mutation (n=2, yellow bars) or the FLT3ITD-JMD/TKD mutation (n=3, blue bars) were exposed to Midostaurin (100nM, PKC412), and JNK inhibitor (10µM, SP600125) for 48 hours, or combinations thereof. The specific cell death of gated AML blasts was calculated to account for treatment-unrelated spontaneous cell death. The bars on the graph represent the mean values with standard errors.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript by Latini et al describes a methodology to develop Boolean-based predictive logic models that can be applied to uncover altered protein/signalling networks in cancer cells and discover potential new therapeutic targets. As a proof-of-concept, they have implemented their strategy on a hematopoietic cell line engineered to express one of two types of FLT3 internal tandem mutations (FLT3-ITD) found in patients, FLT3-ITD-TKD (which are less sensitive to tyrosine kinase inhibitors/TKIs) and FLT3-ITD-JMD (which are more sensitive to TKIs).

      Strengths:

      This useful work could potentially represent a step forward towards personalised targeted therapy, by describing a methodology using Boolean-based predictive logic models to uncover altered protein/signalling networks within cancer cells. However, the weaknesses highlighted below severely limit the extent of any conclusions that can be drawn from the results.

      Weaknesses:

      While the highly theoretical approach proposed by the authors is interesting, the potential relevance of their overall conclusions is severely undermined by a lack of validation of their predicted results in real-world data. Their predictive logic models are built upon a set of poorlyexplained initial conditions, drawn from data generated in vitro from an engineered cell line, and no attempt was made to validate the predictions in independent settings. This is compounded by a lack of sufficient experimental detail or clear explanations at different steps. These concerns considerably temper one's enthusiasm about the conclusions that could be drawn from the manuscript.

      We thank the reviewer for the thorough review and kind comments about our manuscript. We hope the changes and new data we provide further strengthen it in his or her eyes.

      Some specific concerns include:

      (1) It remains unclear how robust the logic models are, or conversely, how affected they might be by specific initial conditions or priors that are chosen. The authors fail to explain the rationale underlying their input conditions at various points. For example: - at the start of the manuscript, they assert that they begin with a pre-PKN that contains "76 nodes and 193 edges", though this is then ostensibly refined with additional new edges (as outlined in Fig 2A). However, why these edges were added, nor model performance comparisons against the basal model are presented, precluding an evaluation of whether this model is better.

      We understand the reviewer’s concern. We have now complemented the manuscript with an extended version of the proposed modelling strategy offering a detailed description of the pipeline and the rationale behind each choice (Supplementary material, pg.14-19). Furthermore, we also referenced the manuscript to a GitHub repository where users can follow and reproduce each step of the pipeline (https://github.com/SaccoPerfettoLab/FLT3ITD_driven_AML_Boolean_models).

      • At a later step (relevant to Fig S4 and Fig 3), they develop separate PKNs, for each of the mutation models, that contain "206 [or] 208 nodes" and "756 [or] 782 edges", without explaining how these seemingly arbitrary initial conditions were arrived at. Their relation to the original parameters in the previous model is also not investigated, raising concerns about model over-fitting and calling into question the general applicability of their proposed approach. The authors need to provide a clearer explanation of the logic underlying some of these initial parameter selections, and also investigate the biological/functional overlap between these sets of genes (nodes).

      We thank the reviewer for raising this question. Very briefly, the proposed optimization strategy falls in a branch of the modelling, where the predictive model is, indeed, driven by the data (Blinov and Moraru, 2012). From a certain point of view, the scope of optimization is the one of fitting the experimental data in the best way possible. To achieve this, we followed standard practices (Dorier et al., 2016; Traynard et al., 2017). To address the issue of “calling into question the general applicability of their proposed approach”, we have compared the activity status of nodes in the models with ‘real data’ extracted from cell lines and patients’ samples to reassure about the robustness and scalability of the strategy (please see below, response to point 3 pg. 9).

      Finally, as mentioned in the previous point, we have now provided a detailed supplementary material, where we have described all the aspects mentioned by the reviewer: step-by-step changes in the PKN, the choice of the parameters and other details can be traced over the novel text and are also available in the GitHub repository (https://github.com/SaccoPerfettoLab/FLT3-ITD_driven_AML_Boolean_models).

      (2) There is concern about the underlying experimental data underpinning the models that were generated, further compounded by the lack of a clear explanation of the logic. For example, data concerning the status of signalling changes as a result of perturbation appears to be generated from multiplex LUMINEX assays using phosphorylation-specific antibodies against just 14 "sentinel" proteins. However, very little detail is provided about the rationale underlying how these 14 were chosen to be "sentinels" (and why not just 13, or 15, or any other number, for that effect?). How reliable are the antibodies used to query the phosphorylation status? What are the signal thresholds and linear ranges for these assays, and how would these impact the performance/reliability of the logic models that are generated from them?

      We thank the reviewer for this comment as it gives us the opportunity to clarify and better explain the criteria behind the experimental data generation.

      Overall, we revised the main text at page 6 and the Figure 2B to improve the clarity of our experimental design. Specifically, the sentinels were chosen because they were considered indirect or direct downstream effectors of the perturbations and were conceived to serve as both a benchmarking system of the study and a readout of the global perturbation of the system. To clarify this aspect, we have added a small network (compressed PKN) in Figure 2B to show that the proteins (green nodes) we chose to measure in the LUMINEX multiplex assay are “sentinels” of the activity of almost all the pathways included in the Prior knowledge network. Moreover, we implemented the methods section “Multiparametric experiment of signaling perturbation” (pg. 16-18), where we added details about the antibodies used in the assay paired with the target phosphosites and their functional role (Table 3). We also better specified the filtering process based on the number of beads detected per each antibody used (pg. 18). About the reliability of the measurements, we can say that the quality of the perturbation data impacts greatly on the logic models’ performance. xMAP technology been already used by the scientific community to generate highly reproducible and reliable multiparametric dataset for model training (Terfve et al., 2012). Additionally, we checked that for each sentinel we could measure a fully active state, a fully inactive state and intermediate states. Modulation of individual analytes are displayed in Figure S3.

      Author response image 2.

      Partial Figure of normalization of analytes activity through Hill curves. Experimental data were normalized and scaled from 0 to 1 using analyte-specific Hill functions. Raw data are reported as triangles, normalized data and squares. Partial Figure representing three plots of the FLT3 ITD-JMD data (Complete Figure in Supplementary material Fig S3).

      (3) In addition, there are publicly available quantitative proteomics datasets from FLT3-mutant cell lines and primary samples treated with TKIs. At the very least, these should have been used by the authors to independently validate their models, selection of initial parameters, and signal performance of their antibody-based assays, to name a few unvalidated, yet critical, parameters. There is an overwhelming reliance on theoretical predictions without taking advantage of real-world validation of their findings. For example, the authors identified a set of primary AML samples with relevant mutations (Fig 5) that could potentially have provided a valuable experimental validation platform for their predictions of effective drug combination. Yet, they have performed Boolean simulations of the predicted effects, a perplexing instance of adding theoretical predictions on top of a theoretical prediction!

      Additionally, there are datasets of drug sensitivity on primary AML samples where mutational data is also known (for example, from the BEAT-AML consortia), that could be queried for independent validation of the authors' models.

      We thank the reviewer for this comment that helped us to significantly strengthen our story. Prompted by his/her comment, we have now queried three different datasets for independent validation of our logic models. Specifically, we have taken advantage of quantitative phosphoproteomics datasets of FLT3-ITD cell lines treated with TKIs (Massacci et al., 2023), phosphoproteomic data of FLT3-ITD positive patients-derived primary blast (Kramer et al., 2022) and of drug sensitivity data on primary FLT3-ITD positive AML samples (BEAT-AML consortia)

      • Comparison with phosphoproteomic data of FLT3-ITD cell lines treated with TKIs (Massacci et al., 2023)

      Here, we compared the steady state of our model upon FLT3 inhibition with the phosphoproteomic data describing the modulation of 16,319 phosphosites in FLT3-ITD BaF3 cells (FLT3ITD-TKD and FLT3ITD-JMD) upon TKI treatment (i.e. quizartinib, a highly selective FLT3 inhibitor). As shown in the table below and new Figure S5A, the activation status of the nodes in the two generated models is highly comparable with the level of regulatory phosphorylations reported in the reference dataset. Briefly, to determine the agreement between each model and the independent dataset, we focused on the phosphorylation level of specific residues that (i) regulate the functional activity of sentinel proteins (denoted in the ‘Mode of regulation’ column) and (ii) that were measured in this work to train the model. So, we cross-referenced the sentinel protein status in FLT3 inhibition simulation (as denoted in the 'Model simulation of FLT3 inhibition' column) with the functional impact of phosphorylation measured in Massacci et. al dataset (as denoted in the 'Functional impact in quizartinib dataset' column). Points of congruence were summarized in the 'Consensus' column. As an example, if the phosphorylation level of an activating residue decreases (e.g., Y185 of Mapk1), we can conclude that the protein is inhibited (‘Down-reg’) and this is coherent with model simulation in which Mapk1 is ‘Inactive’.

      Author response image 3.

      • Comparison with phosphoproteomic data of FLT3-ITD patient-derived primary blasts (Kramer et al., 2022)

      Using the same criteria, we extended our validation efforts by comparing the activity status of the proteins in the “untreated” simulation (i.e. reproducing the tumorigenic state where FLT3, IGF1R and TNFR are set to be active) with their phosphorylation levels in the dataset by Kramer et al. (Kramer et al., 2022). Briefly, this dataset gathers phosphoproteomic data from a cohort of 44 AML patients and we restricted the analysis to 11 FLT3-ITD-positive patients. Importantly, all patients carry the ITD mutation in the juxta membrane domain (JMD), thus allowing for the comparison with FLT3 ITD-JMD specific Boolean model, exclusively.

      The results are shown in the heatmap below. Each cell in the heatmap reports the phosphorylation level of sentinel proteins’ residues in the indicated patient (red and blue indicate up- or- down-regulated phosphoresidues, respectively). Patients were clustered according to Pearson correlation. We observed a good level of agreement between the patients’ phosphoproteomics data and our model (reported in the column “Tumor simulation steady state”) for a subset of patients highlighted within the black rectangle. However, for the remaining patients, the level of agreement is poor. The main reason is that our work focuses on FLT3-ITD signaling and a systematic translation of the Boolean modeling approach to the entire cohort of AML patients would require the inclusion of the impact of other driver mutations in the network. This is actually a current and a future line of investigation of our group. We have revised the discussion, taking this result into consideration.

      Author response image 4.

      • Comparison with drug sensitivity data on primary FLT3-ITD positive AML samples (BEAT-AML consortia)

      Here we took advantage of the Beat AML programme on a cohort of 672 tumour specimens collected from 562 patients. The BEAT AML consortium provides whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity of this large cohort of patient-derived primary blasts. We focused on drug sensitivity screening on 134 patients carrying the typical FLT3-ITD mutation in the JMD region. Unfortunately, the ITD insertion in the TKD region is less characterized and additional in-depth sequencing studies are required to identify in this cohort FLT3ITD-TKD positive blasts. Next, we focused on those compounds hitting nodes present in the FLT3ITD-JMD Boolean model. Specifically, we selected drugs inhibiting FLT3, PI3K, mTOR, JNK and p38 and we calculated the average IC50 of FLT3ITD-JMD patient-derived primary blasts for each drug. These results are reported as a bar graph in the new Fig. S5B and below (upper panel) and were compared with the apoptotic and proliferation rate measured in silico simulation of the FLT3ITD-JMD Boolean model. Drug sensitivity screening on primary FLT3ITD-JMD blasts revealed that inhibition of FLT3, PI3K and mTOR induces cell death at low drug concentrations in contrast with JNK and p38 inhibitors showing higher IC50 values. These observations are consistent with our simulation results of the FLT3ITD-JMD model. As expected, in silico inhibition of FLT3 greatly impacts apoptosis and proliferation. Additionally, in silico suppression of mTOR and to a lesser extent PI3K and p38 affect apoptosis and proliferation. Of note, JNK inhibition neither in silico nor in vitro seems to affect viability of FLT3ITD-JMD cells.

      Author response image 5.

      Altogether these publicly available datasets independently validate our models, strengthening the reliability and robustness of our approach.

      We have now revised the main text (pg. 8; 9) and added a new Figure (Fig. S5) in the supplementary material; we collected the results of the analysis in TableS6.

      (4) There are additional examples of insufficient experimental detail that preclude a fuller appreciation of the relevance of the work. For example, it is alluded that RNA-sequencing was performed on a subset of patients, but the entire methodological section detailing the RNA-seq amounts to just 3 lines! It is unclear which samples were selected for sequencing nor where the data has been deposited (or might be available for the community - there are resources for restricted/controlled access to deidentified genomics/transcriptomics data).

      We apologize for the lack of description regarding the RNA sequencing of patient samples. We have now added details of this approach in the method section (pg. 24), clearly explained in text how we selected the patients for the analysis. Additionally, data has now been deposited in the GEO database (accession number: GSE247483).

      The sentences we have rephrased are below:

      “We analyzed the mutational and expression profiles of 262 genes (Table S7), relevant to hematological malignancies in a cohort of 14 FLT3-ITD positive de novo AML patients (Fig. 5A, panel a). Since, follow-up clinical data were available for 10 out of 14 patients (Fig. 5B, Table S9), we focused on this subset of patients. Briefly, the classification of these 10 patients according to their ITD localization (see Methods) was as follows: 8 patients with FLT3ITD-JMD, 4 with FLT3ITD-JMD+TKD, and 2 with FLT3ITD-TKD (Fig. 5A, panel b). The specific insertion sites of the ITD in the patient cohort are shown in Table S8.

      Similarly, in the "combinatory treatment inference" methods, it states "...we computed the steady state of each cell line best model....." and "Then we inferred the activity of "apoptosis" and "proliferation" phenotypes", without explaining the details of how these were done. The outcomes of these methods are directly relevant to Fig 4, but with such sparse methodological detail, it is difficult to independently assess the validity of the presented data.

      Overall, the theoretical nature of the work is hampered by real-world validation, and insufficient methodological details limit a fuller appreciation of the overall relevance of this work.

      We thank the reviewer for the insightful feedback regarding the methodology in our paper.<br /> About ‘real-world validation’ we have extensively replied to this issue in point 3 (pg. 9-14 of this document). For what concerns the ‘insufficient methodological details’, we have made substantial improvements to enhance clarity and reproducibility, that encompass: (i) revisions in the main text and in the Materials and Methods section; (ii) detailed explanation of each step and decisions taken that can be accessed either as an extended Materials and Methods section (Supplementary material, pg. 14-19) and through our GitHub repository (https://github.com/SaccoPerfettoLab/FLT3-ITD_driven_AML_Boolean_models). We sincerely hope this addition addresses concerns and facilitates a more thorough and independent assessment of our work.

      Reviewer #3 (Public Review):

      Summary:

      The paper "Unveiling the signaling network of FLT3-ITD AML improves drug sensitivity prediction" reports the combination of prior knowledge signaling networks, multiparametric cell-based data on the activation status of 14 crucial proteins emblematic of the cell state downstream of FLT3 obtained under a variety of perturbation conditions and Boolean logic modeling, to gain mechanistic insight into drug resistance in acute myeloid leukemia patients carrying the internal tandem duplication in the FLT3 receptor tyrosine kinase and predict drug combinations that may reverse pharmacoresistant phenotypes. Interestingly, the utility of the approach was validated in vitro, and also using mutational and expression data from 14 patients with FLT3-ITD positive acute myeloid leukemia to generate patient-specific Boolean models.

      Strengths:

      The model predictions were positively validated in vitro: it was predicted that the combined inhibition of JNK and FLT3, may reverse resistance to tyrosine kinase inhibitors, which was confirmed in an appropriate FLT3 cell model by comparing the effects on apoptosis and proliferation of a JNK inhibitor and midostaurin vs. midostaurin alone.

      Whereas the study does have some complexity, readability is enhanced by the inclusion of a section that summarizes the study design, plus a summary Figure. Availability of data as supplementary material is also a high point.

      We thank the reviewer for his/her constructive comments about our manuscript. We believe that our story has been significantly strengthened by the changes and new data we provided.

      Weaknesses:

      (1) Some aspects of the methodology are not properly described (for instance, no methodological description has been provided regarding the clustering procedure that led to Figs. 2C and 2D).

      We apologize for the lack of proper description of the methodology. We have extensively revised the methods section and worked to improve the clarity. We have now added a description of the clustering procedures in the methods section (pg. 19) of new Fig. S2D., Fig. S2E.

      It is not clear in the manuscript whether the patients gave their consent to the use of their data in this study, or the approval from an ethical committee. These are very important points that should be made explicit in the main text of the paper.

      We thank the reviewer for this comment. We have now added the following sentence (pg. 24): “Peripheral blood (PB) samples from 14 AML patients were obtained upon patient’s informed consent.”

      The authors claim that some of the predictions of their models were later confirmed in the follow-up of some of the 14 patients, but it is not crystal clear whether the models helped the physicians to make any decisions on tailored therapeutic interventions, or if this has been just a retrospective exercise and the predictions of the models coincide with (some of) the clinical observations in a rather limited group of patients. Since the paper presents this as additional validation of the models' ability to guide personalized treatment decisions, it would be very important to clarify this point and expand the presentation of the results (comparison of observations vs. model predictions).

      As described in the introduction section, this study was inspired by an urgent clinical problem in AML research: patients carrying the ITD in the TKD domain of the FLT3 receptor display poor prognosis and do not respond to current therapy: Midostaurin (which on the other hand is effective in patients with the ITD in the JMD domain).

      To fill this gap, we gathered a team of 18 participants, of which 7 have a clinical background and have expertise in the diagnosis, treatment and management of AML patients and 5 are experts in Boolean modeling. The scope of the project is the development of a computational approach to identify possible alternative solutions for FLT3ITD-TKD AML patients, generating future lines of investigations. Drug combinations are currently under investigation as a potential means of avoiding drug resistance and achieving more effective and durable treatment responses. However, it is impractical to test for potential synergistic properties among all available drugs using empirical experiments alone. With our approach, we developed models that recreated in silico the main differences in the signaling of sensitive and resistant cells to support the prioritization of novel therapies. Prompted by the reviewer suggestions, we have now extended the validation of our models, through the comparison with publicly available cell lines and patient-derived dataset. We have also confirmed our results by performing in vitro experiments in patient-derived primary blasts treated with midostaurin and/or JNK inhibitor. Importantly, we have already demonstrated that hitting cell cycle regulators in FLT3ITD-TKD cells can be an effective approach to kill resistant leukemia cells (Massacci et al., 2023; Pugliese et al., 2023). We are aware that changing the clinical practice and the therapies for patients require a proper clinical study which goes far beyond the scope of this manuscript.

      However, we hope that our results can be translated soon from “bench-to-bed”. Importantly, we believe that our study can open lines of investigations aimed at the application of our approach to identify promising therapeutic strategies in other clinical settings.

      Recommendations for the authors

      The reviewers have highlighted significant issues regarding the inadequate level of evidence to support some of the conclusions, plus lack of an exhaustive methodological description that may jeopardize reproducibility.

      We hope that the editor and the reviewers will appreciate the extensive revision we made and new data and analysis we provided to strengthen our story.

      Reviewer #1 (Recommendations For The Authors):

      (1) In Fig 2D the hierarchical tree is off-set in relation to the treatment symbols and names in the middle of the Figure. In addition, I do not see FLT3i combination with JNKi in the JMD cells (perhaps, a coloring error?).

      We thank the reviewer for this observation. We have now revised the hierarchical tree, which is now in Figure S2D, we have aligned the tree with the symbols and names and corrected the colouring error for the sample FLT3i+JNKi in JMD cells.

      (2) Midostaurin and PKC412 refer to the same drug and are used interchangeably in the manuscript. Using one name consistently would improve readability.

      We have now improved the readability of the text and the Figures by choosing “Midostaurin” when we refer to the FLT3 inhibitor.

      (3) It is not clear to me why the FLT3-ITD-JMD cells are not presented in Fig. 4B. Perhaps their values are 0? In that case, the readability would be improved by including a thin blue line representing zero values. Additionally, on p.8 the authors state "Interestingly, in the FLT3ITDTKD model, the combined inhibition of JNK and FLT3, exclusively, in silico restores the TKI sensitivity, as revealed by the evaluation of the apoptosis and proliferation levels (Fig. 4B-C)." but Fig. 4C shows no differential effects of JNK inhibition in sensitive versus resistant cells.

      To address the reviewer's point, we’ve added a thin blue line representing the zero values of the FLT3ITD-JMD in the results of the simulations in Figure 4B. Regarding the Figure 4C, the reviewer is right in saying that there is no difference in terms of proliferation between sensitive and resistant cells upon JNKi and FLT3i co-inhibition. However, we can see lower proliferation levels in both cell lines as compared to the “untreated” condition. Indeed, the simulation suggests that by combining JNK and FLT3 inhibition we restore the resistant phenotype lowering the proliferation rate of the resistant cells to the TKI-sensitive levels.

      Reviewer #2 (Recommendations For The Authors):

      I have addressed a number of concerns in the public review. Much better effort needs to be made to provide sufficient methodological detail (to permit independent validation by a sufficiently capable and motivated party) and explain the rationale of important parameter selections. Furthermore, I urge the authors to take advantage of the plethora of publicly available real-world data to validate their predicted outcomes.

      We are grateful to the reviewer for the careful revisions. All the aspects raised have been discussed in the specific sections of the public review. In summary, we have provided more methodological details, by revising the text, the methods session, by adding a new step-by-step description of the modelling strategy, the parameters and the criteria adopted in each phase (supplementary methods) and by referring to the entire code developed. Prompted by the reviewer suggestions, we have performed a novel and extensive comparison of our model with three different publicly available datasets. This analysis significantly strengthens our story, and a new supplementary Figure (Fig. S5) summarizes our findings (pg. 9-14 of this document).

      Reviewer #3 (Recommendations For The Authors):

      (1) At first sight, the distribution of the data points in the PCA space does not really seem to speak of nice clustering. Have the authors computed any clustering validation metric to assess if their clustering strategy is adequate and how informative the results are? Further analysis of this point of the article is precluded by the absence of a clear methodological description.

      Here we have used the PCA analysis to obtain a global view of our complex multiparametric data. We have now worked on the PCA to improve its readability. As shown in the new Figure 2D, PCA analysis showed that the activity level of sentinel proteins stratifies cells according to FLT3 activation status (component 1: presence vs absence of FLT3i) and cytokine stimulation (component 2: IGF1 vs TNF⍺). We have now added new experimental details on this part in the methods section (pg. 19) and we deposited the code used for the clustering strategy on the GitHub repository (https://github.com/SaccoPerfettoLab/FLT3ITD_driven_AML_Boolean_models).

      (2) Whereas scientists and medical professionals who work in the field of oncology may be familiar with some of the abbreviations used here, it would be good for improved readability by a more general audience to make sure that all the abbreviations (e.g., TKI) are properly defined the first time that they appear in the text.

      We thank the reviewer for this observation. To improve the readability of the text, we properly defined all the abbreviations in their first appearance, and we added the “Abbreviation” paragraph at page 15 of the manuscript to summarize them all.

      (3) How were the concentrations of the combined treatments chosen in the cell assays used as validation?

      We thank the reviewer for giving us the chance to clarify this point. We implemented the Methods with additional information about the treatments used in the validations. We detailed the SP600125 IC50 evaluation and usage in our cell lines (pg.22): IC50 values are approximately 1.5 µM in FLT3-ITD mutant cell lines; the SP600125 treatment affects cell viability, reaching a plateau phase of cell death and at about 2 µM. I used the minimal dose of SP600125 (10µM) to properly inhibit JNK. (Kim et al., 2010; Moon et al., 2009).

      We also specified (pg.22) that the concentration of Midostaurin was chosen based on the previously published work (Massacci et al., 2022): FLT3 ITD-TKD cells treated with Midostaurin 100nM show lower apoptotic rate and higher cell viability compared to FLT3 ITD-JMD cells.

      The concentration of SB203580 and UO126 was chosen based on previous data available in the lab and set up experiments (pg.22).

      (4) The authors say that "we were able to derive patient-specific signaling features and enable the identification of potential tailored treatments restoring TKI resistance" and that "our predictions were confirmed by follow-up clinical data for some patients". However, the results section on this part of the manuscript is rather scarce (the main text should be much more descriptive about the results summarized in Fig. 5, which are not self-explanatory).

      We thank the reviewer for this observation. We have now expanded the text to provide a more comprehensive description of the results about personalized Boolean model generation and usage and the content presented in Fig. 5 (pg.10-12).

      (5) I do not really agree with the final conclusion about this paper being "the proof of concept that our personalized informatics approach described here is clinically valid and will enable us to propose novel patient-centered targeted drug solutions". First, the clinical data used here belongs to a rather low number of patients. Second, as mentioned before, it is not clear if the models have been used to make any prospective decision or if this conclusion is drawn from an in vitro assay plus a retrospective analysis on a limited number of patients. Moreover, a description of the results and the discussion of the part of the manuscript dealing with patientspecific models is rather scarce, and it is difficult to see how the authors support their conclusions. Also, the statement " In principle, the generalization of our strategy will enable to obtain a systemic perspective of signaling rewiring in different cancer types, driving novel personalized approaches" may be a bit overoptimistic if one considers that so far, the approach has only been applied to a single type of drug-resistant cancer.

      We thank the reviewer for this comment. We agree with the referees that the clinical data we used belongs to a rather low number of patients. However, during the revision we have extensively worked to support the clinical relevance of our models and our discoveries. Specifically, we have compared our Boolean logic models with two different publicly available datasets on phosphoproteomics and drug sensitivity of FLT3ITD-JMD and FLT3ITD-TKD cell lines and blasts (FigS5 and answer to reviewer 2, point 3). Importantly, these datasets independently validated our models, highlighting that our approach has a translational value. Additionally, we have performed novel experiments by measuring the apoptotic rate of patient-derived primary blasts upon pharmacological suppression of JNK (Fig. 4H, pg. 10 of main text). Our data highlights that our approach has the potential to suggest novel effective treatments.

      That said, we have now revised the discussion to avoid overstatements.

      References

      Arreba-Tutusaus, P., Mack, T.S., Bullinger, L., Schnöder, T.M., Polanetzki, A., Weinert, S., Ballaschk, A., Wang, Z., Deshpande, A.J., Armstrong, S.A., Döhner, K., Fischer, T., Heidel, F.H., 2016. Impact of FLT3-ITD location on sensitivity to TKI-therapy in vitro and in vivo. Leukemia 30, 1220–1225. https://doi.org/10.1038/leu.2015.292

      Blinov, M.L., Moraru, I.I., 2012. Logic modeling and the ridiculome under the rug. BMC Biol 10, 92. https://doi.org/10.1186/1741-7007-10-92

      Dorier, J., Crespo, I., Niknejad, A., Liechti, R., Ebeling, M., Xenarios, I., 2016. Boolean regulatory network reconstruction using literature based knowledge with a genetic algorithm optimization method. BMC Bioinformatics 17, 410. https://doi.org/10.1186/s12859-016-1287-z

      Kramer, M.H., Zhang, Q., Sprung, R., Day, R.B., Erdmann-Gilmore, P., Li, Y., Xu, Z., Helton, N.M., George, D.R., Mi, Y., Westervelt, P., Payton, J.E., Ramakrishnan, S.M., Miller, C.A., Link, D.C., DiPersio, J.F., Walter, M.J., Townsend, R.R., Ley, T.J., 2022. Proteomic and phosphoproteomic landscapes of acute myeloid leukemia. Blood 140, 1533–1548. https://doi.org/10.1182/blood.2022016033

      Massacci, G., Venafra, V., Latini, S., Bica, V., Pugliese, G.M., Graziosi, S., Klingelhuber, F., Krahmer, N., Fischer, T., Mougiakakos, D., Boettcher, M., Perfetto, L., Sacco, F., 2023. A key role of the WEE1-CDK1 axis in mediating TKI-therapy resistance in FLT3-ITD positive acute myeloid leukemia patients. Leukemia 37, 288–297. https://doi.org/10.1038/s41375-022-01785-w

      Pugliese, G.M., Venafra, V., Bica, V., Massacci, G., Latini, S., Graziosi, S., Fischer, T., Mougiakakos, D., Boettcher, M., Perfetto, L., Sacco, F., 2023. Impact of FLT3-ITD location on cytarabine sensitivity in AML: a network-based approach. Leukemia 37, 1151–1155. https://doi.org/10.1038/s41375-023-01881-5

      Rücker, F.G., Du, L., Luck, T.J., Benner, A., Krzykalla, J., Gathmann, I., Voso, M.T., Amadori, S., Prior, T.W., Brandwein, J.M., Appelbaum, F.R., Medeiros, B.C., Tallman, M.S., Savoie, L., Sierra, J., Pallaud, C., Sanz, M.A., Jansen, J.H., Niederwieser, D., Fischer, T., Ehninger, G., Heuser, M., Ganser, A., Bullinger, L., Larson, R.A., Bloomfield, C.D., Stone, R.M., Döhner, H., Thiede, C., Döhner, K., 2022. Molecular landscape and prognostic impact of FLT3-ITD insertion site in acute myeloid leukemia: RATIFY study results. Leukemia 36, 90–99. https://doi.org/10.1038/s41375-021-01323-0

      Terfve, C., Cokelaer, T., Henriques, D., MacNamara, A., Goncalves, E., Morris, M.K., van Iersel, M., Lauffenburger, D.A., Saez-Rodriguez, J., 2012. CellNOptR: a flexible toolkit to train protein signaling networks to data using multiple logic formalisms. BMC Syst Biol 6, 133. https://doi.org/10.1186/1752-0509-6-133

      Traynard, P., Tobalina, L., Eduati, F., Calzone, L., Saez-Rodriguez, J., 2017. Logic Modeling in Quantitative Systems Pharmacology: Logic Modeling in Quantitative Systems Pharmacology. CPT Pharmacometrics Syst. Pharmacol. 6, 499–511. https://doi.org/10.1002/psp4.12225

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank you for the time you took to review our work and for your feedback!

      The major changes to the manuscript are:

      1) Promoted by multiple reviewers, we have replaced the statistical analysis in Figure 1L with a bootstrap analysis, added an ANOVA (in Table S1), and have also added the same analysis with mice as a statistical unit as Figure S4J to the manuscript.

      2) In response to reviewer 1, comment 3, we have replaced the response latency maps previously shown in Figures 3B, 3C, 3E and 3F with response amplitude maps.

      3) In response to reviewer 2, comment 1, we have added a variant of the response traces shown in Figures 3B, 3C, 3E and 3F with mice as the statistical unit as Figures S2C and S2D.

      4) In response to reviewer 2, public review, we have added data from additional experiments as Figures S6F-S6H, that control for the effect of a saline injection.

      A detailed point-by-point response to all reviewer concerns is provided in the following.  

      Reviewer #1 (Public Review):

      The authors present a study of visuo-motor coupling primarily using wide-field calcium imaging to measure activity across the dorsal visual cortex. They used different mouse lines or systemically injected viral vectors to allow imaging of calcium activity from specific cell-types with a particular focus on a mouse-line that expresses GCaMP in layer 5 IT (intratelencephalic) neurons. They examined the question of how the neural response to predictable visual input, as a consequence of self-motion, differed from responses to unpredictable input. They identify layer 5 IT cells as having a different response pattern to other cell-types/layers in that they show differences in their response to closed-loop (i.e. predictable) vs open-loop (i.e. unpredictable) stimulation whereas other cell-types showed similar activity patterns between these two conditions. They analyze the latencies of responses to visuomotor prediction errors obtained by briefly pausing the display while the mouse is running, causing a negative prediction error, or by presenting an unpredicted visual input causing a positive prediction error. They suggest that neural responses related to these prediction errors originate in V1, however, I would caution against overinterpretation of this finding as judging the latency of slow calcium responses in wide-field signals is very challenging and this result was not statistically compared between areas. Surprisingly, they find that presentation of a visual grating actually decreases the responses of L5 IT cells in V1. They interpret their results within a predictive coding framework that the last author has previously proposed. The response pattern of the L5 IT cells leads them to propose that these cells may act as 'internal representation' neurons that carry a representation of the brain's model of its environment. Though this is rather speculative. They subsequently examine the responses of these cells to anti-psychotic drugs (e.g. clozapine) with the reasoning that a leading theory of schizophrenia is a disturbance of the brain's internal model and/or a failure to correctly predict the sensory consequences of self-movement. They find that anti-psychotic drugs strongly enhance responses of L5 IT cells to locomotion while having little effect on other cell-types. Finally, they suggest that anti-psychotics reduce long-range correlations between (predominantly) L5 cells and reduce the propagation of prediction errors to higher visual areas and suggest this may be a mechanism by which these drugs reduce hallucinations/psychosis.

      This is a large study containing a screening of many mouse-lines/expression profiles using wide-field calcium imaging. Wide-field imaging has its caveats, including a broad point-spread function of the signal and susceptibility to hemodynamic artifacts, which can make interpretation of results difficult. The authors acknowledge these problems and directly address the hemodynamic occlusion problem. It was reassuring to see supplementary 2-photon imaging of soma to complement this data-set, even though this is rather briefly described in the paper. Overall the paper's strengths are its identification of a very different response profile in the L5 IT cells compared other layers/cell-types which suggests an important role for these cells in handling integration of self-motion generated sensory predictions with sensory input. The interpretation of the responses to anti-psychotic drugs is more speculative but the result appears robust and provides an interesting basis for further studies of this effect with more specific recording techniques and possibly behavioral measures.

      We thank the reviewer for the feedback and the help with improving the manuscript. We agree, the findings presented in this study are merely a starting point. The two questions we are currently pursuing in follow up work are:

      1) Do the findings generalize to all known antipsychotic drugs?

      2) What is the mechanism by which these drugs induce a decorrelation of activity, specifically in layer 5 neurons?

      But we suspect these questions will take at least a few more years of research to answer.

      Reviewer #2 (Public Review):

      Summary:

      This work investigates the effects of various antipsychotic drugs on cortical responses during visuomotor integration. Using wide-field calcium imaging in a virtual reality setup, the researchers compare neuronal responses to self-generated movement during locomotion-congruent (closed loop) or locomotionincongruent (open loop) visual stimulation. Moreover, they probe responses to unexpected visual events (halt of visual flow, sudden-onset drifting grating). The researchers find that, in contrast to a variety of excitatory and inhibitory cell types, genetically defined layer 5 excitatory neurons distinguish between the closed and the open loop condition and exhibit activity patterns in visual cortex in response to unexpected events, consistent with unsigned prediction error coding. Motivated by the idea that prediction error coding is aberrant in psychosis, the authors then inject the antipsychotic drug clozapine, and observe that this intervention specifically affects closed loop responses of layer 5 excitatory neurons, blunting the distinction between the open and closed loop conditions. Clozapine also leads to a decrease in long-range correlations between L5 activity in different brain regions, and similar effects are observed for two other antipsychotics, aripripazole and haloperidol, but not for the stimulant amphetamine. The authors suggest that altered prediction error coding in layer 5 excitatory neurons due to reduced longrange correlations in L5 neurons might be a major effect of antipsychotic drugs and speculate that this might serve as a new biomarker for drug development.

      Strengths:

      • Relevant and interesting research question:

      The distinction between expected and unexpected stimuli is blunted in psychosis but the neural mechanisms remain unclear. Therefore, it is critical to understand whether and how antipsychotic drugs used to treat psychosis affect cortical responses to expected and unexpected stimuli. This study provides important insights into this question by identifying a specific cortical cell type and long-range interactions as potential targets. The authors identify layer 5 excitatory neurons as a site where functional effects of antipsychotic drugs manifest. This is particularly interesting as these deep layer neurons have been proposed to play a crucial role in computing the integration of predictions, which is thought to be disrupted in psychosis. This work therefore has the potential to guide future investigations on psychosis and predictive coding towards these layer 5 neurons, and ultimately improve our understanding of the neural basis of psychotic symptoms.

      • Broad investigation of different cell types and cortical regions:

      One of the major strengths of this study is quasi-systematic approach towards cell types and cortical regions. By analysing a wide range of genetically defined excitatory and inhibitory cell types, the authors were able to identify layer 5 excitatory neurons as exhibiting the strongest responses to unexpected vs. expected stimuli and being the most affected by antipsychotic drugs. Hence, this quasi-systematic approach provides valuable insights into the functional effects of antipsychotic drugs on the brain, and can guide future investigations towards the mechanisms by which these medications affect cortical neurons.

      • Bridging theory with experiments

      Another strength of this study is its theoretical framework, which is grounded in the predictive coding theory. The authors use this theory as a guiding principle to motivate their experimental approach connecting visual responses in different layers with psychosis and antipsychotic drugs. This integration of theory and experimentation is a powerful approach to tie together the various findings the authors present and to contribute to the development of a coherent model of how the brain processes visual information both in health and in disease.

      Weaknesses:

      • Unclear relevance for psychosis research

      From the study, it remains unclear whether the findings might indeed be able to normalise altered predictive coding in psychosis. Psychosis is characterised by a blunted distinction between predicted and unpredicted stimuli. The results of this study indicate that antipsychotic drugs further blunt the distinction between predicted and unpredicted stimuli, which would suggest that antipsychotic drugs would deteriorate rather than ameliorate the predictive coding deficit found in psychosis. However, these findings were based on observations in wild-type mice at baseline. Given that antipsychotics are thought to have little effects in health but potent antipsychotic effects in psychosis, it seems possible that the presented results might be different in a condition modelling a psychotic state, for example after a dopamine-agonistic or a NMDA-antagonistic challenge. Therefore, future work in models of psychotic states is needed to further investigate the translational relevance of these findings.

      • Incomplete testing of predictive coding interpretation

      While the investigation of neuronal responses to different visual flow stimuli Is interesting, it remains open whether these responses indeed reflect internal representations in the framework of predictive coding. While the responses are consistent with internal representation as defined by the researchers, i.e., unsigned prediction error signals, an alternative interpretation might be that responses simply reflect sensory bottom-up signals that are more related to some low-level stimulus characteristics than to prediction errors. Moreover, This interpretational uncertainty is compounded by the fact that the used experimental paradigms were not suited to test whether behaviour is impacted as a function of the visual stimulation which makes it difficult to assess what the internal representation of the animal actual was. For these reasons, the observed effects might reflect simple bottom-up sensory processing alterations and not necessarily have any functional consequences. While this potential alternative explanation does not detract from the value of the study, future work would be needed to explain the effect of antipsychotic drugs on responses to visual flow. For example, experimental designs that systematically vary the predictive strength of coupled events or that include a behavioural readout might be more suited to draw from conclusions about whether antipsychotic drugs indeed alter internal representations.

      • Methodological constraints of experimental design

      While the study findings provide valuable insights into the potential effects of antipsychotic drugs, it is important to acknowledge that there may be some methodological constraints that could impact the interpretation of the results. More specifically, the experimental design does not include a negative control condition or different doses. These conditions would help to ensure that the observed effects are not due to unspecific effects related to injection-induced stress or time, and not confined to a narrow dose range that might or might not reflect therapeutic doses used in humans. Hence, future work is needed to confirm that the observed effects indeed represent specific drug effects that are relevant to antipsychotic action.

      Conclusion:

      Overall, the results support the idea that antipsychotic drugs affect neural responses to predicted and unpredicted stimuli in deep layers of cortex. Although some future work is required to establish whether this observation can indeed be explained by a drug-specific effect on predictive coding, the study provides important insights into the neural underpinnings of visual processing and antipsychotic drugs, which is expected to guide future investigations on the predictive coding hypothesis of psychosis. This will be of broad interest to neuroscientists working on predictive coding in health and in disease.

      We thank the reviewer for the feedback and the help with improving the manuscript.

      Regarding the concern of a lack of a negative control, we have repeated the correlation measurement experiments in a cohort of Tlx3-Cre x Ai148 mice that received injections of saline. This analysis is now shown in Figure S6F-S6H. Saline injections did not change correlations in L5 IT neurons. Combined with the absence of changes in the L5 IT correlation structure following amphetamine injections (Figures 7G – 7I), this suggests that unspecific effects related to stress of injection, or simply time, cannot explain the observed decorrelation effect of the antipsychotic drugs.

      And we fully agree, a lot more work is needed to confirm that the observed effects are specific and relevant to antipsychotic action.

      Reviewer #3 (Public Review):

      The study examines how different cell types in various regions of the mouse dorsal cortex respond to visuomotor integration and how antipsychotic drugs impacts these responses. Specifically, in contrast to most cell types, the authors found that activity in Layer 5 intratelencephalic neurons (Tlx3+) and Layer 6 neurons (Ntsr1+) differentiated between open loop and closed loop visuomotor conditions. Focussing on Layer 5 neurons, they found that the activity of these neurons also differentiated between negative and positive prediction errors during visuomotor integration. The authors further demonstrated that the antipsychotic drugs reduced the correlation of Layer 5 neuronal activity across regions of the cortex, and impaired the propagation of visuomotor mismatch responses (specifically, negative prediction errors) across Layer 5 neurons of the cortex, suggesting a decoupling of long-range cortical interactions.

      The data when taken as a whole demonstrate that visuomotor integration in deeper cortical layers is different than in superficial layers and is more susceptible to disruption by antipsychotics. Whilst it is already known that deep layers integrate information differently from superficial layers, this study provides more specific insight into these differences. Moreover, this study provides a first step into understanding the potential mechanism by which antipsychotics may exert their effect.

      Whilst the paper has several strengths, the robustness of its conclusions is limited by its questionable statistical analyses. A summary of the paper's strengths and weaknesses follow.

      Strengths:

      The authors perform an extensive investigation of how different cortical cell types (including Layer 2/3, 4 , 5, and 6 excitatory neurons, as well as PV, VIP, and SST inhibitory interneurons) in different cortical areas (including primary and secondary visual areas as well as motor and premotor areas), respond to visuomotor integration. This investigation provides strong support to the idea that deep layer neurons are indeed unique in their computational properties. This large data set will be of considerable interest to neuroscientists interested in cortical processing.

      The authors also provide several lines of evidence that visuomotor information is differentially integrated in deep vs. superficial layers. They show that this is true across experimental paradigms of visuomotor processing (open loop, closed loop, mismatch, drifting grating conditions) and experimental manipulations, with the demonstration that Layer 5 visuomotor integration is more sensitive to disruption by the antipsychotic drug clozapine, compared with cortex as a whole.

      The study further uses multiple drugs (clozapine, aripiprazole and haloperidol) to bolster its conclusion that antipsychotic drugs disrupt correlated cortical activity in Layer 5 neurons, and further demonstrates that this disruption is specific to antipsychotics, as the psychostimulant amphetamine shows no such effect.

      In widefield calcium imaging experiments, the authors effectively control for the impact of hemodynamic occlusions in their results, and try to minimize this impact using a crystal skull preparation, which performs better than traditional glass windows. Moreover, they examine key findings in widefield calcium imaging experiments with two-photon imaging.

      Weaknesses:

      A critical weakness of the paper is its statistical analysis. The study does not use mice as its independent unit for statistical comparisons but rather relies on other definitions, without appropriate justification, which results in an inflation of sample sizes. For example, in Figure 1, independent samples are defined as locomotion onsets, leading to sample sizes of approx. 400-2000 despite only using 6 mice for the experiment. This is only justified if the data from locomotion onsets within a mouse is actually statistically independent, which the authors do not test for, and which seems unlikely. With such inflated sample sizes, it becomes more likely to find spurious differences between groups as significant. It also remains unclear how many locomotion onsets come from each mouse; the results could be dominated by a small subset of mice with the most locomotion onsets. The more disciplined approach to statistical analysis of the dataset is to average the data associated with locomotion onsets within a mouse, and then use the mouse as an independent unit for statistical comparison. A second example, for instance, is in Figure 2L, where the independent statistical unit is defined as cortical regions instead of mice, with the left and right hemispheres counting as independent samples; again this is not justified. Is the activity of cortical regions within a mouse and across cortical hemispheres really statistically independent? The problem is apparent throughout the manuscript and for each data set collected. An additional statistical issue is that it is unclear if the authors are correcting for the use of multiple statistical tests (as in for example Figure 1L and Figure 2B,D). In general, the use of statistics by the authors is not justified in the text.

      Finally, it is important to note that whilst the study demonstrates that antipsychotics may selectively impact visuomotor integration in L5 neurons, it does not show that this effect is necessary or sufficient for the action of antipsychotics; though this is likely beyond the scope of the study it is something for readers to keep in mind.

      We thank the reviewer for the feedback and the help with improving the manuscript.

      Regarding the concerns of statistical analysis, this may partially be a misunderstanding. We apologize for the lack of clarity. For example, the data in Figures 1F-1K is indeed shown as averaged over locomotion onsets, but there is no statistical analysis performed in these panels. The unit for the statistical analysis shown in Figure 1L is brain area (not locomotion onset). A central tenet of the analysis shown in Figures 1L and 2 is that the effect of differential activation during closed and open loop locomotion onsets is not specific to visual areas of cortex. In visual areas of cortex, one would expect to find a difference. In essence, the surprising finding here is the lack of a difference in other cell types but L5 IT neurons. Thus, in the analyses of those figure panels we are testing whether the effect is present on average across all cortical areas. Hence, we chose the statistical unit of Figure 1L to be cortical areas, not mice. We have added the same analysis with mice as a statistical unit as Figure S4J.

      Reviewer #1 (Recommendations For The Authors):

      I have a few concerns and questions that I would like to see addressed:

      1) Figure 1L - the statistics are a little unusual here as the errors are across visual areas rather than across mice or hemispheres. This isn't ideal as ideally, we want to generalize the results across animals, not areas, and the results seem to be driven mostly by V1/RSC. I would like to see comparisons using mice as the statistical unit either in an ANOVA with areas as factors or post-hoc comparisons per area.

      Based on the assumption that visual cortex should respond to visual stimuli, we would have expected to find a difference between closed and open loop locomotion onset responses in all cell types in visual areas of cortex (a closed loop locomotion onset being the combination of locomotion and visual flow onset, while an open loop locomotion onset lacks the visual flow component). Thus, the first surprise was that in most cell types we found very little difference between these two locomotion onset types. Conversely, in Tlx3-positive L5 IT neurons the difference was apparent well outside of the visual areas of cortex (even though the difference was indeed strongest in V1/RSC). To quantify the extent to which closed and open loop locomotion onsets result in different activity patterns across dorsal cortex we performed the analyses shown in Figures 1L and 2. To make the point that the effect was observable on average across cortical areas, we used cortical area as a unit in Figure 1L. We have added the analysis shown in Figure 1L with mice as the statistical unit as Figure S4J and have added the ANOVA information to Table S1, as suggested.

      2) The reduction of activity of L5 IT cells in V1 after the presentation of gratings is curious. The authors suggest it might have been due to one population of cells tuned for the orientation of the presented grating suppressing the remaining cells leading to an aggregate negative response. However, they also observed this negative response in the 2p signal for individual somata. Presumably in the 2p data they could check their hypothesis - is there a group of cells that were tuned for the grating? Is it possible that for some reason the L5 IT cells in the 2p were not being activated by the grating because of their RF locations? How large were the gratings - I didn't see this in the methods section?

      We can certainly identify neurons that selectively increase activity to one particular grating. See Author response image 1, for vertical and horizontal gratings. The gratings were presented full-field on a toroidal screen that surrounded the mouse (240 degrees horizontal and 100 degrees vertical coverage of the visual field). This covered a large fraction of the field of view of the mouse. While we did not map receptive fields of individual neurons in this study, it is unlikely that the receptive fields of the neurons recorded were outside the stimulated area. We have made this clearer in the manuscript.

      Author response image 1.

      The population L5 IT neuron response to full-field drifting grating stimuli was a decrease of activity, yet there were increasing responses in a subset of neurons. (A) Heatmap of responses of all L5 IT neuron somata recorded with two-photon imaging in 7 Tlx3-Cre x Ai148 mice to drifting gratings of vertical orientation, sorted by their response. Data were sorted on odd trials and plotted on even trials to avoid regression to the mean artifacts. Dashed black box marks the top 10% responsive neurons. The data are a subset of the data shown in Figure S3D. (B) As in A, but for responses to drifting gratings of horizontal orientation. (C) Responses of top 10% vertical grating responsive neurons (dashed black box in A) to vertical (orange) or horizontal gratings (green). Neurons were selected on odd trials, and the average response of even trials is shown. (D) As in A, but sorted to the response of horizontal drifting gratings. (E) As in D, but for the horizontal grating stimulus. (F) As in C, but for the top 10% horizontal grating responsive neurons.

      3) I would caution against over-interpretation of latencies from wide-field GCaMP activity (Figure 3). A weaker response in a smaller population of neurons that has the same latency as a strong response in a large population of neurons will appear to have different latencies when convolved with the GCaMP kernel. Also there doesn't appear to be any statistical support for different latencies in different cortical areas. Either this should be correctly treated (ideally with linear mixed effects models to account for the increased correlation within animals) or the latency conclusions should be removed from the manuscript (my recommendation).

      We suspect that by “latency conclusions” the reviewer means “latency analysis”. The only time we mention latency differences is to state that: “In C57BL/6 mice that expressed GCaMP brain wide, both visuomotor mismatch and grating stimuli resulted in increases of activity that were strongest and appeared first in visual regions of dorsal cortex (Figures 3A-3C).”

      Nevertheless, we agree with the reviewer that response latency and response amplitude are not independent in our measurements and have replaced the latency plots in Figures 3B, 3C, 3E and 3F with average response maps.

      4) Given that the data is baseline corrected, is it possible that the effects of the anti-psychotic drugs on L5IT cells was due to a change in the baseline activity of this population?

      While we do find a small increase in average activity as a result of antipsychotic drug injections (Author response image 2), these effects are much smaller than those on locomotion onset responses.

      Author response image 2.

      On average, activity was increased in dorsal cortex after administration of antipsychotic drugs. Average calcium activity over the entire recording session before (naïve) and after (antipsy.) the administration of antipsychotic drugs. Colored lines indicate paired data for individual mice (Blue: 5 mice that had received clozapine, green: 3 mice that had received aripiprazole, red: 3 mice that had received haloperidol).

      To illustrate that the clozapine induced change in locomotion related activity cannot be explained by baseline activity differences, we have replotted the responses shown in Figures 4D and 4E, S3B, S5F without baseline subtraction (Author response image 3).

      Author response image 3.

      Antipsychotic drug injection only modestly shifts the baseline before locomotion onsets. (A) Average response expressed as F/F0 (wherein F0 was defined as the median of a recording session) during closed (solid line, 1101 onsets) and open loop (dashed line, 348 onsets) locomotion onsets in 5 Tlx3-Cre x Ai148 mice that expressed GCaMP6 in layer L5 IT neurons. Shading indicates SEM over onsets. Dashed horizontal line marks a value of F/F0 of 1.005 for comparison with panel B. Underlying data were the same as in Figures 4D and 4E. (B) As in A, but after a single intraperitoneal injection of the drug clozapine and for 707 closed and 350 open loop locomotion onsets. (C) Average response expressed as F/F0 (wherein F0 was defined as the median of a recording session) of L5 soma in V1, recorded with two-photon imaging in 7 Tlx3-Cre x Ai148 mice that expressed GCaMP6 in L5 IT neurons, during either closed (solid) or open loop (dashed) locomotion onsets. Shading indicates SEM over 8434 neurons. Dashed horizontal line marks a value of F/F0 of 1.045 for comparison with panel D. Underlying data were the same as in Figure S3B. (D) As in C, but for the 3 Tlx3 x Ai148 mice that had received a single intraperitoneal injection of clozapine. Underlying data were from Figure S5F.

      5) Figure 5/Figure S6 - Do the results really reflect an effect of distance or is it driven by areas from different hemispheres. Does the result hold if they factor out the effect of hemisphere or calculate the results within hemisphere?

      The effect appears qualitatively unchanged when we exclude interhemispheric connections from the analysis (Author response image 4).

      Author response image 4.

      As in Figures 6D-6F, but with the exclusion of interhemispheric connections. The decorrelation effect appears qualitatively unchanged.

      Reviewer #2 (Recommendations For The Authors):

      In addition to my public review, I only have one statistics-related and a few minor editing suggestions for the abstract. I hope that these might help the authors to improve their manuscript.

      1) It seems that the researchers are combining observations across different subjects, as seen in Figure 1F-L as well as in all of the other figures. While this has been a common practice in their field, it is now widely recognized that this approach can result in biased statistical inferences since it violates the assumptions of most statistical tests (see this recent discussion: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7906290/). As such, it may be beneficial for the authors to consider utilizing statistical tests that are designed to accurately deal with hierarchical data sets, like linear mixed models or hierarchical bootstrap, to confirm their key results. Additionally or alternatively, presenting data grouped by subject would help demonstrate the consistency of their findings across subjects.

      Please note, in Figures 1F-1K, there are no statistical tests – but the data are indeed averaged over locomotion onsets across all mice. We could use hierarchical sampling to calculate a bootstrap estimate of the mean response curves and show those instead, but that is also not standard practice in the field. We suspect this is also not what the reviewer is suggesting. In Figure 1L, the unit is indeed brain areas (see also our response to comment 1 of reviewer 1), but it is not areas x mice (i.e., the analysis is not hierarchical).

      We have now added a supplementary panel (Figure S4J) that shows the data of Figure 1L with mouse as the statistical unit (note, this is also not hierarchical). We have replaced the statistical test data using bootstrapping, as the reviewer suggests. This information can be found in Table S1.<br /> In Figures 2B and 2D, we have replaced the statistical test with hierarchical bootstrap, and updated the corresponding information in Table S1.

      For Figure 3, in which we show mismatch and grating onset responses averaged using onsets as the base unit, we have added supplementary panels (Figure S2) that show the same analysis using mice as the statistical unit. This did not change any of the conclusions. Note, there was no statistical testing in Figure 3.

      For the decorrelation effect of the different antipsychotic drugs that we show in Figures 6 and 7 the statistical unit is mice x region pairs (that is, while the structure is hierarchical, all mice contribute the same number of pairs). Our data are underpowered to use hierarchical bootstrap for testing the drug effects individually. However, if we combine all antipsychotic drug data (clozapine, aripiprazole, and haloperidol) we reach the same conclusions with hierarchical bootstrap as with the statistical tests (ttest and ranksum) used in the paper (Author response image 5).

      Author response image 5.

      Hierarchical bootstrap of the combined distribution of correlation values shown in Figures 6F, 7C and 7F did not change the conclusion that administration of antipsychotic drugs reduces L5 IT neuron correlations. Statistical comparisons using hierarchical bootstrap: Short-range vs no change, p < 0.001; long-range vs no change, p < 0.001; short-range vs longrange, p < 0.05.

      2) Given the impressive amount of data, I found it sometimes a little difficult to follow the manuscript. The authors might want to consider including a high-level overview of their results and rationales at the end of the introduction, and start each Results subsection with a sentence referring back to that highlevel overview ("To test whether X, we did Y and present it in this section.")

      We have attempted to improve the writing along these lines.

      3) Some suggestions that might further improve the clarity of writing.

      Abstract: Does the brain really distinguish between different "activity patterns", or would externallygenerated and self-generated "stimuli" be a slightly more accurate term to describe the observed alterations in schizophrenia?

      We would argue that (outside of sensory organs) the brain only has access to activity patterns, not stimuli directly. We would prefer to keep the phrasing with activity patterns here.

      Line 12: It might be easier to follow if the authors explicitly related that sentence back to the previous sentence "their ability to identify self-generated activity patterns" -> "their ability to distinguish between externally and self/internally generated ..."

      Absolutely correct – we have improved the writing here.

      Line 14: It remains unclear how visuomotor integration relates to the problem of distinguishing between self- and externally generated stimuli.

      We have attempted to expand on this in the abstract.

      Line 26: it remains unclear how the results support the activation of "internal representations" as this term has not been defined previously

      We have removed “internal representation” from the abstract.

      Results, line 80ff: I was confused by the description of all the different investigated cell types, as the first figure panels then only talk about brain wide and L5. Maybe the authors might find that shortening this with a reference to the methods might improve the flow.

      We have moved the list of cell types and mouse lines to the methods, as suggested.  

      Reviewer #3 (Recommendations For The Authors):

      The authors should strongly consider reassessing their statistics as outlined in the Public Review.

      Specifically:

      1) They should justify their definition of independent statistical unit; if this is not the mouse, they should justify why another definition (i.e. locomotion onset) is used, and show that their defined statistical unit achieves the requirements of being statistically independent (i.e. variance of the unit within a mouse is statistically indistinguishable from variance found between mice; more formally they could calculate the intraclass correlation (ICC)).

      We assume the reviewer is referring mainly to Figure 1 and therein to panel 1L.

      Since we did not perform statistical tests on the calcium traces, we are not sure why we would need to justify the choice of the unit we were showing. Moreover, Figure S2 shows the data of the V1 ROI averaged over mice to address this concern. As also mentioned to reviewer 2, we have amended this Figure S2 for the mouse-averaged traces of the V1 ROI data shown in main Figure 3.

      3) They should justify the statistical tests they use and whether they corrected for multiple comparisons; why for example was an ANOVA not used for Figure 1L and Figure 2B,D?

      We did not rely on ANOVA statistics for Figure 1L because we were mainly interested in carving out that Tlx3- (and Ntsr1-) positive mice inhabit a unique space when comparing the similarity of activity during closed and open loop locomotion onsets. We appreciate the reviewer taking a slightly different point of view on the data and now additionally report the ANOVA test result in Table S1. We have also opted to replace the statistical test in Figure 1L with bootstrapping. Lastly, we added Figure S4J which now shows the data in Figure 1L but with mice as the statistical unit.

      With similar logic, in Figure 2, we were not interested in comparing how the correlation of activity in cortical regions with locomotion behavior evolves over regions within a visuomotor feedback condition (closed loop, open loop or dark) but rather how a given region compares across feedback conditions.

      Still, we have opted to replace the statistical test in Figures 2B and 2D with hierarchical bootstrap, as also suggested by reviewer #2, comment 1. This did not change the significance indicator bars. We have accordingly updated Table S1 in which we report the full statistics.

    1. Author response:

      We were delighted by the reviewers' general comments. We thank the reviewers for their thoughtful reviews, constructive criticism, and analysis suggestions. We have carefully addressed each of their points during the revision of the manuscript.

      Unfortunately, after the paper was submitted to eLife, the first author, who ran all the analyses, left academia. We now realized that we currently do not have sufficient resources to perform all additional analyses as requested by the reviewers.

      The following is the authors’ response to the original reviews:

      Public Reviews:

      Reviewer #1 (Public Review):

      This study uses MEG to test for a neural signature of the trial history effect known as 'serial dependence.' This is a behavioral phenomenon whereby stimuli are judged to be more similar than they really are, in feature space, to stimuli that were relevant in the recent past (i.e., the preceding trials). This attractive bias is prevalent across stimulus classes and modalities, but a neural source has been elusive. This topic has generated great interest in recent years, and I believe this study makes a unique contribution to the field. The paper is overall clear and compelling, and makes effective use of data visualizations to illustrate the findings. Below, I list several points where I believe further detail would be important to interpreting the results. I also make suggestions for additional analyses that I believe would enrich understanding but are inessential to the main conclusions.

      (1) In the introduction, I think the study motivation could be strengthened, to clarify the importance of identifying a neural signature here. It is clear that previous studies have focused mainly on behavior, and that the handful of neuroscience investigations have found only indirect signatures. But what would the type of signature being sought here tell us? How would it advance understanding of the underlying processes, the function of serial dependence, or the theoretical debates around the phenomenon?

      Thank you for pointing this out. Our MEG study was designed to address two questions: 1) we asked whether we could observe a direct neural signature of serial dependence, and 2) if so, whether this signature occurs at the encoding or post-encoding stage of stimulus processing in working memory. This second question directly concerns the current theoretical debate on serial dependence.

      Previous studies have found only indirect signatures of serial dependence such as reactivations of information from the previous trial or signatures of a repulsive bias, which were in contrast to the attractive bias in behavior. Thus, it remained unclear whether an attractive neural bias can be observed as a direct reflection of the behavioral bias. Moreover, previous studies observed the neuronal repulsion during early visual processes, leading to the proposal that neural signals become attracted only during later, post-encoding processes. However, these later processing stages were not directly accessible in previous studies. To address these two questions, we combined MEG recordings with an experimental paradigm with two items and a retro-cue. This design allowed to record neural signals during separable encoding and post-encoding task phases and so to pinpoint the task phase at which a direct neural signature of serial dependence occurred that mirrored the behavioral effect.

      We have slightly modified the Introduction to strengthen the study motivation.

      (1a) As one specific point of clarification, on p. 5, lines 91-92, a previous study (St. JohnSaaltink et al.) is described as part of the current study motivation, stating that "as the current and previous orientations were either identical or orthogonal to each other, it remained unclear whether this neural bias reflected an attraction or repulsion in relation to the past." I think this statement could be more explicit as to why/how these previous findings are ambiguous. The St. John-Saaltink study stands as one of very few that may be considered to show evidence of an early attractive effect in neural activity, so it would help to clarify what sort of advance the current study represents beyond that.

      Thank you for this comment. In the study by St. John-Saaltink et al. (2016), two gratings oriented at 45° and 135° were always presented to either the left or right side of a central fixation point in a trial (90° orientation difference). As only the left/right position of the 45° and 135° gratings varied across trials, the target stimulus in the current trial was either the same or differed by exactly 90° from the previous trial. In consequence, this study could not distinguish whether the observed bias was attractive or repulsive, which concerned both the behavioral effect and the V1 signal. Furthermore, the bias in the V1 signal was partially explained by the orientation that was presented at the same position in the previous trial, which could reflect a reactivation of the previous orientation rather than an actual altered orientation.

      We have changed the Introduction accordingly.

      References:

      St. John-Saaltink E, Kok P, Lau HC, de Lange FP (2016) Serial Dependence in Perceptual Decisions Is Reflected in Ac6vity Pa9erns in Primary Visual Cortex. Journal of Neuroscience 36: 6186–6192.

      (1b) The study motivation might also consider the findings of Ranieri et al (2022, J. Neurosci) Fornaciai, Togoli, & Bueti (2023, J. Neurosci), and Lou& Collins (2023, J. Neurosci) who all test various neural signatures of serial dependence.

      Thank you. As all listed findings showed neural signatures revealing a reactivation of the previous stimulus or a response during the current trial, we have added them to the paragraph in the Introduction referring to this class of evidence for the neural basis for serial dependence.

      (2) Regarding the methods and results, it would help if the initial description of the reconstruction approach, in the main text, gave more context about what data is going into reconstruction (e.g., which sensors), a more conceptual overview of what the 'reconstruction' entails, and what the fidelity metric indexes. To me, all of that is important to interpreting the figures and results. For instance, when I first read, it was unclear to me what it meant to "reconstruct the direction of S1 during the S2 epoch" (p. 10, line 199)? As in, I couldn't tell how the data/model knows which item it is reconstructing, as opposed to just reporting whatever directional information is present in the signal.

      (2a) Relatedly, what does "reconstruction strength" reflect in Figure 2a? Is this different than the fidelity metric? Does fidelity reflect the strength of the particular relevant direction, or does it just mean that there is a high level of any direction information in the signal? In the main text explain what reconstruction strength and what fidelity is?

      Thank you for pointing this out. We applied the inverted encoding model method to MEG data from all active sensors (271) within defined time-windows of 100 ms length. MEG data was recorded in two sessions on different days. Specifically, we constructed an encoding model with 18 motion direction-selective channels. Each channel was designed to show peak sensitivity to a specific motion direction, with gradually decreasing sensitivity to less similar directions. In a training step, the encoding model was fiCed to the MEG data of one session to obtain a weight matrix that indicates how well the sensor activity can be explained by the modeled direction. In the testing step, the weight matrix was inverted and applied to the MEG data of the other session, resulting in a response profile of ‘reconstruction strengths’, i.e., how strongly each motion direction was present in a trial. When a specific motion direction was present in the MEG signal, the reconstruction strengths peaked at that specific direction and decreased with increasing direction difference. If no information was present, reconstruction strengths were comparable across all modeled directions, i.e., the response profile was flat. To integrate response profiles across trials, single trial profiles were aligned to a common center direction (i.e., 180°) and then averaged.

      To quantify the accuracy of each IEM reconstruction, i.e., how well the response profile represents a specific motion direction relative to all other directions we computed the ‘reconstruction fidelity’. Fidelity was obtained by projecting the polar vector of the reconstruction at every direction angle (in steps of 1°) onto the common center (180°) and averaging across all direction angles (Rademaker et al 2019, Sprague, Ester & Serences, 2016). As such, ‘reconstruction fidelity’ is a summary metric with fidelity greater than zero indicating an accurate reconstruction.

      How does the model know which direction to reconstruct? Our modelling procedure was informed about the stimulus in question during both the training and the testing step. Specifically, we informed our model during the training step about e.g., the current S2. Then, we fit the model to training data from the S2 epoch and applied it to testing data from the S2 epoch. Crucially, during the testing step the motion direction in question, i.e., current S2, becomes relevant again. For example, when S2 was 120°, the reconstructions were shifted by 60° in order to align with the common center, i.e., 180°. In addition, we also tested whether we could reconstruct the motion direction of S1 during the S2 epoch. Here, we used again the MEG data from the S2 epoch but now for S1 training. i.e., the model was informed about S1 direction. Accordingly, the recentering step during testing was done with regard to the S1 direction. Similarly, we also reconstructed the motion direction of the previous target (i.e., the previous S1 or S2), e.g., during the S2 epoch.

      Together, the multi-variate pattern of MEG activity across all sensors during the S2 epoch could contain information about the currently presented direction of S2, the direction of the preceding S1 and the direction of the target stimulus from the previous trial (i.e., either previous S1 or previous S2) at the same time. An important exception from this regime was the cross-reconstruction analysis (Appendix 1—figure 2). Here we trained the encoding model on the currently relevant item (S1 during the S1 epoch, S2 during the S2 epoch and the cued item during the retro-cue epoch) of one MEG session and reconstructed the previous target on the other MEG session.

      Finally, to examine shifts of the neural representation, single-trial reconstructions were assigned to two groups, those with a previous target that was oriented clockwise (CW) in relation to the currently relevant item and those with a previous target that was oriented counter-clockwise (CCW). The CCW reconstructions were flipped along the direction space, hence, a negative deviation of the maximum of the reconstruction from 180° indicated an attraction toward the previous target, whereas a positive deviation indicated a repulsion. Those reconstructions were then first averaged within each possible motion direction and then across them to account for different presentation numbers of the directions, resulting in one reconstruction per participant, epoch and time point. To examine systematic shifts, we then tested if the maximum of the reconstruction was systematically different from the common center (180°). For display purposes, we subtracted the reconstructed maximum from 180° to compute the direction shifts. A positive shift thus reflected attraction and a negative shift reflected repulsion.

      We have updated the Results accordingly.

      References:

      Rademaker RL, Chunharas C, Serences JT (2019) Coexisting representations of sensory and mnemonic information in human visual cortex. Nature Neuroscience. 22: 1336-1344.

      Sprague TC, Ester EF, Serences JT (2016) Restoring Latent Visual Working Memory Representations in Human Cortex. Neuron. 91: 694-707

      (3) Then in the Methods, it would help to provide further detail still about the IEM training/testing procedure. For instance, it's not entirely clear to me whether all the analyses use the same model (i.e., all trained on stimulus encoding) or whether each epoch and timepoint is trained on the corresponding epoch and timepoint from the other session. This speaks to whether the reconstructions reflect a shared stimulus code across different conditions vs. that stimulus information about various previous and current trial items can be extracted if the model is tailored accordingly.

      As reported above, our modeling procedure was informed about same stimulus during both the training and the testing step, except for the cross-reconstruction analysis.

      Regarding the training and testing data, the model was always trained on data from one session and tested on data from the other session, so that each MEG session once served as the training data set and once as the test data set, hence, training and test data were independent. Importantly, training and testing was always performed in an epoch- and time point-specific way: For example, the model that was trained on the first 100-ms time bin from the S1 epoch of the first MEG session was tested on the first 100-ms time bin from the S1 epoch of the second MEG session.

      Specifically, when you say "aim of the reconstruction" (p. 31, line 699), does that simply mean the reconstruction was centered in that direction (that the same data would go into reconstructing S1 or S2 in a given epoch, and what would differentiate between them is whether the reconstruction was centered to the S1 or S2 direction value)?

      As reported above, during testing the reconstruction was centered at the currently relevant direction. The encoding model was trained with the direction labels of S1, S2 or the target item, corresponding to the currently relevant direction, i.e., S1 in S1 epochs, S2 in S2 epochs and target item (S1 or S2) in the retro-cue epoch. The only exception was the reconstruction of S1 during the S2 epoch. Here the encoding model was trained on the S1 direction, but with data from the S2 epoch and then applied to the S2 epoch data and recentered to the S1 direction. So here, S1 and S2 were indeed trained and tested separately for the same epoch.

      (4) I think training and testing were done separately for each epoch and timepoint, but this could have important implications for interpreting the results. Namely if the models are trained and tested on different time points, and reference directions, then some will be inherently noisier than others (e.g., delay period more so than encoding), and potentially more (or differently) susceptible to bias. For instance, the S1 and S2 epochs show no attractive bias, but they may also be based on more high-fidelity training sets (i.e., encoding), and therefore less susceptible to the bias that is evident in the retrocue epoch.

      Thanks for pointing this out. Training and testing were performed in an epoch- and time point-specific way. Thus, potential differences in the signal-to-noise ratio between different task phases could cause quality differences between the corresponding reconstructed MEG signals. However, we did not observe such differences. Instead, we found comparable time courses of the reconstruction fidelities and the averaged reconstruction strengths between epochs (Figure 2b and 2c, respectively). Fig. 2b, e.g., shows that reconstruction fidelity for motion direction stimuli built up slowly during the stimulus presentation, reaching its maximum only after stimulus offset. This observation may contrast to different stimulus materials with faster build-ups, like the orientation of a Gabor.

      We agree with the reviewer that, regardless of the comparable but not perfectly equal reconstruction fidelities, there are good arguments to assume that the neural representation of the stimulus during its encoding is typically less noisy than during its post-encoding processing and that this difference could be one of the reasons why serial dependence emerged in our study only during the retro-cue epoch. However, the argument could also be reversed: a biased representation, which represents a small and hard-to-detect neural effect, might be easier to observe for less noisy data. So, the fact that we found a significant bias only during the potentially “noisier” retro-cue epoch makes the effect even more noteworthy.

      We mentioned the limitation related to our stimulus material already at the end of the Discussion. We have now added a new paragraph to the Discussion to address the two opposing lines of reasoning.  

      (4) I believe the work would benefit from a further effort to reconcile these results with previous findings (i.e., those that showed repulsion, like Sheehan & Serences), potentially through additional analyses. The discussion attributes the difference in findings to the "combination of a retro-cue paradigm with the high temporal resolution of MEG," but it's unclear how that explains why various others observed repulsion (thought to happen quite early) that is not seen at any stage here. In my view, the temporal (as well as spatial) resolution of MEG could be further exploited here to better capture the early vs. late stages of processing. For instance, by separately examining earlier vs. later time points (instead of averaging across all of them), or by identifying and analyzing data in the sensors that might capture early vs. late stages of processing. Indeed, the S1 and S2 reconstructions show subtle repulsion, which might be magnified at earlier time points but then shift (toward attraction) at later time points, thereby counteracting any effect. Likewise, the S1 reconstruction becomes biased during the S2 epoch, consistent with previous observations that the SD effects grow across a WM delay. Maybe both S1 and S2 would show an attractive bias emerging during the later (delay) portion of their corresponding epoch? As is, the data nicely show that an attractive bias can be detected in the retrocue period activity, but they could still yield further specificity about when and where that bias emerges.

      We are grateful for this suggestion. Before going into detail, we would like to explain our motivation for choosing the present analysis approach that included averaging time points within an epoch of interest.

      Our aim was to detect a neuronal signature of serial dependence which is manifested as an attractive shift of about 3.5° degrees within the 360° direction space. To be able to detect such a small effect in the neural data and given the limited resolution of the reconstruction method and the noisy MEG signals, we needed to maximize the signal-to-noise ratio. A common method to obtain this is by averaging data points. In our study we asked subjects to perform 1022 trials, down-sampled the MEG data from the recorded sampling rate of 1200 Hz to 10 Hz (one data point per 100 ms) that we used for the estimation of reconstruction fidelity and calculated the final neural shift estimates by averaging time points that showed a robust reconstruction fidelity, thus representing interpretable data points.

      Our procedure to maximize the signal-to-noise ratio was successful as we were able to reliably reconstruct the presented and remembered motion direction in all epochs (Figure 1a and 1b in the manuscript). However, the reconstruction did not work equally well for all time points within each epoch. In particular, there were time points with a non-significant reconstruction fidelity. In consequence, for the much smaller neural shift effect we did not expect to observe reliable time-resolved results, i.e., when considering each time point separately. Instead, we used the reconstruction results to define the time window in order to calculate the neural shift, i.e., we averaged across all time points with a significant reconstruction fidelity.

      Author response image 1 depicts the neural shift separately for each time point during the retro-cue epoch. Importantly, the gray parts of the time courses indicate time points where the reconstruction of the presented or cued stimulus was not significant. This means that the reconstructed maxima at those time points were very variable/unreliable and therefore the neural shifts were hardly interpretable.

      Author response image 1.

      Time courses of the reconstruction shift reveal a tendency for an attractive bias during the retrocue phase. Time courses of the neural shift separately for each time point during the S1 (left panel), S2 (middle panel) and retro-cue epochs (right panel). Gray lines indicate time points with non-significant reconstruction fidelities and therefore very variable and non-interpretable neural reconstruction shifts. The colored parts of the lines correspond to the time periods of significant reconstruction fidelities with interpretable reconstruction shifts. Error bars indicate the middle 95% of the resampling distribution. Time points with less than 5% (equaling p < .05) of the resampling distribution below 0° are indicated by a colored circle. N = 10.

      First, the time courses in the Author response image 1 show that the neural bias varied considerably between subjects, as revealed by the resampling distributions, at given time points. In this resampling procedure, we drew 10 participants in 10.000 iterations with replacement and calculated the reconstruction shift based on the mean reconstruction of the resampled participants. The observed variability stresses the necessity to average the values across all time points that showed a significant reconstruction fidelity to increase the signal-to-noise ratio.

      Second, despite this high variability/low signal-to-noise ratio, Author response image 1 (right panel) shows that our choice for this procedure was sensible as it revealed a clear tendency of an attractive shift at almost all time points between 300 through 1500 ms after retro-cue onset with only a few individual time-points showing a significant effect (uncorrected for multiple comparisons). It is worth to mention that this time course did not overlap with the time course of previous target cross-reconstruction (Appendix 1—figure 2, right panel), as there was no significant target cross-reconstruction during the retro-cue epoch with an almost flat profile around zero. Also, there was no overlap with previous target decoding in the retro-cue epoch (Figure 5 in the manuscript). Here, the previous target was reactivated significantly only at early time points of 200 and 300 ms post cue onset (i.e., at time points with a non-significant reconstruction fidelity and therefore no interpretable neural shift), while the nominally highest values of the attractive neural shift were visible at later time points that also showed a significant reconstruction fidelity (Figure 2b in the manuscript).

      Third, Author response image 1 (left and middle panel) shows the time courses of the neural shift during the S1 and S2 epochs. While no neural shift could be observed for S1, during the S2 epoch the time-resolved analysis indicated an initial attractive shift followed by a (nonsignificant) tendency for a repulsive shift. After averaging neural shifts across time points with a significant reconstruction fidelity, there was no significant effect with an overall tendency for repulsion, as reported in the paper. The attractive part of the neural shift during the S2 epoch was nominally strongest at very early time points (at 100-300 ms after S2 onset) and overlapped perfectly with the reactivation of the previous target as shown by the cross-reconstruction analysis (Appendix 1—figure 2, middle panel). This overlap suggests that the neural attractive shift did not reflect an actual bias of the early S2 representation, but rather a consequence of the concurrent reactivation of the previous target in the same neural code as the current representation. Finally, this neural attractive shift during S2 presentation did not correlate with the behavioral error (single trial-wise correlation: no significant time points during S2 epoch) or the behavioral bias (subject-wise correlation). In contrast, for the retro-cue epoch, we observed a significant correlation between the neural attractive shift and behavior.

      Together, the time-resolved results show a clear tendency for an attractive neural bias during the retro-cue phase, thus supporting our interpretation that the attractive shift during the retro-cue phase reflects a direct neuronal signature of serial dependence. However, these additional analyses also demonstrated a large variability between participants and across time points, warranting a cautious interpretation. We conclude that our initial approach of averaging across time points was an appropriate way of reducing the high level of noise in the data and revealed the reported significant and robust attractive neural shift in the retrocue phase.

      (5) A few other potentially interesting (but inessential considerations): A benchmark property of serial dependence is its feature-specificity, in that the attractive bias occurs only between current and previous stimuli that are within a certain range of similarity to each other in feature space. I would be very curious to see if the neural reconstructions manifest this principle - for instance, if one were to plot the trialwise reconstruction deviation from 0, across the full space of current-previous trial distances, as in the behavioral data. Likewise, something that is not captured by the DoG fivng approach, but which this dataset may be in a position to inform, is the commonly observed (but little understood) repulsive effect that appears when current and previous stimuli are quite distinct from each other. As in, Figure 1b shows an attractive bias for direction differences around 30 degrees, but a repulsive one for differences around 170 degrees - is there a corresponding neural signature for this component of the behavior?

      We appreciate the reviewer's idea to split the data. However, given that our results strongly relied on the inclusion of all data points, i.e., including all distances in motion direction between the current S1, S2 or target and the previous target and requiring data averaging, we are concerned that our study was vastly underpowered to be able to inform whether the attractive bias occurs only within a certain range of inter-stimulus similarity. To address this important question, future studies would require neural measurements with much higher signal-to-noise-ratio than the present MEG recordings with two sessions per participant and 1022 trials in total.

      Reviewer #2 (Public Review):

      Summary:

      The study aims to probe the neural correlates of visual serial dependence - the phenomenon that estimates of a visual feature (here motion direction) are attracted towards the recent history of encoded and reported stimuli. The authors utilize an established retro-cue working memory task together with magnetoencephalography, which allows to probe neural representations of motion direction during encoding and retrieval (retro-cue) periods of each trial. The main finding is that neural representations of motion direction are not systematically biased during the encoding of motion stimuli, but are attracted towards the motion direction of the previous trial's target during the retrieval (retro-cue period), just prior to the behavioral response. By demonstrating a neural signature of attractive biases in working memory representations, which align with attractive behavioral biases, this study highlights the importance of post-encoding memory processes in visual serial dependence.

      Strengths:

      The main strength of the study is its elegant use of a retro-cue working memory task together with high temporal resolution MEG, enabling to probe neural representations related to stimulus encoding and working memory. The behavioral task elicits robust behavioral serial dependence and replicates previous behavioral findings by the same research group. The careful neural decoding analysis benefits from a large number of trials per participant, considering the slow-paced nature of the working memory paradigm. This is crucial in a paradigm with considerable trial-by-trial behavioral variability (serial dependence biases are typically small, relative to the overall variability in response errors). While the current study is broadly consistent with previous studies showing that attractive biases in neural responses are absent during stimulus encoding (previous studies reported repulsive biases), to my knowledge it is the first study showing attractive biases in current stimulus representations during working memory. The study also connects to previous literature showing reactivations of previous stimulus representations, although the link between reactivations and biases remains somewhat vague in the current manuscript. Together, the study reveals an interesting avenue for future studies investigating the neural basis of visual serial dependence.

      Weaknesses:

      (1) The main weakness of the current manuscript is that the authors could have done more analyses to address the concern that their neural decoding results are driven by signals related to eye movements. The authors show that participants' gaze position systematically depended on the current stimuli's motion directions, which together with previous studies on eye movement-related confounds in neural decoding justifies such a concern. The authors seek to rule out this confound by showing that the consistency of stimulus-dependent gaze position does not correlate with (a) the neural reconstruction fidelity and (b) the repulsive shift in reconstructed motion direction. However, both of these controls do not directly address the concern. If I understand correctly the metric quantifying the consistency of stimulus-dependent gaze position (Figure S3a) only considers gaze angle and not gaze amplitude. Furthermore, it does not consider gaze position as a function of continuous motion direction, but instead treats motion directions as categorical variables. Therefore, assuming an eye movement confound, it is unclear whether the gaze consistency metric should strongly correlate with neural reconstruction fidelity, or whether there are other features of eye movements (e.g., amplitude differences across participants, and tuning of gaze in the continuous space of motion directions) which would impact the relationship with neural decoding. Moreover, it is unclear whether the consistency metric, which does not consider history dependencies in eye movements, should correlate with attractive history biases in neural decoding. It would be more straightforward if the authors would attempt to (a) directly decode stimulus motion direction from x-y gaze coordinates and relate this decoding performance to neural reconstruction fidelity, and (b) investigate whether gaze coordinates themselves are history-dependent and are attracted to the average gaze position associated with the previous trials' target stimulus. If the authors could show that (b) is not the case, I would be much more convinced that their main finding is not driven by eye movement confounds.

      The reviewer is correct that our eye-movement analysis approach considered gaze angle (direction) and not gaze amplitude. We considered gaze direction to be the more important feature to control for when investigating the neural basis of serial dependence that manifests, given the stimulus material used in our study, as a shift/deviation of angle/direction of a representation towards the previous target motion direction. To directly relate gaze direction and MEG data to each other we equaled the temporal resolution of the eye tracking data to match that of the MEG data. Specifically, our analysis procedure of gaze direction provided a measure indicating to which extent the variance of the gaze directions was reduced compared with random gaze direction patterns, in relation to the specific stimulus direction within each 100 ms time bin. Importantly, this procedure was able to reveal not only systematic gaze directions that were in accordance with the stimulus direction or the opposite direction, but also picked up all stimulus-related gaze directions, even if the relation differed across participants or time.

      Our analysis approach was highly sensitive to detect stimulus-related gaze directions during all task phases (Appendix 1—figure 3). As expected, we found systematic gaze directions when S1 and S2 were presented on the screen, and they were reduced thereafter, indicating a clear relationship between stimulus presentation and eye movement. Systematic gaze directions were also present in the retro-cue phase where no motion direction was presented. Here they showed a clearly different temporal dynamic as compared to the S1 and S2 phases. They appeared at later time points and with a higher variability between participants, indicating that they coincided with retrieving the target motion direction from working memory.

      To relate gaze directions with MEG results, we calculated Spearman rank correlations. We found that there was no systematic relationship at any time point between the stimulus related reconstruction fidelity and the amount of stimulus-related gaze direction. Even more, the correlation varied strongly from time point to time point revealing its random nature. In addition to the lack of significant correlations, we observed clearly distinct temporal profiles for gaze direction (Appendix 1—figure 3a and Appendix 1—figure 3b) and the reconstruction fidelities (Figure 2b in the manuscript, Appendix 1—figure 3c), in particular in the critical retro-cue phase.

      We favored this analysis approach over one that directly decoded stimulus motion direction from x-y gaze coordinates, as we considered it hardly feasible to compute an inverted encoding model with only two eye-tracker channels as an input (in comparison to 271 MEG sensors), and to our knowledge, this has not been done before. Other decoding methods have previously been applied to x-y gaze coordinates. However, in contrast to the inverted encoding model, they did not provide a measure of the representation shift which would be crucial for our investigation of serial dependence.

      We appreciate the suggestion to conduct additional analyses on eye tracking data (including different temporal and spatial resolution and different features) and their relation to MEG data. However, the first author, who ran all the analyses, has in the meantime left academia. Unfortunately, we currently do not have sufficient resources to perform additional analyses.

      While the presented eye movement control analysis makes us confident that our MEG finding was not crucially driven by stimulus-related gaze directions, we agree with the reviewer that we cannot completely exclude that other eye movement-related features could have contributed to our MEG findings. However, we would like to stress that whatever that main source for the observed MEG effect was (shift of the neuronal stimulus representation, (other) features of gaze movement, or shift of the neuronal stimulus representation that leads to systematic gaze movement), our study still provided clear evidence that serial dependence emerged at a later post-encoding stage of object processing in working memory. This central finding of our study is hard to observe with behavioral measures alone and is not affected by the possible effects of eye movements.

      We have slightly modified our conclusion in the Results and Appendix 1. Please see also our response to comment 1 from reviewer 3.

      (2) I am not convinced by the across-participant correlation between attractive biases in neural representations and attractive behavioral biases in estimation reports. One would expect a correlation with the behavioral bias amplitude, which is not borne out. Instead, there is a correlation with behavioral bias width, but no explanation of how bias width should relate to the bias in neural representations. The authors could be more explicit in their arguments about how these metrics would be functionally related, and why there is no correlation with behavioral bias amplitude.

      We are grateful for this suggestion. We correlated the individual neuronal shift with the two individual parameter fits of the behavior shift, i.e., amplitude (a) and tuning width (w). We found a significant correlation between the individual neural bias and the w parameter (r = .70, p = .0246) but not with the a parameter (r = -.35, p = .3258) during the retro-cue period (Appendix 1—figure 1). This indicates that a broader tuning width of the individual bias (as reflected by a smaller w parameter) was associated with a stronger individual neural attraction.

      It is important to note that for the calculation of the neural shift, all trials entered the analysis to increase the signal-to-noise ratio, i.e., it included many trials where current and previous targets were separated by, e.g., 100° or more. These trials were unlikely to produce serial dependence. Subjects with a more broadly tuned serial dependence had more interitem differences that showed a behavioral attraction and therefore more trials affected by serial dependence that entered the calculation of the neural shift. In contrast, individual differences in the amplitude (a) parameter were most likely too small, and higher individual amplitude did not involve more trials as compared to smaller amplitude to affect the neural bias in a way to be observed in a significant correlation.

      We have added this explanation to Appendix 1.  

      (3) The sample size (n = 10) is definitely at the lower end of sample sizes in this field. The authors collected two sessions per participant, which partly alleviates the concern. However, given that serial dependencies can be very variable across participants, I believe that future studies should aim for larger sample sizes.

      We want to express our appreciation for raising this issue. We apologize that we did not explicitly explain and justifythe choice for the sample size used in our paper, in particular, as we had in fact performed a formal a-priori power analysis.

      At the time of the sample size calculation, there were no comparable EEG or MEG studies to inform our power calculation. Thus, we based our calculation merely on the behavioral effect reported in the literature and, in particular, observed in a behavioral study from our lab that included four different experiments with overall more than 100 participants with 1632 trials each (see Fischer et al., 2020), in which the behavioral serial dependence effect (target vs. nontarget) was very robust. Based on the contrast between target and non-target with an effect size of 1.359 in Experiment 1, a power analysis with 80% desired power led to a small, estimated sample size of 6 subjects.

      However, we expected that the detection of the neural signature of this effect would require more participants. Therefore, we based our power calculation on a much smaller behavioral effect, i.e. the modulation of serial dependence by the context-feature congruency that we observed in our previous study (Fischer et al., 2020). In particular, we focused on Experiment 1 of the previous study that used color as the feature for retro-cueing, as we planned to use exactly the same paradigm for the MEG study. In contrast to the serial dependence effect, its modulation by color resulted in a more conservative power estimate: Based on an effect size of 0.856 in that experiment, a sample size of n = 10 should yield a power of 80% with two MEG sessions per subject.

      At the time when we conducted our study, two other studies were published that investigated serial dependence on the neural level. Both studies included a smaller number of data points than our study: Sheehan & Serences (2022) recorded about 840 trials in each of 6 participants, resulting in fewer data points both on the participant and on the trial level. Hajonides et al. (2023) measured 20 participants with 400 trials each, again resulting in fewer datapoints than our study (10 participants with 1022 trials each). Taken together, our a-priori sample size estimation resulted in comparable if not higher power as compared to other similar studies, making us feel confident that the estimated sample was sufficient to yield reliable results.

      We have now included this description and the results of this power analysis in the Materials and Methods section.

      Despite this, we fully agree with the reviewer that our study would profit from higher power. With the knowledge of the results from this study, future projects should attempt to increase substantially the signal-to-noise-ratio by increasing the number of trials in particular, in order to observe, e.g., robust time-resolved effects (see our comments to review 1).

      References:

      Fischer C, Czoschke S, Peters B, Rahm B, Kaiser J, Bledowski C (2020) Context information supports serial dependence of multiple visual objects across memory episodes. Nature Communication 11: 1932.

      Sheehan TC, Serences JT (2022) Attractive serial dependence overcomes repulsive neuronal adaptation PLOS Biology 20: e3001711.

      Hajonides JE, Van Ede F, Stokes MG, Nobre AC, Myers NE (2023) Multiple and Dissociable Effects of Sensory History on Working-Memory Performance Journal of Neuroscience 43: 2730–2740.

      (4) It would have been great to see an analysis in source space. As the authors mention in their introduction, different brain areas, such as PPC, mPFC, and dlPFC have been implicated in serial biases. This begs the question of which brain areas contribute to the serial dependencies observed in the current study. For instance, it would be interesting to see whether attractive shifts in current representations and pre-stimulus reactivations of previous stimuli are evident in the same or different brain areas.

      We appreciate this suggestion. As mentioned above, we currently do not have sufficient resources to perform a MEG source analysis.

      Reviewer #3 (Public Review):

      Summary:

      This study identifies the neural source of serial dependence in visual working memory, i.e., the phenomenon that recall from visual working memory is biased towards recently remembered but currently irrelevant stimuli. Whether this bias has a perceptual or postperceptual origin has been debated for years - the distinction is important because of its implications for the neural mechanism and ecological purpose of serial dependence. However, this is the first study to provide solid evidence based on human neuroimaging that identifies a post-perceptual memory maintenance stage as the source of the bias. The authors used multivariate pattern analysis of magnetoencephalography (MEG) data while observers remembered the direction of two moving dot stimuli. After one of the two stimuli was cued for recall, decoding of the cued motion direction re-emerged, but with a bias towards the motion direction cued on the previous trial. By contrast, decoding of the stimuli during the perceptual stage was not biased.

      Strengths:

      The strengths of the paper are its design, which uses a retrospective cue to clearly distinguish the perceptual/encoding stage from the post-perceptual/maintenance stage, and the rigour of the careful and well-powered analysis. The study benefits from high within participant power through the use of sensitive MEG recordings (compared to the more common EEG), and the decoding and neural bias analysis are done with care and sophistication, with appropriate controls to rule out confounds.

      Weaknesses:

      A minor weakness of the study is the remaining (but slight) possibility of an eye movement confound. A control analysis shows that participants make systematic eye movements that are aligned with the remembered motion direction during both the encoding and maintenance phases of the task. The authors go some way to show that this eye gaze bias seems unrelated to the decoding of MEG data, but in my opinion do not rule it out conclusively. They merely show that the strengths of the gaze bias and the strength of MEGbased decoding/neural bias are uncorrelated across the 10 participants. Therefore, this argument seems to rest on a null result from an underpowered analysis.

      Our MEG as well eye-movement analysis showed that they were sensitive to pick up robustly stimulus-related effects, both for presented and remembered motion directions. When relating both signals to each other by correlating MEG reconstruction strength with gaze direction, we found a null effect, as pointed out by the reviewer. Importantly, there was also a null effect when the shift of the reconstruction (representing our main finding) was correlated with gaze direction. Furthermore, an examination of the individual time courses of gaze direction and individual MEG reconstruction strength revealed that the lack of a relationship between MEG and gaze data did not rest on a singular observation but was present across all time points. Even more, the temporal profile of the correlation varied strongly from time point to time point revealing its random nature and indicating that there was no hint of a pattern that just failed to reach significance. Taking these observations together, our MEG findings were unlikely to be explained by eye position.

      Nevertheless, we agree with the reviewer that there is general problem of interpreting a null effect with a limited number of observations (and an analysis approach that focused on one out of many possible features of the gaze movement). Thus, we admit that there is a (slight) possibility that eye movements contributed to the observed MEG effects. This possibility, however, did not affect our novel finding that serial dependence occurred during the postencoding stage of object processing in working memory.

      Please see also our response to point 1 from reviewer 2.

      Impact:

      This important study contributes to the debate on serial dependence with solid evidence that biased neural representations emerge only at a relatively late post-perceptual stage, in contrast to previous behavioural studies. This finding is of broad relevance to the study of working memory, perception, and decision-making by providing key experimental evidence favouring one class of computational models of how stimulus history affects the processing of the current environment.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Minor concerns:

      The significance statement opens "Our perception is biased towards sensory input from the recent past." This is a semantic point, but it seems a somewhat odd statement, given there is so much debate about whether serial dependence is perceptual vs. decisional, and that the current work indeed claims that it emerges at a late, post-encoding stage.

      Thank you for this point. We agree. “Visual cognition is biased towards sensory input from the recent past.” would be a more appropriate statement. According to the Journal's guidelines, however, the paragraph with the Significant Statement will be not included in the final manuscript.

      It would be preferable for data and code to be available at review so that reviewers might verify some procedural points for clarity.

      Code and preprocessed data used for the presented analyses are now available on OSF via http://osf.io/yjc93/. Due to storage limitations, only the preprocessed MEG data for the main IEM analyses focusing on the current direction are uploaded. For access to additional data, please contact the authors.

      For instance, I could use some clarification on the trial sequence. The methods first say the direction was selected randomly, but then later say each direction occurred equally often, and there were restrictions on the relationships between current and previous trial items. So it seems it couldn't have truly been random direction selection - was the order selected randomly from a predetermined set of possibilities?

      For the S1/S2 stimuli in a trial the dots moved fully coherent in a direction randomly drawn from a pool of directions between 5° and 355° spaced 10° from one another, therefore avoiding cardinal directions. Across trials, there was a predetermined set of possible differences in motion direction between the current and the previous target. This set included 18 motion direction differences, ranging from -170° to 180°, in steps of 10°. Trial sequences were balanced in a way that each of these differences occurred equally often during a MEG session.

      I could also use some additional assurance the sample size (participants or data points) is sufficient for the analysis approach deployed here.

      We performed a formal a-priori power analysis to justify our choice for the sample size. Please see our response to reviewer 2, point 3, where we explained the procedure of the apriori power analysis in detail. We have now included this description and the results of this power analysis in the Materials and Methods.

      Did you consider a decoding approach, instead of reconstruction, to test what information predominates the signal, in an unbiased way?

      Thank you for this argument. With our analysis approach based on the inverted encoding model, we believe to be unbiased, since we first reconstructed whether the MEG signal contained information about the presented and remembered motion direction. Only in the next step, we tested whether this reconstructed signal showed an offset and if so, whether this offset was biased towards or away from the previous target. A decoding approach aims to answer classification questions and is not suitable to reveal the actual shifts of the neural information. In our study, we could decode, e.g., the current direction or the previous target, but this would not answer the question of whether and at which stage of object processing the current representation was biased towards the past. Moreover, in a decoding approach to reveal which information predominates in the signal, we would have to classify different options (e.g. current information vs previous), thereby biasing the possible set of results more than in our chosen analysis.

      I think the claim of a "direct" neural signature may come off as an overstatement when the spatial and temporal aspects of the attractive bias are still so coarsely specified here.

      Thank you for pointing this out. We agree that the term “direct neural signature” can be seen as an overstatement when it is interpreted to indicate a narrowly defined activity of a brain region (ideally via “direct” invasive recordings) that reflects serial dependence. Our definition of the term “direct” referred to the observation of an attractive shift in a neural representation of the current target motion direction item towards the previous target. This was in contrast to previous “indirect” evidence for the neural basis of serial dependence based on either repulsive shifts of neural representations that were opposite to the attractive bias in behavior or on a reactivation of previous information in the current trial without presenting evidence for the actual neural shift. With this definition in mind, we consider the title of our study a valid description of our findings.

      Reviewer #2 (Recommendations For The Authors):

      I was wondering why the authors chose a bootstrap test for their neural bias analysis instead of a permutation test, similar to the one they used for their behavioral analysis. As far as I know, bootstrap tests do not provide guaranteed type-1 error rate control. The procedure for the permutation test would be quite straightforward here, randomly permuting the sign of each participant's neural shift and recording the group-average shift in a permutation distribution. This test seems more adequate and more consistent with the behavioral analysis.

      Thank you for this comment. We adapted a resampling approach (bootstrapping) that was similar to that by Ester et al. (2020) who also investigated categorical biases and also applied a reconstruction method (Inverted Encoding Model) to assess significance of a bias of the reconstructed orientation against zero in a certain direction. The bootstrapping method relied on a) detecting an offset against zero and b) evaluating the robustness of the observed effect across participants. In contrast, a permutation approach, as suggested by the reviewer, assesses whether an empirical neural shift is more extreme than the permutation distribution. The permutation approach seems more suited to assess the magnitude of the shift which in our study was not a priority. Therefore, we reasoned that the bootstrapping for our inference statistics was better suited to assess the direction of the neural shift and its robustness across participants.

      We have added this additional information to the Materials and Methods:

      References:

      Ester EF, Sprague TC, Serences JT (2020) Categorical biases in human occipitoparietal cortex. Journal of Neuroscience 40:917–931.

      The manuscript could be improved by more clearly spelling how the training and testing data were labelled, particularly for the reactivation analyses. If I understood correctly, in the first reactivation analysis the authors train and test on current trial data, but label both training and testing data according to the previous trial's motion direction. In the second analysis, they label the training data according to the current motion direction, but label the testing data according to the previous motion direction. Is that correct?

      Yes, this is correct. Please see also our response to reviewer 1, point 2 and 3, for a detailed description.

      I was surprised to see that the shift in the reconstructed direction is about three times larger than the behavioral attraction bias. Would one not expect these to be comparable in magnitude? It would be helpful to address and discuss this in the discussion section.

      Thank you for pointing this out. We agree with the reviewer that as both measures provided an identical metric (angle degree), one would expect that their magnitudes should be directly comparable. However, we speculate that these magnitudes inform only about the direction of the bias and their significant difference from zero, thus they operate on different scales and are not directly comparable. For example, Hallenbeck et al. (2022) showed that fMRI-based reconstructed orientation bias and behavioral bias correlated on both individual and group level, despite strong magnitude differences. This is in line with our observation and supports the speculation that the magnitudes of neural and behavioral biases operate on different scales and, thus, are not directly comparable.

      We have updated to the Discussion accordingly.

      References:

      Hallenbeck GE, Sprague TC, Rahmati M, Sreenivasan KK, Curtis CE (2022) Working memory representations in visual cortex mediate distraction effects Nature Communications 12: 471.

      Reviewer #3 (Recommendations For The Authors):

      (1) It may be worth showing that the gaze bias towards the current/cued stimulus is not biased towards the previous target. One option might be to run the same analysis pipeline used for the MEG decoding but on the eye-tracking data. Another could be to remove all participants with significant gaze bias, but given the small sample size, this might not be feasible.

      We appreciate this suggestion. However, as mentioned above, we currently do not have sufficient resources to conduct additional analyses on the eye tracking data.

      (2) Minor typo: Figure 3c - bias should be 11.7º, not -11.7º.

      Corrected. Thank you!

      Note on data/code availability: The authors state that preprocessed data and analysis code will be made available on publication, but are not available yet.

      Code and preprocessed data used for the present analyses are now available on OSF via http://osf.io/yjc93/. Due to storage limitations, only the preprocessed MEG data for the main IEM analyses focusing on the current direction are uploaded. For access to additional data, please contact the authors.

    1. Author response:

      The following is the authors’ response to the original reviews.

      In addition to our responses to reviewer suggestions below, a minor bug in the calculation of CAIS was brought to our attention by a reader of our preprint. We have corrected this bug and rerun analyses, whose results became slightly stronger as noise was removed. While we were doing that, someone pointed out to us that our equations were almost the same as Kullback-Leibler divergence, which explains why our metric performed so well. We have made the numerically trivial (see before vs. after figure below) mathematical change to use Kullback-Leibler divergence instead, and now have a better story, with a solid basis in information theory, as to why CAIS works.

      Author response image 1.

      Unfortunately, we discovered a second bug that caused our PIC correction code to fail to perform the needed correction for phylogenetic confounding. The previously reported correlation between CAIS (or ENC) with body mass no longer survives PIC-correction. We have therefore removed this analysis from the manuscript. Our story now stands more on the theoretical basis of CAIS and ENC than on the post facto validation than it previously did. We now also present CAIS and ENC on a more equal footing. ENC results are slightly stronger, while CAIS has the complementary advantage of correcting for amino acid frequencies.

      The work involved in these changes, as well as some of the responses to reviews below, justifies changing the second author into a co-first author, and adding an additional coauthor (Hanon McShea) who discovered the second bug.

      Reviewer #1 (Public Review): 

      In this manuscript, the authors propose a new codon adaptation metric, Codon Adaptation Index of Species (CAIS), which they present as an easily obtainable proxy for effective population size. To permit between-species comparisons, they control for both amino acid frequencies and genomic GC content, which distinguishes their approach from existing ones. Having confirmed that CAIS negatively correlates with vertebrate body mass, as would be expected if small-bodied species with larger effective populations experience more efficient selection on codon usage, they then examine the relationship between CAIS and intrinsic structural disorder in proteins. 

      The idea of a robust species-level measure of codon adaptation is interesting. If CAIS is indeed a reliable proxy for the effectiveness of selection, it could be useful to analyze species without reliable life history- or mutation rate data (which will apply to many of the genomes becoming available in the near future). 

      A key question is whether CAIS, in fact, measures adaptation at the codon level. Unfortunately, CAIS is only validated indirectly by confirming a negative correlation with body mass. As a result, the observations about structural disorder are difficult to evaluate. 

      As discussed in the preamble above, we have replaced the body mass validation with a stronger theoretical basis in information theory.

      A potential problem is that differences in GC between species are not independent of life history. Effective population size can drive compositional differences due to the effects of GC-biased gene conversion (gBGC). As noted by Galtier et al. (2018), genomic GC correlates negatively with body mass in mammals and birds. It would therefore be important to examine how gBGC might affect CAIS, and to what extent it could explain the relationship between CAIS and body mass. 

      Suppose that gBGC drives an increase in GC that is most pronounced at 3rd codon positions in highrecombination regions in small-bodied species. In this case, could observed codon usage depart more strongly from expectations calculated from overall genomic GC in small vertebrates compared to large ones? The authors also report that correcting for local intergenic GC was unsuccessful, based on the lack of a significant negative relationship with body mass (Figure 3D). In principle, this could also be consistent with local GC providing a relatively more appropriate baseline in regions with high recombination rates. Considering these scenarios would clarify what exactly CAIS is capturing. 

      Figure 3 (previously Supplementary Figures S5A and S5B) shows that CAIS is negligibly correlated with %GC (not robust to multiple comparisons correction), and ENC not at all. We believe this is evidence against the possibility brought up by the reviewer, i.e. that Ne might affect gBGC (and hence global %GC). This relationship, if present, could act as a confounding effect, but it is not present within our species dataset. 

      Note that we expect our genomic-GC-based codon usage expectations to reflect unchecked gBGC in an average genomic region, independently of whether that species has high or low Ne. Our working model is that non-selective forces, include gBGC as well as conventional mutation biases, vary among species, and that they rather than selection determine each species’ genome-wide %GC. By correcting for genome-wide %GC, CAIS and ENC correct for both mutation bias and gBGC, in order to isolate the effects of selection.

      This argument, based on an average genomic region, is vulnerable to gene-rich genomic regions having differentially higher recombination rates and hence GC-biased gene conversion. However, we do not see the expected positive correlation between |𝐥𝐨𝐜𝐚𝐥 𝐆𝐂 - global GC| and CAIS (see new Figure 5), again suggesting that gene conversion strength is not a confounding factor acting on CAIS.

      Given claims about "exquisitely adapted species", the case for using CAIS as a measure of codon adaptation would also be stronger if a relationship with gene expression could be demonstrated. RSCU is expected to be higher in highly expressed genes. Is there any evidence that the equivalent GCcontrolled measure behaves similarly? 

      Correlations with gene expression are outside the scope of the current work, which is focused on producing and exploiting a single value of codon adaptation per species. It is indeed possible that our general approach of using Kullback-Leibler divergence to correct for genomic %GC could be useful in future work investigating differences among genes.  

      The manuscript is overall easy to follow, though some additional context may be helpful for the general reader. A more detailed discussion of how this work compares to the approach taken by Galtier et al. (2018), which accounted for GC content and gBGC when examining codon preferences, would be appropriate, for example. In addition, it would have been useful to mention past work that has attempted to explicitly quantify selection on codon usage. 

      One key difference between our work and that of Galtier et al. 2018 is that our approach does not rely on identifying specific codon preferences as a function of species. Our approach might therefore be robust to scenarios where different genes have different codon preferences (see Gingold et al. 2014 https://doi.org/10.1016/j.cell.2014.08.011). At a high level, our results are in broad agreement with those of Galtier et al., 2018, who found that gBGC affected all animal species, regardless of Ne, and who like us, found that the degree of selection on codon usage depended on Ne.

      Reviewer #2 (Public Review): 

      ## Summary 

      The goal of the authors in this study is to develop a more reliable approach for quantifying codon usage such that it is more comparable across species. Specifically, the authors wish to estimate the degree of adaptive codon usage, which is potentially a general proxy for the strength of selection at the molecular level. To this end, the authors created the Codon Adaptation Index for Species (CAIS) that controls for differences in amino acid usage and GC% across species. Using their new metric, the authors find a previously unobserved negative correlation between the overall adaptiveness of codon usage and body size across 118 vertebrates. As body size is negatively correlated with effective population size and thus the general strength of natural selection, the negative correlation between CAIS and body size is expected. The authors argue this was previously unobserved due to failures of other popular metrics such as Codon Adaptation Index (CAI) and the Effective Number of Codons (ENC) to adequately control for differences in amino acid usage and GC content across species. Most surprisingly, the authors also find a positive relationship between CAIS and the overall "disorderedness" of a species protein domains. As some of these results are unexpected, which is acknowledged by the authors, I think it would be particularly beneficial to work with some simulated datasets. I think CAIS has the potential to be a valuable tool for those interested in comparing codon adaptation across species in certain situations. However, I have certain theoretical concerns about CAIS as a direct proxy for the efficiency of selection $sN_e$ when the mutation bias changes across species.  

      ## Strengths 

      (1) I appreciate that the authors recognize the potential issues of comparing CAI when amino acid usage varies and correct for this in CAIS. I think this is sometimes an under-appreciated point in the codon usage literature, as CAI is a relative measure of codon usage bias (i.e. only considers synonyms). However, the strength of natural selection on codon usage can potentially vary across amino acids, such that comparing mean CAI between protein regions with different amino acid biases may result in spurious signals of statistical significance (see Cope et al. Biochemica et Biophysica Acta - Biomembranes 2018 for a clear example of this). 

      We now cite Cope et al. as an example of how amino acid composition can act as a confounding factor.

      (2) The authors present numerous analysis using both ENC and mean CAI as a comparison to CAIS, helping given a sense of how CAIS corrects for some of the issues with these other metrics. I also enjoyed that they examined the previously unobserved relationship between codon usage bias and body size, which has bugged me ever since I saw Kessler and Dean 2014. The result comparing protein disorder to CAIS was particularly interesting and unexpected. 

      Unfortunately, our previous PIC correction code was buggy, and in fact the relationship with body size does not survive PIC correction (although it is strong prior to PIC correction). We have therefore removed it from the paper. However, the more novel result on protein disorder remains strong.

      (3) The CAIS metric presented here is generally applicable to any species that has an annotated genome with protein-coding sequences. 

      ## Weaknesses 

      (1) The main weakness of this work is that it lacks simulated data to confirm that it works as expected. This would be particularly useful for assessing the relationship between CAIS and the overall effect of protein structure disorder, which the authors acknowledge is an unexpected result. I think simulations could also allow the authors to assess how their metric performs in situations where mutation bias and natural selection act in the same direction vs. opposite directions. Additionally, although I appreciate their comparisons to ENC and mean CAI, the lack of comparison to other popular codon metrics for calculating the overall adaptiveness of a genome (e.g. dos Reis et al.'s $S$ statistic, which is a function of tRNA Adaptation Index (tAI) and ENC) may be more appropriate. Even if results are similar to $S$, CAIS has a noted advantage that it doesn't require identifying tRNA gene copy numbers or abundances, which I think are generally less readily available than genomic GC% and protein-coding sequences. 

      The main limitation of dos Reis’s test in our view is that, like the better versions of CAI, it requires comparable orthologs across species. See also the discussion below re the benefits of proteome-wide approach. We now also note the advantage of not needing tRNA gene copy numbers and abundances. 

      Simulated datasets would be great, but we think it a nice addition rather than must-have, in particular because we are skeptical about whether our understanding of all relevant processes is good enough such that simulations would add much to our more heuristic argument along the lines of Figure 2. E.g. the complications of Gingold et al. 2014 cited above are pertinent, but incorporating them would make simulations quite involved. Instead, we now have a stronger theoretical justification for CAIS grounded in information theory. We have significantly expanded discussion of Figure 2 to give a clearer idea of the conceptual underpinnings of CAIS and ENC.

      The authors mention the selection-mutation-drift equilibrium model, which underlies the basic ideas of this work (e.g. higher $N_e$ results in stronger selection on codon usage), but a more in-depth framing of CAIS in terms of this model is not given. I think this could be valuable, particularly in addressing the question "are we really estimating what we think we're estimating?" 

      Let's take a closer look at the formulation for RSCUS. From here on out, subscripts will only be used to denote the codon and it will be assumed that we are only considering the case of r = genome for some species s.

      I think what the authors are attempting to do is "divide out" the effects of mutation bias (as given by $E_i$), such that only the effects of natural selection remain, i.e. deviations from the expected frequency based on mutation bias alone represent adaptive codon usage. Consider Gilchrist et al. MBE 2015, which says that the expected frequency of codon i at selection-mutation-drift equilibrium in gene g for an amino acid with Na synonymous codons is

      where ∆M is the mutation bias, ∆η is the strength of selection scaled by the strength of drift, and φg is the gene expression level of gene g. In this case, ∆M and ∆η reflect the strength and direction of mutation bias and natural selection relative to a reference codon, for which ∆M,∆η = 0. Assuming the selection-mutation-drift equilibrium model is generally adequate to model of the true codon usage patterns in a genome (as I do and I think the authors do, too), the Ei,g could be considered the expected observed frequency codon i in gene g

      E[Oi,g].

      Let’s re-write the  in the form of Gilchrist et al., such that it is a function of mutation bias ∆M. For simplicity we will consider just the two codon case and assume the amino acid sequence is fixed. Assuming GC% is at equilibrium, the term gr and 1 − gr can be written as

      where µx→y is the mutation rate from nucleotides x to y. As described in Gilchrist et al. MBE 2015 and Shah and Gilchrist PNAS 2011, the mutation bias .This can be expressed in terms of the equilibrium GC content by recognizing that

      As we are assuming the amino acid sequence is fixed, the probability of observing a synonymous codon i at an amino acid becomes just a Bernoulli process. 

      If we do this, then 

      Recall that in the Gilchrist et al. framework, the reference codon has ∆MNNG,NNG \= 0 =⇒ e−∆MNNG,NNG \=1. Thus, we have recovered the Gilchrist et al. model from the formulation of $E_i$ under the assumption that natural selection has no impact on codon usage and codon NNG is the pre-defined reference codon. To see this, plug in 0 for ∆η in equation (1).. 

      We can then calculate the expected RSCUS using equation (1) (using notation E[Oi]) and equation (6) for the two codon case. For simplicity assume, we are only considering a gene of average expression (defined as ). Assume in this case that NNG is the reference codon (∆MNNG,∆ηNNG \= 0).

      This shows that the expected value of RSCUS for a two-codon amino acid is expected to increase as the strength of selection $\Delta\eta$ increases, which is desired. Note that $\Delta\eta$ in Gilchrist et al. is formulated in terms of selection *against* a codon relative to the reference, such that a negative value represents that a codon is favored relative to the reference. If $\Delta\eta = 0$ (i.e. selection does not favor either codon), then $E[RSCUS] = 1$. Also note that the expected RSCUS does not remain independent of the mutation bias. This means that even if $sN_e$ (i.e. the strength of natural selection) does not change between species, changes to the strength and direction of mutation bias across species could impact RSCUS. Assuming my math is right, I think one needs to be cautious when interpreting CAIS as representative of the differences in the efficiency of selection across species except under very particular circumstances. One such case could be when it is known that mutation bias varies little across the species of interest. Looking at the species used in this manuscript, most of them have a GC content ranging around 0.41, so I suspect their results are okay. 

      Although I have not done so, I am sure this could be extended to the 4 and 6 codon amino acids. 

      We thank Reviewer 2 for explicitly laying out the math that was implicit in our Figures 1 and 2. While we keep our more heuristic presentation, our revised manuscript now more clearly acknowledges that the per-site codon adaptation bias depicted in Figure 1 has limited sensitivity to s*Ne. The reason that we believe our approach worked despite this, is that we think the phenomenon is driven by what is shown in Figure 2. I.e., where Ne makes a difference is by determining the proteome-wide fraction of codons subject to significant codon adaptation, rather than by determining the strength of codon adaptation at any particular site or gene. We have made multiple changes to the texts to make this point clearer.

      Another minor weakness of this work is that although the method is generally applicable to any species with an annotated genome and the code is publicly available, the code itself contains hard-coded values for GC% and amino acid frequencies across the 118 vertebrates. The lack of a more flexible tool may make it difficult for less computationally-experienced researchers to take advantage of this method. 

      Genome-wide %GC values are hard-coded because they were taken from the previous study of James et al. (2023) https://doi.org/10.1093/molbev/msad073. As summarized in the manuscript, genome-wide %GC was a byproduct of a scan of all six reading frames across genic and intergenic sequences available from NCBI with access dates between May and July 2019. The more complicated code used to calculate the intergenic %GC, and the code used to calculate amino acid frequencies is located at https://github.com/MaselLab/CodonAdaptation-Index-of-Species. Luckily, someone else just wrote a simpler end to end pipeline for us, on the basis of our preprint. We now note this in the Acknowledgements, and link to it: https://github.com/gavinmdouglas/handy_pop_gen/blob/main/CAIS.py.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers for their careful and overall positive evaluation of our work and the constructive feedback! To address the main concerns, we have:

      – Clarified a major misunderstanding of our instructions: Participants were only informed that they would receive different stimuli of medium intensity and were thus not aware that the stimulation temperature remained constant

      – Implemented a new analysis to evaluate how participants rated their expectation and pain levels in the control condition

      – Added a paragraph in the discussion in which we argue that our paradigm is comparable to previous studies

      Below, we provide responses to each of the reviewers’ comments on our manuscript.

      Reviewer #1 (Public Review):

      Summary:  

      In this important paper, the authors investigate the temporal dynamics of expectation of pain using a combined fMRI-EEG approach. More specifically, by modifying the expectations of higher or lower pain on a trial-to-trial basis, they report that expectations largely share the same set of activations before the administration of the painful stimulus, and that the coding of the valence of the stimulus is observed only after the nociceptive input has been presented. fMRIinformed EEG analysis suggested that the temporal sequence of information processing involved the Dorsolateral prefrontal cortex (DLPFC), the anterior insula, and the anterior cingulate cortex. The strength of evidence is convincing, and the methods are solid, but a few alternative interpretations about the findings related to the control group, as well as a more in-depth discussion on the correlations between the BOLD and EEG signals would strengthen the manuscript. 

      Thank you for your positive evaluation! In the revised version of the manuscript, we elaborated on the control condition and the BOLD-EEG correlations in more detail.

      Strengths:  

      In line with open science principles, the article presents the data and the results in a complete and transparent fashion. 

      From a theoretical standpoint, the authors make a step forward in our understanding of how expectations modulate pain by introducing a combination of spatial and temporal investigation. It is becoming increasingly clear that our appraisal of the world is dynamic, guided by previous experiences, and mapped on a combination of what we expect and what we get. New research methods, questions, and analyses are needed to capture these evolving processes.  

      Thank you very much for these positive comments!

      Weaknesses:  

      The control condition is not so straightforward. Across the manuscript it is defined as "no expectation", and in the legend of Figure 1 it is mentioned that the third state would be "no prediction". However, it is difficult to conceive that participants would not have any expectations or predictions. Indeed, in the description of the task it is mentioned that participants were instructed that they would receive stimuli during "intermediate sensitive states". The results of the pain scores and expectations might support the idea that the control condition is situated in between the placebo and nocebo conditions. However, since this control condition was not part of the initial conditioning, and participants had no reference to previous stimuli, one might expect that some ratings might have simply "regressed to the mean" for a lack of previous experience. 

      General considerations and reflections:  

      Inducing expectations in the desired direction is not a straightforward task, and results might depend on the exact experimental conditions and the comparison group. In this sense, the authors' choice of having 3 groups of positive, negative, and "neutral" expectations is to be praised. On the other hand, also control groups form their expectations, and this can constitute a confounder in every experiment using expectation manipulation, if not appropriately investigated. 

      Thank you for raising these important concerns! Firstly, as it seems that we did not explain the experimental procedure in a clear fashion, there appeared to be a general misunderstanding regarding our instructions. We want to emphasize that we did not tell participants that the stimulus intensity would always be the same, but that pain stimuli would be different temperatures of medium intensity. Furthermore, our instruction did not necessarily imply that our algorithm detected a state of medium sensitivity, but that the algorithm would not make any prediction, e.g., due to highly fluctuating states of pain sensitivity, or no clear-cut state of high or low pain sensitivity. We changed this in the Methods (ll. 556-560, 601-606, 612-614) and Results (ll. 181-192) sections of the manuscript to clarify these important features of our procedure.

      Then, we absolutely agree that participants explicitly and implicitly form expectations regarding all conditions over time, including the control condition. We carefully considered your feedback and rephrased the control condition, no longer framing it as eliciting “no expectations” but as “neutral expectations” in the revised version of the manuscript. This follows the more common phrasing in the literature and acknowledges that participants indeed build up expectations in the control condition. However, we do still think that we can meaningfully compare the placebo and nocebo condition to the control condition to investigate the neuronal underpinnings of expectation effects. Independently of whether participants build up an expectation of “medium” intensities in the control condition, which caused them to perceive stimuli in line with this expectation, or if they simply perceived the stimuli as they were (of medium intensity) with limited effects of expectations, the crucial difference to the placebo and nocebo conditions is that there was no alteration of perception due to previous experiences or verbal information and no shift of perception from the actual stimulus intensity towards any direction in the control condition. This allowed us to compare the neural basis of a modulation of pain perception in either direction to a condition in which this modulation did not take place. 

      Author response image 1.

      Variability within conditions over time. Relative variability index for expectation (left) and pain ratings (right) per condition and measurement block. 

      Lastly, we want to highlight that our finding of the control condition being rated in between the placebo and nocebo condition is in line with many previous studies that included similar control conditions and advanced our understanding of pain-related expectations (Bingel et al., 2011; Colloca et al., 2010; Shih et al., 2019). We thank the reviewer for the very interesting idea to evaluate the development of ratings in the control condition in more detail and added a new analysis to the manuscript in which we compared how much intra-subject variance was within the ratings of each of the three conditions and how much this variance changed over time. For this aim, we computed the relative variability index (Mestdagh et al., 2018), a measure that quantifies intra-subject variation over multiple ratings, and compared between the three conditions and the three measurement blocks. We observed differences in variances between conditions for both expectation (F(2,96) = 8.14, p < .001) and pain ratings (F(2,96) = 3.41, p = .037). For both measures, post-hoc tests revealed that there was significantly more variance in the placebo compared to the control condition (both p_holm < .05), but no difference between control and nocebo. The substantial and comparable variation in pain and expectation ratings in all three conditions (or at least between control and nocebo) shows that participants did not always expect and perceive the same intensity within conditions. Variance in expectation ratings decreased from the first block compared to the other two blocks (_F(1.35,64.64) = 5.69, p = .012; both p_holm < .05), which was not the case for pain ratings. Most importantly, there was no interaction effect of block and condition for neither expectation (_F(2.65,127.06) = 0.40, p = .728) nor pain ratings (F(4,192) = 0.48, p = .748), which implies that expectations were similarly dynamically updated in all conditions over the course of the experiment. This speak against a “regression to the mean” in the control condition and shows that control ratings fluctuated from trial to trial. We included this analysis and a more in-depth discussion of the choice of conditions in the Result (ll. 219-232) and Discussion (ll. 452-486) sections of the revised manuscript.

      In addition, although fMRI is still (probably) the best available tool we have to understand the spatial representation of cortical processing, limitations about not only the temporal but even the spatial resolution should be acknowledged. Given the anatomical and physiological complexity of the cortical connections, as we know from the animal world, it is still well possible that subcircuits are activated also for positive and negative expectations, but cannot be observed due to the limitation of our techniques. Indeed, on an empirical/evolutionary basis it would remain unclear why we should have a system that waits for the valence of a stimulus to show differential responses. 

      We agree that the spatial resolution of fMRI is limited and that our signal is often not able to dissociate different subcircuits. Whether on this basis differential processes occurred cannot be observed in fMRI but is indeed possible. We now include this reasoning in our Discussion (ll. 373-377):

      “Importantly, the spatial resolution of fMRI is limited when it comes to discriminating whether the same pattern of activity is due to identical activation or to activation in different sub-circuits within the same area. Nonetheless, the overlap of areas is an indicator for similar processes involved in a more general preparation process.

      Also, moving in a dimension of network and graph theory, one would not expect single areas to be responsible for distinct processes, but rather that they would integrate information in a shared way, potentially with different feedback and feedforward communications. As such, it becomes more difficult to assume the insula is a center for coding potential pain, perhaps more of a node in a system that signals potential dangers for the integrity of the body. 

      We appreciate the feedback on our interpretation of our results and agree that the overall network activity most likely determines how a large part of expectations and pain are coded. We therefore adjusted the Discussion, embedding the results in an interpretation considering networks (ll. 427-430, 432-435,438-442 ). 

      The authors analyze the EEG signal between 0.5 to 128 Hz, finding significant results in the correlation between single-trial BOLD and EEG activity in the higher gamma range (see Figure 6 panel C). It would be interesting to understand the rationale for including such high frequencies in the signal, and the interpretation of the significant correlation in the high gamma range. 

      On a technical level, we adapted our EEG processing pipeline from Hipp et al. (2011) who similarly investigated signals up to 128 Hz. Of note, the spectral smoothing was adjusted to match 3/4 octave, meaning that the frequency resolution at 128 Hz is rather broad and does not only contain oscillations at 128 Hz sharp. Gamma oscillations in general have repeatedly been reported in relation to pain and feedforward signals reflecting noxious information (e.g. Ploner et al., 2017; Strube et al., 2021). Strube et al. (2021) reported the highest effects of pain stimulus intensity and prediction error processing at high gamma frequencies (100 and 98 Hz, respectively). These findings could also serve as basis to interpret our results in this frequency range: If anticipatory activation in the ACC is linked to high gamma oscillations, which appear to play an important role in feedforward signaling of pain intensity and prediction errors, this could indicate that later processing of intensity in this area is already pre-modulated before the stimulus actually occurs. Of note: although not significant, it looks as if the cluster extends further into pain processing on a descriptive level. We added additional explanation regarding the interpretation of the correlation in the Discussion (ll. 414425):

      “The link between anticipatory activity in the ACC and EEG oscillatory activity was observed in the high gamma band, which is consistent with findings that demonstrate a connection between increased fMRI BOLD signals and a relative shift from lower to higher frequencies (Kilner et al., 2005). Gamma oscillations have been repeatedly reported in the context of pain and expectations and have been interpreted as reflecting feedforward signals of noxious information ( e.g. Ploner et al., 2017; Strube et al., 2021). In combination with our findings, this might imply that high frequency oscillations may not only signal higher actual or perceived pain intensity during pain processing (Nickel et al., 2022; Ploner et al., 2017; Strube et al., 2021; Tu et al., 2016), but might also be instrumental in the transfer of directed expectations from anticipation into pain processing.”

      Reviewer #2 (Public Review):  

      I think this is a very promising paper. The combination of EEG and fMRI is unique and original. However, I also have some suggestions that I think could help improve the manuscript. 

      This manuscript reports the findings of an EEG-fMRI study (n = 50) on the effects of expectations on pain. The combination of EEG with fMRI is extremely original and well-suited to study the transition from expectation to perception. However, I think that the current treatment of the data, as well as the way that the manuscript is currently written, does not fully capitalize on the potential of this unique dataset. Several findings are presented but there is currently no clear message coming out of this manuscript. 

      First, one positive point is that the experimental manipulation clearly worked. However, it should be noted that the instructions used are not typical of studies on placebo/nocebo. Participants were not told that the stimulations would be of higher/lower intensity. Rather, they were told that objective intensities were held constant, but that EEG recordings could be used to predict whether they would perceive the stimulus as more or less intense. I think that this is an interesting way to manipulate expectations, but there could have been more justification in the introduction for why the authors have chosen this unusual procedure. 

      Most importantly, we again want to emphasize again that participants were not aware that the stimulation temperature was always the same but were informed that they would receive different stimuli of medium intensity. We now clarify this in the revised Results (ll. 190-192) and Methods (ll. 612-614) sections.

      While we agree that our procedure was not typical, we do not think that the manipulation is not comparable to previous studies on pain-related expectations. To our knowledge, either expectations regarding a treatment that changes pain perception (treatment expectancy) or expectations regarding stimulus intensities (stimulus expectancy) are manipulated (see Atlas & Wager, 2014). In our study, participants received a cue that induced expectations in regard to a ”treatment”, although in this case the “treatment” came from changes in their own brain activity. This is comparable to studies using TENS-devices that are supposedly changing peripheral pain transmission (Skvortsova et al., 2020). Thus, although not typical, our paradigm could be classified as targeting treatment expectancies and allowed us to examine effects on a trial-by-trial level within subjects. We added a paragraph regarding the comparability of our paradigm with previous studies in the Discussion of the revised manuscript (ll. 452-464) .

      Also, the introduction mentions that little is known about potential cerebral differences between expectations of high vs. low pain expectations. I think the fear conditioning literature could be cited here. Activations in ACC, SMA, Ins, parahippocampal gyrus, PAG, etc. are often associated with upcoming threat, whereas activations vmPFC/default mode network are associated with safety. 

      We thank you for your suggestions to add literature on fear conditioning. We agree there is some overlap between fear conditioning and expectation effects in humans, but we also believe there are fundamental differences regarding their underlying processes and paradigms. E.g. the expectation effects are not driven by classical learning algorithms but act in a large amount as self-fulfilling prophecies (see e.g. Jepma et al., 2018). However, we now acknowledge the similarities e.g in the recruitment of the insula and the vmPFC of the modalities in our Introduction (ll. 132-136 ).

      The fact that the authors didn't observe a clearer distinction between high and low expectations here could be related to their specific instructions that imply that the stimulus is the same and that it is the subjective perception that is expected to change. In any case, this is a relatively minor issue that is easy to address. 

      We apologize again for the lack of clarity in our instructions: Participants were unaware that they would receive the exact same stimulus. The clear effects of the different conditions on expectation and pain ratings also challenge the notion that participants always expected the same level of stimulation and/or perception. Additionally, if participants were indeed expecting a consistent level of intensity in all conditions, one would also assume to see the same anticipatory activation in the control condition as in the placebo and nocebo conditions, which is not the case. Thus, we respectfully disagree that the common effects might be explained by our instructions but would argue that they indeed reflect common (anticipatory) processes of positive and negative expectations.

      Towards the end of the introduction, the authors present the aims of the study in mainly exploratory terms: 

      (1) What are the differences between anticipation and perception? 

      (2) What regions display a difference between high and low expectations (high > low or low < high) vs. an effect of expectation regardless of the direction (high and low different than neutral)? 

      I think these are good questions, but the authors should provide more justification, or framework, for these questions. More specifically, what will they be able to conclude based on their observations? 

      For instance (note that this is just an example to illustrate my point. I encourage the authors to come up with their own framework/predictions) : 

      (1) Possibility #1: A certain region encodes expectations in a directed fashion (high > low) and that same region also responds to perception in the same direction (high > low). This region would therefore modulate pain by assimilating perception towards expectations. 

      (2) Possibility # 2: different regions are involved in expectation and perception. Perhaps this could mean that certain regions influence pain processing through descending facilitation for instance...  

      Thank you for pointing out that our hypotheses were not crafted carefully enough. We tried to give better explanations for the possible interpretations of our hypotheses. Additionally, we interpreted our results on the background of a broader framework for placebo and nocebo effects (predictive coding) to derive possible functions of the described brain areas. We embedded this in our Introduction (ll. 74-86, 158-175 ) and Discussion (ll. 384-388 ), interpreting the anticipatory activity and the activity during pain processing in the context of expectation formation as described in Büchel et al. (2014).

      Interpretation derived from our framework (ll. 384-388):

      e.g.: “Following the framework of predictive coding, our results would suggest that the DPMS is the network responsible for integrating ascending signals with descending signals in the pain domain and that this process is similar for positive and negative valences during anticipation of pain but differentiates during pain processing.”

      Regarding analyses, I think that examining the transition from expectations to perception is a strong angle of the manuscript given the EGG-fMRI nature of the study. However, I feel that more could have been done here. One problem is that the sequence of analyses starts by identifying an fMRI signal of interest and then attempts to find its EEG correlates. The problem is that the low temporal resolution of fMRI makes it difficult to differentiate expectation from perception, which doesn't make this analysis a good starting point in my opinion. Why not start by identifying an EEG signal that differentiates perception vs expectation, and then look for its fMRI correlates?  

      We appreciate your feedback on the transition from expectations to perceptions and also think that additional questions could be answered with our data set. However, based on the literature we had specific hypotheses regarding specific brain areas, and we therefore decided to start from the fMRI data with the superior spatial resolution and EEG was used to focus on the temporal dynamics within the areas important for anticipatory processes. We share the view that many different approaches in analyzing our data are possible. On the other hand, identifying relevant areas based on EEG characteristics inherits even more uncertainty due to the spatial filtering of the EEG signal. For the research question of this study a more accurate evaluation of the involved areas and the related representation was more important. We therefore decided to only implement the procedure already present in the manuscript. 

      Finally, I found the hypotheses on "valenced" vs. "absolute" effects a little bit more difficult to follow. This is because "neutral" is not really neutral: it falls in between low and high. If I follow correctly, participants know that the temperature is always the same. Therefore, if they are told that the machine cannot predict whether their perception is going to be low or high, then it must be because it is likely to be in between. Ratings of expectation and pain ratings confirm that. The neutral condition is not "devoid" of expectations as the authors suggest.

      Therefore, it would make sense to look at regions with the following pattern low > neutral > high, or vice-versa, low < neutral < high. Low & high being different than neutral is more difficult to interpret. I don't think that you can say that it reflects "absolute" expectations because neutral is also the expectation of a medium temperature. Perhaps it reflects "certainty/uncertainty" or something like that, but it is not clear that it reflects "expectations". 

      Thank you for your valuable feedback! We considered your concerns about the interpretation of our results and completely agree that the control condition cannot be interpreted as void of expectations (ll. 119-123). We therefore evaluated the control condition in more detail in a separate analysis (ll. 219-232) and integrated a new assessment of the conditions into the Discussion (ll. 465-486). We changed the phrasing of our control condition to “neutral expectations”, as we agree that the control condition is not void of expectations and this phrasing is more in line with other studies (e.g. Colloca et al., 2010; Freeman et al., 2015; Schmid et al., 2015). We would argue that the neutral expectations can still be meaningfully compared to positive and negative expectations because only the latter shift expectations and perception in one direction. Thus, we changed our wording throughout the manuscript to acknowledge that we indeed did not test for general effects of expectations vs. no expectations, but for effects of directed expectations. Please also see our reasoning regarding the control condition in response to Reviewer 1, in which we addressed the interpretation of the control condition. We therefore still believe that the contrasts that we calculated between conditions are valid. The proposed new contrast largely overlaps with our differential contrast low>high and vice versa already reported in the manuscript (for additional results also see Supplements).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Figure 6, panel C. The figure mentions Anterior Cingulate Cortex R, whereas the legend mentions left ACC. Please check. 

      Thanks for catching this, we changed the figure legend accordingly.

      Reviewer #2 (Recommendations For The Authors):  

      - I don't think that activity during the rating of expectations is easily interpretable. I think I would recommend not reporting it. 

      The majority of participants completed the expectation rating relatively quickly (M = 2.17 s, SD = 0.35 s), which resulted in the overlap between the DLPFC EEG cluster and the expectation rating encompassing only a limited portion of the cluster (~ 1 s). We agree that this activity still is more difficult to interpret, yet we have decided to report it for reasons of completeness.

      - The effects on SIIPS are interesting. I think that it is fine to present them as a "validation" of what was observed with pain ratings, but it also seems to give a direction to the analyses that the authors don't end up following. For instance, why not try other "signatures" like the NPS or signatures of pain anticipation? Also, why not try to look at EEG correlates of SIIPS? I don't think that the authors "need" to do any of that, but I just wanted to let them know that SIIPS results may stir that kind of curiosity in the readers.  

      While this would be indeed very interesting, these additional analyses are not directly related to our current research question. We fear that too many analyses could be confusing for the readers. Nonetheless, we are grateful for your suggestion and will implement additional brain signatures in future studies. 

      - The shock was calibrated to be 60%. Why not have high (70%) and low (30%) conditions at equal distances from neutral, like 80% and 40% for instance? The current design makes it hard to distinguish high from control. Perhaps the "common" effects of high + low are driven by a deactivation for low (30%)?  

      We appreciate your feedback! We adjusted the temperature during the test phase to counteract habituation typically happening with heat stimuli. We believe that this was a good measure as participants rated the control condition at roughly VAS 50 (M = 51.40) which was our target temperature and then would be equidistant to the VAS 70 and VAS 30 during conditioning when no habituation should have taken place yet. We further tested whether participants rated placebo and nocebo trials at equal distances from the control condition and found no existent bias for either of the conditions. To do this, we computed the individual placebo effect (control minus placebo) and nocebo effect (nocebo minus control) for each participant during the test phase and statistically compared whether they differed in terms of magnitude. There was no significant difference between placebo and nocebo effects for both expectation (placebo effect M = 14.25 vs. nocebo effect M = 17.22, t(49) = 1.92, p = .061) and pain ratings (placebo effect M = 6.52 vs. nocebo effect M = 5.40, t(49) = -1.11, p = .274). This suggests that our expectation manipulation resulted in comparable shifts in expectation and pain ratings away from the control condition for both the placebo and nocebo condition and thus hints against any bias of the conditioning temperatures. Please also note that the analysis of the common effects was masked for differences of the high and low, therefore the effects cannot be driven by one condition by itself.

      - If I understand correctly, all fMRI contrasts were thresholded with FWE. This is fine, but very strict. The authors could have opted for FDR. Maybe I missed something here....  

      While it is true that FDR is the more liberal approach, it is not valid for spatially correlated fMRI data and is no longer available in SPM for the correction of multiple comparisons. The newly implemented topological peak based FDR correction is comparably sensitive with the FWE correction (see. Chumbley et al. BELEG). We opted for the slightly more conservative approach in our preregistration (_p_FWE < .05), therefore a change of the correction is not possible.

      Altogether, I think that this is a great study. The combination of EEG and fMRI is truly unique and affords many opportunities to examine the transition from expectations to perception. The experimental manipulation of expectations seems to have worked well, and there seem to be very promising results. However, I think that more could have been done. At least, I would recommend trying to give more of a theoretical framework to help interpret the results.  

      We are very grateful for your positive feedback. We took your suggestion seriously and tried to implement a more general framework from the literature (see Büchel et al., 2014) to provide a better explanation for our results.

      References

      Atlas, L. Y., & Wager, T. D. (2014). A meta-analysis of brain mechanisms of placebo analgesia: Consistent findings and unanswered questions. Handbook of Experimental Pharmacology, 225, 37–69. https://doi.org/10.1007/978-3-662-44519-8_3

      Bingel, U., Wanigasekera, V., Wiech, K., Ni Mhuircheartaigh, R., Lee, M. C., Ploner, M., & Tracey, I. (2011). The effect of treatment expectation on drug efficacy: Imaging the analgesic benefit of the opioid remifentanil. Science Translational Medicine, 3(70), 70ra14. https://doi.org/10.1126/scitranslmed.3001244

      Büchel, C., Geuter, S., Sprenger, C., & Eippert, F. (2014). Placebo analgesia: A predictive coding perspective. Neuron, 81(6), 1223–1239. https://doi.org/10.1016/j.neuron.2014.02.042

      Colloca, L., Petrovic, P., Wager, T. D., Ingvar, M., & Benedetti, F. (2010). How the number of learning trials affects placebo and nocebo responses. Pain, 151(2), 430–439. https://doi.org/10.1016/j.pain.2010.08.007

      Freeman, S., Yu, R., Egorova, N., Chen, X., Kirsch, I., Claggett, B., Kaptchuk, T. J., Gollub, R. L., & Kong, J. (2015). Distinct neural representations of placebo and nocebo effects. NeuroImage, 112, 197–207. https://doi.org/10.1016/j.neuroimage.2015.03.015

      Hipp, J. F., Engel, A. K., & Siegel, M. (2011). Oscillatory synchronization in large-scale cortical networks predicts perception. Neuron, 69(2), 387–396. https://doi.org/10.1016/j.neuron.2010.12.027

      Jepma, M., Koban, L., van Doorn, J., Jones, M., & Wager, T. D. (2018). Behavioural and neural evidence for self-reinforcing expectancy effects on pain. Nature Human Behaviour, 2(11), 838–855. https://doi.org/10.1038/s41562-018-0455-8

      Kilner, J. M., Mattout, J., Henson, R., & Friston, K. J. (2005). Hemodynamic correlates of EEG: A heuristic. NeuroImage, 28(1), 280–286. https://doi.org/10.1016/j.neuroimage.2005.06.008

      Nickel, M. M., Tiemann, L., Hohn, V. D., May, E. S., Gil Ávila, C., Eippert, F., & Ploner, M. (2022). Temporal-spectral signaling of sensory information and expectations in the cerebral processing of pain. Proceedings of the National Academy of Sciences of the United States of America, 119(1). https://doi.org/10.1073/pnas.2116616119

      Ploner, M., Sorg, C., & Gross, J. (2017). Brain Rhythms of Pain. Trends in Cognitive Sciences, 21(2), 100–110. https://doi.org/10.1016/j.tics.2016.12.001

      Schmid, J., Bingel, U., Ritter, C., Benson, S., Schedlowski, M., Gramsch, C., Forsting, M., & Elsenbruch, S. (2015). Neural underpinnings of nocebo hyperalgesia in visceral pain: A fMRI study in healthy volunteers. NeuroImage, 120, 114–122. https://doi.org/10.1016/j.neuroimage.2015.06.060

      Shih, Y.‑W., Tsai, H.‑Y., Lin, F.‑S., Lin, Y.‑H., Chiang, C.‑Y., Lu, Z.‑L., & Tseng, M.‑T. (2019). Effects of Positive and Negative Expectations on Human Pain Perception Engage Separate But Interrelated and Dependently Regulated Cerebral Mechanisms. Journal of Neuroscience, 39(7), 1261–1274. https://doi.org/10.1523/JNEUROSCI.2154-18.2018

      Skvortsova, A., Veldhuijzen, D. S., van Middendorp, H., Colloca, L., & Evers, A. W. M. (2020). Effects of Oxytocin on Placebo and Nocebo Effects in a Pain Conditioning Paradigm: A Randomized Controlled Trial. The Journal of Pain, 21(3-4), 430–439. https://doi.org/10.1016/j.jpain.2019.08.010

      Strube, A., Rose, M., Fazeli, S., & Büchel, C. (2021). The temporal and spectral characteristics of expectations and prediction errors in pain and thermoception. ELife, 10. https://doi.org/10.7554/eLife.62809

      Tu, Y., Zhang, Z., Tan, A., Peng, W., Hung, Y. S., Moayedi, M., Iannetti, G. D., & Hu, L. (2016). Alpha and gamma oscillation amplitudes synergistically predict the perception of forthcoming nociceptive stimuli. Human Brain Mapping, 37(2), 501–514. https://doi.org/10.1002/hbm.23048

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer 1:

      (1) The notion of a “root” causal gene - which the authors define based on a graph theoretic notion of topologically sorting graphs - requires a graph that is directed and acyclic. It is the latter that constitutes an important weakness here - it simply is a large simplification of human biology to draw out a DAG including hundreds of genes and a phenotype Y and to claim that the true graph contains no cycles.

      We agree that real causal graphs in biology often contain cycles. We now include additional experimental results with cyclic directed graphs in the Supplementary Materials. RCSP outperformed the other algorithms even in this setting, but we caution the reader that the theoretical interpretation of the RCS score may not coincide with a root causal effect when cycles exist:

      “We also evaluated the algorithms on directed graphs with cycles. We generated a linear SEM over ρ + 1 = 1000 variables in . We sampled the coefficient matrix β from a Bernoulli (1/(p − 1)) distribution but did not restrict the non-zero coefficients to the upper triangular portion of the matrix. We then proceeded to permute the variable ordering and weight each entry as in the Methods for the DAG. We repeated this procedure 30 times and report the results in Supplementary Figure 3.

      RCSP again outperformed all other algorithms even in the cyclic case. The results suggest that conditioning on the surrogate ancestors also estimates the RCS well even in the cyclic case. However, we caution that an error term E<sub>i</sub> can affect the ancestors of when cycles exist. As a result, the RCS may not isolate the causal effect of the error term and thus not truly coincide with the notion of a root causal effect in cyclic causal graphs.”

      (2) I also encourage the authors to consider more carefully when graph structure learned from Perturb-seq can be ported over to bulk RNA-seq. Presumably this structure is not exactly correct - to what extent is the RCSP algorithm sensitive to false edges in this graph? This leap - from cell line to primary human cells - is also not modeled in the simulation. Although challenging - it would be ideal for the RCSP to model or reflect the challenges in correctly identifying the regulatory structure.

      We now include additional experimental results, where we gradually increased the incongruence between the DAG modeling the Perturb-seq and the DAG modeling the bulk RNA-seq using a mixture of graphs. The performance of RCSP degraded gradually, rather than abruptly, with increasing incongruence. We therefore conclude that RCSP is robust to differences between the causal graphs representing Perturb-seq and bulk RNA-seq:

      “We next assessed the performance of RCSP when the DAG underlying the Perturb-seq data differs from the DAG underlying the bulk RNA-seq data. We considered a mixture of two random DAGs in bulk RNA-seq, where one of the DAGs coincided with the Perturb-seq DAG and second alternate DAG did not. We instantiated and simulated samples from each DAG as per the previous subsection. We generated 0%, 25%, 50%, 75%, and 100% of the bulk RNA-seq samples from the alternate DAG, and the rest from the Perturb-seq DAG. We ideally would like to see the performance of RCSP degrade gracefully, as opposed to abruptly, as the percent of samples derived from the alternate DAG increases.

      We summarize results in Supplementary Figure 4. As expected, RCSP performed the best when we drew all samples from the same underlying DAG for Perturb-seq and bulk RNA-seq. However, the performance of RCSP also degraded slowly as the percent of samples increased from the alternate DAG. We conclude that RCSP can accommodate some differences between the underlying DAGs in Perturb-seq and bulk RNA-seq with only a mild degradation in performance.”

      (3) It should also be noted that in most Perturb-seq experiments, the entire genome is not perturbed, and frequently important TFs (that presumably are very far “upstream” and thus candidate “root” causal genes) are not expressed highly enough to be detected with scRNA-seq. In that context - perhaps slightly modifying the language regarding RCSP’s capabilities might be helpful for the manuscript - perhaps it would be better to describe it as an algorithm for causal discovery among a set of genes that were perturbed and measured, rather than a truly complete search for causal factors. Perhaps more broadly it would also benefit the manuscript to devote slightly more text to describing the kinds of scenarios where RCSP (and similar ideas) would be most appropriately applied - perhaps a well-powered, phenotype annotated Perturb-seq dataset performed in a disease relevant primary cell.

      We now clarify that Perturb-seq can only identify root causal genes among the perturbed set of genes in the Discussion:

      “Modern genome-wide Perturb-seq datasets also adequately perturb and measure only a few thousand, rather than all, gene expression levels. RCSP can only identify root causal genes within this perturbed and measured subset.”

      We now also describe the scenario where RCSP can identify root causal genes well in the Introduction:

      “Experiments demonstrate marked improvements in performance, when investigators have access to a large bulk RNA-seq dataset and a genome-wide Perturb-seq dataset from a cell line of a disease-relevant tissue.”

      Reviewer 2:

      (1) The process from health-to-disease is not linear most of the time with many checks along the way that aim to prevent the disease phenotype. This leads to a non-deterministic nature of the path from health-to-disease. In other words, with the same root gene perturbations, and depending on other factors outside of gene expression, someone may develop a phenotype in a year, another in 10 years and someone else never. Claiming that this information is included in the error terms might not be sufficient to address this issue. The authors should discuss this limitation.

      The proposed approach accommodates the above non-deterministic nature. The error terms of model factors that are outside of gene expression. We model the relation from gene expression to Y as probabilistic rather than deterministic because , where E<sub>Y</sub> introduces stochasticity. Thus, two individuals with the same instantiations of the root causes may develop disease differently. We now clarify this in Methods:

      “The error terms model root causes that are outside of gene expression, such as genetic variation or environmental factors. Moreover, the relation from gene expression to Y is stochastic because , where E<sub>Y</sub> introduces the stochasticity. Two individuals may therefore have the exact same error term values over but different instantiations of Y.”

      (2) The paper assumes that the network connectivity will remain the same after perturbation. This is not always true due to backup mechanisms in the cells. For example, suppose that a cell wants to create product P and it can do it through two alternative paths: Path #1: ABP, Path #2: ACP. Now suppose that path #1 is more efficient, so when B can be produced, path #2 is inactive. Once the perturbation blocks element B from being produced, the graph connectivity changes by activation of path #2. I did not see the authors taking this into consideration, which seems to be a major limitation in using Perturb-seq results to infer conductivities.

      We agree that backup mechanisms can exist and therefore now include additional experimental results, where we gradually increased the incongruence between the DAG modeling the Perturb-seq and the DAG modeling the bulk RNA-seq using a mixture of graphs. The performance of RCSP degraded gradually, rather than abruptly, with increasing incongruence. We therefore conclude that RCSP is robust to differences between the causal graphs representing Perturb-seq and bulk RNA-seq:

      “We next assessed the performance of RCSP when the DAG underlying the Perturb-seq data differs from the DAG underlying the bulk RNA-seq data. We considered a mixture of two random DAGs in bulk RNA-seq, where one of the DAGs coincided with the Perturb-seq DAG and second alternate DAG did not. We generated 0%, 25%, 50%, 75%, and 100% of the bulk RNA-seq samples from the alternate DAG, and the rest from the Perturb-seq DAG. We ideally would like to see the performance of RCSP degrade gracefully, as opposed to abruptly, as the percent of samples derived from the alternate DAG increases.

      We summarize results in Supplementary Figure 4. As expected, RCSP performed the best when we drew all samples from the same underlying DAG for Perturb-seq and bulk RNA-seq. However, the performance of RCSP also degraded slowly as the percent of samples increased from the alternate DAG. We conclude that RCSP can accommodate some differences between the underlying DAGs in Perturb-seq and bulk RNA-seq with only a mild degradation in performance.”

      (3) There is substantial system heterogeneity that may cause the same phenotype. This goes beyond the authors claim that although the initial gene causes of a disease may differ from person to person, at some point they will all converge to changes in the same set of “root genes.” This is not true for many diseases, which are defined based on symptoms and lab tests at the patient level. You may have two completely different molecular pathologies that lead to the development of the same symptoms and test results. Breast cancer with its subtypes is a prime example of that. In theory, this issue could be addressed if there is infinite sample size. However, this assumption is largely violated in all existing biological datasets.

      The proposed method accommodates the above heterogeneity. We do not assume that the root causes affect the same set of root causal genes. Instead the root causes and root causal genes may vary from person to person. We write in the Introduction:

      “The problem is further complicated by the existence of complex disease, where a patient may have multiple root causal genes that differ from other patients even within the same diagnostic category... We thus also seek to identify patient-specific root causal genes in order to classify patients into meaningful biological subgroups each hopefully dictated by only a small group of genes.”

      The root causal genes may further affect different downstream genes at the patient-specific level. However root causal genes tend to have many downstream effects so that virtually every gene expression level becomes correlated with Y. We now clarify this by describing the omnigenic root causal model in the Introduction as follows:

      “Finally, application of the algorithm to two complex diseases with disparate pathogeneses recovers an omnigenic root causal model, where a small set of root causal genes drive pathogenesis but impact many downstream genes within each patient. As a result, nearly all gene expression levels are correlated with the diagnosis at the population level.”

      (4) Were the values of the synthetic variables Z-scored?

      Yes, all variables were z-scored. We now clarify this in Methods:

      “We also standardized all variables before running the regressions to prevent gaming of the marginal variances in causal discovery (Reisach et al., 2021; Ng et al., 2024).”

      (5) The algorithm seems to require both RNA-seq and Perturb-seq data (Algorithm 1, page 14). Can it function with RNA-seq data only? What will be different in this case?

      The algorithm cannot function with observational bulk RNA-seq data only. We included Perturb-seq because causal discovery with observational RNA-seq data alone tends to be inaccurate and unstable, as highlighted by the results of CausalCell. We further emphasize that we do not rely on d-separation faithfulness in Methods, which is typically required for causal discovery from observational data alone:

      “We can also claim the backward direction under d-separation faithfulness. We however avoid making this additional assumption because real biological data may not arise from distributions obeying d-separation faithfulness in practice.”

      (6) Synthetic data generation: how many different graphs (SEMs) did they start from? (30?) How many samples per graph? Did they test different sample sizes?

      We now clarify that we generate 30 random SEMs, each associated with a DAG. We used 200 samples for the bulk RNA-seq to mimic a relatively large but common sample size. We also drew 200 samples for each perturbation or control in the Perturb-seq data. We did not consider multiple sample sizes due to the time required to complete each run. Instead, we focused on a typical scenario where investigators would apply RCSP. We now write the following in the Methods:

      “We drew 200 samples for the bulk RNA-seq data to mimic a large but common dataset size. We introduced knockdown perturbations in Perturb-seq by subtracting an offset of two in the softplus function: . We finally drew 200 samples for the control and each perturbation condition to generate the Perturb-seq data. We repeated the above procedure 30 times.” We also include the following in Results:

      “We obtained 200 cell samples from each perturbation, and another 200 controls without perturbations. We therefore generated a total of 2501 × 200 = 500,200 single cell samples for each Perturb-seq dataset. We simulated 200 bulk RNA-seq samples.”

      (7) The presentation of comparative results (Supplementary Figures 4 and 7) is not clear. No details are given on how these results were generated. (what does it mean “The first column denotes the standard deviation of the outputs for each algorithm?”) Why all other methods have higher SD differences than RCSP? Is it a matter of scaling? Shouldn’t they have at least some values near zero since the authors “added the minimum value so that all histograms begin at zero?”

      Each of these supplementary figures contains a 6 by 3 table of figures. By the first column, we mean column one (with rows 1 through 6) of each figure. The D-RCS and D-SD scores represent standard deviations of the RCS and SD scores from zero of each gene, respectively. We can similarly compute the standard deviation of the outputs of the algorithms. We now clarify this in the Supplementary Materials:

      “The figure contains 6 rows and 3 columns. Similar to the D-RCS, we can compute the standard deviation of the output of each algorithm from zero for each gene. The first column in Supplementary Figure 7 denotes the histograms of these standard deviations across the genes.”

      Many histograms do not appear to start at zero because the bars are too small to be visible. We now clarify this in the Supplementary Materials as well:

      “Note that the bars at zero are not visible for many algorithms, since only a few genes attained standard deviations near the minimum.”

      (8) Why RCSP results are more like a negative binomial distribution and every other is kind of normal?

      All other methods have higher standard deviations than RCSP because they fail to compute an accurate measure of the root causal effect. Recall that, just like a machine has a few root causal problems, only a few root casual genes have large root causal effects under the omnigenic root causal model. The results of RCSP look more like a negative binomial distribution because most RCS scores are concentrated around zero and only a few RCS scores are large – consistent with the omnigenic root causal model. The other algorithms fail to properly control for the upstream genes and thus attain large standard deviations for nearly all genes. We now clarify these points in the Supplementary Materials as follows:

      “If an algorithm accurately identifies root causal genes, then it should only identify a few genes with large conditional root causal effects under the omnigenic root causal model. The RCSP algorithm had a histogram with large probability mass centered around zero with a long tail to the right. The standard deviations of the outputs of the other algorithms attained large values for nearly all genes. Incorporating feature selection and causal discovery with CausalCell introduced more outliers in the histogram of ANM. We conclude that only RCSP detected an omnigenic root causal model.”

      (9) What is the significance of genes changing expression “from left to right” in a UMAP plot? (e.g., Fig. 3h and 3g)

      The first UMAP dimension captured the variability of the RCS scores for most root causal genes. As a result, we could focus our analysis on the black cluster in Figure 3 (g) with large RCS scores in the subsequent pathway enrichment analysis summarized in Figure 3 (j). If two dimensions were involved, then we would need to analyze at least two clusters (e.g., black and pink), but this was not the case. We now clarify this in Results:

      “The RCS scores of most of the top genes exhibited a clear gradation increasing only from the left to the right hand side of the UMAP embedding; we plot an example in Figure 3 (h). We found three exceptions to this rule among the top 30 genes (example in Figure 3 (i) and see Supplementary Materials). RCSP thus detected genes with large RCS scores primarily in the black cluster of Figure 3 (g). Pathway enrichment analysis within this cluster alone yielded supra-significant results on the same pathway detected in the global analysis...”

      (10) The authors somewhat overstate the novelty of their algorithm. Representation of GRNs as causal graphs dates back in 2000 with the work of Nir Friedman in yeast. Other methods were developed more recently that look on regulatory network changes at the single sample level which the authors do not seem to be aware (e.g., Ellington et al, NeurIPS 2023 workshop GenBio and Bushur et al, 2019, Bioinformatics are two such examples). The methods they mention are for single cell data and they are not designed to connect single sample-level changes to a person’s phenotype. The RCS method needs to be put in the right background context in order to bring up what is really novel about it.

      We agree that many methods already exist for uncovering associational, predictive (Markov, neighborhood) and causal gene regulatory networks. We now cite the above papers. However, the novelty in our manuscript is not causal graph discovery, but rather estimation of root causal effects, detection of root causal genes, and the proposal of the omnigenic root causal model. We now clarify this in the

      Introduction:

      “Many algorithms focus on discovering associational or predictive relations, sometimes visually represented as gene regulatory networks (Costa et al., 2017; Ellington et al., 2023). Other methods even identify causal relations (Friedman et al., 2000; Wang et al., 2023; Wen et al., 2000; Buschur et al., 2000), but none pinpoint the first gene expression levels that ultimately generate the vast majority of pathogenesis. Simply learning a causal graph does not resolve the issue because causal graphs do not summarize the effects of unobserved root causes, such as unmeasured environmental changes or variants, that are needed to identify all root causal genes. We therefore define the Root Causal Strength (RCS) score...”

      Reviewer 3:

      (1) Several assumptions of the method are problematic. The most concerning is that the observational expression changes are all causally upstream of disease. There is work using Mendelian randomization (MR) showing that the opposite is more likely to be true: most differential expression in disease cohorts is a consequence rather than a cause of disease (Porcu et al., 2021). Indeed, the oxidative stress of AMD has known cellular responses including the upregulation of p53. The authors need to think carefully about how this impacts their framework. Can the theory say anything in this light? Simulations could also be designed to address robustness.

      Strictly speaking, we believe that differential expression in disease most likely has a cyclic causal structure: gene expression causes a diagnosis or symptom severity, and a diagnosis or symptom severity lead to treatments and other behavioral changes that perturb gene expression. For example, revTMWR in Porcu et al. (2021) uses trans-variants that are less likely to directly cause gene expression and instead directly cause a phenotype. However, TWMR as proposed in Porcu et al. (2019) instead uses cis-eQTLs and finds many putative causal relations from gene expression to phenotype. Thus, both causal directions likely hold.

      RCSP uses disease-relevant tissue believed to harbor gene expression levels that cause disease. However, RCSP theoretically cannot handle the scenario where Y is a non-sink vertex and is a parent of a gene expression level because modern Perturb-seq datasets usually do not perturb or measure Y. We therefore empirically investigated the degree of error by running experiments, where we set Y to a non-sink vertex, so that it can cause gene expression. We find that the performance of RCSP degrades considerably for gene expression levels that contain Y as a parent. Thus RCSP is sensitive to violations of the sink target assumption:

      “We finally considered the scenario where Y is a non-sink (or non-terminal) vertex. If Y is a parent of a gene expression level, then we cannot properly condition on the parents because modern Perturbseq datasets usually do not intervene on Y or measure Y . We therefore empirically investigated the degradation in performance resulting from a non-sink target Y, in particular for gene expression levels where Y is a parent. We again simulated 200 samples from bulk RNA-seq and each condition of Perturbseq with a DAG over 1000 vertices, an expected neighborhood size of 2 and a non-sink target Y . We then removed the outgoing edges from Y and resampled the DAG with a sink target. We compare the results of RCSP for both DAGs in gene expression levels where Y is a parent. We plot the results in Supplementary Figure 5. As expected, we observe a degradation in performance when Y is not terminal, where the mean RMSE increased from 0.045 to 0.342. We conclude that RCSP is sensitive to violations of the sink target assumption.”

      (2) A closely related issue is the DAG assumption of no cycles. This assumption is brought to bear because it is required for much classical causal machinery, but is unrealistic in biology where feedback is pervasive. How robust is RCSP to (mild) violations of this assumption? Simulations would be a straightforward way to address this.

      We agree that real causal graphs in biology often contain cycles. We now include additional experimental results with cyclic directed graphs in the Supplementary Materials. RCSP outperformed the other algorithms even in this setting, but we caution the reader that the theoretical interpretation of the RCS score may not coincide with a root causal effect when cycles exist:

      “We also evaluated the algorithms on directed graphs with cycles. We generated a linear SEM over p + 1 = 1000 variables in . We sampled the coefficient matrix β from a Bernoulli (1/(p − 1)) distribution but did not restrict the non-zero coefficients to the upper triangular portion of the matrix. We then proceeded to permute the variable ordering and weight each entry as in the Methods for the DAG. We repeated this procedure 30 times and report the results in Supplementary Figure 3.

      RCSP again outperformed all other algorithms even in the cyclic case. The results suggest that conditioning on the surrogate ancestors also estimates the RCS well even in the cyclic case. However, we caution that an error term E<sub>i</sub> can affect the ancestors of , when cycles exist. As a result, the RCS may not isolate the causal effect of the error term and thus not truly coincide with the notion of a root causal effect in cyclic causal graphs.”

      (3) The authors spend considerable effort arguing that technical sampling noise in X can effectively be ignored (at least in bulk). While the mathematical arguments here are reasonable, they miss the bigger picture point that the measured gene expression X can only ever be a noisy/biased proxy for the expression changes that caused disease: 1) Those events happened before the disease manifested, possibly early in development for some conditions like neurodevelopmental disorders. 2) bulk RNA-seq gives only an average across cell-types, whereas specific cell-types are likely “causal.” 3) only a small sample, at a single time point, is typically available. Expression in other parts of the tissue and at different times will be variable.

      We agree that many other sources of error exist. The causal model of RNA-expression in Methods corresponds to a single snapshot in time for each sample. We now clarify this in the Methods as follows:

      “We represent a snapshot of a biological causal process using an SEM over obeying Equation (3).”

      We thus only detect the root causal genes in a single snapshot in time for each sample in bulk RNA-seq. If we cannot detect the root causal effect in a gene due to the signal washing out over time as in (1), or if the root causal effect in different cell types cancel each other out to exactly zero in bulk as in (2), then we cannot detect those root causal genes even with an infinite sample size.

      (4) While there are connections to the omnigenic model, the latter is somewhat misrepresented. The authors refer to the “core genes” of the omnigenic model as being at the end (longitudinal) of pathogenesis. The omnigenic model makes no statements about temporal ordering: in causal inference terminology the core genes are simply the direct causes of disease.

      We now clarify that we use the word pathogenesis to mean the causal cascade from root causes to the diagnosis. In this case, the direct causes of the diagnosis correspond to the end of pathogenesis, while the root causes correspond to the beginning. For example, if , with Y a diagnosis, then X<sub>1</sub> is a root causal gene while X<sub>2</sub> is a core (direct causal) gene. We now clarify this in the Introduction:

      Root causes of disease correspond to the most upstream causes of a diagnosis with strong causal effects on the diagnosis. Pathogenesis refers to the causal cascade from root causes to the diagnosis. Genetic and non-genetic factors may act as root causes and affect gene expression as an intermediate step during pathogenesis. We introduce root causal gene expression levels – or root causal genes for short – that correspond to the initial changes to gene expression induced by genetic and non-genetic root causes that have large causal effects on a downstream diagnosis (Figure 1 (a)). Root causal genes differ from core genes that directly cause the diagnosis and thus lie at the end, rather than at the beginning, of pathogenesis (Boyle et al., 2017).”

      (5) A key observation underlying the omnigenic model is that genetic heritability is spread throughout the genome (and somewhat concentrated near genes expressed in disease relevant cell types). This implies that (almost) all expressed genes, or their associated (e)SNPs, are “root causes”.

      We now clarify that genetic heritability can be spread throughout the genome in the omnigenic root causal model as well in the Discussion:

      “Further, each causal genetic variant tends to have only a small effect on disease risk in complex disease because the variant can directly cause Y or directly cause any causal gene including those with small root causal effects on Y ; thus, all error terms that cause Y can model genetic effects on Y. However, the root causal model further elaborates that genetic and non-genetic factors often combine to produce a few root causal genes with large root causal effects, where non-genetic factors typically account for the majority of the large effects in complex disease. Many variants may therefore cause many genes in diseases with only a few root causal genes.”

      We finally add Figure 5 into the Discussion as a concrete example illustrating the omnigenic root causal model:

      (6) The claim that root causal genes would be good therapeutic targets feels unfounded. If these are highly variable across individuals then the choice of treatment becomes challenging. By contrast the causal effects may converge on core genes before impacting disease, so that intervening on the core genes might be preferable. The jury is still out on these questions, so the claim should at least be made hypothetical.

      We clarify that we do not claim that root causal genes are better treatment targets than core genes in terms of magnitudes of causal effects on the phenotype. For example, in the common cold with a virus as the root cause, giving a patient an antiviral will eliminate fever and congestion, but so will giving a decongestant and an antipyretic. We only claim that treating root causal genes can eliminate disease near its pathogenic onset, just like giving an antiviral can eliminate the viral load and stop pathogenesis. We write the following the Introduction:

      “Treating root causal genes can modify disease pathogenesis in its entirety, whereas targeting other causes may only provide symptomatic relief... Identifying root causal genes is therefore critical for developing treatments that eliminate disease near its pathogenic onset.”

      We also further clarify in the Discussion that root causal genes account for deleterious causal effects not captured by the diagnosis Y:

      “We finally emphasize that the root causal model accounts for all deleterious effects of the root causal genes, whereas the core gene model only captures the deleterious effects captured by the diagnosis Y. For example, the disease of diabetes causes retinopathy, but retinopathy is not a part of the diagnostic criteria of diabetes. As a result, the gene expression levels that cause retinopathy but not the diagnosis of diabetes are not core genes, even though they are affected by the root causal genes.”

      We do agree that root causal genes may differ substantially between patients, although it is unclear if the heterogeneity is too great to develop treatments.

      (7) The closest thing to a gold standard I believe we have for “root causal genes” is integration of molecular QTLs and GWAS, specifically coloc/MR. Here the “E” of RCSP are explicitly represented as SNPs. I don’t know if there is good data for AMD but there certainly is for MS. The authors should assess the overlap with their results. Another orthogonal avenue would be to check whether the root causal genes change early in disease progression.

      Colocalization and Mendelian randomization unfortunately cannot identify root causal effects because they all attempt, either heuristically (colocalization) or rigorously (MR), to identify variants that cause each gene expression level rather than variants that directly cause each gene expression level and thus make up the error terms. We therefore need new methods that can identify direct causal variants in order to assess overlap.

      We checked whether root causal genes change early in disease progression using knowledge of pathogenesis. In particular, oxidative stress induces pathogenesis in AMD, and RCSP identified root causal genes involved in oxidative stress in AMD:

      “The pathogenesis of AMD involves the loss of RPE cells. The RPE absorbs light in the back of the retina, but the combination of light and oxygen induces oxidative stress, and then a cascade of events such as immune cell activation, cellular senescence, drusen accumulation, neovascularization and ultimately fibrosis (Barouch et al., 2007). We therefore expect the root causal genes of AMD to include genes involved in oxidative stress during early pathogenesis. The gene MIPEP with the highest D-RCS score in Figure 3 (d) indeed promotes the maturation of oxidative phosphorylation-related proteins (Shi et al., 2011). The second gene SLC7A5 is a solute carrier that activates mTORC1 whose hyperactivation increases oxidative stress via lipid peroxidation (Nachef et al., 2021; Go et al., 2020). The gene HEATR1 is involved in ribosome biogenesis that is downregulated by oxidative stress (Turi et al., 2018). The top genes discovered by RCSP thus identify pathways known to be involved in oxidative stress.”

      Similarly, T cell infiltration across the blood brain barrier initiates pathogenesis in MS, and RCSP identified root causal genes involved in this infiltration:

      “Genes with the highest D-RCS scores included MNT, CERCAM and HERPUD2 (Figure 4 (d)). MNT is a MYC antagonist that modulates the proliferative and pro-survival signals of T cells after engagement of the T cell receptor (Gnanaprakasam et al., 2017). Similarly, CERCAM is an adhesion molecule expressed at high levels in microvessels of the brain that increases leukocyte transmigration across the blood brain barrier (Starzyk et al., 2000). HERPUD2 is involved in the endoplasmic-reticulum associated degradation of unfolded proteins (Kokame et al., 2000). Genes with the highest D-RCS scores thus serve key roles in known pathogenic pathways of MS.”

      (8) The available Perturb-seq datasets have limitations beyond on the control of the authors. 1) The set of genes that are perturbed. The authors address this by simply sub-setting their analysis to the intersection of genes represented in the perturbation and observational data. However, this may mean that a true ancestor of X is not modeled/perturbed, limiting the formal claims that can be made. Additionally, some proportion of genes that are nominally perturbed show little to no actual perturbation effect (for example, due to poor guide RNA choice) which will also lead to missing ancestors.

      We now clarify that Perturb-seq can only identify root causal genes among the adequately perturbed set of genes in the Discussion:

      “Modern genome-wide Perturb-seq datasets also only adequately perturb and measure a few thousand, rather than all, gene expression levels. RCSP can only identify root causal genes within this perturbed and measured subset.”

      (9) The authors provide no mechanism for statistical inference/significance for their results at either the individual or aggregated level. While I am a proponent of using effect sizes more than p-values, there is still value in understanding how much signal is present relative to a reasonable null.

      We now explain that RCSP does not perform statistical inference in Methods because it is not clear how to define the appropriate cut-off for the RCS score under the null distribution:

      “We focus on statistical estimation rather than statistical inference because Φ<sub>i</sub> > 0 when E<sub>i</sub> causes Y under mild conditions, so we reject the null hypothesis that Φ<sub>i</sub> \= 0 for many genes if many gene expression levels cause Y. However, just like a machine typically breaks down due to only one or a few root causal problems, we hypothesize that only a few genes have large RCS scores Φ<sub>i</sub> ≫ 0 even in complex disease.”

      (10) I agree with the authors that age coming out of a “root cause” is potentially encouraging. However, it is also quite different in nature to expression, including being “measured” exactly. Will RCSP be biased towards variables that have lower measurement error?

      We tested the above hypothesis by plotting sequencing depth against the D-RCS scores of each gene. We observed a small negative correlation between sequencing depth and D-RCS scores, indicating the D-RCS scores are slightly biased upwards with low sequencing depth. However, genes with the largest D-RCS scores exhibited a wide variety of sequencing depths in both MS and AMD, suggesting that sequencing depth has minimal effect on the largest D-RCS scores. We now explain these results for AMD in the Supplementary Materials:

      “Theorem 1 states that RCS scores may exhibit bias with insufficient sequencing depth. The genes with large D-RCS scores may therefore simply have low sequencing depths. To test this hypothesis, we plotted sequencing depth against D-RCS scores. Consistent with Theorem 1, we observed a small negative correlation between D-RCS and sequencing depth (ρ \= −0.16, p=2.04E-13), and D-RCS scores exhibited greater variability at the lowest sequencing depths (Supplementary Figure 8). However, genes with the largest D-RCS scores had mean sequencing depths interspersed between 20 and 3000. We conclude that genes with the largest D-RCS scores had a variety of sequencing depths ranging from low to high.”

      We also report the results for MS:

      “We plot sequencing depth against the D-RCS scores of each gene similar to the AMD dataset. We again observed a small negative correlation (ρ \= −0.136, p_<_2.2E-16), indicating that genes with low sequencing depths had slightly higher D-RCS scores on average (Supplementary Figure 12). However, genes with the largest D-RCS scores again had a variety of sequencing depths. We conclude that sequencing depth has minimal correlation with the largest D-RCS scores.”

      (11) Finally, it’s a stretch to call K562 cells “lymphoblasts.” They are more myeloid than lymphoid.

      We now clarify that K562 cells are undifferentiated blast cells that can be induced to differentiate into lymphoblasts in Results:

      “We next ran RCSP on 137 samples collected from CD4+ T cells of multiple sclerosis (MS; GSE137143) as well as Perturb-seq data of 1,989,578 undifferentiated blast cells that can be induced to differentiate into lymphoblasts, or the precursors of T cells and other lymphocytes.”

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This valuable study advances our understanding of the brain nuclei involved in rapid-eye movement (REM) sleep regulation. Using a combination of imaging, electrophysiology, and optogenetic tools, the study provides convincing evidence that inhibitory neurons in the preoptic area of the hypothalamus influence REM sleep. This work will be of interest to neurobiologists working on sleep and/or brain circuitry.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This paper identifies GABA cells in the preoptic hypothalamus which are involved in REM sleep rebound (the increase in REM sleep) after selective REM sleep deprivation. By calcium photometry, these cells are most active during REM, and show more claim signals during REM deprivation, suggesting they respond to "REM pressure". Inhibiting these cells ontogenetically diminishes REM sleep. The optogenetic and photometry work is carried out to a high standard, the paper is well-written, and the findings are interesting.

      We thank the reviewer for the detailed feedback and thoughtful comments on how to improve our manuscript. To address the reviewer’s concerns, we revised our discussion and added new data. Below, we address the concerns point by point.

      Points that could be addressed or discussed:

      (1) The circuit mechanism for REM rebound is not defined. How do the authors see REM rebound as working from the POAGAD2 cells? Although the POAGAD2 does project to the TMN, the actual REM rebound could be mediated by a projection of these cells elsewhere. This could be discussed.

      We demonstrate thatPOA GAD2→TMN cells become more frequently activated as the pressure for REMs builds up, whereas inhibiting these neurons during high REMs pressure leads to a suppression of the REMs rebound. It is not known how POA GAD2→TMN cells encodeincreased REMs pressure and subsequently influence the REMs rebound. REMsdeprivation wasshown to changethe intrinsic excitabilityof hippocampal neurons and impact synaptic plasticity (McDermott et al., 2003; Mallick and Singh, 2011 ; Zhou et al., 2020) . We speculate that increasedREMs pressure leads to an increase in the excitabilityof POA->TMN neurons, reflected inthe increased number ofcalcium peaks. The increased excitability of POA GAD2→TMN neurons in turn likely leads to stronger inhibition of downstream REM-off neurons. Consequently, as soon as REMsdeprivation stops, there is an increased chance for enteringREMs. The time coursefor how long it takes till the POA excitability resettles toits baseline consequently sets a permissive time window for increasedamounts of REMs to recover its lostamount. For future studies, it would be interesting to map how quickly the excitability ofPOA neurons increases or decays as afunction of the lost or recovered amount of REMs andunravel the cellularmechanisms underlying the elevated activity of POAGAD2 →TMN neurons during highREMs pressure, e.g., whether changes in the expression of ion channels contribute to increasedexcitability of these neurons (Donlea et al., 2014) . As we mentioned in the Discussion, the POAalso projects to other REMs regulatorybrain regions such as the vlPAG and LH. Therefore, it remains to be tested whether POA GAD2 →TMN neurons also innervate these brain regions to potentially regulate REMs homeostasis. We explicitly state this now in the revised Discussion.

      (2) The "POAGAD2 to TMN" name for these cells is somewhat confusing. The authors chose this name because they approach the POAGAD2 cells via retrograde AAV labelling (rAAV injected into the TMN). However, the name also seems to imply that neurons (perhaps histamine neurons) in the TMN are involved in the REM rebound, but there is no evidence in the paper that this is the case. Although it is nice to see from the photometry studies that the histamine cells are selectively more active (as expected) in NREM sleep (Fig. S2), I could not logically see how this was a relevant finding to REM rebound or the subject of the paper. There are many other types of cells in the TMN area, not just histamine cells, so are the authors suggesting that these non-histamine cells in the TMN could be involved?

      We acknowledge that other types of neurons in the TMN may also be involved in the REMs rebound, and therefore inhibition of histamine neurons by POA GAD2 →TMN neurons may not be the sole source of the observed effect. To stress that other neurons within the TMN and/or brain regions may also contribute to the REMs rebound, we have revised the Results section.

      We performed complementary optogenetic inhibition experiments of TMN HIS neurons to investigate if suppression of these neurons is sufficient to promote REMs. We foundthat SwiChR++ mediated inhibition of TMNHIS neurons increased theamount of REMs compared withrecordings without laser stimulation in the same mice and eYFPmice withlaser stimulation. Thus, while TMN HIS neurons may not bethe only downstream target of GABAergic POA neurons, these data suggest that they contribute to REMs regulation. We have incorporated these results in Fig. S4 .

      We further investigated whether the activity of TMN HIS neurons changes between two REMs episodes. Assumingthat REMs pressure inhibits the activity ofREM-off histamine neurons,their firing rates should behighest right after REMs ends when REMs pressure is lowest, and progressivelydecay throughout the inter-REM interval, and reach their lowest activity right before the onset of REMs ( Park et al., 2021) , similarto the activity profile observed for vlPAG REM-off neurons (Weber et al., 2018).We indeed found that TMNHIS neurons displaya gradual decrease in their activity throughout theinter-REM interval and thus potentially reflect the build up of REM pressure ( Fig. S2F ).

      (3) It is a puzzle why most of the neurons in the POA seem to have their highest activity in REM, as also found by Miracca et al 2022, yet presumably some of these cells are going to be involved in NREM sleep as well. Could the same POAGAD2-TMN cells identified by the authors also be involved in inducing NREM sleep-inhibiting histamine neurons (Chung et al). And some of these POA cells will also be involved in NREM sleep homeostasis (e.g. Ma et al Curr Biol)? Is NREM sleep rebound necessary before getting REM sleep rebound? Indeed, can these two things (NREM and REM sleep rebound) be separated?

      Previous studies have demonstrated that POA GABAergic neurons, including those projecting to the TMN, are involved in NREMs homeostasis (Sherin et al., 1998; Gong et al., 2004; Ma et al., 2019) . Therefore, we predict that POA neurons that are involved in NREMs homeostasis are a subset of POA GAD2 → TMN neurons in our manuscript.

      Using optrode recordings in the POA, we recently reported that 12.4% of neurons sampled have higher activity during NREMs compared with REMs; in contrast, 43.8% of neurons sampled have the highest activity during REMs compared with NREMs (Antila et al., 2022) indicating that the proportion of NREM max neurons is smaller compared with REM max neurons. These proportions of neurons are in agreement with previous results (Takahashi et al., 2009) . Considering fiber photometry monitors the average activity of a population of neurons as opposed to individual neurons, it is possible that we recorded neural activity across heterogeneous populations and therefore our findings may disguise the neural activity of the low proportion of NREMs neurons. We previously reported thespiking activity of POA GAD2 →TMN neurons at the singlecell level (Chung et al., 2017) . We have noted in themanuscript thatwhile the activity ofPOA GAD2→TMN neurons is highestduring REMs, theneural activity increases at NREMs → REMs transitions indicating these neurons also areactive during NREMs.

      Using our REMs restriction protocol, we selectively restricted REMs leading to the subsequent rebound of REMs without affecting NREMs and consequently we did not find an increase in the amount of NREMs during the rebound or an increase in slow-wave activity, a key characteristic of sleep rebound that gradually dissipates during recovery sleep (Blake and Gerard, 1937; Williams et al., 1964; Rosa and Bonnet, 1985; Dijk et al., 1990; Neckelmann and Ursin, 1993; Ferrara et al., 1999) . However, during total sleep deprivation when subjects are deprived of both NREMs and REMs, isolating NREMs and REMs rebound may not be attainable.

      (4) Is it possible to narrow down the POA area where the GAD2 cells are located more precisely?

      POA can be subdivided into anatomically distinct regions such as medial preoptic area, median preoptic area, ventrolateral preoptic area, and lateral preoptic area (MPO, MPN, VLPO, and LPO respectively). To quantify where the virus expressing GAD2 cells and optic fibers are located within the POA, we overlaid the POA coronal reference images (with red boundaries denoting these anatomically distinct regions) over the virus heat maps and optic fiber tracts from datasets used in Figure 1A. We found that virus expression and optic fiber tracts were located in the ventrolateral POA, lateral POA, and the lateral part of medial POA, and included this description in the text.

      Author response image 1.

      Location of virus expression (A) and optic fiber placement (B) within subregions of POA.

      (5) It would be ideal to further characterize these particular GAD2 cells by RT-PCR or RNA seq. Which other markers do they express?

      Single-cell RNA-sequencing of POA neurons has revealed an enormous level of molecular diversity, consisting of nearly 70 subpopulations based on gene expression of which 43 can be clustered into inhibitory neurons (Moffitt et al., 2018) . One of the most studied subpopulation of POA sleep-active neurons contains the inhibitory neuropeptide galanin (Sherin et al., 1998; Gaus et al., 2002; Chung et al., 2017; Kroeger et al., 2018; Ma et al., 2019; Miracca et al., 2022) . Galanin neurons have been demonstrated to innervate the TMN (Sherin et al., 1998) yet, within the galanin neurons 7 distinct clusters exist based on unique gene expression (Moffitt et al., 2018) . In addition to galanin, we have previously performed single-cell RNA-seq on POA GAD2 → TMN neurons and identified additional neuropeptides such as cholecystokinin (CCK), corticotropin-releasing hormone (CRH), prodynorphin (PDYN), and tachykinin 1 (TAC1) as subpopulations of GABAergic POA sleep-active neurons (Chung et al., 2017; Smith et al., 2023) . Like galanin, these neuropeptides can also be divided into multiple subtypes as well (Chen et al., 2017; Moffitt et al., 2018) . Thus while these molecular markers for POA neurons are immensely diverse, we agree that characterizing the molecular identity of POA GAD2 → TMN neurons and investigating the functional relevance of these neuropeptides in the context of REMs homeostasis would enrich our understanding of a neural circuit involved in REMs homeostasis and can stand as a separate extension of this manuscript.

      Reviewer #2 (Public Review):

      Maurer et al investigated the contribution of GAD2+ neurons in the preoptic area (POA), projecting to the tuberomammillary nucleus (TMN), to REM sleep regulation. They applied an elegant design to monitor and manipulate the activity of this specific group of neurons: a GAD2-Cre mouse, injected with retrograde AAV constructs in the TMN, thereby presumably only targeting GAD2+ cells projecting to the TMN. Using this set-up in combination with technically challenging techniques including EEG with photometry and REM sleep deprivation, the authors found that this cell-type studied becomes active shortly (≈40sec) prior to entering REM sleep and remains active during REM sleep. Moreover, optogenetic inhibition of GAD2+ cells inhibits REM sleep by a third and also impairs the rebound in REM sleep in the following hour. Despite a few reservations or details that would benefit from further clarification (outlined below), the data makes a convincing case for the role of GAD2+ neurons in the POA projecting to the TMN in REM sleep regulation.

      We thank the reviewer for the thorough assessment of our study and supportive comments. We have addressed your concerns in the revised manuscript, and our point by point response is provided below.

      The authors found that optogenetic inhibition of GAD2+ cells suppressed REM sleep in the hour following the inhibition (e.g. Fig2 and Fig4). If the authors have the data available, it would be important to include the subsequent hours in the rebound time (e.g. from ZT8.5 to ZT24) to test whether REM sleep rebound remains impaired, or recovers, albeit with a delay.

      We thank the reviewer for this comment and agree that it would be interesting to know how REMs changes for a longer period of time throughout the rebound phase. For Fig. 2, we did not record the subsequent hours. For Fig 4, we recorded the subsequent rebound between ZT7.5 and 10.5. When we compare the REMs amount during this 4 hr interval, the SwiChR mice have less REMs compared with eYFP mice with marginal significance (unpaired t-test, p=0.0641). We also plotted the cumulative REMs amount during restriction and rebound phases, and found that the cumulative amount of REMs was still lower in SwiChR mice than eYFP mice at ZT 10.5 (Author response image 2). Therefore, it will be interesting to record for a longer period of time to test when the SwiChR mice compensate for all the REMs that was lost during the restriction period.

      Author response image 2.

      Cumulative amount of REMs during REMs deprivation and rebound combined with optogenetic stimulation in eYFP and SwiChR groups. This data is shown as bar graphs in Figure 4.

      REM sleep is under tight circadian control (e.g. Wurts et al., 2000 in rats; Dijk, Czeisler 1995 in humans). To contextualize the results, it would be important to mention that it is not clear if the role of the manipulated neurons in REM sleep regulation hold at other circadian times of the day.

      Author response image 3.

      Inhibiting POA GAD2→ TMN neurons at ZT5-8 reduces REMs. (A) Schematic of optogenetic inhibition experiments. (B) Percentage of time spent in REMs, NREMs and wakefulness with laser in SwiChR++ and eYFP mice. Unpaired t-tests, p = 0.0013, 0.0469 for REMs and wakeamount. (C) Duration of REMs, NREMs, and wake episodes. Unpaired t-tests, p = 0.0113 for NREMs duration. (D) Frequency of REMs, NREMs, and wake episodes. Unpaired t-tests, p = 0.0063, 0.0382 for REMs and NREMs frequency.

      REMs propensity is largest towards the end of the light phase (Czeisler et al., 1980; Dijk and Czeisler, 1995; Wurts and Edgar, 2000). As a control, we therefore performed the optogenetic inhibition experiments of POA GAD2→TMN neurons during ZT5-8 (Author response image 3). Similar to our results in Figure 2, we found that SwiChR-mediated inhibition of POA GAD2 →TMN neurons attenuated REMs compared with eYFP laser sessions. These findings suggest our results are consistentat other circadian times of the day.

      The effect size of the REM sleep deprivation using the vibrating motor method is unclear. In FigS4-D, the experimental mice reduce their REM sleep to 3% whereas the control mice spend 6% in REM sleep. In Fig4, mice are either subjected to REM sleep deprivation with the vibrating motor (controls), or REM sleep deprivations + optogenetics (experimental mice).

      The control mice (vibrating motor) in Fig4 spend 6% of their time in REM sleep, which is double the amount of REM sleep compared to the mice receiving the same treatment in FigS4-D. Can the authors clarify the origin of this difference in the text?

      The effect size for REM sleep deprivation is now added in the text.

      It is important to note that these figures are analyzing two different intervals of the REMs restriction. In Fig. S4D, we analyzed the total amount of REMs over the entire 6 hr restriction interval (ZT1.5-7.5). In Fig. 4, we analyzed the amount of REMs only during the last 3 hr of restriction (ZT4.5-7.5) as optogenetic inhibition was performed only during the last 3 hrs when the REMs pressure is high. In Fig. S4D, we looked at the amount of REMs during ZT1.5-4.5 and 4.5-7.5 and found that the amount of REMs during ZT4.5-7.5 (4.46 ± 0.25 %; mean ± s.e.m.) is indeed higher than ZT 1.5-4.5 (1.66 ± 0.62 %), and is comparable to the amount of REMs during ZT4.5-7.5 in eYFP mice (5.95 ± 0.52 %) in Fig. 4. We now clearly state in the manuscript at which time points we analyzed the amount, duration and frequency of REMs.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) A few further citations suggested: Discussion "The TMN contains histamine producing neurons and antagonizing histamine neurons causes sleepiness..." It would be appropriate to cite Uygun DS et al 2016 J Neurosci (PMID: 27807161) here. Using the same HDC-Cre mice as used by Maurer et al., Uygun et al found that selectively increasing GABAergic inhibition onto histamine neurons produced NREM sleep.

      We apologize for omitting this important paper. In the revised manuscript, we added this citation.

      (2) Materials and Methods.

      Although the JAX numbers are given for the mouse lines based on researchers generously donating to JAX for others to use, please cite the papers corresponding to the GAD2-ires-Cre and HDC-ires-Cre mouse lines deposited at JAX.

      GAD2-ires-Cre was described in Taniguchi H et al., 2011, Neuron (PMID: 21943598).

      The construction of the HDC-ires-CRE line is described in Zecharia AY et al J Neurosci et al 2012 (PMID: 22993424).

      We have now added these important citations in the revised manuscript.

      (3) Similarly, for the viruses, please provide the citations for the AAV constructs that were donated to Addgene.

      We have now added these citations in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      The authors rely heavily on their conclusions by using an optogenetic tool that inhibits the activity of GAD2+ neurons, however, it is not shown that these neurons are indeed inhibited as expected. An alternative approach to tackle this could be the application of a different technique to achieve the same output (e.g. chemogenetics). However, both experiments (confirmation of inhibition, or using a different technique) would require a significant amount of work, and given the numerous studies out there showing that these optogenetic tools tend to work, may not be necessary. Hence the authors could also cite a similar study that used a likewise construct and where it was indeed shown that this technique works (i.e. similar retrograde optogenetic construct with Cre depedendent expression combined with electrophysiological recordings).

      This laser stimulation protocol was designed based on previous reports of sustained inhibition using the same inhibitory opsin and our prior results that recapitulate similar findings as inhibitory chemogenetic techniques (Iyer et al., 2016; Kim et al., 2016; Wiegert et al., 2017; Stucynski et al., 2022). We have now added this description in the Result section.

      Fig1A - Right: the virus expression graphs are great and give a helpful insight into the variability. The image on the left (GCAMP+ cells) is less clear, the GCAMP+ cells don't differentiate well from the background. Perhaps the whole brain image with inset in POA can show the GCAMP expression more convincingly.

      We have added a histology picture showing the whole brain image with inset in the POA in the updated Fig. 1A .

      Statistics: The table is very helpful. Based on the degrees of freedom, it seems that in some instances the stats are run on the recordings rather than on the individual mice (e.g. Fig1). It could be considered to use a mixed model where subjects as taken into account as a factor.

      Author response image 4.

      ΔF/Factivity of POA GAD2→TMN neurons during NREMs. The duration of NREMs episodes was normalized in time, ranging from 0 to 100%. Shading, ± s.e.m. Pairwise t-tests with Holm-Bonferroni correctionp = 5.34 e-4 between80 and100. Graybar, intervals where ΔF/F activity was significantly different from baseline (0 to 20%, the first time bin). n = 10 mice. In Fig. 1E , we ran stats based on the recordings. In this data set, we ran stats based on the individual mice, and found that the activity also gradually increased throughout NREMs episodes.

      There is an effect of laser in Fig2 on REM sleep amount, as well as an interaction effect with virus injection (from the table). Therefore, it would be helpful for the reader to also show REM sleep data from the control group (laser stimulation but no active optogenetics construct) in Fig 2.

      To properly control laser and virus effect, we performed the same laser stimulation experiments in eYFP control mice (expressing only eYFP without optogenetic construct, SwiChR++) and the data is provided in Fig 2C .

      Fig3B: At the start of the rebound of REM sleep, there is a massive amount of wakefulness, also reflected in the change of spectral composition. Could you comment on the text about what is happening here?

      We quantified the amount of wakefulness during the first hour of REMs rebound and found that indeed there is no significant difference in wakefulness between REM restriction and baseline control conditions ( Fig. S4H ). Therefore, while the representative image in Fig 3B shows increased wakefulness at the beginning of REMs rebound, we do not think the overall amount of wakefulness is increased.

      Fig 4, supplementary data: it would be helpful for the reader to have mentioned in the text the effect size of the REM sleep restriction protocol (e.g. mean and standard deviation).

      Thank you for this suggestion. We have now added the effect size for the REM sleep restriction experiments in the main text.

      REM sleep restriction and photometry experiment: could be improved by adding within the main body of text that, in order to conduct the photometry experiment in the last hours of REM sleep deprivation, the manual REM sleep deprivation had to be applied, because the vibrating motor technique disturbed the photometry recordings.

      Thank you for this suggestion. We have added the description in the main text.

      Suggestion to build further on the already existing data (not for this paper): you have a powerful dataset to test whether REM sleep pressure builds up during wakefulness or NREM sleep, by correlating when your optogenetic treatment occurs (NREM or wakefulness), with the subsequent rebound in REM sleep (see also Endo et al., 1998; Benington and Heller, 1994; Franken 2001).

      We thank the reviewer for this excellent suggestion. We plan to carry out this experiment in the future.

      References

      Antila, H., Kwak, I., Choi, A., Pisciotti, A., Covarrubias, I., Baik, J., et al. (2022). A noradrenergic-hypothalamic neural substrate for stress-induced sleep disturbances. Proc. Natl. Acad. Sci. 119, e2123528119. doi: 10.1073/pnas.2123528119.

      Blake, H., and Gerard, R. W. (1937). Brain potentials during sleep. Am. J. Physiol.-Leg. Content 119, 692–703. doi: 10.1152/ajplegacy.1937.119.4.692.

      Chen, R., Wu, X., Jiang, L., and Zhang, Y. (2017). Single-Cell RNA-Seq Reveals Hypothalamic Cell Diversity. Cell Rep. 18, 3227–3241. doi: 10.1016/j.celrep.2017.03.004.

      Chung, S., Weber, F., Zhong, P., Tan, C. L., Nguyen, T., Beier, K. T., et al. (2017). Identification of Preoptic Sleep Neurons Using Retrograde Labeling and Gene Profiling. Nature 545, 477–481. doi: 10.1038/nature22350.

      Czeisler, C. A., Zimmerman, J. C., Ronda, J. M., Moore-Ede, M. C., and Weitzman, E. D. (1980). Timing of REM sleep is coupled to the circadian rhythm of body temperature in man. Sleep 2, 329–346.

      Dijk, D. J., Brunner, D. P., Beersma, D. G., and Borbély, A. A. (1990). Electroencephalogram power density and slow wave sleep as a function of prior waking and circadian phase. Sleep 13, 430–440. doi: 10.1093/sleep/13.5.430.

      Dijk, D. J., and Czeisler, C. A. (1995). Contribution of the circadian pacemaker and the sleep homeostat to sleep propensity, sleep structure, electroencephalographic slow waves, and sleep spindle activity in humans. J. Neurosci. Off. J. Soc. Neurosci. 15, 3526–3538. doi: 10.1523/JNEUROSCI.15-05-03526.1995.

      Donlea, J. M., Pimentel, D., and Miesenböck, G. (2014). Neuronal machinery of sleep homeostasis in Drosophila. Neuron 81, 860–872. doi: 10.1016/j.neuron.2013.12.013.

      Ferrara, M., De Gennaro, L., Casagrande, M., and Bertini, M. (1999). Auditory arousal thresholds after selective slow-wave sleep deprivation. Clin. Neurophysiol. Off. J. Int. Fed. Clin. Neurophysiol. 110, 2148–2152. doi: 10.1016/s1388-2457(99)00171-6.

      Gaus, S. E., Strecker, R. E., Tate, B. A., Parker, R. A., and Saper, C. B. (2002). Ventrolateral preoptic nucleus contains sleep-active, galaninergic neurons in multiple mammalian species. Neuroscience 115, 285–294. doi: 10.1016/S0306-4522(02)00308-1.

      Gong, H., McGinty, D., Guzman-Marin, R., Chew, K.-T., Stewart, D., and Szymusiak, R. (2004). Activation of c-fos in GABAergic neurones in the preoptic area during sleep and in response to sleep deprivation. J. Physiol. 556, 935–946. doi: 10.1113/jphysiol.2003.056622.

      Iyer, S. M., Vesuna, S., Ramakrishnan, C., Huynh, K., Young, S., Berndt, A., et al. (2016). Optogenetic and chemogenetic strategies for sustained inhibition of pain. Sci. Rep. 6, 30570. doi: 10.1038/srep30570.

      Kim, H., Ährlund-Richter, S., Wang, X., Deisseroth, K., and Carlén, M. (2016). Prefrontal Parvalbumin Neurons in Control of Attention. Cell 164, 208–218. doi: 10.1016/j.cell.2015.11.038.

      Kroeger, D., Absi, G., Gagliardi, C., Bandaru, S. S., Madara, J. C., Ferrari, L. L., et al. (2018). Galanin neurons in the ventrolateral preoptic area promote sleep and heat loss in mice. Nat. Commun. 9, 4129. doi: 10.1038/s41467-018-06590-7.

      Ma, Y., Miracca, G., Yu, X., Harding, E. C., Miao, A., Yustos, R., et al. (2019). Galanin Neurons Unite Sleep Homeostasis and α2-Adrenergic Sedation. Curr. Biol. CB 29, 3315-3322.e3. doi: 10.1016/j.cub.2019.07.087.

      Mallick, B. N., and Singh, A. (2011). REM sleep loss increases brain excitability: role of noradrenaline and its mechanism of action. Sleep Med. Rev. 15, 165–178. doi: 10.1016/j.smrv.2010.11.001.

      McDermott, C. M., LaHoste, G. J., Chen, C., Musto, A., Bazan, N. G., and Magee, J. C. (2003). Sleep deprivation causes behavioral, synaptic, and membrane excitability alterations in hippocampal neurons. J. Neurosci. Off. J. Soc. Neurosci. 23, 9687–9695. doi: 10.1523/JNEUROSCI.23-29-09687.2003.

      Miracca, G., Anuncibay-Soto, B., Tossell, K., Yustos, R., Vyssotski, A. L., Franks, N. P., et al. (2022). NMDA Receptors in the Lateral Preoptic Hypothalamus Are Essential for Sustaining NREM and REM Sleep. J. Neurosci. 42, 5389–5409. doi: 10.1523/JNEUROSCI.0350-21.2022.

      Moffitt, J. R., Bambah-Mukku, D., Eichhorn, S. W., Vaughn, E., Shekhar, K., Perez, J. D., et al. (2018). Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362. doi: 10.1126/science.aau5324.

      Neckelmann, D., and Ursin, R. (1993). Sleep stages and EEG power spectrum in relation to acoustical stimulus arousal threshold in the rat. Sleep 16, 467–477.

      Park, S.-H., Baik, J., Hong, J., Antila, H., Kurland, B., Chung, S., et al. (2021). A probabilistic model for the ultradian timing of REM sleep in mice. PLOS Comput. Biol. 17, e1009316. doi: 10.1371/journal.pcbi.1009316.

      Rosa, R. R., and Bonnet, M. H. (1985). Sleep stages, auditory arousal threshold, and body temperature as predictors of behavior upon awakening. Int. J. Neurosci. 27, 73–83. doi: 10.3109/00207458509149136.

      Sherin, J. E., Elmquist, J. K., Torrealba, F., and Saper, C. B. (1998). Innervation of histaminergic tuberomammillary neurons by GABAergic and galaninergic neurons in the ventrolateral preoptic nucleus of the rat. J. Neurosci. Off. J. Soc. Neurosci. 18, 4705–4721.

      Smith, J., Honig-Frand, A., Antila, H., Choi, A., Kim, H., Beier, K. T., et al. (2023). Regulation of stress-induced sleep fragmentation by preoptic glutamatergic neurons. Curr. Biol. CB , S0960-9822(23)01585–3. doi: 10.1016/j.cub.2023.11.035.

      Stucynski, J. A., Schott, A. L., Baik, J., Chung, S., and Weber, F. (2022). Regulation of REM sleep by inhibitory neurons in the dorsomedial medulla. Curr. Biol. CB 32, 37-50.e6. doi: 10.1016/j.cub.2021.10.030.

      Takahashi, K., Lin, J.-S., and Sakai, K. (2009). Characterization and mapping of sleep-waking specific neurons in the basal forebrain and preoptic hypothalamus in mice. Neuroscience 161, 269–292. doi: 10.1016/j.neuroscience.2009.02.075.

      Weber, F., Hoang Do, J. P., Chung, S., Beier, K. T., Bikov, M., Saffari Doost, M., et al. (2018). Regulation of REM and Non-REM sleep by periaqueductal GABAergic neurons. Nat. Commun. 9, 1–13. doi: 10.1038/s41467-017-02765-w.

      Wiegert, J. S., Mahn, M., Prigge, M., Printz, Y., and Yizhar, O. (2017). Silencing Neurons: Tools, Applications, and Experimental Constraints. Neuron 95, 504–529. doi: 10.1016/j.neuron.2017.06.050.

      Williams, H. L., Hammack, J. T., Daly, R. L., Dement, W. C., and Lubin, A. (1964). RESPONSES TO AUDITORY STIMULATION, SLEEP LOSS AND THE EEG STAGES OF SLEEP. Electroencephalogr. Clin. Neurophysiol. 16, 269–279. doi: 10.1016/0013-4694(64)90109-9.

      Wurts, S. W., and Edgar, D. M. (2000). Circadian and homeostatic control of rapid eye movement (REM) sleep: promotion of REM tendency by the suprachiasmatic nucleus. J. Neurosci. Off. J. Soc. Neurosci. 20, 4300–4310. doi: 10.1523/JNEUROSCI.20-11-04300.2000.

      Zhou, Y., Lai, C. S. W., Bai, Y., Li, W., Zhao, R., Yang, G., et al. (2020). REM sleep promotes experience-dependent dendritic spine elimination in the mouse cortex. Nat. Commun. 11, 4819. doi: 10.1038/s41467-020-18592-5.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank the three reviewers and the reviewing editor for their positive evaluation of our manuscript. We particularly appreciate that they unanimously consider our work as “important contributions to the understanding of how the CAF-1 complex works”, “The large amounts of data provided in the paper support the authors' conclusion very well” and “The paper effectively addresses its primary objective and is strong”. We also thank them for a careful reading and useful comments to improve the manuscript. We have built on these comments to provide an improved version of the manuscript, and address them point by point below .

      Reviewer #1 (Public Review):

      Summary:

      This paper makes important contributions to the structural analysis of the DNA replication-linked nucleosome assembly machine termed Chromatin Assembly Factor-1 (CAF-1). The authors focus on the interplay of domains that bind DNA, histones, and replication clamp protein PCNA.

      Strengths:

      The authors analyze soluble complexes containing full-length versions of all three fission yeast CAF-1 subunits, an important accomplishment given that many previous structural and biophysical studies have focused on truncated complexes. New data here supports previous experiments indicating that the KER domain is a long alpha helix that binds DNA. Via NMR, the authors discover structural changes at the histone binding site, defined here with high resolution. Most strikingly, the experiments here show that for the S. pombe CAF-1 complex, the WHD domain at the C-terminus of the large subunit lacks DNA binding activity observed in the human and budding yeast homologs, indicating a surprising divergence in the evolution of this complex. Together, these are important contributions to the understanding of how the CAF-1 complex works.

      Weaknesses:

      1. There are some aspects of the experimentation that are incompletely described: <br /> In the SEC data (Fig. S1C) it appears that Pcf1 in the absence of other proteins forms three major peaks. Two are labeled as "1a" (eluting at ~8 mL) and "1b" (~10-11 mL). It appears that Pcf1 alone or in complex with either or both of the other two subunits forms two different high molecular weight complexes (e.g. 4a/4b, 5a/5b, 6a/6b). There is also a third peak in the analysis of Pcf1 alone, which isn't named here, eluting at ~14 mL, overlapping the peaks labeled 2a, 4c, and 5c. The text describing these different macromolecular complexes seems incomplete (p. 3, lines 32-33): "When isolated, both Pcf2 and Pcf3 are monomeric while Pcf1 forms large soluble oligomers". Which of the three Pcf1-alone peaks are oligomers, and how do we know? What is the third peak? The gel analysis across these chromatograms should be shown.

      We thank the reviewer for his/her careful reading of the manuscript. Indeed, we plotted two curves in Figure S1C in a color that does not match the legend, leading to confusion. Curve 1, Pcf1 alone, depicted in red, should appear in pink as indicated in the legend and in the SDS-PAGE analysis below. Curve 1 exhibits two peaks, labeled as 1a and 1b. With an elution volume of 8.5mL close to the dead volume of the column, peak 1a corresponds to soluble oligomers, while peak 1b (10.4mL) likely corresponds to monomeric Pcf1. Curve 5 (Pcf1 + Pcf2 mixture) was in pink instead of purple as indicated in the legend. This curve consists of three distinct peaks (5a, 5b, and 5c). The SDS-PAGE analysis revealed the presence of oligomers of Pcf1-Pcf2 (5a, 8.3mL), the Pcf1-Pcf2 complex (5b, 9.8mL), and Pcf2 alone (5c, 13.6 mL).

      The color has now been corrected in the revised manuscript.

      More importantly, was a particular SEC peak of the three-subunit CAF-1 complex (i.e. 4a or 4b) characterized in the further experimentation, or were the data obtained from the input material prior to the separation of the different peaks? If the latter, how might this have affected the results? Do the forms inter-convert spontaneously?

      We conducted all structural analyses and DNA/PCNA interactions Figures (1-4, S1-S4) with freshly SECpurified samples corresponding to the 4b peak (9.7mL). Aliquots were flash-frozen with 50% glycerol for in vitro histone assembly assays (Figure 5).

      1. Given the strong structural predication about the roles of residues L359 and F380 (Fig. 2f), these should be mutated to determine effects on histone binding.

      We are pleased that our structural predictions are considered as strong. We agree that investigating the role of the L359 and F380 residues will be critical to further refine the binding interface between histone H3-H4 and CAF-1. An in vitro and in vivo analysis of such mutated forms, alongside the current Pcf1-ED mutant characterized in this article and additional potential mutated forms, has the potential to provide a better understanding of the dynamic of histone deposition by CAF-1. However, these additional approaches would require to reach another step in breaking this enigmatic dynamic.

      1. Could it be that the apparent lack of histone deposition by the delta-WHD mutant complex occurs because this mutant complex is unstable when added to the Xenopus extract?

      We cannot formally exclude this possibility, and this could potentially applies to all mutated forms tested. However, in the absence of available antibodies against the fission yeast CAF-1 complex, we cannot test this hypothesis for technical reasons. Nevertheless, we feel reassured by the fact that the in vitro assays of nucleosome assembly are overall consistent with the in vivo assays. Indeed, all mutated forms tested that abolished or weakened nucleosome assembly also exhibited synthetic lethality/growth defect in the absence of a functional HIRA pathway, including the delta WHD mutated form. This genetic synergy, that reflects a defective histone deposition by CAF-1, is not specific to the fission yeast S. pombe and was previously reported in S. cerevisiae (Kaufman et al. MCB 1998; Krawitz et al. MCB 2002). This further supports the evolutionary conservation based on genetic assay as a read out for defective histone deposition by CAF-1.

      Reviewer #1 (Recommendations For The Authors):

      • p. 4: "An experimental molecular weight of 179 kDa was calculated using Small Angle X-ray Scattering (SAXS), consistent with a 1:1:1 stoichiometry (Figure S1e). These data are in agreement with a globular complex with a significant flexibility (Figure S1f)." There needs to be more description of the precision of the molecular weight measurement, and what aspects of these data indicate the flexibility.

      The molecular weight was estimated using the correlation volume (Vc) defined by (Rambo & Tainer, Nature 2013, 496, 477-481). The estimated error with this method is around 10%. We added this information together with supporting arguments for the existence of flexibility: “An experimental molecular weight of 179 kDa was calculated using Small Angle X-ray Scattering (SAXS). Assuming an accuracy of around 10% with this method (Rambo and Tainer 2013), this value is consistent with a 1:1:1 stoichiometry for the CAF-1 complex (calculated MW 167kDa) (Figure S1e). In addition, the position of the maximum for the dimensionless Kratky plot was slightly shifted to higher values in the y and x axis compared to the position of the expected maximum of the curve for a fully globular protein (Figure S1f).

      This shows that the complex was globular with a significant flexibility.”

      • p. 6, lines 21-22: "In contrast, a large part of signals (338-396) did not vanish anymore upon addition of a histone complex preformed with two other histone chaperones known to compete with CAF-1 for histone binding..." Given the contrast made later with the 338-351 region which is insensitive to Asf1/Mcm2, it would be clearer for the reader to describe the Asf1/Mcm2-competed regions as residues 325-338 plus 352-396. Note that the numerical scale of residues doesn't line up perfectly with the data points in Figure 2d, and this should be fixed as well.

      We thank this reviewer for spotting this typographical error; we intended to write "In contrast, a large part of signals (348-396) did not vanish anymore… “. We modified paragraph as suggested by the reviewer because we agree it is clearer for the reader : “In contrast, only a shorter fragment (338-347) vanished upon addition of Asf1-H3-H4-Mcm2(69-138), a histone complex preformed with two other histone chaperones, Asf1 and Mcm2, known to compete with CAF-1 for histone binding (Sauer et al. 2017) and whose histone binding modes are well established (Figure 2e) (Huang et al. 2015, Richet et al. 2015). This finding underscores a direct competition between residues (325-338) and (349-396) within the ED domain and Asf1/Mcm2 for histone binding.”

      The slight shift in the numerical scale Figure 2d was also corrected.

      • p. 8. Lines 22-24: "EMSAs with a double-stranded 40bp DNA fragment confirmed the homogeneity of the bound complex. When increasing the SpCAF-1 concentration, additional mobility shifts suggest, a cooperative DNA binding (Figure 3a)." I agree that the migration of the population is further retarded upon the addition of more protein. However, doesn't this negate the first sentence? That is, if multiple CAF-1 complexes can bind each dsDNA molecule, can these complexes be described as homogeneous?

      We fully agree with the reviewer's comment and have removed the notion of homogeneity from the first sentence. “EMSAs with a double-stranded 40bp DNA fragment showed the formation of a bound complex.”

      • Figure S2b Legend: "1H-15N HSQC spectra of Pcf1_ED (425-496)." The residue numbers should read 325-396.

      The typo has been corrected.

      • Is the title for Figure 5 correct?: "Figure 5: Rescue using Y340 and W348 in the ED domain, the intact KER DNA binding domain and the C-terminal WHD of Pcf1 in SpCAF-1 mediated nucleosome assembly." I don't see that any point mutation rescue experiments are done here.

      The title of figure 5 has been modified for “Efficient nucleosome assembly by SpCAF-1 in vitro requires interactions with H3-H4, DNA and PCNA, and the C-terminal WHD domain”.

      • Figure S6C. I assume the top strain lacks the Pcf2-GFP but this should be stated explicitly.

      The following sentence “The top strain corresponds to a strain expressing wild-type and untagged Pcf2 as a negative control of GFP fluorescence” is now added to the figure legend. The figure S6C has been modified accordingly to mention “Pcf2 (untagged)” and state more explicitly.

      • Regarding point #3 in the public review, a simple initial test of this idea would be to determine if similar amounts of wt and mutant complexes can be immunoprecipitated at the endpoint of the assembly reactions.

      In the absence of available antibodies against the fission yeast CAF-1 complex, we cannot test this hypothesis for technical reasons. However, the in vitro assays of nucleosome assembly are overall consistent with the in vivo assays. Indeed, all mutated forms tested that abolished or weakened nucleosome assembly also exhibited synthetic lethality/growth defect in the absence of a functional HIRA pathway, including the delta WHD mutated form. This genetic synergy, reflecting defective histone deposition by CAF-1, is not specific to the fission yeast S. pombe, as it was previously reported in S. cerevisiae (Kaufman et al. MCB 1998; Krawitz et al. MCB 2002), further supporting the evolution conservation in the genetic assay as a read out for defective histone deposition by CAF-1.

      • Foundational findings that should be cited: The role of PCNA in CAF-1 activity was first recognized by pioneering studies in the Stillman laboratory (PMID: 10052459, 11089978). The earliest recombinant studies of CAF-1 showed that the large subunit is the binding platform for the other two, showed that the KER and ED domains were required for histone deposition activity, and roughly mapped the p60-binding site on the large subunit (PMID: 7600578). Another early study roughly mapped the binding site for the third subunit and showed that biological effects of impairing the PCNA binding synergized with defects in the HIR pathway (PMID: 11756556), a genetic synergy first demonstrated in budding yeast (PMID: 9671489).

      We thank the reviewer for providing these important references that are now cited in the manuscript. PMID: 10052459 and 11089978 are cited page 2 line 18 and 19, PMID: 7600578 page 19 line 5 and PMID: 11756556 and 9671489 page 18 line 2.

      Reviewer #2 (Public Review):

      Summary:

      The authors describe the structure-functional relationship of domains in S. pombe CAF-1, which promotes DNA replication-coupled deposition of histone H3-H4 dimer. The authors nicely showed that the ED domain with an intrinsically disordered structure binds to histone H3-H4, that the KER domain binds to DNA, and that, in addition to a PIP box, the KER domain also contributes to the PCNA binding. The ED and KER domains as well as the WHD domain are essential for nucleosome assembly in vitro. The ED, KER domains, and the PIP box are important for the maintenance of heterochromatin.

      Strengths:

      The combination of structural analysis using NMR and Alphafold2 modeling with biophysical and biochemical analysis provided strong evidence on the role of the different domain structures of the large subunit of SpCAF-1, spPCF-1 in the binding to histone H3-H4, DNA as well as PCNA. The conclusion was further supported by genetic analysis of the various pcf1 mutants. The large amounts of data provided in the paper support the authors' conclusion very well.

      Reviewer #2 (Recommendations For The Authors):

      The paper by Ochesenbein describes the structural and functional analysis of S. pombe CAF-1 complex critical for DNA replication-coupled histone H3/H4 deposition. By using structural, biophysical, and biochemical analyses combined with genetic methods, the authors nicely showed that a large subunit of SpCAF1, SpPCF-1, consists of 5 structured domains with four connecting IDR domains. The ED domain with IDR nature binds to histone H3-H4 dimer with the conformational change of the other domain(s). SpCAF-1 binds to dsDNA by using the KER domain, but not the WHD domain. The experiments have been done with great care and a large amount of the data are highly reliable. Moreover, the results are clearly presented and convincingly written. The conclusion in the paper is very solid and will be useful for researchers who work in the field of chromosome biology.

      Major points:

      1. DNA binding of the KER mutant shown in Figures S3h and S3i, which was measured by the EMSA, looks similar to that of wild-type control in Figure S3f, which is different from the data in Figures 3b and 3e measured by the MST. The authors need a more precise description of the EMSA result of the KER mutant shown in Figures 3 and S3. The quantification of the EMSA result would resolve the point (should be provided).

      A proposed by this reviewer, we performed quantification of all EMSA presented in Figure 3 and Figure S3. We quantified the signal of the free DNA band to calculate a percentage of bound DNA in each condition. All EMSA experiments were conducted in duplicate, allowing us to calculate an average value and standard deviation for each interaction. Representative curves and fitted values are reported below in the figure provided for the reviewer (panel a data for Pcf1_KER domain with two fitting models, panel b for the entire CAF-1 complexes and mutants, panel c for the isolated Pcf1_KER domains), all fitted values in panel d. Importantly, as illustrated in panel a, the complete model for a single interaction (complete KD model, dashed line curve) does not adequately fit the data. In contrast, a function incorporating cooperativity (Hill model) better accounts for the measured data (solid line curve). Consistently, we also used the Hill model to fit the binding curves measured with the MST technique. As also specified now in the text, the Hill model allows to determine an EC50 value (concentration of protein resulting in the disappearance of half of the free DNA band intensity) and a Hill coefficient value (representing cooperativity during the interaction) for each curve.

      We measure a value of 3.4 ± 0.4 μM for the EC50 of SpCAF-1 WT, which is higher than the value measured by MST (0.7 ± 0.1 μM). Higher values were also calculated for all mutants and isolated Pcf1_KER domains compared to MST. These discrepancies could raise from the fact that the DNA concentration used in the two techniques were very different (20nM for MST experiments and 1μM for EMSA). Unlike the complete KD model, which includes in the calculation the DNA concentration (considered here as the "receptor"), the Hill model is fitted independently of this value. This model assumes that the “receptor” concentration is low compared to the KD. Here we calculate EC50 values on the same order of magnitude as the DNA concentration (low micromolar), The quantification obtained by EMSA is thus challenging to interpret. In contrast, values fitted by the MST measurements are more reliable since this limitation of low “receptor” concentration is correct.

      Therefore, although measurements of EC50 and Hill coefficient from EMSA are reproducible, they may be confusing for quantifying apparent affinity values through EC50. Nevertheless, this quantitative analysis of EMSA, requested by the reviewer, has highlighted an interesting characteristic of the KER mutant that is consistent across both methods: even though the EMSA pointed by the reviewer (Figures S3h and S3i compared to the wild-type control in Figure 3d and Figure S3f) show similar EC50 values, the binding cooperativity is different. Binding curves for the KER mutants is no longer cooperative (Hill coefficient ~1), and this is observed for all KER curves (isolated Pcf1_KER domain and the entire SpCAF-1 complex) with both methods, EMSA and MST. We thus decided to emphasize this characteristic of the KER mutant in the text (page 9 line 30-32). “Importantly, this mutant also shows a lower binding cooperativity for DNA binding, as estimated by the Hill coefficient value close to 1, compared to values around 3 for the WT and other mutants.”

      Since EMSA quantifications did not show a loss of “affinity” (as measured by the EC50 value) for the KER* mutants, compared to the WT contrary to MST measurements and because the DNA concentration was close to the measured EC50, we consider that EC50 values calculated by EMSA do not represent a KD value. If we add this quantification, we should discuss this point in detail. Thus, for sake of clarity, we prefer to put in the manuscript EMSA measurements as illustrations and qualitative validations of the interaction but not to include the quantification.

      Author response image 1.

      Quantitative analysis of interaction with DNA by EMSA. a: quantification of the amount of bound DNA for the Pcf1_KER domain (blue points with error bars). The fit with a KD model is shown as a dashed line, and the fit with a Hill model with a solid line. b: Examples of quantifications and fits (Hill model) for reconstituted SpCAF-1 WT and mutants. c: Examples of quantifications and fits (Hill model) for Pcf1_KER domains WT and mutant. d: EC50 values and Hill coefficients obtained for all EMSA experiments presented in Figure 3 and S3.

      1. As with the cooperative DNA binding of CAF-1, it is very important to show the stoichiometry of CAF-1 to the DNA or the site size. Given a long alpha-helix of the KER domain with biased charges, it is also interesting to show a model of how the dsDNA binds to the long helix with a cooperative binding property (this is not essential but would be helpful if the authors discuss it).

      We agree that having a molecular model for the binding of the KER helix to DNA would be especially interesting, but at this point, considering the accuracy of the tools currently at our disposal for predicting DNA-protein interactions, such a model would remain highly speculative.

      1. Figure 5 shows nucleosome assembly by SpCAF-1. SpCAF-1-PIP* mutant produced a product with faster mobility than the control at 2 h incubation. How much amounts of SpCAF-1 was added in the reaction seems to be critical. At least a few different concentrations of proteins should be tested.

      The slightly faster migration of the SpCAF-1-PIPis not systematically reproduced and we observed in several experiments that the band corresponding to supercoiled DNA migrated slightly above or below the one for the complementation by the SpCAF-1-WT (see Author response image 2 below). Thus this indicates that after 2 hours incubation the supercoiling assay with the SpCAF-1-PIP mutant compared to those achieved with the SpCAF-1-WT. To further document whether the WT or the PIP mutant are similar or not, we monitored difference of their nucleosome assembly efficiency by testing their ability to produce supercoiled DNA over shorter time, after 45 minute incubation. Under these conditions, we reproducibly detected supercoiled forms at earlier times with SpCAF-1-WT when compared to the SpCAF-1-PIP* (see figure 5 and Author response image 2). These observations indicate that mutation in the PIP motif of Pcf1 affects the rate of supercoiling in a distinct manner when compared to the other mutations that dramatically impair SpCAF-1 capacity to promote supercoiling.

      Author response image 2.

      Minor points:

      1. Page 8, line 26 or Table 1 legend: Please explain what "EC50" is.

      The definition of EC50, together with a reference paper for the Hill model have been added in the text page 8 lines 23-26, “The curves were fitted with a Hill model (Tso et al. 2018) with a EC50 value of 0.7± 0.1µM (effective concentration at which a 50% signal is observed) and a cooperativity (Hill coefficient, h) of 2.7 ± 0.2, in line with a cooperative DNA binging of SpCAF-1.”, in the Table 1 figure legend and in the method section (page 26).

      1. Page 13, lines 9, 11: "Xenopus" should be italicized.

      This is corrected

      1. Page 14, second half: In S. pombe, the pcf1 deletion mutant is not lethal. It is helpful to mention the phenotype of the deletion mutant a bit more when the authors described the genetic analysis of various pcf1 mutants.

      This point has been added on page 15, line 1.

      1. Figure 1d and Figure S2a: Captions and labels on the X and Y axes are overlapped or misplaced.

      This is corrected

      1. Figure 5: Please add a schematic figure of the assay to explain how one can check the nucleosome assembly by looking at the form I, supercoiled DNAs.

      A new panel has been added to Figure 5. This scheme depicts the supercoiling assay where supercoiled DNA (form I) is used as an indication of efficient nucleosome assembly. The figure legend has also been modified accordingly.

      Reviewer #3 (Public Review):

      Summary:

      The study conducted by Ouasti et al. is an elegant investigation of fission yeast CAF-1, employing a diverse array of technologies to dissect its functions and their interdependence. These functions play a critical role in specifying interactions vital for DNA replication, heterochromatin maintenance, and DNA damage repair, and their dynamics involve multiple interactions. The authors have extensively utilized various in vitro and in vivo tools to validate their model and emphasize the dynamic nature of this complex.

      Strengths:

      Their work is supported by robust experimental data from multiple techniques, including NMR and SAXS, which validate their molecular model. They conducted in vitro interactions using EMSA and isothermal microcalorimetry, in vitro histone deposition using Xenopus high-speed egg extract, and systematically generated and tested various genetic mutants for functionality in in vivo assays. They successfully delineated domain-specific functions using in vitro assays and could validate their roles to large extent using genetic mutants. One significant revelation from this study is the unfolded nature of the acidic domain, observed to fold when binding to histones. Additionally, the authors also elucidated the role of the long KER helix in mediating DNA binding and enhancing the association of CAF-1 with PCNA. The paper effectively addresses its primary objective and is strong.

      Weaknesses:

      A few relatively minor unresolved aspects persist, which, if clarified or experimentally addressed by the authors, could further bolster the study.

      1. The precise function of the WHD domain remains elusive. Its deletion does not result in DNA damage accumulation or defects in heterochromatin maintenance. This raises questions about the biological significance of this domain and whether it is dispensable. While in vitro assays revealed defects in chromatin assembly using this mutant (Figure 5), confirming these phenotypes through in vivo assays would provide additional assurance that the lack of function is not simply due to the in vitro system lacking PTMs or other regulatory factors.

      Our work demonstrates that the WHD domain is important CAF-1 function during DNA replication. Indeed, the deletion of this domain lead to a synthetic lethality when combined with mutation of the HIRA complex, as observed for a null pcf1 mutant, indicating a severe loss of function in the absence of the WHD domain. We propose that these genetic interactions, previously reported in S. cerevisiae (Kaufman et al. MCB 1998; Krawitz et al. MCB 2002) are indicative of a defective histone deposition by CAF-1. Moreover, our work establishes that this domain is dispensable to prevent DNA damage accumulation and to maintain silencing at centromeric heterochromatin, indicating that the WHD domain specifies CAF-1 functions. Moreover, our work further demonstrates that, in contrast to the S. cerevisiae and human WHD domain, the S. pombe counterpart exhibits no DNA binding activity. We thus agree that the WHD domain may contribute to nucleosome assembly in vivo via PTMs or interactions with regulatory factors that may potentially lack in in vitro systems. However, addressing these aspects deserves further investigations beyond the scope of this article.

      1. The observation of increased Pcf2-gfp foci in pcf1-ED cells, particularly in mono-nucleated (G2phase) and bi-nucleated cells with septum marks (S-phase), might suggest the presence of replication stress. This could imply incomplete replication in specific regions, leading to the persistence of Caf1-ED-PCNA factories throughout the cell cycle. To further confirm this, detecting accumulated single-stranded DNA (ssDNA) regions outside of S-phase using RPA as an ssDNA marker could be informative.

      We cannot formally exclude that cells expressing the Pcf1-ED mutated form exhibit incomplete replication in specific regions, an aspect that would require careful investigations. However, the microscopy analysis (Fig. 6c and S6c) of this mutant showed no alteration in the cell morphology, including the absence of elongated cells compared to wild type, a hallmark of checkpoint activation caused by ssDNA (Enoch et al. Gene & Dev 1992). Therefore, investigating the consequences of the interplay between the binding of CAF-1 to PCNA and histones on the dynamic of DNA replication, is of particular interest but out of the scope of the current manuscript.

      1. Moreover, considering the authors' strong assertion of histone binding defects in ED through in vitro assays (Figure 2d and S2a), these claims could be further substantiated, especially considering that some degree of histone deposition might still persist in vivo in the ED mutant (Figure 7d, viable though growth defective double ED*+hip1D mutants). For example, the approach, akin to the one employed in Fig. 6a (FLAG-IPs of various Pcf1-FLAG-tagged mutants), could also enable a comparison of the association of different mutants with histones and PCNA, providing a more thorough validation of their findings.

      We have provided in the current manuscript data establishing how Pcf1 mutated forms interacted with PCNA (Fig. 6a, 6b). Regarding the interactions with histone H3-H4, the approach based on immunoprecipitation using various Pcf1-FLAG tagged mutants has been unsuccessful in our hands. Indeed, we were unable to obtain robust and reproducible interactions between Pcf1 or its various mutated form with H3-H4. This is likely because Co-IP approaches do not probe for direct interactions. Indirect interactions between Pcf1 and H3-H4 are potentially bridged by additional factors, including the two other subunits of CAF-1, Pcf2 and Pcf3, or Asf1. Therefore, we are not in a position to address in vivo the direct interactions between Pcf1 and histone H3-H4.

      1. It would be valuable for the authors to speculate on the necessity of having disordered regions in CAF1. Specifically, exploring the overall distribution of these domains within disordered/unfolded structures could provide insightful perspectives. Additionally, it's intriguing to note that the significant disparities observed among mutants (ED, PIP, and KER*) in in vitro assays seem to become more generic in vivo, except for the indispensability of the WHD-domain. Could these disordered regions potentially play a crucial role in the phase separation of replication factories? Considering these questions could offer valuable insights into the underlying mechanisms at play.

      We agree that the potential mechanistic role of partial disorder in CAF-1 is particularly interesting. Disordered regions of human CAF-1 have been reported to form nuclear bodies with liquid-liquid phase separation properties to maintain HIV latency (Ma et al EMBO J. 2021). As suggested, this raises the question of how disordered domains of Pcf1 could promote phase separation for replication factories, if such phenomenon happens in vivo. Moreover, numerous factors of the replisome also harbor disordered regions (Bedina, A. et al, 2013. Intrinsically Disordered Proteins in Replication Process. InTech. doi: 10.5772/51673), adding complexity in disentangling experimentally such questions. We have added these elements at the end of the discussion in the revised manuscript (page 20, lines 23-29). “Such plasticity and cross-talks provided by structurally disordered domains might be key for the multivalent CAF-1 functions. Human CAF-1 has been reported to form nuclear bodies with liquid-liquid phase separation properties to maintain HIV latency (Ma et al. 2021). This raises the question of a potential role of the disordered domains of Pcf1, together with other replisome factor harbouring such disordered regions (Bedina 2013), in promoting phase separation of replication factories, if such phenomenon happens in vivo. Further studies will be needed to tackle these questions.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank you for the time you took to review our work and for your feedback! The main changes to the manuscript are: 

      (1) We have added additional analysis of running onsets in closed and open loop conditions for audiomotor (Figure 2H) and visuomotor (Figure 3H) coupling.  

      (2) We have also added analysis of running speed and pupil dilation upon mismatch presentation (Figures S2A and S2B, S4A and S4B, and S5A and S5B).

      (3) We have expanded on the discussion of the nature of differences between audiomotor and visuomotor mismatches.

      Reviewer #1:

      The manuscript presents a short report investigating mismatch responses in the auditory cortex, following previous studies focused on the visual cortex. By correlating the mouse locomotion speed with acoustic feedback levels, the authors demonstrate excitatory responses in a subset of neurons to halts in expected acoustic feedback. They show a lack of responses to mismatch in the visual modality. A subset of neurons show enhanced mismatch responses when both auditory and visual modalities are coupled to the animal's locomotion. 

      While the study is well-designed and addresses a timely question, several concerns exist regarding the quantification of animal behavior, potential alternative explanations for recorded signals, correlation between excitatory responses and animal velocity, discrepancies in reported values, and clarity regarding the identity of certain neurons. 

      Strengths: 

      (1) Well-designed study addressing a timely question in the field. 

      (2) Successful transition from previous work focused on the visual cortex to the auditory cortex, demonstrating generic principles in mismatch responses. 

      (3) The correlation between mouse locomotion speed and acoustic feedback levels provides evidence for a prediction signal in the auditory cortex. 

      (4) Coupling of visual and auditory feedback shows putative multimodal integration in the auditory cortex. 

      Weaknesses: 

      (1) Lack of quantification of animal behavior upon mismatches, potentially leading to alternative interpretations of recorded signals. 

      (2) Unclear correlation between excitatory responses and animal velocity during halts, particularly in closed-loop versus playback conditions. 

      (3) Discrepancies in reported values in a few figure panels raise questions about data consistency and interpretation. 

      (4) Ambiguity regarding the identity of the [AM+VM] MM neurons. 

      The manuscript is a short report following up on a series of papers focusing on mismatch responses between sensory inputs and predicted signals. While previous studies focused on the visual modality, here the authors moved to the auditory modality. By pairing mouse locomotion speed to the sound level of the acoustic feedback, they show that a subpopulation of neurons displays excitatory responses to halts in the (expected) acoustic feedback. These responses were lower in the open-loop state, when the feedback was uncorrelated to the animal locomotion. 

      Overall it is a well-designed study, with a timely and well-posed question. I have several concerns regarding the nature of the MM responses and their interpretations. 

      - One lacks quantification of the animal behavior upon mismatches. Behavioral responses may trigger responses in the mouse auditory cortex, and this would be an alternative explanation to the recorded signals. 

      What is the animal speed following closed-loop halts (we only have these data for the playback condition)? 

      We have quantified the running speed of the mouse following audiomotor and visuomotor mismatches. We found no evidence of a change in running speed. We have added this to Figures S2A and S4A, respectively.

      Is there any pupillometry to quantify possible changes in internal states upon halts (both closed-loop and playback)?

      The term 'internal state' may be somewhat ambiguous in this context. We assume the reviewer is asking whether we have any evidence for possible neuromodulatory changes. We know that there are noradrenergic responses in visual cortex to visuomotor mismatches (Jordan and Keller, 2023), but no cholinergic responses (Yogesh and Keller, 2023). Pupillometry, however, is likely not always sensitive enough to pick up these responses. With very strong neuromodulatory responses (e.g. to air puffs, or other startling stimuli), pupil dilation is of course detected, but this effect is likely at best threshold linear. Looking at changes in pupil size following audiomotor and visuomotor mismatch responses, we found no evidence of a change. We have added this to Figures S2B and S4B, respectively. Note, we suspect this is also strongly experience-dependent. The first audio- or visuomotor mismatch the mouse encounters is likely a more salient stimulus (to the rest of the brain, not necessarily to auditory or visual cortex), than the following ones.  

      These quantifications must be provided for the auditory mismatches but also for the VM or [AM+VM] mismatches.  

      During the presentation of multimodal mismatches [AM + VM], mice did not exhibit significant changes in running speed or pupil diameter. These data have been now added to Figures S5A and S5B.

      - AM MM neurons supposedly receive a (excitatory) locomotion-driven prediction signal. Therefore the magnitude of the excitation should depend on the actual animal velocity. Does the halt-evoked response in a closed loop correlate with the animal speed during the halt? Is the correlation less in the playback condition? 

      This is indeed what one would expect. We fear, however, that we don’t have sufficient data to address this question properly. Moreover, there is an important experimental caveat that makes the interpretation of the results difficult. In addition to the sound we experimentally couple to the locomotion speed of the mouse, the mouse self-generates sound by running (the treadmill rotating, changes to the airflow of the air-supported treadmill, footsteps, etc.). These sources of sound all also correlate in intensity with running speed. Thus, it is not entirely clear how our increase in sound amplitude with increasing running speed relates to the increase in self-generated sounds on the treadmill. This is one of the key reasons we usually do this type of experiment in the visual system where experimental control of visual flow feedback (in a given retinotopic location) is straightforward. 

      Having said that, if we look at the how mismatch responses change as a function of locomotion speed across the entire population of neurons, there appears to be no systematic change with running speed (and the effects are highly dependent on speed bins we choose). However, just looking at the most audiomotor mismatch responsive neurons, we find a trend for increased responses with increasing running speed (Author response image 1). We analyzed the top 5% of cells that showed the strongest response to mismatch (MM) and divided the MM trials into three groups based on running speed: slow (10-20 cm/s), middle (20-30 cm/s), and fast (>30 cm/s). Given the fact that we have on average 14 mismatch events in total per neuron, we don’t have sufficient data to analyze this. 

      Author response image 1.

      The average response of strongest AM MM responders to AM mismatches as a function of running speed (data are from 51 cells, 11 fields of view, 6 mice). 

      Values in Figure 2H are way higher than what can be observed in Figures 2C, and D. Could you explain the mismatch in values? Same for 3H and 4F. 

      In Figure 2H (now Figure S2F), we display responses from 4 755 individual neurons. Since most recorded neurons did not exhibit significant responses to mismatch presentations, their responses cluster around zero, significantly contributing to the final average shown in panel D. To clarify how individual neurons contribute to the overall population activity, we have added a histogram showing the distribution of neurons responding to audiomotor mismatch and sound playback halts. We hope this addition clarifies how individual neuron responses affect the final population activity. 

      Furthermore, neurons exhibiting suppression upon closed-loop halts (Figure 2C) show changes in deltaF/F of the same order of magnitude as the AM MM neurons (with excitatory responses). I cannot picture where these neurons are found in the scatter plot of Figure 2H. 

      This is caused by a ceiling effect. While we could adjust the scale of the heat map to capture neurons with very high responses (e.g. [-50 50], Author response image 2), doing so would obscure the response dynamics of most neurons. Note that the number of neurons on the y-axis far exceeds the resolution of this figure and thus there are also aliasing issues that mask the strong responses. 

      Author response image 2.

      Responses of all L2/3 ACx neurons to audiomotor mismatches. Same as Figure 2C with different color scale [-50 50] which does not capture most of the neural activity.  

      - Are [AM+VM] MM neurons AM neurons? 

      Many of [AM + VM] and [AM] neurons overlap but it is not exactly the same population. This is partially visible in Figure 4F. There is a subset of neurons (13.7%; red dots, Figure 4F) that selectively responded to the concurrent [AM+VM] mismatch, while a different subset of neurons (11.2%; yellow dots, Figure 4F) selectively responded to the mismatch responses in isolation. The [VM] response contributes only little to the sum of the two responses [AM] + [VM]. 

      Please do not use orange in Figure 4F, it is perceptually too similar to red. 

      We have now changed it to yellow. 

      Reviewer #2 (Public Review): 

      In this study, Solyga and Keller use multimodal closed-loop paradigms in conjunction with multiphoton imaging of cortical responses to assess whether and how sensorimotor prediction errors in one modality influence the computation of prediction errors in another modality. Their work addresses an important open question pertaining to the relevance of non-hierarchical (lateral cortico-cortical) interactions in predictive processing within the neocortex. 

      Specifically, they monitor GCaMP6f responses of layer 2/3 neurons in the auditory cortex of head-fixed mice engaged in VR paradigms where running is coupled to auditory, visual, or audio-visual sensory feedback. The authors find strong auditory and motor responses in the auditory cortex, as well as weak responses to visual stimuli. Further, in agreement with previous work, they find that the auditory cortex responds to audiomotor mismatches in a manner similar to that observed in visual cortex for visuomotor mismatches. Most importantly, while visuomotor mismatches by themselves do not trigger significant responses in the auditory cortex, simultaneous coupling of audio-visual inputs to movement non-linearly enhances mismatch responses in the auditory cortex. 

      Their results thus suggest that prediction errors within a given sensory modality are non-trivially influenced by prediction errors from another modality. These findings are novel, interesting, and important, especially in the context of understanding the role of lateral cortico-cortical interactions and in outlining predictive processing as a general theory of cortical function. 

      In its current form, the manuscript lacks sufficient description of methodological details pertaining to the closed-loop training and the overall experimental design. In several scenarios, while the results per se are convincing and interesting, their exact interpretation is challenging given the uncertainty about the actual experimental protocols (more on this below). Second, the authors are laser-focused on sensorimotor errors (mismatch responses) and focus almost exclusively on what happens when stimuli deviate from the animal's expectations. 

      While the authors consistently report strong running-onset responses (during open-loop) in the auditory cortex in both auditory and visual versions of the task, they do not discuss their interpretation in the different task settings (see below), nor do they analyze how these responses change during closed-loop i.e. when predictions align with sensory evidence. 

      However, I believe all my concerns can be easily addressed by additional analyses and incorporation of methodological details in the text. 

      Major concerns: 

      (1) Insufficient analysis of audiomotor mismatches in the auditory cortex: 

      Lack of analysis of the dependence of audiomotor mismatches on the running speed: it would be helpful if the authors could clarify whether the observed audiomotor mismatch responses are just binary or scale with the degree of mismatch (i.e. running speed). Along the same lines, how should one interpret the lack of dependence of the playback halt responses on the running speed? Shouldn't we expect that during playback, the responses of mismatch neurons scale with the running speed? 

      Regarding the scaling of AM mismatch responses with running speed, please see our response to reviewer 1 above to the same question. 

      Regarding the playback halt response and dependence on running speed, we would not expect there to be a dependence. The playback halt response (by design) measures the strength of the sensory response to a cessation of a stimulus (think OFF response). These typically are less strong in cortex than the corresponding ON responses but need to be controlled for (else a mismatch response might just be an OFF response – the prediction error is quantified as the difference between AM mismatch response and playback halt response). Given that sound onset responses only have a small dependence on running state, we would similarly expect sound offset (playback halt) responses to exhibit only minimal dependence on running state. 

      Slow temporal dynamics of audiomotor mismatches: despite the transient nature of the mismatches (1s), auditory mismatch responses last for several seconds. They appear significantly slower than previous reports for analogous visuomotor mismatches in V1 (by the same group, using the same methods) and even in comparison to the multimodal mismatches within this study (Figure 4C). What might explain this sustained activity? Is it due to a sustained change in the animal's running in response to the auditory mismatch? 

      This is correct, neither AM or AM+VM mismatch return to baseline in the 3 seconds following onset. VM mismatch response in visual cortex also do not return to baseline in that time window (see e.g.

      Figure 1E in (Attinger et al., 2017), or Figure 1F in (Zmarz and Keller, 2016). What the origin or computation significance of this sustained calcium response is we do not know. In intracellular signals, we do not see this sustained response (Jordan and Keller, 2020). Also peculiar is indeed the fact that in the case of AM mismatch the sustained response is similar in strength to the initial response. But also here, why this would be the case, we do not know. It is conceivable that the initial and the sustained calcium response have different origins, if the sustained response amplitude is all or nothing, the fact that the AM mismatch response is the smallest of the three could explain why sustained and initial responses are closer than for [AM+VM] or VM (in visual cortex) mismatch responses. All sustained responses appear to be roughly 1% dF/F. There are no apparent changes in running speed or pupil dilation that would correlate with the sustained activity (new panel A in Figure S2). 

      (2) Insufficient analysis and discussion of running onset responses during audiomotor sessions: The authors report strong running-onset responses during open-loop in identified mismatch neurons. They also highlight that these responses are in agreement with their model of subtractive prediction error, which relies on subtracting the bottom-up sensory evidence from top-down motor-related predictions. I agree, and, thus, assume that running-onset responses during the open loop in identified 'mismatch' neurons reflect the motor-related predictions of sensory input that the animal has learned to expect. If this is true, one would expect that such running-onset responses should dampen during closed-loop, when sensory evidence matches expectations and therefore cancels out this prediction. It would be nice if the authors test this explicitly by analyzing the running-related activity of the same neurons during closed-loop sessions. 

      Thank you for the suggestion. We now show running onset responses in both closed and open loop conditions for audiomotor and visuomotor coupling (new Figures 2H and 3H). In closed loop, we observe only a transient running onset response. In the open loop condition, running onset responses are sustained. For the visuomotor coupling, running onset responses are sustained in both closed and open loop conditions. This would be consistent with a slightly delayed cancellation of sound and motor related inputs in the audiomotor closed loop condition but not otherwise. 

      (3) Ambiguity in the interpretation of responses in visuomotor sessions. 

      Unlike for auditory stimuli, the authors show that there are no obvious responses to visuomotor mismatches or playback halts in the auditory cortex. However, the interpretation of these results is somewhat complicated by the uncertainty related to the training history of these mice. Were these mice exclusively trained on the visuomotor version of the task or also on the auditory version? I could not find this info in the Methods. From the legend for Figure 4D, it appears that the same mice were trained on all versions of the task. Is this the case? If yes, what was the training sequence? Were the mice first trained on the auditory and then the visual version? 

      The training history of the animals is important to outline the nature of the predictions and mismatch responses that one should expect to observe in the auditory cortex during visuomotor sessions.

      Depending on whether the mice in Figure 3 were trained on visual only or both visual and auditory tasks, the open-loop running onset responses may have different interpretations. 

      a) If the mice were trained only on the visual task, how should one interpret the strong running onset responses in the auditory cortex? Are these sensorimotor predictions (presumably of visual stimuli) that are conveyed to the auditory cortex? If so, what may be their role? 

      b) If the mice were also trained on the auditory version, then a potential explanation of the running-onset responses is that they are audiomotor predictions lingering from the previously learned sensorimotor coupling. In this case, one should expect that in the visual version of the task, these audiomotor predictions (within the auditory cortex) would not get canceled out even during the closedloop periods. In other words, mismatch neurons should constantly be in an error state (more active) in the closed-loop visuomotor task. Is this the case? 

      If so, how should one then interpret the lack of a 'visuomotor mismatch' aligned to the visual halts, over and above this background of continuous errors? 

      As such, the manuscript would benefit from clearly stating in the main text the experimental conditions such as training history, and from discussing the relevant possible interpretations of the responses. 

      Mice were not trained on either audiomotor or visuomotor coupling and were reared normally. Prior to the recording day, the mice were habituated to running on the air-supported treadmill without any coupling for up to 5 days. On the first recording day, the mice experienced all three types of sessions (audiomotor, visuomotor, or combined coupling) in a random order for the first time. We have clarified this in the methods. 

      Regarding the question of how one should interpret the strong running onset responses in the auditory cortex, this is complicated by the fact that – unless mice are raised visually or auditorily deprived – they always have life-long experience with visuomotor or audiomotor coupling. The visuomotor coupling they experience in VR is geometrically matched to what they would experience by moving in the real world, for the audiomotor coupling the exact relationship is less clear, but there are a diverse set of sound sources that scale in loudness with increasing running speed. Hence running onset responses reflect either such learned associations (as the reviewer also speculates), or spurious input. Rearing mice without coupling between movement and visual feedback does not abolish movement related responses in visual cortex (Attinger et al., 2017), to the contrary, it enhances them considerably. We suspect this reflects visual cortex being recruited for other functions in the absence of visual input. But given the data we have we cannot distinguish the different possible sources of running related responses. It is very likely that any “training” related effect we could achieve in a few hours pales in comparison to the life-long experience the mouse has in the world. 

      Regarding the lack of a 'visuomotor mismatch' aligned to the visual halts, we are not sure we understand. Our interpretation is that there are no (or only a very small - we speculate that any nonzero VM mismatch response is just inherited from visual cortex) VM mismatch responses in auditory cortex above chance. Our data are consistent with the interpretation that there is no opposition of bottom up visual and top down motor related input in auditory cortex, hence no VM mismatch responses (independent of how strong the top-down motor related input is). This is of course not surprising – this is more of a sanity check and becomes relevant in the context of interpreting AM+VM responses. 

      (4) Ambiguity in the interpretation of responses in multimodal versus unimodal sessions. 

      The authors show that multimodal (auditory + visual) mismatches trigger stronger responses than unimodal mismatches presented in isolation (auditory only or visual only). Further, they find that even though visual mismatches by themselves do not evoke a significant response, co-presentation of visual and auditory stimuli non-linearly augments the mismatch responses suggesting the presence of nonhierarchical interactions between various predictive processing streams. 

      In my opinion, this is an important result, but its interpretation is nuanced given insufficient details about the experimental design. It appears that responses to unimodal mismatches are obtained from sessions in which only one stimulus is presented (unimodal closed-loop sessions). Is this actually the case? An alternative and perhaps cleaner experimental design would be to create unimodal mismatches within a multimodal closed-loop session while keeping the other stimulus still coupled to the movement. 

      This is correct, unimodal mismatches were acquired in unimodal coupling. Testing unimodal mismatch responses in multimodally coupled VR is an interesting idea we had initially even pursued. However, halting visual flow in a condition of coupling of both visual flow and sound amplitude to running speed has an additional complication. Introducing an audiomotor mismatch in this coupling inherently also creates an audiovisual (AV) mismatch, and the same applies to visuomotor mismatches, which cause a concurrent visuoaudio (VA) mismatch (Figure R3). This assumes that there are cross modal predictions from visual cortex to auditory cortex as there are from auditory cortex to visual cortex (Garner and Keller, 2022). There are interesting differences between the different types of mismatches, but with the all the necessary passive controls this quickly exceeded the amount of data we could reasonably acquire for this paper. This remains an interesting question for future research. 

      Author response image 3.

      Rationale of unimodal mismatches introduced within multimodal paradigm. 

      Given the current experiment design (if my assumption is correct), it is unclear if the multimodal potentiation of mismatch responses is a consequence of nonlinear interactions between prediction/error signals exchanged across visual and auditory modalities. Alternatively, could this result from providing visual stimuli (coupled or uncoupled to movement) on top of the auditory stimuli? If it is the latter, would the observed results still be evidence of non-hierarchical interactions between various predictive processing streams? 

      Mice are not in complete darkness during the AM mismatch experiments (the VR is off, but there is low ambient light in the experimental rooms primarily from computer screens), so we can rule out the possibility that the difference comes from having “no” visual input during AM mismatch responses. Addressing the question of whether it is this particular stimulus that cause the increase would require an experiment in which we couple sound amplitude but keep visual flow open loop. We did not do this, but also think this is highly unlikely. However, as described above, we did do an experiment in which we coupled both sound amplitude and visual flow to running, and then either halted visual flow, or sound amplitude, or both. Comparing the [AM+VM] and [AM+AV] mismatch responses, we find that [AM+VM] responses are larger than [AM+AV] responses as one would expect from an interaction between [AM] and [VM] responses (Author response image 4). Finally, either way the conclusion that there are nonhierarchical interactions of prediction error computations holds either way – if any visual stimulus (either visuomotor mismatch, or visual flow responses) influences audiomotor mismatch responses, this is evidence of non-hierarchical interactions.   

      Author response image 4.

      Average population response of all L2/3 neurons to concurrent [AM + VM] or [AM+AV] mismatch. Gray shading indicates the duration of the stimulus.

      Along the same lines, it would be interesting to analyze how the coupling of visual as well as auditory stimuli to movement influences responses in the auditory cortex in close-loop in comparison to auditoryonly sessions. Also, do running onset responses change in open-loop in multimodal vs. unimodal playback sessions? 

      We agree, and why we started out doing the experiments described above. We stopped with this however, because it quickly became a combinatorial nightmare. We will leave addressing the question of how different types of coupling influences responses in auditory cortex to brave future neuroscientists. 

      Regarding the question of running onset responses, in both the multimodal and auditory only paradigms, running onset responses are transient; bottom-up sensory evidence is quickly subtracted from top-down motor-related prediction (Author response image 5). While there appears to be a small difference in the dynamics of running onset responses between these two paradigms, it was not significant. Note, we also have much less data than we would like here for this type of analysis. 

      Author response image 5.

      Running onset responses recorded in unimodal and multimodal closed loop sessions (1903 neurons, 16 fields of view, 8 mice)

      We also compared running onsets in open loop sessions and did not find any significant differences between unimodal and multimodal sessions (Author response image 6). We found only six sessions in which animals performed at least two running onsets in each session type, therefore, we do not have enough data to include it in the manuscript. 

      Author response image 6.

      Running onset responses recorded within unimodal and multimodal open loop sessions (659 cells, 6 field of view, 5 mice).

      Minor concerns and comments:

      (1) Rapid learning of audiomotor mismatches: It is interesting that auditory mismatches are present even on day 1 and do not appear to get stronger with learning (same on day 2). The authors comment that this could be because the coupling is learned rapidly (line 110). How does this compare to the rate at which visuomotor coupling is learned? Is this rapid learning also observable in the animal's behavior i.e. is there a change in running speed in response to the mismatch? 

      In the visual system this is a bit more complicated. If you look at visuomotor mismatch responses in a normally reared mouse, responses are present from the first mismatch (as far as we can tell given the inherently small dataset with just one response pre mouse). However, this is of course confounded by the fact that a normally reared mouse has visuomotor coupling throughout life from eye-opening. Raising mice in complete darkness, we have shown that approximately 20 min of coupling are sufficient to establish visuomotor mismatch responses (Attinger et al., 2017). 

      Regarding the behavioral changes that correlate with learning, we are not sure what the reviewer would expect. We cannot detect a change in mismatch responses and hence would also not expect to see a change in behavior.

      (2) The authors should clarify whether the sound and running onset responses of the auditory mismatch neurons in Figure 2E were acquired during open-loop. This is most likely the case, but explicitly stating it would be helpful. 

      Both responses were measured in isolation (i.e. VR off, just sound and just running onset), not in an open-loop session. We have clarified in the figure legend that these are the same data as in Figure 1H and N. 

      (3) In lines 87-88, the authors state 'Visual responses also appeared overall similar but with a small increase in strength during running ...'. This statement would benefit from clarification. From Figure S1 it appears that when the animal is sitting there are no visual responses in the auditory cortex. But when the animal is moving, small positive responses are present. Are these actually 'visual' responses - perhaps a visual prediction sent from the visual cortex to the auditory cortex that is gated by movement? If so, are they modulated by features of visual stimuli eg. contrast, intensity? Or, do these responses simply reflect motor-related activity (running)? Would they be present to the same extent in the same neurons even in the dark? 

      This was wrong indeed - we have rephrased the statement as suggested. Regarding the source of visual responses, we use the term “visual response” operationally here agnostic to what pathway might be driving it (i.e. it could be a prediction triggered by visual input). 

      We did not test if recorded visual responses are modulated by contrast or intensity. However, testing whether they are would not help us distinguish whether the responses are ‘visual’ or ‘visual predictions’. Finally, regarding the question about whether they are motor-related responses, this might be a misunderstanding. These are responses to visual stimuli while the mouse is already running (i.e. there is no running onset), hence we cannot test whether these responses are present in the dark (this would be the equivalent of looking at random triggers in the dark while the mouse is running).  

      (4) The authors comment in the text (lines 106-107) about cessation of sound amplitude during audiomotor mismatches as being analogous to halting of visual flow in visuomotor mismatches. However, sound amplitude versus visual flow are quite different in nature. In the visuomotor paradigm, the amount of visual stimulation (photons per unit time) does not necessarily change systematically with running speed. Whereas, in the audiomotor paradigm, the SNR of the stimulus itself changes with running speed which may impact the accuracy of predictions. On a broader note, under natural settings, while the visual flow is coupled to movement, sound amplitude may vary more idiosyncratically with movement. 

      This is a question of coding space. The coding space of visual cortex of the mouse is probably visual flow (or change in image) not number of photons. This already starts in the retina. The demonstration of this is quite impressive. A completely static image on the retina will fade to zero response (even though the number of photons remains constant). This is also why most visual physiologists use dynamic stimuli – e.g. drifting gratings, not static gratings – to map visual responses in visual cortex. If responses were linear in number of photons, this would make less of a difference. The correspondence we make is between visual flow (which we assume is the main coding space of mouse V1 – this is not established fact, but probably implicitly the general consensus of the field) and sound amplitude. Responses in auditory cortex are probably more linear in sound amplitude than visual cortex responses are linear in number of photons, but whether that is the correct coding space is still unclear, and as far as we can tell there is no clear consensus in the field. We did consider coupling running speed to frequency, which may work as well, but given the possible equivalence (as argued above) and the fact that we could see similar responses with sound amplitude coupling we did not explore frequency coupling. 

      If visual speed is the coding space of V1, SNR should behave equivalently in both cases. 

      Perhaps such differences might explain why unlike in the case of visual cortex experiments, running speed does not affect the strength of playback responses in the auditory cortex. 

      Possible, but the more straightforward framing of this point is that sensory responses are enhanced by running in visual cortex while they are not in auditory cortex. A playback halt response (by design) is just a sensory response. Why running does not generally increase sensory responses in auditory cortex (L2/3 neurons), but does so in visual cortex, would be the more general version of the same question.

      We fear we have no intelligent answer to this question.  

      Reviewer #3 (Public Review): 

      This study explores sensory prediction errors in the sensory cortex. It focuses on the question of how these signals are shaped by non-hierarchical interactions, specifically multimodal signals arising from same-level cortical areas. The authors used 2-photon imaging of mouse auditory cortex in head-fixed mice that were presented with sounds and/or visual stimuli while moving on a ball. First, responses to pure tones, visual stimuli, and movement onset were characterized. Then, the authors made the running speed of the mouse predictive of sound intensity and/or visual flow. Mismatches were created through the interruption of sound and/or visual flow for 1 second while the animal moved, disrupting the expected sensory signal given the speed of movement. As a control, the same sensory stimuli triggered by the animal's movement were presented to the animal decoupled from its movement. The authors suggest that auditory responses to the unpredicted silence reflect mismatch responses. That these mismatch responses were enhanced when the visual flow was congruently interrupted, indicates the cross-modal influence of prediction error signals. 

      This study's strengths are the relevance of the question and the design of the experiment. The authors are experts in the techniques used. The analysis explores neither the full power of the experimental design nor the population activity recorded with 2-photon, leaving open the question of to what extent what the authors call mismatch responses are not sensory responses to sound interruption. The auditory system is sensitive to transitions and indeed responses to the interruption of the sound are similar in quality, if not quantity, in the predictive and the control situation. 

      This study's strengths are the relevance of the question and the design of the experiment. The authors are experts in the techniques used. The analysis explores neither the full power of the experimental design nor the population activity recorded with 2-photon, leaving open the question of to what extent what the authors call mismatch responses are not sensory responses to sound interruption. The auditory system is sensitive to transitions and indeed responses to the interruption of the sound are similar in quality, if not quantity, in the predictive and the control situation. The pattern they observe is different from the visuomotor mismatch responses the authors found in V1 (Keller et al., 2012), where the interruption of visual flow did not activate neuronal activity in the decoupled condition. 

      Just to add brief context to this. The reviewer is correct here, the (Keller et al., 2012) paper reports finding no responses to playback halt. However, this was likely a consequence of indicator sensitivity (these experiments were done with what now seems like a pre-historic version of GCaMP). Experiments performed with more modern indicators do find playback halt responses in visual cortex (see e.g. (Zmarz and Keller, 2016)). 

      The auditory system is sensitive to transitions, also those to silence. See the work of the Linden or the Barkat labs on-off responses, and also that of the Mesgarani lab (Khalighinejad et al., 2019) on responses to transitions 'to clean' (Figure 1c) in the human auditory cortex. Since the responses described in the current work are modulated by movement and the relationship between movement and sound is more consistent during the coupled sessions, this could explain the difference in response size between coupled and uncoupled sessions. There is also the question of learning. Prediction signals develop over a period of several days and are frequency-specific (Schneider et al., 2018). From a different angle, in Keller et al. 2012, mismatch responses decrease over time as one might expect from repetition. 

      Also for brief context, this might be a misconception. We don’t find a decrease of mismatch responses in the (Keller et al., 2012) paper – we assume what the reviewer is referring to is the fact that mismatch responses decrease in open-loop conditions (they normally do not in closed-loop conditions). This is the behavior one would expect if the mouse learns that movement no longer predicts visual feedback. 

      It would help to see the responses to varying sound intensity as a function of previous intensity, and to plot the interruption response as a function of both transition and movement in both conditions. 

      Given the large populations of neurons recorded and the diversity of the responses, from clearly negative to clearly positive, it would be interesting to understand better whether the diversity reflects the diversity of sounds used or a diversity of cell types, or both. 

      Comments and questions: 

      Does movement generate a sound and does this change with the speed of movement? It would be useful to have this in the methods. 

      There are three ways to interpret the question – below the answers to all three:

      (1) Running speed is experimentally coupled to sound amplitude of a tone played through a loudspeaker. Tone amplitude is scaled with running speed of the mouse in a closed loop fashion. We assume this is not what the reviewer meant, as this is described in the methods (and the results section). 

      (2) Movements of the mouse naturally generate sounds (footsteps, legs moving against fur, etc.). Most of these sounds trivially scale with the frequency of leg movements – we assume this also not what the reviewer meant. 

      (3) Finally, there are experimental sounds related to the rotation speed of the air supported treadmill that increase with running speed of the mouse. We have added this to the methods as suggested. 

      Figures 1a and 2a. The mouse is very hard to see. Focus on mouse, objective, and sensory stimuli? The figures are generally very clear though. 

      We have enlarged the mouse as suggested. 

      1A-K was the animal running while these responses were measured? 

      We did not restrict this analysis to running or sitting and pooled responses over both conditions.  We have made this more explicit in the results section.  

      Data in Figure 1: Since the modulation of sensory responses by movement is relevant for the mismatch responses, I would move this analysis from S1 to Figure 1 and analyze the responses more finely in terms of running speed relative to sound and gratings. I would include here a more thorough analysis of the responses to 8kHz at varying intensities, for example in the decoupled sessions. Does the response adapt? Does it follow the intensity? 

      We agree that these are interesting questions, but they do not directly pertain to our conclusions here. The key point Figure S1 addresses is whether auditory responses are generally enhanced by running (as they are e.g. in visual cortex) – the answer, on average, is no. We have tried emphasizing this more, but it changes the flow of the paper away from our main message, hence we have left the panels in the supplements. 

      Regarding the 8kHz modulation, there is a general increase of the suppression of activity with increasing sound amplitude (Author response image 7 and Author response image 8). But due to the continuously varying amplitude of the stimulus, we do not have sufficient data (or do not know how to with the data we have) to address questions of adaptation. We assume there is some form of adaptation. However, either way, we don’t see how this would change our conclusions. 

      Author response image 7.

      Neural activity as a function of sound level in an AM open loop session. 

      Author response image 8.

      The average sound evoked population response of all ACx layer 2/3 neurons to 60 dB or 75 dB 8 kHz pure tones. Stimulus duration was 1 s (gray shading).

      2C-D why not talk of motor modulation? Paralleling what happens in response to auditory and visual stimuli? 

      This is correct, a mismatch response (we use mismatch here to operationally describe the stimulus – not the interpretation) can be described either as a prediction error (this is the interpretation) or a stimulus specific motor modulation. Note, the key here is “stimulus specific”. It is stimulus specific as there is an approximately 3x change between mismatch and playback halt (the same sensory stimulus with and without locomotion), but basically no change for sound onsets (Figure S1). Having said that, one explanation (prediction error) has predictive power (and hence is testable – see e.g. (Vasilevskaya et al., 2023) for an extensive discussion on exactly this argument for mismatch responses in visual cortex), while the other does not (a “stimulus specific” motor modulation has no predictive value or computational theory behind it and is simply a description). Thus, we choose to interpret it as a prediction error. Note, this finding does not stand in isolation and many of the testable predictions of the predictive processing interpretation have turned out to be correct (see e.g. (Keller and Mrsic-Flogel, 2018) for a review). 

      Note, we try to only use the interpretation of “prediction error” when motivating why we do the experiments, and in the discussion, but not directly in the description of the results (e.g. in Figure 2).  

      How does the mismatch affect the behavior of the mouse? Does it stop running? This could also influence the size of the response. 

      We quantified animal behavior during audiomotor mismatches and did not find any significant acceleration or slowing down upon mismatch events. Thus, neural responses recorded during AM mismatches are unlikely to be explained by changes in animal behavior. These data have been added in Figure S2A and Figure S4A.

      Figure 3. What about neurons that were positively modulated by both grating and movement? How do these neurons respond to the mismatch? 

      Neurons positively modulated by both grating and movement were slightly more responsive to MM than the rest of the population, though this difference was not significant (Author response image 9). This is also visible in Figure 3G – the high VM mismatch responsive neurons are randomly distributed in regard to correlation with running speed and visual flow speed. 

      Author response image 9.

      Responses to visuomotor mismatches of neurons positively modulated by grating and movement and remaining of the population.

      Line 176. The authors say 'Thus, in the case of a [AM + VM] mismatch both the halted visual flow and the halted sound amplitude are predicted by running speed' but the mismatch (halted flow and amplitude) is not predicted by the speed, correct? Please rephrase. 

      Thank you for pointing this out – this was indeed phrased incorrectly. We have corrected this. 

      How was the sound and/or visual flow interruption triggered? Did the animal have to run at a minimum speed in order for it to happen?

      Sound and visual flow interruptions were triggered randomly, independent of the animal's running speed. However, for the analysis, only MM presentations during which animals were running at a speed of at least 0.3 cm/s were included. The 0.3 cm/s was simply the (arbitrary) threshold we used to determine if the mouse was running. In a completely stationary mouse a mismatch event will not have any effect (sound amplitude/visual flow speed are already at 0). This is described in the methods section.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This is a useful study examining the determinants and mechanisms of LRMP inhibi:on of cAMP regula:on of HCN4 channel ga:ng. The evidence provided to support the main conclusions is unfortunately incomplete, with discrepancies in the work that reduce the strength of mechanis:c insights.

      Thank you for the reviews of our manuscript. We have made a number of changes to clarify our hypotheses in the manuscript and addressed all of the poten:al discrepancies by revising some of our interpreta:on. In addi:on, we have provided addi:onal experimental evidence to support our conclusions. Please see below for a detailed response to each reviewer comment.

      Public Reviews

      Reviewer #1 (Public Review):

      Summary:

      The authors use truncations, fragments, and HCN2/4 chimeras to narrow down the interaction and regulatory domains for LRMP inhibition of cAMP-dependent shifts in the voltage dependence of activation of HCN4 channels. They identify the N-terminal domain of HCN4 as a binding domain for LRMP, and highlight two residues in the C-linker as critical for the regulatory effect. Notably, whereas HCN2 is normally insensitive to LRMP, putting the N-terminus and 5 additional C-linker and S5 residues from HCN4 into HCN2 confers LRMP regulation in HCN2.

      Strengths:

      The work is excellent, the paper well written, and the data convincingly support the conclusions which shed new light on the interaction and mechanism for LRMP regulation of HCN4, as well as identifying critical differences that explain why LRMP does not regulate other isoforms such as HCN2.

      Thank you.

      Reviewer #2 (Public Review):

      Summary:

      HCN-4 isoform is found primarily in the sino-atrial node where it contributes to the pacemaking activity. LRMP is an accessory subunit that prevents cAMP-dependent potentiation of HCN4 isoform but does not have any effect on HCN2 regulation. In this study, the authors combine electrophysiology, FRET with standard molecular genetics to determine the molecular mechanism of LRMP action on HCN4 activity. Their study shows that parts of N- and C-termini along with specific residues in C-linker and S5 of HCN4 are crucial for mediating LRMP action on these channels. Furthermore, they show that the initial 224 residues of LRMP are sufficient to account for most of the activity. In my view, the highlight of this study is Fig. 7 which recapitulates LRMP modulation on HCN2-HCN4 chimera. Overall, this study is an excellent example of using time-tested methods to probe the molecular mechanisms of regulation of channel function by an accessory subunit.

      Weaknesses:

      (1) Figure 5A- I am a bit confused with this figure and perhaps it needs better labeling. When it states Citrine, does it mean just free Citrine, and "LRMP 1-230" means LRMP fused to Citrine which is an "LF" construct? Why not simply call it "LF"? If there is no Citrine fused to "LRMP 1-230", this figure would not make sense to me.

      We have clarified the labelling of this figure and specifically defined all abbreviations used for HCN4 and LRMP fragments in the results section on page 14.

      (2) Related to the above point- Why is there very little FRET between NF and LRMP 1-230? The FRET distance range is 2-8 nm which is quite large. To observe baseline FRET for this construct more explanation is required. Even if one assumes that about 100 amino are completely disordered (not extended) polymers, I think you would still expect significant FRET.

      FRET is extremely sensitive to distance (to the 6th power of distance). The difference in contour length (maximum length of a peptide if extended) between our ~260aa fragment and our ~130 aa fragments is on the order of 450Å (45nm), So, even if not extended it is not hard to imagine that the larger fragments show a weaker FRET signal. In fact, we do see a slightly larger FRET than we do in control (not significant) which is consistent with the idea that the larger fragments just do not result in a large FRET.

      Moreover, this hybridization assay is sensitive to a number of other factors including the affinity between the two fragments, the expression of each fragment, and the orientation of the fluorophores. Any of these factors could also result in reduced FRET.

      We have added a section on the limitations of the FRET 2-hybrid assay in the discussion section on page 20. Our goal with the FRET assay was to provide complimentary evidence that shows some of the regions that are important for direct association and we have edited to the text to make sure we are not over-interpreting our results.

      (3) Unless I missed this, have all the Cerulean and Citrine constructs been tested for functional activity?

      All citrine-tagged LRMP constructs (or close derivatives) were tested functionally by coexpression with HCN (See Table 1 and pages 10-11). Cerulean-tagged HCN4 fragments are of course intrinsically not-functional as they do not include the ion conducting pore.

      Reviewer #3 (Public Review):

      Summary:

      Using patch clamp electrophysiology and Förster resonance energy transfer (FRET), Peters and co-workers showed that the disordered N-terminus of both LRMP and HCN4 are necessary for LRMP to interact with HCN4 and inhibit the cAMP-dependent potentiation of channel opening. Strikingly, they identified two HCN4-specific residues, P545 and T547 in the C-linker of HCN4, that are close in proximity to the cAMP transduction centre (elbow Clinker, S4/S5-linker, HCND) and account for the LRMP effect.

      Strengths:

      Based on these data, the authors propose a mechanism in which LRMP specifically binds to HCN4 via its isotype-specific N-terminal sequence and thus prevents the cAMP transduction mechanism by acting at the interface between the elbow Clinker, the S4S5-linker, the HCND.

      Weaknesses:

      Although the work is interesting, there are some discrepancies between data that need to be addressed.

      (1) I suggest inserting in Table 1 and in the text, the Δ shift values (+cAMP; + LRMP; +cAMP/LRMP). This will help readers.

      Thank you, Δ shift values have been added to Tables 1 and 2 as suggested.

      (2) Figure 1 is not clear, the distribution of values is anomalously high. For instance, in 1B the distribution of values of V1/2 in the presence of cAMP goes from - 85 to -115. I agree that in the absence of cAMP, HCN4 in HEK293 cells shows some variability in V1/2 values, that nonetheless cannot be so wide (here the variability spans sometimes even 30 mV) and usually disappears with cAMP (here not).

      With a large N, this is an expected distribution. In 5 previous reports from 4 different groups of HCN4 with cAMP in HEK 293 (Fenske et al., 2020; Liao et al., 2012; Peters et al., 2020; Saponaro et al., 2021; Schweizer et al., 2010), the average expected range of the data is 26.6 mV and 39.9 mV for 95% (mean ± 2SD) and 99% (mean ± 3SD) of the data, respectively. As the reviewer mentions the expected range from these papers is slightly larger in the absence of cAMP. The average SD of HCN4 (with/without cAMP) in papers are 9.9 mV (Schweizer et al., 2010), 4.4 mV (Saponaro et al., 2021), 7.6 mV (Fenske et al., 2020), 10.0 mV (Liao et al., 2012), and 5.9 mV (Peters et al., 2020). Our SD in this paper is roughly in the middle at 7.6 mV. This is likely because we used an inclusive approach to data so as not to bias our results (see the statistics section of the revised manuscript on page 9). We have removed 2 data points that meet the statistical classification as outliers, no measures of statistical significance were altered by this.

      This problem is spread throughout the manuscript, and the measured mean effects are indeed always at the limit of statistical significance. Why so? Is this a problem with the analysis, or with the recordings?

      The exact P-values are NOT typically at the limit of statistical significance, about 2/3rds would meet the stringent P < 0.0001 cut-off. We have clarified in the statistics section (page 10) that any comparison meeting our significance threshold (P < 0.05) or a stricter criterion is treated equally in the figure labelling. Exact P-values are provided in Tables 1-3.

      There are several other problems with Figure 1 and in all figures of the manuscript: the Y scale is very narrow while the mean values are marked with large square boxes. Moreover, the exemplary activation curve of Figure 1A is not representative of the mean values reported in Figure 1B, and the values of 1B are different from those reported in Table 1.

      Y-axis values for mean plots were picked such that all data points are included and are consistent across all figures. They have been expanded slightly (-75 to -145 mV for all HCN4 channels and -65 to -135 mV for all HCN2 channels). The size of the mean value marker has been reduced slightly. Exact midpoints for all data are also found in Tables 1-3.

      The GV curves in Figure 1B (previously Fig. 1A) are averages with the ±SEM error bars smaller than the symbols in many cases owing to relatively high n’s for these datasets. These curves match the midpoints in panel 1C (previously 1B). Eg. the midpoint of the average curve for HCN4 control in panel A is -117.9 mV, the same as the -117.8 mV average for the individual fits in panel B.

      We made an error in the text based on a previous manuscript version about the ordering of the tables that has now been fixed so these values should now be aligned.

      On this ground, it is difficult to judge the conclusions and it would also greatly help if exemplary current traces would be also shown.

      Exemplary current traces have been added to all figures in the revised manuscript.

      (3) "....HCN4-P545A/T547F was insensitive to LRMP (Figs. 6B and 6C; Table 1), indicating that the unique HCN4 C-linker is necessary for regulation by LRMP. Thus, LRMP appears to regulate HCN4 by altering the interactions between the C-linker, S4-S5 linker, and Nterminus at the cAMP transduction centre."

      Although this is an interesting theory, there are no data supporting it. Indeed, P545 and T547 at the tip of the C-linker elbow (fig 6A) are crucial for LRMP effect, but these two residues are not involved in the cAMP transduction centre (interface between HCND, S4S5 linker, and Clinker elbow), at least for the data accumulated till now in the literature. Indeed, the hypothesis that LRMP somehow inhibits the cAMP transduction mechanism of HCN4 given the fact that the two necessary residues P545 and T547 are close to the cAMP transduction centre, remains to be proven.

      Moreover, I suggest analysing the putative role of P545 and T547 in light of the available HCN4 structures. In particular, T547 (elbow) points towards the underlying shoulder of the adjacent subunit and, therefore, is in a key position for the cAMP transduction mechanism. The presence of bulky hydrophobic residues (very different nature compared to T) in the equivalent position of HCN1 and HCN2 also favours this hypothesis. In this light, it will be also interesting to see whether a single T547F mutation is sufficient to prevent the LRMP effect.

      We agree that testing this hypothesis would be very interesting. However, it is challenging. Any mutation we make that is involved in cAMP transduction makes measuring the LRMP effect on cAMP shifts difficult or impossible.

      Our simple idea, now clarified in the discussion, is that if you look at the regions involved in cAMP transduction (HCND, C-linker, S4-S5), there are very few residues that differ between HCN4 and HCN2. When we mutate the 5 non-conserved residues in the S5 segment and the C-linker, along with the NT, we are able to render HCN2 sensitive to LRMP. Therefore, something about the small sequence differences in this region confer isoform specificity to LRMP. We speculate that this happens because of small structural differences that result from those 5 mutations. If you compare the solved structures of HCN1 and HCN4 (there is no HCN2 structure available), you can see small differences in the distances between key interacting residues in the transduction centre. Also, there is a kink at the bottom of the S4 helix in HCN4 but not HCN1. This points a putatively important residue for cAMP dependence in a different direction in HCN4. We hypothesize in the discussion that this may be how LRMP is isoform specific.

      Moreover, previous work has shown that the HCN4 C-linker is uniquely sensitive to di-cyclic nucleotides and magnesium ions. We are hypothesizing that it is the subtle change in structure that makes this region more prone to regulation in HCN4.

      Reviewing Editor (recommendations for the Authors):

      (1) Exemplar recordings need to be shown and some explanation for the wide variability in the V-half of activation.

      Exemplar currents are now shown for each channel. See the response to Reviewer 3’s public comment 2.

      (2) The rationale for cut sites in LRMP for the investigation of which parts of the protein are important for blocking the effect of cAMP is not logically presented in light of the modular schematics of domains in the protein (N-term, CCD, post-CCD, etc).

      There is limited structural data on LRMP and the HCN4 N-terminus. The cut sites in this paper were determined empirically. We made fragments that were small enough to work for our FRET hybridization approach and that expressed well in our HEK cell system. The residue numbering of the LRMP modules is based on updated structural predictions using Alphafold, which was released after our fragments were designed. This has been clarified in the methods section on pages 5-6 and the Figure 2 legend of the revised manuscript.

      (3) Role of the HCN4 C-terminus. Truncation of the HCN4 C-terminus unstructured Cterminus distal to the CNBD (Fig. 4 A, B) partially reverses the impact of LRMP (i.e. there is now a significant increase in cAMP effect compared to full-length HCN4). The manuscript is written in a manner that minimizes the potential role of the C-terminus and it is, therefore, eliminated from consideration in subsequent experiments (e.g. FRET) and the discussion. The model is incomplete without considering the impact of the C-terminus.

      We thank the reviewer for this comment as it was a result that we too readily dismissed. We have added discussion around this point and revised our model to suggest that not only can we not eliminate a role for the distal C-terminus, our data is consistent with it having a modest role. Our HCN4-2 chimera and HCN4-S719x data both suggest the possibility that the distal C-terminus might be having some effect on LRMP regulation. We have clarified this in the results (pages 12-13) and discussion (page 19).

      (4) For FRET experiments, it is not clear why LF should show an interaction with N2 (residues 125-160) but not NF (residues 1-160). N2 is contained within NF, and given that Citrine and Cerulean are present on the C-terminus of LF and N2/NF, respectively, residues 1-124 in NF should not impact the detection of FRET because of greater separation between the fluorophores as suggested by the authors.

      This is a fair point but FRET is somewhat more complicated. We do not know the structure of these fragments and it’s hard to speculate where the fluorophores are oriented in this type of assay. Moreover, this hybridization assay is sensitive to affinity and expression as well. There are a number of reasons why the larger 1-260 fragment might show reduced FRET compared to 125-260. As mentioned in our response to reviewer 2’s public comment 2, we have added a limitation section that outlines the various caveats of FRET that could explain this.

      (5) For FRET experiments, the choice of using pieces of the channel that do not correlate with the truncations studied in functional electrophysiological experiments limits the holistic interpretation of the data. Also, no explanation or discussion is provided for why LRMP fragments that are capable of binding to the HCN4 N-terminus as determined by FRET (e.g. residues 1-108 and 110-230, respectively) do not have a functional impact on the channel.

      As mentioned in the response to comment 2, the exact fragment design is a function of which fragments expressed well in HEK cells. Importantly, because FRET experiments do not provide atomic resolution for the caveats listed in the revised limitations section on page 20-21, small differences in the cut sites do not change the interpretation of these results. For example, the N-terminal 1-125 construct is analogous to experiments with the Δ1-130 HCN4 channel.

      We suspect that residues in both fragments are required and that the interaction involves multiple parts. This is stated in the results “Thus, the first 227 residues of LRMP are sufficient to regulate HCN4, with residues in both halves of the LRMP N-terminus necessary for the regulation” (page 11). We have also added discussion on this on page 21.

      (6) A striking result was that mutating two residues in the C-linker of HCN4 to amino acids found in HCN channels not affected by LRMP (P545A, T547F), completely eliminated the impact of LRMP on preventing cAMP regulation of channel activation. However, a chimeric channel, (HCN4-2) in which the C-linker, the CNBD, and the C-terminus of HCN4 were replaced by that of HCN2 was found to be partially responsive to LRMP. These two results appear inconsistent and not reconciled in the model proposed by the authors for how LRMP may be working.

      As stated in our answer to your question #3, we have revised our interpretation of these data. If the more distal C-terminus plays some role in the orientation of the C-linker and the transduction centre as a whole, these data can still be viewed consistent with our model. We have added some discussion of this idea in our discussion section.

      (7) Replacing the HCN2 N-terminus with that from HCN4, along with mutations in the S5 (MCS/VVG) and C-linker (AF/PT) recapitulated LRMP regulation on the HCN2 background. The functional importance of the S5 mutations is not clear as no other experiments are shown to indicate whether they are necessary for the observed effect.

      We have added our experiments on a midpoint HCN2 clone that includes the S5 mutants and the C-linker mutants in the absence of the HCN4 N-terminus (ie HCN2 MCSAF/VVGPT) (Fig. 7). And we have discussed our rationale for the S5 mutations as we believe they may be responsible for the different orientations of the S4-S5 linker in HCN1 and HCN4 structures that are known to impact cAMP regulation.

      Reviewer #1 (Recommendations For The Authors):

      A) Comments:

      (1) Figure 1: Please show some representative current traces.

      Exemplar currents are now shown for each channel in the manuscript.

      (2) Figure 1: There appears to be a huge number of recordings for HCN4 +/- cAMP as compared to those with LRMP 1-479Cit. How was the number of recordings needed for sufficient statistical power decided? This is particularly important because the observed slowing of deactivation by cAMP in Fig. 1C seems like it may be fairly subtle. Perhaps a swarm plot would make the shift more apparent? Also, LRMP 1-479Cit distributions in Fig. 1B-C look like they are more uniform than normal, so please double-check the appropriateness of the statistical test employed.

      We have revised the methods section (page 7) to discuss this, briefly we performed regular control experiments throughout this project to ensure that a normal cAMP response was occurring. Our minimum target for sufficient power was 8-10 recordings. We have expanded the statistics section (page 9) to discuss tests of normality and the use of a log scale for deactivation time constants which is why the shifts in Fig. 1D (revised) are less apparent.

      (3) It would be helpful if the authors could better introduce their logic for the M338V/C341V/S345G mutations in the HCN4-2 VVGPT mutant.

      See response to the reviewing editor’s comment 7.

      B) Minor Comments:

      (1) pg. 9: "We found that LRMP 1-479Cit inhibited HCN4 to an even greater degree than the full-length LRMP, likely because expression of this tagged construct was improved compared to the untagged full-length LRMP, which was detected by co-transfection with GFP." Co-transfection with GFP seems like an extremely poor and a risky measure for LRMP expression.

      We agree that the exact efficiency of co-transfection is contentious although some papers and manufacturer protocols indicate high co-transfection efficiency (Xie et al., 2011). In this paper we used both co-transfection and tagged proteins with similar results.

      (2) pg 9: "LRMP 1-227 construct contains the N-terminus of LRMP with a cut-site near the Nterminus of the predicted coiled-coil sequence". In Figure 2 the graphic shows the coiledcoil domain starting at 191. What was the logic for splitting at 227 which appears to be the middle of the coiled-coil?

      See response to the reviewing editor’s comment 2.

      (3) Figure 5C: Please align the various schematics for HCN4 as was done for LRMP. It makes it much easier to decipher what is what.

      Fig. 5 has been revised as suggested.

      (4) pg 12: I assume that the HCN2 fragment chosen aligns with the HCN4 N2 fragment which shows binding, but this logic should be stated if that is the case. If not, then how was the HCN2 fragment chosen?

      This is correct. This has been explicitly stated in the revised manuscript (page 14).

      (5) Figure 7: Add legend indicating black/gray = HCN4 and blue = HCN2.

      This has been stated in the revised figure legend.

      (6) pg 17: Conservation of P545 and T547 across mammalian species is not shown or cited.

      This sentence is not included in the revised manuscript, however, for the interest of the reviewer we have provided an alignment of this region across species here.

      Author response image 1.

      Reviewer #2 (Recommendations For The Authors):

      (1) It is not clear whether in the absence of cAMP, LRMP also modestly shifts the voltagedependent activity of the channels. Please clarify.

      We have clarified that LRMP does not shift the voltage-dependence in the absence of cAMP (page 10). In the absence of cAMP, LRMP does not significantly shift the voltagedependence of activation in any of the channels we have tested in this paper (or in our prior 2020 paper).

      (2) Resolution of Fig. 8b is low.

      We ultimately decided that the cartoon did not provide any important information for understanding our model and it was removed.

      (3) Please add a supplementary figure showing the amino acid sequence of LRMP to show where the demarcations are made for each fragment as well as where the truncations were made as noted in Fig 3 and Fig 4.

      A new supplementary figure showing the LRMP sequence has been added and cited in the methods section (page 5). Truncation sites have been added to the schematic in Fig. 2A.

      (4) In the cartoon schematic illustration for Fig. 3 and Fig.4, the legend should include that the thick bold lines in the C-Terminal domain represent the CNBD, while the thick bold lines in the N-Terminal domain represent the HCN domain. This was mentioned in Liao 2012, as you referenced when you defined the construct S719X, but it would be nice for the reader to know that the thick bold lines you have drawn in your cartoon indicate that it also highlights the CNBD or the HCN domain.

      This has been added to figure legends for the relevant figures in the revised manuscript.

      (5) On page 12, missing a space between "residues" and "1" in the parenthesis "...LRMP L1 (residues1-108)...".

      Fixed. Thank you.

      (6) Which isoform of LRMP was used? What is the NCBI accession number? Is it the same one from Peters 2020 ("MC228229")?

      This information has been added to the methods (page 5). It is the same as Peters 2020.

      Reviewer #3 (Recommendations For The Authors):

      (1) "Truncation of residues 1-62 led to a partial LRMP effect where cAMP caused a significant depolarizing shift in the presence of LRMP, but the activation in the presence of LRMP and cAMP was hyperpolarized compared to cAMP alone (Fig. 3B, C and 3E; Table 1). In the HCN4Δ1-130 construct, cAMP caused a significant depolarizing shift in the presence of LRMP; however, the midpoint of activation in the presence of LRMP and cAMP showed a non-significant trend towards hyperpolarization compared to cAMP alone (Fig. 3C and 3E; Table 1)".

      This means that sequence 62-185 is necessary and sufficient for the LRMP effect. I suggest a competition assay with this peptide (synthetic, or co-expressed with HCN4 full-length and LRMP to see whether the peptide inhibits the LRMP effect).

      We respectfully disagree with the reviewer’s interpretation. Our results, strongly suggest that other regions such as residues 25-65 (Fig. 3C) and C-terminal residues (Fig. 6) are also necessary. The use of a peptide could be an interesting future experiment, however, it would be very difficult to control relative expression of a co-expressed peptide. We think that our results in Fig. 7E-F where this fragment is added to HCN2 are a better controlled way of validating the importance of this region.

      (2) "Truncation of the distal C-terminus (of HCN4) did not prevent LRMP regulation. In the presence of both LRMP and cAMP the activation of HCN4-S719X was still significantly hyperpolarized compared to the presence of cAMP alone (Figs. 4A and 4B; Table 1). And the cAMP-induced shift in HCN4-S719X in the presence of LRMP (~7mV) was less than half the shift in the absence of LRMP (~18 mV)."

      On the basis of the partial effects reported for the truncations of the N-terminus of HCN4 162 and 1-130 (Fig 3B and C), I do not think it is possible to conclude that "truncation of the distal C-terminus (of HCN4) did not prevent LRMP regulation". Indeed, cAMP-induced shift in HCN4 Δ1-62 and Δ1-130 in the presence of LRMP were 10.9 and 10.5 mV, respectively, way more than the ~7mV measured for the HCN4-S719X mutant.

      As you rightly stated at the end of the paragraph:" Together, these results show significant LRMP regulation of HCN4 even when the distal C-terminus is truncated, consistent with a minimal role for the C-terminus in the regulatory pathway". I would better discuss this minimal role of the C-terminus. It is true that deletion of the first 185 aa of HCN4 Nterminus abolishes the LRMP effect, but it is also true that removal of the very Cterm of HCN4 does affect LRMP. This unstructured C-terminal region of HCN4 contains isotype-specific sequences. Maybe they also play a role in recognizing LRMP. Thus, I would suggest further investigation via truncations, even internal deletions of HCN4-specific sequences.

      Please see the response to the reviewing editor’s comment 3.

      (3) Figure 5: The N-terminus of LRMP FRETs with the N-terminus of HCN4.

      Why didn't you test the same truncations used in Fig. 3? Indeed, based on Fig 3, sequences 1-25 can be removed. I would have considered peptides 26-62 and 63-130 and 131-185 and a fourth (26-185). This set of peptides will help you connect binding with the functional effects of the truncations tested in Fig 3.

      Please see the response to the reviewing editor’s comment 2 and 5.

      Why didn't you test the C-terminus (from 719 till the end) of HCN4? This can help with understanding why truncation of HCN4 Cterminus does affect LRMP, tough partially (Fig. 4A).

      Please see the response to the reviewing editor’s comment 3.

      (4) "We found that a previously described HCN4-2 chimera containing the HCN4 N-terminus and transmembrane domains (residues 1-518) with the HCN2 C-terminus (442-863) (Liao et al., 2012) was partially regulated by LRMP (Fig. 7A and 7B)".

      I do not understand this partial LRMP effect on the HCN4-2 chimera. In Fig. 6 you have shown that the "HCN4-P545A/T547F was insensitive to LRMP (Figs. 6B and 6C; Table 1), indicating that the unique HCN4 C-linker is necessary for regulation by LRMP". How can be this reconciled with the HCN4-2 chimera? HCN4-2, "containing" P545A/T547F mutations, should not perceive LRMP.

      Please see the response to the reviewing editor’s comment 6.

      (5) "we next made a targeted chimera of HCN2 that contains the distal HCN4 N-terminus (residues 1-212) and the HCN2 transmembrane and C-terminal domains with 5 point mutants in non-conserved residues of the S5 segment and C-linker elbow (M338V/C341V/S345G/A467P/F469T)......Importantly, the HCN4-2 VVGPT channel is insensitive to cAMP in the presence of LRMP (Fig. 7C and 7D), indicating that the HCN4 Nterminus and cAMP-transduction centre residues are sufficient to confer LRMP regulation to HCN2".

      Why did you insert also the 3 mutations of S5? Are these mutations somehow involved in the cAMP transduction mechanism?

      You have already shown that in HCN4 only P545 and T547 (Clinker) are necessary for LRMP effect. I suggest to try, at least, the chimera of HCN2 with only A467P/F469T. They should work without the 3 mutations in S5.

      Please see the response to the reviewing editor’s comment 7.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment 

      fMRI was used to address an important aspect of human cognition - the capacity for structured representations and symbolic processing - in a cross-species comparison with non-human primates (macaques); the experimental design probed implicit symbolic processing through reversal of learned stimulus pairs. The authors present solid evidence in humans that helps elucidate the role of brain networks in symbolic processing, however the evidence from macaques was incomplete (e.g., sample size constraints, potential and hard-to-quantify differences in attention allocation, motivation, and lived experience between species).

      Thank you very much for your assessment. We would like to address the potential issues that you raise point-by-point below.

      We agree that for macaque monkey physiology, sample size is always a constraint, due to both financial and ethical reasons. We addressed this concern by combining the results from two different labs, which allowed us to test 4 animals in total, which is twice as much as what is common practice in the field of primate physiology. (We discuss this now on lines 473-478.)

      Interspecies differences in motivation, attention allocation, task strategies etc. could also be limiting factors. Note that we did address the potential lack of attention allocation directly in Experiment 2 using implicit reward association, which was successful as evidenced by the activation of attentional control areas in the prefrontal cortex. We cannot guarantee that the strategies that the two species deploy are identical, but we tentatively suggest that this might be a less important factor in the present study than in other interspecies comparisons that use explicit behavioral reports. In the current study, we directly measured surprise responses in the brain in the absence of any explicit instructions in either species, which allowed us to  measure the spontaneous reversal of learned associations, which is a very basic element of symbolic representation. Our reasoning is that such spontaneous responses should be less dependent on attention allocation and task strategies. (We discuss this now in more detail on lines 478-485.)

      Finally, lived experience could be a major factor. Indeed, obvious differences include a lifetime of open-field experiences and education in our human adult subjects, which was not available to the monkey subjects, and includes a strong bias towards explicit learning of symbolic systems (e.g. words, letters, digits, etc). However, we have previously shown that 5-month-old human infants spontaneously generalize learning to the reversed pairs after a short learning in the lab using EEG (Kabdebon et al, PNAS, 2019). This indicates that also with very limited experience, humans spontaneously reverse learned associations. (We discuss this now in more detail on lines 478-485.) It could be very interesting to investigate whether spontaneous reversal could be present in infant macaque monkeys, as there might be a critical period for this effect. Although neurophysiology in awake infant monkeys is highly challenging, it would be very relevant for future work. (We discuss this in more detail on lines 493-498.)

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Kerkoerle and colleagues present a very interesting comparative fMRI study in humans and monkeys, assessing neural responses to surprise reactions at the reversal of a previously learned association. The implicit nature of this task, assessing how this information is represented without requiring explicit decision-making, is an elegant design. The paper reports that both humans and monkeys show neural responses across a range of areas when presented with incongruous stimulus pairs. Monkeys also show a surprise response when the stimuli are presented in a reversed direction. However, humans show no such surprise response based on this reversal, suggesting that they encode the relationship reversibly and bidirectionally, unlike the monkeys. This has been suggested as a hallmark of symbolic representation, that might be absent in nonhuman animals. 

      I find this experiment and the results quite compelling, and the data do support the hypothesis that humans are somewhat unique in their tendency to form reversible, symbolic associations. I think that an important strength of the results is that the critical finding is the presence of an interaction between congruity and canonicity in macaques, which does not appear in humans. These results go a long way to allay concerns I have about the comparison of many human participants to a very small number of macaques. 

      We thank the reviewer for the positive assessment. We also very much appreciate the point about the interaction effect in macaque monkeys – indeed, we do not report just a negative finding. 

      I understand the impossibility of testing 30+ macaques in an fMRI experiment. However, I think it is important to note that differences necessarily arise in the analysis of such datasets. The authors report that they use '...identical training, stimuli, and whole-brain fMRI measures'. However, the monkeys (in experiment 1) actually required 10 times more training. 

      We agree that this description was imprecise. We have changed it to “identical training stimuli” (line 151), indeed the movies used for training were strictly identical. Furthermore, please note that we do report the fMRI results after the same training duration. In experiment 1, after 3 days of training, the monkeys did not show any significant results, even in the canonical direction. However, in experiment 2, with increased attention and motivation, a significant effect was observed on the first day of scanning after training, as was found in human subjects (see Figure 4 and Table 3).

      More importantly, while the fMRI measures are the same, group analysis over 30+ individuals is inherently different from comparing only 2 macaques (including smoothing and averaging away individual differences that might be more present in the monkeys, due to the much smaller sample size). 

      Thank you for understanding that a limited sampling size is intrinsic to macaque monkey physiology. We also agree that data analysis in humans and monkeys is necessarily different. As suggested by the reviewer, we added an analysis to address this, see the corresponding reply to the ‘Recommendations for the authors’ section below.

      Despite this, the results do appear to show that macaques show the predicted interaction effect (even despite the sample size), while humans do not. I think this is quite convincing, although had the results turned out differently (for example an effect in humans that was absent in macaques), I think this difference in sample size would be considerably more concerning. 

      Thank you for noting this. Indeed, the interaction effect is crucial, and the task design was explicitly made to test this precise prediction, described in our manuscript as the “reversibility hypothesis”. The congruity effect in the learned direction served as a control for learning, while the corresponding congruity effect in the reversed direction tested for spontaneous reversal. The reversibility hypothesis stipulates that in humans there should not be a difference between the learned and the reversed direction, while there should be for monkeys. We already wrote about that in the result section of the original manuscript and now also describe this more explicitly in the introduction and beginning of the result section.

      I would also note that while I agree with the authors' conclusions, it is notable to me that the congruity effect observed in humans (red vs blue lines in Fig. 2B) appears to be far more pronounced than any effect observed in the macaques (Fig. 3C-3). Again, this does not challenge the core finding of this paper but does suggest methodological or possibly motivational/attentional differences between the humans and the monkeys (or, for example, that the monkeys had learned the associations less strongly and clearly than the humans). 

      As also explained in response to the eLife assessment above, we expanded the “limitations” section of the discussion, with a deeper description of the possible methodological differences between the two species (see lines 478-485).

      With the same worry in mind, we did increase the attention and motivation of monkeys in experiment 2, and indeed obtained a greater activation to the canonical pairs and their violation, -notably in the prefrontal cortex – but crucially still without reversibility.

      In the end, we believe that the striking interspecies difference in size and extent of the violation effect, even for purely canonical stimuli, is an important part of our findings and points to a more efficient species-specific learning system, that our experiment tentatively relates to a symbolic competence.

      This is a strong paper with elegant methods and makes a worthwhile contribution to our understanding of the neural systems supporting symbolic representations in humans, as opposed to other animals. 

      We again thank the reviewer for the positive review.

      Reviewer #2 (Public Review): 

      In their article titled "Brain mechanisms of reversible symbolic reference: a potential singularity of the human brain", van Kerkoerle et al address the timely question of whether non-human primates (rhesus macaques) possess the ability for reverse symbolic inference as observed in humans. Through an fMRI experiment in both humans and monkeys, they analyzed the bold signal in both species while observing audio-visual and visual-visual stimuli pairs that had been previously learned in a particular direction. Remarkably, the findings pertaining to humans revealed that a broad brain network exhibited increased activity in response to surprises occurring in both the learned and reverse directions. Conversely, in monkeys, the study uncovered that the brain activity within sensory areas only responded to the learned direction but failed to exhibit any discernible response to the reverse direction. These compelling results indicate that the capacity for reversible symbolic inference may be unique to humans. 

      In general, the manuscript is skillfully crafted and highly accessible to readers. The experimental design exhibits originality, and the analyses are tailored to effectively address the central question at hand.

      Although the first experiment raised a number of methodological inquiries, the subsequent second experiment thoroughly addresses these concerns and effectively replicates the initial findings, thereby significantly strengthening the overall study. Overall, this article is already of high quality and brings new insight into human cognition. 

      We sincerely thank the reviewer for the positive comments. 

      I identified three weaknesses in the manuscript: 

      - One major issue in the study is the absence of significant results in monkeys. Indeed, authors draw conclusions regarding the lack of significant difference in activity related to surprise in the multidemand network (MDN) in the reverse congruent versus reverse incongruent conditions. Although the results are convincing (especially with the significant interaction between congruency and canonicity), the article could be improved by including additional analyses in a priori ROI for the MDN in monkeys (as well as in humans, for comparison). 

      First, we disagree with the statement about “absence of significant results in monkeys”. We do report a significant interaction which, as noted by the referee, is a crucial positive finding.

      Second, we performed the suggested analysis for experiment 2, using the bilateral ROIs of the putative monkey MDN from previous literature (Mitchell, et al. 2016), which are based on the human study by Fedorenko et al. (PNAS, 2013). 

      Author response table 1.

      Congruity effect for monkeys in Experiment 2 within the ROIs of the MDN (n=3). Significance was assessed with one-sided one-sample t-tests.

      As can be seen, none of the regions within the monkey MDN showed an FDR-corrected significant difference or interaction. Although the absence of a canonical congruity effect makes it difficult to draw strong conclusions, it did approach significance at an uncorrected level in the lateral frontal posterior region, similar to  the large prefrontal effect we report in Figures 4 and 5. Furthermore, for the reversed congruity effect there was never even a trend at the uncorrected level, and the crucial interaction of canonicity and congruity again approached significance in the lateral prefrontal cortex.  

      We also performed an ANOVA  in the human participants of the VV experiment on the average betas across the 7 different fronto-parietal ROIs as used by Mitchell et al to define their equivalent to the monkey brain (Fig 1a, right in Mitchell et al. 2016) with congruity, canonicity and hemisphere (except for the anterior cingulate which is a bilateral ROI) as within-subject factors. We confirmed the results presented in the manuscript (Figure 4C) with notably no significant interaction between congruity and canonicity in any of these ROIs (all F-values (except insula) <1). A significant main effect of congruity was observed in the posterior middle frontal gyrus (MFG) and inferior precentral sulcus at the FDR corrected level. Analyses restricted to the canonical trials found a congruity effect in these two regions plus the anterior insula and anterior cingulate/presupplementary motor area, whereas no ROIs were significant at a FDR corrected level for reverse trials. There was a trend in the middle MFG and inferior precentral region for reversed trials. Crucially, there was not even a trend for the interaction between congruity and canonicity at the uncorrected level. The difference in the effect size between the canonical and reversed direction can therefore be explained by the larger statistical power due to the larger number of congruent trials (70%, versus 10% for the other trial conditions), not by a significant effect by the canonical and the reversed direction. 

      Author response table 2.

      Congruity effect for humans in Experiment 2 within the ROIs of the MDN (n=23).

      These results support our contention that the type of learning of the stimulus pairs was very different in the two species. We thank the reviewer for suggesting these relevant additional analyses.

      - While the authors acknowledge in the discussion that the number of monkeys included in the study is considerably lower compared to humans, it would be informative to know the variability of the results among human participants. 

      We agree that this is an interesting question, although it is also very open-ended. For instance, we could report each subjects’ individual whole-brain results, but this would take too much space (and the interested reader will be able to do so from the data that we make available as part of this publication). As a step in this direction, we provide below a figure showing the individual congruity effects, separately for each experiment and for each ROI of table 5, and for each of the 52 participants for whom an fMRI localizer was available:

      Author response image 1.

      Difference in mean betas between congruent and incongruent conditions in a-priori linguistic and mathematical ROIs (see definition and analyses in Table 5) in both experiments (experiment 1 = AV, left panel; experiment 2= VV, right panel). Dots correspond to participants (red: canonical trials, green reversed trials).The boxplot notch is located at the median and the lower and upper box hinges at the 25th and 75th centiles. Whiskers extend to 1.5 inter-quartile ranges on either side of the hinges. ROIs are ranked by the median of the Incongruent-Congruent difference across canonical and reversed order, within a given experiment. For purposes of comparison between the two experiments, we have underlined with colors the top-five common ROIs between the two experiments. N.s.: non-significant congruity effect (p>0.05)

      Several regions show a rather consistent difference across subjects (see, for instance, the posterior STS in experiment 1, left panel). Overall, only 3 of the 52 participants did not show any beta superior to 2 in canonical or reversed in any ROIs. The consistency is quite striking, given the limited number of test trials (in total only 16 incongruent trials per direction per participant), and the fact that these ROIs were selected for their responses to spoken or written  sentences, as part of a subsidiary task quite different from the main task.

      - Some details are missing in the methods.  

      Thank you for these comments, we reply to them point-by-point below.

      Reviewer #3 (Public Review): 

      This study investigates the hypothesis that humans (but not non-human primates) spontaneously learn reversible temporal associations (i.e., learning a B-A association after only being exposed to A-B sequences), which the authors consider to be a foundational property of symbolic cognition. To do so, they expose humans and macaques to 2-item sequences (in a visual-auditory experiment, pairs of images and spoken nonwords, and in a visual-visual experiment, pairs of images and abstract geometric shapes) in a fixed temporal order, then measure the brain response during a test phase to congruent vs. incongruent pairs (relative to the trained associations) in canonical vs. reversed order (relative to the presentation order used in training). The advantage of neuroimaging for this question is that it removes the need for a behavioral test, which non-human primates can fail for reasons unrelated to the cognitive construct being investigated. In humans, the researchers find statistically indistinguishable incongruity effects in both directions (supporting a spontaneous reversible association), whereas in monkeys they only find incongruity effects in the canonical direction (supporting an association but a lack of spontaneous reversal). Although the precise pattern of activation varies by experiment type (visual-auditory vs. visual-visual) in both species, the authors point out that some of the regions involved are also those that are most anatomically different between humans and other primates. The authors interpret their finding to support the hypothesis that reversible associations, and by extension symbolic cognition, is uniquely human. 

      This study is a valuable complement to prior behavioral work on this question. However, I have some concerns about methods and framing. 

      We thank the reviewer for the careful summary of the manuscript, and the positive comments.

      Methods - Design issues: 

      The authors originally planned to use the same training/testing protocol for both species but the monkeys did not learn anything, so they dramatically increased the amount of training and evaluation. By my calculation from the methods section, humans were trained on 96 trials and tested on 176, whereas the monkeys got an additional 3,840 training trials and 1,408 testing trials. The authors are explicit that they continued training the monkeys until they got a congruity effect. On the one hand, it is commendable that they are honest about this in their write-up, given that this detail could easily be framed as deliberate after the fact. On the other hand, it is still a form of p-hacking, given that it's critical for their result that the monkeys learn the canonical association (otherwise, the critical comparison to the non-canonical association is meaningless). 

      Thank you for this comment. 

      Indeed, for experiment 1, the amount of training and testing was not equal for the humans and monkeys, as also mentioned by reviewer 2. We now describe in more detail how many training and imaging days we used for each experiment and each species, as well as the number of blocks per day and the number of trials per block (see lines 572-577). We also added the information on the amount of training receives to all of the legends of the Tables.

      We are sorry for giving the impression that we trained until the monkeys learned this. This was not the case. Based on previous literature, we actually anticipated that the short training would not be sufficient, and therefore planned additional training in advance. Specifically, Meyer & Olson (2011) had observed pair learning in the inferior temporal cortex of macaque monkeys after 816 exposures per pair. This is similar to the additional training we gave, about 80 blocks with 12 trials per pair per block. This is  now explained in more detail (lines 577-580).

      Furthermore, we strongly disagree with the pejorative term p-hacking. The aim of the experiment was not to show a congruency effect in the canonical direction in monkeys, but to track and compare their behavior in the same paradigm as that of humans for the reverse direction. It would have been unwise to stop after human-identical training and only show that humans learn better, which is a given. Instead, we looked at brain activations at both times, at the end of human-identical training and when the monkeys had learned the pairs in the canonical direction. 

      Finally, in experiment 2, monkeys were tested after the same 3 days of training as humans. We wrote: “Using this design, we obtained significant canonical congruity effects in monkeys on the first imaging day after the initial training (24 trials per pair), indicating that the animals had learned the associations” (lines 252-253).

      (2) Between-species comparisons are challenging. In addition to having differences in their DNA, human participants have spent many years living in a very different culture than that of NHPs, including years of formal education. As a result, attributing the observed differences to biology is challenging. One approach that has been adopted in some past studies is to examine either young children or adults from cultures that don't have formal educational structures. This is not the approach the authors take. This major confound needs to minimally be explicitly acknowledged up front. 

      Thank you for raising this important point. We already had a section on “limitations” in the manuscript, which we now extended (line 478-485). Indeed, this study is following a previous study in 5-month-old infants using EEG, in which we already showed that after learning associations between labels and categories, infants spontaneously generalize learning to the reversed pairs after a short learning period in the lab (Kabdebon et al, PNAS, 2019). We also cited preliminary results of the same paradigm as used in the current study but using EEG in 4-month-old infants (Ekramnia and Dehaene-Lambertz, 2019), where we replicated the results obtained by Kabdebon et al. 2019 showing that preverbal infants spontaneously generalize learning to the reversed pairs. 

      Functional MRI in awake infants remains a challenge at this age (but see our own work, DehaeneLambertz et al, Science, 2002), especially because the experimental design means only a few trials in the conditions of interest (10%) and thus a long experimental duration that exceed infants’ quietness and attentional capacities in the noisy MRI environment. (We discuss this on lines 493-496.)

      (3) Humans have big advantages in processing and discriminating spoken stimuli and associating them with visual stimuli (after all, this is what words are in spoken human languages). Experiment 2 ameliorates these concerns to some degree, but still, it is difficult to attribute the failure of NHPs to show reversible associations in Experiment 1 to cognitive differences rather than the relative importance of sound string to meaning associations in the human vs. NHP experiences. 

      As the reviewer wrote, we deliberately performed Experiment 2 with visual shapes to control for various factors that might have explained the monkeys' failure in Experiment 1. 

      (4) More minor: The localizer task (math sentences vs. other sentences) makes sense for math but seems to make less sense for language: why would a language region respond more to sentences that don't describe math vs. ones that do? 

      The referee is correct: our use of the word “reciprocally” was improper (although see Amalric et Dehaene, 2016 for significant differences in both directions when non-mathematical sentences concern specific knowledge). We changed the formulation to clarify this as follows: “In these ROIs, we recovered the subject-specific coordinates of each participant’s 10% best voxels in the following comparisons: sentences vs rest for the 6 language Rois ; reading vs listening for the VWFA ; and numerical vs non-numerical sentences for the 8 mathematical ROIs.” (lines 678-680).

      Methods - Analysis issues: 

      (5) The analyses appear to "double dip" by using the same data to define the clusters and to statistically test the average cluster activation (Kriegeskorte et al., 2009). The resulting effect sizes are therefore likely inflated, and the p-values are anticonservative. 

      It is not clear to us which result the reviewer is referring to. In Tables 1-4, we report the values that we found significant in the whole brain analysis, we do not report additional statistical tests for this data. For Table 5, the subject-specific voxels were identified through a separate localizer experiment, which was designed to pinpoint the precise activation areas for each subject in the domains of oral and written language-processing and math. Subsequently, we compared the activation at these voxel locations across different conditions of the main experiment. Thus, the two datasets were distinct, and there was no double dipping. In both interpretations of the comment, we therefore disagree with the reviewer.

      Framing: 

      (6) The framing ("Brain mechanisms of reversible symbolic reference: A potential singularity of the human brain") is bigger than the finding (monkeys don't spontaneously reverse a temporal association but humans do). The title and discussion are full of buzzy terms ("brain mechanisms", "symbolic", and "singularity") that are only connected to the experiments by a debatable chain of assumptions. 

      First, this study shows relatively little about brain "mechanisms" of reversible symbolic associations, which implies insights into how these associations are learned, recognized, and represented. But we're only given standard fMRI analyses that are quite inconsistent across similar experimental paradigms, with purely suggestive connections between these spatial patterns and prior work on comparative brain anatomy. 

      We agree with the referee that the term “mechanism” is ambiguous and, for systems neuroscientists, may suggest more than we are able to do here with functional MRI. We changed the title to “Brain areas for reversible symbolic reference, a potential singularity of the human brain”. This title better describes our specific contribution: mapping out the areas involved in reversibility in humans, and showing that they do not seem to respond similarly in macaque monkeys.

      Second, it's not clear what the relationship is between symbolic cognition and a propensity to spontaneously reverse a temporal association. Certainly, if there are inter-species differences in learning preferences this is important to know about, but why is this construed as a difference in the presence or absence of symbols? Because the associations aren't used in any downstream computation, there is not even any way for participants to know which is the sign and which is the signified: these are merely labels imposed by the researchers on a sequential task. 

      As explained in the introduction, the reversibility test addressed a very minimal core property of symbolic reference. There cannot be a symbol if its attachment doesn’t operate in both directions. Thus, this property is necessary – but we agree that it is not sufficient. Indeed, more tests are needed to establish whether and how the learned symbols are used in further downstream compositional tasks (as discussed in our recent TICS papers, Dehaene et al. 2022). We added a sentence in the introduction to acknowledge this fact:

      “Such reversibility is a core and necessary property of symbols, although we readily acknowledge that it is not sufficient, since genuine symbols present additional referential and compositional properties that will not be tested in the present work.” (lines 89-92).

      Third, the word "singularity" is both problematically ambiguous and not well supported by the results. "Singularity" is a highly loaded word that the authors are simply using to mean "that which is uniquely human". Rather than picking a term with diverse technical meanings across fields and then trying to restrict the definition, it would be better to use a different term. Furthermore, even under the stated definition, this study performed a single pairwise comparison between humans and one other species (macaques), so it is a stretch to then conclude (or insinuate) that the "singularity" has been found (see also pt. 2 above). 

      We have published an extensive review including a description of our use of the term “singularity” (Dehaene et al., TICS 2022). Here is a short except: “Humans are different even in domains such as drawing and geometry that do not involve communicative language. We refer to this observation using the term “human cognitive singularity”, the word singularity being used here in its standard meaning (the condition of being singular) as well as its mathematical sense (a point of sudden change). Hominization was certainly a singularity in biological evolution, so much so that it opened up a new geological age (the Anthropocene). Even if evolution works by small continuous change (and sometimes it doesn’t [4]), it led to a drastic cognitive change in humans.”

      We find the referee’s use of the pejorative term ”insinuate” quite inappropriate. From the title on, we are quite nuanced and refer only to a “potential singularity”. Furthermore, as noted above, we explicitly mention in the discussion the limitations of our study, and in particular the fact that only a single non-human species was tested (see lines 486-493). We are working hard to get chimpanzee data, but this is remarkably difficult for us, and we hope that our paper will incite other groups to collect more evidence on this point.

      (7) Related to pt. 6, there is circularity in the framing whereby the authors say they are setting out to find out what is uniquely human, hypothesizing that the uniquely human thing is symbols, and then selecting a defining trait of symbols (spontaneous reversible association) *because* it seems to be uniquely human (see e.g., "Several studies previously found behavioral evidence for a uniquely human ability to spontaneously reverse a learned association (Imai et al., 2021; Kojima, 1984; Lipkens et al., 1988; Medam et al., 2016; Sidman et al., 1982), and such reversibility was therefore proposed as a defining feature of symbol representation reference (Deacon, 1998; Kabdebon and DehaeneLambertz, 2019; Nieder, 2009).", line 335). They can't have it both ways. Either "symbol" is an independently motivated construct whose presence can be independently tested in humans and other species, or it is by fiat synonymous with the "singularity". This circularity can be broken by a more modest framing that focuses on the core research question (e.g., "What is uniquely human? One possibility is spontaneous reversal of temporal associations.") and then connects (speculatively) to the bigger conceptual landscape in the discussion ("Spontaneous reversal of temporal associations may be a core ability underlying the acquisition of mental symbols").

      We fail to understand the putative circularity that the referee sees in our introduction. We urge him/her to re-read it, and hope that, with the changes that we introduced, it does boil down to his/her summary, i.e. “What is uniquely human? One possibility is spontaneous reversal of temporal associations."

      Reviewer #1 (Recommendations For The Authors): 

      In general, the manuscript was very clear, easy to read, and compelling. I would recommend the authors carefully check the text for consistency and minor typos. For example: 

      The sample size for the monkeys kept changing throughout the paper. E.g., Experiment 1: n = 2 (line 149); n = 3 (line 205).  

      Thank you for catching this error, we corrected it. The number of animals was indeed 2  for experiment 1, and 3 for experiment 2. (Animals JD and YS participated in experiment 1 and JD, JC and DN in experiment 2. So only JD participated in both experiments.)

      Similarly, the number of stimulus pairs is reported inconsistently (4 on line 149, 5 pairs later in the paper). 

      We’re sorry that this was unclear. We used 5 sets of 4 audio-visual pairs each. We now clarify this, on line 157 and on lines 514-516.

      At least one case of p>0.0001, rather than p < 0.0001 (I assume). 

      Thank you once again, we now corrected this.

      Reviewer #2 (Recommendations For The Authors): 

      One major issue in the study is the absence of significant results in monkeys. Indeed, the authors draw conclusions regarding the lack of significant difference in activity related to surprise in the multidemand network (MDN) in the reverse congruent versus reverse incongruent conditions. Although the results are convincing (especially with the significant interaction between congruency and canonicity), the article could be improved by including additional analyses in a priori ROI for the MDN in monkeys (as well as in humans, for comparison). In other words: what are the statistics for the MDN regarding congruity, canonicity, and interaction in both species? Since the authors have already performed this type of analysis for language and Math ROIs (table 5), it should be relatively easy for them to extend it to the MDN. Demonstrating that results in monkeys are far from significant could further convince the reader. 

      Furthermore, while the authors acknowledge in the discussion that the number of monkeys included in the study is considerably lower compared to humans, it would be informative to know the variability of the results among human participants. Specifically, it would be valuable to describe the proportion of human participants in which the effects of congruency, canonicity, and their interaction are significant. Additionally, stating the variability of the F-values for each effect would provide reassurance to the reader regarding the distinctiveness of humans in comparison to monkeys. Low variability in the results would serve to mitigate concerns that the observed disparity is merely a consequence of testing a unique subset of monkeys, which may differ from the general population. Indeed, this would be a greater support to the notion that the dissimilarity stems from a genuine distinction between the two species. 

      We responded to both of these points above.

      In terms of methods, details are missing: 

      - How many trials of each condition are there exactly? (10% of 44 trials is 4.4) : 

      We wrote: “In both humans and monkeys, each block started with 4 trials in the learned direction (congruent canonical trials), one trial for each of the 4 pairs (2 O-L and 2 L-O pairs). The rest of the block consisted of 40 trials in which 70% of trials were identical to the training; 10% were incongruent pairs but the direction (O-L or L-O) was correct (incongruent canonical trials), thus testing whether the association was learned; 10% were congruent pairs but the direction within the pairs was reversed relative to the learned pairs (congruent reversed trials) and 10% were incongruent pairs in reverse (incongruent reversed trials).”(See lines 596-600.)

      Thus, each block comprised 4 initial trials, 28 canonical congruent trials, 4 canonical incongruent, 4 reverse congruent and 4 reverse incongruent trials, i.e. 4+28+3x4=40 trials.

      - How long is one trial? 

      As written in the method section: “In each trial, the first stimulus (label or object) was presented during 700ms, followed by an inter-stimulus-interval of 100ms then the second stimulus during 700ms. The pairs were separated by a variable inter-trial-interval of 3-5 seconds” i.e. 700+100+700=1500, plus 3 to 4.75 seconds of blank between the trials (see lines 531-533).

      - How are the stimulus presentations jittered? 

      See : “The pairs were separated by a variable inter-trial-interval randomly chosen among eight different durations between 3 and 4.75 seconds (step=250 ms). The series of 8 intervals was randomized again each time it was completed.”(lines 533-535).

      - What is the statistical power achieved for humans? And for monkeys? 

      We know of no standard way to define power for fMRI experiments. Power will depend on so many parameters, including the fMRI signal-to-noise ratio, the attention of the subject, the areas being considered, the type of analysis (whole-brain versus ROIs), etc.

      - Videos are mentioned in the methods, is it the image and sound? It is not clear. 

      We’re sorry that it was unclear. Video’s were only used for the training of the human subjects. We now corrected this in the method section (lines 552-554).

      Reviewer #3 (Recommendations For The Authors): 

      The main recommendations are to adjust the framing (making it less bold and more connected to the empirical evidence) and to ensure independence in the statistical analyses of the fMRI data. 

      See our replies to the reviewer’s comments on “Framing” above. In particular, we changed the title of the paper from “Brain mechanisms of reversible symbolic reference” to “Brain areas for reversible symbolic reference”.

      References cited in this response

      Dehaene, S., Al Roumi, F., Lakretz, Y., Planton, S., & Sablé-Meyer, M. (2022). Symbols and mental programs : A hypothesis about human singularity. Trends in Cognitive Sciences, 26(9), 751‑766. https://doi.org/10.1016/j.tics.2022.06.010.

      Dehaene-Lambertz, Ghislaine, Stanislas Dehaene, et Lucie Hertz-Pannier. Functional Neuroimaging of Speech Perception in Infants. Science 298, no 5600 (2002): 2013-15. https://doi.org/10.1126/science.1077066.

      Ekramnia M, Dehaene-Lambertz G. 2019. Investigating bidirectionality of associations in young infants as an approach to the symbolic system. Presented at the CogSci. p. 3449.

      Fedorenko E, Duncan J, Kanwisher N (2013) Broad domain generality in focal regions of frontal and parietal cortex. Proc Natl Acad Sci U S A 110:16616-16621.

      Kabdebon, Claire, et Ghislaine Dehaene-Lambertz. « Symbolic Labeling in 5-Month-Old Human Infants ». Proceedings of the National Academy of Sciences 116, no 12 (2019): 5805-10. https://doi.org/10.1073/pnas.1809144116.

      Mitchell, D. J., Bell, A. H., Buckley, M. J., Mitchell, A. S., Sallet, J., & Duncan, J. (2016). A Putative Multiple-Demand System in the Macaque Brain. Journal of Neuroscience, 36(33), 8574‑8585. https://doi.org/10.1523/JNEUROSCI.0810-16.2016

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The biogenesis of outer membrane proteins (OMPs) into the outer membranes of Gram-negative bacteria is still not fully understood, particularly substrate recognition and insertion by beta-assembly machinery (BAM). In the studies, the authors present their studies that in addition to recognition by the last strand of an OMP, sometimes referred to as the beta-signal, an additional signal upstream of the last strand is also important for OMP biogenesis.

      Strengths:

      1. Overall the manuscript is well organized and written, and addresses an important question in the field. The idea that BAM recognizes multiple signals on OMPs has been presented previously, however, it was not fully tested.

      2. The authors here re-address this idea and propose that it is a more general mechanism used by BAM for OMP biogenesis.

      3. The notion that additional signals assist in biogenesis is an important concept that indeed needs fully tested in OMP biogenesis.

      4. A significant study was performed with extensive experiments reported in an attempt to address this important question in the field.

      5. The identification of important crosslinks and regions of substrates and Bam proteins that interact during biogenesis is an important contribution that gives clues to the path substrates take en route to the membrane.

      Weaknesses:

      Major critiques (in no particular order):

      1. The title indicates 'simultaneous recognition', however no experiments were presented that test the order of interactions during OMP biogenesis.

      We have replaced the word “Simultaneous” with “Dual” so as not to reflect on the timing of the recognition events for the distinct C-terminal signal and -5 signal.

      1. Aspects of the study focus on the peptides that appear to inhibit OmpC assembly, but should also include an analysis of the peptides that do not to determine this the motif(s) present still or not.

      We thank the reviewer for this comment. Our study focuses on the peptides which exhibited an inhibitory effect in order to elucidate further interactions between the BAM complex and substrate proteins, especially in early stage of the assembly process. In the case of peptide 9, which contains all of our proposed elements but did not have an inhibitory effect, there is the presence of an arginine residue at the polar residue next to hydrophobic residue in position 0 (0 Φ). As seen in Fig S5, S6, and S7, there are no positively charged amino acids in the polar residue positions in the -5 or last strands. This might be the reason why peptide 9, as well as peptide 24, the β-signal derived from the mitochondrial OMP Tom40 and contains a lysine at the polar position, did not display an inhibitory effect. Incorporating the reviewer's suggestions might elucidate conditions that should not be added to the elements, but this is not the focus of this paper and was not discussed to avoid complicating the paper.

      1. The β-signal is known to form a β-strand, therefore it is unclear why the authors did not choose to chop OmpC up according to its strands, rather than by a fixed peptide size. What was the rationale for how the peptide lengths were chosen since many of them partially overlap known strands, and only partially (2 residues) overlap each other? It may not be too surprising that most of the inhibitory peptides consist of full strands (#4, 10, 21, 23).

      A simple scan of known β-strands would have been an alternative approach, however this comes with the bias of limiting the experiments to predicted substrate (strand) sequences, and it presupposes that the secondary structure element would be formed by this tightly truncated peptide.

      Instead, we allowed for the possibility that OMPs meet the BAM complex in an unfolded or partially folded state, and that the secondary structure (β-strand) might only form via β-argumentation after the substrate is placed in the context of the lateral gate. We therefore used peptides that mapped right across the entirety of OmpC, with a two amino acid overlap.

      To clarify this important point regarding the unbiased nature of our screen, we have revised the text:

      (Lines 147-151) "We used peptides that mapped the entirety of OmpC, with a two amino acid overlap. This we considered preferable to peptides that were restricted by structural features, such as β-strands, in consideration that β-strand formation may or may not have occurred in early-stage interactions at the BAM complex."

      1. It would be good to have an idea of the propensity of the chosen peptides to form β-stands and participate in β-augmentation. We know from previous studies with darobactin and other peptides that they can inhibit OMP assembly by competing with substrates.

      We appreciate the reviewer's suggestion. However, we have not conducted biophysical characterizations of the peptides to calculate the propensity of each peptide to form β-stands and participate in β-augmentation. The sort of detailed biophysical analysis done for Darobactin (by the Maier and Hiller groups, The antibiotic darobactin mimics a β-strand to inhibit outer membrane insertase Nature 593:125-129) was a Nature publication based on this single peptide. A further biophysical analysis of all of the peptides presented here goes well beyond the scope of our study.

      1. The recognition motifs that the authors present span up to 9 residues which would suggest a relatively large binding surface, however, the structures of these regions are not large enough to accommodate these large peptides.

      The β-signal motif (ζxGxx[Ω/Φ]x[Ω/Φ]) is an 8-residue consensus, some of the inhibitory peptides include additional residues before and after the defined motif of 8 residues, and the lateral gate of BamA has been shown interact with a 7-residue span (eg. Doyle et al, 2022). Cross-linking presented in our study showed BamD residues R49 and G65 cross-linked to the positions 0 and 6 of the internal signal in OmpC (Fig. 6D).

      We appreciate this point of clarification and have modified the text to acknowledge that in the final registering of the peptide with its binding protein, some parts of the peptide might sit beyond the bounds of the BamD receptor’s binding pocket and the BamA lateral gate:

      (Lines 458-471) "The β-signal motif (ζxGxx[Ω/Φ]x[Ω/Φ]) is an eight-residue consensus, and internal signal motif is composed of a nine-residue consensus. Recent structures have shown the lateral gate of BamA interacts with a 7-residue span of substrate OMPs. Interestingly, inhibitory compounds, such as darobactin, mimic only three resides of the C-terminal side of β-signal motif. Cross-linking presented here in our study showed that BamD residues R49 and G65 cross-linked to the positions 0 and 6 of the internal signal in OmpC (Fig. 6D). Both signals are larger than the assembly machineries signal binding pocket, implying that the signal might sit beyond the bounds of the signal binding pocket in BamD and the lateral gate in BamA. These finding are consistent with similar observations in other signal sequence recognition events, such as the mitochondrial targeting presequence signal that is longer than the receptor groove formed by the Tom20, the subunit of the translocator of outer membrane (TOM) complex (Yamamoto et al., 2011). The presequence has been shown to bind to Tom20 in several different conformations within the receptor groove (Nyirenda et al., 2013)."

      Moreover, the distance between amino acids of BamD which cross-linked to the internal signal, R49 and Y62, is approximately 25 Å (pdbID used 7TT3). The distance of the maximum amino acid length of the internal signal of OmpC, from F280 to Y288, is approximately 22 Å (pdbID used 2J1N). This would allow for the signal to fit within the confines of the TRP motif of BamD.

      Author response image 1.

      1. The authors highlight that the sequence motifs are common among the inhibiting peptides, but do not test if this is a necessary motif to mediate the interactions. It would have been good to see if a library of non-OMP related peptides that match this motif could also inhibit or not.

      With respect, this additional work would not address any biological question relevant to the function of BamD. To randomize sequences and then classify those that do or don’t fit the motif would help in refining the parameters of the β-signal motif, but that was not our intent.

      We have identified the peptides from within the total sequence of an OMP, shown which peptides inhibit in an assembly assay, and then observed that the inhibitory peptides conform to a previously published (β-signal) motif.

      1. In the studies that disrupt the motifs by mutagenesis, an effect was observed and attributed to disruption of the interaction of the 'internal signal'. However, the literature is filled with point mutations in OMPs that disrupt biogenesis, particular those within the membrane region. F280, Y286, V359, and Y365 are all residues that are in the membrane region that point into the membrane. Therefore, more work is needed to confirm that these mutations are in parts of a recognition motif rather than on the residues that are disrupting stability/assembly into the membrane.

      As the reviewer pointed out, the side chains of the amino acids constituting the signal elements we determined were all facing the lipid side, of which Y286 and Y365 were important for folding as well as to be recognized. However, F280A and V359A had no effect on folding, but only on assembly through the BAM complex. The fact that position 0 functions as a signal has been demonstrated by peptidomimetics (Fig. 1) and point mutant analysis (Fig. 2). We appreciate this clarification and have modified the text to acknowledge that the all of the signal element faces the lipid side, which contributes to their stability in the membrane finally, and before that the BAM complex actively recognizes them and determines their orientation:

      (Lines 519-526) After OMP assembly, all elements of the internal signal are positioned such that they face into the lipid-phase of the membrane. This observation may be a coincidence, or may be utilized by the BAM complex to register and orientate the lipid facing amino acids in the assembling OMP away from the formative lumen of the OMP. Amino acids at position 6, such as Y286 in OmpC, are not only component of the internal signal for binding by the BAM complex, but also act in structural capacity to register the aromatic girdle for optimal stability of the OMP in the membrane.

      1. The title of Figure 3 indicates that disrupting the internal signal motif disrupts OMP assembly, however, the point mutations did not seem to have any effect. Only when both 280 and 286 were mutated was an effect observed. And even then, the trimer appeared to form just fine, albeit at reduced levels, indicating assembly is just fine, rather the rate of biogenesis is being affected.

      We appreciate this point and have revised the title of Figure 3 to be:

      (Lines 1070-1071) "Modifications in the putative internal signal slow the rate of OMP assembly in vivo."

      1. In Figure 4, the authors attempt to quantify their blots. However, this seems to be a difficult task given the lack of quality of the blots and the spread of the intended signals, particularly of the 'int' bands. However, the more disturbing trend is the obvious reduction in signal from the post-urea treatment, even for the WT samples. The authors are using urea washes to indicate removal of only stalled substrates. However a reduction of signal is also observed for the WT. The authors should quantify this blot as well, but it is clear visually that both WT and the mutant have obvious reductions in the observable signals. Further, this data seems to conflict with Fig 3D where no noticeable difference in OmpC assembly was observed between WT and Y286A, why is this the case?

      We have addressed this point by adding a statistical analysis on Fig. 4A. As the reviewer points out, BN-PAGE band quantification is a difficult task given the broad spread of the bands on these gels. Statistical analysis showed that the increase in intermediates (int) was statistically significant for Y286A at all times until 80 min, when the intermediate form signals decrease.

      (Lines 1093-1096) "Statistical significance was indicated by the following: N.S. (not significant), p<0.05; , p<0.005; *. Exact p values of intermediate formed by Wt vs Y286A at each timepoint were as follows; 20 minutes: p = 0.03077, 40 minutes: p = 0.02402, 60 minutes: p = 0.00181, 80 minutes: p = 0.0545."

      Further regarding the Int. band, we correct the statement as follows.

      (Lines 253-254) "Consistent with this, the assembly intermediate which was prominently observed at the OmpC(Y286A) can be extracted from the membranes with urea;"

      OMP assembly in vivo has additional periplasmic chaperones and factors present in order to support the assembly process. Therefore, it is likely that some proteins were assembled properly in vivo compared to their in vitro counterparts. Such a decrease has been observed not only in E. coli but also in mitochondrial OMP import (Yamano et al., 2010).

      1. The pull-down assays with BamA and BamD should include a no protein control at the least to confirm there is no non-specific binding to the resin. Also, no detergent was mentioned as part of the pull downs that contained BamA or OmpC, nor was it detailed if OmpC was urea solubilized.

      We have performed pull down experiments with a no-protein (Ni-NTA only) control as noted (Author response image 1). The results showed that the amount of OmpC carrying through on beads only was significantly lower than the amount of OmpC bound in the presence of BamD or BamA. The added OmpC was not treated with urea, but was synthesized by in vitro translation; the in vitro translated OmpC is the standard substrate in the EMM assembly assay (Supp Fig. S1) where it is recognized by the BAM complex. Thus, we used it for pull-down as well and, to make this clearer, we have revised as follows:

      Author response image 2.

      Pull down assay of radio-labelled OmpC with indicated protein or Ni-NTA alone (Ni-NTA) . T; total, FT; Flow throw, W; wash, E; Elute.

      (Lines 252-265) "Three subunits of the BAM complex have been previously shown to interact with the substrates: BamA, BamB, and BamD (Hagan et al., 2013; Harrison, 1996; Ieva et al., 2011). In vitro pull-down assay showed that while BamA and BamD can independently bind to the in vitro translated OmpC polypeptide (Fig .S9A), BamB did not (Fig. S9B)."

      11.

      • The neutron reflectometry experiments are not convincing primarily due to the lack controls to confirm a consistent uniform bilayer is being formed and even if so, uniform orientations of the BamA molecules across the surface.

      • Further, no controls were performed with BamD alone, or with OmpC alone, and it is hard to understand how the method can discriminate between an actual BamA/BamD complex versus BamA and BamD individually being located at the membrane surface without forming an actual complex.

      • Previous studies have reported difficulty in preparing a complex with BamA and BamD from purified components.

      • Additionally, little signal differences were observed for the addition of OmpC. However, an elongated unfolded polypeptide that is nearly 400 residues long would be expected to produce a large distinct signal given that only the C-terminal portion is supposedly anchored to BAM, while the rest would be extended out above the surface.

      • The depiction in Figure 5D is quite misleading when viewing the full structures on the same scales with one another.

      We have addressed these five points individually as follows.

      i. The uniform orientation of BamA on the surface is guaranteed by the fixation through a His-tag engineered into extracellular loop 6 of BamA and has been validated in previous studies as cited in the text. Moreover, to explain this, we reconstructed another theoretical model for BamA not oriented well in the system as below. However, we found that the solid lines (after fitting) didn’t align well with the experimental data. We therefore assumed that BamA has oriented well in the membrane bilayer.

      Author response image 3.

      Experimental (symbols) and fitted (curves) NR profiles of BamA not oriented well in the POPC bilayer in D2O (black), GMW (blue) and H2O (red) buffer.

      ii. There would be no means by which to do a control with OmpC alone or BamD alone as neither protein binds to the lipid layer chip. OmpC is diluted from urea and then the unbound OmpC is washed from the chip before NR measurements. BamD does not have an acyl group to anchor it to the lipid layer, without BamA to anchor to, it too is washed from the chip before NR measurements. We have reconstructed another theoretical model for both of BamA + BamD embedding in the membrane bilayer, and the fits were shown below. Apparently, the fits didn’t align well with the experimental data, which discriminate the BamA/BamD individually being located at the membrane surface without forming an actual complex.

      Author response image 4.

      Experimental (symbols) and fitted (curves) NR profiles of BamA+D embedding together in the POPC bilayer in D2O (black), GMW (blue) and H2O (red) buffer.

      iii. The previous studies that reported difficulty in preparing a complex with BamA and BamD from purified components were assays done in aqueous solution including detergent solubilized BamA, or with BamA POTRA domains only. Our assay is superior in that it reports the binding of BamD to a purified BamA that has been reconstituted in a lipid bilayer.

      iv. The relatively small signal differences observed for the addition of OmpC are expected, since OmpC is an elongated, unfolded polypeptide of nearly 400 residues long which, in the context of this assay, can occupy a huge variation in the positions at which it will sit with only the C-terminal portion anchored to BAM, and the rest moving randomly about and extended from the surface.

      v. We appreciate the point raised and have now added a note in the Figure legend that these are depictions of the results and not a scale drawing of the structures.

      1. In the crosslinking studies, the authors show 17 crosslinking sites (43% of all tested) on BamD crosslinked with OmpC. Given that the authors are presenting specific interactions between the two proteins, this is worrisome as the crosslinks were found across the entire surface of BamD. How do the authors explain this? Are all these specific or non-specific?

      The crosslinking experiment using purified BamD was an effective assay for comprehensive analysis of the interaction sites between BamD and the substrate. However, as the reviewer pointed out, cross-linking was observed even at the sites that, in the context of the BAM complex, interact with BamC as a protein-protein interaction and would not be available for substrate protein-protein interactions. To complement this, analysis and to address this issue, we also performed the experiment in Fig. 6C.

      In Fig. 6C, the interaction of BamD with the substrate is examined in vivo, and the results demonstrate that if BPA is introduced into the site, we designated as the substrate recognition site, it is cross-linked to the substrate. On the other hand, position 114 was found to crosslink with the substrate in vitro crosslinking, but not in vivo. It should be noted that position 114 has also been confirmed to form cross-link products with BamC, we believe that BamD-substrate interactions in the native state have been investigated. To explain the above, we have added the following description to the Results section.

      (Lines 319-321) "Structurally, these amino acids locate both the lumen side of funnel-like structure (e.g. 49 or 62) and outside of funnel-like structure such as BamC binding site (e.g. 114) (fig. S12C). (Lines 350-357) Positions 49, 53, 65, and 196 of BamD face the interior of the funnel-like structure of the periplasmic domain of the BAM complex, while position 114 is located outside of the funnel-like structure (Bakelar et al., 2016; Gu et al., 2016; Iadanza et al., 2016). We note that while position 114 was cross-linked with OmpC in vitro using purified BamD, that this was not seen with in vivo cross-linking. Instead, in the context of the BAM complex, position 114 of BamD binds to the BamC subunit and would not be available for substrate binding in vivo (Bakelar et al., 2016; Gu et al., 2016; Iadanza et al., 2016)."

      1. The study in Figure 6 focuses on defined regions within the OmpC sequence, but a more broad range is necessary to demonstrate specificity to these regions vs binding to other regions of the sequence as well. If the authors wish to demonstrate a specific interaction to this motif, they need to show no binding to other regions.

      The region of affinity for the BAM complex was determined by peptidomimetic analysis, and the signal region was further identified by mutational analysis of OmpC. Subsequently, the subunit that recognizes the signal region was identified as BamD. In other words, in the process leading up to Fig. 6, we were able to analyze in detail that other regions were not the target of the study. We have revised the text to make clear that we focus on the signal region including the internal signal, and have not also analyzed other parts of the signal region:

      (Lines 329-332) "As our peptidomimetic screen identified conserved features in the internal signal, and cross-linking highlighted the N-terminal and C-terminal TPR motifs of BamD as regions of interaction with OmpC, we focused on amino acids specifically within the β-signals of OmpC and regions of BamD which interact with β-signal."

      1. The levels of the crosslinks are barely detectable via western blot analysis. If the interactions between the two surfaces are required, why are the levels for most of the blots so low?

      These are western blots of cross-linked products – the efficiency of cross-linking is far less than 100% of the interacting protein species present in a binding assay and this explains why the levels for the blots are ‘so low’. We have added a sentence to the revised manuscript to make this clear for readers who are not molecular biologists:

      (Lines 345-348) "These western blots reveal cross-linked products representing the interacting protein species. Photo cross-linking of unnatural amino acid is not a 100% efficient process, so the level of cross-linked products is only a small proportion of the molecules interacting in the assays."

      15.

      • Figure 7 indicates that two regions of BamD promote OMP orientation and assembly, however, none of the experiments appears to measure OMP orientation?

      • Also, one common observation from panel F was that not only was the trimer reduced, but also the monomer. But even then, still a percentage of the trimer is formed, not a complete loss.

      (i) We appreciate this point and have revised the title of Figure 7 to be:

      (Lines 1137-1138) "Key residues in two structurally distinct regions of BamD promote β-strand formation and OMP assembly."

      (ii) In our description of Fig. 7F (Lines 356-360) we do not distinguish between the amount of monomer and trimer forms, since both are reflective of the overall assembly rate i.e. assembly efficiency. Rather, we state that:

      "The EMM assembly assay showed that the internal signal binding site was as important as the β-signal binding site to the overall assembly rates observed for OmpC (Fig. 7F), OmpF (fig. S15D), and LamB (fig. S15E). These results suggest that recognition of both the C-terminal β-signal and the internal signal by BamD is important for efficient protein assembly."

      16.

      • The experiment in Fig 7B would be more conclusive if it was repeated with both the Y62A and R197A mutants and a double mutant. These controls would also help resolve any effect from crowding that may also promote the crosslinks.

      • Further, the mutation of R197 is an odd choice given that this residue has been studied previously and was found to mediate a salt bridge with BamA. How was this resolved by the authors in choosing this site since it was not one of the original crosslinking sites?

      As stated in the text, the purpose of the experiment in Figure 7B is to measure the impact of pre-forming a β-strand in the substrate (OmpC) before providing it to the receptor (BamD). We thank the reviewer for the comment on the R197 position of BamD. The C-terminal domain of BamD has been suggested to mediate the BamA-BamD interface, specifically BamD R197 amino acid creates a salt-bridge with BamA E373 (Ricci et al., 2012). It had been postulated that the formation of this salt-bridge is not strictly structural, with R197 highlighted as a key amino acid in BamD activity and this salt-bridge acts as a “check-point” in BAM complex activity (Ricci et al., 2012, Storek et al., 2023). Our results agree with this, showing that the C-terminus of BamD acts in substrate recognition and alignment of the β-signal (Fig. 6, Fig S12). We show that amino acids in the vicinity of R197 (N196, G200, D204) cross-linked well to substrate and mutations to the β-signal prevent this interaction (Fig S12B, D). For mutational analysis of BamD, we looked then at the conservation of the C-terminus of BamD and determined R197 was the most highly conserved amino acid (Fig 6C). In order to account for this, we have adjusted the manuscript:

      (Lines 376-377) "R197 has previously been isolated as a suppressor mutation of a BamA temperature sensitive strain (Ricci et al., 2012)."

      (Lines 495-496) "This adds an additional role of the C-terminus of BamD beyond a complex stability role (Ricci et al., 2012; Storek et al., 2023)."

      1. As demonstrated by the authors in Fig 8, the mutations in BamD lead to reduction in OMP levels for more than just OmpC and issues with the membrane are clearly observable with Y62A, although not with R197A in the presence of VCN. The authors should also test with rifampicin which is smaller and would monitor even more subtle issues with the membrane. Oddly, no growth was observed for the Vec control in the lower concentration of VCN, but was near WT levels for 3 times VCN, how is this explained?

      While it would be interesting to correlate the extent of differences to the molecular size of different antibiotics such as rifampicin, such correlations are not the intended aim of our study. Vancomycin (VCN) is a standard measure of outer membrane integrity in our field, hence its use in our tests for membrane integrity.

      We apologize to the reviewer as Figure 8 D-G may have been misleading. Figure 8D,E are using bamD shut-down cells expressing plasmid-borne BamD mutants. Whereas Figure 8F, G are the same strain as used in Figure 3. We have adjusted the figure as well as the figure legend: (Lines 1165-1169) D, E E coli bamD depletion cells expressing mutations at residues, Y62A and R197A, in the β-signal recognition regions of BamD were grown with of VCN. F, G, E coli cells expressing mutations to OmpC internal signal, as shown in Fig 3, grown in the presence of VCN. Mutations to two key residues of the internal signal were sensitive to the presence of VCN.

      1. While Fig 8I indeed shows diminished levels for FY as stated, little difference was observed for the trimer for the other mutants compared to WT, although differences were observed for the dimer. Interestingly, the VY mutant has nearly WT levels of dimer. What do the authors postulate is going on here with the dimer to trimer transition? How do the levels of monomer compare, which is not shown?

      The BN-PAGE gel system cannot resolve protein species that migrate below ~50kDa and the monomer species of the OMPs is below this size. We can’t comment on effects on the monomer because it is not visualized. The non-cropped gel image is shown here. Recently, Hussain et al., has shown that in vitro proteo-liposome system OmpC assembly progresses from a “short-lived dimeric” form before the final process of trimerization (Hussain et al., 2021). However, their findings suggest that LPS plays the final role in stimulation of dimer-to-trimer, a step well past the recognition step of the β-signals. Mutations to the internal signal of OmpC results in the formation of an intermediate, the substrate stalled on the BAM complex. This stalling, presumably, causes a hinderance to the BAM complex resulting in reduced timer and loss of dimer OmpF signal in the EMM of cells expressing OmpC double mutant strain, FY. cannot resolve protein species that migrate below ~50kDa and the monomer species of the OMPs is below this size. We can’t comment on effects on the monomer because it is not visualized. The non-cropped gel image is shown here. We have noted this in the revised text:

      Author response image 5.

      Non-cropped gel of Fig. 8I. the asterisk indicates a band observed in the sample loading wells at the top of the gel.

      (Lines 417-418) "The dimeric form of endogenous OmpF was prominently observed in both the OmpC(WT) as well as the OmpC(VY) double mutant cells."

      1. In the discussion, the authors indicate they have '...defined an internal signal for OMP assembly', however, their study is limited and only investigates a specific region of OmpC. More is needed to definitively say this for even OmpC, and even more so to indicate this is a general feature for all OMPs.

      We acknowledge the reviewer's comment on this point and have expanded the statement to make sure that the conclusion is justified with the specific evidence that is shown in the paper and the supplementary data. We now state:

      (Lines 444-447) "This internal signal corresponds to the -5 strand in OmpC and is recognized by BamD. Sequence analysis shows that similar sequence signatures are present in other OMPs (Figs. S5, S6 and S7). These sequences were investigated in two further OMPs: OmpF and LamB (Fig. 2C and D)."

      Note, we did not state that this is a general feature for all OMPs. That would not be a reasonable proposition.

      20.

      • In the proposed model in Fig 9, it is hard to conceive how 5 strands will form along BamD given the limited surface area and tight space beneath BAM.

      • More concerning is that the two proposal interaction sites on BamD, Y62 and R197, are on opposite sides of the BamD structure, not along the same interface, which makes this model even more unlikely.

      • As evidence against this model, in Figure 9E, the two indicates sites of BamD are not even in close proximity of the modeled substrate strands.

      We can address the reviewer’s three concerns here:

      i. The first point is that the region (formed by BamD engaged with POTRA domains 1-2 and 5 of BamA) is not sufficient to accommodate five β-strands. Structural analysis reveals that the interaction between the N-terminal side of BamD and POTRA1-2 is substantially changed the conformation by substrate binding, and that this surface is greatly extended. This surface does have enough space to accommodate five beta-strands, as now documented in Fig. 9D, 9E using the latest structures (7TT5 and 7TT2) as illustrations of this. The text now reads:

      (Lines 506-515) "Spatially, this indicates the BamD can serve to organize two distinct parts of the nascent OMP substrate at the periplasmic face of the BAM complex, either prior to or in concert with, engagement to the lateral gate of BamA. Assessing this structurally showed the N-terminal region of BamD (interacting with the POTRA1-2 region of BamA) and the C-terminal region of BamD (interacting with POTRA5 proximal to the lateral gate of BamA) (Bakelar et al., 2016; Gu et al., 2016; Tomasek et al., 2020) has the N-terminal region of BamD changing conformation depending on the folding states of the last four β-strands of the substrate OMP, EspP (Doyle et al., 2022). The overall effect of this being a change in the dimensions of this cavity change, a change which is dependent on the folded state of the substrate engaged in it (Fig 9 B-E)."

      ii. The second point raised regards the orientation of the substrate recognition residues of BamD. Both Y62A and R197 were located on the lumen side of the funnel in the EspP-BAM transport intermediate structure (PDBID;7TTC); Y62A is relatively located on the edge of BamD, but given that POTRA1-2 undergoes a conformational change and opens this region, as described above, both are located in locations where they could bind to substrates. This was explained in the following text in the results section of revised manuscript.

      (Lines 377-379) "Each residue was located on the lumen side of the funnel-like structure in the EspP-BAM assembly intermediate structure (PDBID; 7TTC) (Doyle et al., 2022)."

      **Reviewer #2 (Public Review):"

      Previously, using bioinformatics study, authors have identified potential sequence motifs that are common to a large subset of beta-barrel outer membrane proteins in gram negative bacteria. Interestingly, in that study, some of those motifs are located in the internal strands of barrels (not near the termini), in addition to the well-known "beta-signal" motif in the C-terminal region.

      Here, the authors carried out rigorous biochemical, biophysical, and genetic studies to prove that the newly identified internal motifs are critical to the assembly of outer membrane proteins and the interaction with the BAM complex. The author's approaches are rigorous and comprehensive, whose results reasonably well support the conclusions. While overall enthusiastic, I have some scientific concerns with the rationale of the neutron refractory study, and the distinction between "the intrinsic impairment of the barrel" vs "the impairment of interaction with BAM" that the internal signal may play a role in. I hope that the authors will be able to address this.

      Strengths:

      1. It is impressive that the authors took multi-faceted approaches using the assays on reconstituted, cell-based, and population-level (growth) systems.

      2. Assessing the role of the internal motifs in the assembly of model OMPs in the absence and presence of BAM machinery was a nice approach for a precise definition of the role.

      Weaknesses:

      1. The result section employing the neutron refractory (NR) needs to be clarified and strengthened in the main text (from line 226). In the current form, the NR result seems not so convincing.

      What is the rationale of the approach using NR?

      We have now modified the text to make clear that:

      (Lines 276-280) "The rationale to these experiments is that NR provides: (i) information on the distance of specified subunits of a protein complex away from the atomically flat gold surface to which the complex is attached, and (ii) allows the addition of samples between measurements, so that multi-step changes can be made to, for example, detect changes in domain conformation in response to the addition of a substrate."

      What is the molecular event (readout) that the method detects?

      We have now modified the text to make clear that:

      (Lines 270-274) "While the biochemical assay demonstrated that the OmpC(Y286A) mutant forms a stalled intermediate with the BAM complex, in a state in which membrane insertion was not completed, biochemical assays such as this cannot elucidate where on BamA-BamD this OmpC(Y286A) substrate is stalled."

      What are "R"-y axis and "Q"-x axis and their physical meanings (Fig. 5b)?

      The neutron reflectivity, R, refers to the ratio of the incoming and exiting neutron beams and it is measured as a function of Momentum transfer Q, which is defined as Q=4π sinθ/λ, where θ is the angle of incident and λ is the neutron wavelength. R(Q)is approximately given byR(Q)=16π2/ Q2 |ρ(Q)|2, where R(Q) is the one-dimensional Fourier transform of ρ(z), the scattering length density (SLD) distribution normal to the surface. SLD is the sum of the coherent neutron scattering lengths of all atoms in the sample layer divided by the volume of the layer. Therefore, the intensity of the reflected beams is highly dependent on the thickness, densities and interface roughness of the samples. This was explained in the following text in the method section of revised manuscript.

      (Lines 669-678) "Neutron reflectivity, denoted as R, is the ratio of the incoming to the exiting neutron beams. It’s calculated based on the Momentum transfer Q, which is defined by the formula Q=4π sinθ/λ, where θ represents the angle of incidence and λ stands for the neutron wavelength. The approximate value of R(Q) can be expressed as R(Q)=16π2/ Q2 |ρ(Q)|2, where R(Q) is the one-dimensional Fourier transform of ρ(z), which is the scattering length density (SLD) distribution perpendicular to the surface. SLD is calculated by dividing the sum of the coherent neutron scattering lengths of all atoms in a sample layer by the volume of that layer. Consequently, factors such as thickness, volume fraction, and interface roughness of the samples significantly influence the intensity of the reflected beams."

      How are the "layers" defined from the plot (Fig. 5b)?

      The “layers” in the plot (Fig. 5b) represent different regions of the sample being studied. In this study, we used a seven-layer model to fit the experimental data (chromium - gold - NTA - HIS8 - β-barrel - P3-5 - P1-2. This was explained in the following text in the figure legend of revised manuscript. (Lines 1115-1116) The experimental data was fitted using a seven-layer model: chromium - gold - NTA - His8 - β-barrel - P3-5 - P1-2.

      What are the meanings of "thickness" and "roughness" (Fig. 5c)?

      We used neutron reflectometry to determine the relative positions of BAM subunits in a membrane environment. The binding of certain subunits induced conformational changes in other parts of the complex. When a substrate membrane protein is added, the periplasmic POTRA domain of BamA extends further away from the membrane surface. This could result in an increase in thickness as observed in neutron reflectometry measurements.

      As for roughness, it is related to the interface properties of the sample. In neutron reflectometry, the intensity of the reflected beams is highly dependent on the thickness, densities, and interface roughness of the samples. An increase in roughness could suggest changes in these properties, possibly due to protein-membrane interactions or structural changes within the membrane.

      (Lines 1116-1120) "Table summarizes of the thickness, roughness and volume fraction data of each layer from the NR analysis. The thickness refers to the depth of layered structures being studied as measured in Å. The roughness refers to the irregularities in the surface of the layered structures being studied as measured in Å."

      What does "SLD" stand for?

      We apologize for not explaining abbreviation when the SLD first came out. We explained it in revised manuscript. (Line 298)

      1. In the result section, "The internal signal is necessary for insertion step of assembly into OM" This section presents an important result that the internal beta-signal is critical to the intrinsic propensity of barrel formation, distinct from the recognition by BAM complex. However, this point is not elaborated in this section. For example, what is the role of these critical residues in the barrel structure formation? That is, are they involved in any special tertiary contacts in the structure or in membrane anchoring of the nascent polypeptide chains?

      We appreciate the reviewer's comment on this point. Both position 0 and position 6 appear to be important amino acids for recognition by the BAM complex, since mutations introduced at these positions in peptide 18 prevent competitive inhibition activity.

      In terms of the tertiary structure of OmpC, position 6 is an amino acid that contributes to the aromatic girdle, and since Y286A and Y365A affected OMP folding as measured in folding experiments, it is perhaps their position in the aromatic girdle that contributes to the efficiency of β-barrel folding in addition to its function as a recognition signal. We have added a sentence in the revised manuscript:

      (Lines 233-236) "Position 6 is an amino acid that contributes to the aromatic girdle. Since Y286A and Y365A affected OMP folding as measured in folding experiments, their positioning into the aromatic girdle may contributes to the efficiency of β-barrel folding, in addition to contributing to the internal signal."

      The mutations made at position 0 had no effect on folding, so this residue may function solely in the signal. Given the register of each β-strand in the final barrel, the position 0 residues have side-chains that face out into the lipid environment. From examination of the OmpC crystal structure, the residue at position 0 makes no special tertiary contacts with other, neighbouring residues.  

      Reviewer #1 (Recommendations For The Authors):

      Minor critiques (in no particular order):

      1. Peptide 18 was identified based on its strong inhibition for EspP assembly but another peptide, peptide 23, also shows inhibition and has no particular consensus.

      We would correct this point. Peptide 23 has a strong consensus to the canonical β-signal. We had explained the sequence consensus of β-signal in the Results section of the text. In the third paragraph, we have added a sentence indicating the relationship between peptide 18 and peptide 23.

      (Lines 152-168) "Six peptides (4, 10, 17, 18, 21, and 23) were found to inhibit EspP assembly (Fig. 1A). Of these, peptide 23 corresponds to the canonical β-signal of OMPs: it is the final β-strand of OmpC and it contains the consensus motif of the β-signal (ζxGxx[Ω/Φ]x[Ω/Φ]). The inhibition seen with peptide 23 indicated that our peptidomimetics screening system using EspP can detect signals recognized by the BAM complex. In addition to inhibiting EspP assembly, five of the most potent peptides (4, 17, 18, 21, and 23) inhibited additional model OMPs; the porins OmpC and OmpF, the peptidoglycan-binding OmpA, and the maltoporin LamB (fig. S3). Comparing the sequences of these inhibitory peptides suggested the presence of a sub-motif from within the β-signal, namely [Ω/Φ]x[Ω/Φ] (Fig. 1B). The sequence codes refer to conserved residues such that: ζ, is any polar residue; G is a glycine residue; Ω is any aromatic residue; Φ is any hydrophobic residue and x is any residue (Hagan et al., 2015; Kutik et al., 2008). The non-inhibitory peptide 9 contained some elements of the β-signal but did not show inhibition of EspP assembly (Fig. 1A).

      Peptide 18 also showed a strong sequence similarity to the consensus motif of the β-signal (Fig. 1B) and, like peptide 23, had a strong inhibitory action on EspP assembly (Fig. 1A). Variant peptides based on the peptide 18 sequence were constructed and tested in the EMM assembly assay (Fig. 1C)."

      1. It is unclear why the authors immediately focused on BamD rather than BamB, given that both were mentioned to mediate interaction with substrate. Was BamB also tested?

      We thank the reviewer for this comment. Following the reviewer's suggestion, we have now performed a pull-down experiment on BamB and added it to Fig. S9. We also modified the text of the results as follows.

      (Lines 262-265) "Three subunits of the BAM complex have been previously shown to interact with the substrates: BamA, BamB, and BamD (Hagan et al., 2013; Harrison, 1996; Ieva et al., 2011). In vitro pull-down assay showed that while BamA and BamD can independently bind to the in vitro translated OmpC polypeptide (Fig .S9A), BamB did not (Fig. S9B)."

      1. For the in vitro folding assays of the OmpC substrates, labeled and unlabeled, no mention of adding SurA or any other chaperone which is known to be important for mediating OMP biogenesis in vitro.

      We appreciate the reviewer’s concerns on this point, however chaperones such as SurA are non-essential factors in the OMP assembly reaction mediated by the BAM complex: the surA gene is not essential and the assembly of OMPs can be measured in the absence of exogenously added SurA. It remains possible that addition of SurA to some of these assays could be useful in detailing aspects of chaperone function in the context of the BAM complex, but that was not the intent of this study.

      1. For the supplementary document, it would be much easier for the reader to have the legends groups with the figures.

      Following the reviewer's suggestion, we have placed the legends of Supplemental Figures together with each Figure.

      1. Some of the figures and their captions are not grouped properly and are separated which makes it hard to interpret the figures efficiently.

      We thank the reviewer for this comment, we have revised the manuscript and figures to properly group the figures and captions together on a single page.

      1. The authors begin their 'Discussion' with a question (line 454), however, they don't appear to answer or even attempt to address it; suggest removing rhetorical questions.

      As per the reviewers’ suggestion, we removed this question.

      1. Line 464, 'unbiased' should be removed. This would imply that if not stated, experiments are 'negatively' biased.

      We removed this word and revised the sentence as follows:

      (Lines 431-433) "In our experimental approach to assess for inhibitory peptides, specific segments of the major porin substrate OmpC were shown to interact with the BAM complex as peptidomimetic inhibitors."

      1. Lines 466-467; '...go well beyond expected outcomes.' What does this statement mean?

      Our peptidomimetics led to unexpected results in elucidating the additional essential signal elements. The manuscript was revised as follows:

      (Lines 433-435) "Results for this experimental approach went beyond expected outcomes by identifying the essential elements of the signal Φxxxxxx[Ω/Φ]x[Ω/Φ] in β-strands other than the C-terminal strand."

      1. Line 478; '...rich information that must be oversimplified...'?

      We appreciate the reviewer’s pointed out. For more clarity, the manuscript was revised as follows:

      (Lines 450-453) "The abundance of information which arises from modeling approaches and from the multitude of candidate OMPs, is generally oversimplified when written as a primary structure description typical of the β-signal for bacterial OMPs (i.e. ζxGxx[Ω/Φ]x[Ω/Φ]) (Kutik et al., 2008)."

      1. There are typos in the supplementary figures.

      We have revised and corrected the Supplemental Figure legends.  

      Reviewer #2 (Recommendations For The Authors):

      1. In Supplementary Information, I recommend adding the figure legends directly to the corresponding figures. Currently, it is very inconvenient to go back and forth between legends and figures.

      Following the reviewer's suggestion, we have placed the legends of Supplemental Figures together with each Figure.

      1. Line 94 (p.3): "later"

      Lateral?

      Yes. We have corrected this.

      1. Line 113 (p.3): The result section, "Peptidomimetics derived from E. coli OmpC inhibit OMP assembly" Rationale of the peptide inhibition assay is not clear. How can the peptide sequence that effectively inhibit the assembly interpreted as the b-assembly signal? By competitive binding to BAM or by something else? What is the authors' hypothesis in doing this assay?

      In revision, we have added following sentence to explain the aim and design of the peptidomimetics:

      (Lines 140-145) "The addition of peptides with BAM complex affinity, such as the OMP β-signal, are capable of exerting an inhibitory effect by competing for binding of substrate OMPs to the BAM complex (Hagan et al., 2015). Thus, the addition of peptides derived from the entirety of OMPs to the EMM assembly assay, which can evaluate assembly efficiency with high accuracy, expects to identify novel regions that have affinity for the BAM complex."

      1. Line 113- (p.3) and Fig. S1: The result section, "Peptidomimetics derived from E. coli OmpC inhibit OMP assembly"

      Some explanation seems to be needed why b-barrel domain of EspP appears even without ProK?

      We appreciate the reviewer’s pointed out. We added following sentence to explain:

      (Lines 128-137) "EspP, a model OMP substrate, belongs to autotransporter family of proteins. Autotransporters have two domains; (1) a β-barrel domain, assembled into the outer membrane via the BAM complex, and (2) a passenger domain, which traverses the outer membrane via the lumen of the β-barrel domain itself and is subsequently cleaved by the correctly assembled β-barrel domain (Celik et al., 2012). When EspP is correctly assembled into outer membrane, a visible decrease in the molecular mass of the protein is observed due to the self-proteolysis. Once the barrel domain is assembled into the membrane it becomes protease-resistant, with residual unassembled and passenger domains degraded (Leyton et al., 2014; Roman-Hernandez et al., 2014)."

      1. Line 186 (p.6): "Y285"

      Y285A?

      We have corrected the error, it was Y285A.

      1. Lines 245- (p. 7)/ Lines 330- (p. 10)

      It needs to be clarified that the results described in these paragraphs were obtained from the assays with EMM.

      We appreciate the reviewer’s concerns on these points. For the first half, the following text was added at the beginning of the applicable paragraph to indicate that all of Fig. 4 is the result of the EMM assembly assay.

      (Line 241) "We further analyzed the role of internal β-signal by the EMM assembly assay. At the second half, we used purified BamD but not EMM. We described clearly with following sentence."

      (Lines 316-318) "We purified 40 different BPA variants of BamD, and then irradiated UV after incubating with 35S-labelled OmpC."

    1. Author response:

      The following is the authors’ response to the original reviews.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Lines 40-42: The sentence "The coupling of structural connectome (SC) and functional connectome (FC) varies greatly across different cortical regions reflecting anatomical and functional hierarchies as well as individual differences in cognitive function, and is regulated by genes" is a misstatement. Regional variations of structure-function coupling do not really reflect differences in cognitive function among individuals, but inter-subject variations do.

      Thank you for your comment. We have made revisions to the sentence to correct its misstatement. Please see lines 40-43: “The coupling of structural connectome (SC) and functional connectome (FC) varies greatly across different cortical regions reflecting anatomical and functional hierarchies[1, 6-9] and is regulated by genes[6, 8], as well as its individual differences relates to cognitive function[8, 9].”

      (2) In Figure 1, the graph showing the relation between intensity and cortical depth needs explanation.

      Thank you for your comment. We have added necessary explanation, please see lines 133-134: “The MPC was used to map similarity networks of intracortical microstructure (voxel intensity sampled in different cortical depth) for each cortical node.”

      (3) Line 167: Change "increased" to "increase".

      We have corrected it, please see lines 173-174: “…networks significantly increased with age and exhibited greater increase.”

      (4) Line 195: Remove "were".

      We have corrected it, please see line 204: “…default mode networks significantly contributed to the prediction…”

      (5) Lines 233-240, Reproducibility analyses: Comparisons of parcellation templates were not made with respect to gene weights. Is there any particular reason?

      Thank you for your comment. We have quantified the gene weights based on HCPMMP using the same procedures. We identified a correlation (r \= 0.25, p<0.001) between the gene weights in HCPMMP and BNA. Given that this is a relatively weak correlation, we need to clarify the following points.

      Based on HCPMMP, we produced an averaged gene expression profile for 10,027 genes covering 176 left cortical regions[1]. The excluding 4 cortical regions that had an insufficient number of assigned samples may lead to different templates having a relatively weak correlation of gene associations. Moreover, the effect of different template resolutions on the results of human connectome-transcriptome association is still unclear.

      In brain connectome analysis, the choice of parcellation templates can indeed influence the subsequent findings to some extent. A methodological study[2] provided referenced correlations about 0.4~0.6 for white matter connectivity and 0.2~0.4 for white matter nodal property between two templates (refer to Figure 4 and 5 in [2]). Therefore, the age-related coupling changes as a downstream analysis was calculated using multimodal connectome and correlated with gene expression profiles, which may be influenced by the choice of templates. 

      We have further supplemented gene weights results obtained from HCPMMP to explicitly clarify the dependency of parcellation templates.

      Please see lines 251-252: “The gene weights of HCPMMP was consistent with that of BNA (r = 0.25, p < 0.001).”

      Author response image 1.

      The consistency of gene weights between HCPMMP and BNA.

      Please see lines 601-604: “Finally, we produced an averaged gene expression profile for 10,027 genes covering 176 left cortical regions based on HCPMMP and obtained the gene weights by PLS analysis. We performed Pearson's correlation analyses to assess the consistency of gene weights between HCPMMP and BNA.”

      Reviewer #2 (Recommendations For The Authors):

      Your paper is interesting to read and I found your efforts to evaluate the robustness of the results of different parcellation strategies and tractography methods very valuable. The work is globally easy to navigate and well written with informative good-quality figures, although I think some additional clarifications will be useful to improve readability. My suggestions and questions are detailed below (I aimed to group them by topic which did not always succeed so apologies if the comments are difficult to navigate, but I hope they will be useful for reflection and to incorporate in your work).

      * L34: 'developmental disorder'

      ** As far as I understand, the subjects in HCP-D are mostly healthy (L87). Thus, while your study provides interesting insights into typical brain development, I wonder if references to 'disorder' might be premature. In the future, it would be interesting to extend your approach to the atypical populations. In any case, it would be extremely helpful and appreciated if you included a figure visualising the distribution of behavioural scores within your population and in relationship to age at scan for your subjects (and to include a more detailed description of the assessment in the methods section) given that large part of your paper focuses on their prediction using coupling inputs (especially given a large drop of predictive performance after age correction). Such figures would allow the reader to better understand the cognitive variability within your data, but also potential age relationships, and generally give a better overview of your cohort.

      We agree with your comment that references to 'disorder' is premature. We have made revisions in abstract and conclusion. 

      Please see lines 33-34: “This study offers insight into the maturational principles of SC-FC coupling in typical development.”

      Please see lines 395-396: “Further investigations are needed to fully explore the clinical implications of SC-FC coupling for a range of developmental disorders.”

      In addition, we have included a more detailed description of the cognitive scores in the methods section and provided a figure to visualize the distributions of cognitive scores and in relationship to age for subjects. Please see lines 407-413: “Cognitive scores. We included 11 cognitive scores which were assessed with the National Institutes of Health (NIH) Toolbox Cognition Battery (https://www.healthmeasures.net/exploremeasurement-systems/nih-toolbox), including episodic memory, executive function/cognitive flexibility, executive function/inhibition, language/reading decoding, processing speed, language/vocabulary comprehension, working memory, fluid intelligence composite score, crystal intelligence composite score, early child intelligence composite score and total intelligence composite score. Distributions of these cognitive scores and their relationship with age are illustrated in Figure S12.”

      Author response image 2.

      Cognitive scores and age distributions of scans.

      * SC-FC coupling

      ** L162: 'Regarding functional subnetworks, SC-FC coupling increased disproportionately with age (Figure 3C)'.

      *** As far as I understand, in Figure 3C, the points are the correlation with age for a given ROI within the subnetwork. Is this correct? If yes, I am not sure how this shows a disproportionate increase in coupling. It seems that there is great variability of SC-FC correlation with age across regions within subnetworks, more so than the differences between networks. This would suggest that the coupling with age is regionally dependent rather than network-dependent? Maybe you could clarify?

      The points are the correlation with age for a given ROI within the subnetwork in Figure 3C. We have revised the description, please see lines 168-174: “Age correlation coefficients distributed within functional subnetworks were shown in Figure 3C. Regarding mean SC-FC coupling within functional subnetworks, the somatomotor (𝛽𝑎𝑔𝑒\=2.39E-03, F=4.73, p\=3.10E-06, r\=0.25, p\=1.67E07, Figure 3E), dorsal attention (𝛽𝑎𝑔𝑒\=1.40E-03, F=4.63, p\=4.86E-06, r\=0.24, p\=2.91E-07, Figure 3F), frontoparietal (𝛽𝑎𝑔𝑒 =2.11E-03, F=6.46, p\=2.80E-10, r\=0.33, p\=1.64E-12, Figure 3I) and default mode (𝛽𝑎𝑔𝑒 =9.71E-04, F=2.90, p\=3.94E-03, r\=0.15, p\=1.19E-03, Figure 3J) networks significantly increased with age and exhibited greater increase.” In addition, we agree with your comment that the coupling with age is more likely region-dependent than network-dependent. We have added the description, please see lines 329-332: “We also found the SC-FC coupling with age across regions within subnetworks has more variability than the differences between networks, suggesting that the coupling with age is more likely region-dependent than network-dependent.” This is why our subsequent analysis focused on regional coupling.  

      *** Additionally, we see from Figure 3C that regions within networks have very different changes with age. Given this variability (especially in the subnetworks where you show both positive and negative correlations with age for specific ROIs (i.e. all of them)), does it make sense then to show mean coupling over regions within the subnetworks which erases the differences in coupling with age relationships across regions (Figures 3D-J)?

      Considering the interest and interpretation for SC-FC coupling, showing the mean coupling at subnetwork scales with age correlation is needed, although this eliminates variability at regional scale. These results at different scales confirmed that coupling changes with age at this age group are mainly increased.

      *** Also, I think it would be interesting to show correlation coefficients across all regions, not only the significant ones (3B). Is there a spatially related tendency of increases/decreases (rather than a 'network' relationship)? Would it be interesting to show a similar figure to Figure S7 instead of only the significant regions?

      As your comment, we have supplemented the graph which shows correlation coefficients across all regions into Figure 3B. Similarly, we supplemented to the other figures (Figure S3-S6).

      Author response image 3.

      Aged-related changes in SC-FC coupling. (A) Increases in whole-brain coupling with age. (B) Correlation of age with SC-FC coupling across all regions and significant regions (p<0.05, FDR corrected). (C) Comparisons of age-related changes in SC-FC coupling among functional networks. The boxes show the median and interquartile range (IQR; 25–75%), and the whiskers depict 1.5× IQR from the first or third quartile. (D-J) Correlation of age with SC-FC coupling across the VIS, SM, DA, VA, LIM, FP and DM. VIS, visual network; SM, somatomotor network; DA, dorsal attention network; VA, ventral attention network; LIM, limbic network; FP, frontoparietal network; DM, default mode network.

      *** For the quantification of MPC.

      **** L421: you reconstructed 14 cortical surfaces from the wm to pial surface. If we take the max thickness of the cortex to be 4.5mm (Fischl & Dale, 2000), the sampling is above the resolution of your anatomical images (0.8mm). Could you expand on what the interest is in sampling such a higher number of surfaces given that the resolution is not enough to provide additional information?

      The surface reconstruction was based on state-of-the-art equivolumetric surface construction techniques[3] which provides a simplified recapitulation of cellular changes across the putative laminar structure of the cortex. By referencing a 100-μm resolution Merkerstained 3D histological reconstruction of an entire post mortem human brain (BigBrain: https://bigbrain.loris.ca/main.php), a methodological study[4] systematically evaluated MPC stability with four to 30 intracortical surfaces when the resolution of anatomical image was 0.7 mm, and selected 14 surfaces as the most stable solution. Importantly, it has been proved the in vivo approach can serve as a lower resolution yet biologically meaningful extension of the histological work[4]. 

      **** L424: did you aggregate intensities over regions using mean/median or other statistics?

      It might be useful to specify.

      Thank you for your careful comment. We have revised the description in lines 446-447: “We averaged the intensity profiles of vertices over 210 cortical regions according to the BNA”.

      **** L426: personal curiosity, why did you decide to remove the negative correlation of the intensity profiles from the MPC? Although this is a common practice in functional analyses (where the interpretation of negatives is debated), within the context of cortical correlations, the negative values might be interesting and informative on the level of microstructural relationships across regions (if you want to remove negative signs it might be worth taking their absolute values instead).

      We agree with your comment that the interpretation of negative correlation is debated in MPC. Considering that MPC is a nascent approach to network modeling, we adopted a more conservative strategy that removing negative correlation by referring to the study [4] that proposed the approach. As your comment, the negative correlation might be informative. We will also continue to explore the intrinsic information on the negative correlation reflecting microstructural relationships.

      **** L465: could you please expand on the notion of self-connections, it is not completely evident what this refers to.

      We have revised the description in lines 493-494: “𝑁𝑐 is the number of connection (𝑁𝑐 = 245 for BNA)”.

      **** Paragraph starting on L467: did you evaluate the multicollinearities between communication models? It is possibly rather high (especially for the same models with similar parameters (listed on L440-444)). Such dependence between variables might affect the estimates of feature importance (given the predictive models only care to minimize error, highly correlated features can be selected as a strong predictor while the impact of other features with similarly strong relationships with the target is minimized thus impacting the identification of reliable 'predictors').

      We agree with your comment. The covariance structure (multicollinearities) among the communication models have a high probability to lead to unreliable predictor weights. In our study, we applied Haufe's inversion transform[5] which resolves this issue by computing the covariance between the predicted FC and each communication models in the training set. More details for Haufe's inversion transform please see [5]. We further clarified in the manuscript, please see in lines 497-499: “And covariance structure among the predictors may lead to unreliable predictor weights. Thus, we applied Haufe's inversion transform[38] to address these issues and identify reliable communication mechanisms.”

      **** L474: I am not completely familiar with spin tests but to my understanding, this is a spatial permutation test. I am not sure how this applies to the evaluation of the robustness of feature weight estimates per region (if this was performed per region), it would be useful to provide a bit more detail to make it clearer.

      As your comment, we have supplemented the detail, please see lines 503-507: “Next, we generated 1,000 FC permutations through a spin test[86] for each nodal prediction in each subject and obtained random distributions of model weights. These weights were averaged over the group and were investigated the enrichment of the highest weights per region to assess whether the number of highest weights across communication models was significantly larger than that in a random discovery.”

      **** L477: 'significant communication models were used to represent WMC...', but in L103 you mention you select 3 models: communicability, mean first passage, and flow graphs. Do you want to say that only 3 models were 'significant' and these were exactly the same across all regions (and data splits/ parcellation strategies/ tractography methods)? In the methods, you describe a lot of analysis and testing but it is not completely clear how you come to the selection of the final 3, it would be beneficial to clarify. Also, the final 3 were selected on the whole dataset first and then the pipeline of SC-FC coupling/age assessment/behaviour predictions was run for every (WD, S1, S2) for both parcellations schemes and tractography methods or did you end up with different sets each time? It would be good to make the pipeline and design choices, including the validation bit clearer (a figure detailing all the steps which extend Figure 1 would be very useful to understand the design/choices and how they relate to different runs of the validation).

      Thank you for your comment. In all reproducibility analyses, we used the same 3 models which was selected on the main pipeline (probabilistic tractography and BNA parcellation). According to your comment, we produced a figure that included the pipeline of model selection as the extend of Figure 1. And the description please see lines 106-108: “We used these three models to represent the extracortical connectivity properties in subsequent discovery and reproducibility analyses (Figure S1).” 

      Author response image 4.

      Pipeline of model selection and reproducibility analyses.

      **** Might the imbalance of features between structural connectivity and MPC affect the revealed SC-FC relationships (3 vs 1)? Why did you decide on this ratio rather than for example best WM structural descriptor + MPC?

      We understand your concern. The WMC communication models represent diverse geometric, topological, or dynamic factors. In order to describe the properties of WMC as best as possible, we selected three communication models after controlling covariance structure that can significantly predict FC from the 27 models. Compared to MPC, this does present a potential feature imbalance problem. However, this still supports the conclusion that coupling models that incorporate microarchitectural properties yield more accurate predictions of FC from SC[6, 7]. The relevant experiments are shown in Figure S2 below. If only the best WM structural descriptor is used, this may lose some communication properties of WMC.

      **** L515: were intracranial volume and in-scanner head motion related to behavioural measures? These variables likely impact the inputs, do you expect them to influence the outcome assessments? Or is there a mistake on L518 and you actually corrected the input features rather than the behaviour measures?

      The in-scanner head motion and intracranial volume are related to some age-adjusted behavioural measures, as shown in the following table. The process of regression of covariates from cognitive measures was based on these two cognitive prediction studies [8, 9]. Please see lines 549-554: “Prior to applying the nested fivefold cross-validation framework to each behaviour measure, we regressed out covariates including sex, intracranial volume, and in-scanner head motion from the behaviour measure[59, 69]. Specifically, we estimated the regression coefficients of the covariates using the training set and applied them to the testing set. This regression procedure was repeated for each fold.”

      Author response table 1.

      ** Additionally, in the paper, you propose that the incorporation of cortical microstructural (myelin-related) descriptors with white-matter connectivity to explain FC provides for 'a more comprehensive perspective for characterizing the development of SC-FC coupling' (L60). This combination of cortical and white-matter structure is indeed interesting, however the benefits of incorporating different descriptors could be studied further. For example, comparing results of using only the white matter connectivity (assessed through selected communication models) ~ FC vs (white matter + MPC) ~ FC vs MPC ~ FC. Which descriptors better explain FC? Are the 'coupling trends' similar (or the same)? If yes, what is the additional benefit of using the more complex combination? This would also add strength to your statement at L317: 'These discrepancies likely arise from differences in coupling methods, highlighting the complementarity of our methods with existing findings'. Yes, discrepancies might be explained by the use of different SC inputs. However, it is difficult to see how discrepancies highlight complementarity - does MCP (and combination with wm) provide additional information to using wm structural alone?~

      According to your comment, we have added the analyses based on different models using only the myelin-related predictor or WM connectivity to predict FC, and further compared the results among different models. please see lines 519-521: “In addition, we have constructed the models using only MPC or SCs to predict FC, respectively. Spearman’s correlation was used to assess the consistency between spatial patterns based on different models.” 

      Please see lines 128-130: “In addition, the coupling pattern based on other models (using only MPC or only SCs to predict FC) and the comparison between the models were shown in Figure S2A-C.” Please see lines 178-179: “The age-related patterns of SC-FC coupling based other coupling models were shown in Figure S2D-F.”

      Although we found that there were spatial consistencies in the coupling patterns between different models, the incorporation of MPC with SC connectivity can improve the prediction of FC than the models based on only MPC or SC. For age-related changes in coupling, the differences between the models was further amplified. We agree with you that the complementarity cannot be explicitly quantified and we have revised the description, please see line 329: “These discrepancies likely arise from differences in coupling methods.”

      Author response image 5.

      Comparison results between different models. Spatial pattern of mean SC-FC coupling based on MPC ~ FC (A), SCs ~ FC (B), and MPC + SCs ~ FC (C). Correlation of age with SC-FC coupling across cortex based on MPC ~ FC (D), SCs ~ FC (E), and MPC + SCs ~ FC (F).

      ** For the interpretation of results: L31 'SC-FC coupling is positively associated with genes in oligodendrocyte-related pathways and negatively associated with astrocyte-related gene'; L124: positive myelin content with SC-FC coupling...and similarly on L81, L219, L299, L342, and L490:

      ***You use a T1/T2 ratio which is (in large part) a measure of myelin to estimate the coupling between SC and FC. Evaluation with SC-FC coupling with myeline described in Figure 2E is possibly biased by the choice of this feature. Similarly, it is possible that reported positive associations with oligodendrocyte-related pathways and SC-FC coupling in your work could in part result from a bias introduced by the 'myelin descriptor' (conversely, picking up the oligodendrocyte-related genes is a nice corroboration for the T1/T2 ration being a myelin descriptor, so that's nice). However, it is possible that if you used a different descriptor of the cortical microstructure, you might find different expression patterns associated with the SCFC coupling (for example using neurite density index might pick up neuronal-related genes?). As mentioned in my previous suggestions, I think it would be of interest to first use only the white matter structural connectivity feature to assess coupling to FC and assess the gene expression in the cortical regions to see if the same genes are related, and subsequently incorporate MPC to dissociate potential bias of using a myelin measure from genetic findings.

      Thank you for your insightful comments. In this paper, however, the core method of measuring coupling is to predict functional connections using multimodal structural connections, which may yield more information than a single modal. We agree with your comment that separating SCs and MPC to look at the genes involved in both separately could lead to interesting discoveries. We will continue to explore this in the future.

      ** Generally, I find it difficult to understand the interpretation of SC-FC coupling measures and would be interested to hear your thinking about this. As you mention on L290-294, how well SC predicts FC depends on which input features are used for the coupling assessment (more complex communication models, incorporating additional microstructural information etc 'yield more accurate predictions of FC' L291) - thus, calculated coupling can be interpreted as a measure of how well a particular set of input features explain FC (different sets will explain FC more or less well) ~ coupling is related to a measure of 'missing' information on the SC-FC relationship which is not contained within the particular set of structural descriptors - with this approach, the goal might be to determine the set that best, i.e. completely, explains FC to understand the link between structure and function. When you use the coupling measures for comparisons with age, cognition prediction etc, the 'status' of the SC-FC changes, it is no longer the amount of FC explained by the given SC descriptor set, but it's considered a descriptor in itself (rather than an effect of feature selection / SC-FC information overlap) - how do you interpret/argue for this shift of use?

      Thank you for your comment. In this paper, we obtain reasonable SC-FC coupling by determining the optimal set of structural features to explain the function. The coupling essentially measures the direct correspondence between structure and function. To study the relationship between coupling and age and cognition is actually to study the age correlation and cognitive correlation of this direct correspondence between structure and function. 

      ** In a similar vein to the above comment, I am interested to hear what you think: on L305 you mention that 'perfect SC-FC coupling may be unlikely'. Would this reasoning suggest that functional activity takes place through other means than (and is therefore somehow independent of) biological (structural) substrates? For now, I think one can only say that we have imperfect descriptors of the structure so there is always information missing to explain function, this however does not mean the SC and FC are not perfectly coupled (only that we look at insufficient structural descriptors - limitations of what imaging can assess, what we measure etc). This is in line with L305 where you mention that 'Moreover, our results suggested that regional preferential contributions across different SCs lead to variations in the underlying communication process'. This suggests that locally different areas might use different communication models which are not reflected in the measures of SC-FC coupling that was employed, not that the 'coupling' is lower or higher (or coupling is not perfect). This is also a change in approach to L293: 'This configuration effectively releases the association cortex from strong structural constraints' - the 'release' might only be in light of the particular structural descriptors you use - is it conceivable that a different communication model would be more appropriate (and show high coupling) in these areas.

      Thank you for your insightful comments. We have changed the description, please see lines 315317: “SC-FC coupling is dynamic and changes throughout the lifespan[7], particularly during adolescence[6,9], suggesting that perfect SC-FC coupling may require sufficient structural descriptors.” 

      *Cognitive predictions:

      ** From a practical stand-point, do you think SC-FC coupling is a better (more accurate) indicator of cognitive outcomes (for example for future prediction studies) than each modality alone (which is practically easier to obtain and process)? It would be useful to check the behavioural outcome predictions for each modality separately (as suggested above for coupling estimates). In case SC-FC coupling does not outperform each modality separately, what is the benefit of using their coupling? Similarly, it would be useful to compare to using only cortical myelin for the prediction (which you showed to increase in importance for the coupling). In the case of myelin->coupling-> intelligence, if you are able to predict outcomes with the same performance from myelin without the need for coupling measures, what is the benefit of coupling?

      From a predictive performance point of view, we do not believe that SC-FC coupling is a better indicator than a single mode (voxel, network or other indicator). Our starting point is to assess whether SC-FC coupling is related to the individual differences of cognitive performances rather than to prove its predictive power over other measures. As you suggest, it's a very interesting perspective on the predictive power of cognition by separating the various modalities and comparing them. We will continue to explore this issue in the future study.

      ** The statement on L187 'suggesting that increased SC-FC coupling during development is associated with higher intelligence' might not be completely appropriate before age corrections (especially given the large drop in performance that suggests confounding effects of age).

      According to your comment, we have removed the statement.

      ** L188: it might be useful to report the range of R across the outer cross-validation folds as from Figure 4A it is not completely clear that the predictive performance is above the random (0) threshold. (For the sake of clarity, on L180 it might be useful for the reader if you directly report that other outcomes were not above the random threshold).

      According to your comment, we have added the range of R and revised the description, please see lines 195-198: “Furthermore, even after controlling for age, SC-FC coupling remained a significant predictor of general intelligence better than at chance (Pearson’s r\=0.11±0.04, p\=0.01, FDR corrected, Figure 4A). For fluid intelligence and crystal intelligence, the predictive performances of SC-FC coupling were not better than at chance (Figure 4A).”

      In a similar vein, in the text, you report Pearson's R for the predictive results but Figure 4A shows predictive accuracy - accuracy is a different (categorical) metric. It would be good to homogenise to clarify predictive results.

      We have made the corresponding changes in Figure 4.

      Author response image 6.

      Encoding individual differences in intelligence using regional SC-FC coupling. (A) Predictive accuracy of fluid, crystallized, and general intelligence composite scores. (B) Regional distribution of predictive weight. (C) Predictive contribution of functional networks. The boxes show the median and interquartile range (IQR; 25–75%), and the whiskers depict the 1.5× IQR from the first or third quartile.

      *Methods and QC:

      -Parcellations

      ** It would be useful to mention briefly how the BNA was applied to the data and if any quality checks were performed for the resulting parcellations, especially for the youngest subjects which might be most dissimilar to the population used to derive the atlas (healthy adults HCP subjects) ~ question of parcellation quality.

      We have added the description, please see lines 434-436: “The BNA[31] was projected on native space according to the official scripts (http://www.brainnetome.org/resource/) and the native BNA was checked by visual inspection.” 

      ** Additionally, the appropriateness of structurally defined regions for the functional analysis is also a topic of important debate. It might be useful to mention the above as limitations (which apply to most studies with similar focus).

      We have added your comment to the methodological issues, please see lines 378-379: “Third, the appropriateness of structurally defined regions for the functional analysis is also a topic of important debate.”

      - Tractography

      ** L432: it might be useful to name the method you used (probtrackx).

      We have added this name to the description, please see lines 455-456: “probabilistic tractography (probtrackx)[78, 79] was implemented in the FDT toolbox …”

      ** L434: 'dividing the total fibres number in source region' - dividing by what?

      We have revised the description, please see line 458: “dividing by the total fibres number in source region.”

      ** L436: 'connections in subcortical areas were removed' - why did you trace connections to subcortical areas in the first place if you then removed them (to match with cortical MPC areas I suspect)? Or do you mean there were spurious streamlines through subcortical regions that you filtered?

      On the one hand we need to match the MPC, and on the other hand, as we stated in methodological issues, the challenge of accurately resolving the connections of small structures within subcortical regions using whole-brain diffusion imaging and tractography techniques[10, 11]. 

      ** Following on the above, did you use any exclusion masks during the tracing? In general, more information about quality checks for the tractography would be useful. For example, L437: did you do any quality evaluations based on the removed spurious streamlines? For example, were there any trends between spurious streamlines and the age of the subject? Distance between regions/size of the regions?

      We did not use any exclusion masks. We performed visual inspection for the tractography quality and did not assess the relationship between spurious streamlines and age or distance between regions/size of the regions.

      ** L439: 'weighted probabilistic network' - this was weighted by the filtered connectivity densities or something else?

      The probabilistic network is weighted by the filtered connectivity densities.

      ** I appreciate the short description of the communication models in Text S1, it is very useful.

      Thank you for your comment.

      ** In addition to limitations mentioned in L368 - during reconstruction, have you noticed problems resolving short inter-hemispheric connections?

      We have not considered this issue, we have added it to the limitation, please see lines 383-384: “In addition, the reconstruction of short connections between hemispheres is a notable challenge.”

      - Functional analysis:

      ** There is a difference in acquisition times between participants below and above 8 years (21 vs 26 min), does the different length of acquisition affect the quality of the processed data?

      We have made relatively strict quality control to ensure the quality of the processed data.  

      ** L446 'regressed out nuisance variables' - it would be informative to describe in more detail what you used to perform this.

      We have provided more detail about the regression of nuisance variables, please see lines 476-477: “The nuisance variables were removed from time series based on general linear model.”

      ** L450-452: it would be useful to add the number of excluded participants to get an intuition for the overall quality of the functional data. Have you checked if the quality is associated with the age of the participant (which might be related to motion etc). Adding a distribution of remaining frames across participants (vs age) would be useful to see in the supplementary methods to better understand the data you are using.

      We have supplemented the exclusion information of the subjects during the data processing, and the distribution and aged correlation of motion and remaining frames. Please see lines 481-485: “Quality control. The exclusion of participants in the whole multimodal data processing pipeline was depicted in Figure S13. In the context of fMRI data, we computed Pearson’s correlation between motion and age, as well as between the number of remaining frames and age, for the included participants aged 5 to 22 years and 8 to 22 years, respectively. These correlations were presented in Figure S14.”

      Author response image 7.

      Exclusion of participants in the whole multimodal data processing pipeline.  

      Author response image 8.

      Figure S14. Correlations between motion and age and number of remaining frames and age.

      ** L454: 'Pearson's correlation's... ' In contrast to MPC you did not remove negative correlations in the functional matrices. Why this choice?

      Whether the negative correlation connection of functional signal is removed or not has always been a controversial issue. Referring to previous studies of SC-FC coupling[12-14], we find that the practice of retaining negative correlation connections has been widely used. In order to retain more information, we chose this strategy. Considering that MPC is a nascent approach to network modeling, we adopted a more conservative strategy that removing negative correlation by referring to the study [4] that proposed the approach.

      - Gene expression:

      ** L635, you focus on the left cortex, is this common? Do you expect the gene expression to be fully symmetric (given reported functional hemispheric asymmetries)? It might be good to expand on the reasoning.

      An important consideration regarding sample assignment arises from the fact that only two out of six brains were sampled from both hemispheres and four brains have samples collected only in the left. This sparse sampling should be carefully considered when combining data across donors[1]. We have supplemented the description, please see lines 569-571: “Restricting analyses to the left hemisphere will minimize variability across regions (and hemispheres) in terms of the number of samples available[40].”

      ** Paragraph of L537: you use evolution of coupling with age (correlation) and compare to gene expression with adults (cohort of Allen Human Brain Atlas - no temporal evolution to the gene expressions) and on L369 you mention that 'relative spatial patterns of gene expressions remain stable after birth'. Of course this is not a place to question previous studies, but would you really expect the gene expression associated with the temporary processes to remain stable throughout the development? For example, myelination would follow different spatiotemporal gradient across brain regions, is it reasonable to expect that the expression patterns remain the same? How do you then interpret a changing measure of coupling (correlation with age) with a gene expression assessed statically?

      We agree with your comment that the spatial expression patterns is expected to vary at different periods. We have revised the previous description, please see lines 383-386: “Fifth, it is important to acknowledge that changes in gene expression levels during development may introduce bias in the results.”

      - Reproducibility analyses:

      ** Paragraph L576: are we to understand that you performed the entire pipeline 3 times (WD, S1, S2) for both parcellations schemes and tractography methods (~12 times) including the selection of communication models and you always got the same best three communication models and gene expression etc? Or did you make some design choices (i.e. selection of communication models) only on a specific set-up and transfer to other settings?

      The choice of communication model is established at the beginning, which we have clarified in the article, please see lines 106-108: “We used these three models to represent the extracortical connectivity properties in subsequent discovery and reproducibility analyses (Figure S1).” For reproducibility analyses (parcellation, tractography, and split-half validation), we fixed other settings and only assessed the impact of a single factor.

      ** Paragraph of L241: I really appreciate you evaluated the robustness of your results to different tractography strategies. It is reassuring to see the similarity in results for the two approaches. Did you notice any age-related effects on tractography quality for the two methods given the wide age range (did you check?)

      In our study, the tractography quality was checked by visual inspection. Using quantifiable tools to tractography quality in future studies could answer this question objectively.

      ** Additionally, I wonder how much of that overlap is driven by the changes in MPC which is the same between the two methods... especially given its high weight in the SC-FC coupling you reported earlier in the paper. It might be informative to directly compare the connectivity matrices derived from the two tracto methods directly. Generally, as mentioned in the previous comments, I think it would be interesting to assess coupling using different input settings (with WM structural and MPC separate and then combined).

      As your previous comment, we have examined the coupling patterns, coupling differences, coupling age correlation, and spatial correlations between the patterns based on different models, as shown in Figure S2. Please see our response to the previous comment for details.

      ** L251 - I also wonder if the random splitting is best adapted to validation in your case given you study relationships with age. Would it make more sense to make stratified splits to ensure a 'similar age coverage' across splits?

      In our study, we adopt the random splitting process which repeated 1,000 times to minimize bias due to data partitioning. The stratification you mentioned is a reasonable method, and keeping the age distribution even will lead to higher verification similarity than our validation method. However, from the validation results of our method, the similarity is sufficient to explain the generalization of our findings.

      Minor comments

      L42: 'is regulated by genes'

      ** Coupling (if having a functional role and being regulated at all) is possibly resulting from a complex interplay of different factors in addition to genes, for example, learning/environment, it might be more cautious to use 'regulated in part by genes' or similar.

      We have corrected it, please see line 42.

      L43 (and also L377): 'development of SC-FC coupling'

      ** I know this is very nitpicky and depends on your opinion about the nature of SC-FC coupling, but 'development of SC-FC coupling' gives an impression of something maturing that has a role 'in itself' (for example development of eye from neuroepithelium to mature organ etc.). For now, I am not sure it is fully certain that SC-FC coupling is more than a byproduct of the comparison between SC and FC, using 'changes in SC-FC coupling with development' might be more apt.

      We have corrected it, please see lines 43-44.

      L261 'SC-FC coupling was stronger ... [] ... and followed fundamental properties of cortical organization.' vs L168 'No significant correlations were found between developmental changes in SC-FC coupling and the fundamental properties of cortical organization'.

      **Which one is it? I think in the first you refer to mean coupling over all infants and in the second about correlation with age. How do you interpret the difference?

      Between the ages of 5 and 22 years, we found that the mean SC-FC coupling pattern has become similar to that of adults, consistent with the fundamental properties of cortical organization. However, the developmental changes in SC-FC coupling are heterogeneous and sequential and do not follow the mean coupling pattern to change in the same magnitude.

      L277: 'temporal and spatial complexity'

      ** Additionally, communication models have different assumptions about the flow within the structural network and will have different biological plausibility (they will be more or less

      'realistic').

      Here temporal and spatial complexity is from a computational point of view.

      L283: 'We excluded a centralized model (shortest paths), which was not biologically plausible' ** But in Text S1 and Table S1 you specify the shortest paths models. Does this mean you computed them but did not incorporate them in the final coupling computations even if they were predictive?

      ** Generally, I find the selection of the final 3 communication models confusing. It would be very useful if you could clarify this further, for example in the methods section.

      We used all twenty-seven communication models (including shortest paths) to predict FC at the node level for each participant. Then we identified three communication models that can significantly predict FC. For the shortest path, he was excluded because he did not meet the significance criteria. We have further added methodological details to this section, please see lines 503-507.

      L332 'As we observed increasing coupling in these [frontoparietal network and default mode network] networks, this may have contributed to the improvements in general intelligence, highlighting the flexible and integrated role of these networks' vs L293 'SC-FC coupling in association areas, which have lower structural connectivity, was lower than that in sensory areas. This configuration effectively releases the association cortex from strong structural constraints imposed by early activity cascades, promoting higher cognitive functions that transcend simple sensori-motor exchanges'

      ** I am not sure I follow the reasoning. Could you expand on why it would be the decoupling promoting the cognitive function in one case (association areas generally), but on the reverse the increased coupling in frontoparietal promoting the cognition in the other (specifically frontoparietal)?

      We tried to explain the problem, for general intelligence, increased coupling in frontoparietal could allow more effective information integration enable efficient collaboration between different cognitive processes.

      * Formatting errors etc.

      L52: maybe rephrase?

      We have rephrased, please see lines 51-53: “The T1- to T2-weighted (T1w/T2w) ratio of MRI has been proposed as a means of quantifying microstructure profile covariance (MPC), which reflects a simplified recapitulation in cellular changes across intracortical laminar structure[6, 1215].”

      L68: specialization1,[20].

      We have corrected it.

      L167: 'networks significantly increased with age and exhibited greater increased' - needs rephrasing.

      We have corrected it.

      L194: 'networks were significantly predicted the general intelligence' - needs rephrasing.

      We have corrected it, please see lines 204-205: “we found that the weights of frontoparietal and default mode networks significantly contributed to the prediction of the general intelligence.”

      L447: 'and temporal bandpass filtering' - there is a verb missing.

      We have corrected it, please see line 471: “executed temporal bandpass filtering.”

      L448: 'greater than 0.15' - unit missing.

      We have corrected it, please see line 472: “greater than 0.15 mm”.

      L452: 'After censoring, regression of nuisance variables, and temporal bandpass filtering,' - no need to repeat the steps as you mentioned them 3 sentences earlier.

      We have removed it.

      L458-459: sorry I find this description slightly confusing. What do you mean by 'modal'? Connectional -> connectivity profile. The whole thing could be simplified, if I understand correctly your vector of independent variables is a set of wm and microstructural 'connectivity' of the given node... if this is not the case, please make it clearer.

      We have corrected it, please see line 488: “where 𝒔𝑖 is the 𝑖th SC profiles, 𝑛 is the number of SC profiles”.

      L479: 'values and system-specific of 480 coupling'.

      We have corrected it.

      L500: 'regular' - regularisation.

      We have changed it to “regularization”.

      L567: Do you mean that in contrast to probabilistic with FSL you use deterministic methods within Camino? For L570, you introduce communication models through 'such as': did you fit all models like before? If not, it might be clearer to just list the ones you estimated rather than introduce through 'such as'.

      We have changed the description to avoid ambiguity, please see lines 608-609: “We then calculated the communication properties of the WMC including communicability, mean first passage times of random walkers, and flow graphs (timescales=1).”

      Citation [12], it is unusual to include competing interests in the citation, moreover, Dr. Bullmore mentioned is not in the authors' list - this is most likely an error with citation import, it would be good to double-check.

      We have corrected it.

      L590: Python scripts used to perform PLS regression can 591 be found at https://scikitlearn.org/. The link leads to general documentation for sklearn.

      We have corrected it, please see lines 627-630: “Python scripts used to perform PLS regression can be found at https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.PLSRegression.html#sklearn.cro ss_decomposition.PLSRegression.”

      P26 and 27 - there are two related sections: Data and code availability and Code availability - it might be worth merging into one section if possible.

      We have corrected it, please see lines 623-633.

      References

      (1) Arnatkeviciute A, Fulcher BD, Fornito A. A practical guide to linking brain-wide gene expression and neuroimaging data. Neuroimage. 2019;189:353-67. Epub 2019/01/17. doi: 10.1016/j.neuroimage.2019.01.011. PubMed PMID: 30648605.

      (2) Zhong S, He Y, Gong G. Convergence and divergence across construction methods for human brain white matter networks: an assessment based on individual differences. Hum Brain Mapp. 2015;36(5):1995-2013. Epub 2015/02/03. doi: 10.1002/hbm.22751. PubMed PMID: 25641208; PubMed Central PMCID: PMCPMC6869604.

      (3) Waehnert MD, Dinse J, Weiss M, Streicher MN, Waehnert P, Geyer S, et al. Anatomically motivated modeling of cortical laminae. Neuroimage. 2014;93 Pt 2:210-20. Epub 2013/04/23. doi: 10.1016/j.neuroimage.2013.03.078. PubMed PMID: 23603284.

      (4) Paquola C, Vos De Wael R, Wagstyl K, Bethlehem RAI, Hong SJ, Seidlitz J, et al. Microstructural and functional gradients are increasingly dissociated in transmodal cortices. PLoS Biol. 2019;17(5):e3000284. Epub 2019/05/21. doi: 10.1371/journal.pbio.3000284. PubMed PMID: 31107870.

      (5) Haufe S, Meinecke F, Gorgen K, Dahne S, Haynes JD, Blankertz B, et al. On the interpretation of weight vectors of linear models in multivariate neuroimaging. Neuroimage. 2014;87:96-110. Epub 2013/11/19. doi: 10.1016/j.neuroimage.2013.10.067. PubMed PMID: 24239590.

      (6) Demirtas M, Burt JB, Helmer M, Ji JL, Adkinson BD, Glasser MF, et al. Hierarchical Heterogeneity across Human Cortex Shapes Large-Scale Neural Dynamics. Neuron. 2019;101(6):1181-94 e13. Epub 2019/02/13. doi: 10.1016/j.neuron.2019.01.017. PubMed PMID: 30744986; PubMed Central PMCID: PMCPMC6447428.

      (7) Deco G, Kringelbach ML, Arnatkeviciute A, Oldham S, Sabaroedin K, Rogasch NC, et al. Dynamical consequences of regional heterogeneity in the brain's transcriptional landscape. Sci Adv. 2021;7(29). Epub 2021/07/16. doi: 10.1126/sciadv.abf4752. PubMed PMID: 34261652; PubMed Central PMCID: PMCPMC8279501.

      (8) Chen J, Tam A, Kebets V, Orban C, Ooi LQR, Asplund CL, et al. Shared and unique brain network features predict cognitive, personality, and mental health scores in the ABCD study. Nat Commun. 2022;13(1):2217. Epub 2022/04/27. doi: 10.1038/s41467-022-29766-8. PubMed PMID: 35468875; PubMed Central PMCID: PMCPMC9038754.

      (9) Li J, Bzdok D, Chen J, Tam A, Ooi LQR, Holmes AJ, et al. Cross-ethnicity/race generalization failure of behavioral prediction from resting-state functional connectivity. Sci Adv. 2022;8(11):eabj1812. Epub 2022/03/17. doi: 10.1126/sciadv.abj1812. PubMed PMID: 35294251; PubMed Central PMCID: PMCPMC8926333.

      (10) Thomas C, Ye FQ, Irfanoglu MO, Modi P, Saleem KS, Leopold DA, et al. Anatomical accuracy of brain connections derived from diffusion MRI tractography is inherently limited. Proc Natl Acad Sci U S A. 2014;111(46):16574-9. Epub 2014/11/05. doi: 10.1073/pnas.1405672111. PubMed PMID: 25368179; PubMed Central PMCID: PMCPMC4246325.

      (11) Reveley C, Seth AK, Pierpaoli C, Silva AC, Yu D, Saunders RC, et al. Superficial white matter fiber systems impede detection of long-range cortical connections in diffusion MR tractography. Proc Natl Acad Sci U S A. 2015;112(21):E2820-8. Epub 2015/05/13. doi: 10.1073/pnas.1418198112. PubMed PMID: 25964365; PubMed Central PMCID: PMCPMC4450402.

      (12) Gu Z, Jamison KW, Sabuncu MR, Kuceyeski A. Heritability and interindividual variability of regional structure-function coupling. Nat Commun. 2021;12(1):4894. Epub 2021/08/14. doi: 10.1038/s41467-021-25184-4. PubMed PMID: 34385454; PubMed Central PMCID: PMCPMC8361191.

      (13) Liu ZQ, Vazquez-Rodriguez B, Spreng RN, Bernhardt BC, Betzel RF, Misic B. Time-resolved structure-function coupling in brain networks. Commun Biol. 2022;5(1):532. Epub 2022/06/03. doi: 10.1038/s42003-022-03466-x. PubMed PMID: 35654886; PubMed Central PMCID: PMCPMC9163085.

      (14) Zamani Esfahlani F, Faskowitz J, Slack J, Misic B, Betzel RF. Local structure-function relationships in human brain networks across the lifespan. Nat Commun. 2022;13(1):2053. Epub 2022/04/21. doi: 10.1038/s41467-022-29770-y. PubMed PMID: 35440659; PubMed Central PMCID: PMCPMC9018911.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We sincerely appreciate the editors for overseeing an efficient review process and for upholding the high standards of the journal. We have made extensive revisions to the manuscript after carefully reviewing the reviewers’ comments. We have addressed all the comments in our response and have incorporated the changes suggested by the reviewers to the best of our abilities. Notably, we have made the following major changes to the manuscript:

      (1) We have increased the patient cohort size from 10 to 23 for evaluating the levels of YEATS2 and H3K27cr.

      (2) To further strengthen the clinical relevance of our study, we have checked the expression of major genes involved in the YEATS2-mediated histone crotonylation axis (YEATS2, GCDH, ECHS1, Twist1 along with H3K27cr levels) in head and neck cancer tissues using immunohistochemistry.

      (3) We have performed extensive experiments to look into the role of p300 in assisting YEATS2 in regulating promoter histone crotonylation.

      The changes made to the manuscript figures have been highlighted in our response. We have also updated the Results section in accordance with the updated figures. Tables 1-4 and Supplementary files 1-3 have been moved to one single Excel workbook named ‘Supplementary Tables 1-8’. Additional revisions have been made to improve the overall quality of the manuscript and enhance data visualization. These additional changes are highlighted in the tracked changes version of the manuscript.

      Our response to the Public Reviews and ‘Recommendations to the Authors’ can be found below.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This manuscript investigates a mechanism between the histone reader protein YEATS2 and the metabolic enzyme GCDH, particularly in regulating epithelial-to-mesenchymal transition (EMT) in head and neck cancer (HNC).

      Strengths:

      Great detailing of the mechanistic aspect of the above axis is the primary strength of the manuscript.

      Weaknesses:

      Several critical points require clarification, including the rationale behind EMT marker selection, the inclusion of metastasis data, the role of key metabolic enzymes like ECHS1, and the molecular mechanisms governing p300 and YEATS2 interactions.

      We would like to sincerely thank the reviewer for the detailed, in-depth, and positive response. We have implemented constructive revisions to the manuscript to address the reviewer’s concerns effectively.

      Major Comments:

      (1) The title, "Interplay of YEATS2 and GCDH mediates histone crotonylation and drives EMT in head and neck cancer," appears somewhat misleading, as it implies that YEATS2 directly drives histone crotonylation. However, YEATS2 functions as a reader of histone crotonylation rather than a writer or mediator of this modification. It cannot itself mediate the addition of crotonyl groups onto histones. Instead, the enzyme GCDH is the one responsible for generating crotonyl-CoA, which enables histone crotonylation. Therefore, while YEATS2 plays a role in recognizing crotonylation marks and may regulate gene expression through this mechanism, it does not directly catalyse or promote the crotonylation process.

      We thank the reviewer for their insightful comment regarding the precision of our title. We agree that the initial wording 'mediates' could imply a direct enzymatic role for YEATS2 in histone crotonylation, which is indeed not the case. As the reviewer correctly points out, YEATS2 functions as a 'reader' of histone crotonylation marks.

      However, our research demonstrates that YEATS2 plays a crucial indirect regulatory role in the establishment of these crotonylation marks. Specifically, our data indicates that YEATS2 facilitates the recruitment of the histone crotonyltransferase p300 to specific gene promoters, such as that of SPARC. This recruitment mechanism directly impacts the localized deposition of crotonyl marks on nearby histone residues. Therefore, while YEATS2 does not directly catalyze the addition of crotonyl groups, its presence and interaction with p300 are essential for the regulation and establishment of histone crotonylation at these critical sites.

      To accurately reflect this nuanced, yet significant, regulatory mechanism, we have revised the title. We are replacing 'mediates' with 'regulates' to precisely convey that YEATS2 influences the histone crotonylation process, albeit indirectly, through its role in recruiting the enzymatic machinery. The updated title will now read: 'Interplay of YEATS2 and GCDH regulates histone crotonylation and drives EMT in head and neck cancer.' We believe this change maintains the core message of our findings while enhancing the scientific accuracy of the title.

      (2) The study suggests a link between YEATS2 and metastasis due to its role in EMT, but the lack of clinical or pre-clinical evidence of metastasis is concerning. Only primary tumor (PT) data is shown, but if the hypothesis is that YEATS2 promotes metastasis via EMT, then evidence from metastatic samples or in vivo models should be included to solidify this claim.

      We thank the reviewer for their valuable suggestion regarding the need for clinical or pre-clinical evidence of metastasis. We fully agree that direct evidence linking YEATS2 to metastasis would significantly strengthen our claims, especially given its demonstrated role in EMT.

      Our primary objective in this study was to meticulously dissect the molecular mechanisms by which YEATS2 regulates histone crotonylation and drives EMT in head and neck cancer. We have provided comprehensive upstream and downstream molecular insights into this process, culminating in a clear demonstration of YEATS2's functional importance in promoting EMT through multiple in vitro phenotypic assays (e.g., Matrigel invasion, wound healing, 3D invasion assays). As the reviewer notes, EMT is a widely recognized prerequisite for cancer metastasis[1]. Therefore, establishing YEATS2 as a driver of EMT directly implicates its potential role in metastatic progression.

      To further address the reviewer's concern and bridge the gap between EMT and metastasis, we have performed additional analyses that will be incorporated into the revised manuscript:

      Clinical Correlation with Tumor Grade: We analyzed publicly available head and neck cancer patient datasets. Our analysis revealed a significant positive correlation between YEATS2 expression and increasing tumor grade. Specifically, we observed significantly higher YEATS2 expression in Grade 2-4 tumors compared to Grade 1 tumors. Given that higher tumor grades are frequently associated with increased metastatic potential and poorer prognosis in HNC[2], this finding provides compelling clinical correlative evidence linking elevated YEATS2 expression to more aggressive disease.

      Gene Set Enrichment Analysis (GSEA) for Metastasis Pathways: To further explore the biological processes associated with YEATS2 in a clinical context, we performed GSEA on TCGA HNC patient samples stratified by high versus low YEATS2 expression. This analysis robustly demonstrated a positive enrichment of metastasis-related gene sets in the high YEATS2 expression group, compared to the low YEATS2 group. This strengthens the mechanistic link by showing that pathways associated with metastasis are co-ordinately upregulated when YEATS2 is highly expressed.

      These new clinical data provide strong correlative evidence supporting a direct association of YEATS2 with metastasis, building upon our detailed mechanistic dissection of its role in EMT.

      (3) There seems to be some discrepancy in the invasion data with BICR10 control cells (Figure 2C). BICR10 control cells with mock plasmids, specifically shControl and pEGFP-C3 show an unclear distinction between invasion capacities. Normally, we would expect the control cells to invade somewhat similarly, in terms of area covered, within the same time interval (24 hours here). But we clearly see more control cells invading when the invasion is done with KD and fewer control cells invading when the invasion is done with OE. Are these just plasmid-specific significant effects on normal cell invasion? This needs to be addressed.

      We thank the reviewer for their careful examination of Figure 2C and their insightful observation regarding the appearance of the control cells in relation to the knockdown (Figure 2B) and overexpression (Figure 2C) experiments. We understand how, at first glance, the control invasion levels across these panels might seem disparate.

      We wish to clarify that Figure 2B (YEATS2 knockdown) and Figure 2C (YEATS2 overexpression) represent two entirely independent experiments, conducted with distinct experimental conditions and methodologies, as detailed in our Methods section.

      Specifically:

      Figure 2B (Knockdown): Utilizes lentivirus-mediated transduction for stable shRNA delivery (shControl as control).

      Figure 2C (Overexpression): Utilizes transfection with plasmid DNA (pEGFP-C3 as control) via a standard transfection reagent.

      These fundamental differences in genetic manipulation methods (transduction vs. transfection), along with potential batch-to-batch variations in reagents or cell passage number at the time of each independent experiment, can indeed lead to variations in absolute basal invasion rates of control cells[3].

      Therefore, the invasion capacity of BICR10 control cells in Figure 2B (shControl) should only be compared to the YEATS2 knockdown conditions within that same panel. Similarly, the invasion capacity of control cells in Figure 2C (pEGFP-C3) should only be compared to the YEATS2 overexpression conditions within that specific panel. The crucial finding in each panel lies in the relative change in invasion caused by YEATS2 manipulation (knockdown or overexpression) compared to its respective, concurrently run control.

      We have ensured that all statistical analyses (as indicated in the figure legends and methods) were performed by comparing the experimental groups directly to their matched internal controls within each independent experiment. The significant increase in invasion upon YEATS2 overexpression and the significant decrease upon YEATS2 knockdown, relative to their respective controls, are robust and reproducible findings.

      (4) In Figure 3G, the Western blot shows an unclear band for YEATS2 in shSP1 cells with YEATS2 overexpression condition. The authors need to clearly identify which band corresponds to YEATS2 in this case.

      We thank the reviewer for pointing out the ambiguity in the YEATS2 Western blot for the shSP1 + pEGFP-C3-YEATS2 condition in Figure 3G. We apologize for this lack of clarity. The two bands seen in the shSP1+pEGFP-C3-YEATS2 condition correspond to the endogenous YEATS2 band (lower band) and YEATS2-GFP band (upper band, corresponding to overexpressed YEATS2-GFP fusion protein, which has a higher molecular weight). To avoid confusion, the endogenous band is now highlighted (marked by *) in the lane representing the shSP1+pEGFP-C3-YEATS2 condition. We have also updated the figure legend accordingly.

      (5) In ChIP assays with SP1, YEATS2 and p300 which promoter regions were selected for the respective genes? Please provide data for all the different promoter regions that must have been analysed, highlighting the region where enrichment/depletion was observed. Including data from negative control regions would improve the validity of the results.

      Throughout our study, we have performed ChIP-qPCR assays to check the binding of SP1 on YEATS2 and GCDH promoter, and to check YEATS2 and p300 binding on SPARC promoter. Using transcription factor binding prediction tools and luciferase assays, we selected multiple sites on the YEATS2 and GCDH promoter to check for SP1 binding. The results corresponding to the site that showed significant enrichment were provided in the manuscript. The region of SPARC promoter in YEATS2 and p300 ChIP assay was selected on the basis of YEATS2 enrichment found in the YEATS2 ChIP-seq data. The ChIP-qPCR data for all the promoter regions investigated (including negative controls) can be found below (Author response image 1.).

      Authors’ response image 1.

      (A) SP1 ChIP-qPCR results indicating SP1 occupancy on different regions of YEATS2 promoter. YEATS2 promoter region showing SP1 binding sites (indicated by red boxes) is shown above. SP1 showed significant enrichment at F1R1 region. The results corresponding to F1R1 region were included in Figure 3D. (B) SP1 ChIPqPCR results indicating SP1 occupancy on different regions of GCDH promoter. GCDH promoter region showing SP1 binding sites (indicated by red boxes) is shown above. SP1 showed significant enrichment at F2R2 region. The results corresponding to F2R2 region were included in Figure 7E. (C) YEATS2 ChIP-qPCR results in shControl vs. shYEATS2 BICR10 cells indicating YEATS2 occupancy on different regions of SPARC promoter. SPARC promoter region showing YEATS2 ChIP-seq and H3K27cr ChIP-seq signals is shown above. YEATS2 showed significant enrichment at F1R1 region. The results corresponding to F1R1 region were included in Figure 5C. (D) p300 ChIP-qPCR results in shControl vs. shYEATS2 BICR10 cells indicating p300 occupancy on different regions of SPARC promoter. p300 showed significant enrichment at F1R1 region. The results corresponding to F1R1 region were included in Figure 5F.

      (6) The authors establish a link between H3K27Cr marks and GCDH expression, and this is an already well-known pathway. A critical missing piece is the level of ECSH1 in patient samples. This will clearly delineate if the balance shifted towards crotonylation.

      We greatly appreciate the reviewer's insightful comment regarding the importance of assessing ECSH1 levels in patient samples to clearly delineate the metabolic balance shifting towards crotonylation. We fully agree that this is a critical piece of evidence.

      To directly address this point and substantiate our claim regarding the altered metabolic balance in HNC, we had previously analyzed the expression of both GCDH and ECHS1 in TCGA HNC RNA-seq data (as presented in Figure 4—figure supplement 1A and B). This analysis revealed a consistent increase in GCDH expression and a concomitant decrease in ECHS1 expression in tumor samples compared to normal tissues. Based on these findings, we hypothesized that this altered expression profile would indeed lead to an accumulation of crotonyl-CoA and, consequently, an overall increase in histone crotonylation in HNC.

      To further validate and extend these findings at the protein level, we have now performed immunohistochemistry (IHC) analysis for both ECHS1 and GCDH in a cohort of HNC normal vs. tumor tissues. Our IHC results strikingly corroborate the RNA-seq data: GCDH consistently showed increased protein expression in tumor samples, whereas ECHS1 exhibited significantly reduced protein expression in tumors compared to their adjacent normal counterpart tissues (Figure 4E and Authors’ response figure 5).

      These new data, combined with existing TCGA HNC RNA-seq analysis strongly supports our proposed mechanism where altered GCDH and ECHS1 expression contributes to increased histone crotonylation in head and neck cancer.

      (7) The p300 ChIP data on the SPARC promoter is confusing. The authors report reduced p300 occupancy in YEATS2-silenced cells, on SPARC promoter. However, this is paradoxical, as p300 is a writer, a histone acetyltransferase (HAT). The absence of a reader (YEATS2) shouldn't affect the writer (p300) unless a complex relationship between p300 and YEATS2 is present. The role of p300 should be further clarified in this case. Additionally, transcriptional regulation of SPARC expression in YEATS2 silenced cells could be analysed via downstream events, like Pol-II recruitment. Assays such as Pol-II ChIP-qPCR could help explain this.

      We greatly appreciate the reviewer's insightful observation regarding the apparently paradoxical reduction of p300 occupancy on the SPARC promoter upon YEATS2 silencing (Figure 5F), and their call for further clarification of p300's role and the potential complex relationship with YEATS2. We agree that this point required further mechanistic investigation.

      As we have shown through RNA-seq and ChIP-seq analyses, YEATS2 broadly influences histone crotonylation levels at gene promoters, thereby impacting gene expression. While p300 is indeed a known histone acetyltransferase (HAT) with promiscuous acyltransferase activity, including crotonyltransferase activity[4], the precise mechanism by which its occupancy is affected by a 'reader' protein like YEATS2 was unclear. Our initial data suggested a dependency of p300 recruitment on YEATS2.

      To directly address the reviewer's concern and thoroughly delineate the molecular mechanism of cooperativity between YEATS2 and p300 in regulating histone crotonylation, we have now performed a series of targeted experiments, which have been incorporated into the revised manuscript:

      (a) Validation of p300's role in SPARC expression: We performed p300 knockdown in BICR10 cells, followed by immunoblotting to assess SPARC protein levels. As expected, a significant decrease in SPARC protein levels was observed upon p300 knockdown (Figure 5G). This confirms p300's direct involvement in SPARC gene expression.

      (b) Direct interaction between YEATS2 and p300: To investigate a potential physical association, we performed co-immunoprecipitation assays to check for an interaction between endogenous YEATS2 and p300. Our results clearly demonstrate the presence of YEATS2 in the p300-immunoprecipitate sample, indicating that YEATS2 and p300 physically interact and likely function together as a complex to drive the expression of target genes like SPARC (Figure 5H). This direct interaction provides the mechanistic basis for how YEATS2 influences p300 occupancy.

      (c) Impact on transcriptional activity (Pol II recruitment): As suggested, we performed RNA Polymerase II (Pol II) ChIP-qPCR on the SPARC promoter in YEATS2 knockdown cells. We observed a significant decrease in Pol II occupancy on the SPARC promoter after YEATS2 knockdown in BICR10 cells (Figure 6C). This confirms that YEATS2 silencing leads to reduced transcriptional initiation/elongation at this promoter.

      (d) p300's direct role in H3K27cr on SPARC promoter: To confirm p300's specific role in crotonylation at this locus, we performed H3K27cr ChIP-qPCR after p300 knockdown. As anticipated, a significant decrease in H3K27cr enrichment was observed on the SPARC promoter upon p300 knockdown (Figure 6J), directly demonstrating p300's crotonyltransferase activity at this site.

      (e) Rescue of p300 occupancy and H3K27cr by YEATS2 overexpression in SP1deficient cells: To further establish the YEATS2-p300 axis, we performed SP1 knockdown (which reduces YEATS2 expression) followed by ectopic YEATS2 overexpression, and then assessed p300 occupancy and H3K27cr levels on the SPARC promoter. While SP1 knockdown led to a decrease in both p300 and H3K27cr enrichment, we observed a significant rescue of both p300 occupancy and H3K27cr enrichment upon YEATS2 overexpression in the shSP1 cells (Figure 6E and F). This provides strong evidence that YEATS2 acts downstream of SP1 to regulate p300 recruitment and H3K27cr levels.

      Collectively, these comprehensive new results clearly establish that YEATS2 directly interacts with and assists in the recruitment of p300 to the SPARC promoter. This recruitment is crucial for p300's localized crotonyltransferase activity, leading to increased H3K27cr marks and subsequent activation of SPARC transcription. This clarifies the previously observed 'paradox' and defines a novel cooperative mechanism between a histone reader (YEATS2) and a writer (p300) in regulating histone crotonylation and gene expression.

      (8) The role of GCDH in producing crotonyl-CoA is already well-established in the literature. The authors' hypothesis that GCDH is essential for crotonyl-CoA production has been proven, and it's unclear why this is presented as a novel finding. It has been shown that YEATS2 KD leads to reduced H3K27cr, however, it remains unclear how the reader is affecting crotonylation levels. Are GCDH levels also reduced in the YEATS2 KD condition? Are YEATS2 levels regulating GCDH expression? One possible mechanism is YEATS2 occupancy on GCDH promoter and therefore reduced GCDH levels upon YEATS2 KD. This aspect is crucial to the study's proposed mechanism but is not addressed thoroughly.

      We appreciate the reviewer's valuable comment questioning the novelty of GCDH's role in crotonyl-CoA production and seeking further clarification on how YEATS2 influences crotonylation levels beyond its reader function.

      We agree that GCDH's general role in producing crotonyl-CoA is well-established[5,6]. Our study, however, aims to delineate a novel epigenetic-metabolic crosstalk in head and neck cancer, specifically investigating how the interplay between the histone crotonylation reader YEATS2 and the metabolic enzyme GCDH contributes to increased histone crotonylation and drives EMT in this context.

      Our initial investigations using GSEA on publicly available TCGA RNA-seq data revealed that HNC patients with high YEATS2 expression also exhibit elevated expression of genes involved in the lysine degradation pathway, prominently including GCDH. Recognizing the known roles of YEATS2 in preferentially binding H3K27cr7 and GCDH in producing crotonylCoA, we hypothesized that the elevated H3K27cr levels observed in HNC are a consequence of the combined action of both YEATS2 and GCDH. We have provided evidence that increased nuclear GCDH correlates with higher H3K27cr abundance, likely due to an increased nuclear pool of crotonyl-CoA, and that YEATS2 contributes through its preferential maintenance of crotonylation marks by recruiting p300 (as detailed in Figure 5FH and Figure 6J-L of the manuscript and elaborated in our response to point 7). Thus, our work highlights that both YEATS2 and GCDH are crucial for the regulation of histone crotonylation-mediated gene expression in HNC.

      To directly address the reviewer's query regarding YEATS2's influence on GCDH levels and nuclear histone crotonylation:

      • YEATS2 does not transcriptionally regulate GCDH: We did not find any evidence of YEATS2 directly regulating the expression levels of GCDH at the transcriptional level in HNC cells.

      • Novel finding: YEATS2 regulates GCDH nuclear localization: Crucially, we discovered that YEATS2 downregulation significantly reduces the nuclear pool of GCDH in head and neck cancer cells (Figure 7G). This is a novel mechanism suggesting that YEATS2 influences histone crotonylation not only by affecting promoter H3K27cr levels via p300 recruitment, but also by regulating the availability of the crotonyl-CoA producing enzyme, GCDH, within the nucleus.

      • Common upstream regulation by SP1: Interestingly, we found that both YEATS2 and GCDH expression are commonly regulated by the transcription factor SP1 in HNC. Our data demonstrate that SP1 binds to the promoters of both genes, and its downregulation leads to a decrease in their respective expressions (Figure 3 and Figure 7). This provides an important upstream regulatory link between these two key players.

      • Functional validation of GCDH in EMT: We further assessed the functional importance of GCDH in maintaining the EMT phenotype in HNC cells. Matrigel invasion assays after GCDH knockdown and overexpression in BICR10 cells revealed that the invasiveness of HNC cells was significantly reduced upon GCDH knockdown and significantly increased upon GCDH overexpression (results provided in revised manuscript Figure 7F and Figure 7—figure supplement 1F).

      These findings collectively demonstrate a multifaceted role for YEATS2 in regulating histone crotonylation by both direct recruitment of the writer p300 and by influencing the nuclear availability of the crotonyl-CoA producing enzyme GCDH. We acknowledge that the precise molecular mechanism governing YEATS2's effect on GCDH nuclear localization remains an exciting open question for future investigation, but our current data establishes a novel regulatory axis.

      (9) The authors should provide IHC analysis of YEATS2, SPARC alongside H3K27cr and GCDH staining in normal vs. tumor tissues from HNC patients.

      We thank the reviewer for their suggestion. We have performed IHC analysis for YEATS2, H3K27cr and GCDH in normal and tumor samples obtained from HNC patient.

      Reviewer #2 (Public review):

      Summary:

      The manuscript emphasises the increased invasive potential of histone reader YEATS2 in an SP1-dependent manner. They report that YEATS2 maintains high H3K27cr levels at the promoter of EMT-promoting gene SPARC. These findings assigned a novel functional implication of histone acylation, crotonylation.

      We thank the reviewer for the constructive comments. We are committed to making beneficial changes to the manuscript in order to alleviate the reviewer’s concerns.

      Concerns:

      (1) The patient cohort is very small with just 10 patients. To establish a significant result the cohort size should be increased.

      We thank the reviewer for this suggestion. We have increased the number of patient samples to assess the levels of YEATS2 (n=23 samples) and the results have been included in Figure 1G and Figure 1—figure supplement 1F.

      (2) Figure 4D compares H3K27Cr levels in tumor and normal tissue samples. Figure 1G shows overexpression of YEATS2 in a tumor as compared to normal samples. The loading control is missing in both. Loading control is essential to eliminate any disparity in protein concentration that is loaded.

      To address the reviewer’s concern, we have repeated the experiment and used H3 as a loading control as nuclear protein lysates from patient samples were used to check YEATS2 and H3K27cr levels.

      (3) Figure 4D only mentions 5 patient samples checked for the increased levels of crotonylation and hence forms the basis of their hypothesis (increased crotonylation in a tumor as compared to normal). The sample size should be more and patient details should be mentioned.

      As part of the revision, we have now checked the H3K27cr levels in a total of 23 patient samples and the results have been included in Figure 4D and Figure 4— figure supplement 1D. Patient details are provided in Supplementary Table 6.

      (4) YEATS2 maintains H3K27Cr levels at the SPARC promoter. The p300 is reported to be hyper-activated (hyperautoacetylated) in oral cancer. Probably, the activated p300 causes hyper-crotonylation, and other protein factors cause the functional translation of this modification. The authors need to clarify this with a suitable experiment.

      We thank the reviewer for this insightful comment regarding the functional relationship between YEATS2 and p300 in the context of H3K27cr, especially considering reports of p300 hyper-activation in oral cancer. We agree that a precise clarification of p300's role and its cooperativity with YEATS2 is crucial to fully understand the functional translation of this modification.

      As we have shown through global RNA-seq and ChIP-seq analyses, YEATS2 broadly affects gene expression by regulating histone crotonylation levels at gene promoters. We also recognize that the histone writer p300 is a promiscuous acyltransferase, known to add various non-acetyl marks, including crotonylation[4]. Our initial data, showing decreased p300 occupancy on the SPARC promoter upon YEATS2 downregulation (Figure 5F), suggested a strong dependency of p300 on YEATS2 for its recruitment. To fully delineate the molecular mechanism of this cooperativity and clarify how YEATS2 influences p300-mediated histone crotonylation and its functional outcomes, we have performed the following series of experiments, which have been integrated into the revised manuscript:

      (a) Validation of p300's role in SPARC expression: We performed p300 knockdown in BICR10 cells, followed by immunoblotting to assess SPARC protein levels. As expected, a significant decrease in SPARC protein levels was observed upon p300 knockdown (Figure 5G). This confirms p300's direct involvement in SPARC gene expression.

      (b) Direct interaction between YEATS2 and p300: To investigate a potential physical association, we performed co-immunoprecipitation assays to check for an interaction between endogenous YEATS2 and p300. Our results clearly demonstrate the presence of YEATS2 in the p300-immunoprecipitate sample, indicating that YEATS2 and p300 physically interact and likely function together as a complex to drive the expression of target genes like SPARC (Figure 5H). This direct interaction provides the mechanistic basis for how YEATS2 influences p300 occupancy.

      (c) Impact on transcriptional activity (Pol II recruitment): As suggested, we performed RNA Polymerase II (Pol II) ChIP-qPCR on the SPARC promoter in YEATS2 knockdown cells. We observed a significant decrease in Pol II occupancy on the SPARC promoter after YEATS2 knockdown in BICR10 cells (Figure 6C). This confirms that YEATS2 silencing leads to reduced transcriptional initiation/elongation at this promoter.

      (d) p300's direct role in H3K27cr on SPARC promoter: To confirm p300's specific role in crotonylation at this locus, we performed H3K27cr ChIP-qPCR after p300 knockdown. As anticipated, a significant decrease in H3K27cr enrichment was observed on the SPARC promoter upon p300 knockdown (Figure 6J), directly demonstrating p300's crotonyltransferase activity at this site.

      (e) Rescue of p300 occupancy and H3K27cr by YEATS2 overexpression in SP1deficient cells: To further establish the YEATS2-p300 axis, we performed SP1 knockdown (which reduces YEATS2 expression) followed by ectopic YEATS2 overexpression, and then assessed p300 occupancy and H3K27cr levels on the SPARC promoter. While SP1 knockdown led to a decrease in both p300 and H3K27cr enrichment, we observed a significant rescue of both p300 occupancy and H3K27cr enrichment upon YEATS2 overexpression in the sh_SP1_ cells (Figure 6K and L). This provides strong evidence that YEATS2 acts downstream of SP1 to regulate p300 recruitment and H3K27cr levels.

      Collectively, these comprehensive new results clearly establish that YEATS2 directly interacts with and assists in the recruitment of p300 to the SPARC promoter. This recruitment is crucial for p300's localized crotonyltransferase activity, leading to increased H3K27cr marks and subsequent activation of SPARC transcription. This clarifies the previously observed 'paradox' and defines a novel cooperative mechanism between a histone reader (YEATS2) and a writer (p300) in regulating histone crotonylation and gene expression.

      (5) I do not entirely agree with using GAPDH as a control in the western blot experiment since GAPDH has been reported to be overexpressed in oral cancer.

      We would like to clarify that GAPDH was not used as a loading control for protein expression comparisons between normal and tumor samples. GAPDH was used as a loading control only in experiments using head and neck cancer cell lines where shRNA-mediated knockdown or overexpression was employed. These manipulations specifically target the genes of interest and are not expected to alter GAPDH expression, making it a suitable loading control in these instances.

      (6) The expression of EMT markers has been checked in shControl and shYEATS2 transfected cell lines (Figure 2A). However, their expression should first be checked directly in the patients' normal vs. tumor samples.

      We thank the reviewer for the suggestion. We have now checked the expression of EMT marker Twist1 alongside YEATS2 expression in normal vs. tumor tissue samples using IHC (Figure 4E).

      (7) In Figure 3G, knockdown of SP1 led to the reduced expression of YEATS2 controlled gene Twist1. Ectopic expression of YEATS2 was able to rescue Twist1 partially. In order to establish that SP1 directly regulates YEATS2, SP1 should also be re-introduced upon the knockdown background along with YEATS2 for complete rescue of Twist1 expression.

      To address the reviewer’s concern regarding the partial rescue of Twist1 in SP1 depleted-YEATS2 overexpressed cells, we performed the experiment as suggested by the reviewer. We overexpressed both SP1 and YEATS2 in SP1-depleted cells and found that Twist1 depletion was almost completely rescued.

      Authors’ response image 2.

      Immunoblot depicting the decreased Twist1 levels on SP1 knockdown and its subsequent rescue of expression upon YEATS2 and SP1 overexpression in BICR10 (endogenous YEATS2 band indicated by *).

      (8) In Figure 7G, the expression of EMT genes should also be checked upon rescue of SPARC expression.

      We thank the reviewer for the suggestion. We have examined the expression of EMT marker Twist1 on YEATS2/ GCDH rescue. On overexpressing both YEATS2 and GCDH in sh_SP1_ cells we found that the depleted expression of Twist1 was rescued.

      Authors’ response image 3.

      Immunoblot depicting the decreased Twist1 levels on SP1 knockdown and its subsequent rescue of expression upon dual overexpression of YEATS2 and GCDH in BICR10 (* indicates GFP-tagged YEATS2 probed using GFP antibody).

      Reviewer #1 (Recommendations for the authors):

      While the study offers insights into the specific role of this axis in regulating epithelial-tomesenchymal transition (EMT) in HNC, its broader mechanistic novelty is limited by prior discoveries in other cancer types (https://doi.org/10.1038/s41586-023-06061-0). The manuscript would benefit from the inclusion of metastasis data, the role of key metabolic enzymes like ECHS1, the molecular mechanisms governing p300 and YEATS2 interactions, additional IHC data, negative control data in ChIP, and an explanation of discrepancies in certain figures.

      We thank the reviewer for their constructive suggestions. We have made extensive revisions to our manuscript to substantiate our findings. We have looked into the expression of ECHS1/ GCDH in HNC tumor tissues using IHC, performed extensive experiments to validate the role of p300 in YEATS2-mediated histone crotonylation, and provided additional data supporting our findings wherever required. The revised figures have been provided in the updated version of the manuscript and also in the Authors’ response.

      Minor Comments:

      (1) The study begins with a few EMT markers, such as Vimentin, Twist, and N-Cadherin to validate the role of YEATS2 in promoting EMT. Including a broader panel of EMT markers would strengthen the conclusions about the effects of YEATS2 on EMT and invasion. Additionally, the rationale for selecting these EMT markers is not fully elaborated. Why were other well-known EMT players not included in the analysis?

      On performing RNA-seq with shControl and sh_YEATS2_ samples, we discovered that TWIST1 was showing decrease in expression on YEATS2 downregulation. So Twist1 was investigated as a potential target of YEATS2 in HNC cells. N-Cadherin was chosen because it is known to get upregulated directly by Twist1[8]. Further, Vimentin was chosen as it a well-known marker for mesenchymal phenotype and is frequently used to indicate EMT in cancer cells[9].

      Authors’ response image 4.

      IGV plot showing the decrease in Twist1 expression in shControl vs. shYEATS2 RNA-seq data.

      Other than the EMT-markers used in our study, the following markers were amongst those that showed significant change in gene expression on YEATS2 downregulation.

      Authors’ response table 1.

      List of EMT-related genes that showed significant change in expression on YEATS2 knockdown in RNA-seq analysis.

      As depicted in the table above, majority of the genes that showed downregulation on YEATS2 knockdown were mesenchymal markers, while epithelial-specific genes such as Ecadherin and Claudin-1 showed upregulation. This data signifies the essential role of YEATS2 in driving EMT in head and neck cancer.

      (2) The authors use Ponceau staining, but the rationale behind this choice is unclear. Ponceau is typically used for transfer validation. For the same patient, western blot loading controls like Actin/GAPDH should be shown. Also, at various places throughout the manuscript, Ponceau staining has been used. These should also be replaced with Actin/GAPDH blots.

      Ponceau S staining is frequently used as alternative for housekeeping genes like GAPDH as control for protein loading[10]. However, to address this issue, we have repeated the western and used H3 as a loading control as nuclear protein lysates from patient samples were used to check YEATS2 and H3K27cr levels.

      For experiments (In Figures 5E, 6F, 6I, and 7H ) where we assessed SPARC levels in conditioned media obtained from BICR10 cells (secretory fraction), Ponceau S staining was deliberately used as the loading control. In such extracellular protein analyses, traditional intracellular housekeeping genes (like Actin or GAPDH) are not applicable. Ponceau S has been used as a control for showing SPARC expression in secretory fraction of mammalian cell lines in previous studies as well11.  

      (3) The manuscript briefly mentions that p300 was identified as the only protein with increased expression in tumours compared to normal tissue in the TCGA dataset. What other writers were checked for? Did the authors check for their levels in HNC patients?

      We thank the reviewer for this observation. As stated by previous studies [12,13], p300 and GCN5 are the histone writers that can act as crotonyltransferases at the H3K27 position. Although the crotonyltransferase activity of GCN5 has been demonstrated in yeast, it has not been confirmed in human. Whereas the histone crotonyltransferase activity of p300 has been validated in human cells using in vitro HCT assays[4,14]. Therefore, we chose to focus on p300 for further validation of its role in YEATS2mediated regulation of histone crotonylation. We did not check the levels of p300 in HNC patient tissues. However, p300 showed higher expression in tumor as compared to normal in publicly available HNC TCGA RNA-seq data (Figure 5—figure supplement 1G).

      We acknowledge that the original statement in the manuscript, 'For this we looked at expression of the known writers of H3K27Cr mark in TCGA dataset, and discovered that p300 was the only protein that had increased expression in tumor vs. normal HNC dataset…', was indeed slightly misleading. Our intention was to convey that p300 is considered the major and most validated histone crotonyltransferase capable of influencing crotonylation at the H3K27 position in humans, and that its expression was notably increased in the HNC TCGA tumor dataset. We have now reframed this sentence in the revised manuscript to accurately reflect our findings and focus, as follows:

      'For this, we checked the expression of p300, a known writer of H3K27cr mark in humans, in the TCGA dataset. We found that p300 had increased expression in tumor vs. normal HNC dataset…'

      This revised wording more accurately reflects our specific focus on p300's established role and its observed upregulation in HNC.

      (4) Figure 6E, blot should be replaced. The results aren't clearly visible.

      We thank the reviewer for this observation. We have repeated the western blot and the Figure 6E (Figure 6F in the revised version of manuscript) has now been replaced with a cleaner blot.

      (5) Reference 9 and 19 are the same. Please rectify.

      We apologize for this inadvertent error. We have rectified this error in the updated version of the manuscript.

      References

      (1) Brabletz, T.; Kalluri, R.; Nieto, M. A.; Weinberg, R. A. EMT in Cancer. Nat Rev Cancer 2018, 18(2), 128–134. https://doi.org/10.1038/nrc.2017.118.

      (2) Pisani, P.; Airoldi, M.; Allais, A.; Aluffi Valletti, P.; Battista, M.; Benazzo, M.; Briatore, R.; Cacciola, S.; Cocuzza, S.; Colombo, A.; Conti, B.; Costanzo, A.; Della Vecchia, L.; Denaro, N.; Fantozzi, C.; Galizia, D.; Garzaro, M.; Genta, I.; Iasi, G. A.; Krengli, M.; Landolfo, V.; Lanza, G. V.; Magnano, M.; Mancuso, M.; Maroldi, R.; Masini, L.; Merlano, M. C.; Piemonte, M.; Pisani, S.; Prina-Mello, A.; Prioglio, L.; Rugiu, M. G.; Scasso, F.; Serra, A.; Valente, G.; Zannetti, M.; Zigliani, A. Metastatic Disease in Head & Neck Oncology. Acta Otorhinolaryngol Ital 2020, 40 (SUPPL. 1), S1–S86. https://doi.org/10.14639/0392-100X-suppl.1-40-2020.

      (3) Lin, J.; Zhang, P.; Liu, W.; Liu, G.; Zhang, J.; Yan, M.; Duan, Y.; Yang, N. A Positive Feedback Loop between ZEB2 and ACSL4 Regulates Lipid Metabolism to Promote Breast Cancer Metastasis. Elife 2023, 12, RP87510. https://doi.org/10.7554/eLife.87510.

      (4) Liu, X.; Wei, W.; Liu, Y.; Yang, X.; Wu, J.; Zhang, Y.; Zhang, Q.; Shi, T.; Du, J. X.; Zhao, Y.; Lei, M.; Zhou, J.-Q.; Li, J.; Wong, J. MOF as an Evolutionarily Conserved Histone Crotonyltransferase and Transcriptional Activation by Histone Acetyltransferase-Deficient and Crotonyltransferase-Competent CBP/P300. Cell Discov 2017, 3 (1), 17016. https://doi.org/10.1038/celldisc.2017.16.

      (5) Jiang, G.; Li, C.; Lu, M.; Lu, K.; Li, H. Protein Lysine Crotonylation: Past, Present, Perspective. Cell Death Dis 2021, 12 (7), 703. https://doi.org/10.1038/s41419-021-03987-z.

      (6) Yuan, H.; Wu, X.; Wu, Q.; Chatoff, A.; Megill, E.; Gao, J.; Huang, T.; Duan, T.; Yang, K.; Jin, C.; Yuan, F.; Wang, S.; Zhao, L.; Zinn, P. O.; Abdullah, K. G.; Zhao, Y.; Snyder, N. W.; Rich, J. N. Lysine Catabolism Reprograms Tumour Immunity through Histone Crotonylation. Nature 2023, 617 (7962), 818–826. https://doi.org/10.1038/s41586-023-06061-0.

      (7) Zhao, D.; Guan, H.; Zhao, S.; Mi, W.; Wen, H.; Li, Y.; Zhao, Y.; Allis, C. D.; Shi, X.; Li, H. YEATS2 Is a Selective Histone Crotonylation Reader. Cell Res 2016, 26 (5), 629–632. https://doi.org/10.1038/cr.2016.49.

      (8) Alexander, N. R.; Tran, N. L.; Rekapally, H.; Summers, C. E.; Glackin, C.; Heimark, R. L. NCadherin Gene Expression in Prostate Carcinoma Is Modulated by Integrin-Dependent Nuclear Translocation of Twist1. Cancer Res 2006, 66 (7), 3365–3369.

      https://doi.org/10.1158/0008-5472.CAN-05-3401.

      (9) Satelli, A.; Li, S. Vimentin in Cancer and Its Potential as a Molecular Target for Cancer Therapy. Cellular and Molecular Life Sciences 2011, 68 (18), 3033–3046. https://doi.org/10.1007/s00018-011-0735-1.

      (10) Romero-Calvo, I.; Ocón, B.; Martínez-Moya, P.; Suárez, M. D.; Zarzuelo, A.; Martínez-Augustin, O.; de Medina, F. S. Reversible Ponceau Staining as a Loading Control Alternative to Actin in Western Blots. Anal Biochem 2010, 401 (2), 318–320. https://doi.org/https://doi.org/10.1016/j.ab.2010.02.036.

      (11) Ling, H.; Li, Y.; Peng, C.; Yang, S.; Seto, E. HDAC10 Inhibition Represses Melanoma Cell Growth and BRAF Inhibitor Resistance via Upregulating SPARC Expression. NAR Cancer 2024, 6 (2), zcae018. https://doi.org/10.1093/narcan/zcae018.

      (12) Gao, D.; Li, C.; Liu, S.-Y.; Xu, T.-T.; Lin, X.-T.; Tan, Y.-P.; Gao, F.-M.; Yi, L.-T.; Zhang, J. V; Ma, J.Y.; Meng, T.-G.; Yeung, W. S. B.; Liu, K.; Ou, X.-H.; Su, R.-B.; Sun, Q.-Y. P300 Regulates Histone Crotonylation and Preimplantation Embryo Development. Nat Commun 2024, 15 (1), 6418. https://doi.org/10.1038/s41467-024-50731-0.

      (13) Li, K.; Wang, Z. Histone Crotonylation-Centric Gene Regulation. Epigenetics Chromatin 2021, 14 (1), 10. https://doi.org/10.1186/s13072-021-00385-9.

      (14) Sabari, B. R.; Tang, Z.; Huang, H.; Yong-Gonzalez, V.; Molina, H.; Kong, H. E.; Dai, L.; Shimada, M.; Cross, J. R.; Zhao, Y.; Roeder, R. G.; Allis, C. D. Intracellular Crotonyl-CoA Stimulates Transcription through P300-Catalyzed Histone Crotonylation. Mol Cell 2015, 58 (2), 203–215. https://doi.org/https://doi.org/10.1016/j.molcel.2015.02.029.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents a valuable insight into a computational mechanism of pain perception. The evidence supporting the authors’ claims is solid, although the inclusion of 1) more diverse candidate computational models, 2) more systematic analysis of the temporal regularity effects on the model fit, and 3) tests on clinical samples would have strengthened the study. The work will be of interest to pain researchers working on computational models and cognitive mechanisms of pain in a Bayesian framework.

      Thank you very much again for considering the manuscript and judging it as a valuable contribution to understanding mechanisms of pain perception. We recognise the above-mentioned points of improvement and elaborate on them in the initial response to the reviewers.

      Response to the reviewers

      Reviewer 1:

      Reviewer Comment 1.1 — Selection of candidate computational models: While the paper juxtaposes the simple model-free RL model against a Kalman Filter model in the context of pain perception, the rationale behind this choice remains ambiguous. It prompts the question: could other RL-based models, such as model-based RL or hierarchical RL, offer additional insights? A more detailed explanation of their computational model selection would provide greater clarity and depth to the study.

      Initial reply: Thank you for this point. Our models were selected a-priori, following the modelling strategy from Jepma et al. (2018) and hence considered the same set of core models for clear extension of the analysis to our non-cue paradigm. The key question for us was whether expectations were used to weight the behavioural estimates, so our main interest was to compare expectation vs non-expectation weighted models.

      Model-based and hierarchical RL are very broad terms that can be used to refer to many different models, and we are not clear about which specific models the reviewer is referring to. Our Bayesian models are generative models, i.e. they learn the generative statistics of the environment (which is characterised by inherent stochasticity and volatility) and hence operate model-based analyses of the stimulus dynamics. In our case, this happened hierarchically and it was combined with a simple RL rule.

      Revised reply: We clarified our modelling choices in the ”Modelling strategy” subsection of the results section.

      Reviewer Comment 1.2 — Effects of varying levels of volatility and stochasticity: The study commendably integrates varying levels of volatility and stochasticity into its experimental design. However, the depth of analysis concerning the effects of these variables on model fit appears shallow. A looming concern is whether the superior performance of the expectation-weighted Kalman Filter model might be a natural outcome of the experimental design. While the non-significant difference between eKF and eRL for the high stochasticity condition somewhat alleviates this concern, it raises another query: Would a more granular analysis of volatility and stochasticity effects reveal fine-grained model fit patterns?

      Initial reply: We are sorry that the reviewer finds shallow ”the depth of analysis concerning the effects of these variables on model fit”. We are not sure which analysis the reviewer has in mind when suggesting a ”more granular analysis of volatility and stochasticity effects” to ”reveal fine-grained model fit patterns”. Therefore, we find it difficult to improve our manuscript in this regard. We are happy to add analyses to our paper but we would be greatful for some specific pointers. We have already provided:

      •    Analysis of model-naive performance across different levels of stochasticity and volatility (section 2.3, figure 3, supplementary information section 1.1 and tables S1-2)

      •    Model fitting for each stochasticity/volatility condition (section 2.4.1, figure 4, supplementary table S5)

      •    Group-level and individual-level differences of each model parameter across stochasticity/volatility conditions (supplementary information section 7, figures S4-S5).

      •    Effect of confidence on scaling factor for each stochasticity/volatility condition (figure 5)

      Reviewer Comment 1.3 — Rating instruction: According to Fig. 1A, participants were prompted to rate their responses to the question, ”How much pain DID you just feel?” and to specify their confidence level regarding their pain. It is difficult for me to understand the meaning of confidence in this context, given that they were asked to report their *subjective* feelings. It might have been better to query participants about perceived stimulus intensity levels. This perspective is seemingly echoed in lines 100-101, ”the primary aim of the experiment was to determine whether the expectations participants hold about the sequence inform their perceptual beliefs about the intensity of the stimuli.”

      Initial reply: Thank you for raising this question, which allows us to clarify our paradigm. On half of the trials, participants were asked to report the perceived intensity of the previous stimulus; on the remaining trials, participants were requested to predict the intensity of the next stimulus. Therefore, we did query ”participants about perceived stimulus intensity levels”, as described at lines 49-55, 296-303, and depicted in figure 1.

      The confidence refers to the level of confidence that participants have regarding their rating - how sure they are. This is done in addition to their perceived stimulus intensity and it has been used in a large body of previous studies in any sensory modality.

      Reviewer Comment 1.4 — Relevance to clinical pain: While the authors underscore the relevance of their findings to chronic pain, they did not include data pertaining to clinical pain. Notably, their initial preprint seemed to encompass data from a clinical sample (https://www.medrxiv.org /content/10.1101/2023.03.23.23287656v1), which, for reasons unexplained, has been omitted in the current version. Clarification on this discrepancy would be instrumental in discerning the true relevance of the study’s findings to clinical pain scenarios.

      Initial reply: The preprint that the Reviewer is referring to was an older version of the manuscript in which we combined two different experiments, which were initially born as separate studies: the one that we submitted to eLife (done in the lab, with noxious stimuli in healthy participants) and an online study with a different statistical learning paradigm (without noxious stimuli, in chronic back pain participants). Unfortunately, the paradigms were different and not directly comparable. Indeed, following submission to a different journal, the manuscript was criticised for this reason. We therefore split the paper in two, and submitted the first study to eLife. We are now planning to perform the same lab-based experiment with noxious stimuli on chronic back pain participants. Progress on this front has been slowed down by the fact that I (Flavia Mancini) am on maternity leave, but it remains top priority once back to work.

      Reviewer Comment 1.5 — Paper organization: The paper’s organization appears a little bit weird, possibly due to the removal of significant content from their initial preprint. Sections 2.12.2 and 2.4 seem more suitable for the Methods section, while 2.3 and 2.4.1 are the only parts that present results. In addition, enhancing clarity through graphical diagrams, especially for the experimental design and computational models, would be quite beneficial. A reference point could be Fig. 1 and Fig. 5 from Jepma et al. (2018), which similarly explored RL and KF models.

      Initial reply: Thank you for these suggestions. We will consider restructuring the paper in the revised version.

      Revised reply: We restructured introduction, results and parts of the methods. We followed the reviewer’s suggestion regarding enhancing clarity through graphical diagrams. We have visualised the experimental design in Figure 1D. Furthemore, we have visualised the two main computational models (eRL and eKF) in Figure 2, following from Jepma et al. (2018). As a result, we have updated the notation in Section 4.4 to be clearer and consistent with the graphical representation (rename the variable referring to observed thermal input from Ot to Nt).

      Reviewer Comment 1.6 — In lines 99-100, the statement ”following the work by [23]” would be more helpful if it included a concise summary of the main concepts from the referenced work.

      - It would be helpful to have descriptions of the conditions that Figure 1C is elaborating on.

      - In line 364, the ”N {t}” in the sentence ”The observation on trial t, N {t}”, should be O {t}.

      Initial reply: Thank you for spotting these and for providing the suggestions. We will include the correction in the revised version.

      Revised reply: We have added the following regarding the lines 99-100:

      ”We build on the work by [23], who show that pain perception is strongly influenced by expectations as defined by a cue that predicts high or low pain. In contrast to the cue-paradigm from [23], the primary aim of our experiment was to determine whether the expectations participants hold about the sequence itself inform their perceptual beliefs about the intensity of the stimuli.”

      See comment in the previous reply, regarding the notation change from Ot to Nt.

      Reviewer 2:

      Reviewer Comment 2.1 — This is a highly interesting and novel finding with potential implications for the understanding and treatment of chronic pain where pain regulation is deficient. The paradigm is clear, the analysis is state-of-the-art, the results are convincing, and the interpretation is adequate.

      Initial reply: Thank you very much for these positive comments.

      Reviewer 3:

      Summary:

      I am pleased to have had the opportunity to review this manuscript, which investigated the role of statistical learning in the modulation of pain perception. In short, the study showed that statistical aspects of temperature sequences, with respect to specific manipulations of stochasticity (i.e., randomness of a sequence) and volatility (i.e., speed at which a sequence unfolded) influenced pain perception. Computational modelling of perceptual variables (i.e., multi-dimensional ratings of perceived or predicted stimuli) indicated that models of perception weighted by expectations were the best explanation for the data. My comments below are not intended to undermine or question the quality of this research. Rather, they are offered with the intention of enhancing what is already a significant contribution to the pain neuroscience field. Below, I highlight the strengths and weaknesses of the manuscript and offer suggestions for incorporating additional methodological details.

      Strengths:

      The manuscript is articulate, coherent, and skilfully written, making it accessible and engaging.

      - The innovative stimulation paradigm enables the exploration of expectancy effects on perception without depending on external cues, lending a unique angle to the research.

      - By including participants’ ratings of both perceptual aspects and their confidence in what they perceived or predicted, the study provides an additional layer of information to the understanding of perceptual decision-making. This information was thoughtfully incorporated into the modelling, enabling the investigation of how confidence influences learning.

      - The computational modelling techniques utilised here are methodologically robust. I commend the authors for their attention to model and parameter recovery, a facet often neglected in previous computational neuroscience studies.

      - The well-chosen citations not only reflect a clear grasp of the current research landscape but also contribute thoughtfully to ongoing discussions within the field of pain neuroscience.

      Initial reply: We are really grateful for reviewer’s insightful comments and for providing useful guidance regarding our methodology. We are also thankful for highlighting the strengths of our manuscript. Below we respond to individual weakness mentioned in the reviews report.

      Reviewer Comment 3.1 — In Figure 1, panel C, the authors illustrate the stimulation intensity, perceived intensity, and prediction intensity on the same scale, facilitating a more direct comparison. It appears that the stimulation intensity has been mathematically transformed to fit a scale from 0 to 100, aligning it with the intensity ratings corresponding to either past or future stimuli. Given that the pain threshold is specifically marked at 50 on this scale, one could logically infer that all ratings falling below this value should be deemed non-painful. However, I find myself uncertain about this interpretation, especially in relation to the term ”arbitrary units” used in the figure. I would greatly appreciate clarification on how to accurately interpret these units, as well as an explanation of the relationship between these values and the definition of pain threshold in this experiment.

      Initial reply: Indeed, as detailed in the Methods section 4.3, the stimulation intensity was originally transformed from the 1-13 scale to 0-100 scale to match the scales in the participant response screens.

      Following the method used to establish the pain threshold, we set the stimulus intensity of 7 as the threshold on the original 1-13 scale. However, during the rating part of the experiment, several of the participants never or very rarely selected a value above 50 (their individually defined pain threshold), despite previously indicating a moment during pain threshold procedure when a stimulus becomes painful. This then results in the re-scaled intensity values as well the perception rating, both on the same 0-100 scale of arbitrary units, to never go above the pain threshold. Please see all participant ratings and inputs in the Figure below. We see that it would be more illustrative to re-plot Figure 1 with a different exemplary participant, whose ratings go above the pain threshold, perhaps with an input intensity on the 1-13 scale on the additional right-hand-side y-axis. We will add this in the revised version as well as highlight the fact above.

      Importantly, while values below 50 are deemed non-painful by participants, the thermal stimulation still activates C-fibres involved in nociception, and we would argue that the modelling framework and analysis still applies in this case.

      Revised reply: We re-plotted Figure 1E-F with a different exemplary participant, whose rating go above the pain threshold. We also included all participant pain perception and prediction ratings, noxious input sequences and confidence ratings in the supplement in Figures S1-S3.

      Reviewer Comment 3.2 — The method of generating fluctuations in stimulation temperatures, along with the handling of perceptual uncertainty in modelling, requires further elucidation. The current models appear to presume that participants perceive each stimulus accurately, introducing noise only at the response stage. This assumption may fail to capture the inherent uncertainty in the perception of each stimulus intensity, especially when differences in consecutive temperatures are as minimal as 1°C.

      Initial reply: We agree with the reviewer that there are multiple sources of uncertainty involved in the process of rating the intensity of thermal stimuli - including the perception uncertainty. In order to include an account of inaccurate perception, one would have to consider different sources that contribute to this, which there may be many. In our approach, we consider one, which is captured in the expectation weighted model, more clearly exemplified in the expectation-weighted Kalman-Filter model (eKF). The model assumes participants perception of input as an imperfect indicator of the true level of pain. In this case, it turns out that perception is corrupted as a result of the expectation participants hold about the upcoming stimuli. The extent of this effect is partly governed by a subjective level of noise ϵ, which may also subsume other sources of uncertainty beyond the expectation effect. Moreover, the response noise ξ, could also subsume any other unexplained sources of noise.

      Author response image 1.

      Stimulis intensity transformation

      Revised reply: We clarified our modelling choices in the ”2.2 Modelling strategy” subsection.

      Reviewer Comment 3.3 — A key conclusion drawn is that eKF is a better model than eRL. However, a closer examination of the results reveals that the two models behave very similarly, and it is not clear that they can be readily distinguished based on model recovery and model comparison results.

      Initial reply: While, the eKF appears to rank higher than the eRL in terms of LOOIC and sigma effects, we don’t wish to make make sweeping statements regarding significance of differences between eRL and eKF, but merely point to the trend in the data. We shall make this clearer in the revised version of the manuscript. However, the most important result is that the models involving expectation-weighing are arguably better capturing the data.

      Revised reply: We elaborated on the significance statements in the ”Modelling Results” subsection:

      • We considered at least a 2 sigma effect as indication of a significant difference. In each condition, the expectation weighted models (eKF and eRL) provided better fit than models without this element (KF and RL; approx. 2-4 sigma difference, as reported in Figure 5A-D). This suggests that regardless of the levels of volatility and stochasticity, participants still weigh perception of the stimuli with their expectation.

      and in the first paragraph of the Discussion:

      • When varying different levels of inherent uncertainty in the sequences of stimuli (stochasticity and volatility), the expectation and confidence weighted models fitted the data better than models weighted for confidence but not for expectations (Figure 5A-D). The expectation-weighted bayesian (KF) model offered a better fit than the expectation-weighted, model-free RL model, although in conditions of high stochasticity this difference was short of significance. Overall, this suggests that participants’ expectations play a significant role in the perception of sequences of noxious stimuli.

      We are aware of the limitations and lack of clear guidance regarding using sigma effects to establish significance (as per reviewer’s suggestion: https://discourse.mc-stan.org/t/loo-comparison-in-referenceto-standard-error/4009). Here we decided to use the above-mentioned threshold of 2-sigma as an indication of significance, but note the potential limitations of the inferences - especially when distinguishing between eRL/eKF models.

      Reviewer Comment 3.4 — Regarding model recovery, the distinction between the eKF and eRL models seems blurred. When the simulation is based on the eKF, there is no ability to distinguish whether either eKF or eRL is better. When the simulation is based on the eRL, the eRL appears to be the best model, but the difference with eKF is small. This raises a few more questions. What is the range of the parameters used for the simulations?

      Initial reply: We agree that the distinction between eKF and eRL in the model recovery is not that clean-cut, which may in turn point to the similarity between the two models. To simulate the data for the model and parameter recovery analysis, we used the group means and variances estimated on the participant data to sample individual parameter values.

      Reviewer Comment 3.5 — Is it possible that either eRL or eKF are best when different parameters are simulated? Additionally, increasing the number of simulations to at least 100 could provide more convincing model recovery results.

      Initial reply: It could be a possibility, but would require further investigation and comparison of fits for different bins/ranges of parameters to see if there is any consistent advantage of one model over another is each bin. We will consider adding this analysis, and provide an additional 50 simulations to paint a more convincing picture.

      Revised reply: We increased the number of simulations per model pair to ≈ 100 (after rejecting fits based on diagnostics criteria - E-BFMI and divergent transitions) and updated the confusion matrix (Table S4). Although the confusion between eRL and eKF remains, the model recovery shows good distinction between expectation weighted vs non-expectation weighted (and Random) models, which supports our main conclusion in the paper.

      Reviewer Comment 3.6 — Regarding model comparison, the authors reported that ”the expectation-weighted KF model offered a better fit than the eRL, although in conditions of high stochasticity, this difference was short of significance against the eRL model.” This interpretation is based on a significance test that hinges on the ratio between the ELPD and the surrounding standard error (SE). Unfortunately, there’s no agreed-upon threshold of SEs that determines significance, but a general guideline is to consider ”several SEs,” with a higher number typically viewed as more robust. However, the text lacks clarity regarding the specific number of SEs applied in this test. At a cursory glance, it appears that the authors may have employed 2 SEs in their interpretation, while only depicting 1 SE in Figure 4.

      Initial reply: Indeed, we considered 2 sigma effect as a threshold, however we recognise that there is no agreed-upon threshold, and shall make this and our interpretation clearer regarding the trend in the data, in the revision.

      Revised reply: We clarify this further, as per our revised response to Comment 3.3 above. We have also added the following statement in section 4.5.1 (Methods, Model comparison): ”There’s no agreed-upon threshold of SEs that determines significance, but the higher the sigma difference, the more robust is the effect.”

      Reviewer Comment 3.7 — With respect to parameter recovery, a few additional details could be included for completeness. Specifically, while the range of the learning rate is understandably confined between 0 and 1, the range of other simulated parameters, particularly those without clear boundaries, remains ambiguous. Including scatter plots with the simulated parameters on the xaxis and the recovered parameters on the y-axis would effectively convey this missing information.

      Furthermore, it would be beneficial for the authors to clarify whether the same priors were used for both the modelling results presented in the main paper and the parameter recovery presented in the supplementary material.

      Initial reply: Thanks for this comment and for the suggestions. To simulate the data for the model and parameter recovery analysis, we used the group means and variances estimated on the participant data to sample individual parameter values. The priors on the group and individual-level parameters in the recovery analysis where the same as in the fitting procedure. We will include the requested scatter plots in the next iteration of the manuscript.

      Revised reply: We included parameter recovery scatter plots for each model and parameter in the Supplement Figures S7-S11.

      Reviewer Comment 3.8 — While the reliance on R-hat values for convergence in model fitting is standard, a more comprehensive assessment could include estimates of the effective sample size (bulk ESS and/or tail ESS) and the Estimated Bayesian Fraction of Missing Information (EBFMI), to show efficient sampling across the distribution. Consideration of divergences, if any, would further enhance the reliability of the results.

      Initial reply: Thank you very much for this suggestion, we will aim to include these measures in the revised version.

      Revised reply: We have considered the suggested diagnostics and include bulk and tail ESS values for each condition, model, parameter in the Supplement Tables S6-S9. We also report number of chain with low E-BFMI (0), number of divergent transitions (0) and the E-BFMI values per chain in Table S10.

      Reviewer Comment 3.9 — The authors write: ”Going beyond conditioning paradigms based in cuing of pain outcomes, our findings offer a more accurate description of endogenous pain regulation.” Unfortunately, this statement isn’t substantiated by the results. The authors did not engage in a direct comparison between conditioning and sequence-based paradigms. Moreover, even if such a comparison had been made, it remains unclear what would constitute the gold standard for quantifying ”endogenous pain regulation.”

      Initial reply: This is valid point, indeed we do not compare paradigms in our study, and will remove this statement in the future version.

      Revised reply: We have removed this statement from the revised version.

      Reviewer Comment 3.10 — In relation to the comment on model comparison in my public review, I believe the following link may provide further insight and clarify the basis for my observation. It discusses the use of standard error in model comparison and may be useful for the authors in addressing this particular point: https://discourse.mc-stan.org/t/loo-comparison-in-referenceto-standard-error/4009

      Initial reply: Thank you for this suggestion, we will consider the forum discussion in our manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This solid study investigates the transdifferentiation of chicken embryonic fibroblasts into muscle and fat cells in 3D to create whole-cut meat mimics. The study is important and provides a method to control muscle, fat, and collagen content within the 3D meat mimics and thus provides a new avenue for customized cultured meat production. Limitations of this study include the use of transgene for transdifferentiation and thus the creation of GMO food.

      We are grateful for the substantial effort that editors and reviewers put into assessing our manuscript and providing insightful feedback. We have tried to address, as much as possible, all comments and criticisms. We believe that we have now a significantly improved manuscript. Below, there is a point-by-point response.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors presented here a novel 3D fibroblast culture and transdifferentiation approach for potential meat production with GelMA hydrogel.

      Strengths:

      (1) Reduced serum concentration for 3D chicken fibroblast culture and transdifferentiation is optimized.

      (2) Efficient myogenic transdifferentiation and lipogenesis as well as controlled fat deposition are achieved in the 3D GelMA.

      Weaknesses:

      (1) While the authors stated the rationale of using fibroblasts instead of myogenic/adipogenic stem cells for meat production, the authors did not comment on the drawbacks/disadvantages of genetic engineering (e.g., forced expression of MyoD) in meat production.

      Thanks for the reviewer for raise this important issue. We have now described this drawback in the discussion part.

      As a proof-of-concept study, we sought to explore the potential of utilizing the transdifferentiation integrated transgene tools for overexpressing a transdifferentiation factor to achieve the maximum muscle production. However, it is important to acknowledge that genetically modified meat products derived from the genetic engineering of cultured cells will not be suitable for consumer acceptance and market viability. We are currently testing other non-genomic integrating delivery means such as modRNAs and chemical cocktails to induce myogenic transdifferentiation in fibroblasts. We believe the new non-genomic integration means would be compatible for the meat production and consumer acceptance.

      Please see lines 439-445.

      “As a proof-of-concept, we utilized the transgene method to achieve maximum myogenic induction and the final products still retain the foreign transgene fragment in the cells’ genome. It is therefore posing a risk of genetic modified food which is not suitable for mass production. In the next step, other non-transgenic means such as non-integrating vectors, chemical reprogramming, modified RNAs, and recombinant transgene removal techniques will be explored to develop transgene-free end products.”

      (2) While the authors cited one paper to state the properties and applications of GelMA hydrogel in tissue engineering and food processing, concerns/examples of the food safety with GelMA hydrogel are not discussed thoroughly.

      Thank you for pointing out this issue. We discussed the drawbacks of Gelma hydrogel applications in the meat production in the main text.

      GelMA-based hydrogels have shown great potential due to their biocompatibility and mechanical tenability. It is widely used in 3D cell culture and tissue engineering for regenerative medicine, but less common in food processing and agricultural applications. Due to its special photo-crosslinking properties, biocompatibility and degradability, it allows this material to be shaped into complex tissue structures by 3D printing or modelling. Many researchers have also used Gelma hydrogel as a scaffold for culture meat production (Jeong et al., 2022; Li et al., 2021; Park et al., 2023). Later research will carefully consider Gelma hydrogen as well as other types of scaffold biomaterials for cost-effective and food-safety compliant culture meat production (Bomkamp et al., 2022).

      Bomkamp, C., Skaalure, S. C., Fernando, G. F., Ben‐Arye, T., Swartz, E. W., & Specht, E. A. J. A. S. (2022). Scaffolding biomaterials for 3D cultivated meat: prospects and challenges. Advanced Science (Weinh), 9(3), 2102908.

      Jeong, D., Seo, J. W., Lee, H. G., Jung, W. K., Park, Y. H., & Bae, H. (2022). Efficient Myogenic/Adipogenic Transdifferentiation of Bovine Fibroblasts in a 3D Bioprinting System for Steak-Type Cultured Meat Production. Advanced Science (Weinh), 9(31), e2202877.

      Li, Y., Liu, W., Li, S., Zhang, M., Yang, F., & Wang, S. J. J. o. F. F. (2021). Porcine skeletal muscle tissue fabrication for cultured meat production using three-dimensional bioprinting technology. Journal of Future Foods, 1(1), 88-97.

      Park, S., Hong, Y., Park, S., Kim, W., Gwon, Y., Jang, K.-J., & Kim, J. J. J. o. B. E. (2023). Designing Highly Aligned Cultured Meat with Nanopatterns-Assisted Bio-Printed Fat Scaffolds. Journal of Biosystems Engineering, 48(4), 503-511.

      We discussed the drawbacks of GelMA hydrogel. Please see lines 445-457.

      “Another food safety concern in this study is the use of GelMA hydrogel for culture meat production. Due to its excellent biocompatibility and mechanical flexibility, GelMA-based hydrogel has demonstrated significant potential in scalable 3D cell culture for creating artificial tissue ranging in sizes from millimeters to centimeters. It is widely used in 3D cell culture and tissue engineering for regenerative medicine, but less common in food processing and agricultural applications. Due to its special photo-crosslinking properties, biocompatibility and degradability, it allows this material to be shaped into complex tissue structures by 3D printing or modelling. Many researchers have also used GelMA hydrogel as a scaffold for culture meat production (Jeong et al., 2022; Li et al., 2021; Park et al., 2023). Later research will carefully consider hydrogel as well as other types of scaffold biomaterials for cost-effective and food-safety compliant culture meat production (Bomkamp et al., 2022). ”

      (3) In Fig. 4C, there seems no significant difference in the Vimentin expression between Fibroblast_MyoD and Myofibroblast. The conclusion of "greatly reduced in the myogenic transdifferentiated cells" is overstated.

      Thanks for pointing out this mistake.

      We revised the wording accordingly. The vimentin expression was reduced in fibroblast_MyoD compare to the original fibroblast.

      Please see lines 231-233.

      “The fibroblast intermediate filament Vimentin (Tarbit et al., 2019) was abundantly expressed in the fibroblasts but reduced in the myogenic transdifferentiated cells (Figure 4C)”

      (4) The presented cell culture platform is only applied to chicken fibroblasts and should be tested in other species such as pigs and fish.

      Thank you for the suggestion.

      In this pilot cultured meat study, we utilized chicken embryonic fibroblasts. These specific cells were chosen for their near-immortal nature and robustness in culture, as well as the inducible myogenic capacity. In our previous experiments (Ren et al, Cell Reports, 2022, 40:111206), we have tested the myogenic transdifferentiation potential of fibroblasts from mice, pigs, and chickens, and observed varying efficiencies of myogenesis. It is important to note that fibroblast cells derived from different species, or even different tissues within the same species, would exhibit significant variations in their capacities for myogenic and adipogenic transdifferentiation.

      In this proof-of-concept study we used only one source of fibroblasts for testing culture meat production and confirmed the myogenic/adipogenic transdifferentiation could be manipulated as feasible means to precisely control muscle, fat and collagen content. We would expect that different origins of fibroblasts to display different transdifferentiation efficiencies and thus produce various muscle/fat ratios in meat mimics. That is beyond the scope of current study.

      Furthermore, we are also testing myogenic/adipogenic transdifferentiation of fibroblasts from pigs through non-genomic integration approaches. We believe only the non-transgene tools are viable solutions for culture meat production in the future. We added the species information in the discussion part.

      See lines 515-517.

      “This approach can be readily extrapolated to other species such as pigs and presents promising avenues for the large-scale production of customized and versatile meat products that may cater to varying consumer preferences.”

      Reviewer #2 (Public Review):

      The manuscript by Ma et al. tries to develop a protocol for cell-based meat production using chicken fibroblasts as three-dimensional (3D) muscle tissues with fat accumulation. The authors used genetically modified fibroblasts which can be forced to differentiate into muscle cells and formulated 3D tissues with these cells and a biphasic material (hydrogel). The degrees of muscle differentiation and lipid deposition in culture were determined by immunohistochemical, biochemical, and molecular biological evaluations. Notably, the protocol successfully achieved the process of myogenic and lipogenic stimulation in the 3D tissues.

      Overall, the study is reasonably designed and performed including adequate analysis. The manuscript is clearly written with well-supported figures. While it presents valuable results in the field of cultivated meat science and skeletal muscle biology, some critical concerns were identified. First, it is unclear whether some technical approaches were really the best choice for cell-based meat production. Next, more careful evaluations and justifications would be required to properly explain biological events in the results. These points include additional evaluations and considerations with regard to myocyte alignment and lipid accumulation in the differentiated 3D tissues. The present data are very suggestive in general, but further clarifications and arguments would properly support the findings and conclusions.

      Thanks for the reviewer’s comments. We have performed additional experiments and analysis to address the critical questions. We also revised the text extensively to clarify or discuss some of the concerns, such as the cell alignment and cellular distribution of intramuscular fat issues. We expect the revised data and text could adequately support the conclusions of the manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) In Figure 1, the authors used 1% chicken serum. Have the authors tested other lower concentrations? It will be interesting to see the lowest chicken serum concentrations in fibroblast culture and transdifferentiation;

      Thank you for your suggestion.

      Yes, we actually have tested the lower concentrations of serum, such as 1% FBS, and 0.5% chicken serum. However, the cells are not in a healthy state under these low levels of serum, as shown by the abnormal cell morphology and nearly no cell growth. Please see the revised Supplementary Figure S1D, in which we added the 1%FBS and 0.5% chicken serum data. Hence, the 1% chicken serum is optimal in our hands. We will also test other types of specialized serum-free medium in future experiments.

      (2) In Figure 2, the authors should quantify the fold expansion of fibroblasts cultured in 3D gel after 1, 3, 5, and 9 days since this data is important for future meat manufacturing. In addition, long-term expansion (e.g., 1 month) in 3D gel should also be shown;

      Thanks for the question. We have quantified the cell growth in 3D by measuring the PHK26 stained cells. Since the cells were implanted into the gel, they propagated exponentially from 1 day to 9 days. The cell proliferation data provide good reference for the future meat manufacturing (Figure 2D). We have tried the long-term expansion in 3D but failed to measure the cell proliferation. Because the 3D gel always collapsed during 12-15 days in cell culture for some unknown reasons, either the cells are grown too crowded to compromise the gel structure or the gel matrix itself is not strong enough for standing long-term. We believe the cells will grow well in long-term if we provide enough 3D attachment surface, since they grow indefinitely in 2D. We will testing different 3D matrix in the future.

      Please see the revised Figure 2D for the quantification of cells.

      (3) In Figure 3, please also show MyoD staining as it'll be interesting to see the expression of exogenous and endogenous MyoD expression after dox treatment. In Figure G, the hydrogel meat seems very small, please show/discuss the maximum size of hydrogel meat that may be achieved using this approach;

      Thanks for asking this information. We performed the immunostaining by using the anti-MyoD and anti-Flag to show the expression of all MyoD (exogenous and endogenous) and only exogenous MyoD after dox treatment. The MyoD and 3xFlag were fused in-frame in the transgene plasmid and thus the anti-Flag staining indicate the exogenous MyoD expression and anti-MyoD staining indicate the expression of exogenous and endogenous MyoD together.

      As shown in Figure S4, we found that almost 100% of cells were positive for MyoD staining and 60% of which expressed Flag, these data were consistent with our previous results (Ren et al., 2022, Cell Reports).

      Author response image 1.

      As for the size of the culture meat based on hydrogel, we discussed the possibilities in scalable production of hydrogel based whole-cut meat mimics. Please see lines 446-449. “Due to its excellent biocompatibility and mechanical flexibility, GelMA-based hydrogel has demonstrated significant potential in scalable 3D cell culture for creating artificial tissue ranging in sizes from millimeters to centimeters.”

      (4) In Figure 5 and Supplementary Figure 6, please quantify the Oil-red O+ fat cells in the 2D and 3D lipogenic induction. Also in Fig. 6B, quantify the oil-red+MHC+ cells;

      Thank you for this advice. We have quantified the oil-red O stained images in the result “Stimulate the fat deposition in chicken fibroblasts in 3D” using analysis software imageJ and the quantification of Oil-red O area was added to the corresponding graphs (Figure 5C, Figure S6C and S6F).

      However, due to the unique structure of the 3D matrix, many MHC+ and Oil Red O+ double-positive cells overlap with each other across different Z-stack layers in 3D. This overlap makes it challenging to accurately position and quantify the double-positive cells as the different layers interfere with each other.

      (5) In Figure 7, please show immunostaining images of collagen and other major ECMs;

      Thank you for this question. We have tried to stain collagen networks the by the Picrosirius Red staining but failed. Instead, we employed the laminin immunostainings to confirm that the ECM contents in the 3D matrix is increasing steadily during cell culturation.

      Please see Figure 7C. Lines 346-348.

      “the laminin protein content was accumulated and increased steadily during 3D culturation (Figure 7C) “

      (6) In Figure 8, please show hierarchical clustering analysis of whole transcriptomes of 3D_fibroblasts, 3D_MyoD, 3D+FI, and 3D_MyoD+FI. A Venn Diagram showing the overlap and distinct gene expression among these groups is also appreciated.

      Thank you for the suggestion.

      We added the hierarchical clustering analysis of whole transcriptomes of 3D_fibroblasts, 3D_MyoD, 3D+FI, and 3D_MyoD+FI using Euclidean distance with ward.D cluster method. Please see Figure 8B. The result showed that these groups formed two large clusters, in which the 3D+FI clustered separately and the 3D_fibroblasts, 3D_MyoD and 3D_MyoD+FI were more similar. Please see Figure 8B.

      As the reviewer suggested, we also compared the transcriptomes of 3D_MyoD, 3D+FI, and 3D_MyoD+FI to the original 3D_fibroblasts to identify differentially expression genes (DEG) and then analyzed the overlap and distinct DEGs respectively. As shown in Figure 8D, the Venn Diagram showed that majority of DEG from 3D_MyoD+FI (3D_MyoD+FI versus 3D_fibroblasts) are overlapped with 3D_MyoD and 3D+FI, indicating that 3D_MyoD+FI are compatible with myogenic and adipogenic function.

      Please see the revised Figure 8.

      Reviewer #2 (Recommendations For The Authors):

      In this study, the authors demonstrated a new approach for cultivated meat production using chicken fibroblasts. Specifically, the cells were cultured as 3D and induced muscle differentiation and lipid deposition. The manuscript contains a good set of data, which would be valuable to researchers in the fields of both cell-based meat and skeletal muscle biology. From the aspect of cultivated meat science, the rationale behind the idea is understandable, but it remains unclear whether the proposed approach was really the best choice to achieve their final goal. On the other hand, when we read this manuscript as a paper in skeletal muscle biology, the overall approach was not innovative enough and several uncertain issues remain. The authors should add more sufficient justifications, arguments, and discussions.

      (1) When considering their goal to produce edible meat products, the current approach has some concerns. First, there are issues with the approach used for the induction of myogenesis by MyoD transgene. This makes the end products GMO foods, which are not easily acceptable to a wide range of consumers. Next, the hydrogel was used for 3D tissue formation, but it is unclear whether this matrix type is edible, safe, and bio-comparable for cell-based meat production. The authors already discussed these points by excusing that the current work remains proof-of-concept. However, more careful considerations and justifications would be required.

      Thank you for the suggestion.

      We acknowledge that the current transgene myogenic induction method is not suitable for mass production of culture meat because of the GMO food concerns. We utilized the MyoD transgene as the means of myogenic transdifferentiation at the first place, because of the ease of genetic manipulation and maximum efficiency. We are current testing non-genomic integration tools such as chemical cocktails and modified RNAs for myogenic transdifferentiation.

      When it comes to the applications of hydrogel in the food industry, certain types of hybrid hydrogels, such as those made from pectin or sodium polyacrylate, are not only edible but also safe for consumption. While GelMA hydrogel is typically utilized in tissue engineering and subsequent implantation in patients for therapeutic regenerative medicine purposes, it has not been commonly employed in food processing. In this study, we cultivated cells within GelMA hydrogel due to its durability and ease of use in cell culture. Moving forward, we plan to investigate alternative types of matrices to develop cultured meat suitable for food applications.

      We have now described the GMO and hydrogel drawbacks in the discussion part. Please see lines 439-457.

      “As a proof-of-concept, we utilized the transgene method to achieve maximum myogenic induction and the final products still retain the foreign transgene fragment in the cells’ genome. It is therefore posing a risk of genetic modified food which is not suitable for mass production. In the next step, other non-transgenic means such as non-integrating vectors, chemical reprogramming, modified RNAs, and recombinant transgene removal techniques will be explored to develop transgene-free end products. Another food safety concern in this study is the use of GelMA hydrogel for culture meat production. Due to its excellent biocompatibility and mechanical flexibility, GelMA-based hydrogel has demonstrated significant potential in scalable 3D cell culture for creating artificial tissue ranging in sizes from millimeters to centimeters. It is widely used in 3D cell culture and tissue engineering for regenerative medicine, but less common in food processing and agricultural applications. Due to its special photo-crosslinking properties, biocompatibility and degradability, it allows this material to be shaped into complex tissue structures by 3D printing or modelling. Many researchers have also used GelMA hydrogel as a scaffold for culture meat production (Jeong et al., 2022; Li et al., 2021; Park et al., 2023). Later research will carefully consider hydrogel as well as other types of scaffold biomaterials for cost-effective and food-safety compliant culture meat production (Bomkamp et al., 2022). ”

      (2) From the view of skeletal muscle biology, the approaches (MyoD overexpression, hydrogel-based 3D tissue formation, and lipogenic induction) have already been tested.

      Thank you for the insightful comments from the perspective of skeletal muscle cell biology. We totally agree that the current approaches including MyoD overexpression, 3D cell culture and lipogenic induction, were routine experiments in muscle cell biology. However, we want to highlight that utilization of these classical and robust muscle cell approaches, combine with the unique advantages of fibroblast cells (easily accessible, immortalized, cost-effective, ...) would provide a novel and practical avenue for culture meat production. We stated these issues in the revised manuscript in the discussion part.

      Please see lines 511-515.

      “In conclusion, we have effectively utilized immortalized chicken fibroblasts in conjunction with classical myogenic/adipogenic transdifferentiation approaches within 3D hydrogel to establish a cultured meat model. This model allows for the precise regulation of the synthesis of key components found in conventional meat, including muscle, fat, and ECM.”

      (3) The common emphasis in this manuscript is to use the advantages of 3D culture for tissue differentiation. As the authors described, skeletal muscle is a highly aligned tissue. In this study, some results successfully demonstrated advantages in terms of myocyte alignment, maturation, and lipid deposition. However, the current results cannot address whether the entire 3D tissues maintained these advantageous characteristics or not. Because the method for 3D formation does not have any additional modifications to make the cells aligned, like micropatterning, scaffolding, or bioprinting.

      Thank you for the suggestion.

      We agree with the reviewer that the skeletal muscle tissues are composed of well organized, directional bundles of fibers, and the cell alignment would greatly affect the meat tenderness and sensory properties. Therefore, it is a desired attribute if the cells in the culture meat matrix could be aligned together. But this alignment would require sophisticated biomaterial engineering mainly involved in the scaffold manipulation which is beyond the scope of this study. The hydrogel used in this study formed different sizes of pores at random directions and we would expect the embedded cells to be totally non-directional. But we still found localized cell alignments in some parts of the gel matrix which confirming the cell-cell interactions, please see figure 3D. We describe this feature in the results part. In the future, we will be testing the application of physical or electrical stimulations to the matrix to see if we can align the cells better to make all the muscle cells in the whole matrix to align together.

      Please see lines 186-190.

      “The separate XY axis views of the orthogonal projections at different depths (Figure 3D) and a multi-angle video (Supplementary Video 2) also showed the several myotubes were aligned together. Nevertheless, many myotubes were oriented in different directions, preventing the entire matrix from aligning in one direction.”

      (4) In the skeletal muscle, fat accumulation mainly occurs in adipocytes between myocytes. This means that "intra-" muscular fat deposition is identified. However, lipid deposition within myocytes also occurred in this preparation (Supplementary Figure 7C). This situation is not "intra-" muscular accumulation, which sounds different from what is going on in normal skeletal muscle tissues. Please explain what happened and what biological situations accounted for this. Also, the authors should clarify better how lipogenesis was induced in the 3D tissues, such as cell types (transdifferentiated myocytes, remained/un-transdifferentiated fibroblasts, or both).

      Thank you for the very insightful question. We have revised the corresponding text to further explain the intramuscular fat distribution in different cell types in culture meat.

      We totally agree with the reviewer that intramuscular fat accumulation may occur mainly in the intramuscular adipocytes. However, under some pathological and physiological conditions in human and animals, the lipid droplets were also abundantly observed inside myofibers (intramyocellular lipids within myofiber cytoplasm). For instance, high intramyocellular lipid content was found in insulin resistance patients and paradoxically in endurance trained athletes, (doi.org/10.1016/j.tem.2012.05.009), as well as in some farm animals under intensive selective breeding (doi:10.2174/1876142910901010059). In the current study, with the Oil Red O staining of lipid droplets, we identified lipid deposition in both the transdifferentiated myocytes and the remained un-transdifferentiated fibroblasts in the culture meat. This lipid distribution pattern is comparable to the intramuscular fat storage pattern observed in some human and animals, in which fat accumulation occurs in both myofibers (intramyocellular lipids) and intramuscular adipocyte cells (extramyocellular lipids) which reside within the muscle tissue bundle but between myofibers. We reason that current adipogenic induction treatment caused lipogenesis in both the MyoD-transdifferentiated cells and un-transdifferentiated fibroblasts. It is difficult to compare the absolute amount of lipids between these two types of cells via the Oil Red O staining. Also, it is almost impossible to separate these two types of cells from the 3D meat mimics. Thus, we can only confirm the lipid deposition occurs in both transdifferentiated myocytes and un-transdifferentiated fibroblasts, but without knowing which one is dominant and the major contributor to the intramuscular fat content in the culture meat.

      Please see lines 486-492.

      “In this study, the deposition of fat in the myotubes/myofibers facilitated the storage of significant lipid quantities in transdifferentiated muscle cells, known as intramyocellular lipids. Additionally, we observed Oil Red O staining in the remaining un-transdifferentiated fibroblasts, resembling cells of intramuscular adipocytes (extramyocellular lipids) found within muscle tissue. Hence, current adipogenic induction treatment caused lipogenesis in both the MyoD-transdifferentiated cells and un-transdifferentiated fibroblasts.”

    1. Author Response

      The following is the authors’ response to the current reviews.

      Overall Response

      We thank the reviewers for reviewing our manuscript, recognizing the significance of our study, and offering valuable suggestions. Based on the reviewer’s comments and the updated eLife assessment, we would like to chose the current version of our manuscript as the Version of Record of our manuscript.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Given knowledge of the amino acid sequence and of some version of the 3D structure of two monomers that are expected to form a complex, the authors investigate whether it is possible to accurately predict which residues will be in contact in the 3D structure of the expected complex. To this effect, they train a deep learning model which takes as inputs the geometric structures of the individual monomers, per-residue features (PSSMs) extracted from MSAs for each monomer, and rich representations of the amino acid sequences computed with the pre-trained protein language models ESM-1b, MSA Transformer, and ESM-IF. Predicting inter-protein contacts in complexes is an important problem. Multimer variants of AlphaFold, such as AlphaFold-Multimer, are the current state of the art for full protein complex structure prediction, and if the three-dimensional structure of a complex can be accurately predicted then the inter-protein contacts can also be accurately determined. By contrast, the method presented here seeks state-of-the-art performance among models that have been trained end-to-end for inter-protein contact prediction.

      Strengths:

      The paper is carefully written and the method is very well detailed. The model works both for homodimers and heterodimers. The ablation studies convincingly demonstrate that the chosen model architecture is appropriate for the task. Various comparisons suggest that PLMGraph-Inter performs substantially better, given the same input, than DeepHomo, GLINTER, CDPred, DeepHomo2, and DRN-1D2D_Inter.

      The authors control for some degree of redundancy between their training and test sets, both using sequence and structural similarity criteria. This is more careful than can be said of most works in the field of PPI prediction.

      As a byproduct of the analysis, a potentially useful heuristic criterion for acceptable contact prediction quality is found by the authors: namely, to have at least 50% precision in the prediction of the top 50 contacts.

      We thank the reviewer for recognizing the strengths of our work!

      Weaknesses:

      The authors check for performance drops when the test set is restricted to pairs of interacting proteins such that the chain pair is not similar as a pair (in sequence or structure) to a pair present in the training set. A more challenging test would be to restrict the test set to pairs of interacting proteins such that none of the chains are separately similar to monomers present in the training set. In the case of structural similarity (TM-scores), this would amount to replacing the two "min"s with "max"s in Eq. (4). In the case of sequence similarity, one would simply require that no monomer in the test set is in any MMSeqs2 cluster observed in the training set. This may be an important check to make, because a protein may interact with several partners, and/or may use the same sites for several distinct interactions, contributing to residual data leakage in the test set.

      We thank the reviewer for the suggestion! In the case of protein-protein prediction (“0D prediction”) or protein-protein interfacial residue prediction(“1D prediction”), we think making none of the chains in the test set separately similar to monomers in the training set is necessary, as the reviewer pointed out that a protein may interact with several partners, and may even use the same sites for the interactions. Since the task of this study is predicting the inter-protein residue-residue contacts (“2D prediction”), even though a protein uses the same site to interact with different partners, as long as the interacting partners are different, the inter-protein contact maps would be different. Therefore, we don’t think that in our task, making this restriction to the test set is necessary.

      The training set of AFM with v2 weights has a global cutoff of 30 April 2018, while that of PLMGraph-Inter has a cutoff of March 7 2022. So there may be structures in the test set for PLMGraph-Inter that are not in the training set of AFM with v2 weights (released between May 2018 and March 2022). The "Benchmark 2" dataset from the AFM paper may have a few additional structures not in the training or test set for PLMGraph-Inter. I realize there may be only few structures that are in neither training set, but still think that showing the comparison between PLMGraph-Inter and AFM there would be important, even if no statistically significant conclusions can be drawn.

      We thank the reviewer for the suggestion! It is not enough to only use the date cutoff to remove the redundancy, since similar structures can be deposited in the PDB in different dates. Because AFM does not release the PDB codes of its training set, it is difficult for us to totally remove the redundancy. Therefore, we think no rigorous conclusion can be drawn by including these comparisons in the manuscript. Besides, the main point of this study is to demonstrate that the integration of multiple protein language models using protein geometric graphs can dramatically improve the model performance for inter-protein contact prediction, which can provide some important enlightenments for the future development of more powerful protein complex structure prediction methods beyond AFM, rather than providing a tool which can beat AFM at this moment. We think including too many stuffs in the comparison with AFM may distract the readers. Therefore, we choose to not include these comparisons in the manuscript.

      Finally, the inclusion of AFM confidence scores is very good. A user would likely trust AFM predictions when the confidence score is high, but look for alternative predictions when it is low. The authors' analysis (Figure 6, panels c and d) seems to suggest that, in the case of heterodimers, when AFM has low confidence, PLMGraph-Inter improves precision by (only) about 3% on average. By comparison, the reported gains in the "DockQ-failed" and "precision-failed" bins are based on knowledge of the ground truth final structure, and thus are not actionable in a real use-case.

      We agree with the reviewer that more studies are needed for providing a model which can well complement or even beat AFM. The main point of this study is to demonstrate that the integration of multiple protein language models using protein geometric graphs can dramatically improve the model performance for inter-protein contact prediction, which can provide some important enlightenments for the future development of more powerful protein complex structure prediction methods beyond AFM.

      Reviewer #2 (Public Review):

      This work introduces PLMGraph-Inter, a new deep learning approach for predicting inter-protein contacts, which is crucial for understanding proteinprotein interactions. Despite advancements in this field, especially driven by AlphaFold, prediction accuracy and efficiency in terms of computational cost still remains an area for improvement. PLMGraph-Inter utilizes invariant geometric graphs to integrate the features from multiple protein language models into the structural information of each subunit. When compared against other inter-protein contact prediction methods, PLMGraph-Inter shows better performance which indicates that utilizing both sequence embeddings and structural embeddings is important to achieve high-accuracy predictions with relatively smaller computational costs for the model training.

      We thank the reviewer for recognizing the strengths of our work!

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      • I recommend renaming the section "Further potential redundancies removal between the training and the test" to "Further potential redundancies removal between the training and the test sets"

      Changed.

      • In lines 768-769, the sentence seems to end prematurely in "to use more stringent threshold in the redundancy removal"

      Corrected.

      • In Eq. (4), line 789, there are many instances of dashes that look like minus signs, creating some confusion.

      Corrected.

      • I think I may have mixed up figure references in my first review. When I said (Recommendations to the authors): "p. 22, line 2: from the figure, I would have guessed "greater than or equal to 0.7", not 0.8", I think I was referring to what is now lines 423-424, referring to what is now Figure 5c. The point stands there, I think.

      Corrected.

      • A couple of new grammatical mishaps have been introduced in the revision. These could be rectified.

      We carefully rechecked our revisions, and corrected the grammatical issues we found.

      Reviewer #2 (Recommendations For The Authors):

      Most of my concerns were resolved through the revision. I have only one suggestion for the main figure.

      The current scatter plots in Figure 2 are hard to understand as too many different methods are abstracted into a single plot with multiple colors. I would suggest comparing their performances using box plot or violin plot for the figure 2.

      We thank the reviewer for the suggestion! In the revision, we tried violin plot, but it does not look good since too many different methods are included in the plot. Besides, we chose the scatter plot as it can provide much more details. We also provided the individual head-to-head scatter plots as supplementary figures, we think which can also be helpful for the readers to capture the information of the figures.


      The following is the authors’ response to the original reviews.

      Overall Response

      We would like to thank the reviewers for reviewing our manuscript, recognizing the significance of our study, and offering valuable suggestions. We have carefully revised the manuscript to address all the concerns and suggestions raised by the reviewers.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Given knowledge of the amino acid sequence and of some version of the 3D structure of two monomers that are expected to form a complex, the authors investigate whether it is possible to accurately predict which residues will be in contact in the 3D structure of the expected complex. To this effect, they train a deep learning model that takes as inputs the geometric structures of the individual monomers, per-residue features (PSSMs) extracted from MSAs for each monomer, and rich representations of the amino acid sequences computed with the pre-trained protein language models ESM-1b, MSA Transformer, and ESM-IF. Predicting inter-protein contacts in complexes is an important problem. Multimer variants of AlphaFold, such as AlphaFold-Multimer, are the current state of the art for full protein complex structure prediction, and if the three-dimensional structure of a complex can be accurately predicted then the inter-protein contacts can also be accurately determined. By contrast, the method presented here seeks state-of-the-art performance among models that have been trained end-to-end for inter-protein contact prediction.

      Strengths:

      The paper is carefully written and the method is very well detailed. The model works both for homodimers and heterodimers. The ablation studies convincingly demonstrate that the chosen model architecture is appropriate for the task. Various comparisons suggest that PLMGraph-Inter performs substantially better, given the same input than DeepHomo, GLINTER, CDPred, DeepHomo2, and DRN-1D2D_Inter. As a byproduct of the analysis, a potentially useful heuristic criterion for acceptable contact prediction quality is found by the authors: namely, to have at least 50% precision in the prediction of the top 50 contacts.

      We thank the reviewer for recognizing the strengths of our work!

      Weaknesses:

      My biggest issue with this work is the evaluations made using bound monomer structures as inputs, coming from the very complexes to be predicted. Conformational changes in protein-protein association are the key element of the binding mechanism and are challenging to predict. While the GLINTER paper (Xie & Xu, 2022) is guilty of the same sin, the authors of CDPred (Guo et al., 2022) correctly only report test results obtained using predicted unbound tertiary structures as inputs to their model. Test results using experimental monomer structures in bound states can hide important limitations in the model, and thus say very little about the realistic use cases in which only the unbound structures (experimental or predicted) are available. I therefore strongly suggest reducing the importance given to the results obtained using bound structures and emphasizing instead those obtained using predicted monomer structures as inputs.

      We thank the reviewer for the suggestion! In the revision, to emphasize the performance of PLMGraph-Inter using the predicted monomer structures, we moved the evaluation results based on the predicted monomer from the supplementary to the main text (see the new Table 1 and Figure 2 in the revised manuscript) and re-organized the two subsections “Evaluation of PLMGraph-Inter on HomoPDB and HeteroPDB test sets” and “Impact of the monomeric structure quality on contact prediction” in the main text.

      In particular, the most relevant comparison with AlphaFold-Multimer (AFM) is given in Figure S2, not Figure 6. Unfortunately, it substantially shrinks the proportion of structures for which AFM fails while PLMGraph-Inter performs decently. Still, it would be interesting to investigate why this occurs. One possibility would be that the predicted monomer structures are of bad quality there, and PLMGraph-Inter may be able to rely on a signal from its language model features instead. Finally, AFM multimer confidence values ("iptm + ptm") should be provided, especially in the cases in which AFM struggles.

      We thank the reviewer for the suggestion! It is worth noting that AFM automatically searches monomer templates in the prediction, and when we checked our AFM runs, we found that 99% of the targets in our study (including all the targets in the four datasets: HomoPDB, HeteroPDB, DHTest and DB5.5) at least 20 templates were identified (AFM employed the top 20 templates in the prediction), and 87.8% of the targets employed the native templates (line 455-462 in page 25 in the subsection of “Comparison of PLMGraph-Inter with AlphaFold-Multimer”). Therefore, we think Figure 6 not Figure S5 (the original Figure S2) shows a fairer comparison. Besides, it is also worth noting the targets used in this study would have a large overlap with the training set of AlphaFold-Multimer, since AFM used all protein complex structures in PDB deposited before 2018-04-30 in the model training, which would further cause the overestimation of the performance of AFM (line 450-455 in page 24-25 in the subsection of “Comparison of PLMGraph-Inter with AlphaFold-Multimer”).

      To mimic the performance of AlphaFold2 in real practice and produce predicted monomeric structures with more diverse qualities, we only used the MSA searched from Uniref100 protein sequence database as the input to AlphaFold2 and set to not use the template (line 203~210 in page 12 in the subsection of “Evaluation of PLMGraph-Inter on HomoPDB and HeteroPDB test sets”). Since some of the predicted monomer structures are of bad quality, it is reasonable that the performance of PLMGraph-Inter drops when the predicted monomeric structures are used in the prediction. We provided a detailed analysis of the impact of the monomeric structure quality on the prediction performance in the subsection “Impact of the monomeric structure quality on contact prediction” in the main text.

      We provided the analysis of the AFM multimer confidence values (“iptm + ptm”) in the revision (Figure 6, Figure S5 and line 495-501 in page 27 in the subsection of “Comparison of PLMGraph-Inter with AlphaFold-Multimer”).

      Besides, in cases where any experimental structures - bound or unbound - are available and given to PLMGraph-Inter as inputs, they should also be provided to AlphaFold-Multimer (AFM) as templates. Withholding these from AFM only makes the comparison artificially unfair. Hence, a new test should be run using AFM templates, and a new version of Figure 6 should be produced. Additionally, AFM's mean precision, at least for top-50 contact prediction, should be reported so it can be compared with PLMGraph-Inter's.

      We thank the reviewers for the suggestion, and we are sorry for the confusion! In the AFM runs to predict protein complex structures, we used the default setting of AFM which automatically searches monomer templates in the prediction. When we checked our AFM runs, we found that 99% of the targets in our study (including all the targets in the four datasets: HomoPDB, HeteroPDB, DHTest and DB5.5) employed at least 20 templates in their predictions (AFM only used the top 20 templates), and 87.8% of the targets employed the native template. We further clarified this in the revision (line 455462 in page 25 in the subsection of “Comparison of PLMGraph-Inter with AlphaFoldMultimer”). We also included the mean precisions of AFM (top-50 contact prediction) in the revision (Table S5 and line 483-484 in page 26 in the subsection of “Comparison of PLMGraph-Inter with AlphaFold-Multimer”).

      It's a shame that many of the structures used in the comparison with AFM are actually in the AFM v2 training set. If there are any outside the AFM v2 training set and, ideally, not sequence- or structure-homologous to anything in the AFM v2 training set, they should be discussed and reported on separately. In addition, why not test on structures from the "Benchmark 2" or "Recent-PDB-Multimers" datasets used in the AFM paper?

      We thank the reviewer for the suggestion! The biggest challenge to objectively evaluate AFM is that as far as we known, AFM does not release the PDB ids of its training set and the “Recent-PDB-Multimers” dataset. “Benchmark 2” only includes 17 heterodimer proteins, and the number would be further decreased after removing targets redundant to our training set. We think it is difficult to draw conclusions from such a small number of targets.

      It is also worth noting that the AFM v2 weights have now been outdated for a while, and better v3 weights now exist, with a training cutoff of 2021-09-30.

      Author response image 1.

      The head-to-head comparison of qualities of complex predicted by AlphaFold-Multimer (2.2.0) and AlphaFold-Multimer (2.3.2) for each target PPI.

      We thank the reviewer for reminding the new version of AFM. The only difference between AFM V3 and V2 is the cutoff date of the training set. During the revision, we also tested the new version of AFM on the datasets of HomoPDB and HeteroPDB, but we found the performance difference between the two versions of AFM is actually very little (see the figure above, not shown in the main text). One reason might be that some targets in HomoPDB and HeteroPDB are redundant with the training sets of the two version of AFM. Since our test sets would have more overlaps with the training set of AFM V3, we keep using the AFM V2 weights in this study.

      Another weakness in the evaluation framework: because PLMGraph-Inter uses structural inputs, it is not sufficient to make its test set non-redundant in sequence to its training set. It must also be non-redundant in structure. The Benchmark 2 dataset mentioned above is an example of a test set constructed by removing structures with homologous templates in the AF2 training set. Something similar should be done here.

      We thank the reviewer for the suggestion! In the revision, we explored the performance of PLMGraph-Inter when using different thresholds of fold similarity scores of interacting monomers to further remove potential redundancies between the training and test sets (i.e. redundancy in structure ) (line 353-386 in page 19-21 in the subsection “Ablation study”; line 762-797 in page 41-43 in the subsection “Further potential redundancies removal between the training and the test”). We found that for heteromeric PPIs (targets in HeteroPDB), the further removal of potential redundancy in structure has little impact on the model performance (~3%, when TM-score 0.5 is used as the threshold). However, for homomeric PPIs (targets in HomoPDB), the further removal of potential redundancy in structure significantly reduce the model performance (~18%, when TM-score 0.5 is used as the threshold) (see Table 2). One possible reason for this phenomenon is that the binding mode of the homomeric PPI is largely determined by the fold of its monomer, thus the does not generalize well on targets whose folds have never been seen during the training.

      Whether the deep learning model can generalize well on targets with novel folds is a very interesting and important question. We thank the reviewer for pointing out this! However, to the best of our knowledge, this question has rarely been addressed by previous studies including AFM. For example, the Benchmark 2 dataset is prepared by ClusPro TBM (bioRxiv 2021.09.07.459290; Proteins 2020, 88:1082-1090) which uses a sequence-based approach (HHsearch) to identify templates not structure-based. Therefore, we don’t think this dataset is non-redundant in structure.

      Finally, the performance of DRN-1D2D for top-50 precision reported in Table 1 suggests to me that, in an ablation study, language model features alone would yield better performance than geometric features alone. So, I am puzzled why model "a" in the ablation is a "geometry-only" model and not a "LM-only" one.

      Using the protein geometric graph to integrate multiple protein language models is the main idea of PLMGraph-Inter. Comparing with our previous work (DRN-1D2D_Inter), we consider the building of the geometric graph as one major contribution of this work. To emphasize the efficacy of this geometric graph, we chose to use the “geometry-only” model as the base model.

      Reviewer #1 (Recommendations For The Authors):

      Some sections of the paper use technical terminology which limits accessibility to a broad audience. An obvious example is in the section "Results > Overview of PLMGraph-Inter > The residual network module": the average eLife reader is not a machine learning expert and might not be familiar with a "convolution with kernel size of 1 * 1". In general, the "Overview of PLMGraph-Inter" is a bit heavy with technical details, and I suggest moving many of these to Methods. This overview section can still be there but it should be shorter and written using less technical language.

      We thank the reviewer for the suggestion! We moved some technical details to the Methods section in the revision (line 184-185 in page 11; line 729-735 in page 39).

      List of typos and minor issues (page number according to merged PDF):

      • p. 3. line -3: remove "to"

      Corrected (line 36, page 3)

      • p. 5, line 7: "GINTER" should be "GLINTER"

      Corrected (line 64, page 5)

      • p. 6, line -4: "Given structures" -> "Given the structures"

      Corrected (line 95, page 6)

      • p. 6, line -2: "with which encoded"... ?

      We rephrased this sentence in revision. (line 97, page 6)

      • p. 9, line 1: "principal" -> "principle"

      Corrected (line 142, page 9)

      • p. 13, line 1: "has" -> "but have"

      Corrected (line 231, page 13)

      • p. 14, lines 6-7: "As can be seen from the figure that the predicted" -> "As can be seen from the figure, the predicted"

      We rephrased this paragraph, and the sentence was deleted in the revision (line 257-259 in page 15).

      • p. 18, line 1: the "five models" are presumably models a-e? If so, say "of models a-e"

      Corrected (line 310, page 17)

      • p. 22, line 2: from the figure, I would have guessed "greater than or equal to 0.7", not 0.8

      Based the Figure 3C, we think 0.8 is a more appropriate cutoff, since the precision drops significantly when the DTM-score is within 0.7~0.8.

      • p. 23, lines 2-3: "worth to making" -> "worth making"

      Corrected (line 443, page 24)

      • p. 24, line -5: "predict" -> "predicted"

      Corrected (line 484, page 26)

      • p 28, line -5: Please clarify what you mean by "We doubt": are you saying that you don't think these rearrangements exist in nature? If not, then reword.

      Corrected (line 566, page 30)

      • Figure 2, panel c, "DCPred" in the legend should be "CDPred"

      Corrected

      • Figures 3 and 5: Please improve the y-axis title in panel C. "Percent" of what?

      We changed the “Percent” to “% of targets” in the revision.

      We thank the reviewer for carefully reading our manuscript!

      Reviewer #2 (Public Review):

      This work introduces PLMGraph-Inter, a new deep-learning approach for predicting inter-protein contacts, which is crucial for understanding proteinprotein interactions. Despite advancements in this field, especially driven by AlphaFold, prediction accuracy and efficiency in terms of computational cost) still remains an area for improvement. PLMGraph-Inter utilizes invariant geometric graphs to integrate the features from multiple protein language models into the structural information of each subunit. When compared against other inter-protein contact prediction methods, PLMGraph-Inter shows better performance which indicates that utilizing both sequence embeddings and structural embeddings is important to achieve high-accuracy predictions with relatively smaller computational costs for the model training.

      The conclusions of this paper are mostly well supported by data, but test examples should be revisited with a more strict sequence identity cutoff to avoid any potential information leakage from the training data. The main figures should be improved to make them easier to understand.

      We thank the reviewer for recognizing the significance of our work! We have carefully revised the manuscript to address the reviewer’s concerns.

      (1) The sequence identity cutoff to remove redundancies between training and test set was set to 40%, which is a bit high to remove test examples having homology to training examples. For example, CDPred uses a sequence identity cutoff of 30% to strictly remove redundancies between training and test set examples. To make their results more solid, the authors should have curated test examples with lower sequence identity cutoffs, or have provided the performance changes against sequence identities to the closest training examples.

      We thank the reviewer for the valuable suggestion! The “40 sequence identity” is a widely used threshold to remove redundancy when evaluating deep-learning based protein-protein interaction and protein complex structure prediction methods, thus we also chose this threshold in our study (bioRxiv 2021.10.04.463034, Cell Syst. 2021 Oct 20;12(10):969-982.e6). In the revision, we explored whether PLMGraph-inter can keep its performance when more stringent thresholds (30%,20%,10%) is applied (line 353386 in page 20-21 in the subsection of “Ablation study” and line 762-780 in page 40 in the subsection of “Further potential redundancies removal between the training and the test”). The result shows that even when using “10% sequence identity” as the threshold, mean precisions of the predicted contacts only decreases by ~3% (Table 2).

      (2) Figures with head-to-head comparison scatter plots are hard to understand as scatter plots because too many different methods are abstracted into a single plot with multiple colors. It would be better to provide individual head-tohead scatter plots as supplementary figures, not in the main figure.

      We thank the reviewer for the suggestion! We will include the individual head-to-head scatter plots as supplementary figures in the revision (Figure S1 and Figure S2 in the supplementary).

      (3) The authors claim that PLMGraph-Inter is complementary to AlphaFoldmultimer as it shows better precision for the cases where AlphaFold-multimer fails. To strengthen the point, the qualities of predicted complex structures via protein-protein docking with predicted contacts as restraints should have been compared to those of AlphaFold-multimer structures.

      We thank the reviewer for the suggestion! We included this comparison in the revision (Figure S7).

      (4) It would be interesting to further analyze whether there is a difference in prediction performance depending on the depth of multiple sequence alignment or the type of complex (antigen-antibody, enzyme-substrates, single species PPI, multiple species PPI, etc).

      We thank the reviewer for the suggestion! We analyzed the relationship between the prediction performance and the depth of MSA in the revision (Figure S4 and Line 253264 in page 15 in the subsection of “Evaluation of PLMGraph-Inter on HomoPDB and HeteroPDB test sets” and line 798-806 in page 42 in the subsection of “Calculating the normalized number of the effective sequences of paired MSA”).

      Reviewer #2 (Recommendations For The Authors):

      I have the following suggestions in addition to the public review.

      (1) Overall, the manuscript is well-written; however, I recommend a careful review for minor grammar corrections to polish the final text.

      We carefully checked the manuscript and corrected all the grammar issues and typos we found in the revision.

      (2) It would be better to indicate that single sequence embeddings, MSA embeddings, and structure embeddings are ESM-1b, ESM-MSA & PSSM, and ESM-IF when they are first mentioned in the manuscript e.g. single sequence embeddings from ESM-1b, MSA embeddings from ESM-MSA and PSSM, and structural embeddings from ESM-IF.

      We revised the manuscript according to the reviewer’s suggestion (line 86-88 in page 6; line 99-101 in page 7).

      (3) I don't think "outer concatenation" is commonly used. Please specify whether it's outer sum, outer product, or horizontal & vertical tiling followed by concatenation.

      It is horizontal & vertical tiling followed by concatenation. We clarified this in the revision (line 129-130 in page 8).

      (4) 10th sentence on the page where the Results section starts, please briefly mention what are the other 2D pairwise features.

      We clarified this in the revision (line 131-132 in page 8).

      (5) In the result section, it states edges are defined based on Ca distances, but in the method section, it says edges are determined based on heavy atom distances. Please correct one of them.

      It should be Ca distances. We are sorry for the carelessness, and we corrected this in the revision (line 646 in page 35).

      (6) For the sentence, "Where ESM-1b and ESM-MSA-1b are pretrained PLMs learned from large datasets of sequences and MSAs respectively without label supervision,", I'd suggest replacing "without label supervision" with "with masked language modeling tasks" for clarity.

      We revised the manuscript according to the reviewer’s suggestion (line 150-151 in page 9).

      (7) It would be better to briefly explain what is the dimensional hybrid residual block when it first mentioned.

      We explained the dimensional hybrid residue block when it first mentioned in the revision (line 107 in page 7).

      (8) Please include error bars for the bar plots and standard deviations for the tables.

      We thank the reviewer for the suggestion! Our understanding is the error bars and standard deviations are very informative for data which follow gaussian-like distributions, but our data (precisions of the predicted contacts) are obviously not this type. Most previous studies in protein contact prediction and inter-protein contact prediction also did not include these in their plots or tables. In our case, including these elements requires a dramatic change of the styles of our figures and tables, but we would like to not change our figures and tables too much in the revision.

      (9) Please indicate whether the chain break is considered to generate attention map features from ESM-MSA-1b. If it's considered, please specify how.

      The paired sequences were directly concatenated without using any letter to connect them, which means we did not consider chain break in generating the attention maps from ESM-MSA-1b.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment:

      This manuscript is a valuable study of the responses of GPi neurons to DBS stimulation in human PD and dystonia patients and it finds evidence for altered short-term and long-term plasticity in response to DBS between the two patient populations. This data set is of interest to both basic and clinical researchers working in the field of DBS and movement disorders. While there was enthusiasm for the potential significance of these findings, support for their conclusions was incomplete. Thir data may be indicative of more interesting and complex interpretations than currently considered in the article. 

      The authors would like to express their gratitude to the Editorial Team and Reviewers for their invaluable feedback which helped to improve the manuscript.

      Reviewer #1:

      Summary:

      Sumarac et al investigate differences in globus pallidus internus (GPi) spike activity and short- and long-term plasticity of direct pathway projections in patients with Parkinson's disease (PD) and dystonia. Their main claims are that GPi neurons exhibit distinct characteristics in these two disorders, with PD associated with specific power-frequency oscillations and dystonia showing lower firing rates, increased burstiness, and less regular activity. Additionally, long-term plasticity and synaptic depression appear to differ between the two conditions. The authors suggest that these findings support the concept of hyperfunctional GPi output in PD and hypofunctional output in dystonia, possibly driven by variations in the plasticity of striato-pallidal synapses. Overall enthusiasm is relatively high, but I think the discussion omits discussing findings that don't align well with standard models. 

      Strengths: 

      These types of studies are valuable as the data arise from patients who have dystonia or PD. This could provide unique insights into disease pathophysiology that might not be recapitulated in animal systems work. 

      Thank you for the positive feedback.

      Weaknesses: 

      - The rate model and indirect/direct pathway ideas lack explanatory power; too much of the hypothesis generation and discussion in this manuscript is set in the context of these old ideas. Their data in my view emphasize this somewhat emphatically. Most patients with the 'hypokinetic' movement disorder PD have dystonia as a part of their motor features. Dystonia is a form of excessive muscle activation that on the one hand is 'hyperkinetic' but on the other usually decreases the speed of motor tasks, even in patients with primary dystonia. Similarly, PD patients display a bewildering variety of hyperkinetic manifestations as well (rest tremor, dystonia, dyskinesia). If these are truly independent classifications, i.e. hyper- versus hypo-kinetic, the authors must acknowledge that there is considerable overlap in the spike activity across groups - numerous dystonia patients display higher discharge rates than the majority of the PD sample. Based on the firing rate alone, it would not be possible to distinguish these groups. 

      Thank you for your insightful comments regarding the discussion of the rate model and the distinction between hyperkinetic and hypokinetic movement disorders. We acknowledge that the rate model, primarily derived from limited number of animal subjects [1], may not fully encapsulate the complexities of Parkinson's disease (PD) and dystonia. Our study aimed to validate animal model findings in humans by correlating single-neuron features with disease symptom severity. However, we concur with the Reviewer’s comment regarding the overlapping motor features in hypokinetic and hyperkinetic disorders. We can speculate that the overlap in neuronal properties may be reflected in the overlap of, for example, hyperkinetic features being also present in PD, as suggested by the Reviewer. Per the Reviewer’s request, we have now acknowledged this notion in the manuscript. Interestingly, hypokinetic symptoms have been reported to occur in dystonia in response to GPi-stimulation and have been associated with beta activity in the LFP [2], which reinforces the notion that neural activity may be more related to specific symptoms rather than diseases as a whole. Supplementing our analyses, in addition to total UPDRSIII scores, we have now provided correlations with only hypokinetic (i.e. bradykinesia) subscores of the UPDRSIII to focus on more direct assessment of hypokinetic features in PD versus hyperkinetic features in dystonia. We have updated our methods and results accordingly.

      [1] M. R. DeLong, “Primate models of movement disorders of basal ganglia origin.,” Trends Neurosci, vol. 13, no. 7, pp. 281–285, Jul. 1990, doi: 10.1016/0166-2236(90)90110-v.

      [2] R. Lofredi et al., “Pallidal Beta Activity Is Linked to Stimulation-Induced Slowness in Dystonia,” Movement Disorders, vol. 38, no. 5, pp. 894–899, 2023, doi: 10.1002/mds.29347.

      Amendments to the manuscript:

      “Indeed, variability in spike firing rates in PD may be reflected in the considerable overlap in spiking activity between PD and dystonia (Fig. 1A), with many dystonia patients exhibiting higher discharge rates compared to PD patients.”

      “Given that UPDRSIII includes both hypokinetic and hyperkinetic symptoms of PD, we further sought to disaggregate the score by only considering items 23-26 in UPDRSIII, which assess hypokinetic symptoms of PD.”

      “… with a marginally stronger correlation for PD hypokinetic symptoms only (items 23-26 of UPDRSIII, Spearman's rho=0.32, p=.0330; Supplementary Fig. 3)”

      Supplementary Fig. 3: We provided correlations with hypokinetic (i.e., bradykinesia) subscore of the UPDRSIII. There is very little difference between correlation results of UPDRSIII total (Fig. 1) and the hypokinetic-only subscore (Supplementary Fig. 3).

      “though our results do not change substantially when only hypokinetic PD features are considered (Supplementary Fig. 3).”

      - If beta power is pathognomonic of parkinsonism, the authors found no differences in beta-related spike discharges across the groups. One would have predicted greater beta power in PD than in primary dystonia. This should be discussed explicitly and an interpretation should be provided. 

      We agree with the reviewer that considering the previous LFP literature, one might have expected a difference in single-neuron oscillation power between PD and dystonia. However, while prior studies [3], [4] have reported significant differences in oscillatory power between the two diseases, researchers examined local field potential (LFP) activity only. Other work [5] in non-human primates investigated single-neuron oscillations and reported no differences between PD and dystonia at the single-neuron level, in line with our findings. However, despite the lack of difference in overall power presented here, we provide evidence that the strength of the beta-frequency single-neuron oscillations nevertheless correlates with symptom severity in PD but not dystonia; whereas the strength of the theta-frequency single-neuron oscillations correlates with symptom severity in dystonia but not PD.

      [3] P. Silberstein et al., “Patterning of globus pallidus local field potentials differs between Parkinson’s disease and dystonia.,” Brain, vol. 126, no. Pt 12, pp. 2597–2608, Dec. 2003, doi: 10.1093/brain/awg267.

      [4] D. D. Wang et al., “Pallidal Deep-Brain Stimulation Disrupts Pallidal Beta Oscillations and Coherence with Primary Motor Cortex in Parkinson’s Disease,” J Neurosci, vol. 38, no. 19, pp. 4556–4568, May 2018, doi: 10.1523/JNEUROSCI.0431-18.2018.

      [5] P. A. Starr et al., “Spontaneous pallidal neuronal activity in human dystonia: comparison with Parkinson’s disease and normal macaque.,” J Neurophysiol, vol. 93, no. 6, pp. 3165–3176, Jun. 2005, doi: 10.1152/jn.00971.2004.

      Amendments to the manuscript:

      “Although previous research has reported differences in the LFP power between PD and dystonia [27,28], a study in non-human primates found no such differences in single-neuron oscillatory strength [8], as reflected in our findings. However, despite a lack of difference in overall power across disorders, we were able to derive disease/frequency-specific relationships with respect to clinical scores (Fig. 1C; oscillatory features).”

      - The study lacks a healthy control group, making it challenging to differentiate disease-specific findings from normal variations in GPi activity and plasticity. Although this is acknowledged in the discussion, this complicates the interpretation of the results. The sample sizes for PD and dystonia patients are relatively small, and the study combines various forms of dystonia, potentially masking subtype-specific differences. A larger and more homogenous sample could enhance the study's reliability.

      Indeed, intraoperative microelectrode recordings cannot be obtained in healthy individuals. We agree with the Reviewer that this limits the interpretation of the data. However, directly comparing clinical correlations with single neuron readouts between two distinct clinical entities may, to some degree, compensate for the lack of healthy control data. This contrast, while not providing a healthy control, is still able to point to disease-specific differences. This approach has previously been used to comparisons at the LFP level [6]. While the sample size is indeed small, it is comparable or even higher to similar studies that have investigated the relation of symptom severity of single neuron readouts [7]. The Reviewer is right in that we do not differentiate between generalized or cervical dystonia. We chose to do so because our subgroup analysis provided in the Supplementary Material did not suggest specific differences; though there is insufficient data from specific dystonia subtypes to make formal statistical comparisons. Indeed, future studies should investigate specific subtypes further.

      [6] R. Lofredi et al., “Pallidal beta bursts in Parkinson’s disease and dystonia,” Movement Disorders, vol. 34, no. 3, pp. 420–424, 2019, doi: 10.1002/mds.27524.

      [7] A. Gulberti et al., “Subthalamic and nigral neurons are differentially modulated during parkinsonian gait,” Brain, p. awad006, Feb. 2023, doi: 10.1093/brain/awad006.

      Amendments to the manuscript:

      “While we did not observe differences across dystonia subtypes (Supplementary Fig. 1), future studies in larger patient cohorts would are warranted. Finally, as many findings in Fig. 1 do not survive corrections for multiple comparisons, we suggest interpretation of results with caution. Despite this, many of our findings related to neuronal correlates are generally in line with previous literature, especially related to oscillatory correlates of PD and dystonia.”

      - While they mention that data are available on request, sharing data openly would increase transparency and allow for independent validation of the results. It is unclear how sharing deidentified data would compromise patient privacy or present ethical issues of any kind, as claimed by the authors. 

      Much of the data in question were collected under an old Research Ethics Board (REB) protocol which did not address data sharing. However, we have consulted with our REB and gained retroactive permission to post de-identified data which are now available in the Supplementary Material.

      Amendments to the manuscript:

      “The data that support the findings of this study are available in a public repository (see: https://osf.io/nqzd2/)”

      - They appropriately acknowledge several limitations, such as the inability to use pharmacological interventions and the need for further research in the chronic setting. 

      Thank you for the comment.

      - The manuscript highlights differences in GPi activity and plasticity between PD and dystonia but could provide more context on the clinical implications of these findings, particularly regarding what the implications would be novel paradigms for deep brain stimulation. 

      Thank you for the comment. Our finding that striato-pallidal plasticity decays more slowly in dystonia compared to PD may relate to the slower time course of symptom relief associated with GPi-DBS in dystonia, as presently outlined in the discussion. On the other hand, symptoms are also suppressed for longer after the cessation of stimulation in dystonia compared to PD, which may reflect long-term plastic changes [8], [9]. In the context of clinical DBS, plasticity modulation may be facilitated by intermittent stimulation algorithms that may achieve the necessary plastic network change by applying stimulation for a defined time but could then be switched off for improved energy consumption and perhaps as a means of mitigating side effects. DBS devices with chronic sensing may enable monitoring of evoked potential amplitudes for future adaptive stimulation applications; however, currently available devices are limited by low sampling rates, but future devices may overcome these technical limitations.

      [8] D. Ruge et al., “Deep brain stimulation effects in dystonia: time course of electrophysiological changes in early treatment.,” Mov Disord, vol. 26, no. 10, pp. 1913–1921, Aug. 2011, doi: 10.1002/mds.23731.

      [9] D. Ruge et al., “Shaping reversibility? Long-term deep brain stimulation in dystonia: the relationship between effects on electrophysiology and clinical symptoms.,” Brain, vol. 134, no. Pt 7, pp. 2106–2115, Jul. 2011, doi: 10.1093/brain/awr122.

      Amendments to the manuscript:

      “While further work is certainly required to better understand disease-related differences in plasticity, our findings may nevertheless motivate the development of periodic intermittent (ON/OFF) DBS strategies which periodically modulate synaptic plasticity for therapeutic benefits which outlast stimulation delivery, as have recently been employed in preclinical work [52,53].”

      - While statistical tests are mentioned, the manuscript could benefit from a more detailed presentation of statistical methods, including correction for multiple comparisons and effect sizes. Did the authors consider different recording sites within each patient as independent observations? I think this is not appropriate if that was the case. 

      Thank you for your constructive feedback. In response to the concerns regarding the statistical methods, we have expanded our analysis to provide a more comprehensive statistical overview. Specifically, we implemented the Bonferroni correction for multiple comparisons across each of the seven tests conducted for the differences in single-neuron features between PD and dystonia. The adjustment revealed that only the burst index and coefficient of variation retain statistical significance after post hoc correction, while the firing rate does not. Results of the Bonferroni corrections are now presented in Supplementary Table 3. Reflecting on the initial comment about firing rates between the two disorders, our updated findings underscore the limitation of using firing rates alone to differentiate between PD and dystonia, and instead, our analysis now points to burstiness and firing irregularity as more reliable discriminators. Regarding the clinical correlations, we refined our statistical analysis by employing nonparametric Monte Carlo permutation tests with 5000 permutations, as used in recent work [10], [11]. This method is chosen for its independence from assumptions regarding data distribution. Specifically, we computed and tested the Spearman rho for significance using the permutation test. Then, to address multiple comparisons, we controlled the false discovery rate (FDR) using the Benjamini-Hochberg procedure. Results of these comparisons are now presented in Supplementary Table 4. Lastly, to address the concern regarding recording site independence within patients, we updated our plasticity analysis methodology. In our study, 6 out of 18 patients had multiple recording sites. Thus, to account for this, we employed linear mixed models (LMM) with patient ID as a random factor to appropriately account for the non-independence of these observations.

      [10] v Lofredi et al., “Dopamine-dependent scaling of subthalamic gamma bursts with movement velocity in patients with Parkinson’s disease,” Elife, vol. 7, p. e31895, Feb. 2018, doi: 10.7554/eLife.31895.

      [11] R. Lofredi et al., “Subthalamic beta bursts correlate with dopamine-dependent motor symptoms in 106 Parkinson’s patients,” npj Parkinsons Dis., vol. 9, no. 1, Art. no. 1, Jan. 2023, doi: 10.1038/s41531-022-00443-3.

      Amendments to the manuscript:

      “For comparing differences in single-neuron features between PD and dystonia, significant results were followed up with post hoc multiple comparisons with a Bonferroni correction. For clinical correlations, non-parametric Monte Carlo permutation tests were used, avoiding assumptions about data distribution. The tested values were randomly shuffled 5,000 times to form a probability distribution, with the p-value reflecting the original sample rank. All tests underwent adjustment for multiple comparisons, controlling the false discovery rate (FDR) at an α-level of 0.05.”

      “analyzed using a linear mixed model (LMM) with patient ID as a random factor, normalized fEP amplitudes as the response variable, and epoch as a fixed effect”

      “using a LMM with patient ID as a random factor”

      “However, none of the clinical correlations survived Benjamini-Hochberg FDR-correction for multiple comparisons (Supplementary Table 4).”

      “In PD, fEP amplitudes were significantly greater after compared to before HFS (LMM; p = .0075, effect size = 5.42 ± 1.79; Fig. 2C), while in dystonia, the increase approached but did not reach statistical significance (LMM; p = .0708, effect size = 2.82 ± 1.45; Fig. 2C).”

      All statistics were updated in the results section and the figures.

      “Finally, as many findings in Fig. 1 do not survive corrections for multiple comparisons, we suggest interpretation of results with caution. Despite this, many of our findings related to neuronal correlates are generally in line with previous literature, especially related to oscillatory correlates of PD and dystonia.”

      - The manuscript could elaborate on the potential mechanisms underlying the observed differences in GPi activity and plasticity and their relevance to the pathophysiology of PD and dystonia. 

      Thank you for your feedback. We have enhanced the manuscript by integrating additional discussions on previous studies related to plasticity in dystonia and PD (e.g., [12], [13]), which highlight excessive plasticity in dystonia. Although these may appear contradictory to our findings of increased plasticity in PD compared to dystonia, we propose (also justified by previous literature) that chronic dopaminergic medication use may lead to synaptic over-sensitization, which has been hypothesized as a biological mechanism underlying levodopa-induced dyskinesias (a hyperkinetic feature) in PD [14].

      [12] Y. Tamura et al., “Disordered plasticity in the primary somatosensory cortex in focal hand dystonia.,” Brain, vol. 132, no. Pt 3, pp. 749–755, Mar. 2009, doi: 10.1093/brain/awn348.

      [13] D. A. Peterson, T. J. Sejnowski, and H. Poizner, “Convergent evidence for abnormal striatal synaptic plasticity in dystonia.,” Neurobiol Dis, vol. 37, no. 3, pp. 558–573, Mar. 2010, doi: 10.1016/j.nbd.2009.12.003.

      [14] P. Calabresi, B. Picconi, A. Tozzi, V. Ghiglieri, and M. Di Filippo, “Direct and indirect pathways of basal ganglia: a critical reappraisal.,” Nat Neurosci, vol. 17, no. 8, pp. 1022–1030, Aug. 2014, doi: 10.1038/nn.3743.

      Amendments to the manuscript:

      “Converging evidence from past animal and human studies suggests that dystonia is associated with impaired synaptic function and abnormal synaptic plasticity [35–37]. Compared to healthy controls, it has been shown that transcranial magnetic stimulation induced motor evoked potentials (MEPs) are hyperexcitable in dystonia [38,39], and somatosensory and motor cortical plasticity is greater [40]. Likewise, enhanced long-term potentiation at cortico-striatal synapses has been shown in rodent models of dystonia [41,42]. While our finding that long term potentiation effects are greater in PD compared to dystonia (Fig. 2D) is difficult to corroborate with this literature, one potential explanation can be that all of our PD patients are long-term users of levodopa. We have previously shown that the intake of this antiparkinsonian dopaminergic medication leads to potent increases in the magnitude of direct pathway plasticity [15]. Although patients are 12hr withdrawn form antiparkinsonian medications for surgery, it could be that striato-pallidal synapses are nevertheless chronically over-sensitized from prolonged use of dopaminergic medication; which is a well-known hypothesis related to the manifestation of levodopa-induced dyskinesias (a hyperkinetic feature) in PD [43]. Indeed, a lack of depotentiation of striato-pallidal projections has previously been observed in patients with levodopa-induced dyskinesias [44]. As such, excessive plasticity of these projections may corroborate hyperkinetic features of dystonia and levodopa-induced dyskinesias in PD.”

      Reviewer #2: 

      Summary: 

      The authors investigated how neuronal activity and metrics of plasticity using local electrical stimulation in the GPi were different between Parkinson's disease and dystonia patients. 

      Strengths: 

      The introduction highlights the importance of the work and the fundamental background needed to understand the rest of the paper. It also clearly lays out the novelty (i.e., that the dynamics of plastic effects in GPi between dystonia and PD have not been directly compared). 

      The methods are clearly described and the results are well organized in the figures. 

      The results are strong with measurements from a large population of patients for each disease group and with distinct findings for each group. 

      Thank you for the kind appraisal.

      Weaknesses: 

      The discussion was hard to follow in several places, making it difficult to fully appreciate how well the authors' claims and conclusions are justified by their data, mostly in relation to the plasticity results. It may help to summarize the relevant findings for each section first and then further expand on the interpretation, comparison with prior work, and broader significance. Currently, it is hard to follow each section without knowing which results are being discussed until the very end of the section. With the current wording in the "Neuronal correlates.." section, it is not always clear which results are from the current manuscript, and where the authors are referring to past work.

      Thank you for this feedback. The main findings are now summarized in a paragraph at the beginning of the Discussion section, before being discussed in comparison to other studies in the literature in subsequent sub-sections. Moreover, throughout the Discussion, findings from our study are now always reflected by a reference to the relevant figure to more easily differentiate current findings from previous literature. Additionally, Discussion sub-sections have been expanded to consider additional literature in response to various comments throughout the Review process (including the subsequent Review comment).

      Amendments to the manuscript:

      Paper findings are referenced to figures which depict the results at hand; discussion sub-sections expanded; and the following text has been added at the start of the Discussion:

      “In particular, we found that GPi neurons exhibited lower firing rates, but greater burstiness and variability in dystonia compared to PD (Fig. 1A). While no differences were found in the power of spiketrain oscillations across disorders (Fig. 1B), we found that PD symptom severity positively correlated with the power of low-beta frequency spiketrain oscillations, whereas dystonia symptom severity positively correlated with the power of theta frequency spiketrain oscillations (Fig. 1C). Dystonia symptom severity moreover correlated negatively with firing rate, and positively with neuronal variability. These results are discussed in greater detail with respect to previous literature in the subsequent Discussion section entitled “Neuronal correlates of PD and dystonia.” In response to electrical stimulation (protocol depicted in Fig. 2A), we found significant increases in the amplitudes of positive-going stimulation-evoked field potential amplitudes (considered to reflect striato-pallidal synaptic strength; as exemplified in Fig. 2B) before versus after HFS in both PD and dystonia (Fig. 2C); with recording sites in PD exhibiting significantly greater increases (Fig. 2D). While changes to evoked potential amplitude before versus after stimulation can be considered to be reflective of long-term plasticity [15,18], the dynamics of evoked potentials during HFS (as depicted in Fig. 2E) can be considered as reflective of short-term synaptic plasticity [18,21]. To this end, our findings are suggestive of faster latency synaptic depression in PD compared to dystonia (Fig. 2F/G). Plasticity findings are discussed in greater detail in the Discussion section entitled “Direct pathway plasticity.”

      Also, I felt that more discussion could be used to highlight the significance of the current results by comparing and/or contrasting them to prior relevant work and mechanisms. The novelty or impact is not very clear as written. Could this be further substantiated in the Discussion? 

      Thank you for the feedback. The discussion has been expanded to include additional literature that is relevant to the findings reported in the manuscript. For example, with regards to the neuronal correlates sub-section, we now highlight the important findings [15] that show changes to the discharge rates and oscillatory tendencies of GPi neurons in non-human primates in response to staged MPTP applications to progressively titrate motor severity; these results substantiate our lack of correlation with firing rates in PD, and presence of a clinical correlation with beta oscillations. We additionally now emphasize human studies that found LFP power difference between PD and dystonia [3], [4]; but simultaneously highlight studies that did not find such differences in spike-train oscillations (in non-human primates) [5], which is reflective of our own findings. With regards to our plasticity sub-section, we have added new content related to previous literature on plasticity in dystonia and PD (also addressed in response to a query from Reviewer #1). For example, we bring to light a variety of previous studies [12], [13] emphasizing excessive plasticity in dystonia. However, while such studies may seem to contradict our findings of greater plasticity in PD compared to dystonia, we additionally provide hypotheses (justified by previous literature) that prolonged used of dopaminergic medication may result in synaptic over-sensitization, thus giving rise to levodopa-induced dyskinesias (a hyperkinetic feature) in PD [14].

      [3] P. Silberstein et al., “Patterning of globus pallidus local field potentials differs between Parkinson’s disease and dystonia.,” Brain, vol. 126, no. Pt 12, pp. 2597–2608, Dec. 2003, doi: 10.1093/brain/awg267.

      [4] D. D. Wang et al., “Pallidal Deep-Brain Stimulation Disrupts Pallidal Beta Oscillations and Coherence with Primary Motor Cortex in Parkinson’s Disease,” J Neurosci, vol. 38, no. 19, pp. 4556–4568, May 2018, doi: 10.1523/JNEUROSCI.0431-18.2018.

      [5] P. A. Starr et al., “Spontaneous pallidal neuronal activity in human dystonia: comparison with Parkinson’s disease and normal macaque.,” J Neurophysiol, vol. 93, no. 6, pp. 3165–3176, Jun. 2005, doi: 10.1152/jn.00971.2004.

      [12] Y. Tamura et al., “Disordered plasticity in the primary somatosensory cortex in focal hand dystonia.,” Brain, vol. 132, no. Pt 3, pp. 749–755, Mar. 2009, doi: 10.1093/brain/awn348.

      [13] D. A. Peterson, T. J. Sejnowski, and H. Poizner, “Convergent evidence for abnormal striatal synaptic plasticity in dystonia.,” Neurobiol Dis, vol. 37, no. 3, pp. 558–573, Mar. 2010, doi: 10.1016/j.nbd.2009.12.003.

      [14] P. Calabresi, B. Picconi, A. Tozzi, V. Ghiglieri, and M. Di Filippo, “Direct and indirect pathways of basal ganglia: a critical reappraisal.,” Nat Neurosci, vol. 17, no. 8, pp. 1022–1030, Aug. 2014, doi: 10.1038/nn.3743.

      [15] A. Muralidharan et al., “Physiological changes in the pallidum in a progressive model of Parkinson’s disease: Are oscillations enough?,” Exp Neurol, vol. 279, pp. 187–196, May 2016, doi: 10.1016/j.expneurol.2016.03.002.

      Amendments to the manuscript:

      “Despite the lack of correlations with firing rate in PD, our findings seem to align with those of Muralidharan and colleagues [25], who showed that GPi neuronal firing rates may not directly correlate with motor severity but exhibit variability across the disease severity continuum in parkinsonian non-human primates (initially increasing, then decreasing, then increasing again at mild, moderate, and severe disease manifestations, respectively). Thus, while GPi discharge rates may change in PD, such changes may not be reflected by linear relationships with motor sign development and progression. Indeed, variability in spike firing rates in PD may be reflected in the considerable overlap in spiking activity between PD and dystonia (Fig. 1A), with many dystonia patients exhibiting higher discharge rates compared to PD patients. While differences in discharge rates were nevertheless observed between PD and dystonia, it may be that the combination of rate and pattern (reflected in the BI and CV) changes best differentiates the two disorders.”

      “Converging evidence from past animal and human studies suggests that dystonia is associated with impaired synaptic function and abnormal synaptic plasticity [35–37]. Compared to healthy controls, it has been shown that transcranial magnetic stimulation induced motor evoked potentials (MEPs) are hyperexcitable in dystonia [38,39], and somatosensory and motor cortical plasticity is greater [40]. Likewise, enhanced long-term potentiation (LTP) at cortico-striatal synapses has been shown in rodent models of dystonia [41,42]. While our finding that LTP effects are greater in PD compared to dystonia (Fig. 2D) is difficult to corroborate with this literature, one potential explanation can be that all of our PD patients are long-term users of levodopa. We have previously shown that the intake of this antiparkinsonian dopaminergic medication leads to potent increases in the amount of plasticity elicited in GPi [15]. Although patients are 12hr withdrawn form antiparkinsonian medications for surgery, it could be that striato-pallidal synapses are nevertheless chronically over-sensitized from prolonged use of dopaminergic medication; which is a well-known hypothesis related to the manifestation of levodopa-induced dyskinesias (a hyperkinetic feature) in PD [43]. Indeed, a lack of depotentiation of striato-pallidal projections has previously been observed in patients with levodopa-induced dyskinesias [44]. As such, excessive plasticity of these projections may corroborate hyperkinetic features of dystonia and levodopa-induced dyskinesias in PD.”

      Some specific comments and questions about the Discussion: 

      Lines 209-211 - This sentence was hard to understand, could it be clarified? 

      Lines 211-213 - What do phasic and tonic components mean exactly? Could this be specifically defined? Are there specific timescales (as referred to in Intro)?

      Lines 215-217 - It's not clear what was delayed in dystonia, and how the authors are trying to contrast this with the faster time course in PD. I think some of this is explained in the introduction, but could also be re-summarized here as relevant to the results discussed. 

      Lines 223-224 - I'm not sure I follow the implication that network reorganization leads to delayed functional benefits. Could this be further elaborated? 

      Reply & Amendments to the manuscript: Thank you for your feedback. We've made the following concise revisions to address the comments:

      We've clarified lines 209-211 to explain that variations in electrical stimulation effects on pathways in PD and dystonia may reveal the operational mechanisms of DBS, despite a common target:

      “The variation in the modulation of these projections / pathways to electrical stimulation may also indicate the mechanism by which DBS operates across PD and dystonia, despite a common stimulation target.”

      In response to the second comment on lines 211-213 about phasic and tonic components, we now specify that phasic refers to dynamic muscle contractions, and tonic to continuous muscle contractions, providing clear definitions relevant to our context:

      “Clinical studies in dystonia have shown that DBS leads to a more rapid improvement in the transient, dynamic muscle contractions (phasic components) of the disorder when compared to the sustained, continuous muscle contractions (tonic or fixed components) [33]”

      For lines 215-217, we've refined our discussion to clearly contrast the delayed response in dystonia with the faster onset in PD:

      “This contrast with PD, where the, the maximal clinical response to DBS occurs within a much faster time course [13,36].”

      On lines 223-224, we've expanded the explanation of how network reorganization may lead to delayed functional benefits, highlighting adjustments in neural connectivity and synaptic efficacy in response to stimulation:

      “which involves adjustments in neural connectivity or synaptic efficacy in response to the stimulation [14,35].”

      Could the absence of a relationship between FR and disease in PD be discussed? 

      Thank you for raising this point. Despite observing higher firing rates in PD compared to dystonia, it is unexpected that these rates do not correlate with symptom severity according to the rate model of PD [1]. However, despite the lack of correlations with firing rates, our findings align with similar animal work of Muralidharan et al. [15], which reported that neuronal firing rates within the GPi of rhesus monkeys did not increase linearly with respect to varying intensities of parkinsonian motor severity. We did however show that low beta oscillatory strength within the GPi may play a significant role in the manifestation of motor symptoms in PD; which is also in line with findings of Muralidharan and colleagues. As per the Reviewer’s request, we have included this content into our discussion.

      [1] M. R. DeLong, “Primate models of movement disorders of basal ganglia origin.,” Trends Neurosci, vol. 13, no. 7, pp. 281–285, Jul. 1990, doi: 10.1016/0166-2236(90)90110-v.

      [15] A. Muralidharan et al., “Physiological changes in the pallidum in a progressive model of Parkinson’s disease: Are oscillations enough?,” Exp Neurol, vol. 279, pp. 187–196, May 2016, doi: 10.1016/j.expneurol.2016.03.002.

      Amendments to the manuscript:

      “Despite the lack of correlations with firing rate in PD, our findings seem to align with those of Muralidharan and colleagues [25], who showed that GPi neuronal firing rates may not directly correlate with motor severity but exhibit variability across the disease severity continuum in parkinsonian non-human primates (initially increasing, then decreasing, then increasing again at mild, moderate, and severe disease manifestations, respectively). Thus, while GPi discharge rates may change in PD, such changes may not be reflected by linear relationships with motor sign development and progression.”

      “Indeed, Muralidharan and colleagues [25] also showed linear group-level relationships between low-beta frequency spiketrain oscillations and disease severity in parkinsonian non-human primates, despite the lack of linear relationships with spike discharge rates (as discussed above).”

      It wasn't very clear how the direct pathway can be attributed to plasticity changes if the GPi makes up both the direct and indirect pathways. Could this be further clarified? 

      The reviewer brings up an important nuanced point. Recent work from our lab [16] shows that inhibitory evoked fields in STN (which receives inhibitory fields from GPe; no other inhibitory sources) are persistent with very minimal depression during HFS. On the other hand, inhibitory fields in the SNr (which receives majority of its inhibitory inputs from striatum; though some come by way of GPe as well per anatomical literature) depress quickly. We have previously also shown these rapidly depressing fields in GPi [17], [18], which also receives the majority of its inhibitory inputs via striatum, though some also from GPe. As such, the disaggregation of striatum-mediated versus GPe-mediated inhibitory fields is achieved based on: lack of rapidly depressing inhibitory evoked field potentials in STN (which receives inhibitory inputs via GPe and not striatum), but a common presence of rapidly depressing evoked field potentials in SNr and GPi (which both receive most of their inhibitory inputs from striatum); differences in the morphology of purportedly GPe- (fast latency) versus striatum-mediated (slow latency) evoked field potentials [16]; and the presence of slow latency caudato-nigral evoked field potentials in slices [19] that are reversed by GABA antagonist application [20]. These points are indeed outlined in the first paragraph of the Discussion sub-section “Direct pathway plasticity.” However, we have now additionally added a point to the Limitations that inhibitory inputs to the GPi also come by way of GPe, though in a lesser abundance.

      [16] L. A. Steiner et al., “Persistent synaptic inhibition of the subthalamic nucleus by high frequency stimulation,” Brain Stimul, vol. 15, no. 5, pp. 1223–1232, 2022, doi: 10.1016/j.brs.2022.08.020.

      [17] L. D. Liu, I. A. Prescott, J. O. Dostrovsky, M. Hodaie, A. M. Lozano, and W. D. Hutchison, “Frequency-dependent effects of electrical stimulation in the globus pallidus of dystonia patients.,” J Neurophysiol, vol. 108, no. 1, pp. 5–17, Jul. 2012, doi: 10.1152/jn.00527.2011.

      [18] L. Milosevic et al., “Modulation of inhibitory plasticity in basal ganglia output nuclei of patients with Parkinson’s disease,” Neurobiology of Disease, vol. 124, pp. 46–56, Apr. 2019, doi: 10.1016/j.nbd.2018.10.020.

      [19] M. Yoshida and W. Precht, “Monosynaptic inhibition of neurons of the substantia nigra by caudato-nigral fibers,” Brain Res, vol. 32, no. 1, pp. 225–228, Sep. 1971, doi: 10.1016/0006-8993(71)90170-3.

      [20] W. Precht and M. Yoshida, “Blockage of caudate-evoked inhibition of neurons in the substantia nigra by picrotoxin,” Brain Res, vol. 32, no. 1, pp. 229–233, Sep. 1971, doi: 10.1016/0006-8993(71)90171-5.

      Amendments to the manuscript:

      “Indeed, GPi receives the greatest abundance of inhibitory inputs from striatum (direct pathway), but also it also receives inhibitory inputs by way of GPe (indirect pathway). Although we can functionally disaggregate these pathway-specific responses based on differences in morphology and dynamics of GPe-mediated versus striatum-mediated inhibitory fEPs [21]; the possibility of compounded effects cannot be completely ruled out.”

      The mechanism of short- and long-term plasticity as applied in the protocols used in this work are outlined in reference to previous citations [15, 16, 18]. Because this is a central aspect of the current work and interpreting the results, it was difficult to appreciate how these protocols provide distinct metrics of short and long-term plasticity in GPi without some explanation of how it applies to the current work and the specific mechanisms. It would also help to be able to better link how the results fit with the broader conclusions. 

      Short-term plasticity is measured as the dynamic change to the fEP during ongoing HFS. For long-term plasticity analyses, the fEP amplitudes during LFS were compared pre- versus post-HFS. To make this analysis more intuitive we have added a protocol illustration to Fig 2. We have moreover greatly expanded the discussion to include more literature related to disease-specific differences in plasticity, and implications of modulating plasticity using DBS.

      Amendments to the manuscript:

      Added new panel to Fig 2

      Author response image 1.

      “Converging evidence from past animal and human studies suggests that dystonia is associated with impaired synaptic function and abnormal synaptic plasticity [35–37]. Compared to healthy controls, it has been shown that transcranial magnetic stimulation induced motor evoked potentials (MEPs) are hyperexcitable in dystonia [38,39], and somatosensory and motor cortical plasticity is greater [40]. Likewise, enhanced long-term potentiation at cortico-striatal synapses has been shown in rodent models of dystonia [41,42]. While our finding that long term potentiation effects are greater in PD compared to dystonia (Fig. 2D) is difficult to corroborate with this literature, one potential explanation can be that all of our PD patients are long-term users of levodopa. We have previously shown that the intake of this antiparkinsonian dopaminergic medication leads to potent increases in the amount of plasticity elicited in GPi [15]. Although patients are 12hr withdrawn form antiparkinsonian medications for surgery, it could be that striato-pallidal synapses are nevertheless chronically over-sensitized from prolonged use of dopaminergic medication; which is a well-known hypothesis related to the manifestation of levodopa-induced dyskinesias (a hyperkinetic feature) in PD [43]. Indeed, a lack of depotentiation of striato-pallidal projections has previously been observed in patients with levodopa-induced dyskinesias [44]. As such, excessive plasticity of these projections may corroborate hyperkinetic features of dystonia and levodopa-induced dyskinesias in PD.”

      In the Conclusion, it was difficult to understand the sentence about microcircuit interaction (line 232) and how it selectively modulates the efficacy of target synapses. Some further explanation here would be helpful. Also, it was not clear how these investigations (line 237) provide cellular-level support for closed-loop targeting. Could the reference to closed-loop targeting also be further explained? 

      We agree with the reviewer that the current wording may be confusing. We have changed the wording to be clearer. We have additionally added content related to closed-loop DBS based on chronic monitoring of evoked potential responses.

      Amendments to the manuscript:

      “Furthermore, chronic monitoring of evoked fields may allow for tracking of subcortical neuronal projections as indexed by inhibitory fields reported in this study. microcircuit interaction to selectively modulate the efficacy of target synapses.”

      future applications of DBS may also benefit from closed loop tuning of basal-ganglia-thalamo-cortical circuit dynamics and plasticity through chronic monitoring of evoked potential responses [56].

      How is the burst index calculated (Methods)? 

      Thank you for pointing out that the burst index definition was missing from the paper. It has now been added to the manuscript.

      Amendments to the manuscript:

      “The burst index was computed by taking the ratio of the means from a two-component Gaussian mixture model applied to the log interspike interval distribution, a modification of the previous mode-over-mean ISI method [20]”

      Figures and figure captions are missing some details:

      Fig. 1 - What does shading represent? 

      The shading in Fig. 1 illustrates results that were significant before adjustment for multiple comparisons.

      Amendments to the manuscript:

      “Depicted scatterplots are results that were significant before correction for multiple comparisons”

      Fig. 2 - Can the stimulation artifact be labeled so as not to be confused with the physiological signal? Is A representing the average of all patients or just one example? Are there confidence intervals for this data as it's not clear if the curves are significantly different or not (may not be important to show if just one example)? Same for D. What is being plotted in E? Is this the exponential fitted on data? Can this be stated in the figure citation directly so readers don't have to find it in the text, where it may not be directly obvious which figure the analyses are being applied towards? 

      Thank you for your comments regarding Fig. 2. We have made the following revisions to address the concerns:

      To clarify the presence of stimulation artifacts and differentiate them from the physiological signal, we have updated Panel B and E in the updated Fig. 2 which highlight the stimulation artifacts accordingly.

      Regarding the comment about Panel A (now B in the updated figure), it represents one single example per disease, rather than an average of all patients.

      In response to the comment about what is plotted in Panel E, we have revised the figure caption to explicitly state that it includes the exponential fit on the data.

      Amendments to the manuscript:

      Figure 2 panel B and E now highlight stimulation artifacts.

      Author response image 2.

      Author response image 3.

      The figure captions could use more details, that can be taken from the text, so that readers can understand figures without searching for relevant details across the paper. 

      Thank you for your feedback. We have revised the figure captions accordingly to provide more details.

      Amendments to the manuscript:

      “Fig 1 – GPi spiketrain feature analyses and clinical correlates of PD and dystonia. (A) With respect to (A) rate-based spiketrain features, firing rate was greater in PD while burst index (BI) and coefficient of variation (CV) were greater in dystonia; whereas no differences were found for (B) oscillatory spiketrain features for theta, alpha, low beta, high beta frequencies. MWU statistical results depicted are not corrected for multiple comparisons; after correction using the Bonferroni method, only CV and BI results remain significant (please see Supplementary Table 3). (C) In PD, the power of low beta spiketrain oscillations positively correlated (Spearman correlation) with symptom severity; in dystonia, neuronal firing rate negatively correlated with symptom severity, whereas CV and the power of theta spiketrain oscillations positively correlated with symptom severity. Depicted scatterplots are results that were significant before correction for multiple comparisons; however, none of the results persist after Benjamini-Hochberg correction for false discovery rate (please see Supplementary Table 4).”

      “Fig 2 – Long-term and short-term effects of HFS on striato-pallidal plasticity in PD and dystonia. (A) Schematic of the plasticity protocol to assess long-term plasticity via fEP amplitude comparisons pre- versus post-HFS and short-term plasticity via fEP dynamics during HFS. (B) Highlights example fEP traces for measuring long-term plasticity pre- versus post-HFS, with (C) displaying group-level fEP amplitudes pre- versus post-HFS across diseases. (D) Illustrates the amount of plasticity (i.e., percentage change in fEP amplitudes pre- versus post-HFS) in both PD and dystonia, with PD showing higher levels of plasticity. (E) Provides an example of fEP traces during HFS for assessing short-term plasticity, with (F) depicting group-level decay rates of fEP amplitudes using an exponential fit on the fEP amplitudes over the first 5 stimulus pulses across diseases. (G) Shows the half-life of the fitted exponential (i.e., rate of attenuation of fEP amplitudes) between PD and dystonia, with PD demonstrating faster fEP attenuation.”

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      To gain further insight into the dynamics of microglial aging in the hippocampus, the authors used a bioinformatics method known as "pseudotime" or "trajectory inference" to understand how cells may progress through different functional states, as defined by cellular transcriptome (15,16). These bioinformatics approaches can reveal key patterns in scRNAseq / snRNAseq datasets and, in the present study, the authors conclude that a "stress response" module characterized by expression of TGFb1 represents a key "checkpoint" in microglial aging in midlife, after which the cells can move along distinct transcriptional trajectories as aging progresses. This is an intriguing possibility. However, pseudotime analyses need to be validated via additional bioinformatics as well as follow-up experiments. Indeed, Heumos et al, in their Nature Genetics "Expert Guidelines" Review, emphasize that "inferred trajectories might not necessarily have biological meaning." They recommend that "when the expected topology is unknown, trajectories and downstream hypotheses should be confirmed by multiple trajectory inference methods using different underlying assumptions."(15) Numerous algorithms are available for trajectory inference (e.g. Monocle, PAGA, Slingshot, RaceID/StemID, among many others) and their performance and suitability depends on the individual dataset and nature of the trajectories that are to be inferred. It is recommended to use dynGuidelines(16) for the selection of optimal pseudotime analysis methods. In the present manuscript, the authors do not provide any justification for their use of Monocle 3 over other trajectory inference approaches, nor do they employ a secondary trajectory inference method to confirm observations made with Monocle 3. Finally, follow-up validation experiments that the authors carry out have their own limitations and caveats (see below). Hence, while the microglial aging trajectories identified by this study are intriguing, they remain hypothetical trajectories that need to be proven with additional follow-up experiments.

      We thank the reviewer for their suggestion. We have utilized the dynGuidelines kindly provided by the reviewer to utilize an additional trajectory inference tool to analyze our data. We selected Scorpius based on the structure of our data. The tool has provided additional support that microglia progress from a homeostatic state (Cx3cr1, Mef2c) to the induction of stress genes (Hspa1, Atf3) at an intermediate point during aging progression. Furthermore, we observe a concordant increase in ribosomal protein genes at a time point in the pseudotime analysis immediately prior to activation of inflammation-related genes (Il1b, Cst7). These additional analyses support the main findings of our original pseudotime analysis and have been added to the manuscript as Figure S3C,D. Additionally, in the statistical test that uncovers differentially expressed genes along the pseudotime trajectory in this analyses, we find that Tgfb1 is one of the genes that is differentially expressed with peak expression at an intermediate timepoint along the pseudotime trajectory. Furthermore, we have done some preliminary trajectory analysis with slingshot (Street et al, BMC Genomics, PMID: 29914354) that found a similar trajectory with analogous gene expression patterns and dynamic expression of Tgfb1.

      To follow up on the idea that TGFb1 signaling in microglia plays a key role in determining microglial aging trajectories, the authors use RNAscope to show that TGFb1 levels in microglia peak in middle age. They also treat primary LPS-activated microglia with TGFb1 and show that this restores expression of microglial homeostatic gene expression and dampens expression of stress response and, potentially, inflammatory genes. Finally, they utilize transgenic approaches to delete TGFb1 from microglia around 8-10mo of age and scRNAseq to show that homeostatic signatures are lost and inflammatory signatures are gained. Hence, findings in this study support the idea that TGFb1 can strongly regulate microglial phenotype. Loss of TGFb1 signaling to microglia in adulthood has already been shown to cause decreased microglial morphological complexity and upregulation of genes typically associated with microglial responses to CNS insults(17-19). TGFb1 signaling to microglia has also been implicated in microglial responses to disease and manipulations to increase this signaling can improve disease progression in some cases(19). In this light, the findings in the present study are largely confirmatory of previous findings in the literature. They also fall short of unequivocally demonstrating that TGFb1 signaling acts as a "checkpoint" for determining subsequent microglial aging trajectory. To show this clearly, one would need to perturb TGFb1 signaling around 12mo of age and carry out sequencing (bulkRNAseq or scRNAseq) of microglia at 18mo and 24mo. Such experiments could directly demonstrate whether the whole microglial population has been diverted to the TGFb1-low aging trajectory (that progresses through a translational burst state to an inflammation state as proposed). Future development of tools to tag TGFb1 high or low microglia could also enable fate tracing type experiments to directly show whether the TGFb1 state in middle age predicts cell state at later phases of aging.

      We apologize for the use of the term “checkpoint” when referring to the role of Tgfb1 in microglial aging. Instead, our model posits that Tgfb1 expression increases in response to the early insults of the aging process in an attempt to return microglia to homeostasis. Therefore, this would predict that increasing TGFB1 levels after an insult would decrease activation and age-related progression of microglia, which we demonstrate in vitro (Figure 3). Alternatively, the loss of TGFB1 should prevent microglia from returning to a homeostatic state after an age-related stressor, and thus increase the number of microglia in activated states. We observe this increase in activated microglia in our middle-aged microglia-specific Tgfb1 knockout mouse model. Furthermore, the haploinsufficiency of Tgfb1 at this age indicates that TGFB1 signaling in microglia is sensitive to relative levels of Tgfb1. The transient increase in Tgfb1 expression further suggests that the threshold for TGFB1 signaling is dynamic. Finally, RNA-Seq analysis of both in vitro TGFB1 supplemented microglia and in vivo Tgfb1 depleted microglia highlight that TGFB1 alters the aging microglia transcriptome. Combined, these results provide evidence that Tgfb1 modulates advancement of microglia through an aging continuum.

      The present study would also like to draw links between features of microglial aging in the hippocampus and a decline in hippocampal-dependent cognition during aging. To this end, they carry out behavioral testing in 8-10mo old mice that have undergone microglial-specific TGFb1 deletion and find deficits in novel object recognition and contextual fear conditioning. While this provides compelling evidence that TGFb1 signaling in microglia can impact hippocampus-dependent cognition in midlife, it does not demonstrate that this signaling accelerates or modulates cognitive decline (see below). Age-associated cognitive decline refers to cognitive deficits that emerge as a result of the normative brain aging process (20-21). For a cognitive deficit to be considered age-associated cognitive decline, it must be shown that the cognitive operation under study was intact at some point earlier in the adult lifespan. This requires longitudinal study designs that determine whether a manipulation impacts the relationship between brain status and cognition as animals age (22-24). Alternatively, cross-sectional studies with adequate sample sizes can be used to sample the variability in cognitive outcomes at different points of the adult lifespan (22-24) and show that this is altered by a particular manipulation. For this specific study, one would ideally demonstrate that hippocampal-based learning/memory was intact at some point in the lifespan of mice with microglial TGFb1 KO but that this manipulation accelerated or exacerbated the emergence of deficits in hippocampal-dependent learning/memory during aging. In the absence of these types of data, the authors should tone down their claims that they have identified a cellular and molecular mechanism that contributes to cognitive decline.

      We agree with the reviewer that to adequately demonstrate an age-dependent effect of microglia-derived TGFB1 on cognition it is necessary to perturb microglial TGFB1 at young and mature ages and assess the age-dependent effect on cognition. To address this, we have now performed a complementary behavioral study utilizing the Tmem119-CreER mouse model to drive the microglia-specific excision of Tgfb1 in two separate cohorts of mice – one young (2-3 months) and one in mature mice (7-8 months) – followed by cognitive testing. Using the novel object recognition test, we find that young mice of all genotypes (WT, Tgfb1 Het and Tgfb1 cKO ) retain the ability to recognize the novel object (as determined by having a significant preference in exploring the novel object). Alternatively, only the WT mature mice demonstrate a preference for the novel object, while the Tgfb1 Het and Tgfb1 cKO show no preference for the novel object. These behavioral data demonstrate an age-dependent necessity for microglia-specific TGFB1 in in maintain proper hippocampal-dependent memory and is now included in the manuscript as revised Figure 4I-J. We have also included additional behavioral tests (Y-Maze and open field) that did not show any difference between the genotypes as Figure S6D-G. Unfortunately, we were unable to perform the fear conditioning testing, as our apparatus broke during this time. Together, these results reveal that there is an age-dependent necessity for microglia-derived TGFB1 for hippocampal-dependent cognitive function.

      A final point of clarification for the reader pertains to the mining of previously generated data sets within this study. The language in the results section, methods, and figure legends causes confusion about which experiments were actually carried out in this study versus previous studies. Some of the language makes it sound as though parabiosis experiments and experiments using mouse models of Alzheimer's Disease were carried out in this study. However, parabiosis and AD mouse model experiments were executed in previous studies (25,26), and in the present study, RNAseq datasets were accessed for targeted data mining. It is fantastic to see further mining of datasets that already exist in the field. However, descriptions in the results and methods sections need to make it crystal clear that this is what was done.

      The reviewer makes an excellent point. While we referenced the public dataset in the original manuscript, the citation style of superscripted numbers diminishes our ability to adequately reference the datasets. Therefore, we have added the names of the first authors (Palovics for the parabiosis dataset and Sala Frigerio for the Alzheimer’s Disease dataset) to all the instances in the results and figure legends when we refer to these datasets.

      Additional recommendations:

      Major comments.

      (1) There is some ambiguity surrounding how to interpret the microglial TGFb1 knockout that seems incompatible with viewing this molecule as a "checkpoint" in microglial aging. TGFb1 is believed to be primarily produced by microglia. Secreted TGFb1 is then detected by microglial TGFbR2. Are the microglia that have high levels of TGFb1 in middle age signaling to themselves (autocrine signaling)? Or contributing to a local milieu that impacts multiple neighbor microglia (paracrine signaling)? The authors could presumably look in their own dataset to evaluate microglial capacity to detect TGFb1 via its receptors.

      We thank the reviewer for this insightful suggestion. We have undertaken analysis of our dataset to assess whether Tgfb1 acts through autocrine or paracrine signaling. To do so, we reanalyzed our microglia aging scRNA-Seq dataset leveraging the variation in microglia Tgfb1 expression to probe the relative activity of TGFB1. Specifically, we partitioned microglia into quartiles based on their Tgfb1 expression, and subsequently investigated the expression of TGFB signaling effectors and targets. High expression of downstream TGFB signaling pathway components in microglia with high Tgfb1 expression would point to autocrine mechanisms while, alternatively, high expression of downstream TGFB signaling pathway components in microglia with low Tgfb1 expression would point to paracrine mechanisms. We observed highest expression of TGFB signaling pathway components and targets in microglia with the highest expression of Tgfb1. These data suggest that Tgfb1 acts through an autocrine mechanism. These results have been added to our manuscript as Figure S4E-G. Additionally, while our manuscript was under review, a paper by Bedolla et al (Nature Communications 2024; PMID: 38906887) was published that investigated the role of Tgfb1 in adult microglia. This paper utilized orthogonal techniques – sparse microglia-specific Tgfb1 knockout and IHC - to also suggest that microglia utilize autocrine Tgfb1 signaling. Together, these complementary data provide strong evidence that Tgfb1 acts through an autocrine mechanism in adult microglia.

      (2) Conclusions of the study rest on the assumption that microglial inflammatory responses are a central driver of cognitive decline. They assume that manipulations that increase microglial progression into an inflammatory state will negatively impact cognitive function. Although there are certainly a lot of data in the field that inflammatory factors can impact synaptic function, additional experiments would be required to unequivocally demonstrate that a "TGFb1 dependent" progression of microglia to an inflammatory state underlies any observed changes in cognition. For example, in the context of microglial TGFb1 deletion, can NSAIDs or blockers of soluble TNFa (e.g. XENP345), or blockers of SPP1, etc. rescue behavior? Can microglial depletion in this context rescue behavior? Assuming behavior was carried out in the same microglial TGFb1 KO mice that were used for microglial scRNAseq, they could also carry out linear regression-type analyses to link microglial inflammatory status to the behavioral performance of individual mice. In the absence of additional evidence of this sort, the authors should tone down claims about mechanistic relationships between microglial state and cognitive performance.

      We thank the reviewer for realizing that the link between cognition and inflammation in our paper is speculative. Therefore, we have taken the reviewer’s advice and toned down the claims linking inflammation to cognition in our manuscript. Instead, we connect the disruption in cognition to what is observed in our data, a loss of microglia homeostasis and a shift in the microglia aging trajectories.

      Additional Recommendations:

      Minor comments:

      (1) Ideally at some point in the results or discussion, the authors should acknowledge that the hippocampus has highly distinct sub-regions and that microglia show different functions and properties across these sub-regions (e.g. microglia in hilus and subgranular zone vs microglia in stratum radiatum, vs microglia immediately adjacent to or embedded within stratum pyrimidale). Do expression levels of TGFb1 and microglial aging trajectories vary across sub-regions? To what extent can this account for heterogeneity of aging trajectories observed in microglial aging within the hippocampus?

      We are interested in how microglia heterogeneity during aging is influenced by the specific functions, and thus microenvironments within the hippocampus. Therefore, we have expanded our IHC analysis of microglia to determine how the microenvironment influences microglia phenotypes by looking at several different regions of the hippocampus. We have included this regional analysis as Figure S2 in the manuscript. This analysis has revealed region-specific effects on microglia activation during aging.

      (2) For immunohistochemistry data, it is not particularly convincing to see one example of one cell from each condition. Generally, an accepted approach in the field is to present lower magnification images accompanied by zoom panels for several cells from each field of view. This reassures the reader that specific cells haven't simply been "cherry-picked" to support a particular conclusion.

      To allay the concerns of the reviewer that cells haven’t been “cherry-picked”, we have provided low magnification images for the aging CD68 and NF<sub>κ</sub>B stains in Supplemental Figure S2.

      (3) In immunohistochemistry data, have measures been taken to ensure that observed signals are not simply autofluorescence that becomes prominent in tissues with aging? (i.e. use of trueblack or photoquenching of tissue prior to staining) See PMID 37923732

      We agree that autofluorescence, at least partially due to the accumulation of lipofuscin, becomes prominent in certain regions and cells of the hippocampus during aging. This most prominently occurs in the microglia of the hilus. This autofluorescence has a particular subcellular distribution, as it is localized to lyso-endosomal bodies. The microglia activation marker CD68 is also localized to lysosomes. A previous publication by Burns et al (eLife; PMID: 32579115) identified autofluorescent microglia (AF+) with unique molecular profiles that accumulate with age. They posited that these AF+ microglia resembled other microglia subsets that have pronounced storage compartments, such as the pro-inflammatory lipid droplet-containing microglia that accumulate with age reported by Marschallinger et al (Nature; PMID: 31959936). As such, autofluorescence present in microglia potentially represents distinctive and functional states of microglia. Our CD68 immunostaining accumulates with age, which could overlap with autofluorescent storage bodies. Thus, we performed a complementary CD68 immunostaining in an independent cohort of young (3 months) and aged (24 months) mice with autofluorescence quencher TrueBlack, and found that the staining pattern and accumulation of CD68 microglia with age persisted as previously observed after use of this quencher (see Authpr response image 1). Images are IBA1 (cyan) and CD68 (yellow) with the molecular layer (ML), granule cell (GC), and hilus illustrated and corresponding quantification provided (Two-way ANOVA with Sidak’s multiple comparisons test; ***P<0.001; ****P<0.0001).

      We would like to note that the subcellular localization of the other immunostainings included in the manuscript was distinct from CD68, and not likely to be associated with the autofluorescent storage bodies. Additionally, our RNAScope staining for Tgfb1 did not show an accumulation with age, but rather a transient increase at 12 months of age, which indicates that the interpretation of the RNAScope stain for Tgfb1 was not unduly influenced by autofluorescence.

      Author response image 1.

      (4) Ideally, more care is needed with the language used to describe microglial state during aging. The terms "dystrophic," "dysfunctional," and "inflammatory" all carry their own implications and assumptions. Many changes exhibited by microglia during aging can initially be adaptive or protective, particularly during middle age. Without additional experiments to show that specific microglial attributes during aging are actively detrimental to the tissue and additional experiments to show that microglia have ceased to be capable of engaging in many of their normal actions to support tissue homeostasis, the authors should exercise caution in using terms like dysfunctional.

      We appreciate the reviewers’ suggestion. To allay the concerns of the reviewer about the multiple implications of terms such as “dysfunctional” and “inflammatory”, we have tried to replace them throughout the text with more specific terms.

      Reviewer #2:

      That said, given what we recently learned about microglia isolation for RNA-seq analysis, there is a danger that some of the observations are a result of not age, but cell stress from sample preparation (enzymatic digestion 10min at 37C; e.g. PMID: 35260865). Changes in cell state distribution along aging were made based on scRNA-seq and were not corroborated by any other method, such as imaging of cluster-specific marker expression in microglia at different ages. This analysis would allow confirming the scRNA-seq data and would also give us an idea of where the subsets are present within the hippocampus, and whether there is any interesting distribution of cell states (e.g. some are present closer to stem cells?). Since TGFb is thought to be crucial to microglia biology, it would be valuable to include more analysis of the mice with microglia-specific Tgfb deletion e.g. what was the efficiency of recombination in microglia? Did their numbers change after induction of Tgfb deletion in Cx3cr1-creERT2::Tgfb-flox mice.

      We thank the reviewer for their comment regarding potential ex vivo transcriptional alterations with the approaches used in our study. We performed our aging microglia scRNA-Seq characterization prior to the release of Marsh et al (Nature Neuroscience; PMID: 35260865), which revealed the potential transcriptional artefacts induced by isolation. That being said, we took great care to minimize the amount of time samples were subjected to enzymatic digestion (15 minutes) and kept cells at 4C during the remainder of the isolation. Furthermore, we performed all isolations simultaneously, so that transcriptional changes induced by the isolation would be present across all ages and should not be observed during our analysis unless indicative of a true age-related change. Additionally, we have corroborated changes in cell state distribution across ages using several markers (Tgfb1 and KLF2 for the intermediate stress state, S6 for the translation state, and NFKB and CD68 for activation states). In the revised manuscript, we have added additional hippocampal subregion analysis of several IHC immunostains to provide spatial insights into the microglia aging process (Figure S2). This analysis reveals unique spatial dynamics of microglia aging. For example, as the reviewer foresaw, we found that the granule cell layer (the location of adult hippocampal neurogenesis) had a more pronounced age-associated progression of microglial activation than several other regions. A subset of regions had minimal levels of activation during aging, such as the molecular layer and the stratum radiatum of the CA1 (inner CA1in the manuscript) – regions enriched in synaptic terminals. Furthermore, this analysis highlights the susceptibility of microglia aging to microenvironmental influences.

      Regarding the temporally controlled microglia-specific genetic KO mouse model used in our original submission, the Cx3cr1-CreER allele selected (B6.129P2(Cg)-Cx3cr1tm2.1(cre/ERT2)Litt/WganJ) has been reported to have very high recombination efficiency (~94% in Parkhurst et al (Cell; PMID: 24360280)), and we used a tamoxifen induction protocol very similar to Faust et al. (Cell Reports; PMID: 37635351) that achieved ~98% recombination (they injected 100mg/kg for 5 days, while we injected 90mg/kg for 5 days). We analyzed our scRNA-Seq data for the expression of Tgfb1 and found that the knockout mice had a 67% reduction in cells expressing higher levels of Tgfb1 (see panel A in Author response image 2). This is likely a large underestimate of the recombination efficiency, as exon 3 is floxed and residual nonfunctional transcripts could be present, given nonsense-mediated decay is not realized in a number of knockout lines (Lindner et al, Methods, PMID: 33838271). We likely achieved a much higher excision efficiency. We would like to highlight that our data indicating increased microglia activation after tamoxifen treatment (Figure S5A) and the involvement of autonomous signaling (Figure S4E-G) are consistent with recently published work by Bedolla et al, (Nature Communications; PMID: 38906887). Additionally, as part of the revision process, we have now corroborated our behavioral data using and independent temporally controlled microglia-specific KO mouse model - Tmem119-CreER::Tgfb1 knockout mice (Figure 4I-K). We performed qPCR on sorted microglia to determine RNA levels in wildtype and knockout mice. Relative levels of Tgfb1 and exon 3 of Tgfb1 (the floxed exon) on technical replicates of 3 pooled samples indicated overall loss of Tgfb1 expression, as well as undetectable levels of exon 3 as normalized to Actb (see panel B in Author response image 2).

      Author response image 2.

      With respect to the effects of aging and Tgfb1 on microglia density, we find a slight region-specific increase in microglia density with age (see Author response image 3). The density of Iba1 cells across hippocampal regions was analyzed at 3 and 24 months of age (see panel A in Author response image 3) and along an aging continuum at 3, 6, 12, 18, and 24 months (see panel B in Author response image 3). These data are also included in the revised manuscript (Figure S2D-F).

      Author response image 3.

      Deletion of Tgfb1 also had region-specific effects on microglia. While there was no difference in microglia density between wildtype and heterozygous microglia, there was a significant increase in microglia density in the hilus and molecular layers in knockout mice (see Author response image 4) and included in the revised manuscript (Figure S5A). These data indicate that there are subtle region-specific increases in microglia density with age, as well as following the deletion of Tgfb1 from microglia of mature mice.

      Author response image 4.

      Additional Recommendations:

      (1) The problem of possible digestion artifacts in scRNA-seq should be at least addressed in the discussion as a caveat in data interpretation. Staining for unique cluster markers in undigested tissue would solve the problem. It can be done with microscopy or using flow cytometry, but for this microglia, isolation should be done with no enzymes or with Actinomycin (PMID: 35260865).

      The ex vivo activation signature uncovered by Marsh et al. (Nature Neuroscience; PMID: 35260865) arises from the digestion methods used to isolate microglia. We took the utmost care in processing our microglia identically within experiments, which should minimize the amount of uneven ex vivo activation of microglia. This is borne out by the structures of our single-cell sequencing data. Unlike Marsh et al_. where they observe unique cluster after addition of their inhibitors, we do not see any clusters unique to a single condition, suggesting that any influence of _ex vivo activation was evenly distributed.

      Importantly, as suggested by the review, we have we have complemented our scRNA-Seq analysis by corroborating several markers for various stages of microglia aging progression using RNAScope and IHC in intact tissue. Specifically, the transient age-dependent increase in Tgfb1 high microglia was confirmed using RNAScope (Figure 3B), the age-related increase in ribosomal high microglia was confirmed using S6 immunostaining (Figure 3I), and the increase of various markers of age-associated activation (C1q, CD68 and NFkB) was confirmed using immunostaining (Figure 1F and Figure S2D-I). Additionally, we have also performed immunostainings for KLF2 and confirmed peak microglia expression at 18 months of age with lower levels at 24 months of age (Figure 2H).

      (2) The figures of GO and violin plots are not easy to follow sometimes... what are the data points in the violin plots, maybe worth showing them as points? For the GO, e.g. in 3D, 3J, including a short description of the figure could help, e.g. in Figure 1. it was clear.

      We chose not to include the datapoints in the violin plots for aesthetic purposes. Each violin plot would have had hundreds of points that would have made the plots very busy and hidden the structure of the distribution. In Author response image 5 we show the violin plot in Figure 2M with (panel A) and without (panel B) individual points. In a small format, the points overlap and become jumbled together. Therefore, we chose to present the violin plots without points for clarity on the data structure. As for the gene ontology plots in Figure 3, we have updated the descriptions in both the text and figure legends to provide clarification on what they represent.

      Author response image 5.

      (3) I'm very curious to see the mechanism of action of "aged" microglia in the TGFb-depletion model. Is it creating hostile conditions for stem cells, or we have increased synapse loss? Something else?

      We thank the reviewer for their insightful questions. We would like to note that during the revision process of our manuscript, a complementary study was published reporting that the loss of microglia-derived Tgfb1 leads to an aberrant increase in the density of dendritic spines in the CA1 region of the hippocampus (Bedolla et al, Nature Communications, PMID: 38906887). The data from Bedolla et al, shows sparsely labeled neurons in the CA1 with a mGreenLantern expressing virus in mice the had Tgfb1 deleted from microglia using the Cx3cr1-CreERT driver (Figure 7U,V). Additionally, McNamara et al (Nature; PMID: 36517604) demonstrated that microglia-derived Tgfb1 signaling regulates myelin integrity during development and several studies have revealed links between Tgfb1 signaling and altered neurogenesis (e.g., He et al, Nature, PMID: 24859199 and Dias et al, Neuron, PMID: 25467979). Together, this growing body of work indicates that microglia-derived TGFB1 regulates myelination, neurogenesis and synaptic plasticity, which have all been shown to play a role in cognition.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      The authors have developed a compelling coarse-grained simulation approach for nucleosome-nucleosome interactions within a chromatin array. The data presented are solid and provide new insights that allow for predictions of how chromatin interactions might occur in vivo, but some of the claims should be tempered. The tools will be valuable for the chromosome biology field.

      Response: We want to thank the editors and all the reviewers for their insightful comments. We have made substantial changes to the manuscript to improve its clarity and temper necessary claims, as detailed in the responses, and we performed additional analyses to address the reviewers’ concerns. We believe that we have successfully addressed all the comments, and the quality of our paper has improved significantly.

      In the following, we provide point-to-point responses to all the reviewer comments. 

      RESPONSE TO REFEREE 1:

      Comment 0: This study develops and applies a coarse-grained model for nucleosomes with explicit ions. The authors perform several measurements to explore the utility of a coarse-grained simulation method to model nucleosomes and nucleosome arrays with explicit ions and implicit water. ’Explicit ions’ means that the charged ions are modeled as particles in simulation, allowing the distributions and dynamics of ions to be measured. Since nucleosomes are highly charged and modulated by charge modifications, this innovation is particularly relevant for chromatin simulation.

      Response: We thank the reviewer’s excellent summary of the work.

      Comment 1: Strengths: This simulation method produces accurate predictions when compared to experiments for the binding affinity of histones to DNA, counterion interactions, nucleosome DNA unwinding, nucleosome binding free energies, and sedimentation coefficients of arrays. The variety of measured quantities makes both this work and the impact of this coarse-grained methodology compelling. The comparison between the contributions of sodium and magnesium ions to nucleosome array compaction, presented in Figure 3, was exciting and a novel result that this simulation methodology can assess.

      Response: We appreciate the reviewer’s strong assessment of the paper’s significance, novelty, and broad interest, and we thank him/her for the detailed suggestions and comments.

      Comment 2: Weaknesses: The presentation of experimental data as representing in vivo systems is a simplification that may misrepresent the results of the simulation work. In vivo, in this context, typically means experimental data from whole cells. What one could expect for in vivo experimental data is measurements on nucleosomes from cell lysates where various and numerous chemical modifications are present. On the contrary, some of the experimental data used as a comparison are from in vitro studies. In vitro in this context means nucleosomes were formed ’in a test tube’ or under controlled conditions that do not represent the complexity of an in vivo system. The simulations performed here are more directly compared to in vitro conditions. This distinction likely impacts to what extent these simulation results are biologically relevant. In vivo and in vitro differences could be clarified throughout and discussed.

      Response: As detailed in Response to Comment 3, we have made numerous modifications in the Introduction, Results, and Discussion Section to emphasize the differences between reconstituted and native nucleosomes. The newly added texts also delve into the utilization of the interaction strength measured for reconstituted nucleosomes as a reference point for conceptualizing the interactions among native nucleosomes.

      Comment 3: In the introduction (pg. 3), the authors discuss the uncertainty of nucleosome-tonucleosome interaction strengths in vivo. For example, the authors discuss works such as Funke et al. However, Funke et al. used reconstituted nucleosomes from recombinant histones with one controlled modification (H4 acetylation). Therefore, this study that the authors discuss is measuring nucleosome’s in vitro affinity, and there could be significant differences in vivo due to various posttranslational modifications. Please revise the introduction, results section ”Close contacts drive nucleosome binding free energy,” and discussion to reflect and clarify the difference between in vitro and in vivo measurements. Please also discuss how biological variability could impact your findings in vivo. The works of Alexey Onufriev’s lab on the sensitivity of nucleosomes to charge changes (10.1016/j.bpj.2010.06.046, 10.1186/s13072-018-0181-5), such as some PTMs, are one potential starting place to consider how modifications alter nucleosome stability in vivo.

      Response: We thank the reviewer for the insightful comments and agree that native nucleosomes can differ from reconstituted nucleosomes due to the presence of histone modifications.

      We have revised the introduction to emphasize the differences between in vitro and in vivo nucleosomes. The new text now reads

      "The relevance of physicochemical interactions between nucleosomes to chromatin organization in vivo has been constantly debated, partly due to the uncertainty in their strength [cite]. Examining the interactions between native nucleosomes poses challenges due to the intricate chemical modifications that histone proteins undergo within the nucleus and the variations in their underlying DNA sequences [cite]. Many in vitro experiments have opted for reconstituted nucleosomes that lack histone modifications and feature wellpositioned 601-sequence DNA to simplify the chemical complexity. These experiments aim to establish a fundamental reference point for understanding the strength of interactions within native nucleosomes. Nevertheless, even with reconstituted nucleosomes, a consensus regarding the significance of their interactions remains elusive. For example, using force-measuring magnetic tweezers, Kruithof et al. estimated the inter-nucleosome binding energy to be ∼ 14 kBT [cite]. On the other hand, Funke et al. introduced a DNA origamibased force spectrometer to directly probe the interaction between a pair of nucleosomes [cite], circumventing any potential complications from interpretations of single molecule traces of nucleosome arrays. Their measurement reported a much weaker binding free energy of approximately 2 kBT. This large discrepancy in the reported reference values complicates a further assessment of the interactions between native nucleosomes and their contribution to chromatin organization in vivo."

      We modified the first paragraph of the results section to read

      "Encouraged by the explicit ion model’s accuracy in reproducing experimental measurements of single nucleosomes and nucleosome arrays, we moved to directly quantify the strength of inter-nucleosomes interactions. We once again focus on reconstituted nucleosomes for a direct comparison with in vitro experiments. These experiments have yielded a wide range of values, ranging from 2 to 14 kBT [cite]. Accurate quantification will offer a reference value for conceptualizing the significance of physicochemical interactions among native nucleosomes in chromatin organization in vivo."

      New text was added to the Discussion Section to emphasize the implications of simulation results for interactions among native nucleosomes.

      "One significant finding from our study is the predicted strong inter-nucleosome interactions under the physiological salt environment, reaching approximately 9 kBT. We showed that the much lower value reported in a previous DNA origami experiment is due to the restricted nucleosomal orientation inherent to the device design. Unrestricted nucleosomes allow more close contacts to stabilize binding. A significant nucleosome binding free energy also agrees with the high forces found in single-molecule pulling experiments that are needed for chromatin unfolding [cite]. We also demonstrate that this strong inter-nucleosomal interaction is largely preserved at longer nucleosome repeat lengths (NRL) in the presence of linker histone proteins. While posttranslational modifications of histone proteins may influence inter-nucleosomal interactions, their effects are limited, as indicated by Ding et al. [cite], and are unlikely to completely abolish the significant interactions reported here. Therefore, we anticipate that, in addition to molecular motors, chromatin regulators, and other molecules inside the nucleus, intrinsic inter-nucleosome interactions are important players in chromatin organization in vivo."

      The suggested references (10.1016/j.bpj.2010.06.046, 10.1186/s13072-018-0181-5) are now included as citations # 44 and 45.

      Comment 4: Due to the implicit water model, do you know if ions can penetrate the nucleosome more? For example, does the lack of explicit water potentially cause sodium to cluster in the DNA grooves more than is biologically relevant, as shown in Figure 1?

      Response: We thank the reviewer for the insightful comments. The parameters of the explicit-ion model were deduced from all-atom simulations and fine-tuned to replicate crucial aspects of the local ion arrangements around DNA (1). The model’s efficacy was demonstrated in reproducing the radial distribution function of Na+ and Mg2+ ion distributions in the proximity of DNA (see Author response image 1). Consequently, the number of ions near DNA in the coarse-grained models aligns with that observed in all-atom simulations, and we do not anticipate any significant, unphysical clustering. It is worth noting that previous atomistic simulations have also reported the presence of a substantial quantity of Na+ ions in close proximity to nucleosomal DNA (refer to Author response image 2).

      Author response image 1.

      Comparison between the radial distribution functions of Na+ (left) and Mg2+ (right) ions around the DNA phosphate groups computed from all-atom (black) and coarse-grained (red) simulations. Figure adapted from Figure 4 of Ref. 1. The coarse-grained explicit ion model used in producing the red curves is identical to the one presented in the current manuscript. (© 2011, AIP Publishing. This figure is reproduced with permission from Figure 4 in Freeman GS, Hinckley DM, de Pablo JJ (2011) A coarse-grain three-site-pernucleotide model for DNA with explicit ions. The Journal of Chemical Physics 135:165104. It is not covered by the CC-BY 4.0 license and further reproduction of this figure would need permission from the copyright holder.)

      Author response image 2.

      Three-dimensional distribution of sodium ions around the nucleosome determined from all-atom explicit solvent simulations. Darker blue colors indicate higher sodium density and high density of sodium ions around the DNA is clearly visible. The crystallographically identified acidic patch has been highlighted as spheres on the surface of the histone core and a high level of sodium condensation is observed around these residues. Figure adapted from Ref. 2. (© 2009, American Chemical Society. This figure is reproduced with permission from Figure 7 in Materese CK, Savelyev A, Papoian GA (2009) Counterion Atmosphere and Hydration Patterns near a Nucleosome Core Particle. J. Am. Chem. Soc. 131:15005–15013.. It is not covered by the CC-BY 4.0 license and further reproduction of this figure would need permission from the copyright holder.)

      Comment 5: Histone side chain to DNA interactions, such as histone arginines to DNA, are essential for nucleosome stability. Therefore, can the authors provide validation or references supporting your model of the nucleosome with one bead per amino acid? I would like to see if the nucleosomes are stable in an extended simulation or if similar dynamic motions to all-atom simulations are observed.

      Response: The nucleosome model, which employs one bead per amino acid and lacks explicit ions, has undergone extensive calibration and has found application in numerous prior studies. For instance, the de Pablo group utilized a similar model to showcase its ability to accurately replicate the experimentally measured nucleosome unwinding free energy penalty (3), sequence-dependent nucleosome sliding (4), and the interaction between two nucleosomes (5). Similarly, the Takada group employed a comparable model to investigate acetylation-modulated tri-nucleosome structures (6), chromatin structures influenced by chromatin factors (7), and nucleosome sliding (8). Our group also employed this model to study the structural rearrangement of a tetranucleosome (9) and the folding of larger chromatin systems (10). In cases where data were available, simulations frequently achieved quantitative reproduction of experimental results.

      We added the following text to the manuscript to emphasize previous studies that validate the model accuracy.

      "We observe that residue-level coarse-grained models have been extensively utilized in prior studies to examine the free energy penalty associated with nucleosomal DNA unwinding [cite], sequence-dependent nucleosome sliding [cite], binding free energy between two nucleosomes [cite], chromatin folding [cite], the impact of histone modifications on tri-nucleosome structures [cite], and protein-chromatin interactions [cite]. The frequent quantitative agreement between simulation and experimental results supports the utility of such models in chromatin studies. Our introduction of explicit ions, as detailed below, further extends the applicability of these models to explore the dependence of chromatin conformations on salt concentrations."

      We agree that arginines are important for nucleosome stability. Since we assign positive charges to these residues, their contribution to DNA binding can be effectively captured. The model’s ability in reproducing nucleosome stability is supported by the good agreement between the simulated free energy penalty associated with nucleosomal DNA unwinding and experimental value estimated from single molecule experiments (Figure 1).

      To further evaluate nucleosome stability in our simulations, we conducted a 200-ns-long simulation of a nucleosome featuring the 601-sequence under physiological salt conditions– 100 mM NaCl and 0.5 mM MgCl2, consistent with the conditions in Figure 1 of the main text. We found that the nucleosome maintains its overall structure during this simulation. The nucleosome’s radius of gyration (Rg) remained proximate to the value corresponding to the PDB structure (3.95 nm) throughout the entire simulation period (see Author response image 3).

      Author response image 3.

      Time trace of the radius of gyration (Rg) of a nucleosome with the 601-sequence along an unbiased, equilibrium trajectory. It is evident the Rg fluctuates around the value found in the PDB structure (3.95 nm), supporting the stability of the nucleosome in our simulation.

      Occasional fluctuations in Rg corresponded to momentary, partial unwrapping of the nucleosomal DNA, a phenomenon observed in single-molecule experiments. However, we advise caution due to the coarse-grained nature of our simulations, which prevents a direct mapping of simulation timescale to real time. Importantly, the rate of DNA unwrapping in our simulations is notably overestimated.

      It’s plausible that coarse-grained models, lacking side chains, might underestimate the barrier for DNA sliding along the nucleosome. Specifically, our model, without differentiation between interactions among various amino acids and nucleotides, accurately reproduces the average nucleosomal DNA binding affinity but may not capture the energetic variations among binding interfaces. Since sliding’s contribution to chromatin organization is minimal due to the use of strongly positioning 601 sequences, we imposed rigidity on the two nucleotides situated at the dyad axis to prevent nucleosomal DNA sliding. In future studies, enhancing the calibration of protein-DNA interactions to achieve improved sequence specificity would be an intriguing avenue. To underscore this limitation of the model, we have included the following text in the discussion section of the main text.

      "Several aspects of the coarse-grained model presented here can be further improved. For instance, the introduction of specific protein-DNA interactions could help address the differences in non-bonded interactions between amino acids and nucleotides beyond electrostatics [cite]. Such a modification would enhance the model’s accuracy in predicting interactions between chromatin and chromatin-proteins. Additionally, the single-bead-per-amino-acid representation used in this study encounters challenges when attempting to capture the influence of histone modifications, which are known to be prevalent in native nucleosomes. Multiscale simulation approaches may be necessary [cite]. One could first assess the impact of these modifications on the conformation of disordered histone tails using atomistic simulations. By incorporating these conformational changes into the coarse-grained model, systematic investigations of histone modifications on nucleosome interactions and chromatin organization can be conducted. Such a strategy may eventually enable the direct quantification of interactions among native nucleosomes and even the prediction of chromatin organization in vivo."

      Comment 6: The solvent salt conditions vary in the experimental reference data for internucleosomal interaction energies. The authors note, for example, that the in vitro data from Funke et al. differs the most from other measurements, but the solvent conditions are 35 mM NaCl and 11 mM MgCl2. Since this simulation method allows for this investigation, could the authors speak to or investigate if solvent conditions are responsible for the variability in experimental reference data? The authors conclude on pg. 8-9 and Figure 4 that orientational restraints in the DNA origami methodology are responsible for differences in interaction energy. Can the authors rule out ion concentration contributions?

      Response: We thank the reviewer for the insightful comment. We would like to clarify that the black curve presented in Figure 4B of the main text was computed using the salt concentration specified by Funke et al. (35 mM NaCl and 11 mM MgCl2). Furthermore, there were no restraints placed on nucleosome orientations during these calculations. Consequently, the results in Figure 4B can be directly compared with the black curve in Figure 5C. The data in Figure 5C were calculated under physiological salt conditions (150 mM NaCl and 2 mM MgCl2), which are the standard solvent salt conditions used in most studies. It is worth noting that the free energy of nucleosome binding is significantly higher at the salt concentration employed by Funke et al. (14 kBT) than the value at the physiological salt condition (9 kBT). Therefore, comparing the results in Figure 4B and 5C eliminates ion concentration conditions as a potential cause for the the almost negligible result reported by Funke et al.

      Comment 7: In the discussion on pg. 12 residual-level should be residue-level.

      Response: We apologize for the oversight and have corrected the grammatical error in our manuscript.

      RESPONSE TO REFEREE 2:

      Comment 0: In this manuscript, the authors introduced an explicit ion model using the coarse-grained modelling approach to model the interactions between nucleosomes and evaluate their effects on chromatin organization. The strength of this method lies in the explicit representation of counterions, especially divalent ions, which are notoriously difficult to model. To achieve their aims and validate the accuracy of the model, the authors conducted coarse-grained molecular dynamics simulations and compared predicted values to the experimental values of the binding energies of protein-DNA complexes and the free energy profile of nucleosomal DNA unwinding and inter-nucleosome binding. Additionally, the authors employed umbrella sampling simulations to further validate their model, reproducing experimentally measured sedimentation coefficients of chromatin under varying salt concentrations of monovalent and divalent ions.

      Response: We thank the reviewer’s excellent summary of the work.

      Comment 1: The significance of this study lies in the authors’ coarse-grained model which can efficiently capture the conformational sampling of molecules while maintaining a low computational cost. The model reproduces the scale and, in some cases, the shape of the experimental free energy profile for specific molecule interactions, particularly inter-nucleosome interactions. Additionally, the authors’ method resolves certain experimental discrepancies related to determining the strength of inter-nucleosomal interactions. Furthermore, the results from this study support the crucial role of intrinsic physicochemical interactions in governing chromatin organization within the nucleus.

      Response: We appreciate the reviewer’s strong assessment of the paper’s significance, novelty, and broad interest, and we thank him/her for the detailed suggestions and comments.

      Comment 2: The method is simple but can be useful, given the authors can provide more details on their ion parameterization. The paper says that parameters in their ”potentials were tuned to reproduce the radial distribution functions and the potential of mean force between ion pairs determined from all-atom simulations.” However, no details on their all-atom simulations were provided; at some point, the authors refer to Reference 67 which uses all-atom simulations but does not employ the divalent ions. Also, no explanation is given for their modelling of protein-DNA complexes.

      Response: We appreciate the reviewer’s suggestion on clarifying the parameterization of the explicition model. The parameterization was not carried out in reference 67 nor by us, but by the de Pablo group in citation 53. Specifically, ion potentials were parameterized to fit the potential of mean force between both monovalent and divalent ion pairs, calculated either from all-atom simulations or from the literature. The authors carried out extensive validations of the model parameters by comparing the radial distribution functions of ions computed using the coarse-grained model with those from all-atom simulations. Good agreements between coarse-grained and all-atom results ensure that the parameters’ accuracy in reproducing the local structures of ion interactions.

      To avoid confusion, we have revised the text from:

      "Parameters in these potentials were tuned to reproduce the radial distribution functions and the potential of mean force between ion pairs determined from all-atom simulations."

      to

      "Parameters in these potentials were tuned by Freeman et al. [cite] to reproduce the radial distribution functions and the potential of mean force between ion pairs determined from all-atom simulations."

      We modified the Supporting Information at several places to clarify the setup and interpretation of protein-DNA complex simulations.

      For example, we clarified the force fields used in these simulation with the following text

      "All simulations were carried out using the software Lammps [cite] with the force fields defined in the previous two sections."

      We added details on the preparation of these simulations as follows

      "We carried out a series of umbrella-sampling simulations to compute the binding free energies of a set of nine protein-DNA complexes with experimentally documented binding dissociation constants [cite]. Initial configurations of these simulations were prepared using the crystal structures with the corresponding PDB IDs listed in Fig. S1."

      We further revised the caption of Figure S1 (included as Author response image 4) to facilitate the interpretation of simulation results.

      Author response image 4.

      The explicit-ion model predicts the binding affinities of protein-DNA complexes well, related to Fig. 1 of the main text. Experimental and simulated binding free energies are compared for nine protein-DNA complexes [cite], with a Pearson Correlation coefficient of 0.6. The PDB ID for each complex is indicated in red, and the diagonal line is drawn in blue. The significant correlation between simulated and experimental values supports the accuracy of the model. To further enhance the agreement between the two, it will be necessary to implement specific non-bonded interactions that can resolve differences among amino acids and nucleotides beyond simple electrostatics. Such modifications will be interesting avenues for future research. See text Section: Binding free energy of protein-DNA complexes for simulation details.

      Comment 3: Overall, the paper is well-written, concise and easy to follow but some statements are rather blunt. For example, the linker histone contribution (Figure 5D) is not clear and could be potentially removed. The result on inter-nucleosomal interactions and comparison to experimental values from Ref#44 is the most compelling. It would be nice to see if the detailed shape of the profile for restrained inter-nucleosomal interactions in Figure 4B corresponds to the experimental profile. Including the dependence of free energy on a vertex angle would also be beneficial.

      Response: We thank the reviewer for the comments and agree that the discussion on linker histone results was brief. However, we believe the results are important and demonstrate our model’s advantage over mesoscopic approaches in capturing the impact of chromatin regulators on chromatin organization.

      Therefore, instead of removing the result, we expanded the text to better highlight its significance, to help its comprehension, and to emphasize its biological implications. The image in Figure 5D was also redesigned to better visualize the cross contacts between nucleosomes mediated by histone H1. The added texts are quoted as below, and the new Figure 5 is included.

      Author response image 5.

      Revised main text Figure 5, with Figure 5D modified for improved visual clarity.

      "Importantly, we found that the weakened interactions upon extending linker DNA can be more than compensated for by the presence of histone H1 proteins. This is demonstrated in Fig. 5C and Fig. S8, where the free energy cost for tearing part two nucleosomes with 167 bp DNA in the presence of linker histones (blue) is significantly higher than the curve for bare nucleosomes (red). Notably, at larger inter-nucleosome distances, the values even exceed those for 147 bp nucleosomes (black). A closer examination of the simulation configurations suggests that the disordered C-terminal tail of linker histones can extend and bind the DNA from the second nucleosome, thereby stabilizing the internucleosomal contacts (as shown in Fig. 5D). Our results are consistent with prior studies that underscore the importance of linker histones in chromatin compaction [cite], particularly in eukaryotic cells with longer linker DNA [cite]."

      We further compared the simulated free energy profile, depicting the center of mass distance between nucleosomes, with the experimental profile, as depicted in Author response image 6. The agreement between the simulated and experimental results is evident. The nuanced features observed between 60 to 80 Ain the simulated profile stem from DNA unwinding˚ to accommodate the incoming nucleosome, creating a small energy barrier. It’s worth noting that such unwinding is unlikely to occur in the experimental setup due to the hybridization method used to anchor nucleosomes onto the DNA origami. Moreover, our simulation did not encompass configurations below 60 A, resulting in a lack of data in˚ that region within the simulated profile.

      We projected the free energy profile onto the vertex angle of the DNA origami device, utilizing the angle between two nucleosome faces as a proxy. Once more, the simulated profile demonstrates reasonable agreement with the experimental data (Author response image 6). Author response image 6 has been incorporated as Figure S4 in the Supporting Information.

      Author response image 6.

      Explicit ion modeling reproduces the experimental free energy profiles of nucleosome binding. (A) Comparison between the simulated (black) and experimental (red) free energy profile as a function of the inter-nucleosome distance. Error bars were computed as the standard deviation of three independent estimates. The barrier observed between 60A and 80˚ A arises from the unwinding of nucleosomal DNA when the two nu-˚ cleosomes are in close proximity, as highlighted in the orange circle. (B) Comparison between the simulated (black) and experimental (red) free energy profile as a function of the vertex angle. Error bars were computed as the standard deviation of three independent estimates. (C) Illustration of the vertex angle Φ used in panel (B).

      Comment 4: Another limitation of this study is that the authors’ model sacrifices certain atomic details and thermodynamic properties of the modelled systems. The potential parameters of the counter ions were derived solely by reproducing the radial distribution functions (RDFs) and potential of mean force (PMF) based on all-atom simulations (see Methods), without considering other biophysical and thermodynamic properties from experiments. Lastly, the authors did not provide any examples or tutorials for other researchers to utilize their model, thus limiting its application.

      Response: We agree that residue-level coarse-grained modeling indeed sacrifices certain atomistic details. This sacrifice can be potentially limiting when studying the impact of chemical modifications, especially on histone and DNA methylations. We added a new paragraph in the Discussion Section to point out such limitations and the relevant text is quoted below.

      "Several aspects of the coarse-grained model presented here can be further improved. For instance, the introduction of specific protein-DNA interactions could help address the differences in non-bonded interactions between amino acids and nucleotides beyond electrostatics [cite]. Such a modification would enhance the model’s accuracy in predicting interactions between chromatin and chromatin-proteins. Additionally, the single-bead-per-amino-acid representation used in this study encounters challenges when attempting to capture the influence of histone modifications, which are known to be prevalent in native nucleosomes. Multiscale simulation approaches may be necessary [cite]. One could first assess the impact of these modifications on the conformation of disordered histone tails using atomistic simulations. By incorporating these conformational changes into the coarse-grained model, systematic investigations of histone modifications on nucleosome interactions and chromatin organization can be conducted. Such a strategy may eventually enable the direct quantification of interactions among native nucleosomes and even the prediction of chromatin organization in vivo."

      Nevertheless, it’s important to note that while the model sacrifices accuracy, it compensates with superior efficiency. Atomistic simulations face significant challenges in conducting extensive free energy calculations required for a quantitative evaluation of ion impacts on chromatin structures.

      The explicit ion model, introduced by the de Pablo group, follows a standard approach adopted by other research groups, such as the parameterization of ion models using the potential of mean force from atomistic simulations (11; 12). According to multiscale coarse-graining theory, reproducing potential mean force (PMF) enables the coarsegrained model to achieve thermodynamic consistency with the atomistic model, ensuring identical statistical properties derived from them. However, it’s crucial to recognize that an inherent limitation of such approaches is their dependence on the accuracy of atomistic force fields in reproducing thermodynamic properties from experiments, as any inaccuracies in the atomistic force fields will similarly affect the resulting coarse-grained (CG) model.

      We have provided the implementation of CG model and detailed instructions on setting up and performing simulations GitHub repository. Examples include simulation setup for a protein-DNA complex and for a nucleosome with the 601-sequence.

      References [1] Freeman GS, Hinckley DM, de Pablo JJ (2011) A coarse-grain three-site-pernucleotide model for DNA with explicit ions. The Journal of Chemical Physics 135:165104.

      [2] Materese CK, Savelyev A, Papoian GA (2009) Counterion Atmosphere and Hydration Patterns near a Nucleosome Core Particle. J. Am. Chem. Soc. 131:15005–15013.

      [3] Lequieu J, Cordoba A, Schwartz DC, de Pablo JJ´ (2016) Tension-Dependent Free Energies of Nucleosome Unwrapping. ACS Cent. Sci. 2:660–666.

      [4] Lequieu J, Schwartz DC, De Pablo JJ (2017) In silico evidence for sequence-dependent nucleosome sliding. Proc. Natl. Acad. Sci. U.S.A. 114.

      [5] Moller J, Lequieu J, de Pablo JJ (2019) The Free Energy Landscape of Internucleosome Interactions and Its Relation to Chromatin Fiber Structure. ACS Cent. Sci. 5:341–348.

      [6] Chang L, Takada S (2016) Histone acetylation dependent energy landscapes in trinucleosome revealed by residue-resolved molecular simulations. Sci Rep 6:34441.

      [7] Watanabe S, Mishima Y, Shimizu M, Suetake I, Takada S (2018) Interactions of HP1 Bound to H3K9me3 Dinucleosome by Molecular Simulations and Biochemical Assays. Biophysical Journal 114:2336–2351.

      [8] Brandani GB, Niina T, Tan C, Takada S (2018) DNA sliding in nucleosomes via twist defect propagation revealed by molecular simulations. Nucleic Acids Research 46:2788–2801.

      [9] Ding X, Lin X, Zhang B (2021) Stability and folding pathways of tetra-nucleosome from six-dimensional free energy surface. Nat Commun 12:1091.

      [10] Liu S, Lin X, Zhang B (2022) Chromatin fiber breaks into clutches under tension and crowding. Nucleic Acids Research 50:9738–9747.

      [11] Savelyev A, Papoian GA (2010) Chemically accurate coarse graining of doublestranded DNA. Proc. Natl. Acad. Sci. U.S.A. 107:20340–20345.

      [12] Noid WG (2013) Perspective: Coarse-grained models for biomolecular systems. The Journal of Chemical Physics 139:090901.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      This manuscript examines the individual and dual effects of CHIP and LOY in MI employing a cohort of ~460 individuals. CHIP is assessed by NGS and LOY is assessed by PCR. The threshold for CHIP is set at 2% (an arbitrary cutoff that is often used) and LOY at 9% (according to the Discussion text - this reviewer may have missed the section that describes why this threshold was employed). The investigation assessed whether LOY could modulate inflammation, atherosclerotic burden, or MI risk associated with CHIP. Neither CHIP nor LOY independently affected hsCRP, atherosclerotic burden, or MI incidence, nor did LOY presence diminish these outcomes in CHIP+ male subjects.

      This study represents the first dual analysis of CHIP and LOY on CVD outcomes. The results are largely negative, contradictory to other studies (many with much larger sample sizes). I would attribute the limitation of sample size as a major contributor to the negative data. While the negative data are suspect, the "positive" finding that LOY abolishes the prognostic significance of CHIP on MI is of interest (and consistent with what is understood from mechanistic studies).

      Overall, I enjoyed reading the paper, and it is of interest to the research community.

      However, I disagree with some of the authors' interpretations of the data.

      Generally, many conclusions on CHIP interpretation are based on the comparison of findings from very large datasets that have been evaluated by shallow NGS DNA sequencing. These studies lack sensitivity and accuracy, but this is counterbalanced by their very large sample sizes. Thus, they draw conclusions from the sickest individuals (ICD codes) with the largest clones (explaining the 10% VAF threshold). Here, the study has a well-phenotyped cohort, but as far as this reviewer can tell, the DNA sequencing is "shallow" NGS. Typically, to assess smaller datasets, investigators employ an error-correction method (DNA barcodes, duplex sequencing, etc.) for the sensitivity and accuracy of calling variants. Thus, the current study appears to suffer from this limitation (small sample sizes combined with NGS).

      We thank the reviewer for his/her positive and open comment. We acknowledge that we did not use error-corrected sequencing method for our study. However, we do not fully agree with the statement that our NGS sequencing technique is “shallow”.

      Considering our entire sequencing panel, we achieve a sequencing depth ≥100X and ≥300X for 100% [99%;100%] and 99% [99%;100%] of the targeted regions respectively. This corresponds to a median depth of 2111X [1578;2574] for all regions sequenced. When considering “CHIP genes”, the median depth is 2694X [1875;3785] for patients from the CHAth study and 3455X [2266;4885] for patients from the 3C study. More specifically, for DNMT3A and TET2 genes, the median depths of sequencing are 2531X [1818;3313] and 3710X [2444;4901] for patients from the CHAth and 3C study respectively. These values are far much higher than the 300X recommended for NGS sequencing by capture technology by the French National Institute of Cancer. Coupling this high depth of sequencing with our bioinformatic pipeline that uses 3 different variant callers, a manual curing for all variants by trained hematobiologists and a bioinformatic tool to estimate the background noise allow us to detect somatic mutation with a VAF of 1% with a high accuracy. Noteworthy, our accuracy in detecting mutations in leukemia-associated genes is tested twice a year as part of our quality control program organized by the French Group of Molecular Biologists in Hematology (GBMHM). We added the information about the depth of sequencing in the Supplementary Methods section.

      While the "negative" data from this study are inconclusive, the positive data (i.e. CHIP being prognostic for MI in the absence but not presence of MI) is of interest. Thus, the investigators may want to consider a shorter report that largely focuses on this finding.

      We thank the reviewer for his/her interest in this result. We also agree that it would be interesting to focus specifically on demonstrating the impact of mLOY in countering the cardiovascular risk associated with CHIP. We performed additional analysis to demonstrate that this effect was independent of age and cardiovascular risk factors and included this information in the results section.

      However, we believe that it is also of interest to show negative results that, although probably due to limitation in sample size, suggest that the cardiovascular risk associated with CHIP is not as strong and clinically pertinent as initially suggested. Of note, if CHIP really increase the risk of Myocardial Infarction in a significant manner, they would be more frequently detected in subjects who suffered from a MI compared to those who did not, which was not observed in our cohort. Moreover, we were able to determine that if CHIP increases the risk of MI, they do it to a much lesser extent (HR = 1.03 for CHIP) -than other established cardiovascular risk factors such as hypercholesterolemia or tobacco use HR = 1.47 and HR = 1.86 respectively in our cohort), which questions the pertinence of considering for CHIP in the management of patients with atherothrombosis. These data have been added in the Results and Discussion sections.

      We also believe that our study has the merit to assess directly the impact of CHIP on atheroma burden, which has been performed in only a limited number of studies in the context of coronary artery disease. This could not be possible by analyzing only male subjects in our cohort because it would further decrease the statistical power of our analyses.

      Reviewer #2 (Public Review):

      Summary: 

      The preprint by Fawaz et al. presents the findings of a study that aimed to assess the relationship between somatic mutations associated with clonal hematopoiesis (CHIP) and the prevalence of myocardial infarction (MI). The authors conducted targeted DNA sequencing analyses on samples from 149 MI patients and 297 non-MI controls from a separate cohort. Additionally, they investigated the impact of the loss of the Y chromosome (LOY), another somatic mutation frequently observed in clonally expanded blood cells. The results of the study primarily demonstrate no significant associations, as neither CHIP nor LOY were found to be correlated with an increased prevalence of MI. Of note, the null findings regarding CHIP are in conflict with several larger studies in the literature.

      Strengths:

      Overall, this is a useful research work on an emerging risk factor for cardiovascular disease (CVD). The use of a targeted sequencing approach is a strength, as it offers higher sensitivity than the whole exome sequencing approaches used in many previous studies.

      Weaknesses:

      Reporting null findings is definitely relevant in an emerging field such as the role of somatic mutations in cardiovascular disease. Nevertheless, the study suffers from severe limitations, which casts doubts on the authors' conclusions, as detailed below:

      (1) The small sample size of the study population is a critical limitation, particularly when reporting null findings that conflict (partly) with positive findings in much larger studies, totaling hundreds of thousands of individuals (e.g. Zekavat et al, Nature CVR 2023, Vlasschaert et al, Circulation 2023; Zhao et al, JAMA Cardio 2024). The authors claim that they have 90% power to detect an effect size of CHIP on MI comparable to that in a previous report (Jaiswal et al, NEJM 2017). However, the methodology used to estimate statistical power is not described.

      We thank the reviewer for his/her pertinent and constructive comments. We totally agree that our study presents a substantially smaller sample size as compared to the studies of Zekavat et al, Vlasschaert et al or Zhao et al.

      The CHAth study was designed as a prospective study (which is not frequent in CHIP reports) to demonstrate that, if CHIP increase the risk of MI, they would be detected more frequently in patients who suffered from a MI compared to those who did not. To achieve this, we defined eligibility criteria to have a rather high prevalence of CHIP and optimize the statistical power of a study based on a limited number of patients. We thus enrolled patients who suffered from a first MI after the age of 75 years. These patients had to be compared with subjects from the Three-City study who had 65 years or more at inclusion and did not present any cardiovascular event before inclusion.

      To determine the number of patients necessary to achieve our objective, we considered a CHIP prevalence of 20% in the general population after the age of 75 years, as estimated when we set up our study (Genovese et al, NEJM 2014, Jaiswal et al, NEJM 2014, Jaiswal et al, NEJM 2017). At this time the relative risk of MI associated with CHIP was shown to be 1.7, leading to an expected prevalence of CHIP of 37% in subjects who presented a MI. Based on these hypotheses, the recruitment of 112 patients in the CHAth would have been sufficient to detect a significant higher prevalence of CHIP in MI(+) patients compared to MI(-) subjects with a power of 0.90 at a type I error rate of 5%. These calculations were performed by the Research Methodology Support Unit of the University Hospital of Bordeaux. These data were added in the Supplementary Methods section to expose more clearly the design and objectives of the CHAth study.

      Finally, we recruited 149 patients in the CHAth study and compared them to 297 control subjects. Although recruiting more patients than initially needed, we observed a similar prevalence of CHIP between our 2 cohorts, suggesting that the cardiovascular risk associated with CHIP is lower than the 1.7 increased risk claimed in most publications related to CHIP in the cardiovascular field. We have to notice that our study was not designed to demonstrate the impact of CHIP on the occurrence of MI during follow-up, which could explain our negative results due to a limited number of patients as stated by the reviewers. This statement has been added in the Supplementary Methods section. However, performing such analysis allowed us to confirm that the risk of MI associated with CHIP was lower than 1.7 and lower than the one associated with hypercholesterolemia or smoking.

      We would like also to notice that the eligibility criteria for both CHAth and the Three-City study can have led to a selection bias, possibly contributing to the contradiction of our results with other studies. As stated before, in the CHAth study, only patients who experience a first MI after the age of 75 were enrolled. In the Three-City study, all subjects had 65 years or more at inclusion. On the contrary, most of the cohorts showing an association between CHIP and cardiovascular events were composed of younger subjects:

      -          Bioimage : median age 70 years (55-80 years)

      -          MDC : median age 60 years

      -          ATVB : subjects with a MI before 45 years

      -          PROMIS : subjects between 30 and 80 years

      -          UK Biobank : between 40 and 70 years at inclusion, median age of 58 years in the study of Vlasschaert et al.

      -          Zhao et al : median age of 53.83 years (45.35-62.39 years).

      This last information was added in the Discussion section (lines 452-454).

      Furthermore, the work by Jaiswal et al (NEJM 2017) showed a hazard ratio of approx. 2.0, but more recent work in much larger populations suggests that the overall effect of CHIP on atherosclerotic CVD is smaller, most likely due to the heterogeneity of effects of different mutated genes (e.g. Zekavat et al, Nature CVR 2023, Vlasschaert et al, Circulation 2023; Zhao et al, JAMA Cardio 2024).

      We thank the reviewer for insisting on the fact that the initial HR of 2.0 observed by Jaiswal et al was shown to be smaller in more recent studies. This corresponds to what we wrote in the introduction (lines 103-109) and discussion (lines 365-370, 465-471).

      In addition, several analyses in the current manuscript are conducted separately in MI(+) (n= 149) and MI(-) (N=297) individuals, further limiting statistical power. Power is still lower in the investigation of the effects of LOY and its interaction with CHIP, as only men are included in these analyses. Overall, I believe the study is severely underpowered, which calls into question the validity of the reported null findings.

      We agree with the reviewer that the statistical power of our study is lower than the one of other studies, in particular those based on several hundred thousand patients. Whenever possible, we analyzed our data by combining MI(+) and MI(-) subjects. However, for some aspects such as atherosclerosis, we did not have the same parameters available for these 2 groups and had to analyze them separately, leading to a more limited statistical power. We also have to acknowledge that our study was not designed to demonstrate an effect of CHIP on incident MI (as stated before), limiting our statistical power to demonstrate an effect of CHIP +/- mLOY on the incident risk of coronary artery disease.

      However, when designing our prospective study (CHAth study), we aimed to address the limitations of a small cohort and obtain rapid, significant results regarding the impact of CHIP. We hypothesized that if CHIP really increases the risk of myocardial infarction (MI), it would be detected more frequently in patients who have experienced a MI compared to those who have not. This study design would demonstrate the importance of CHIP in MI pathophysiology without requiring thousands of patients. However, we did not observe such an association questioning the relevance of detecting CHIP for the management of patients in the field of Cardiology. This was confirmed by the fact that in our cohort, the cardiovascular risk associated with CHIP appears to be low (HR = 1.03 [0.657;1.625] after adjustment on sex, age and cardiovascular risk factors) compared to hypercholesterolemia (HR = 1.474 [0.758;2.866]) or smoking (HR = 1.865 [0.943;3.690]). These data have been added in the Results and Discussion sections.

      In addition, we would like to mention that despite the limited number of subjects studied, we do not have only negative results. When studying only men subjects, we were able to show that CHIP accelerate the occurrence of MI, particularly in the absence of mLOY (Figure 2D). This effect was independent of age and cardiovascular risk factors (diabetes, cholesterol and high blood pressure). We added this last information in the results section of the manuscript, although we acknowledge that this has to be confirmed in future work.

      (2) Related to the above, it is widely accepted that the effects of CHIP on CVD are highly heterogeneous, as some mutated genes appear to have a strong impact on atherosclerosis, whereas the effect of others is negligible (e.g. Zekavat et al, Nature CVR 2023, Vlasschaert et al, Circulation 2023, among others). TET2 mutations are frequently considered a "positive control", given the multiple lines of evidence suggesting that these mutations confer a higher risk of atherosclerotic disease.

      However, no association with MI or related variables was found for TET2 mutations in the current work. Reporting the statistical power specifically for assessing the effect of TET2 mutations would enhance the interpretation of these results.

      We thank the reviewer for this pertinent remark. It has indeed been shown that depending on the somatic mutation, the impact of CHIP on inflammation, atherosclerosis and cardiovascular risk is different. The studies cited by the reviewer suggest that DNMT3A mutations have a low impact on atherosclerosis/atherothrombosis while other “non-DNMT3A” mutations, including TET2 mutations, have a greater impact. In particular, Zekavat et al suggested that TP53, PPM1D, ASXL1 and spliceosome mutations have a similar impact on atherosclerosis/atherothrombosis to TET2.

      To answer to the reviewer in our cohort, we did not find a clear association between the detection of TET2 mutation with a VAF≥2% and:

      -          A history of MI at inclusion (p=0.5339)

      -          Inflammation (p=0.440)

      -          Atherosclerosis burden :

      -   In the CHAth study:

      -  p=0.031 for stenosis≥50%

      -  p=0.442 fir multitruncular lesions

      -  p=0.241 for atheroma volume

      -   in the 3C study :

      -  p=0.792 for the presence of atheroma

      -  p=0.3966 for the number of plaques

      -  p=0.876 for intima-media thickness

      -          Incidence of MI (p=0.5993)

      Similarly we did not find any association between the detection of TET2 mutations with a VAF≥1% and:

      -          A history of MI at inclusion (p=0.5339)

      -          Inflammation (p=0.802)

      -          Atherosclerosis burden :

      -   In the CHAth study :

      -  p=0.104 for stenosis≥50%

      -  p=0.617 fir multitruncular lesions

      -  p=0.391 for atheroma volume

      -   in the 3c study:

      -  p=0.3291 for the presence of atheroma

      -  p=0.2060 for the number of plaques

      -  p=0.2300 for intima-media thickness

      -          Incidence of MI (p=0.195)

      However, analyzing the specific effect of TET2 mutations reduces the cohort of CHIP(+) subjects to 61 individuals. In these conditions, considering a prevalence of “TET2-CHIP” of 13.5% (in our cohort) and a hazard ratio of 1.3 (Vlasschaert et al), the statistical power to show an increased risk of MI is only 16%.

      (3) One of the most essential features of CHIP is the tight correlation with age. In this study, the effect of age on CHIP (Supplementary Tables S5, S6) seems substantially milder than in previous studies. Given the relatively weak association with age here, it is not surprising that no association with MI or atherosclerotic disease was found, considering that this association would have a much smaller effect size.

      We thank the reviewer for highlighting this point. Although the difference of median age between subjects with or without a CHIP is not very important in our cohort, we did observe a significant association of CHIP with age:

      -          The differences in age were statistically significant both in the CHAth and 3C study (Supplementary Tables S5 and S6)

      -          We observed a significant association between age and CHIP prevalence (p<0.001 for the total cohort, p=0.0197 for the CHAth study, and p=0.0394 for the 3C cohort after adjustment on sex). This association was already shown in the figure 1. We added the significant association between age and CHIP prevalence in the Results section (line 279).

      As stated before, we have to remind the reviewer that we enrolled only subjects of ≥75 years and ≥65 years in the CHAth and 3C studies respectively. This led to a median age in our cohort that was substantially higher than in other cohorts (in particular the UK Biobank and the different cohorts studied by Jaiswal et al). This could have contributed to an apparent milder effect of age on CHIP, even if this association was still observed.

      In addition, there are previous reports of sex-related differences in the prevalence of CHIP, is there an association between CHIP and age after adjusting for sex? 

      The reviewer correctly pointed out that sex has been associated with various aspects of CHIP. While Zekavat et al reported that CHIP carriers were more frequently males, Kar et al (Nature Genetics 2022), and Kamphuis et al (Hemasphere 2023) did not observe a difference in the prevalence of CHIP between males and females, but rather a difference in the mutational spectrum. Male presented more frequently SRSF2, ASXL1, SF3B1, U2AF1, JAK2, TP53 and PPM1D mutations while females had more frequently DNMT3A, CBL and GNB1 mutations.

      In our study, the association between CHIP prevalence and age was indeed significant even after adjustment on sex (p<0.001 for the total cohort, p=0.0197 for the CHAth study and p=0.0394 for the 3C).

      (4) The mutated genes included in the definition of "CHIP" here are markedly different than those in most previous studies, particularly when considering specifically the studies that demonstrated an association between CHIP and atherosclerotic CVD. For instance, the definition of CHIP in this manuscript includes genes such as ANKRD26, CALR, CCND2, and DDX41... that are not prototypical CHIP genes. This is unlikely to have a major impact on the main results, as the vast majority of mutations detected are indeed in bona fide CHIP genes, but it should be at least acknowledged.

      We agree with the reviewer that our gene panel includes genes that are not considered prototypical CHIP genes. This acknowledgment has been added in the Supplementary Methods section. To perform this study, we did not design a specific targeted sequencing panel. We used the one that is used for the diagnosis of myeloid malignancies at the University Hospital of Bordeaux. ANKRD26 and DDX41 are genes that, when mutated, predispose to the development of hematological malignancies. CALR mutations are frequently detected in Myeloproliferative Neoplasms while CCND2 mutation can be detected in acute myeloid leukemia among other diseases. As usually performed in our routine practice, we analyzed all the genes in the panel. However, as stated by the reviewer, most of the mutations we detected involved bona fide CHIP genes.

      Furthermore, the strategy used here for the CHIP variant calling and curation seems substantially different than that used in previous studies, which precludes a direct comparison. This is important because such differences in the definition of CHIP and the curation of variants are the basis of most conflicting findings in the literature regarding the effects of this condition. Ideally, the authors should conduct sensitivity analyses restricted to prototypical CHIP genes, using the criteria that have been previously established in the field (e.g. Vlasschaert et al, Blood 2023).

      We agree with the reviewer, our strategy for CHIP variant calling and curation was substantially different from what has been used in other studies. We decided to apply the criteria we used in previous studies for the analysis of somatic mutation in myeloid malignancies. Because CHIP are defined by the detection of “somatic mutations in leukemia driver genes”, this appeared to follow the definition of CHIP.

      We also acknowledge that this discrepancy with the criteria defined by Vlasschaert et al could contribute to our findings that differ from those of other studies. We thus checked whether the variants detected were in accordance or not with the criteria defined by Vlasschaert et al. Pooling the 2 cohorts, we detected 439 variants, 381 of which were in accordance with the criteria established by Vlasschaert et al, representing a concordance rate of 86.8%. Moreover, the variants “wrongly” retained according to these criteria had an impact on the conclusion on the detection of CHIP in only 15 patients (because these variants were associated with a mutation in a bona fide CHIP gene and/or because its VAF was below 2%). Thus, the impact of CHIP variant calling and curation had only a limited impact on our results. This has been added in the discussion (lines 455-459).

      However, we would like to discuss the criteria that have been defined by Vlasschaert et al which are probably too restrictive. For some genes, such as ZRSR2, in addition to frameshift and non-sens mutations that are expected to be associated with a loss of function, only some single nucleotide variations were retained (probably those detected by this group). In our patient 20785, we detected a c.524A>G, p.(Tyr175Cys) mutation that was not reported in the list published by Vlasscheart et al. However, this variant presents a VAF presumptive of a somatic origin (3%), affects the Zn finger domain of the protein and is observed in a male subject. Thus, it presents several criteria to consider it as associated with a loss of function. Similarly, the CBL variant c.1139T>C, p.(Leu380Pro) observed in our patient 21536, although not affecting the residues 381-421 of the protein (the criteria defined by Vlasschaert et al), has been reported in 29 cases of hematological malignancies. It is thus likely to have a significant impact on the behavior of hematopoietic cells. Moreover, in the same patient, a TET2 c.4534G>A, p.(Ala1512Thr) variant was detected. Although not affecting directly the CD1 domain, it has been reported in a case of AML with a VAF suggestive of a somatic origin (Papaemmanuil et al, NEJM 2016). The SH2B3 gene is not considered by Vlasschaert et al as a bona fide CHIP gene, contrary to other genes involved in cell signaling such as JAK2, GNAS, GNB1, CBL. However, inactivating mutations in SH2B3 can be detected in myeloid malignancies and were recently shown to drive the phenotype in some patients with a MPN (Zhang et al, American Journal of Hematology 2024). We could thus expect that this also happens in our patients 22591 and 21998 who harbor mutations of SH2B3 (a SNV in the PH domain and a frameshift mutation respectively).

      Regarding BCOR, STAG2, SMC3 and RAD21 genes, although frameshift mutations are the most prevalent, there are several reports on the existence of SNV in the context of hematological malignancies (COSMIC, Blood (2021) 138 (24): 2455–2468, Blood Cancer Journal (2023)13:18 ; https://doi.org/10.1038/s41408-023-00790-1).

      We can also add that although Vlasschaert et al did not consider CSF3R and CALR as CHIP-genes, Kessler et al did. Because CHIP are an emerging field, it should be considered that the concepts that define it are expected to evolve, as demonstrated by the recent study of the Jyoti Nangalia’s group (Bernstein et al, Nature Genetics 2024) who showed that 17 additional genes (including SH2B3) should be considered as driver of clonal hematopoiesis.

      (5) An important limitation of the current study is the cross-sectional design of most of the analyses. For instance, it is not surprising that no association is found between CHIP and prevalent atherosclerosis burden by ultrasound imaging, considering that many individuals may have developed atherosclerosis years or decades before the expansion of the mutant clones, limiting the possible effect of CHIP on atherosclerosis burden. Similarly, the analysis of the relationship between CHIP and a history of MI may be confounded by the potential effects of MI on the expansion of mutant clones. In this context, it is noteworthy that the only positive results here are found in the analysis of the relationship between CHIP at baseline and incident MI development over follow-up. Increasing the sample size for these longitudinal analyses would provide deeper insights into the relationship between CHIP and MI. 

      We agree with the reviewer that increasing the sample size for longitudinal analyses would provide deeper insights into the relationship between CHIP and MI. Unfortunately, for the moment, we do not have access to additional samples of the 3C study and are not able to perform these additional analyses.

      (6) The description of some analyses lacks detail, but it seems that statistical analyses were exclusively adjusted for age or age and sex. The lack of adjustment for conventional cardiovascular risk factors in statistical analyses may confound results, particularly given the marked differences in several variables observed between groups.

      The reviewer is right when saying that we adjusted our analyses on age and/or sex. This was done because as stated before, our results did not show a lot of significant differences. However, we reanalyzed our data, adjusting further the tests for conventional cardiovascular risk factors, and observed similar results. These data have been added in the results section (lines 286-287, 303, 319, 331-332, 341).

      (7) The variant allele fraction (VAF) threshold for identifying clinically relevant clonal hematopoiesis is still a subject of debate. The authors state that subjects without any detectable mutation or with mutations with a VAF below 2% were considered non-CHIP carriers. While this approach is frequent in the field, it likely misses many impactful mutations with lower VAFs. Such false negatives could contribute to the null findings reported here. Ideally, the authors should determine the lower detection limit of their sequencing approach (either computationally or through serial dilution experiments) and identify the threshold of VAF that can be detected reliably with their sequencing assay. The association between CHIP and MI should then be evaluated considering all mutations above this VAF threshold, in addition to sensitivity analyses with other thresholds frequent in the literature, such as 1% VAF, 2% VAF, and 10% VAF.

      We agree with the reviewer that the VAF threshold for identifying clinically relevant CH is still debated. As stated in the manuscript and by the reviewer, we used the conventional threshold of 2%. Considering that different studies have shown that the cardiovascular risk is increased in a more important manner for CHIP with a high VAF (Jaiswal et al, NEJM 2017, Kessler et al Nature 2022, Vlasschaert et al, Circulation 2023), it is not sure that considering variant with a very low VAF (below 2%) would help us in finding an impact of CHIP on inflammation, atherosclerosis or atherothrombotic risk.

      However, as mentioned by the reviewer, variants with a low VAF could have a clinical impact as recently reported by Zhao et al. In France, the use of biological analysis for medical purposes imposes to demonstrate that all its aspects are mastered, including their performances. In that context, we determined that our NGS strategy allowed us to reliably detect mutation with a VAF down to 1% (data not shown). As stated in the discussion, we also analyzed our results considering variants with a VAF of 1% and found similar results (lines 394-395). The sensitivity analyses were already mentioned in the manuscript, as we also searched for an effect of CHIP with a high VAF (≥5%) and found no effect neither. We did not have a sufficient number of subjects carrying variants with a VAF≥10% to perform analysis with this threshold.

      (8) The authors should justify the use of 3D vascular ultrasound imaging exclusively in the supra-aortic trunk. I am not familiar with this technique, but it seems to be most typically used to evaluate atherosclerosis burden in superficial vascular beds such as carotids or femorals. I am concerned about the potential impact of tissue depth on the accurate quantification of atherosclerosis burden in the current study (e.g. https://doi.org/10.1016/j.atherosclerosis.2016.03.002). It is unclear whether the carotids or femorals were imaged in the study population. 

      We apologize for the lack of precision in the Methods section. As stated by the reviewer, we evaluated the atherosclerosis burden in superficial vascular beds. We measured atheroma volume at the site of the common carotid (as described by B Lopez-Melgar, in Atheroslerosis, 2016). We did not analyze femoral arteries in this study. The sentence is now corrected in the Methods (lines 176-179).

      (9) The specific criteria used to define LOY need to be justified. LOY is stated to be defined based on a "A cut off of 9% of cells with mLOY defined the detection of a mLOY based on the study of 30 men of less than 40 years who had a normal karyotype as assessed by conventional cytogenetic study." As acknowledged by the authors, this definition of LOY is substantially different than that used in recent studies employing the same technique to detect LOY (Mas-Peiro et al, EHJ 2023). In addition, it seems essential to provide more detailed information on the ddPCR assay used to determine LOY, including the operating range and, more importantly, the lower limit of detection (%LOY) of the assay. A dilution series of a control DNA with no LOY would be helpful in this context. 

      We apologize if the definition of the threshold for detecting mLOY was unclear. To test the performance of our ddPCR technique, we first determined the background noise by testing DNA obtained from total leukocytes in 30 men of ≤40 years who presented a normal karyotype as assessed by conventional cytogenetic technics. In this control population supposed not to carry mLOY, we detected of proportion of cells with mLOY of 2,34+/-1,98 (see Author response image 1, panel A). We thus considered a threshold above 9% as being different from background noise (mean + 3 times the standard deviation).

      We then compared the proportion of cells with mLOY measured by ddPCR and conventional karyotype and observed a rather good correlation between the 2 technics (R2\=0.6430, p=0.0053, see Author response image 1, panel B). Finally, we tested the reliability of our ddPCR assay in detecting different levels of mLOY using a dilution series of control DNA (from an equivalent of 2% of cell with mLOY to 98% of cells with mLOY). We observed a very nice correlation between the theoretical and measured proportions of cells with mLOY (R2\=0.9989, p<0.001, see Author response image 1, panel C). Of note, the proportion of mLOY measured for values ≤10% were concordant with theoretical values. However, considering the background noise determined with control DNA, we were unable to confirm that this “signal” was different from the background noise. Therefore, we set a threshold of 9% to define the detection of mLOY by ddPCR. It is also noteworthy that the 10% cell population with mLOY was consistently detected by the ddPCR technique. This has been added in the Methods section (lines 228-235).

      Author response image 1.

      (10) Our understanding of the relationship between CHIP and CVD is evolving fast, and the manuscript should be considered in the context of recent literature in the field. For instance, the recent work by Zhao et al (JAMA Cardio 2024, doi:10.1001/jamacardio.2023.5095) should be considered, as it used a similar targeted DNA sequencing approach as the one used here, but found a clear association between CHIP and coronary heart disease (in a population of 6181 individuals). 

      We thank the reviewer for this pertinent reference. We did not include it in the first version of our manuscript because it was not published yet when we submitted our work. We included this reference in the discussion (lines 451, 455, 464). We also included the recent study of Heimlich et al (Circ Gen Pre Med 2024, lines 464-468) who studied the association of CHIP with atherosclerosis burden.

      (11) The use of subjective terms like "comprehensive" or "thorough" in the title of the manuscript does not align with the objective nature of scientific reporting. 

      We removed the terms “comprehensive” and “thorough” from the title and the text.

      Recommendations for the authors:

      Reviewing Editor:

      The Editors believe that in light of the small study the word Comprehensive has to be removed (including from the title and abstract).

      We agree and removed the term comprehensive from the title and the text.

      Reviewer #1 (Recommendations For The Authors):

      Other comments:

      It has long been recognized that hsCRP does not adequately address the inflammation associated with CHIP. For example, see Bick et al Nature 2020; 586:763. Through an assessment of a large dataset, the regulation of multiple inflammatory mediators was associated with CHIP but not with CRP. 

      We agree that hsCRP is probably not the most sensitive marker for inflammatory state associated with CHIP. However, it is the most commonly used one in medical practise. However, as indicated in the discussion (lines 418-420), we did not observe any association between CHIP and the plasmatic level of different cytokines (IL1ß, IL6, IL18 and TNFα) in patients enrolled in the CHAth study.

      Many of the citations lack journal names, volumes, page numbers, etc. 

      We apologize for this and corrected the citations.

      Please provide more details on the methodology (i.e. is CHIP assessed only through NGS with no error correction?). Specify the rationale for why the 9% LOY threshold was employed. Provide this information in the Methods section.

      We added more details on the methodology as demanded in the results section (lines 212-214 and 228-235).

      Supplementary Table S3 lacks headings. What are the designations for columns 6-8? 

      We apologize for this and corrected the Table. Columns 6-8 correspond to the VAF, coverage of the variants and depth of sequencing, as for Table S4.

    1. Author response:

      The following is the authors’ response to the current reviews. 

      eLife assessment:

      This useful modeling study explores how the biophysical properties of interneuron subtypes in the basolateral amygdala enable them to produce nested oscillations whose interactions facilitate functions such as spike-timing-dependent plasticity. The strength of evidence is currently viewed as incomplete because of insufficient grounding in prior experimental results and insufficient consideration of alternative explanations. This work will be of interest to investigators studying circuit mechanisms of fear conditioning as well as rhythms in the basolateral amygdala.

      We disagree with the overall assessment of our paper. The current reviews published below focus on two kinds of perceived inadequacies. Reviewer 1 (R1) was concerned that the fear conditioning paradigm used in the model is not compatible with some of the experiments we are modeling. The reviewer helpfully suggested in the Recommendations for the Authors some papers, which R1 believed exposed this incompatibility. In our reading, those data are indeed compatible with our hypotheses, as we will explain in our reply. Furthermore, the point raised by R1 is an issue for the entire field. We will suggest a solution to that issue based on published data.

      Reviewer 2 (R2) said that there is no evidence that the BLA is capable of producing, by itself, the rhythms that have been observed during fear conditioning in BLA and, furthermore, that the paper we cited to support such evidence, in fact, refutes our argument. We believe that the reasoning used by reviewer 2 is wrong and that the framework of R2 for what counts as evidence is inadequate. We spell out our arguments below in the reply to the reviewers.

      Finally, we believe this work is of interest far beyond investigators studying fear conditioning. The work shows how rhythms can create the timing necessary for spike-timing-dependent plasticity using multiple time scales that come from multiple different kinds of interneurons found both in BLA and, more broadly, in cortex. Thus, the work is relevant for all kinds of associative learning, not just fear conditioning. Furthermore, it is one of the first papers to show how rhythms can be central in mechanisms of higher-order cognition.

      Reviewer #1

      We thank Reviewer 1 for his kind remarks about our first set of responses and their understanding of the importance of the work. There was only one remaining point to be addressed:

      Deficient in this study is the construction of the afferent drive to the network, which does elicit activities that are consistent with those observed to similar stimuli. It still remains to be demonstrated that their mechanism promotes plasticity for training protocols that emulate the kinds of activities observed in the BLA during fear conditioning.

      It is true that some fear conditioning protocols involve non-overlapping US and CS, raising the question of how plasticity happens or whether behavioral effects may happen without plasticity. This is an issue for the entire field (Sun et al., F1000Research, 2020). Several papers (Quirk, Repa and LeDoux, 1995; Herry et al, 2007; Bordi and Ledoux 1992) show that the pips in auditory fear conditioning increase the activity of some BLA neurons: after an initial transient, the overall spike rate is still higher than baseline activity. The question remains as to whether the spiking is sustained long enough and at a high enough rate for STDP to take place when US is presented sometime after the stop of the CS.

      Experimental recordings cannot speak to the rate of spiking of BLA neurons during US due to recording interference from the shock. However, evidence seems to suggest that ECS activity should increase during the US due to the release of acetylcholine (ACh) from neurons in the basal forebrain (BF) (Rajebhosale et al., 2024). Pyramidal cells of the BLA robustly express M1 muscarinic ACh receptors (Muller et al., 2013; McDonald and Mott, 2021) and M1 receptors target spines receiving glutamatergic input (McDonald et al., 2019). Thus, ACh from BF should elicit a long-lasting depolarization in pyramidal cells. Indeed, the pairing of ACh with even low levels of spiking of BLA neurons results in a membrane depolarization that can last 7 – 10 s (Unal et al., 2015). This implies that the release of ACh can affect the consequences of the CS in successive trials. This should include higher spiking rates and more sustained activity in the ECS neurons after the first presentation of US, thus ensuring a concomitant activation of ECS and fear (F) neurons necessary for STDP to take place. Hence, we suggest that a solution to the problem raised by R1 may be solved by considering the role of ACh release by BF. To the best of our knowledge, there is nothing in the literature that contradicts this potential solution. The model we have may be considered a “minimal” model that puts in by hand the higher frequency due to the cholinergic drive without explicitly modeling it. As R1 says, it is important for us to give the motivation of that higher frequency; in the next revision, we will be explicit about how the needed adequate firing rate can come about without an overlap of CS and US in any given trial.

      Reviewer #2

      The authors of this study have investigated how oscillations may promote fear learning using a network model. They distinguished three types of rhythmic activities and implemented an STDP rule to the network aiming to understand the mechanisms underlying fear learning in the BLA.

      After the revision, the fundamental question, namely, whether the BLA networks can or cannot intrinsically generate any theta rhythms, is still unanswered. The author added this sentence to the revised version: "A recent experimental paper, (Antonoudiou et al., 2022), suggests that the BLA can intrinsically generate theta oscillations (3-12 Hz) detectable by LFP recordings under certain conditions, such as reduced inhibitory tone." In the cited paper, the authors studied gamma oscillations, and when they applied 10 uM Gabazine to the BLA slices observed rhythmic oscillations at theta frequencies. 10 uM Gabazine does not reduce the GABA-A receptor-mediated inhibition but eliminates it, resulting in rhythmic populations burst driven solely by excitatory cells. Thus, the results by Antonoudiou et al., 2022 contrast with, and do not support, the present study, which claims that rhythmic oscillations in the BLA depend on the function of interneurons. Thus, there is still no convincing evidence that BLA circuits can intrinsically generate theta oscillations in intact brain or acute slices. If one extrapolates from the hippocampal studies, then this is not surprising, as the hippocampal theta depends on extra-hippocampal inputs, including, but not limited to the entorhinal afferents and medial septal projections (see Buzsaki, 2002). Similarly, respiratory related 4 Hz oscillations are also driven by extrinsic inputs. Therefore, at present, it is unclear which kind of physiologically relevant theta rhythm in the BLA networks has been modelled.

      Reviewer 2 (R2) says “the fundamental question, namely, whether the BLA networks can or cannot intrinsically generate any theta rhythms, is still unanswered.” In our revision, we cited (Antonoudiou et al., 2022), who showed that BLA can intrinsically generate theta oscillations (3-12 Hz) detectable by LFP recordings. R2 pointed out that this paper produces such theta under conditions in which the inhibition is totally removed. R2 then states that the resulting rhythmic populations burst at theta “are driven solely by excitatory cells. Thus, the results by (Antonoudiou et al., 2022) contrast with, and do not support, the present study, which claims that rhythmic oscillations in the BLA depend on the function of interneurons. Thus, there is still no convincing evidence that BLA circuits can intrinsically generate theta oscillations in intact brain or acute slices.”

      This reasoning of R2 is faulty. With all GABAergic currents omitted, the LFP is composed of excitatory currents and intrinsic currents. Our model of the LFP includes all synaptic and membrane currents. In our model, the high theta comes from the spiking activity of the SOM cells, which increase their activity if the inhibition from VIP cells is removed. We are including a new simulation, which models the activity of the slice in the presence of kainate (as done in Antonoudiou et al., 2022), providing additional excitation to the network. If the BLA starts at high excitation, our model produces an ongoing gamma in the VIP cells that suppress SOM cells and allows a PING gamma to form between PV and F cells; with Gabazine (modeled as the removal of all the GABAergic synapses), this PING is no longer possible and so the gamma rhythm disappears. As expected, the simulation shows that the model produces theta with Gabazine; the model also shows that a PING rhythm is produced without Gabazine, and that this rhythm goes away with Gabazine because PING requires feedback inhibition (see Author response image 1). Thus, the theta increase with Gabazine in the (Antonoudiou et al., 2022) paper can be reproduced in our model, so that paper does support the model.

      Author response image 1.

      Spectral properties of the BLA network without (black) versus with Gabazine (magenta). Power spectra of the LFP proxy, which is the linear sum of AMPA, GABA (only present in the absence of Gabazine, D-, NaP-, and H-currents. Both power spectra are represented as mean and standard deviation across 10 network realizations. Bottom: inset between 35 and 50 Hz.

      Nevertheless, we agree that this paper alone is not sufficient evidence that the BLA can produce a low theta. We have recently learned of a new paper (Bratsch-Prince et al., 2024) that is directly related to the issue of whether the BLA by itself can produce low theta, and in what circumstances. In this study, intrinsic BLA theta is produced in slices with ACh stimulation (without needing external glutamate input) which, in vivo, would be produced by the basal forebrain (Rajebhosale et al., eLife, 2024) in response to salient stimuli. The low-theta depends on muscarinic activation of CCK interneurons, a group of interneurons that overlaps with the VIP neurons in our model (Krabbe 2017; Mascagni and McDonald, 2003).

      We suspect that the low theta produced in (Bratsch-Prince et al., 2024) is the same as the low theta in our model. We do not explicitly include ACh modulation of BLA in our paper, but in current work with experimentalists, we aim to show that ACh is essential to the theta by activating the BLA VIP cells. In our re-revised version, we will discuss Bratsch-Prince et al., 2024 and its connection to our hypothesis that the theta oscillations can be produced within the BLA.

      Note that we have already included a paragraph stating explicitly that our hypothesis in no way contradicts the idea that inputs to the BLA may include theta oscillations. Indeed, the following paragraphs in the revised paper describe the complexity of trying to understand the origin of brain rhythms in vivo. R2 did not appear to take this complexity, and the possible involvement of neuromodulation, into account in their current position that the theta rhythms cannot be produced intrinsically in the BLA.

      From revised paper: “Where the rhythms originate, and by what mechanisms. A recent experimental paper, (Antonoudiou et al. 2022), suggests that the BLA can intrinsically generate theta oscillations (3-12 Hz) detectable by LFP recordings under certain conditions, such as reduced inhibitory tone. They draw this conclusion in mice by removing the hippocampus, which can volume conduct to BLA, and noticing that other nearby brain structures did not display any oscillatory activity. Our model also supports the idea that intrinsic mechanisms in the BLA can support the generation of the low theta, high theta, and gamma rhythms.

      Although the BLA can produce these rhythms, this does not rule out that other brain structures also produce the same rhythms through different mechanisms, and these can be transmitted to the BLA. Specifically, it is known that the olfactory bulb produces and transmits the respiratory-related low theta (4 Hz) oscillations to the dorsomedial prefrontal cortex, where it organizes neural activity (Bagur et al., 2021). Thus, the respiratory-related low theta may be captured by BLA LFP because of volume conduction or through BLA extensive communications with the prefrontal cortex. Furthermore, high theta oscillations are known to be produced by the hippocampus during various brain functions and behavioral states, including during spatial exploration (Vanderwolf, 1969) and memory formation/retrieval (Raghavachari et al., 2001), which are both involved in fear conditioning. Similarly to the low theta rhythm, the hippocampal high theta can manifest in the BLA. It remains to understand how these other rhythms may interact with the ones described in our paper.”

      We believe our current paper is important to show how detailed biophysical modeling can unearth the functional implications of physiological details (such as the biophysical bases of rhythms), which are often (indeed, usually) ignored in models, and why rhythms may be essential to some cognitive processes (including STDP). Indeed, for evaluating our paper it is necessary to go back to the purpose of a model, especially one such as ours, which is “hypothesis/data driven”. The hypotheses of the model serve to illuminate the functional roles of the physiological details, giving meaning to the data. Of course, the hypotheses must be plausible, and we think that the discussion above easily clears that bar. Hypotheses should also be checked experimentally, and a model that explains the implications of a hypothesis, such as ours, provides motivation for doing the hard work of experimental testing. We think that R1 understands this and has been very helpful.

      —————

      The following is the authors’ response to the original reviews.

      eLife assessment

      This useful modeling study explores how the biophysical properties of interneuron subtypes in the basolateral amygdala enable them to produce nested oscillations whose interactions facilitate functions such as spike-timing-dependent plasticity. The strength of evidence is currently viewed as incomplete because the relevance to plasticity induced by fear conditioning is viewed as insufficiently grounded in existing training protocols and prior experimental results, and alternative explanations are not sufficiently considered. This work will be of interest to investigators studying circuit mechanisms of fear conditioning as well as rhythms in the basolateral amygdala. 

      Most of our comments below are intended to rebut the sentence: “The strength of evidence is currently viewed as incomplete because the relevance to plasticity induced by fear conditioning is viewed as insufficiently grounded in existing training protocols and prior experimental results, and alternative explanations are not sufficiently considered”. 

      We believe this work will be interesting to investigators interested in dynamics associated with plasticity, which goes beyond fear learning. It will also be of interest because of its emphasis on the interactions of multiple kinds of interneurons that produce dynamics used in plasticity, in the cortex (which has similar interneurons) as well as BLA. We note that the model has sufficiently detailed physiology to make many predictions that can be tested experimentally. Details are below in the answer to reviewers.

      Reviewer #1 (Public Comments):  

      (1) … the weakness is that their attempt to align with the experimental literature (specifically Krabbe et al. 2019) is performed inconsistently. Some connections between cell types were excluded without adequate justification (e.g. SOM+ to PV+). 

      In order to constrain our model, we focused on what is reported in (Krabbe et al., 2019) in terms of functional connectivity instead of structural connectivity. Thus, we included only those connections for which there was strong functional connectivity. For example, the SOM to PV connection is shown to be small (Krabbe et al., 2019, Supp. Fig. 4, panel t). We also omitted PV to SOM, PV to VIP, SOM to VIP, VIP to excitatory projection neurons; all of these are shown in (Krabbe et al. 2019, Fig. 3 (panel l), and Supp. Fig. 4 (panels m,t)) to have weak functional connectivity, at least in the context of fear conditioning. 

      We reply with more details below to the Recommendations for the Authors, including new text.

      (2) The construction of the afferent drive to the network does not reflect the stimulus presentations that are given in fear conditioning tasks. For instance, the authors only used a single training trial, the conditioning stimulus was tonic instead of pulsed, the unconditioned stimulus duration was artificially extended in time, and its delivery overlapped with the neutral stimulus, instead of following its offset. These deviations undercut the applicability of their findings.  

      Regarding the use of a single long presentation of US rather than multiple presentations (i.e., multiple trials): in early versions of this paper, we did indeed use multiple presentations. We were told by experimental colleagues that the learning could be achieved in a single trial. We note that, if there are multiple presentations in our modeling, nothing changes; once the association between CS and US is learned, the conductance of the synapse is stable. Also, our model does not need a long period of US if there are multiple presentations.  

      We agree that, in order to implement the fear conditioning paradigm in our in-silico network, we made several assumptions about the nature of the CS and US inputs affecting the neurons in the BLA and the duration of these inputs. A Poisson spike train to the BLA is a signal that contains no structure that could influence the timing of the BLA output; hence, we used this as our CS input signal. We also note that the CS input can be of many forms in general fear conditioning (e.g., tone, light, odor), and we wished to de-emphasize the specific nature of the CS. The reference mentioned in the Recommendations for authors, (Quirk, Armony, and LeDoux 1997), uses pulses 2 seconds long. At the end of fear conditioning, the response to those pulses is brief. However, in the early stages of conditioning, the response goes on for as long as the figure shows. The authors do show the number of cells responding decreases from early to late training, which perhaps reflects increasing specificity over training. This feature is not currently in our model, but we look forward to thinking about how it might be incorporated. Regarding the CS pulsed protocol used in (Krabbe et al., 2019), it has been shown that intense inputs (6kHz and 12 kHz inputs) can lead to metabotropic effects that last much longer than the actual input (200 ms duration) (Whittington et al., Nature, 1995). Thus, the effective input to the BLA may indeed be more like Poisson.

      Our model requires the effect of the CS and US inputs on the BLA neuron activity to overlap in time in order to instantiate fear learning. Despite paradigms involving both overlapping (delay conditioning, where US coterminates with CS (Lindquist et al., 2004), or immediately follows CS (e.g., Krabbe et al., 2019)) and non-overlapping (trace conditioning) CS/US inputs existing in the literature, we hypothesized that concomitant activity in CS- and US-encoding neuron activity should be crucial in both cases. This may be mediated by the memory effect, as suggested in the Discussion of our paper, or by metabotropic effects as suggested above, or by the contribution from other brain regions. We will emphasize in our revision that the overlap in time, however instantiated, is a hypothesis of our model. It is hard to see how plasticity can occur without some memory trace of US. This is a consequence of our larger hypothesis that fear learning uses spiketiming-dependent plasticity; such a hypothesis about plasticity is common in the modeling literature. 

      We reply with more details below to the Recommendations for the Authors, including new text.

      Reviewer #1 (Recommendations For The Authors): 

      Major points: 

      (1) This paper draws extensively from Krabbe et al. 2019, but it does not do so consistently. The paper would be strengthened if it tried to better match the circuit properties and activations.

      Specifically: 

      a. Krabbe found that PV interneurons were comparably activated by the US (see Supp Fig 1). Your model does not include that. The basis for the Krabbe 2019 claim that PV US responses are weaker is that they have a slightly larger proportion of cells inhibited by the US, but this is not especially compelling. In addition, their Fig 2 showed that VIP and SOM cells receive afferents from the same set of upstream regions. 

      b. The model excluded PV-SOM connections, but this does not agree with Krabbe et al. 2019, Table 2. PV cells % connectivity and IPSC amplitudes were comparable to those from VIP interneurons. 

      c. ECS to PV synapses are not included. This seems unlikely given the dense connectivity between PV interneurons and principal neurons in cortical circuits and the BLA (Woodruff and Sah 2007 give 38% connection probability in BLA). 

      We thank the Reviewer for raising these points, which allow us to clarify how we constrained our model and to do more simulations. Specifically: 

      a. (Wolff et al., Nature, 2014), cited by (Krabbe et al. 2018), reported that PV and SOM interneurons are on average inhibited by the US during the fear conditioning. However, we agree that (Krabbe et al., 2019) added to this by specifying that PV interneurons respond to both CS+ and US, although the fraction of US-inhibited PV interneurons is larger. As noted by the Reviewer, in the model we initially considered the PV interneurons responding only to CS+ (identified as “CS” in our manuscript). For the current revision, we ran new simulations in which the PV interneuron receives the US input, instead of CS+. It turned out that this did not affect the results, as shown in the figure below: all the network realizations learn the association between CS and fear. In the model, the PING rhythm between PV and F is the crucial component for establishing fine timing between ECS and F, which is necessary for learning. Having PV responding to the same input as F, i.e., US, facilitates their entrainment in PING and, thus, successful learning. 

      As for afferents of VIP and SOM from upstream regions, in (Krabbe et al., 2019) is reported that “[…] BLA SOM interneurons receive a different array of afferent innervation compared to that of VIP and PV interneurons, which might contribute to the differential activity patterns observed during fear learning.” Thus, in the model, we are agnostic about inputs to SOM interneurons; we modeled them to fire spontaneously at high theta.

      To address these points in the manuscript, we added some new text in what follows:

      (1) New Section “An alternative network configuration characterized by US input to PV, instead of CS, also learns the association between CS and fear” in the Supplementary information:

      “We constrained the BLA network in Fig. 2 with CS input to the PV interneuron, as reported in (Krabbe et al., 2018). However, (Krabbe et al., 2019) notes that a class of PV interneurons may be responding to US rather than CS. Fig. S3 presents the results obtained with this variation in the model (see Fig. 3 A,B for comparison) and shows that all the network realizations learn the association between CS and fear. In the model, the PING rhythm between PV and F is the crucial component for establishing fine timing between ECS and F, which is necessary for learning. Having PV responding to the same input as F, i.e., US, facilitates their entrainment in PING and, thus, successful fear learning.

      We model the VIP interneuron as affected by US; in addition, (Krabbe et al. 2019) reports that a substantial proportion of them is mildly activated by CS. Replacing the US by CS does not change the input to VIP cells, which is modeled by the same constant applied current. Thus, the VIP CS-induced activity is a bursting activity at low theta, similar to the one elicited by US in Fig. 2.”

      (2) Section “With the depression-dominated plasticity rule, all interneuron types are needed to provide potentiation during fear learning” in Results: “Finally, since (Krabbe et al., 2019) reported that a fraction of PV interneurons are affected by US, we have also run the simulations for single neuron network with the PV interneuron affected by US instead of CS. In this case as well, all the network realizations are learners (see Fig. S3). ”

      (3) Section “Conditioned and unconditioned stimuli” in Materials and Methods: “To make Fig. S3, we also considered a variation of the model with PV interneurons affected by US, instead of CS, as reported in (Krabbe et al. 2019).”

      b. Re the SOM to PV connection: As reported in the reply to the public reviews, we considered the prominent functional connections reported in (Krabbe et al., 2019), instead of structural connections. That is, we included only those connections for which there was strong functional connectivity. For example, the SOM to PV connection is shown to be small (Supp. Fig. 4, panel t, in (Krabbe et al., 2019)). We also omitted PV to SOM, PV to VIP, SOM to VIP, and VIP to excitatory projection neurons; all of these are shown in (Krabbe et al. 2019, Fig. 3 (panel l), and Supp. Fig. 4 (panels m,t)) to have weak functional connectivity, at least in the context of fear conditioning.

      In order to clarify this point, in Section “Network connectivity and synaptic currents” in Materials and Methods, we now say:

      “We modeled the network connectivity as presented in Fig. 2B, derived from the prominent functional, instead of structural, connections reported in (Krabbe et al., 2019).”

      c. Re the ECS to PV synapses: We thank the Reviewer for the reference provided; as the Reviewer says, the ECS to PV synapses are not included. Upon adding this connection in our network, we found that, unlike the connection suggested in part a above, introducing these synapses would, in fact, change the outcome. Thus, the omission of this connection must be considered an implied hypothesis. Including those synapses with a significant strength would alter the PING rhythm created by the interactions between F and PV, which is crucial for ECS and F fine timing. Thanks very much for showing us that this needs to be said. Our hypothesis does not contradict the dense connections mentioned by the Reviewer; such dense connectivity does not mean that all pyramidal cells connect to all interneurons. This hypothesis may be taken as a prediction of the model.

      The absence of this connection is now discussed at the end of a new Section of the Discussion entitled “Assumptions and predictions of the model”, which reads as follows:

      “Finally, the model assumes the absence of significantly strong connections from the excitatory projection cells ECS to PV interneurons, unlike the ones from F to PV. Including those synapses would alter the PING rhythm created by the interactions between F and PV, which is crucial for ECS and F fine timing. We note that in (Woodruff and Sah, 2007) only 38% of the pyramidal cells are connected to PV cells. The functional identity of the connected pyramidal cells is unknown. Our model suggests that successful fear conditioning requires F to PV connections and that ECS to PV must be weak or absent.”

      (2) Krabbe et al. 2019 and Davis et al. 2017 were referenced for the construction of the conditioned and unconditioned stimulus pairing protocol. The Davis citation is not applicable here because that study was a contextual, not cued, fear conditioning paradigm. Regarding Krabbe, the pairing protocol was radically different from what the authors used. Their conditioned stimulus was a train of tone pips presented at 0.9 Hz, which lasted 30 s, after which the unconditioned stimulus was presented after tone offset. The authors should determine how their network behaves when this protocol is used. Also, note that basolateral amygdala responses to tone stimuli are primarily brief onset responses (e.g. Quirk, Armony, and LeDoux 1997), and not the tonic activation used in the model.  

      We replied to this point in our responses to the Reviewer’s Public Comments as follows:

      “We agree that, in order to implement the fear conditioning paradigm in our in-silico network, we made several assumptions about the nature of the CS and US inputs affecting the neurons in the BLA and the duration of these inputs. A Poisson spike train to the BLA is a signal that contains no structure that could influence the timing of the BLA output; hence, we used this as our CS input signal. We also note that the CS input can be of many forms in general fear conditioning (e.g., tone, light, odor), and we wished to de-emphasize the specific nature of the CS. The reference mentioned in the Recommendations for authors, (Quirk, Armony, and LeDoux 1997), uses pulses 2 seconds long. At the end of fear conditioning, the response to those pulses is brief. However, in the early stages of conditioning, the response goes on for as long as the figure shows. The authors do show the number of cells responding decreases from early to late training, which perhaps reflects increasing specificity over training. This feature is not currently in our model, but we look forward to thinking about how it might be incorporated. Regarding the CS pulsed protocol used in (Krabbe et al., 2019), it has been shown that intense inputs (6kHz and 12 kHz inputs) can lead to metabotropic effects that last much longer than the actual input (200 ms duration) (Whittington et al., Nature, 1995). Thus, the effective input to the BLA may indeed be more like

      Poisson.”

      Current answer to the Reviewer:

      There are several distinct issues raised by the Reviewer in the more detailed critique. We respectfully disagree that the model is not applicable to context-dependent fear learning where the context acts as a CS, though we should have been more explicit. Specifically, our CS input can describe both the cue and the context. We included the following text in the Results section “Interneuron rhythms provide the fine timing needed for depression-dominated STDP to make the association between CS and fear”:

      “In our simulations, the CS input describes either the context or the cue in contextual and cued fear conditioning, respectively. For the context, the input may come from the hippocampus or other non-sensory regions, but this does not affect its role as input in the model.”

      The second major issue is whether the specific training protocols used in the cited papers need to be exactly reproduced in the signals received by the elements of our model; we note that there are many transformations that can occur between the sensory input and the signals received by the BLA. In the case of auditory fear conditioning, a series of pips, rather than individual pips, are considered the CS (e.g., (Stujenske et al., 2014; Krabbe et al. 2019)). Our understanding is that a single pip does not elicit a fear response; a series of pips is required for fear learning. This indicates that it is not the neural code of a single pip that matters, but rather the signal entering the amygdala that incorporates any history-dependent signaling that could lead to spiking throughout the sequence of pips.  Also, as mentioned above, intense inputs at frequencies about 6kHz and 12kHz can lead to metabotropic effects that last much longer than each brief pip (~200 ms), thus possibly producing continuous activity in neurons encoding the input. Thus, we believe that our use of the Poisson spike train is reasonable. 

      However, we are aware that the activity of neurons encoding CS can be modulated by the pips: neurons encoding auditory CS display a higher firing rate when each pip is presented and a Poisson-like spike train between pips (Herry et al., Journal of Neuroscience, 2007). Here we confirm that potentiation is present even in the presence of the fast transient response elicited by the pips. We said in the original manuscript that there is learning for a Poisson spike train CS input at ~50 Hz; this describes the neuronal activity in between pips. For the revision, we asked whether learning is preserved when CS is characterized by higher frequencies, which would describe the CS during and right after each pip. We show in the new Fig. S4 that potentiation is ensured for a range of CS frequencies. The figure shows the learning speed as a function of CS and US frequencies. For all the CS frequencies considered, i) there is learning, ii) learning speed increases with CS frequency. Thus, potentiation is present even when pips elicit a faster transient response.

      To better specify this in the manuscript, 

      We added the following sentences in the Results section “With the depressiondominated plasticity rule, all interneuron types are needed to provide potentiation during fear learning”: 

      “We note that the CS and US inputs modeled as independent Poisson spike trains represent stimuli with no structure. Although we have not explicitly modeled pulsating pips, as common in auditory fear conditioning (e.g., (Stujenske 2014; Krabbe 2019)), we show in Fig. S4 that potentiation can be achieved over a relatively wide range of gamma frequencies. This indicates that overall potentiation is ensured if the gamma frequency transiently increases after the pip.”

      We added the section “The full network potentiates for a range of CS frequencies“ and figure S4 in the Supplementary Information:

      We included in Materials and Methods “Conditioned and unconditioned stimuli” the following sentences:

      “Finally, for Fig.S4, we considered a range of frequencies for the CS stimulus. To generate the three Poisson spike trains with average frequencies from 48 to 64 Hz in Fig. S4, we set 𝜆 = 800, 1000, 1200.”

      Finally, to address the comment about the need for CS and US overlapping in time to instantiate fear association, we added the following text in the Results section “Assumptions and predictions of the model”:

      “Finally, our model requires the effect of the CS and US inputs on the BLA neuron activity to overlap in time in order to instantiate fear learning. Despite paradigms involving both overlapping (delay conditioning, where US co-terminates with CS (e.g., (Lindquist et al., 2004)), or immediately follows CS (e.g., Krabbe et al., 2019)) and non-overlapping (trace conditioning) CS/US inputs exist, we hypothesized that concomitant activity in CS- and US-encoding neuron activity should be crucial in both cases. This may be mediated by the memory effect due to metabotropic effects (Whittington et al., Nature, 1995) as suggested above, or by the contribution from other brain regions (see section “Involvement of other brain structures” in the Discussion). The fact that plasticity occurs with US memory trace is a consequence of our larger hypothesis that fear learning uses spike-timing-dependent plasticity; such a hypothesis about plasticity is common in the modeling literature.”

      (3) As best as I could tell, only a single training trial was used in this study. Fair enough, especially given that fear learning can occur with a single trial. However, most studies of amygdala fear conditioning have multiple trials (~5 or more). How does the model perform when multiple trials are given?  

      The association between CS and fear acquired after one trial, i.e., through a potentiated ECS to F connection, is preserved in the presence of multiple trials.  Indeed, the association would be weakened or erased (through depression of the ECS to F connection) only if ECS and F did not display good fine timing, i.e., F does not fire right after ECS most of the time. However, the implemented circuit supports the role of interneurons in providing the correct fine timing, thus preventing the association acquired from being erased.  

      In the second paragraph of the Results section “With the depression-dominated plasticity rule, all interneuron types are needed to provide potentiation during fear learning”, we made the above point by adding the following text:

      “We note that once the association between CS and fear is acquired, subsequent presentations of CS and US do not weaken or erase it: the interneurons ensure the correct timing and pauses in ECS and F activity, which are conducive for potentiation.”

      (4) The LFP calculations are problematic. First, it is unclear how they were done. Did the authors just take the transmembrane currents they included and sum them, or were they scaled by distance from the 'electrode' and extracellular conductivity (as one would derive from the Laplace equation)? Presumably, the spatial arrangement of model neurons was neglected so distance was not a factor. 

      Second, if this is the case, then the argument for excluding GABAergic conductances seems flawed. If the spatial arrangement of neurons is relevant to whether to include or exclude GABAergic conductances, then wouldn't a simulation without any spatial structure not be subject to the concern of laminar vs. nuclear arrangement? 

      Moreover, to the best I can tell, the literature the authors use to justify the exclusion of

      GABAergic currents does not make the case for a lack of GABAergic contribution in non-laminar structures. Instead, those studies only argue that in a non-laminar structure, AMPA currents are detectable, not that GABA cannot be detected. Thus, the authors should either include the GABAergic currents when calculating their simulated LFP, or provide a substantially better argument or citation for their exclusion. 

      We thank the Reviewer for pointing this out; this comment helped us rethink how to model the LFP. The origin of the LFP signal in BLA has not been fully determined, but factors thought to be important include differences in the spatial extension of the arborization in excitatory and inhibitory neurons, in the number of synaptic boutons, and spatial distributions of somata and synapses (Lindén et al 2011; Łęski 2013; Mazzoni et al. 2015). In the first version of the manuscript, we excluded the GABAergic currents because it is typically assumed that they add very little to the extracellular field as the inhibitory reversal potential is close to the resting membrane potential. For the revision, we re-ran the simulations during pre and post fear conditioning and we modeled the LFP as the sum of the AMPA, GABA and NaP-/H-/D- currents. With this new version of the LFP, we added a new Fig. 6 showing that there is a significant increase in the low theta power, but not in the high theta power, with fear learning (Fig. 6 C, D, E). This increase in the low theta power was mainly due to the AMPA currents created by the newly established connection from ECS to F, which allowed F to be active after fear conditioning in response to CS. 

      However, as the Reviewer mentioned, our network has no spatial extent: neurons are modeled as point cells. Thus, our current model does not include the features necessary to model some central aspects of the LFP. Despite that, our model does clearly demonstrate how rhythmic activity in the spike timing of neurons within the network changes due to fear learning (Fig. 6B). The spiking outputs of the network are key components of the inputs to the LFP, and thus we expect the rhythms in the spiking to be reflected in more complex descriptions of the LFP. But we also discovered that different LFP proxies provide different changes in rhythmic activity comparing pre- and post-fear learning; although we have no principled way to choose a LFP proxy, we believe that the rhythmic firing is the essential finding of the model.

      We have added the following to the manuscript:

      (1) In the new version of Fig. 6, we present the power spectra of the network spiking activity (panel B), along with the power spectra of the LFP proxy that includes the GABA, AMPA, and NaP-/H-/D- currents (panels C, D, E). 

      (2) We modified the conclusion of the Results section entitled “Increased low-theta frequency is a biomarker of fear learning” by saying:

      “In this section, we explore how plasticity in the fear circuit affects the network dynamics, comparing after fear conditioning to before. We first show that fear conditioning leads to an increase in low theta frequency power of the network spiking activity compared to the pre-conditioned level (Fig. 6 A,B); there is no change in the high theta power. We also show that the LFP, modeled as the linear sum of all the AMPA, GABA, NaP-, D-, and H- currents in the network, similarly reveals a low theta power increase and no significant variation in the high theta power (Fig. 6 C,D,E). These results reproduce the experimental findings in (Davis et al., 2017), and (Davis et al., 2017), and Fig 6 F,G show that the low theta increase is due to added excitation provided by the new learned pathway. The additional unresponsive ECS and F cells in the network were included to ensure we had not biased the LFP towards excitation. Nevertheless, although both the AMPA and GABA currents contribute to the power increase in the low theta frequency range (Fig. 6F), the AMPA currents show a dramatic power increase relative to the baseline (the average power ratio of AMPA and GABA post- vs pre-conditioning across 20 network realizations is 3*103 and 4.6, respectively). This points to the AMPA currents as the major contributor to the low theta power increase. Specifically, the newly potentiated AMPA synapse from ECS to F ensures F is active after fear conditioning, thus generating strong currents in the PV cells to which it has strong connections (Fig. 6G). Finally, the increase in power is in the low theta range because ECS and F are allowed to spike only during the active phase of the low theta spiking VIP neurons. We have also explored another proxy for the LFP (see Supplementary Information and Fig. S6).”

      In the Supplementary Information, we included a figure and some text in the new section entitled “A higher low theta power increase emerges in LFP approximated with the sum of the absolute values of the currents compared to their linear sum”:

      “Given that our BLA network comprises a few neurons described as single-compartment cells with no spatial extension and location, the LFP cannot be computed directly from our model’s read-outs. In the main text, we choose as an LFP proxy the linear sum of the AMPA, GABA, and P-/H-/D-currents. We note that if the LFP is modeled as the sum of the absolute value of the currents, as suggested by (Mazzoni et al. 2008; Mazzoni et al. 2015), an even higher low theta power increase arises after fear conditioning compared to the linear sum. Differences in the power spectra also arise if other LFP proxies (e.g., only AMPA currents, only GABA currents) are considered. A principled description of an LFP proxy would require modeling the three-dimensional BLA anatomy, including that of the interneurons VIP and SOM; this is outside the scope of the current paper. (See (Feng et al. 2019) for a related project in the BLA.)”

      (3) We updated the Materials and Methods section “Local field potentials and spectral analysis” to explain how we compute the LFP in the revised manuscript: 

      “We considered as an LFP proxy as the linear sum of all the AMPA, GABA, NaP, D, and H currents in the network. The D-current is in the VIP interneurons, and NaP-current and H-current are in SOM interneurons.”

      Although it is beyond the scope of the current work, an exploration of the most accurate proxy of the LFP in the amygdala is warranted. Such a study could be accomplished by adopting a similar approach as in (Mazzoni et al., 2015), where several LFP proxies based on point-neuron leaky-integrate and fire neuronal network were compared with a “groundtruth” LFP obtained in an analogous realistic three-dimensional network model. 

      To explicitly mention this issue in the paper, we add a paragraph in the “Limitations and caveats” section in the Discussion, which reads as follows:

      “LFPs recorded in the experiments are thought to be mainly created by transmembrane currents in neurons located around the electrode and depend on several factors, including the morphology of the arborization of contributing neurons and the location of AMPA and GABA boutons (Katzner et al. 2009; Lindén et al 2011; Łęski 2013; Mazzoni et al. 2015). Since our model has no spatial extension, we used an LFP proxy; this proxy was shown to reflect the rhythmic output of the network, which we believe to be the essential result (for more details see Results “Increased low-theta frequency is a biomarker of fear learning”, and Supplementary Information “A higher low theta power increase emerges in LFP approximated with the sum of the absolute values of the currents compared to their linear sum”).”

      (4)     We have removed the section “Plasticity between fear neuron and VIP slows down overall potentiation” in Results and sections “Plasticity between the fear neuron (F) and VIP slows down overall potentiation” and “Plastic F to VIP connections further increase lowtheta frequency power after fear conditioning” in the Supplementary Information. This material is extraneous since we are using a new proxy for LFP.

      Minor points: 

      (1) In Figure 3C, the y-axis tick label for 0.037 is written as "0.37."

      We thank the reviewer for finding this typo; we fixed it.

      (2) Figure 5B is unclear. It seems to suggest that the added ECS and F neurons did not respond to either the CS or UCS. Is this true? If so, why include them in the model? How would their inclusion change the model behavior? 

      It is correct that the added ECS and F neurons did not respond to the CS or US (UCS); they are constructed to be firing at 11 Hz in the absence of any connections from other cells.  These cells were included to be part of our computation of the LFP.  Specifically, adding in those cells would make the LFP take inhibition into account more, and we wanted to make sure that were not biasing our computation away from the effects of inhibition.  As shown in the paper (Fig. 6B), even with inhibition onto these non-responsive cells, the LFP has the properties claimed in the paper concerning the changes in the low theta and high-theta power, because the LFP is dominated by new excitation rather than the inhibition. 

      First, in the Results section “Network with multiple heterogeneous neurons can establish the association between CS and fear”, we commented on the added ECS and F neurons that do not respond to either CS or US by saying the following:

      “The ECS cells not receiving CS are inhibited by ongoing PV activity during the disinhibition window (Fig. 5B); they are constructed to be firing at 11 Hz in the absence of any connections from other cells. The lack of activity in those cells during fear conditioning implies that there is no plasticity from those ECS cells to the active F. Those cells are included for the calculation of the LFP (see below in “Increased low-theta frequency is a biomarker of fear learning”.)”

      Furthermore, we add the following sentence in the Results section “Increased low-theta frequency is a biomarker of fear learning”: 

      “The additional unresponsive ECS and F cells in the network were included to ensure we had not biased the LFP towards excitation.”

      (3) Applied currents are given as current densities, but these are difficult to compare with current levels observed from whole-cell patch clamp recordings. Can the currents be given as absolute levels, in pA/nA. 

      In principle, it is possible to connect current densities with absolute levels, as requested. However, we note that the number of cells in models is orders of magnitude smaller than the number being modeled. It is common in modeling to adjust physiological parameters to achieve the qualitative properties that are important to the model, rather than trying to exactly match particular recordings.

      We added to the Methods description why we choose units per unit area, rather than absolute units. 

      “All the currents are expressed in units per area, rather than absolute units, to avoid making assumptions about the size of the neuron surface.”

      (4) Regarding: "We note that the presence of SOM cells is crucial for plasticity in our model since they help to produce the necessary pauses in the excitatory projection cell activity. However, the high theta rhythm they produce is not crucial to the plasticity: in our model, high theta or higher frequency rhythms in SOM cells are all conducive to associative fear learning. This opens the possibility that the high theta rhythm in the BLA mostly originates in the prefrontal cortex and/or the hippocampus (Stujenske et al., 2014, 2022)." The chain of reasoning in the above statement is unclear. The second sentence seems to be saying contradictory things. 

      We agree that the sentence was confusing; thank you for pointing it out. We have revised the paragraph to make our point clearer. The central points are: 1) having the SOM cells in the BLA is critical to the plasticity in the model, and 2) these cells may or may not be the source of the high theta observed in the BLA during fear learning.

      We deleted from the discussion the text reported by the Reviewer, and we added the following one to make this point clearer:

      “We note that the presence of SOM cells is crucial for plasticity in our model since they help to produce the necessary pauses in the excitatory projection cell activity. The BLA SOM cells do not necessarily have to be the only source of the high theta observed in the BLA during fear learning; the high theta detected in the LFP of the BLA also originates from the prefrontal cortex and/or the hippocampus (Stujenske et al., 2014, 2022).”

      (5) Regarding: "This suggests low theta power change is not just an epiphenomenon but rather a biomarker of successful fear conditioning." Not sure this is the right framing for the above statement. The power of the theta signal in the LFP reflects the strengthening of connections, but it itself does not have an impact on network activity. Moreover, whether something is epiphenomenal is not relevant to the question of whether it can serve as a successful biomarker. A biomarker just needs to be indicative, not causal. 

      We intended to say why the low theta power change is a biomarker in the sense of the Reviewer. That is: experiments have shown that, with learning, the low theta power increases. The modeling shows in addition that, when learning does not take place, the low power does not increase. That means that the low theta power increases if and only if there is learning, i.e., the change in low theta power is a biomarker. To make our meaning clearer, we have changed the quoted sentences to read: 

      “This suggests that the low theta power change is a biomarker of successful fear conditioning: it occurs when there is learning and does not occur when there is no learning.”

      Reviewer #2 (Public Comments): 

      We thank the Reviewer for raising these interesting points. Below are our public replies and the changes we made to the manuscript to address the Reviewer’s objections.

      (1) Gamma oscillations are generated locally; thus, it is appropriate to model in any cortical structure. However, the generation of theta rhythms is based on the interplay of many brain areas therefore local circuits may not be sufficient to model these oscillations.

      Moreover, to generate the classical theta, a laminal structure arrangement is needed (where neurons form layers like in the hippocampus and cortex)(Buzsaki, 2002), which is clearly not present in the BLA. To date, I am not aware of any study which has demonstrated that theta is generated in the BLA. All studies that recorded theta in the BLA performed the recordings referenced to a ground electrode far away from the BLA, an approach that can easily pick up volume conducted theta rhythm generated e.g., in the hippocampus or other layered cortical structure. To clarify whether theta rhythm can be generated locally, one should have conducted recordings referenced to a local channel (see Lalla et al., 2017 eNeuro). In summary, at present, there is no evidence that theta can be generated locally within the BLA. Though, there can be BLA neurons, firing of which shows theta rhythmicity, e.g., driven by hippocampal afferents at theta rhythm, this does not mean that theta rhythm per se can be generated within the BLA as the structure of the BLA does not support generation of rhythmic current dipoles. This questions the rationale of using theta as a proxy for BLA network function which does not necessarily reflect the population activity of local principal neurons in contrast to that seen in the hippocampus.

      In both modeling and experiments, a laminar structure does not seem to be needed to produce a theta rhythm. A recent experimental paper, (Antonoudiou et al. 2022), suggests that the BLA can intrinsically generate theta oscillations (3-12 Hz) detectable by LFP recordings under certain conditions, such as reduced inhibitory tone. The authors draw this conclusion by looking at mice ex vivo slices. The currents that generate these rhythms are in the BLA, since the hippocampus was removed to eliminate hippocampal volume conduction and other nearby brain structures did not display any oscillatory activity. Also, in the modeling literature, there are multiple examples of the production of theta rhythms in small networks not involving layers; these papers explain the mechanisms producing theta from non-laminated structures (Dudman et al., 2009, Kispersky et al., 2010, Chartove et al. 2020).  We are not aware of any model description of the mechanisms of theta that do require layers.

      We added the following text in the introduction of the manuscript to make this point clearer:  “A recent rodent experimental study (Antonoudiou et al. 2022) suggests that BLA can intrinsically generate theta oscillations (3-12 Hz).”

      (2) The authors distinguished low and high theta. This may be misleading, as the low theta they refer to is basically a respiratory-driven rhythm typically present during an attentive state (Karalis and Sirota, 2022; Bagur et al., 2021, etc.). Thus, it would be more appropriate to use breathing-driven oscillations instead of low theta. Again, this rhythm is not generated by the BLA circuits, but by volume conducted into this region. Yet, the firing of BLA neurons can still be entrained by this oscillation. I think it is important to emphasize the difference.

      Many rhythms of the nervous system can be generated in multiple parts of the brain by multiple mechanisms. We do not dispute that low theta appears in the context of respiration; however, this does not mean that other rhythms with the same frequencies are driven by respiration. Indeed, in the response to question 1 above, we showed that theta can appear in the BLA without inputs from other regions. In our paper, the low theta is generated in the BLA by VIP neurons. Using intrinsic currents known to exist in VIP neurons (Porter et al., 1998), modeling has shown that such neurons can intrinsically produce a low theta rhythm. This is also shown in the current paper. This example is part of a substantial literature showing that there are multiple mechanisms for any given frequency band. 

      To elaborate more on this in the manuscript, we added the following new section in the discussion:

      “Where the rhythms originate, and by what mechanisms. A recent experimental paper, (Antonoudiou et al. 2022), suggests that the BLA can intrinsically generate theta oscillations (3-12 Hz) detectable by LFP recordings under certain conditions, such as reduced inhibitory tone. They draw this conclusion in mice by removing the hippocampus, which can volume conduct to BLA, and noticing that other nearby brain structures did not display any oscillatory activity. Our model also supports the idea that intrinsic mechanisms in the BLA can support the generation of the low theta, high theta, and gamma rhythms. 

      Although the BLA can produce these rhythms, this does not rule out that other brain structures also produce the same rhythms through different mechanisms, and these can be transmitted to the BLA. Specifically, it is known that the olfactory bulb produces and transmits the respiratory-related low theta (4 Hz) oscillations to the dorsomedial prefrontal cortex, where it organizes neural activity (Bagur et al., 2021). Thus, the respiratory-related low theta may be captured by BLA LFP because of volume conduction or through BLA extensive communications with the prefrontal cortex. Furthermore, high theta oscillations are known to be produced by the hippocampus during various brain functions and behavioral states, including during spatial exploration (Vanderwolf, 1969) and memory formation/retrieval (Raghavachari et al., 2001), which are both involved in fear conditioning. Similarly to the low theta rhythm, the hippocampal high theta can manifest in the BLA. It remains to understand how these other rhythms may interact with the ones described in our paper.”

      We also note that the presence of D-currents in the BLA VIP interneurons should be confirmed experimentally, and that the ability of VIP interneurons to generate the BLA low theta rhythm constitutes a prediction of our computational model. These points are specified in the first paragraph in the Discussion entitled “Assumptions and predictions of the model”:

      “The interneuron descriptions in the model were constrained by the electrophysiological properties reported in response to hyperpolarizing currents (Sosulina et al., 2010). Specifically, we modeled the three subtypes of VIP, SOM, and PV interneurons displaying bursting behavior, regular spiking with early spike-frequency adaptation, and regular spiking without spike-frequency adaptation, respectively. Focusing on VIP interneurons, we were able to model the bursting behavior by including the D-type potassium current. This current is thought to exist in the VIP interneurons in the cortex (Porter et al., 1998), but whether this current is also found in the VIP interneurons the BLA is still unknown. Similarly, we endowed the SOM interneurons with NaP- and H-currents, as the OLM cells in the hippocampus. Due to these currents, the VIP and SOM cells are able to show  low- and high-theta oscillations, respectively. The presence of these currents and the neurons’ ability to exhibit oscillations in the theta range during fear conditioning and at baseline in BLA, which are assumptions of our model, should be tested experimentally.”

      (3) The authors implemented three interneuron types in their model, ignoring a large fraction of GABAergic cells present in the BLA (Vereczki et al., 2021). Recently, the microcircuit organization of the BLA has been more thoroughly uncovered, including connectivity details for PV+ interneurons, firing features of neurochemically identified interneurons (instead of mRNA expression-based identification, Sosulina et al., 2010), synaptic properties between distinct interneuron types as well as principal cells and interneurons using paired recordings. These recent findings would be vital to incorporate into the model instead of using results obtained in the hippocampus and neocortex. I am not sure that a realistic model can be achieved by excluding many interneuron types.

      The interneurons and connectivity that we used were inspired by the functional connectivity reported in (Krabbe et al., 2019) (see above answer to Reviewer #1). As reported in (Vereczki et al., 2021), there are multiple categories and subcategories of interneurons; that paper does not report on which ones are essential for fear conditioning. We did use all the highly represented categories of the interneurons, except NPYcontaining neurogliaform cells.

      The Reviewer says “I am not sure that a realistic model can be achieved by excluding many interneuron types”. We agree with the Reviewer that discarding the introduction of other interneurons subtypes and the description of more specific connectivity (soma-, dendrite-, and axon-targeting connections) may limit the ability of our model to describe all the details in the BLA. However, this work represents a first effort towards a biophysically detailed description of the BLA rhythms and their function. As in any modeling approach, assumptions about what to describe and test are determined by the scientific question; details postulated to be less relevant are omitted to obtain clarity. The interneuron subtypes we modeled, especially VIP+ and PV+, have been reported to have a crucial role in fear conditioning (Krabbe et al., 2019). Other interneurons, e.g. cholecystokinin and SOM+, have been suggested as essential in fear extinction. Thus, in the follow-up of this work to explain fear extinction, we will introduce other cell types and connectivity. In the current work, we have achieved our goals of explaining the origin of the experimentally found rhythms and their roles in the production of plasticity underlying fear learning. Of course, a more detailed model may reveal flaws in this explanation, but this is science that has not yet been done.

      We elaborate more on this in a new section in the Discussion entitled “Assumptions and predictions of the model”. The paragraph related to this point reads as follows:

      “Our model, which is a first effort towards a biophysically detailed description of the BLA rhythms and their functions, does not include the neuron morphology, many other cell types, conductances, and connections that are known to exist in the BLA; models such as ours are often called “minimal models” and constitute the majority of biologically detailed models. Such minimal models are used to maximize the insight that can be gained by omitting details whose influence on the answers to the questions addressed in the model are believed not to be qualitatively important. We note that the absence of these omitted features constitutes hypotheses of the model: we hypothesize that the absence of these features does not materially affect the conclusions of the model about the questions we are investigating. Of course, such hypotheses can be refuted by further work showing the importance of some omitted features for these questions and may be critical for other questions. Our results hold when there is some degree of heterogeneity of cells of the same type, showing that homogeneity is not a necessary condition.”

      (4) The authors set the reversal potential of GABA-A receptor-mediated currents to -80 mV. What was the rationale for choosing this value? The reversal potential of IPSCs has been found to be -54 mV in fast-spiking (i.e., parvalbumin) interneurons and around -72 mV in principal cells (Martina et al., 2001, Veres et al., 2017).

      A GABA-A reversal potential around -80 mV is common in the modeling literature (Jensen et al., 2005; Traub et al., 2005; Kumar et al., 2011; Chartove et al., 2020). Other computational works of the amygdala, e.g. (Kim et al., 2016), consider GABA-A reversal potential at -75 mV based on the cortex (Durstewitz et al., 2000). The papers cited by the reviewer have a GABA-A reversal potential of -72 mV for synapses onto pyramidal cells; this is sufficiently close to our model that it is not likely to make a difference. For synapses onto PV+ cells, the papers cited by the reviewer suggest that the GABA-A reversal potential is -54 mV; such a reversal potential would lead these synapses to be excitatory instead of inhibitory. However, it is known (Krabbe et al., 2019; Supp. Fig. 4b) that such synapses are in fact inhibitory. Thus, we wonder if the measurements of Martina and Veres were made in a condition very different from that of Krabbe. For all these reasons, we consider a GABA-A reversal potential around -80 mV in amygdala to be a reasonable assumption.

      In section “Network connectivity and synaptic currents” in “Materials and Methods” we provided references to motivate our choice of considering a GABA-A reversal potential around -80 mV:

      “The GABAa current reversal potential (𝐸!) is set to −80        𝑚𝑉, as common in the modeling literature (Jensen et al., 2005; Traub et al., 2005; Kumar et al., 2011; Chartove et al., 2020).”

      (5) Proposing neuropeptide VIP as a key factor for learning is interesting. Though, it is not clear why this peptide is more important in fear learning in comparison to SST and CCK, which are also abundant in the BLA and can effectively regulate the circuit operation in cortical areas.

      Other peptides seem to be important in overall modulation of fear, but VIP is especially important in the first part of fear learning, the subject of our paper. Re SST: we hypothesize that SST interneurons are critical in fear extinction and preventing fear generalization, but not to initial fear learning. The peptide of the CCK neurons, which overlap with VIP cells, has been proposed to promote the switch between fear and safety states after fear extinction (Krabbe al. 2018). Thus, these other peptides are likely more important for other aspects of fear learning.  

      In the Discussion, we have added:

      “We hypothesize that SST peptide is critical in fear extinction and preventing fear generalization, but not to initial fear learning. Also, the CCK peptide has been proposed to promote the switch between fear and safety states after fear extinction (Krabbe al. 2018).”

      Reviewer #2 (Recommendations For The Authors): 

      We note that Reviewer #2’s Recommendations For The Authors have the same content as the Public Comments. Thus, the changes to the manuscript we implemented above address also the private critiques listed below.

      (1) As the breathing-driven rhythm is a global phenomenon accompanying fear state, one might restrict the analysis to this oscillation. The rationale beyond this restriction is that the 'high' theta in the BLA has an unknown origin (since it can originate from the ventral hippocampus, piriform cortex etc.). 

      In response to point 4 made by Reviewer 1 (Recommendations for the Authors) (p. 13), referring to high theta in the BLA, we previously wrote: 1) having the SOM cells in the BLA is critical to the plasticity in the model, and 2) these cells may or may not be the source of the high theta observed in the BLA during fear learning.

      In the Public Critiques, Reviewer 2 relates the respiratory rhythm to the low theta. We answered this point in point 2 of the Reviewer’s Public Comments (at p. 15).

      (2) I would include more interneurons in the network model incorporating recent findings. 

      This point was answered in our response to point 3 of the Reviewer’s Public Comments.

      (3) The reversal potential for GABA-A receptor-mediated currents would be good to set to measured values. In addition, I would use AMPA conductance values that have been measured in the BLA. 

      We addressed this objection in our response to point 4 of the Reviewer’s Public Comments.

      Reviewer #3 (Public comments):

      Weaknesses: 

      (1) The main weakness of the approach is the lack of experimental data from the BLA to constrain the biophysical models. This forces the authors to use models based on other brain regions and leaves open the question of whether the model really faithfully represents the basolateral amygdala circuitry. 

      (2) Furthermore, the authors chose to use model neurons without a representation of the morphology. However, given that PV+ and SOM+ cells are known to preferentially target different parts of pyramidal cells and given that the model relies on a strong inhibition form SOM to silence pyramidal cells, the question arises whether SOM inhibition at the apical dendrite in a model representing pyramidal cell morphology would still be sufficient to provide enough inhibition to silence pyramidal firing.

      3) Lastly, the fear learning relies on the presentation of the unconditioned stimulus over a long period of time (40 seconds). The authors justify this long-lasting input as reflecting not only the stimulus itself but as a memory of the US that is present over this extended time period. However, the experimental evidence for this presented in the paper is only very weak.

      We are repeating here the answers we gave in response to the public comments, adding further relevant points.

      (1) Our neurons were constrained by electrophysiology properties in response to hyperpolarizing currents in the BLA (Sosulina et al., 2010). We can reproduce these electrophysiological properties by using specific membrane currents known to be present in similar neurons in other brain regions (D-current in VIP interneurons in the cortex, and NaP- and H-currents in OLM/SOM cells in the hippocampus). Also, though a much more detailed description of BLA interneurons was given in (Vereczki et al., 2021), it is not clear that this level of detail is relevant to the questions that we were asking, especially since the experiments described were not done in the context of fear learning.

      (2) It is true that we did not include the morphology, which undoubtedly makes a difference to some aspects of the circuit dynamics. Furthermore, it is correct that the model relies on a strong inhibition from SOM and PV to silence the excitatory projection neurons. We agree that the placement of the SOM inhibition on the pyramidal neurons can make a difference on some aspects of the circuit behavior. We are assuming that the inhibition from the SOM cells can inhibit the pyramidal cells firing, which can be seen as a hypothesis of our model. It is well known that VIP cells disinhibit pyramidal cells through inhibition of SOM and PV cells (Krabbe et al. 2019); hence, this hypothesis is generally believed. This choice of parameters comes from using simplified models: it is standard in modeling to adjust parameters to compensate for simplifications.

      Re points 1) and 2), in a new paragraph (“Assumptions and predictions of the model”) in the Discussion reported in response to Reviewer #2 (public comments)’s point 3, we stated that modeling requires the omission of many details to bring out the significance of other details.

      (3) 40 seconds is the temporal interval we decided to use to present the results. In the Results, we also showed that there is learning over a shorter interval of time (15 seconds) where CS and US/memory of US should both be present. Thus, our model requires 15 seconds over a single or multiple trials for associative learning to be established. We included references to additional experimental papers to support our reasoning in the last paragraph of section “Assumptions and predictions of the model” in the Discussion, also reported in response to Reviewer #1 point 2 (Recommendations for the Authors). We said there that some form of memory or overlap in the activity of the excitatory projection neurons is necessary for spike-timing-dependent plasticity.

      The authors achieved the aim of constructing a biophysically detailed model of the BLA not only capable of fear learning but also showing spectral signatures seen in vivo. The presented results support the conclusions with the exception of a potential alternative circuit mechanism demonstrating fear learning based on a classical Hebbian (i.e. non-depression-dominated) plasticity rule, which would not require the intricate interplay between the inhibitory interneurons. This alternative circuit is mentioned but a more detailed comparison between it and the proposed circuitry is warranted.

      Our model accounts for the multiple rhythms observed in the context of fear learning, as well as the known involvement of multiple kinds of interneurons. We did not say explicitly enough why our complicated model may be functionally important in ways that cannot be fulfilled with a simpler model with the non depression-dominated Hebbian rule. To explain this, we have added the following in the manuscript discussion: 

      “Although fear learning can occur without the depression-dominated rule, we hypothesize that it is necessary for other aspects of fear learning and regulation. That is, in pathological cases, there can be overgeneralization of learning. We hypothesize that the modulation created by the involvement of these interneurons is normally used to prevent such overgeneralization. However, this is beyond the scope of the present paper.”

      We have also written an extra paragraph about generalization in the Discussion “Synaptic plasticity in our model”:

      “With the classical Hebbian plasticity rule, we show that learning can occur without the involvement of the VIP and SOM cells. Although fear learning can occur without the depressiondominated rule, we hypothesize that the latter is necessary for other aspects of fear learning and regulation. Generalization of learning can be pathological, and we hypothesize that the modulation created by the involvement of VIP and SOM interneurons is normally used to prevent such overgeneralization. However, in some circumstances, it may be desirable to account for many possible threats, and then a classical Hebbian plasticity rule could be useful. We note that the involvement or not of the VIP-SOM circuit has been implicated when there are multiple strategies for solving a task (Piet et al., 2024). In our situation, the nature of the task (including reward structure) may determine whether the learning rule is depression-dominated and therefore whether the VIP-SOM circuit plays an important role.”

      Reviewer #3 (Recommendations For The Authors): 

      We thank the Reviewer for all the recommendations. We replied to each of them below.

      In general, there are some inconsistencies in the naming (e.g. sometimes you write PV sometimes PV+,...), please use consistent abbreviations throughout the manuscript. You also introduce some of the abbreviations multiple times. 

      We modified the manuscript to remove all the inconsistencies in the naming. 

      Introduction: 

      - In the last section you speak about one recent study but actually cite two articles. 

      We removed the reference to (Perrenoud and Cardin, 2023), which is a commentary on the Veit et al. article.

      Results: 

      - 'Brain rhythms are thought to be encoded and propagated largely by interneurons' What do you mean by encoded here? 

      We agree with the Reviewer that the verb “to encode” is not accurate. We modified the sentence as follows:

      “Brain rhythms are thought to be generated and propagated largely by interneurons”.

      - The section 'Interneurons interact to modulate fear neuron output' could be clearer. Start with describing the elements of the circuit, then the rhythms in the baseline. 

      We reorganized the section as follows:

      “Interneurons interact to modulate fear neuron output. Our BLA network consists of interneurons, detailed in the previous section, and excitatory projection neurons (Fig. 2A). Both the fear-encoding neuron (F), an excitatory projection neuron, and the VIP interneuron are activated by the noxious stimulus US (Krabbe et al., 2019). As shown in Fig. 2A (top, right), VIP disinhibits F by inhibiting both SOM and PV, as suggested in (Krabbe et al., 2019). We do not include connections from PV to SOM and VIP, nor connections from SOM to PV and VIP, since those connections have been shown to be significantly weaker than the ones included (Krabbe et al., 2019). The simplest network we consider is made of one neuron for each cell type. We introduce a larger network with some heterogeneity in the last two sections of the Results.

      Fig. 2A (bottom) shows a typical dynamic of the network before and after the US input onset, with US modeled as a Poisson spike train at ~50 Hz; the network produces all the rhythms originating from the interneurons alone or through their interactions with the excitatory projection neurons (shown in Fig. 1). Specifically, since VIP is active at low theta during both rest and upon the injection of US, it then modulates F at low theta cycles via SOM and PV. In the baseline condition, the VIP interneuron has short gamma bursts nested in low theta rhythm. With US onset, VIP increases its burst duration and the frequency of low theta rhythm. These longer bursts make the SOM cell silent for long periods of each low theta cycle, providing F with windows of disinhibition and contributing to the abrupt increase in activity right after the US onset. Finally, in Fig. 2A, PV lacks any external input and fires only when excited by F. Thanks to their reciprocal interactions, PV forms a PING rhythm with F, as depicted in Fig.1C.”

      - Figure 3C: The lower dashed line has the tick label '0.37' which should read '0.037'. 

      We fixed it.

      - The section describing the network with multiple neurons could be clearer, especially, it is not really clear how these different ECS and F neurons receive their input. 

      We answered the same objection in the reply to Reviewer #1 in point 2 under “minor issues.”

      Discussion: 

      - The paragraph 'It has also been suggested that ventral tegmental area has a role in fear expression (Lesas et al.,2023). Furthermore, it has been reported that the prelimbic cortex (PL) modulates the BLA SOM cells during fear retrieval, and the latter cells are crucial to discriminate non-threatening cues when desynchronized by the PL inputs (Stujenske et al., 2022).' is merely stating facts but I don't see how they relate to the presented work. 

      We thank the Reviewer for pointing out that this was confusing. What we meant to emphasize was that later stages of fear conditioning and extinction appear to require more than the BLA. We specifically mention the discrimination of non-threatening cues at the end of the paragraph, which now reads as follows:

      “Other brain structures may be involved in later stages of fear responsiveness, such as fear extinction and prevention of generalization. It has been reported that the prelimbic cortex (PL) modulates the BLA SOM cells during fear retrieval, and the latter cells are crucial to discriminate non-threatening cues when desynchronized by the PL inputs (Stujenske et al., 2022). Brain structures such as the prefrontal cortex and hippocampus have been documented to play a crucial role also in fear extinction, the paradigm following fear conditioning aimed at decrementing the conditioned fearful response through repeated presentations of the CS alone. As reported by several studies, fear extinction suppresses the fear memory through the acquisition of a distinct memory, instead of through the erasure of the fear memory itself (Harris et al., 2000; Bouton, 2002; Trouche et al., 2013; Thompson et al., 2018). Davis et al., 2017 found a high theta rhythm following fear extinction that was associated with the suppression of threat in rodents. Our model can be extended to include structures in the prefrontal cortex and the hippocampus to further investigate the role of rhythms in the context of discrimination of non-threatening cues and extinction. We hypothesize that a different population of PV interneurons plays a crucial role in mediating competition between fearful memories, associated with a low theta rhythm, and safety memories, associated with a high theta rhythm; supporting experimental evidence is in (Lucas et al., 2016; Davis et al., 2017; Chen et al., 2022).”

      - The comparison to other models BLA is quite short and seems a bit superficial. A more indepth comparison seems warranted. 

      We thank the reviewer for suggesting that a more in-depth comparison between our and other models in the literature would improve the manuscript. We rewrote entirely the first paragraph of that section. The new content reads as follows:

      “Comparison with other models. Many computational models that study fear conditioning have been proposed in the last years; the list includes biophysically detailed models (e.g., (Li 2009; Kim et al., 2013a)), firing rate models (e.g., Krasne 2011; Ball 2012; Vlachos 2011), and connectionist models (e.g., Moustafa 2013; Armony 1997; Edeline 1992) (for a review see (Nair et al., 2016)). Both firing rate models and connectionist models use an abstract description of the interacting neurons or regions. The omission of biophysical details prevents such models from addressing questions concerning the roles of dynamics and biophysical details in fear conditioning, which is the aim of our model.  There are also biophysically detailed models (Li 2009; Kim 2013; Kim 2016; Feng 2019), which differ from ours in both the physiology included in the model and the description of how plastic changes take place.  One main difference in the physiology is that we differentiated among types of interneurons, since the fine timing produced for the latter was key to our use of rhythms to produce spike-time dependent plasticity. The origin of the gamma rhythm (but not the other rhythms) was investigated in Feng et al 2019, but none of these papers connected the rhythms to plasticity.

      The most interesting difference between our work and that in (Li 2009; Kim 2013; Kim 2016) is the modeling of plasticity.  We use spike-time dependent plasticity rules.  The models in (Li 2009; Kim 2013; Kim 2016) were more mechanistic about how the plasticity takes place, starting with the known involvement of calcium with plasticity.  Using a hypothesis about back propagation of spikes, the set of papers together come up with a theory that is consistent with STDP and other instantiations of plasticity (Shouval 2002a; Shouval 2002b).  For the purposes of our paper, this level of detail, though very interesting, was not necessary for our conclusions.  By contrast, in order for the rhythms and the interneurons to have the dynamic roles they play in the model, we needed to restrict our STDP rule to ones that are depression-dominated.  Our reading of (Shouval 2002) suggests to us that such subrules are possible outcomes of the general theory.  Thus, there is no contradiction between the models, just a difference in focus; our focus was on the importance of the much-documented rhythms (Seidenbecher et al., 2003; Courtin et al., 2014b; Stujenske et al., 2014; Davis et al., 2017) in providing the correct spike timing.  We showed in the Supplementary Information (“Classical Hebbian plasticity rule, unlike the depression-dominated one, shows potentiation even with no strict pre and postsynaptic spike timing”) that if the STDP rule was not depression dominated, the rhythms need not be necessary.  We hypothesize that the necessity of strict timing enforced by the depression-dominated rule may foster the most appropriate association with fear at the expense of less relevant associations.”

      - The paragraph 'This could happen among some cells responding to weaker sensory inputs that do not lead to pre-post timing with fear neurons. This timing could be modified by the "triconditional rule", as suggested in (Grewe et al., 2017).' is not very clear. What exactly is 'this' in the first sentence referring to? If you mention the 'tri-conditional rule' here, please briefly explain it and how it would solve the issue at hand here.  

      We apologize that the sentence reported was not sufficiently clear. “This” refers to “depression”. We meant that, in our model, depression during fear conditioning happens every time there is no pre-post timing between neurons encoding the neutral stimuli and fear cells; poor pre-post timing can characterize the activity of neurons responding to weaker sensory inputs and does not lead to associative learning. We modified that paragraph as follows:

      “The study in (Grewe et al., 2017) suggests that associative learning resulting from fear conditioning induces both potentiation and depression among coactive excitatory neurons; coactivity was determined by calcium signaling and thus did not allow measurements of fine timing between spikes. In our model, we show how potentiation between coactive cells occurs when strict pre-post spike timing and appropriate pauses in the spiking activity arise. Depression happens when one or both of these components are not present. Thus, in our model, depression represents the absence of successful fear association and does not take part in the reshaping of the ensemble encoding the association, as instead suggested in (Grewe et al., 2017). A possible follow-up of our work involves investigating how fear ensembles form and modify through fear conditioning and later stages. This follow-up work may involve using a tri-conditional rule, as suggested in (Grewe et al. 2017), in which the potential role of neuromodulators is taken into account in addition to the pre- and postsynaptic neuron activity; this may lead to both potentiation and depression in establishing an associative memory.”

      - In the limitations and caveats section you mention that the small size of the network implies that they represent a synchronous population. What are the potential implications for the proposed rhythm-dependent mechanism? What are your expectations for larger networks? 

      We apologize if we were not adequately clear. We are guessing that the Reviewer thought we meant the entire population was synchronous, which it is not. We meant that, when we use a single cell to represent a subpopulation of cells of that type, that subpopulation is effectively synchronous. For larger networks in which each subtype is represented by many cells, there can be heterogeneity within each subtype. We have shown in the paper that the basic results still hold under some heterogeneity; however, they may fail if the heterogeneity is too large.

      We mentioned in a new section named “Assumptions and predictions of the model” in response to point 3 made by Reviewer #2.

      - The discussion is also missing a section on predictions/new experiments that can be derived from the model. How can the model be confirmed, what experiments/results would break the model? 

      To answer this question, we put in a new section in the Discussion entitled “Assumptions and predictions of the model”. The first paragraph of this section is in the reply to Reviewer #2 point 2; the second paragraph is in the reply to Reviewer #2 point 3; the last paragraph is in the Reply to Reviewer #1 point c; the rest of the section reads as follows:

      “Our study suggests that all the interneurons are necessary for associative learning provided that the STDP rule is depression-dominated. This prediction could be tested experimentally by selectively silencing each interneuron subtype in the BLA: if the associative learning is hampered by silencing any of the interneuron subtypes, this validates our study. Finally, the model prediction could be tested indirectly by acquiring more information about the plasticity rule involved in the BLA during associative learning. We found that all the interneurons are necessary to establish fear learning only in the case of a depression-dominated rule. This rule ensures that fine timing and pauses are always required for potentiation: interneurons provide both fine timing and pauses to pyramidal cells, making them crucial components of the fear circuit. 

      The modeling of the interneurons assumes the involvement of various intrinsic currents; the inclusion of those currents can be considered hypotheses of the model. Our model predicts that blockade of D-current in VIP interneurons (or silencing VIP interneurons) will both diminish low theta and prevent fear learning. Finally, the model assumes the absence of significantly strong connections from the excitatory projection cells ECS to PV interneurons, unlike the ones from F to PV. Including those synapses would alter the PING rhythm created by the interactions between F and PV, which is crucial for fine timing between ECS and F needed for LTP.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Summary of reviewers’ comments and our revisions: 

      We thank the reviewers for their thoughtful feedback. This feedback has motivated multiple revisions and additions that, in our view, have greatly improved the manuscript. This is especially true with regard to a major goal of this study: clearly defining existing scientific perspectives and delineating their decoding implications. In addition to building on this conceptual goal, we have expanded existing analyses and have added a new analysis of generalization using a newly collected dataset. We expect the manuscript will be of very broad interest, both to those interested in BCI development and to those interested in fundamental properties of neural population activity and its relationship with behavior.

      Importantly, all reviewers were convinced that MINT provided excellent performance, when benchmarked against existing methods, across a broad range of standard tasks:

      “their method shows impressive performance compared to more traditional decoding approaches” (R1) 

      “The paper was thorough in considering multiple datasets across a variety of behaviors, as well as existing decoding methods, to benchmark the MINT approach. This provided a valuable comparison to validate the method.” (R2) 

      “The fact that performance on stereotyped tasks is high is interesting and informative…” (R3)

      This is important. It is challenging to design a decoder that performs consistently across multiple domains and across multiple situations (including both decoding and neural state estimation). MINT does so. MINT consistently outperformed existing lightweight ‘interpretable’ decoders, despite being a lightweight interpretable decoder itself. MINT was very competitive with expressive machine-learning methods, yet has advantages in flexibility and simplicity that more ‘brute force’ methods do not. We made a great many comparisons, and MINT was consistently a strong performer. Of the many comparisons we made, there was only one where MINT was at a modest disadvantage, and it was for a dataset where all methods performed poorly. No other method we tested was as consistent. For example, although the GRU and the feedforward network were often competitive with MINT (and better than MINT in the one case mentioned above), there were multiple other situations where they performed less well and a few situations where they performed poorly. Moreover, no other existing decoder naturally estimates the neural state while also readily decoding, without retraining, a broad range of behavioral variables.

      R1 and R2 were very positive about the broader impacts of the study. They stressed its impact both on decoder design, and on how our field thinks, scientifically, about the population response in motor areas: 

      “This paper presents an innovative decoding approach for brain-computer interfaces” (R1)

      “presents a substantial shift in methodology, potentially revolutionizing the way BCIs interpret and predict neural behaviour” (R1)

      “the paper's strengths, particularly its emphasis on a trajectory-centric approach and the simplicity of MINT, provide a compelling contribution to the field” (R1)

      “The authors made strong arguments, supported by evidence and literature, for potentially high-dimensional neural states and thus the need for approaches that do not rely on an assumption of low dimensionality” (R2)

      “This work is motivated by brain-computer interfaces applications, which it will surely impact in terms of neural decoder design.” (R2)

      “this work is also broadly impactful for neuroscientific analysis... Thus, MINT will likely impact neuroscience research generally.” (R2)

      We agree with these assessments, and have made multiple revisions to further play into these strengths. As one example, the addition of Figure 1b (and 6b) makes this the first study, to our knowledge, to fully and concretely illustrate this emerging scientific perspective and its decoding implications. This is important, because multiple observations convince us that the field is likely to move away from the traditional perspective in Figure 1a, and towards that in Figure 1b. We also agree with the handful of weaknesses R1 and R2 noted. The manuscript has been revised accordingly. The major weakness noted by R1 was the need to be explicit regarding when we suspect MINT would (and wouldn’t) work well in other brain areas. In non-motor areas, the structure of the data may be poorly matched with MINT’s assumptions. We agree that this is likely to be true, and thus agree with the importance of clarifying this topic for the reader. The revision now does so. R1 also wished to know whether existing methods might benefit from including trial-averaged data during training, something we now explore and document (see detailed responses below). R2 noted two weaknesses: 1) The need to better support (with expanded analysis) the statement that neural and behavioral trajectories are non-isometric, and 2) The need to more rigorously define the ‘mesh’. We agree entirely with both suggestions, and the revision has been strengthened by following them (see detailed responses below).

      R3 also saw strengths to the work, stating that:

      “This paper is well-structured and its main idea is clear.” 

      “The fact that performance on stereotyped tasks is high is interesting and informative, showing that these stereotyped tasks create stereotyped neural trajectories.” 

      “The task-specific comparisons include various measures and a variety of common decoding approaches, which is a strength.”

      However, R3 also expressed two sizable concerns. The first is that MINT might have onerous memory requirements. The manuscript now clarifies that MINT has modest memory requirements. These do not scale unfavorably as the reviewer was concerned they might. The second concern is that MINT is: 

      “essentially a table-lookup rather than a model.”

      Although we don’t agree, the concern makes sense and may be shared by many readers, especially those who take a particular scientific perspective. Pondering this concern thus gave us the opportunity to modify the manuscript in ways that support its broader impact. Our revisions had two goals: 1) clarify the ways in which MINT is far more flexible than a lookup-table, and 2) better describe the dominant scientific perspectives and their decoding implications.

      The heart of R3’s concern is the opinion that MINT is an effective but unprincipled hack suitable for situations where movements are reasonably stereotyped. Of course, many tasks involve stereotyped movements (e.g. handwriting characters), so MINT would still be useful. Nevertheless, if MINT is not principled, other decode methods would often be preferable because they could (unlike MINT in R3’s opinion) gain flexibility by leveraging an accurate model. Most of R3’s comments flow from this fundamental concern: 

      “This is again due to MINT being a lookup table with a library of stereotyped trajectories rather than a model.”

      “MINT models task-dependent neural trajectories, so the trained decoder is very task-dependent and cannot generalize to other tasks.”

      “Unlike MINT, these works can achieve generalization because they model the neural subspace and its association to movement.”

      “given that MINT tabulates task-specific trajectories, it will not generalize to tasks that are not seen in the training data even when these tasks cover the exact same space (e.g., the same 2D computer screen and associated neural space).”

      “For proper training, the training data should explore the whole movement space and the associated neural space, but this does not mean all kinds of tasks performed in that space must be included in the training set (something MINT likely needs while modeling-based approaches do not).”

      The manuscript has been revised to clarify that MINT is considerably more flexible than a lookup table, even though a lookup table is used as a first step. Yet, on its own, this does not fully address R3’s concern. The quotes above highlight that R3 is making a standard assumption in our field: that there exists a “movement space and associated neural space”. Under this perspective, one should, as R3 argues fully explore the movement space. This would perforce fully explore the associated neural subspace. One can then “model the neural subspace and its association to movement”. MINT does not use a model of this type, and thus (from R3’s perspective) does not appear to use a model at all. A major goal of our study is to question this traditional perspective. We have thus added a new figure to highlight the contrast between the traditional (Figure 1a) and new (Figure 1b) scientific perspectives, and to clarify their decoding implications.

      While we favor the new perspective (Figure 1b), we concede that R3 may not share our view. This is fine. Part of the reason we believe this study is timely, and will be broadly read, is that it raises a topic of emerging interest where there is definitely room for debate. If we are misguided – i.e. if Figure 1a is the correct perspective – then many of R3’s concerns would be on target: MINT could still be useful, but traditional methods that make the traditional assumptions in Figure 1a would often be preferable. However, if the emerging perspective in Figure 1b is more accurate, then MINT’s assumptions would be better aligned with the data than those of traditional methods, making it a more (not less) principled choice.

      Our study provides new evidence in support of Figure 1b, while also synthesizing existing evidence from other recent studies. In addition to Figure 2, the new analysis of generalization further supports Figure 1b. Also supporting Figure 1b is the analysis in which MINT’s decoding advantage, over a traditional decoder, disappears when simulated data approximate the traditional perspective in Figure 1a.

      That said, we agree that the present study cannot fully resolve whether Figure 1a or 1b is more accurate. Doing so will take multiple studies with different approaches (indeed we are currently preparing other manuscripts on this topic). Yet we still have an informed scientific opinion, derived from past, present and yet-to-be-published observations. Our opinion is that Figure 1b is the more accurate perspective. This possibility makes it reasonable to explore the potential virtues of a decoding method whose assumptions are well-aligned with that perspective. MINT is such a method. As expected under Figure 1b, MINT outperforms traditional interpretable decoders in every single case we studied. 

      As noted above, we have added a new generalization-focused analysis (Figure 6) based on a newly collected dataset. We did so because R3’s comments highlight a deep point: which scientific perspective one takes has strong implications regarding decoder generalization. These implications are now illustrated in the new Figure 6a and 6b. Under Figure 6a, it is possible, as R3 suggests, to explore “the whole movement space and associated neural space” during training. However, under Figure 6b, expectations are very different. Generalization will be ‘easy’ when new trajectories are near the training-set trajectories. In this case, MINT should generalize well as should other methods. In contrast, generalization will be ‘hard’ when new neural trajectories have novel shapes and occupy previously unseen regions / dimensions. In this case, all current methods, including MINT, are likely to fail. R3 points out that traditional decoders have sometimes generalized well to new tasks (e.g. from center-out to ‘pinball’) when cursor movements occur in the same physical workspace. These findings could be taken to support Figure 6a, but are equally consistent with ‘easy’ generalization in Figure 6b. To explore this topic, the new analysis in Figure 6c-g considers conditions that are intended to span the range from easy to hard. Results are consistent with the predictions of Figure 6b. 

      We believe the manuscript has been significantly improved by these additions. The revisions help the manuscript achieve its twin goals: 1) introduce a novel class of decoder that performs very well despite being very simple, and 2) describe properties of motor-cortex activity that will matter for decoders of all varieties.

      Reviewer #1: 

      Summary: 

      This paper presents an innovative decoding approach for brain-computer interfaces (BCIs), introducing a new method named MINT. The authors develop a trajectory-centric approach to decode behaviors across several different datasets, including eight empirical datasets from the Neural Latents Benchmark. Overall, the paper is well written and their method shows impressive performance compared to more traditional decoding approaches that use a simpler approach. While there are some concerns (see below), the paper's strengths, particularly its emphasis on a trajectory-centric approach and the simplicity of MINT, provide a compelling contribution to the field. 

      We thank the reviewer for these comments. We share their enthusiasm for the trajectory-centric approach, and we are in complete agreement that this perspective has both scientific and decoding implications. The revision expands upon these strengths.

      Strengths: 

      The adoption of a trajectory-centric approach that utilizes statistical constraints presents a substantial shift in methodology, potentially revolutionizing the way BCIs interpret and predict neural behaviour. This is one of the strongest aspects of the paper. 

      Again, thank you. We also expect the trajectory-centric perspective to have a broad impact, given its relevance to both decoding and to thinking about manifolds.

      The thorough evaluation of the method across various datasets serves as an assurance that the superior performance of MINT is not a result of overfitting. The comparative simplicity of the method in contrast to many neural network approaches is refreshing and should facilitate broader applicability. 

      Thank you. We were similarly pleased to see such a simple method perform so well. We also agree that, while neural-network approaches will always be important, it is desirable to also possess simple ‘interpretable’ alternatives.

      Weaknesses:  

      Comment 1) Scope: Despite the impressive performance of MINT across multiple datasets, it seems predominantly applicable to M1/S1 data. Only one of the eight empirical datasets comes from an area outside the motor/somatosensory cortex. It would be beneficial if the authors could expand further on how the method might perform with other brain regions that do not exhibit low tangling or do not have a clear trial structure (e.g. decoding of position or head direction from hippocampus) 

      We agree entirely. Population activity in many brain areas (especially outside the motor system) presumably will often not have the properties upon which MINT’s assumptions are built. This doesn’t necessarily mean that MINT would perform badly. Using simulated data, we have found that MINT can perform surprisingly well even when some of its assumptions are violated. Yet at the same time, when MINT’s assumptions don’t apply, one would likely prefer to use other methods. This is, after all, one of the broader themes of the present study: it is beneficial to match decoding assumptions to empirical properties. We have thus added a section on this topic early in the Discussion: 

      “In contrast, MINT and the Kalman filter performed comparably on simulated data that better approximated the assumptions in Figure 1a. Thus, MINT is not a ‘better’ algorithm – simply better aligned with the empirical properties of motor cortex data. This highlights an important caveat. Although MINT performs well when decoding from motor areas, its assumptions may be a poor match in other areas (e.g. the hippocampus). MINT performed well on two non-motor-cortex datasets – Area2_Bump (S1) and DMFC_RSG (dorsomedial frontal cortex) – yet there will presumably be other brain areas and/or contexts where one would prefer a different method that makes assumptions appropriate for that area.”

      Comment 2) When comparing methods, the neural trajectories of MINT are based on averaged trials, while the comparison methods are trained on single trials. An additional analysis might help in disentangling the effect of the trial averaging. For this, the authors could average the input across trials for all decoders, establishing a baseline for averaged trials. Note that inference should still be done on single trials. Performance can then be visualized across different values of N, which denotes the number of averaged trials used for training. 

      We explored this question and found that the non-MINT decoders are harmed, not helped, by the inclusion of trial-averaged responses in the training set. This is presumably because the statistics of trialaveraged responses don’t resemble what will be observed during decoding. This statistical mismatch, between training and decoding, hurts most methods. It doesn’t hurt MINT, because MINT doesn’t ‘train’ in the normal way. It simply needs to know rates, and trial-averaging is a natural way to obtain them. To describe the new analysis, we have added the following to the text.

      “We also investigated the possibility that MINT gained its performance advantage simply by having access to trial-averaged neural trajectories during training, while all other methods were trained on single-trial data. This difference arises from the fundamental requirements of the decoder architectures: MINT needs to estimate typical trajectories while other methods don’t. Yet it might still be the case that other methods would benefit from including trial-averaged data in the training set, in addition to single-trial data. Alternatively, this might harm performance by creating a mismatch, between training and decoding, in the statistics of decoder inputs. We found that the latter was indeed the case: all non-MINT methods performed better when trained purely on single-trial data.”

      Reviewer #2:

      Summary: 

      The goal of this paper is to present a new method, termed MINT, for decoding behavioral states from neural spiking data. MINT is a statistical method which, in addition to outputting a decoded behavioral state, also provides soft information regarding the likelihood of that behavioral state based on the neural data. The innovation in this approach is neural states are assumed to come from sparsely distributed neural trajectories with low tangling, meaning that neural trajectories (time sequences of neural states) are sparse in the high-dimensional space of neural spiking activity and that two dissimilar neural trajectories tend to correspond to dissimilar behavioral trajectories. The authors support these assumptions through analysis of previously collected data, and then validate the performance of their method by comparing it to a suite of alternative approaches. The authors attribute the typically improved decoding performance by MINT to its assumptions being more faithfully aligned to the properties of neural spiking data relative to assumptions made by the alternatives. 

      We thank the reviewer for this accurate summary, and for highlighting the subtle but important fact that MINT provides information regarding likelihoods. The revision includes a new analysis (Figure 6e) illustrating one potential way to leverage knowledge of likelihoods.

      Strengths:  

      The paper did an excellent job critically evaluating common assumptions made by neural analytical methods, such as neural state being low-dimensional relative to the number of recorded neurons. The authors made strong arguments, supported by evidence and literature, for potentially high-dimensional neural states and thus the need for approaches that do not rely on an assumption of low dimensionality. 

      Thank you. We also hope that the shift in perspective is the most important contribution of the study. This shift matters both scientifically and for decoder design. The revision expands on this strength. The scientific alternatives are now more clearly and concretely illustrated (especially see Figure 1a,b and Figure 6a,b). We also further explore their decoding implications with new data (Figure 6c-g).

      The paper was thorough in considering multiple datasets across a variety of behaviors, as well as existing decoding methods, to benchmark the MINT approach. This provided a valuable comparison to validate the method. The authors also provided nice intuition regarding why MINT may offer performance improvement in some cases and in which instances MINT may not perform as well. 

      Thank you. We were pleased to be able to provide comparisons across so many datasets (we are grateful to the Neural Latents Benchmark for making this possible).

      In addition to providing a philosophical discussion as to the advantages of MINT and benchmarking against alternatives, the authors also provided a detailed description of practical considerations. This included training time, amount of training data, robustness to data loss or changes in the data, and interpretability. These considerations not only provided objective evaluation of practical aspects but also provided insights to the flexibility and robustness of the method as they relate back to the underlying assumptions and construction of the approach. 

      Thank you. We are glad that these sections were appreciated. MINT’s simplicity and interpretability are indeed helpful in multiple ways, and afford opportunities for interesting future extensions. One potential benefit of interpretability is now explored in the newly added Figure 6e. 

      Impact: 

      This work is motivated by brain-computer interfaces applications, which it will surely impact in terms of neural decoder design. However, this work is also broadly impactful for neuroscientific analysis to relate neural spiking activity to observable behavioral features. Thus, MINT will likely impact neuroscience research generally. The methods are made publicly available, and the datasets used are all in public repositories, which facilitates adoption and validation of this method within the greater scientific community. 

      Again, thank you. We have similar hopes for this study.

      Weaknesses (1 & 2 are related, and we have switched their order in addressing them): 

      Comment 2) With regards to the idea of neural and behavioral trajectories having different geometries, this is dependent on what behavioral variables are selected. In the example for Fig 2a, the behavior is reach position. The geometry of the behavioral trajectory of interest would look different if instead the behavior of interest was reach velocity. The paper would be strengthened by acknowledgement that geometries of trajectories are shaped by extrinsic choices rather than (or as much as they are) intrinsic properties of the data. 

      We agree. Indeed, we almost added a section to the original manuscript on this exact topic. We have now done so:

      “A potential concern regarding the analyses in Figure 2c,d is that they require explicit choices of behavioral variables: muscle population activity in Figure 2c and angular phase and velocity in Figure 2d. Perhaps these choices were misguided. Might neural and behavioral geometries become similar if one chooses ‘the right’ set of behavioral variables? This concern relates to the venerable search for movement parameters that are reliably encoded by motor cortex activity [69, 92–95]. If one chooses the wrong set of parameters (e.g. chooses muscle activity when one should have chosen joint angles) then of course neural and behavioral geometries will appear non-isometric. There are two reasons why this ‘wrong parameter choice’ explanation is unlikely to account for the results in Figure 2c,d. First, consider the implications of the left-hand side of Figure 2d. A small kinematic distance implies that angular position and velocity are nearly identical for the two moments being compared. Yet the corresponding pair of neural states can be quite distant. Under the concern above, this distance would be due to other encoded behavioral variables – perhaps joint angle and joint velocity – differing between those two moments. However, there are not enough degrees of freedom in this task to make this plausible. The shoulder remains at a fixed position (because the head is fixed) and the wrist has limited mobility due to the pedal design [60]. Thus, shoulder and elbow angles are almost completely determined by cycle phase. More generally, ‘external variables’ (positions, angles, and their derivatives) are unlikely to differ more than slightly when phase and angular velocity are matched. Muscle activity could be different because many muscles act on each joint, creating redundancy. However, as illustrated in Figure 2c, the key effect is just as clear when analyzing muscle activity. Thus, the above concern seems unlikely even if it can’t be ruled out entirely. A broader reason to doubt the ‘wrong parameter choice’ proposition is that it provides a vague explanation for a phenomenon that already has a straightforward explanation. A lack of isometry between the neural population response and behavior is expected when neural-trajectory tangling is low and output-null factors are plentiful [55, 60]. For example, in networks that generate muscle activity, neural and muscle-activity trajectories are far from isometric [52, 58, 60]. Given this straightforward explanation, and given repeated failures over decades to find the ‘correct’ parameters (muscle activity, movement direction, etc.) that create neural-behavior isometry, it seems reasonable to conclude that no such isometry exists.”

      Comment 1) The authors posit that neural and behavioral trajectories are non-isometric. To support this point, they look at distances between neural states and distances between the corresponding behavioral states, in order to demonstrate that there are differences in these distances in each respective space. This supports the idea that neural states and behavioral states are non-isometric but does not directly address their point. In order to say the trajectories are non-isometric, it would be better to look at pairs of distances between corresponding trajectories in each space. 

      We like this idea and have added such an analysis. To be clear, we like the original analysis too: isometry predicts that neural and behavioral distances (for corresponding pairs of points) should be strongly correlated, and that small behavioral distances should not be associated with large neural distances. These predictions are not true, providing a strong argument against isometry. However, we also like the reviewer’s suggestion, and have added such an analysis. It makes the same larger point, and also reveals some additional facts (e.g. it reveals that muscle-geometry is more related to neural-geometry than is kinematic-geometry). The new analysis is described in the following section:

      “We further explored the topic of isometry by considering pairs of distances. To do so, we chose two random neural states and computed their distance, yielding dneural1. We repeated this process, yielding dneural2. We then computed the corresponding pair of distances in muscle space (dmuscle1 and dmuscle2) and kinematic space (dkin1 and dkin2). We considered cases where dneural1 was meaningfully larger than (or smaller than) dneural2, and asked whether the behavioral variables had the same relationship; e.g. was dmuscle1 also larger than dmuscle2? For kinematics, this relationship was weak: across 100,000 comparisons, the sign of dkin1 − dkin2 agreed with dneural1 − dneural2 only 67.3% of the time (with 50% being chance). The relationship was much stronger for muscles: the sign of dmuscle1 − dmuscle2 agreed with dneural1 − dneural2 79.2% of the time, which is far more than expected by chance yet also far from what is expected given isometry (e.g. the sign agrees 99.7% of the time for the truly isometric control data in Figure 2e). Indeed there were multiple moments during this task when dneural1 was much larger than dneural2, yet dmuscle1 was smaller than dmuscle2. These observations are consistent with the proposal that neural trajectories resemble muscle trajectories in some dimensions, but with additional output-null dimensions that break the isometry [60].”

      Comment 3) The approach is built up on the idea of creating a "mesh" structure of possible states. In the body of the paper the definition of the mesh was not entirely clear and I could not find in the methods a more rigorous explicit definition. Since the mesh is integral to the approach, the paper would be improved with more description of this component. 

      This is a fair criticism. Although MINTs actual operations were well-documented, how those operations mapped onto the term ‘mesh’ was, we agree, a bit vague. The definition of the mesh is a bit subtle because it only emerges during decoding rather than being precomputed. This is part of what gives MINT much more flexibility than a lookup table. We have added the following to the manuscript.

      “We use the term ‘mesh’ to describe the scaffolding created by the training-set trajectories and the interpolated states that arise at runtime. The term mesh is apt because, if MINT’s assumptions are correct, interpolation will almost always be local. If so, the set of decodable states will resemble a mesh, created by line segments connecting nearby training-set trajectories. However, this mesh-like structure is not enforced by MINT’s operations.

      Interpolation could, in principle, create state-distributions that depart from the assumption of a sparse manifold. For example, interpolation could fill in the center of the green tube in Figure 1b, resulting in a solid manifold rather than a mesh around its outer surface. However, this would occur only if spiking observations argued for it. As will be documented below, we find that essentially all interpolation is local”

      We have also added Figure 4d. This new analysis documents the fact that decoded states are near trainingset trajectories, which is why the term ‘mesh’ is appropriate.

      Reviewer #3:

      Summary:  

      This manuscript develops a new method termed MINT for decoding of behavior. The method is essentially a table-lookup rather than a model. Within a given stereotyped task, MINT tabulates averaged firing rate trajectories of neurons (neural states) and corresponding averaged behavioral trajectories as stereotypes to construct a library. For a test trial with a realized neural trajectory, it then finds the closest neural trajectory to it in the table and declares the associated behavior trajectory in the table as the decoded behavior. The method can also interpolate between these tabulated trajectories. The authors mention that the method is based on three key assumptions: (1) Neural states may not be embedded in a lowdimensional subspace, but rather in a high-dimensional space. (2) Neural trajectories are sparsely distributed under different behavioral conditions. (3) These neural states traverse trajectories in a stereotyped order.  

      The authors conducted multiple analyses to validate MINT, demonstrating its decoding of behavioral trajectories in simulations and datasets (Figures 3, 4). The main behavior decoding comparison is shown in Figure 4. In stereotyped tasks, decoding performance is comparable (M_Cycle, MC_Maze) or better (Area 2_Bump) than other linear/nonlinear algorithms

      (Figure 4). However, MINT underperforms for the MC_RTT task, which is less stereotyped (Figure 4).  

      This paper is well-structured and its main idea is clear. The fact that performance on stereotyped tasks is high is interesting and informative, showing that these stereotyped tasks create stereotyped neural trajectories. The task-specific comparisons include various measures and a variety of common decoding approaches, which is a strength. However, I have several major concerns. I believe several of the conclusions in the paper, which are also emphasized in the abstract, are not accurate or supported, especially about generalization, computational scalability, and utility for BCIs. MINT is essentially a table-lookup algorithm based on stereotyped task-dependent trajectories and involves the tabulation of extensive data to build a vast library without modeling. These aspects will limit MINT's utility for real-world BCIs and tasks. These properties will also limit MINT's generalizability from task to task, which is important for BCIs and thus is commonly demonstrated in BCI experiments with other decoders without any retraining. Furthermore, MINT's computational and memory requirements can be prohibitive it seems. Finally, as MINT is based on tabulating data without learning models of data, I am unclear how it will be useful in basic investigations of neural computations. I expand on these concerns below.  

      We thank the reviewer for pointing out weaknesses in our framing and presentation. The comments above made us realize that we needed to 1) better document the ways in which MINT is far more flexible than a lookup-table, and 2) better explain the competing scientific perspectives at play. R3’s comments also motivated us to add an additional analysis of generalization. In our view the manuscript is greatly improved by these additions. Specifically, these additions directly support the broader impact that we hope the study will have.

      For simplicity and readability, we first group and summarize R3’s main concerns in order to better address them. (These main concerns are all raised above, in addition to recurring in the specific comments below. Responses to each individual specific comment are provided after these summaries.)

      (1) R3 raises concerns about ‘computational scalability.’ The concern is that “MINT's computational and memory requirements can be prohibitive.” This point was expanded upon in a specific comment, reproduced below:

      I also find the statement in the abstract and paper that "computations are simple, scalable" to be inaccurate. The authors state that MINT's computational cost is O(NC) only, but it seems this is achieved at a high memory cost as well as computational cost in training. The process is described in section "Lookup table of log-likelihoods" on line [978-990]. The idea is to precompute the log-likelihoods for any combination of all neurons with discretization x all delay/history segments x all conditions and to build a large lookup table for decoding. Basically, the computational cost of precomputing this table is O(V^{Nτ} x TC) and the table requires a memory of O(V^{Nτ}), where V is the number of discretization points for the neural firing rates, N is the number of neurons, τ is the history length, T is the trial length, and C is the number of conditions. This is a very large burden, especially the V^{Nτ} term. This cost is currently not mentioned in the manuscript and should be clarified in the main text. Accordingly, computation claims should be modified including in the abstract.

      The revised manuscript clarifies that our statement (that computations are simple and scalable) is absolutely accurate. There is no need to compute, or store, a massive lookup table. There are three tables: two of modest size and one that is tiny. This is now better explained:

      “Thus, the log-likelihood of , for a particular current neural state, is simply the sum of many individual log-likelihoods (one per neuron and time-bin). Each individual log-likelihood depends on only two numbers: the firing rate at that moment and the spike count in that bin. To simplify online computation, one can precompute the log-likelihood, under a Poisson model, for every plausible combination of rate and spike-count. For example, a lookup table of size 2001 × 21 is sufficient when considering rates that span 0-200 spikes/s in increments of 0.1 spikes/s, and considering 20 ms bins that contain at most 20 spikes (only one lookup table is ever needed, so long as its firing-rate range exceeds that of the most-active neuron at the most active moment in Ω). Now suppose we are observing a population of 200 neurons, with a 200 ms history divided into ten 20 ms bins. For each library state, the log-likelihood of the observed spike-counts is simply the sum of 200 × 10 = 2000 individual loglikelihoods, each retrieved from the lookup table. In practice, computation is even simpler because many terms can be reused from the last time bin using a recursive solution (Methods). This procedure is lightweight and amenable to real-time applications.”

      In summary, the first table simply needs to contain the firing rate of each neuron, for each condition, and each time in that condition. This table consumes relatively little memory. Assuming 100 one-second-long conditions (rates sampled every 20 ms) and 200 neurons, the table would contain 100 x 50 x 200 = 1,000,000 numbers. These numbers are typically stored as 16-bit integers (because rates are quantized), which amounts to about 2 MB. This is modest, given that most computers have (at least) tens of GB of RAM. A second table would contain the values for each behavioral variable, for each condition, and each time in that condition. This table might contain behavioral variables at a finer resolution (e.g. every millisecond) to enable decoding to update in between 20 ms bins (1 ms granularity is not needed for most BCI applications, but is the resolution used in this study). The number of behavioral variables of interest for a particular BCI application is likely to be small, often 1-2, but let’s assume for this example it is 10 (e.g. x-, y-, and z-position, velocity, and acceleration of a limb, plus one other variable). This table would thus contain 100 x 1000 x 10 = 1,000,000 floating point numbers, i.e. an 8 MB table. The third table is used to store the probability of s spikes being observed given a particular quantized firing rate (e.g. it may contain probabilities associated with firing rates ranging from 0 – 200 spikes/s in 0.1 spikes/s increments). This table is not necessary, but saves some computation time by precomputing numbers that will be used repeatedly. This is a very small table (typically ~2000 x 20, i.e. 320 KB). It does not need to be repeated for different neurons or conditions, because Poisson probabilities depend on only rate and count.

      (2) R3 raises a concern that MINT “is essentially a table-lookup rather than a model.’ R3 states that MINT 

      “is essentially a table-lookup algorithm based on stereotyped task-dependent trajectories and involves the tabulation of extensive data to build a vast library without modeling.”

      and that,

      “as MINT is based on tabulating data without learning models of data, I am unclear how it will be useful in basic investigations of neural computations.”

      This concern is central to most subsequent concerns. The manuscript has been heavily revised to address it. The revisions clarify that MINT is much more flexible than a lookup table, even though MINT uses a lookup table as its first step. Because R3’s concern is intertwined with one’s scientific assumptions, we have also added the new Figure 1 to explicitly illustrate the two key scientific perspectives and their decoding implications. 

      Under the perspective in Figure 1a, R3 would be correct in saying that there exist traditional interpretable decoders (e.g. a Kalman filter) whose assumptions better model the data. Under this perspective, MINT might still be an excellent choice in many cases, but other methods would be expected to gain the advantage when situations demand more flexibility. This is R3’s central concern, and essentially all other concerns flow from it. It makes sense that R3 has this concern, because their comments repeatedly stress a foundational assumption of the perspective in Figure 1a: the assumption of a fixed lowdimensional neural subspace where activity has a reliable relationship to behavior that can be modeled and leveraged during decoding. The phrases below accord with that view:

      “Unlike MINT, these works can achieve generalization because they model the neural subspace and its association to movement.”

      “it will not generalize… even when these tasks cover the exact same space (e.g., the same 2D computer screen and associated neural space).”

      “For proper training, the training data should explore the whole movement space and the associated neural space”

      “I also believe the authors should clarify the logic behind developing MINT better. From a scientific standpoint, we seek to gain insights into neural computations by making various assumptions and building models that parsimoniously describe the vast amount of neural data rather than simply tabulating the data. For instance, low-dimensional assumptions have led to the development of numerous dimensionality reduction algorithms and these models have led to important interpretations about the underlying dynamics”

      Thus, R3 prefers a model that 1) assumes a low-dimensional subspace that is fixed across tasks and 2) assumes a consistent ‘association’ between neural activity and kinematics. Because R3 believes this is the correct model of the data, they believe that decoders should leverage it. Traditional interpretable method do, and MINT doesn’t, which is why they find MINT to be unprincipled. This is a reasonable view, but it is not our view. We have heavily revised the manuscript to clarify that a major goal of our study is to explore the implications of a different, less-traditional scientific perspective.

      The new Figure 1a illustrates the traditional perspective. Under this perspective, one would agree with R3’s claim that other methods have the opportunity to model the data better. For example, suppose there exists a consistent neural subspace – conserved across tasks – where three neural dimensions encode 3D hand position and three additional neural dimensions encode 3D hand velocity. A traditional method such as a Kalman filter would be a very appropriate choice to model these aspects of the data.

      Figure 1b illustrates the alternative scientific perspective. This perspective arises from recent, present, and to-be-published observations. MINT’s assumptions are well-aligned with this perspective. In contrast, the assumptions of traditional methods (e.g. the Kalman filter) are not well-aligned with the properties of the data under this perspective. This does not mean traditional methods are not useful. Yet under Figure 1b, it is traditional methods, such as the Kalman filter, that lack an accurate model of the data. Of course, the reviewer may disagree with our scientific perspective. We would certainly concede that there is room for debate. However, we find the evidence for Figure 1b to be sufficiently strong that it is worth exploring the utility of methods that align with this scientific perspective. MINT is such a method. As we document, it performs very well.

      Thus, in our view, MINT is quite principled because its assumptions are well aligned with the data. It is true that the features of the data that MINT models are a bit different from those that are traditionally modeled. For example, R3 is quite correct that MINT does not attempt to use a biomimetic model of the true transformation from neural activity, to muscle activity, and thence to kinematics. We see this as a strength, and the manuscript has been revised accordingly (see paragraph beginning with “We leveraged this simulated data to compare MINT with a biomimetic decoder”).

      (3) R3 raises concerns that MINT cannot generalize. This was a major concern of R3 and is intimately related to concern #2 above. The concern is that, if MINT is “essentially a lookup table” that simply selects pre-defined trajectories, then MINT will not be able to generalize. R3 is quite correct that MINT generalizes rather differently than existing methods. Whether this is good or bad depends on one’s scientific perspective. Under Figure 1a, MINT’s generalization would indeed be limiting because other methods could achieve greater flexibility. Under Figure 1b, all methods will have serious limits regarding generalization. Thus, MINT’s method for generalizing may approximate the best one can presently do. To address this concern, we have made three major changes, numbered i-iii below:

      i) Large sections of the manuscript have been restructured to underscore the ways in which MINT can generalize. A major goal was to counter the impression, stated by R3 above, that: 

      “for a test trial with a realized neural trajectory, [MINT] then finds the closest neural trajectory to it in the table and declares the associated behavior trajectory in the table as the decoded behavior”.

      This description is a reasonable way to initially understand how MINT works, and we concede that we may have over-used this intuition. Unfortunately, it can leave the misimpression that MINT decodes by selecting whole trajectories, each corresponding to ‘a behavior’. This can happen, but it needn’t and typically doesn’t. As an example, consider the cycling task. Suppose that the library consists of stereotyped trajectories, each four cycles long, at five fixed speeds from 0.5-2.5 Hz. If the spiking observations argued for it, MINT could decode something close to one of these five stereotyped trajectories. Yet it needn’t. Decoded trajectories will typically resemble library trajectories locally, but may be very different globally. For example, a decoded trajectory could be thirty cycles long (or two, or five hundred) perhaps speeding up and slowing down multiple times across those cycles.

      Thus, the library of trajectories shouldn’t be thought of as specifying a limited set of whole movements that can be ‘selected from’. Rather, trajectories define a scaffolding that outlines where the neural state is likely to live and how it is likely to be changing over time. When we introduce the idea of library trajectories, we are now careful to stress that they don’t function as a set from which one trajectory is ‘declared’ to be the right one:

      “We thus designed MINT to approximate that manifold using the trajectories themselves, rather than their covariance matrix or corresponding subspace. Unlike a covariance matrix, neural trajectories indicate not only which states are likely, but also which state-derivatives are likely. If a neural state is near previously observed states, it should be moving in a similar direction. MINT leverages this directionality.

      Training-set trajectories can take various forms, depending on what is convenient to collect. Most simply, training data might include one trajectory per condition, with each condition corresponding to a discrete movement. Alternatively, one might instead employ one long trajectory spanning many movements. Another option is to employ many sub-trajectories, each briefer than a whole movement. The goal is simply for training-set trajectories to act as a scaffolding, outlining the manifold that might be occupied during decoding and the directions in which decoded trajectories are likely to be traveling.”

      Later in that same section we stress that decoded trajectories can move along the ‘mesh’ in nonstereotyped ways:

      “Although the mesh is formed of stereotyped trajectories, decoded trajectories can move along the mesh in non-stereotyped ways as long as they generally obey the flow-field implied by the training data. This flexibility supports many types of generalization, including generalization that is compositional in nature. Other types of generalization – e.g. from the green trajectories to the orange trajectories in Figure 1b – are unavailable when using MINT and are expected to be challenging for any method (as will be documented in a later section).”

      The section “Training and decoding using MINT” has been revised to clarify the ways in which interpolation is flexible, allowing decoded movements to be globally very different from any library trajectory.

      “To decode stereotyped trajectories, one could simply obtain the maximum-likelihood neural state from the library, then render a behavioral decode based on the behavioral state with the same values of c and k. This would be appropriate for applications in which conditions are categorical, such as typing or handwriting. Yet in most cases we wish for the trajectory library to serve not as an exhaustive set of possible states, but as a scaffolding for the mesh of possible states. MINT’s operations are thus designed to estimate any neural trajectory – and any corresponding behavioral trajectory – that moves along the mesh in a manner generally consistent with the trajectories in Ω.”

      “…interpolation allows considerable flexibility. Not only is one not ‘stuck’ on a trajectory from Φ, one is also not stuck on trajectories created by weighted averaging of trajectories in Φ. For example, if cycling speed increases, the decoded neural state could move steadily up a scaffolding like that illustrated in Figure 1b (green). In such cases, the decoded trajectory might be very different in duration from any of the library trajectories. Thus, one should not think of the library as a set of possible trajectories that are selected from, but rather as providing a mesh-like scaffolding that defines where future neural states are likely to live and the likely direction of their local motion. The decoded trajectory may differ considerably from any trajectory within Ω.”

      This flexibility is indeed used during movement. One empirical example is described in detail:

      “During movement… angular phase was decoded with effectively no net drift over time. This is noteworthy because angular velocity on test trials never perfectly matched any of the trajectories in Φ. Thus, if decoding were restricted to a library trajectory, one would expect growing phase discrepancies. Yet decoded trajectories only need to locally (and approximately) follow the flow-field defined by the library trajectories. Based on incoming spiking observations, decoded trajectories speed up or slow down (within limits).

      This decoding flexibility presumably relates to the fact that the decoded neural state is allowed to differ from the nearest state in Ω. To explore… [the text goes on to describe the new analysis in Figure 4d, which shows that the decoded state is typically not on any trajectory, though it is typically close to a trajectory].”

      Thus, MINT’s operations allow considerable flexibility, including generalization that is compositional in nature. Yet R3 is still correct that there are other forms of generalization that are unavailable to MINT. This is now stressed at multiple points in the revision. However, under the perspective in Figure 1b, these forms of generalization are unavailable to any current method. Hence we made a second major change in response to this concern…  ii) We explicitly illustrate how the structure of the data determines when generalization is or isn’t possible. The new Figure 1a,b introduces the two perspectives, and the new Figure 6a,b lays out their implications for generalization. Under the perspective in Figure 6a, the reviewer is quite right: other methods can generalize in ways that MINT cannot. Under the perspective in Figure 6b, expectations are very different. Those expectations make testable predictions. Hence the third major change… iii) We have added an analysis of generalization, using a newly collected dataset. This dataset was collected using Neuropixels Probes during our Pac-Man force-tracking task. This dataset was chosen because it is unusually well-suited to distinguishing the predictions in Figure 6a versus Figure 6b. Finding a dataset that can do so is not simple. Consider R3’s point that training data should “explore the whole movement space and the associated neural space”. The physical simplicity of the Pac-Man task makes it unusually easy to confirm that the behavioral workspace has been fully explored. Importantly, under Figure 6b, this does not mean that the neural workspace has been fully explored, which is exactly what we wish to test when testing generalization. We do so, and compare MINT with a Wiener filter. A Wiener filter is an ideal comparison because it is simple, performs very well on this task, and should be able to generalize well under Figure 1a. Additionally, the Wiener filter (unlike the Kalman Filter) doesn’t leverage the assumption that neural activity reflects the derivative of force. This matters because we find that neural activity does not reflect dforce/dt in this task. The Wiener filter is thus the most natural choice of the interpretable methods whose assumptions match Figure 1a.

      The new analysis is described in Figure 6c-g and accompanying text. Results are consistent with the predictions of Figure 6b. We are pleased to have been motivated to add this analysis for two reasons. First, it provides an additional way of evaluating the predictions of the two competing scientific perspectives that are at the heart of our study. Second, this analysis illustrates an underappreciated way in which generalization is likely to be challenging for any decode method. It can be tempting to think that the main challenge regarding generalization is to fully explore the relevant behavioral space. This makes sense if a behavioral space has “an associated neural space”. However, we are increasingly of the opinion that it doesn’t. Different tasks often involve different neural subspaces, even when behavioral subspaces overlap. We have even seen situations where motor output is identical but neural subspaces are quite different. These facts are relevant to any decoder, something highlighted in the revised Introduction:

      “MINT’s performance confirms that there are gains to be made by building decoders whose assumptions match a different, possibly more accurate view of population activity. At the same time, our results suggest fundamental limits on decoder generalization. Under the assumptions in Figure 1b, it will sometimes be difficult or impossible for decoders to generalize to not-yet-seen tasks. We found that this was true regardless of whether one uses MINT or a more traditional method. This finding has implications regarding when and how generalization should be attempted.”

      We have also added an analysis (Figure 6e) illustrating how MINT’s ability to compute likelihoods can be useful in detecting situations that may strain generalization (for any method). MINT is unusual in being able to compute and use likelihoods in this way.

      Detailed responses to R3: we reproduce each of R3’s specific concerns below, but concentrate our responses on issues not already covered above.

      Main comments: 

      Comment 1. MINT does not generalize to different tasks, which is a main limitation for BCI utility compared with prior BCI decoders that have shown this generalizability as I review below. Specifically, given that MINT tabulates task-specific trajectories, it will not generalize to tasks that are not seen in the training data even when these tasks cover the exact same space (e.g., the same 2D computer screen and associated neural space). 

      First, the authors provide a section on generalization, which is inaccurate because it mixes up two fundamentally different concepts: 1) collecting informative training data and 2) generalizing from task to task. The former is critical for any algorithm, but it does not imply the latter. For example, removing one direction of cycling from the training set as the authors do here is an example of generating poor training data because the two behavioral (and neural) directions are non-overlapping and/or orthogonal while being in the same space. As such, it is fully expected that all methods will fail. For proper training, the training data should explore the whole movement space and the associated neural space, but this does not mean all kinds of tasks performed in that space must be included in the training set (something MINT likely needs while modeling-based approaches do not). Many BCI studies have indeed shown this generalization ability using a model. For example, in Weiss et al. 2019, center-out reaching tasks are used for training and then the same trained decoder is used for typing on a keyboard or drawing on the 2D screen. In Gilja et al. 2012, training is on a center-out task but the same trained decoder generalizes to a completely different pinball task (hit four consecutive targets) and tasks requiring the avoidance of obstacles and curved movements. There are many more BCI studies, such as Jarosiewicz et al. 2015 that also show generalization to complex realworld tasks not included in the training set. Unlike MINT, these works can achieve generalization because they model the neural subspace and its association to movement. On the contrary, MINT models task-dependent neural trajectories, so the trained decoder is very task-dependent and cannot generalize to other tasks. So, unlike these prior BCIs methods, MINT will likely actually need to include every task in its library, which is not practical. 

      I suggest the authors remove claims of generalization and modify their arguments throughout the text and abstract. The generalization section needs to be substantially edited to clarify the above points. Please also provide the BCI citations and discuss the above limitation of MINT for BCIs. 

      As discussed above, R3’s concerns are accurate under the view in Figure 1a (and the corresponding Figure 6a). Under this view, a method such as that in Gilja et al. or Jarosiewicz et al. can find the correct subspace, model the correct neuron-behavior correlations, and generalize to any task that uses “the same 2D computer screen and associated neural space”, just as the reviewer argues. Under Figure 1b things are quite different.

      This topic – and the changes we have made to address it – is covered at length above. Here we simply want to highlight an empirical finding: sometimes two tasks use the same neural subspace and sometimes they don’t. We have seen both in recent data, and it is can be very non-obvious which will occur based just on behavior. It does not simply relate to whether one is using the same physical workspace. We have even seen situations where the patterns of muscle activity in two tasks are nearly identical, but the neural subspaces are fairly different. When a new task uses a new subspace, neither of the methods noted above (Gilja nor Jarosiewicz) will generalize (nor will MINT). Generalizing to a new subspace is basically impossible without some yet-to-be-invented approach. On the other hand, there are many other pairs of tasks (center-out-reaching versus some other 2D cursor control) where subspaces are likely to be similar, especially if the frequency content of the behavior is similar (in our recent experience this is often critical). When subspaces are shared, most methods will generalize, and that is presumably why generalization worked well in the studies noted above.

      Although MINT can also generalize in such circumstances, R3 is correct that, under the perspective in Figure 1a, MINT will be more limited than other methods. This is now carefully illustrated in Figure 6a. In this traditional perspective, MINT will fail to generalize in cases where new trajectories are near previously observed states, yet move in very different ways from library trajectories. The reason we don’t view this is a shortcoming is that we expect it to occur rarely (else tangling would be high). We thus anticipate the scenario in Figure 6b.

      This is worth stressing because R3 states that our discussion of generalization “is inaccurate because it mixes up two fundamentally different concepts: 1) collecting informative training data and 2) generalizing from task to task.” We have heavily revised this section and improved it. However, it was never inaccurate. Under Figure 6b, these two concepts absolutely are mixed up. If different tasks use different neural subspaces, then this requires collecting different “informative training data” for each. One cannot simply count on having explored the physical workspace.

      Comment 2. MINT is shown to achieve competitive/high performance in highly stereotyped datasets with structured trials, but worse performance on MC_RTT, which is not based on repeated trials and is less stereotyped. This shows that MINT is valuable for decoding in repetitive stereotyped use-cases. However, it also highlights a limitation of MINT for BCIs, which is that MINT may not work well for real-world and/or less-constrained setups such as typing, moving a robotic arm in 3D space, etc. This is again due to MINT being a lookup table with a library of stereotyped trajectories rather than a model. Indeed, the authors acknowledge that the lower performance on MC_RTT (Figure 4) may be caused by the lack of repeated trials of the same type. However, real-world BCI decoding scenarios will also not have such stereotyped trial structure and will be less/un-constrained, in which MINT underperforms. Thus, the claim in the abstract or lines 480-481 that MINT is an "excellent" candidate for clinical BCI applications is not accurate and needs to be qualified. The authors should revise their statements according and discuss this issue. They should also make the use-case of MINT on BCI decoding clearer and more convincing. 

      We discussed, above, multiple changes and additions to the revision that were made to address these concerns. Here we briefly expand on the comment that MINT achieves “worse performance on MC_RTT, which is not based on repeated trials and is less stereotyped”. All decoders performed poorly on this task. MINT still outperformed the two traditional methods, but this was the only dataset where MINT did not also perform better (overall) than the expressive GRU and feedforward network. There are probably multiple reasons why. We agree with R3 that one likely reason is that this dataset is straining generalization, and MINT may have felt this strain more than the two machine-learning-based methods. Another potential reason is the structure of the training data, which made it more challenging to obtain library trajectories in the first place. Importantly, these observations do not support the view in Figure 1a. MINT still outperformed the Kalman and Wiener filters (whose assumptions align with Fig. 1a). To make these points we have added the following:

      “Decoding was acceptable, but noticeably worse, for the MC_RTT dataset… As will be discussed below, every decode method achieved its worst estimates of velocity for the MC_RTT dataset. In addition to the impact of slower reaches, MINT was likely impacted by training data that made it challenging to accurate estimate library trajectories. Due to the lack of repeated trials, MINT used AutoLFADS to estimate the neural state during training. In principle this should work well. In practice AutoLFADS may have been limited by having only 10 minutes of training data. Because the random-target task involved more variable reaches, it may also have stressed the ability of all methods to generalize, perhaps for the reasons illustrated in Figure 1b.

      The only dataset where MINT did not perform the best overall was the MC_RTT dataset, where it was outperformed by the feedforward network and GRU. As noted above, this may relate to the need for MINT to learn neural trajectories from training data that lacked repeated trials of the same movement (a design choice one might wish to avoid). Alternatively, the less-structured MC_RTT dataset may strain the capacity to generalize; all methods experienced a drop in velocity-decoding R2 for this dataset compared to the others. MINT generalizes somewhat differently than other methods, and may have been at a modest disadvantage for this dataset. A strong version of this possibility is that perhaps the perspective in Figure 1a is correct, in which case MINT might struggle because it cannot use forms of generalization that are available to other methods (e.g. generalization based on neuron-velocity correlations). This strong version seems unlikely; MINT continued to significantly outperform the Wiener and Kalman filters, which make assumptions aligned with Figure 1a.”

      Comment 3. Related to 2, it may also be that MINT achieves competitive performance in offline and trial-based stereotyped decoding by overfitting to the trial structure in a given task, and thus may not generalize well to online performance due to overfitting. For example, a recent work showed that offline decoding performance may be overfitted to the task structure and may not represent online performance (Deo et al. 2023). Please discuss. 

      We agree that a limitation of our study is that we do not test online performance. There are sensible reasons for this decision:

      “By necessity and desire, all comparisons were made offline, enabling benchmarked performance across a variety of tasks and decoded variables, where each decoder had access to the exact same data and recording conditions.”

      We recently reported excellent online performance in the cycling task with a different algorithm

      (Schroeder et al. 2022). In the course of that study, we consistently found that improvements in our offline decoding translated to improvements in our online decoding. We thus believe that MINT (which improves on the offline performance of our older algorithm) is a good candidate to work very well online. Yet we agree this still remains to be seen. We have added the following to the Discussion:

      “With that goal in mind, there exist three important practical considerations. First, some decode algorithms experience a performance drop when used online. One presumed reason is that, when decoding is imperfect, the participant alters their strategy which in turn alters the neural responses upon which decoding is based. Because MINT produces particularly accurate decoding, this effect may be minimized, but this cannot be known in advance. If a performance drop does indeed occur, one could adapt the known solution of retraining using data collected during online decoding [13]. Another presumed reason (for a gap between offline and online decoding) is that offline decoders can overfit the temporal structure in training data [107]. This concern is somewhat mitigated by MINT’s use of a short spike-count history, but MINT may nevertheless benefit from data augmentation strategies such as including timedilated versions of learned trajectories in the libraries”

      Comment 4. Related to 2, since MINT requires firing rates to generate the library and simple averaging does not work for this purpose in the MC_RTT dataset (that does not have repeated trials), the authors needed to use AutoLFADS to infer the underlying firing rates. The fact that MINT requires the usage of another model to be constructed first and that this model can be computationally complex, will also be a limiting factor and should be clarified. 

      This concern relates to the computational complexity of computing firing-rate trajectories during training. Usually, rates are estimated via trial-averaging, which makes MINT very fast to train. This was quite noticeable during the Neural Latents Benchmark competition. As one example, for the “MC_Scaling 5 ms Phase”, MINT took 28 seconds to train while GPFA took 30 minutes, the transformer baseline (NDT) took 3.5 hours, and the switching nonlinear dynamical system took 4.5 hours.

      However, the reviewer is quite correct that MINT’s efficiency depends on the method used to construct the library of trajectories. As we note, “MINT is a method for leveraging a trajectory library, not a method for constructing it”. One can use trial-averaging, which is very fast. One can also use fancier, slower methods to compute the trajectories. We don’t view this as a negative – it simply provides options. Usually one would choose trial-averaging, but one does not have to. In the case of MC_RTT, one has a choice between LFADS and grouping into pseudo-conditions and averaging (which is fast). LFADS produces higher performance at the cost of being slower. The operator can choose which they prefer. This is discussed in the following section:

      “For MINT, ‘training’ simply means computation of standard quantities (e.g. firing rates) rather than parameter optimization. MINT is thus typically very fast to train (Table 1), on the order of seconds using generic hardware (no GPUs). This speed reflects the simple operations involved in constructing the library of neural-state trajectories: filtering of spikes and averaging across trials. At the same time we stress that MINT is a method for leveraging a trajectory library, not a method for constructing it. One may sometimes wish to use alternatives to trial-averaging, either of necessity or because they improve trajectory estimates. For example, for the MC_RTT task we used AutoLFADS to infer the library. Training was consequently much slower (hours rather than seconds) because of the time taken to estimate rates. Training time could be reduced back to seconds using a different approach – grouping into pseudo-conditions and averaging – but performance was reduced. Thus, training will typically be very fast, but one may choose time-consuming methods when appropriate.”

      Comment 5. I also find the statement in the abstract and paper that "computations are simple, scalable" to be inaccurate. The authors state that MINT's computational cost is O(NC) only, but it seems this is achieved at a high memory cost as well as computational cost in training. The process is described in section "Lookup table of log-likelihoods" on line [978-990]. The idea is to precompute the log-likelihoods for any combination of all neurons with discretization x all delay/history segments x all conditions and to build a large lookup table for decoding. Basically, the computational cost of precomputing this table is O(V^{Nτ} x TC) and the table requires a memory of O(V^{Nτ}), where V is the number of discretization points for the neural firing rates, N is the number of neurons, τ is the history length, T is the trial length, and C is the number of conditions. This is a very large burden, especially the V^{Nτ} term. This cost is currently not mentioned in the manuscript and should be clarified in the main text. Accordingly, computation claims should be modified including in the abstract. 

      As discussed above, the manuscript has been revised to clarify that our statement was accurate.

      Comment 6. In addition to the above technical concerns, I also believe the authors should clarify the logic behind developing MINT better. From a scientific standpoint, we seek to gain insights into neural computations by making various assumptions and building models that parsimoniously describe the vast amount of neural data rather than simply tabulating the data. For instance, low-dimensional assumptions have led to the development of numerous dimensionality reduction algorithms and these models have led to important interpretations about the underlying dynamics (e.g., fixed points/limit cycles). While it is of course valid and even insightful to propose different assumptions from existing models as the authors do here, they do not actually translate these assumptions into a new model. Without a model and by just tabulating the data, I don't believe we can provide interpretation or advance the understanding of the fundamentals behind neural computations. As such, I am not clear as to how this library building approach can advance neuroscience or how these assumptions are useful. I think the authors should clarify and discuss this point. 

      As requested, a major goal of the revision has been to clarify the scientific motivations underlying MINT’s design. In addition to many textual changes, we have added figures (Figures 1a,b and 6a,b) to outline the two competing scientific perspectives that presently exist. This topic is also addressed by extensions of existing analyses and by new analyses (e.g. Figure 6c-g). 

      In our view these additions have dramatically improved the manuscript. This is especially true because we think R3’s concerns, expressed above, are reasonable. If the perspective in Figure 1a is correct, then R3 is right and MINT is essentially a hack that fails to model the data. MINT would still be effective in many circumstances (as we show), but it would be unprincipled. This would create limitations, just as the reviewer argues. On the other hand, if the perspective in Figure 1b is correct, then MINT is quite principled relative to traditional approaches. Traditional approaches make assumptions (a fixed subspace, consistent neuron-kinematic correlations) that are not correct under Figure 1b.

      We don’t expect R3 to agree with our scientific perspective at this time (though we hope to eventually convince them). To us, the key is that we agree with R3 that the manuscript needs to lay out the different perspectives and their implications, so that readers have a good sense of the possibilities they should be considering. The revised manuscript is greatly improved in this regard.

      Comment 7. Related to 6, there seems to be a logical inconsistency between the operations of MINT and one of its three assumptions, namely, sparsity. The authors state that neural states are sparsely distributed in some neural dimensions (Figure 1a, bottom). If this is the case, then why does MINT extend its decoding scope by interpolating known neural states (and behavior) in the training library? This interpolation suggests that the neural states are dense on the manifold rather than sparse, thus being contradictory to the assumption made. If interpolation-based dense meshes/manifolds underlie the data, then why not model the neural states through the subspace or manifold representations? I think the authors should address this logical inconsistency in MINT, especially since this sparsity assumption also questions the low-dimensional subspace/manifold assumption that is commonly made. 

      We agree this is an important issue, and have added an analysis on this topic (Figure 4d). The key question is simple and empirical: during decoding, does interpolation cause MINT to violate the assumption of sparsity? R3 is quite right that in principle it could. If spiking observations argue for it, MINT’s interpolation could create a dense manifold during decoding rather than a sparse one. The short answer is that empirically this does not happen, in agreement with expectations under Figure 1b. Rather than interpolating between distant states and filling in large ‘voids’, interpolation is consistently local. This is a feature of the data, not of the decoder (MINT doesn’t insist upon sparsity, even though it is designed to work best in situations where the manifold is sparse).

      In addition to adding Figure 4d, we added the following (in an earlier section):

      “The term mesh is apt because, if MINT’s assumptions are correct, interpolation will almost always be local. If so, the set of decodable states will resemble a mesh, created by line segments connecting nearby training-set trajectories. However, this mesh-like structure is not enforced by MINT’s operations. Interpolation could, in principle, create state-distributions that depart from the assumption of a sparse manifold. For example, interpolation could fill in the center of the green tube in Figure 1b, resulting in a solid manifold rather than a mesh around its outer surface. However, this would occur only if spiking observations argued for it. As will be documented below, we find that essentially all interpolation is local.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      I appreciate the detailed methods section, however, more specifics should be integrated into the main text. For example on Line 238, it should additionally be stated how many minutes were used for training and metrics like the MAE which is used later should be reported here.

      Thank you for this suggestion. We now report the duration of training data in the main text:

      “Decoding R^2 was .968 over ~7.1 minutes of test trials based on ~4.4 minutes of training data.”

      We have also added similar specifics throughout the manuscript, e.g. in the Fig. 5 legend:

      “Results are based on the following numbers of training / test trials: MC\_Cycle (174 train, 99 test), MC\_Maze (1721 train, 574 test), Area2\_Bump (272 train, 92 test), MC\_RTT (810 train, 268 test).”

      Similar additions were made to the legends for Fig. 6 and 8. Regarding the request to add MAE for the multitask network, we did not do so for the simple reason that the decoded variable (muscle activity) has arbitrary units. The raw MAE is thus not meaningful. We could of course have normalized, but at this point the MAE is largely redundant with the correlation. In contrast, the MAE is useful when comparing across the MC_Maze, Area2_Bump, and MC_RTT datasets, because they all involve the same scale (cm/s).

      Regarding the MC_RTT task, AutoLFADS was used to obtain robust spike rates, as reported in the methods. However, the rationale for splitting the neural trajectories after AutoLFADS is unclear. If the trajectories were split based on random recording gaps, this might lead to suboptimal performance? It might be advantageous to split them based on a common behavioural state? 

      When learning neural trajectories via AutoLFADS, spiking data is broken into short (but overlapping) segments, rates are estimated for each segment via AutoLFADs, and these rates are then stitched together across segments into long neural trajectories. If there had been no recording gaps, these rates could have been stitched into a single neural trajectory for this dataset. However, the presence of recording gaps left us no choice but to stitch together these rates into more than one trajectory. Fortunately, recording gaps were rare: for the decoding analysis of MC_RTT there were only two recording gaps and therefore three neural trajectories, each ~2.7 minutes in duration. 

      We agree that in general it is desirable to learn neural trajectories that begin and end at behaviorallyrelevant moments (e.g. in between movements). However, having these trajectories potentially end midmovement is not an issue in and of itself. During decoding, MINT is never stuck on a trajectory. Thus, if MINT were decoding states near the end of a trajectory that was cut short due to a training gap, it would simply begin decoding states from other trajectories or elsewhere along the same trajectory in subsequent moments. We could have further trimmed the three neural trajectories to begin and end at behaviorallyrelevant moments, but chose not to as this would have only removed a handful of potentially useful states from the library.

      We now describe this in the Methods:

      “Although one might prefer trajectory boundaries to begin and end at behaviorally relevant moments (e.g. a stationary state), rather than at recording gaps, the exact boundary points are unlikely to be consequential for trajectories of this length that span multiple movements. If MINT estimates a state near the end of a long trajectory, its estimate will simply jump to another likely state on a different trajectory (or earlier along the same trajectory) in subsequent moments. Clipping the end of each trajectory to an earlier behaviorally-relevant moment would only remove potentially useful states from the libraries.”

      Are the training and execution times in Table 1 based on pure Matlab functions or Mex files? If it's Mex files as suggested by the code, it would be good to mention this in the Table caption.

      They are based on a combination of MATLAB and MEX files. This is now clarified in the table caption:

      “Timing measurements taken on a Macbook Pro (on CPU) with 32GB RAM and a 2.3 GHz 8-Core Intel Core i9 processor. Training and execution code used for measurements was written in MATLAB (with the core recursion implemented as a MEX file).”

      As the method most closely resembles a Bayesian decoder it would be good to compare performance against a Naive Bayes decoder. 

      We agree and have now done so. The following has been added to the text:

      “A natural question is thus whether a simpler Bayesian decoder would have yielded similar results. We explored this possibility by testing a Naïve Bayes regression decoder [85] using the MC_Maze dataset. This decoder performed poorly, especially when decoding velocity (R2 = .688 and .093 for hand position and velocity, respectively), indicating that the specific modeling assumptions that differentiate MINT from a naive Bayesian decoder are important drivers of MINT’s performance.”

      Line 199 Typo: The assumption of stereotypy trajectory also enables neural states (and decoded behaviors) to be updated in between time bins. 

      Fixed

      Table 3: It's unclear why the Gaussian binning varies significantly across different datasets. Could the authors explain why this is the case and what its implications might be? 

      We have added the following description in the “Filtering, extracting, and warping data on each trial” subsection of the Methods to discuss how 𝜎 may vary due to the number of trials available for training and how noisy the neural data for those trials is:

      “First, spiking activity for each neuron on each trial was temporally filtered with a Gaussian to yield single-trial rates. Table 3 reports the Gaussian standard deviations σ (in milliseconds) used for each dataset. Larger values of σ utilize broader windows of spiking activity when estimating rates and therefore reduce variability in those rate estimates. However, large σ values also yield neural trajectories with less fine-grained temporal structure. Thus, the optimal σ for a dataset depends on how variable the rate estimates otherwise are.”

      An implementation of the method in an open-source programming language could further enhance the widespread use of the tool. 

      We agree this would be useful, but have yet not implemented the method in any other programming languages. Implementation in Python is still a future goal.

      Reviewer #2 (Recommendations For The Authors): 

      - Figures 4 and 5 should show the error bars on the horizontal axis rather than portraying them vertically. 

      [Note that these are now Figures 5 and 6]

      The figure legend of Figure 5 now clarifies that the vertical ticks are simply to aid visibility when symbols have very similar means and thus overlap visually. We don’t include error bars (for this analysis) because they are very small and would mostly be smaller than the symbol sizes. Instead, to indicate certainty regarding MINT’s performance measurements, the revised text now gives error ranges for the correlations and MAE values in the context of Figure 4c. These error ranges were computed as the standard deviation of the sampling distribution (computed via resampling of trials) and are thus equivalent to SEMs. The error ranges are all very small; e.g. for the MC_Maze dataset the MAE for x-velocity is 4.5 +/- 0.1 cm/s. (error bars on the correlations are smaller still).

      Thus, for a given dataset, we can be quite certain of how well MINT performs (within ~2% in the above case). This is reassuring, but we also don’t want to overemphasize this accuracy. The main sources of variability one should be concerned about are: 1) different methods can perform differentially well for different brain areas and tasks, 2) methods can decode some behavioral variables better than others, and 3) performance depends on factors like neuron-count and the number of training trials, in ways that can differ across decode methods. For this reason, the study examines multiple datasets, across tasks and brain areas, and measures performance for a range of decoded variables. We also examine the impact of training-set-size (Figure 8a) and population size (solid traces in Fig. 8b, see R2’s next comment below). 

      There is one other source of variance one might be concerned about, but it is specific to the neuralnetwork approaches: different weight initializations might result in different performance. For this reason, each neural-network approach was trained ten times, with the average performance computed. The variability around this average was very small, and this is now stated in the Methods.

      “For the neural networks, the training/testing procedure was repeated 10 times with different random seeds. For most behavioral variables, there was very little variability in performance across repetitions. However, there were a few outliers for which variability was larger. Reported performance for each behavioral group is the average performance across the 10 repetitions to ensure results were not sensitive to any specific random initialization of each network.”

      - For Figure 6, it is unclear whether the neuron-dropping process was repeated multiple times. If not, it should be since the results will be sensitive to which particular subsets of neurons were "dropped". In this case, the results presented in Figure 6 should include error bars to describe the variability in the model performance for each decoder considered. 

      A good point. The results in Figure 8 (previously Figure 6) were computed by averaging over the removal of different random subsets of neurons (50 subsets per neuron count), just as the reviewer requests. The figure has been modified to include the standard deviation of performance across these 50 subsets. The legend clarifies how this was done.

      Reviewer #3 (Recommendations For The Authors): 

      Other comments: 

      (1) [Line 185-188] The authors argue that in a 100-dimensional space with 10 possible discretized values, 10^100 potential neural states need to be computed. But I am not clear on this. This argument seems to hold only in the absence of a model (as in MINT). For a model, e.g., Kalman filter or AutoLFADS, information is encoded in the latent state. For example, a simple Kalman filter for a linear model can be used for efficient inference. This 10^100 computation isn't a general problem but seems MINT-specific, please clarify. 

      We agree this section was potentially confusing. It has been rewritten. We were simply attempting to illustrate why maximum likelihood computations are challenging without constraints. MINT simplifies this problem by adding constraints, which is why it can readily provide data likelihoods (and can do so using a Poisson model). The rewritten section is below:

      “Even with 1000 samples for each of the neural trajectories in Figure 3, there are only 4000 possible neural states for which log-likelihoods must be computed (in practice it is fewer still, see Methods). This is far fewer than if one were to naively consider all possible neural states in a typical rate- or factor-based subspace. It thus becomes tractable to compute log-likelihoods using a Poisson observation model. A Poisson observation model is usually considered desirable, yet can pose tractability challenges for methods that utilize a continuous model of neural states. For example, when using a Kalman filter, one is often restricted to assuming a Gaussian observation model to maintain computational tractability “

      (2) [Figure 6b] Why do the authors set the dropped neurons to zero in the "zeroed" results of the robustness analysis? Why not disregard the dropped neurons during the decoding process? 

      We agree the terminology we had used in this section was confusing. We have altered the figure and rewritten the text. The following, now at the beginning of that section, addresses the reviewer’s query: 

      “It is desirable for a decoder to be robust to the unexpected loss of the ability to detect spikes from some neurons. Such loss might occur while decoding, without being immediately detected. Additionally, one desires robustness to a known loss of neurons / recording channels. For example, there may have been channels that were active one morning but are no longer active that afternoon. At least in principle, MINT makes it very easy to handle this second situation: there is no need to retrain the decoder, one simply ignores the lost neurons when computing likelihoods. This is in contrast to nearly all other methods, which require retraining because the loss of one neuron alters the optimal parameters associated with every other neuron.”

      The figure has been relabeled accordingly; instead of the label ‘zeroed’, we use the label ‘undetected neuron loss’.

      (3) Authors should provide statistical significance on their results, which they already did for Fig. S3a,b,c but missing on some other figures/places. 

      We have added error bars in some key places, including in the text when quantifying MINT’s performance in the context of Figure 4. Importantly, error bars are only as meaningful as the source of error they assess, and there are reasons to be careful given this. The standard method for putting error bars on performance is to resample trials, which is indeed what we now report. These error bars are very small. For example, when decoding horizontal velocity for the MC_Maze dataset, the correlation between MINT’s decode and the true velocity had a mean and SD of the sampling distribution of 0.963 +/- 0.001. This means that, for a given dataset and target variable, we have enough trials/data that we can be quite certain of how well MINT performs. However, we want to be careful not to overstate this certainty. What one really wants to know is how well MINT performs across a variety of datasets, brain areas, target variables, neuron counts, etc. It is for this reason that we make multiple such comparisons, which provides a more valuable view of performance variability.

      For Figure 7, error bars are unavailable. Because this was a benchmark, there was exactly one test-set that was never seen before. This is thus not something that could be resampled many times (that would have revealed the test data and thus invalidated the benchmark, not to mention that some of these methods take days to train). We could, in principle, have added resampling to Figure 5. In our view it would not be helpful and could be misleading for the reasons noted above. If we computed standard errors using different train/test partitions, they would be very tight (mostly smaller than the symbol sizes), which would give the impression that one can be quite certain of a given R^2 value. Yet variability in the train/test partition is not the variability one is concerned about in practice. In practice, one is concerned about whether one would get a similar R^2 for a different dataset, or brain area, or task, or choice of decoded variable. Our analysis thus concentrated on showing results across a broad range of situations. In our view this is a far more relevant way of illustrating the degree of meaningful variability (which is quite large) than resampling, which produces reassuringly small but (mostly) irrelevant standard errors.

      Error bars are supplied in Figure 8b. These error bars give a sense of variability across re-samplings of the neural population. While this is not typically the source of variability one is most concerned about, for this analysis it becomes appropriate to show resampling-based standard errors because a natural concern is that results may depend on which neurons were dropped. So here it is both straightforward, and desirable, to compute standard errors. (The fact that MINT and the Wiener filter can be retrained many times swiftly was also key – this isn’t true of the more expressive methods). Figure S1 also uses resampling-based confidence intervals for similar reasons.

      (4) [Line 431-437] Authors state that MINT outperforms other methods with the PSTH R^2 metric (trial-averaged smoothed spikes for each condition). However, I think this measure may not provide a fair comparison and is confounded because MINT's library is built using PSTH (i.e., averaged firing rate) but other methods do not use the PSTH. The author should clarify this. 

      The PSTH R^2 metric was not created by us; it was part of the Neural Latents Benchmark. They chose it because it ensures that a method cannot ‘cheat’ (on the Bits/Spike measure) by reproducing fine features of spiking while estimating rates badly. We agree with the reviewer’s point: MINT’s design does give it a potential advantage in this particular performance metric. This isn’t a confound though, just a feature. Importantly, MINT will score well on this metric only if MINT’s neural state estimate is accurate (including accuracy in time). Without accurate estimation of the neural state at each time, it wouldn’t matter that the library trajectory is based on PSTHs. This is now explicitly stated:

      “This is in some ways unsurprising: MINT estimates neural states that tend to resemble (at least locally) trajectories ‘built’ from training-set-derived rates, which presumably resemble test-set rates. Yet strong performance is not a trivial consequence of MINT’s design. MINT does not ‘select’ whole library trajectories; PSTH R2 will be high only if condition (c), index (k), and the interpolation parameter (α) are accurately estimated for most moments.”

    1. Author response:

      The following is the authors’ response to the original reviews

      We thank the reviewers for their careful and positive assessment of our manuscript. Maybe our findings are best summarized in the model below, showing that KDM5 inhibition/loss mediates a viral mimicry and DNA damage response through the generation of R-loops in genomic repeats. This is a different mechanism from the more well studied double-stranded RNA-induced “viral mimicry” response. Our studies also suggest that KDM5 inhibition may have a larger therapeutic window than STING agonists, since KDM5 inhibition seemingly does not induce “viral mimicry” in normal breast epithelial cells. 

      Author response image 1.

      Model of viral mimicry activation. De-repression of repetitive elements may trigger dsRNA formation, which activates the RIG-1/MDA5 pathway, as well as PKR. Alternatively, derepression of these elements may induce transcription replication conflicts (TRCs), resulting in R-loop formation. R-loops can lead to DNA damage, and/or activate the cGAS/STING pathway. Both the MAVS pathway and the cGAS/STING pathway converge to activate type I interferon (IFN) responses, resulting in decreased cell fitness and/or increased immunogenicity.

      We do agree with the assessment that the study would be strengthened by in vivo studies. However, there are 4 different isoforms of KDM5 (3 in females), and existing KDM5specific inhibitors do not have adequate PK/PD properties for in vivo studies. We would also like to note that most mouse studies have not been proven to accurately predict immunotherapy responses in patients. Future studies in ex vivo tumor models would strengthen the clinical relevance of these studies. In the interim, we have added some normal macrophage studies in Figure S5 and an example of studies in normal T-cells below. Such studies will also be important to ensure that future KDM5 inhibitors do not have adverse effects on the immune system. Here, we observe that KDM5 inhibition appears to have neutral or slightly reduced T cell viability with KDM5 inhibition (Author response image 2a). However, KDM5 inhibition also results in increased CD107a expression in T-cells, indicative of a more cytotoxic phenotype (Author response image 2b). These studies suggest that KDM5 inhibitors do not have significant adverse effects on T cells or macrophages (figure S5) in the normal immune environment.

      Author response image 2.

      KDM5 inhibition does not have significant adverse effects on T-cells. a) Fold change proliferation of T-cells from 2 different human donors (left and right panels on graph) activated with 0.25ug/ml CD3 and treated with the indicated concentrations of C48 or a positive control (CBLB) compared to vehicle controls. b. FACS plots and histograms of CD107a surface expression (x-axis) versus forward scatter (FSC, y-axis) of T-cells from 2 different humans donors activated with 0.25ug/ml or 0.5mug/ml CD3 and treated with the indicated concentrations of C48.

      Specific comments and answers to Reviewer #1:

      We have added some additional analysis of data from other breast cancer cell lines to strengthen our points (Figure S2f, Figure S3e, Figure S4g-h, k.) We have also uploaded all the data to Geo with the following accession numbers :

      GSE296387: H3K4me3 CUT-and-Tag data

      GSE296584: S9.6 CUT-and-Tag data

      GSE296974: RNA-sequencing data

      Responses to Reviewer #1 (Recommendations for the authors):

      (1) We have not conducted genomic studies comparing KDM5 expression to retroelement activation status in the tumor data sets but recognize that this is important for future studies. Again, there are several KDM5 isoforms and looking at repeat expression in these larger data sets is complex. We have added some data correlating KDM5 expression with ISG signatures in Figure S3j-l as well as in the graph below (Author response image 3). The correlation with ISG and AP signatures is modest, but strongest for KDM5B and C in breast cancer data sets, consistent with our disruption data for these 2 isoforms. As mentioned above, we do agree that future studies of KDM5s along with a broader analysis of other epigenetic modifying enzymes over repeats in various cancer types will shed light on the role of histone modifying enzymes in suppressing “viral mimicry” in tumors.

      Author response image 3.

      Correlation between gene expression and IFN gene set GSVA scores in breast cancer cell lines. a) Pearson correlation score between gene expression and IFN signature (ISG) gene set variation analysis (GSVA) scores in breast cancer cell lines as reported in DepMap. Higher ranks indicate an inverse correlation between expression of the individual gene and the expression of the ISG gene set. Correlation ranks for KDM5A, B and C are highlighted. b) as in a), but comparing gene expression to antigen presentation (AP) GSVA scores.

      (2) We apologize for the mislabeling in figure 2B – has been corrected in the revised version.

      (3) We agree that blocking the cGAS/STING pathway, only partially rescues the ISREGFP and HLA-A, B, C phenotype in HCC1428 cells. We have added data (Figure S2f) showing that this rescue is stronger in MCF7 cells. It is possible that the MDA5/MAVS pathway may also contribute to activation of the Type I interferon response. However, we have data that MAVS plays a minor (if any) role in this context, as MAVS KO minimally decreases C48-induced ISRE-GFP activity and HLA-A, B, C surface expression in HCC1428 cells (added Figure S2g).

      Furthermore, there is no significant increase in dsRNA observed (using J2 antibody as a readout in immunofluorescence experiments) with C48 treatment as compared to 5’-azacytidine treatment or ADAR K/O (data not included). However, we have not performed MAVS/PKR K/O experiments to completely rule out the involvement of the dsRNA sensing pathways.

      (4) These experiments were performed in the operetta imaging system, rather than confocal imaging, and therefore we do not have such images. Quantification of RNaseH1-GFP in the whole cell is reported in the figure, as RNaseH1-GFP signal is increased in both the nucleus and the cytoplasm with C48 treatment. This is not unexpected, as our data suggest that R-loop formation occurs in repetitive regions of the genome that are de-repressed by KDM5 inhibition in the nucleus, and the RNA/DNA hybrids, generated from R-loops, may activate cGAS/STING pathway in the cytoplasm.

      (5) Disruption of siXPF and siXPG is relatively toxic in itself. Complete knockouts in breast cancer cells were not viable and we partially knocked down XPF using siRNA instead. We do agree that these kinds of rescue studies need to be expanded upon in future studies, but they served as further proof of the conclusions presented here.

      (6) We have provided all the data in Geo and alternative representations can be made.

      (7) Unfortunately, CUT-and-Tag experiments were not performed in cells expressing siXPF and therefore we cannot provide this data. However, XPF has been previously shown to be responsible for excising R-loops from the genome, rendering them detectable by cGAS/STING in the cytoplasm (Crossley et al, 2022, referenced in the current MS). Therefore, while we demonstrate that XPF knockdown attenuates type I IFN pathway activation upon KDM5 inhibition, it may not necessarily reduce R-loop formation in retroelements; it may just prevent their excision and downstream cGAS/STING activation. We do agree that CUT-and-Tag experiments in cells treated with siXPF versus siControl will have to be performed in the future to test this hypothesis.

      Responses to Reviewer #2 (Recommendations for the authors):

      (1) We have modified the text as well as the figure legend to state that this is a simplistic representation of the pathway in normal cells. As stated in the introduction, these pathways can be modified in tumors. The data presented suggest that the dsRNA pathway can be activated in all breast cancer cell lines tested, whereas more variation is observed in the activation of the STING pathway.  

      (2) The ADAR guides target ADAR 110 and p150 but not ADAR2. This has been clarified in the text.  

      (3) The guides have been renamed in the figure as the reviewer suggests.  

      (4) It has been shown by others that KDM5 can occupy the STING promoter (https://pubmed.ncbi.nlm.nih.gov/30080846/); which supports the reviewer’s suggestion that STING upregulation in HMECs may be due to increased H3K4me3 at the STING gene. However, we argue that STING upregulation is not sufficient to activate “viral mimicry” due to the absence of “tumor-specific R-loops” (due to an increase in TRC in tumor cells) in normal cells. It is interesting to note that the S9.6 signal in subtelomeric regions is increased in HMECS similar to what is observed in tumor cells. However, the S9.6 signal over other repeats is not (Author response image 4), suggesting that C48-induced increases over non-telomeric repeats are tumor specific. This suggests that the tumor-specific increases in R-loop formation, which lead to “viral mimicry” activation, are not driven by those formed in subtelomeric regions. Future studies will have to expand on these findings.

      Author response image 4.

      Percent of S9.6 reads that align to repetitive genome in HMEC cells. (a) % of total aligned S9.6 reads that map to subtelomeric region in HMEC cells treated with DMSO or 2.5 μM C48. (b) % of total aligned S9.6 reads that map to repetitive elements in general in HMEC cells treated as in a).

      (5) Clarity on R-loop quantification has been added to the figure legend as well as in the Materials and Methods section. Mean fluorescence intensity in the whole cell (this includes both nuclear and cytoplasmic signals) was quantified together and normalized to the number of DAPI-stained nuclei per well. As mentioned above all quantified in the Operetta imaging system.

      (6) We have added some data that shows that increases in H3K4me3 is observed in and around ISGs upon KDM5 inhibition (Figure S4f). However, without time course experiments it is difficult to assess whether these are direct effects of the KDM5 inhibitor or indirect effects from activation of Type I IFN (similarly to what has previously been reported with 5’-azacytidine induction of “viral mimicry”, https://pubmed.ncbi.nlm.nih.gov/26317465/).

      (7) We have previously included data showing that S9.6 reads in repeats that do not display C48-mediated increases in H3K4me3 also do not increase with C48 treatment (this is now Figure S4o). In addition, we have added some data showing that repeats with increased H3K4me3 and repeats with increased transcription upon C48 treatment also have increased S9.6 reads. Repeats that display both increases in H3K4me3 and mRNA expression have even greater increases in S9.6 signal compared to repeats that have increases in either one (Figure S4m-n). Taken together, this data suggest that KDM5 inhibition increases H3K4me3 in repeats, thereby allowing for their transcription, which can increase the probability of Transcription replication conflicts (TRC) and R-loop formation at such loci.

      (8) As mentioned earlier in this response, while we observe increased S9.6 reads in subtelomeric regions of HCC1428 cells upon KDM5 inhibition, we also observe this in normal HMEC cells. Since KDM5 inhibition does not induce viral mimicry in HMEC cells, this suggests that R-loops formed in subtelomeric regions do not dictate the response observed with C48 treatment in breast cancer cells.

      We hope that these answers to the reviewers comments as well as the additional data provided strengthens our findings.

    1. Author response:

      The following is the authors’ response to the original reviews

      Summary of Revisions

      We sincerely thank the editors and reviewers for their thorough assessment and constructive feedback, which has greatly improved our manuscript. We have carefully addressed all concerns as summarized below:

      In response to the requests made by Reviewer #1:

      • Clarified task design and acknowledged its limitations regarding endpoint accuracy control.

      • Included analysis comparing the effects of cerebellar block on within-trial versus inter-trial movements.

      • Clearly defined target groupings, replacing the term “single-joint” with “movements with low coupling torques” and “multi-joint” with “movements with high coupling torques”: definitions which are now supported by a supplementary material describing the net torque data as a function of the targets.

      • Added detailed descriptions of trial success criteria, based on timing, and positional constraints.

      • Expanded figures illustrating the effect of the cerebellar block on movement decomposition and variability in joint space and across different target directions.

      In response to the requests made by Reviewer #2:

      • Included an explicit discussion highlighting why the acute reduction in muscle torque during cerebellar block is likely due to agonist weakness rather than cocontraction, emphasizing the rationale behind our torque-centric analysis.

      • Clearly defined trial success criteria and included the timing and accuracy constraints used in our study.

      • Clarified our rationale for grouping targets based on shoulder flexion/extension, clearly justified by interaction torque analysis.

      • Revised the caption and legend of Figure 3d for clarity and included partial correlation results to account for the variability across monkeys for the analysis of reduction in hand velocity vs. coupling torque in control. 

      In response to the requests made by Reviewer #3:

      • Included electrophysiological validation of the accuracy of targeting the superior cerebellar peduncle from one of the monkeys used in the experiment.

      • Provided new analyses comparing movement decomposition and variability between slower and faster movements within the cerebellar block condition.

      • Revised manuscript text to clarify terminology and clearly explained the rationale behind target groupings and torque analyses.

      • Expanded discussion sections to better explain the relationships between timing deficits, movement decomposition, trajectory variability, and faulty motor commands.

      • Clarified methodological choices regarding our analysis timeframe and acknowledged limitations related to the distinction between feedforward and feedback control.

      Reviewer #1 (Public review): 

      Summary:

      In a previous work, Prut and colleagues had shown that during reaching, high-frequency stimulation of the cerebellar outputs resulted in reduced reach velocity. Moreover, they showed that the stimulation produced reaches that deviated from a straight line, with the shoulder and elbow movements becoming less coordinated. In this report, they extend their previous work by the addition of modeling results that investigate the relationship between the kinematic changes and torques produced at the joints. The results show that the slowing is not due to reductions in interaction torques alone, as the reductions in velocity occur even for movements that are single joints. More interestingly, the experiment revealed evidence for the decomposition of the reaching movement, as well as an increase in the variance of the trajectory.

      Strengths:

      This is a rare experiment in a non-human primate that assessed the importance of cerebellar input to the motor cortex during reaching.

      We thank the reviewer for their positive feedback on our study. We particularly appreciate their recognition of the novelty and importance of our experimental approach in non-human primates, as well as their insightful summary of our key findings.

      Weaknesses:

      My major concerns are described below.

      If I understand the task design correctly, the monkeys did not need to stop their hand at the target. I think this design may be suboptimal for investigating the role of the cerebellum in control of reaching because a number of earlier works have found that the cerebellum's contributions are particularly significant as the movement ends, i.e., stopping at the target. For example, in mice, interposed nucleus neurons tend to be most active near the end of the reach that requires extension, and their activation produces flexion forces during the reach (Becker and Person 2019). Indeed, the inactivation of interposed neurons that project to the thalamus results in overshooting of reaching movements (Low et al. 2018). Recent work has also found that many Purkinje cells show a burst-pause pattern as the reach nears its endpoint, and stimulation of the mossy fibers tends to disrupt endpoint control (Calame et al. 2023). Thus, the fact that the current paper has no data regarding endpoint control of the reach is puzzling to me.

      We appreciate the reviewer’s point that cerebellar contributions can be particularly critical near the endpoint of a reach. In our task design, monkeys were indeed required to hold at the target briefly—100 ms for Monkeys S and P, and 150 ms for Monkeys C and M—before receiving the reward. However,  given the size of the targets and the velocity of movements, it often happened that the monkeys didn’t have to stop their movements fully to obtain the reward. Importantly, we relaxed the task’s requirements (by increasing the target size and reducing the temporal constraints) to enable the monkeys to perform a sufficient number of successful trials under both the control and the cerebellar block conditions. This was necessary as we found that strict criteria regarding these parameters yielded a very low success rate in the cerebellar block condition. Nevertheless, as we appreciate now, this task design is suboptimal for studying endpoint accuracy which is an important aspect of cerebellar control. In the methods section of our revised manuscript, we have clarified this aspect of the task design and acknowledged that it is sub-optimal for examining the role of the cerebellum in end-point control (lines 475-485). The task design of our future studies will explicitly address this point more carefully.

      Because stimulation continued after the cursor had crossed the target, it is interesting to ask whether this disruption had any effects on the movements that were task-irrelevant. The reason for asking this is because we have found that whereas during task-relevant eye or tongue movements the Purkinje cells are strongly modulated, the modulations are much more muted when similar movements are performed but are task-irrelevant (Pi et al., PNAS 2024; Hage et al. Biorxiv 2024). Thus, it is interesting to ask whether the effects of stimulation were global and affected all movements, or were the effects primarily concerned with the task-relevant movements.

      This is an insightful suggestion. The behavioral task in the present study was designed with a focus on task-relevant, reward-associated reaching movements. Nevertheless, we also have data on the inter-trial movements (e.g., return-to-center reaches) under continued cerebellar stimulation, which were not directly associated with reward. In response to the reviewer’s comment, we compared the effects of cerebellar block on endpoint velocities between these two types of movements. We found that reductions in peak hand velocity during inter-trial movements were significantly smaller than those observed during the target directed reaches. We have updated the Results section of our manuscript (lines 125-137) and expanded our supplementary document (Supplementary Figure S1) to include this analysis. 

      If the schematic in Figure 1 is accurate, it is difficult for me to see how any of the reaching movements can be termed single joint. In the paper, T1 is labeled as a single joint, and T2T4 are labeled as dual-joint. The authors should provide data to justify this.

      The reviewer is correct. Movements to all targets involved both shoulder and elbow joints, but the degree to which each joint participated varied in a targetspecific manner. In our original manuscript, we used the term “single-joint” to refer to movements in which one joint was nearly stationary, resulting in minimal coupling torque at the adjacent joint. Specifically, for Targets 1 and 5, the net torque—and thus acceleration— at the elbow was negligible, causing the shoulder to experience low coupling torques (as illustrated in Figure 3c of our revised manuscript). Following this comment and  to avoid confusion, we have now explained this explicitly in the revised manuscript (lines 178-187). This is supported by Supplementary Figure S2 demonstrating the net torques at the shoulder and elbow for movements to each target. We have also replaced the term ‘singlejoint movements’  and ‘multi-joint movements’  with  ‘movements with low coupling torques’ and ‘movements with high coupling torques’ respectively in our revised manuscript (lines 178-180, 204-207, 225-227, 230-232, 305-307, and 362-365).  

      Because at least part of this work was previously analyzed and published, information should be provided regarding which data are new.

      While some of the same animals and stimulation protocol were presented in prior work, the inverse-dynamics modeling, the analyses exploring progressive velocity changes across trials under a cerebellar block, and the relationship of motor noise to movement velocity are newly reported in this manuscript. We have included a clear statement in the Methods section specifying which components of the dataset and analyses are entirely new (lines 582-589).

      Reviewer #1 (Recommendations for the authors):

      (1) Before the results are presented, it is useful to present the experimental paradigm in more detail. For example, after the center-out movement was completed, was the monkey required to hold at the target location? How did the next trial begin (re-centering movement)? Next, specify the stimulation protocol, noting that each session was divided into 3-4 blocks of stimulation and not stimulation, with each block 50-80 trials.

      We have updated the results section of our revised manuscript (lines 91-104) to present the experimental paradigm in more detail according to the reviewer’s advice.

      (2) Figure 1. Hand velocity does not show how the reach was completed. Did the subjects stop at the target or simply shoot through it and turn around without stopping? Why are the traces cut off?

      Monkeys were indeed required to hold at the target briefly (100-150 ms) before receiving the reward. However,  given the size of the targets and the velocity of movements, it often happened that the monkeys didn’t have to stop their movements fully to obtain the reward. The hand velocity profile shown in Figure 1b and the torque profiles shown in Figures 2a and 2b correspond to the period from movement onset to the entry of the control cursor into the peripheral target which marked the end of the movement for the trial. Since the monkeys didn’t have to stop their movements fully for the trial to end, the traces appear cut off at the beginning of the deceleration/stopping phase of the movement. We have updated the captions of Figures 1b, 2a, and 2b to include this information (lines 869-872 and 882-884).  

      (3) Maybe state that the data regarding reaction times are not presented because of the task design in which the go signal was predictable.

      In monkeys M and C, the timing of the go signal was fixed and therefore predictable. Furthermore, they were also allowed a grace period of 200 ms before the go signal to facilitate predictive timing which often resulted in negative reaction times. However, in Monkeys S and P, the go signal was variable in timing and the monkeys were not allowed to initiate the movements before the go signal. In our previous studies (Nashef et al., 2019; Israely et al. 2025), we reported increased reaction times under cerebellar block. However, since the present study focuses specifically on execution-related motor deficits, we did not analyze reaction time data. 

      (4) Please provide the data and analysis regarding the entire reach, including the period after the cursor crosses the target and returns to the center position.

      We compared the peak hand velocity of the target-directed movements to the inter-trial return-to-center movements. Cerebellar block produced significantly smaller reductions in peak hand velocity during inter-trial movements compared to within-trial reaches. The results section of our revised manuscript (lines 125137) and the supplementary material (Supplementary Figure S1) have been updated accordingly. While the behavioral task in the present study was designed with a focus on task-relevant, reward-associated reaching movements, it will be interesting to examine in detail the effect of cerebellar block on spontaneous movements in a future study.

      (5) Figure 5. To illustrate the decomposition of multijoint movements into a sequence of single joint movements, I suggest plotting movements in joint space (in addition to Cartesian space as you have done now). The results in Figure 5 are most interesting and thus should be expanded. Please provide this data using the format in Figure 1C, that is, as a function of direction.

      Following the reviewer’s suggestion, we have plotted sample trajectories in joint-velocity (Supplementary Figures 3a and b) and position space (Supplementary Figures 4a and b) to highlight the decomposition of multi-joint movements and increased inter-trial trajectory variability respectively during the cerebellar block. Additionally, we also analyzed movement decomposition and trajectory variability as a function of target direction (Supplementary Figures 3c and 4c respectively). The corresponding text in the Results section has been updated accordingly (lines 256-261, 267-271, 277-278 and 280-288).

      Reviewer #2 (Public review):

      This manuscript asks an interesting and important question: what part of 'cerebellar' motor dysfunction is an acute control problem vs a compensatory strategy to the acute control issue? The authors use a cerebellar 'blockade' protocol, consisting of high-frequency stimuli applied to the cerebellar peduncle which is thought to interfere with outflow signals. This protocol was applied in monkeys performing center outreaching movements and has been published from this laboratory in several preceding studies. I found the takehome-message broadly convincing and clarifying - that cerebellar block reduces muscle activation acutely particularly in movements that involve multiple joints and therefore invoke interaction torques, and that movements progressively slow down to in effect 'compensate' for these acute tone deficits. The manuscript was generally well written, and the data was clear, convincing, and novel. My comments below highlight suggestions to improve clarity and sharpen some arguments.

      We thank the reviewer for their thoughtful and constructive feedback. We are grateful for their recognition of the significance of our findings regarding acute and compensatory motor responses following a cerebellar block.

      Primary comments:

      (1) Torque vs. tone: Is it known whether this type of cerebellar blockade is reducing muscle tone or inducing any type of acute co-contraction that could influence limb velocity through mechanisms different than 'atonia'? If so, the authors should discuss this information in the discussion section starting around line 336, and clarify that this motivates (if it does) the focus on 'torques' rather than muscle activation. Relatedly, besides the fact that there are joints involved, is there a reason there is so much emphasis on torque per se? If the muscle is deprived of sufficient drive, it would seem that it would be more straightforward to conceptualize the deficit as one of insufficient timed drive to a set of muscles than joint force. Some text better contextualizing the choices made here would be sufficient to address this concern. I found statements like those in the introduction "hand velocity was low initially, reflecting a primary muscle torque deficit" to be lacking in substance. Either that statement is self-evident or the alternative was not made clear. Finally, emphasize that it is a loss of self-generated torque at the shoulder that accounts for the velocity deficits. At times the phrasing makes it seem that there is a loss of some kind of passive torque.

      We appreciate the reviewer's emphasis on distinguishing between reduced muscle tone and altered co-contraction patterns as potential explanations for decreased limb velocity. Our focus on torques per se arises from previous studies suggesting that a core deficit in cerebellar ataxia is impaired prediction of passive coupling torques (Bastian et al., 1996). In our study, we demonstrate that motor deficits in cerebellar ataxia result in fact from both the inability to compensate for passive coupling torques and an acute insufficiency in the ability to generate active muscle torques.

      The muscle torque, representing the sum of all muscle forces acting at a joint, can indeed be reduced by any of the two mechanisms: (i) co-contraction of agonist and antagonist muscles, and/or (ii) insufficient agonist muscle activity (i.e., agonist weakness). In cerebellar ataxia, co-contraction has been proposed as a simplifying strategy to stabilize stationary joints during decomposed multi-joint movements (Bastian et al., 1996). In our experiments, this strategy would likely emerge gradually following cerebellar block similar to the adaptive slowing of movements aimed at reducing inter-joint interactions. However, we found that irrespective of the magnitude of coupling torques involved, reduction in the velocity of movements also occurred immediately following cerebellar block—a pattern less consistent with gradually emerging compensatory strategies. We therefore argue that this acute onset of movement slowing was mainly driven by agonist weakness. Our argument is further supported by previous studies which attributed reduced agonist muscle activity as a cause for the slowing of voluntary movements in individuals with cerebellar lesions (Hallet et al. 1991; Wild et al., 1996). Additionally, early studies have also reported muscle weakness (asthenia) and hypotonia acutely following cerebellar injury in humans (Haines et al., 2007) and experimental lesions in animals (Luciani, 1893; Bremer et al., 1935; Fulton & Dow, 1937; Granit et al., 1955).

      We have modified the discussion section of our revised manuscript (lines 366-376) to explain/clarify this. Additionally, we have also underscored that the observed velocity deficits primarily reflect a reduction of self-generated torque at the shoulder (whether acute or adaptive), rather than any reduction in passive torque (lines 350-352).

      (2) Please clarify some of the experimental metrics: Ln 94 RESULTS. The success rate is used as a primary behavioral readout, but what constitutes success is not clearly defined in the methods. In addition to providing a clear definition in the methods section, it would also be helpful for the authors to provide a brief list of criteria used to determine a 'successful' movement in the results section before the behavioral consequences of stimulation are described. In particular, the time and positional error requirements should be clear.

      Successful trials were defined as trials in which monkeys didn’t leave the center position before the “Go” signal and entered the peripheral target within a permitted movement time. We have updated the results (lines 91-104) and methods (lines 475-485) section of our revised manuscript to include (i) the timing criteria of each phase of the trials and (ii) the size of the peripheral targets indicating the tolerance for endpoint accuracy.  

      (3) Based on the polar plot in Figure 1c, it seemed odd to consider Targets 1-4 outward and 5-8 inward movements, when 1 and 5 are side-to-side. Is there a rationale for this grouping or might results be cleaner by cleanly segregating outward (targets 2-4) and inward (targets 6-8) movements? Indeed, by Figure 3 where interaction torques are measured, this grouping would seem to align with the hypothesis much more cleanly since it is with T2,T3,and T4 where clear coupling torques deficits are seen with cerebellar block.

      We acknowledge the reviewer's observation regarding the classification of targets 1 and 5 as side-to-side movements rather than strictly "outward" or "inward." In the initial section of our results, we grouped the targets based on shoulder joint movements: "outward" targets involved shoulder flexion, while "inward" targets involved shoulder extension. This classification highlighted the more pronounced effect of cerebellar block on movements requiring shoulder flexion compared to those requiring shoulder extension. For subsequent analyses, we focused on the effects of cerebellar block on movements to "outward" targets, which included directions involving low (target 1) or high (targets 2–4) coupling torques. To clarify this aspect, we have revised our manuscript to explain our definition of "outward" (targets 1–4) and "inward" (targets 5–8) target groupings based on shoulder  flexion and extension movements respectively (lines 117-120).

      (4) I did not follow Figure 3d. Both the figure axis labels and the description in the main text were difficult to follow. Furthermore, the color code per animal made me question whether the linear regression across the entire dataset was valid, or would be better performed within animal, and the regressions summarized across animals. The authors should look again at this section and figure.

      We have revised the legend of Figure 3d to include a detailed explanation of how the value along each axis is computed  (lines 908-920 of the revised manuscript). Please note that  the color coding of the data points is as per the target number (T1-T4) and not the monkey number (as denoted in the figure legend). Also, pooling of data across monkeys was done after confirming that data from each animal expressed a similar trend. Specifically, the correlation coefficients were all positive but statistically significant in 3 out of the 4 monkeys. Following the reviewers’ feedback, we now performed  a partial correlation analysis (which controls for the variability across monkeys) and found a significant correlation (r = 0.32, p < 0.001) between reduction in peak hand velocities during cerebellar block and the net coupling torque impulse. We have updated the manuscript to include the result of the partial correlation analysis (lines 173-176).  

      (5) Line 206+ The rationale for examining movement decomposition with a cerebellar block is presented as testing the role of the cerebellum in timing. Yet it is not spelled out what movement decomposition and trajectory variability have to do with motor timing per se.

      The reviewer is right and the relations between timing, decomposition and variability need to be explicitly explained. In the results  section of our revised manuscript, we have explained how decomposed movements and trajectory variability may reflect impaired temporal coordination across multiple joints—a critical cerebellar function (lines 235-244).

      Reviewer #2 (Recommendations for the authors):

      (1) Rephrase the findings, starting Line 232. Here the authors state, "Next, we asked whether movement decomposition was mainly due to lower hand velocities. We therefore selected a subset of control trials that matched the cerebellar block trials in their peak velocity. However, even though movement decomposition in these control trials was higher compared to all control trials, it was still significantly lower than velocity matched cerebellar block trials." I suggest inverting the final sentence to: "Movement decomposition in control trials was significantly lower than velocity-matched cerebellar block trials, even though these control trials themselves had somewhat higher decomposition indices than all control trials together." A similar issue pops up with trajectory variability below that simply requires some editing to be less clunky.

      Following the reviewer’s suggestion, we have revised the sentences related to movement decomposition and trajectory variability. These sentences now reads as follows: 

      (lines 267-271 in the revised manuscript): “Movement decomposition in control trials was significantly lower than velocity-matched cerebellar block trials (p < 0.001; Figure 5c), even though these control trials themselves had 11.0% (CI [5.2, 17.0], p = 0.03) higher decomposition than the mean value calculated across all control trials.” 

      (lines 280-288 in the revised manuscript): “ When we compared the subset of velocitymatched control and cerebellar block trials, we found that cerebellar block trials exhibited 34.6% (CI [26.2, 43.2], p < 0.001) higher trajectory variability (Figure 5e). Normally, slower movements are also less variable due to the speed-accuracy tradeoff (Plamondon and Alimi 1997). Indeed, the trajectory variability in this subset of slower control trials was 5.5% (CI [0.9, 9.9], p = 0.02) lower than that of all control trials. In other words, despite slower movements, cerebellar block led to increased trajectory variability.”

      (2) Typo: Ln 73 sequences, not sequence.

      Typo error was corrected (line 75 of revised manuscript). 

      Reviewer #3 (Public review):

      Summary:

      In their manuscript, "Disentangling acute motor deficits and adaptive responses evoked by the loss of cerebellar output," Sinha and colleagues aim to identify distinct causes of motor impairments seen when perturbing cerebellar circuits. This goal is an important one, given the diversity of movement-related phenotypes in patients with cerebellar lesions or injuries, which are especially difficult to dissect given the chronic nature of the circuit damage. To address this goal, the authors use high-frequency stimulation (HFS) of the superior cerebellar peduncle in monkeys performing reaching movements. HFS provides an attractive approach for transiently disrupting cerebellar function previously published by this group. First, they found a reduction in hand velocities during reaching, which was more pronounced for outward versus inward movements. By modeling inverse dynamics, they find evidence that shoulder muscle torques are especially affected. Next, the authors examine the temporal evolution of movement phenotypes over successive blocks of HFS trials. Using this analysis, they find that in addition to the acute, specific effects on muscle torques in early HFS trials, there was an additional progressive reduction in velocity during later trials, which they interpret as an adaptive response to the inability to effectively compensate for interaction torques during cerebellar block. Finally, the authors examine movement decomposition and trajectory, finding that even when low-velocity reaches are matched to controls, HFS produces abnormally decomposed movements and higher than expected variability in trajectory.

      Strengths:

      Overall, this work provides important insight into how perturbation of cerebellar circuits can elicit diverse effects on movement across multiple timescales.

      The HFS approach provides temporal resolution and enables analysis that would be hard to perform in the context of chronic lesions or slow pharmacological interventions. Thus, this study describes an important advance over prior methods of circuit disruption, and their approach can be used as a framework for future studies that delve deeper into how additional aspects of sensorimotor control are disrupted (e.g., response to limb perturbations).

      In addition, the authors use well-designed behavioral approaches and analysis methods to distinguish immediate from longer-term adaptive effects of HFS on behavior. Moreover, inverse dynamics modeling provides important insight into how movements with different kinematics and muscle dynamics might be differentially disrupted by cerebellar perturbation.

      We thank the reviewer for their detailed assessment and thoughtful comments and greatly appreciate their positive feedback.  

      Weaknesses:

      The argument that there are acute and adaptive effects to perturbing cerebellar circuits is compelling, but there seems to be a lost opportunity to leverage the fast and reversible nature of the perturbations to further test this idea and strengthen the interpretation. Specifically, the authors could have bolstered this argument by looking at the effects of terminating HFS - one might hypothesize that the acute impacts on muscle torques would quickly return to baseline in the absence of HFS, whereas the longer-term adaptive component would persist in the form of aftereffects during the 'washout' period. As is, the reversible nature of the perturbation seems underutilized in testing the authors' ideas.

      We agree that our approach could more explicitly exploit the rapid reversibility of high-frequency stimulation (HFS) by examining post-stimulation ‘washout’ periods. However, for the present dataset, we ended the session after the set of cerebellar block trials without using an explicit washout period. We plan to study the effect of the cerebellar block on immediate post-block washout trials in the future.    

      The analysis showing that there is a gradual reduction in velocity during what the authors call an adaptive phase is convincing. That said, the argument is made that this is due to difficulty in compensating for interaction torques. Even if the inward targets (i.e., targets 68) do not show a deficit during the acute phase, these targets still have significant interaction torques (Figure 3c). Given the interpretation of the data as presented, it is not clear why disruption of movement during the adaptive phase would not be seen for these targets as well since they also have large interaction torques. Moreover, it is difficult to delve into this issue in more detail, as the analyses in Figures 4 and 5 omit the inward targets.

      The reviewer is right and  movements to Targets 6–8 (inward) were seemingly unaffected despite also involving significant interaction torques. Specifically, we noted that while outward targets (2–4) tend to involve higher coupling torque impulses on average, this alone does not fully explain the differential impact of cerebellar block, as illustrated by discrepancies at the individual target level (e.g., target 7 vs. target 1). We propose two possible explanations: (1) a bias toward shoulder flexion in the effect of cerebellar block—consistent with earlier studies showing ipsilateral flexor activation or tone changes following stimulation or lesioning of the deep cerebellar nuclei; and (2) posture-related facilitation of inward (shoulder extension) movements from the central starting position. This point is addressed in the Discussion section (lines 404-433  in the revised manuscript).

      The text in the Introduction and in the prior work developing the HFS approach overstates the selectivity of the perturbations. First, there is an emphasis on signals transmitted to the neocortex. As the authors state several times in the Discussion, there are many subcortical targets of the cerebellar nuclei as well, and thus it is difficult to disentangle target-specific behavioral effects using this approach. Second, the superior cerebellar peduncle contains both cerebellar outputs and inputs (e.g., spinocerebellar). Therefore, the selectivity in perturbing cerebellar output feels overstated. Readers would benefit from a more agnostic claim that HFS affects cerebellar communication with the rest of the nervous system, which would not affect the major findings of the study.

      The reviewer is right that the superior cerebellar peduncle carries both descending and ascending fibers, and that cerebellar nuclei project to subcortical as well as cortical targets. Therefore, we cannot rule out the fact that the effect of HFS  may be mediated in part through pathways other than the cerebello-thalamo-cortical pathway (as mentioned in the Discussion section). However, it is also important to note that in primates the cerebellar-thalamo-cortical (CTC) pathway greatly expanded (at the expense of the cerbello-rubro-spinal tract) in mediating cerebellar control of voluntary movements (Horne and Butler, 1995). The cerebello-subcortical pathways diminished in importance over the course of evolution (Nathan and Smith, 1982, Padel et al., 1981, ten Donkelaar, 1988). Previously we found that the ascending spinocerebellar axons which enter the cerebellum through the superior cerebellar peduncle (SCP) are weakly task-related and the descending system is quite small (Cohen et al, 2017). We have clarified these points and acknowledged that HFS disrupts cerebellar communication broadly, rather than solely the cerebellothalamo-cortical pathway in the methods section of our revised manuscript (lines 531544).  

      The text implies that increased movement decomposition and variability must be due to noise. However, this assumption is not tested. It is possible that the impairments observed are caused by disrupted commands, independent of whether these command signals are noisy. In other words, commands could be low noise but still faulty.

      We recognize the reviewer’s concern about linking movement decomposition and trial-to-trial trajectory variability with motor noise. We interpret these motor abnormalities as a form of motor noise in the sense that they are generated by faulty motor commands. We draw our interpretation from the findings of previous research work which show that the cerebellum aids in the state estimation of the limb and subsequent generation of accurate feedforward commands. Therefore, disruption of the cerebellar output may lead to faulty motor commands resulting in the observed asynchronous joint activations (i.e., movement decomposition) and unpredictable trajectories (i.e., increased trial-to-trial variability). Both observed deficits resemble increased motor noise. This point is presented in our Discussion section (lines 436-458 of the revised manuscript),

      Throughout the text, the use of the term 'feedforward control' seems unnecessary. To dig into the feedforward component of the deficit, the authors could quantify the trajectory errors only at the earliest time points (e.g., in Figure 5d), but even with this analysis, it is difficult to disentangle feedforward- and feedback-mediated effects when deficits are seen throughout the reach. While outside the scope of this study, it would be interesting to explore how feedback responses to limb perturbation are affected in control versus HFS conditions. However, as is, these questions are not explored, and the claim of impaired feedforward control feels overstated.

      We agree that to strictly focus on feedforward control, we could have examined the measured variables in the first 50-100 ms of the movement which has been shown to be unaffected by feedback responses (Pruszynski et al. 2008, Todorov and Jordan 2002,  Pruszynski  and Scott 2012, Crevecoeur  et al. 2013). However, in our task, the amplitude of movements made by the monkeys was small, and therefore the response measures in the first 50-100 ms were too small for a robust estimation. Also, fixing a time window led to an unfair comparison between control and cerebellar block trials, in which velocity was significantly reduced and therefore movement time was longer.  Therefore, we used the peak velocity, torque impulse at the peak velocity, and maximum deviation of the hand trajectory as response measures. We have acknowledged this point in the methods section of our revised manuscript (lines 590-600). We have also refrained from using the term feedforward control throughout the text of our revised manuscript as suggested by the reviewer.

      The terminology 'single-joint' movement is a bit confusing. At a minimum, it would be nice to show kinematics during different target reaches to demonstrate that certain targets are indeed single joint movements. More of an issue, however, is that it seems like these are not actually 'single-joint' movements. For example, Figure 2c shows that target 1 exhibits high elbow and shoulder torques, but in the text, T1 is described as a 'single-joint' reach (e.g. lines 155-156). The point that I think the authors are making is that these targets have low interaction torques. If that is the case, the terminology should be changed or clarified to avoid confusion.

      Indeed, as reviewer #1 also noted, movements to targets 1 and 5 are not purely single-joint but rather have relatively low coupling torques. Movements to all targets involved both shoulder and elbow joints, but the degree to which each joint participated varied in a target-specific manner. In our original manuscript, we used the term “single-joint” to refer to movements in which one joint was largely stationary, resulting in minimal coupling torque at the adjacent joint. Specifically, for Targets 1 and 5, the net torque—and thus acceleration—at the elbow was negligible, causing the shoulder to experience low coupling torques (as illustrated in Figure 3c of our revised manuscript). Following this comment and  to avoid confusion, we have now explained this explicitly in the revised manuscript (lines 178-187). This is supported by Supplementary Figure S2 demonstrating the net torques at the shoulder and elbow for movements to each target. We have also replaced the term ‘single-joint movements’  and ‘multi-joint movements’  with  ‘movements with low coupling torques’ and ‘movements with high coupling torques’ respectively in our revised manuscript (lines 178-180, 204-207, 225-227, 230-232, 305-307, and 362-365).

      The labels in Figure 3d are confusing and could use more explanation in the figure legend. In Figure 3d, it is stated that data from all monkeys is pooled. However, if there is a systematic bias between animals, this could generate spurious correlations. Were correlations also calculated for each animal separately to confirm the same trend between velocity and coupling torques holds for each animal?

      We have revised the legend of Figure 3d to include a detailed explanation of how the values along each axis are computed  (lines 908-920 of the revised manuscript). Please note that the pooling of data across monkeys was done after confirming that data from each animal expressed a similar trend. Specifically, the correlation coefficients were all positive but statistically significant in 3 out of the 4 monkeys. Moreover, following the reviewers’ feedback, we also did a partial correlation analysis (which controls for the variability across monkeys) and found a significant correlation (r = 0.32, p < 0.001) between reduction in peak hand velocities during cerebellar block and the net coupling torque impulse. We have updated the manuscript to include the result of the partial correlation analysis (lines 173-176).  

      In Table S1, it would be nice to see target-specific success rates. The data would suggest that targets with the highest interaction torques will have the largest reduction in success rates, especially during later HFS trials. Is this the case?

      The breakdown of the percentage increase in failure rate due to cerebellar block as a function of target direction is shown in Author response image 1 inserted to this response. 

      Author response image 1.

      Effect of cerebellar block on failure rate. The change in failure rate for the cerebellar block trials was computed relative to the control trials per session per target. The depicted values are the mean ± 95% confidence intervals across all sessions pooled from all four monkeys. The individual means of each monkey are overlaid. Statistical significance is denoted as follows: p ≥ 0.05NS, p < 0.05*, p < 0.01**, p < 0.001*** [T1-8: Targets 1-8]

      The increase in failure rate due to cerebellar block was not affected by the target direction (linear mixed model analysis,  target x trial-type interaction effect: p  = 0.44).  However, it should be noted that success/failure depends on several factors beyond just the execution related impaired limb dynamics. In a previous study (Nashef et al. 2019) we identified several causes of failure such as (i) not entering the central target in time, (ii) premature exit from the central target before the ‘go’ signal,  (iii) reaction time longer than the time permitted to reach the peripheral target after the ‘go’ signal, or (iv) not holding at the peripheral target for the required time at the end of the movement.   

      Reviewer #3 (Recommendations for the authors):

      (1) It would be helpful to provide some supplemental information on electrophysiological validation of the targeting in each monkey. Was any variability in targeting observed (e.g., some targeting was more effective at eliciting cortical responses)? If so, does targeting variability relate to any of the variability in behavioral effects of HFS across monkeys?

      Although we currently do not have an exact measure of the proportion of fibers blocked by HFS, our targeting approach consistently elicited robust cortical responses across monkeys. Specifically, we implanted the stimulating electrode at the location that produced the maximum peak-to-peak evoked responses in the primary motor cortex. Author response image 2 in this response demonstrates that even a slight deviation (~0.5 mm) from this optimal site reduced these responses substantially.:

      Author response image 2.

      Evoked responses in the primary motor cortex as a function of the location of the stimulation site. [LEFT] Coronal T2-weighted MRI showing the planned trajectory to target the superior cerebellar peduncle (location marked by the tip of the arrowhead) through a round chamber suitably positioned over the skull. [RIGHT] Evoked multi-unit (300-7500 Hz) responses from one of the recording electrodes in the primary motor cortex are used to guide the stimulating electrode to the correct implant site. As the stimulating electrode was lowered deeper, maximum peak-to-peak evoked responses were obtained at a depth of 32.5 mm relative to the cortical surface. This was chosen as the implant site. Elevating or lowering the electrode by ~0.5 mm from this depth reduced the peak-to-peak response amplitude. 

      (2) The emphasis in the Introduction that HFS provides direct insight into deficits seen in patients with cerebellar disease or injury is a bit overstated. Patients have very diverse etiologies, only a modest number of which might be faithfully mimicked by SCP HFS. I would suggest some text acknowledging that this is only a limited model for cerebellar disease or injury.

      We agree with the reviewer that the high-frequency stimulation of the superior cerebellar peduncle provides a limited model that does not fully replicate the diverse pathologies seen in cerebellar disease or injury. In fact, in the introduction section (lines 53-59 of our revised manuscript) we have mentioned that the discrepancy in the conclusions of various clinical studies may reflect the heterogeneity of the individuals with cerebellar lesions who often have differences in lesion etiology and associated damage beyond the cerebellum itself. While this may preclude the generalization of our findings to the wider clinical population per se, our approach offers a precise and controlled method to investigate the immediate and adaptive changes in motor behavior following the disruption of cerebellar signals.

      (3) Do animals with HFS show less decomposition and trajectory variability in their slower movements when compared to their faster movements? Comparisons are only made with velocity-matched control blocks, but the comparison of slower vs. faster reaches during HFS blocks would also be informative.

      To answer this point we classified movements during cerebellar block as either slow or fast based on the median peak hand velocity of the cerebellar block trials per target per session. We then computed the decomposition index and trajectory variability for the fast and slow movements during cerebellar block relative to control in the same way as in Figure 5 of our manuscript (i.e., the percentage change relative to control). Our analysis revealed significantly lower movement decomposition (p < 0.001) and reduced trajectory variability (p < 0.001) for slower movements compared to faster ones within the cerebellar block condition (Author response image 3).

      Author response image 3.

      Effect of slow and fast movements during cerebellar block on movement decomposition and trajectory variability. [LEFT] Change in decomposition index (i.e., the proportion of the movement time during which the movement was decomposed) for slow and fast cerebellar block trials relative to all control trials. The change in median decomposition was computed per session per target and then averaged across all eight targets to arrive at one value per session. The depicted values are the mean ± 95% confidence intervals across all sessions pooled from all four monkeys. The individual means of each monkey are overlaid. [RIGHT] Change in inter-trial trajectory variability for slow and fast cerebellar block trials relative to all control trials. The trajectory variability was measured as the standard deviation of the maximum perpendicular distance of the trajectories from the Y-axis after transforming them as in Figure 5d of the main text. The change in trajectory variability for the fast and slow cerebellar block trials was then computed per session per target and averaged across all eight targets to arrive at one value per session. The depicted values are the mean ± 95% confidence intervals across all sessions pooled from all four monkeys. The individual means of each monkey are overlaid. Statistical significance is denoted as follows: p ≥ 0.05NS, p < 0.05*, p < 0.01**, p < 0.001***. [Cbl: Cerebellar block].

      (4) Line 220- 'velocity' should be 'speed' or 'absolute velocity'?

      The term velocity was changed to speed in  the revised manuscript (line 255).

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents a valuable finding on the mechanism to promote distant metastasis in breast cancer. The evidence supporting the claims of the authors is convincing. The work will be of interest to medical biologists working on breast cancer.

      Public Reviews:

      Reviewer #1 (Public Review):

      Strengths

      The paper has shown the expression of RGS10 is related to the molecular subtype, distant metastasis, and survival status of breast cancer. The study utilizes bioinformatic analyses, human tissue samples, and in vitro and in vivo experiments which strengthen the data. RGS10 was validated to inhibit EMT through a novel mechanism dependent on LCN2 and miR-539-5p, thereby reducing cancer cell proliferation, colony formation, invasion, and migration. The study elaborated the function of RGS10 in influencing the prognosis and biological behavior which could be considered as a potential drug target in breast cancer.

      Weakness

      The mechanism by which the miR-539-5p/RGS10/LCN2 axis may be related to the prognosis of cancer patients still needs to be elucidated. In addition, the sample size used is relatively limited. Especially, if further exploration of the related pathways and mechanisms of LCN2 can be carried out by using organoid models, as well as the potential of RGS10 as a biomarker for further clinical translation to verify its therapeutic target effect, which will make the data more convincing.

      Answer: Thank you for your comments and suggestions. In future research, we will utilize large clinical cohorts and organoid models to further explore relevant research mechanisms.

      Reviewer #2 (Public Review):

      Liu et al., by focusing on the regulation of G protein-signaling 10 (RGS10), reported that RGS10 expression was significantly lower in patients with breast cancer, compared with normal adjacent tissue. Genetic inhibition of RGS10 caused epithelial-mesenchymal transition, and enhanced cell proliferation, migration, and invasion, respectively. These results suggest an inhibitory role of RGS10 in tumor metastasis. Furthermore, bioinformatic analyses determined signaling cascades for RGS10-mediated breast cancer distant metastasis. More importantly, both in vitro and in vivo studies evidenced that alteration of RGS10 expression by modulating its upstream regulator miR-539-5p affects breast cancer metastasis. Altogether, these findings provide insight into the pathogenesis of breast tumors and hence identify potential therapeutic targets in breast cancer.

      The conclusions of this study are mostly well supported by data. However, there is a weakness in the study that needs to be clarified.

      In Figure 2A, although some references supported that SKBR3 and MCF-7 possess poorly aggressive and less invasive abilities, examining only RGS10 expression in those cells, it could not be concluded that 'RGS10 acts as a tumor suppressor in breast cancer'. It would be better to introduce a horizontal comparison of the invasive ability of these 3 types of cells using an invasion assay.

      Answer: Thank you for your comments and suggestions. MDA-MB-231, SKBR3, and MCF-7 originate from triple-negative breast cancer (high invasiveness), Her-2 receptor overexpression (relatively weak invasiveness), and luminal type breast cancer (relatively weak invasiveness) separately. Previous studies have demonstrated the invasive ability of these 3 types of cells. (PMID: 34390568)

      Reviewer #3 (Public Review):

      Distant metastasis is the major cause of death in patients with breast cancer. In this manuscript, Liu et al. show that RGS10 deficiency elicits distant metastasis via epithelial-mesenchymal transition in breast cancer. As a prognostic indicator of breast cancer, RGS10 regulates the progress of breast cancer and affects tumor phenotypes such as epithelial-mesenchymal transformation, invasion, and migration. The conclusions of this paper are mostly well supported by data, but some analyses need to be clarified.

      (1) Because diverse biomarkers have been identified for EMT, it is recommended to declare the advantages of using RGS10 as an EMT marker.

      Answer: Thank you for your comments. The dysregulation of RGS protein expression has been observed to be associated with various types of cancer. (PMID: 26293348). Previous studies have shown that RGS10 knocking down can lead to chemotherapy resistance of ovarian cancer cells to paclitaxel, cisplatin, and vincristine. In colorectal tumors, the transcription of RGS10 is regulated by DNA methylation and histone deacetylation. As a key regulatory factor in the G protein signaling pathway, RGS 10 is involved in tumor development including survival, polarization, adhesion, chemotaxis, and differentiation, these hints suggest RGS10 might be a marker for EMT in breast cancer.

      (2) The authors utilized databases to study the upstream regulatory mechanisms of RSG10. It is recommended to clarify why the authors focused on miRNAs rather than other epigenetic modifications.

      Answer: Thank you for your comments. miRNAs are short-chain non-coding RNA molecules that bind to the target mRNA's 3 'untranslated region (3'UTR) to cause mRNA degradation or translation inhibition, thus regulating gene expression in cells. These small molecules play a crucial role in regulating the expression of cancer-related genes and can act as tumor promoters or tumor suppressors. To further improve the molecular mechanism of malignant biological behavior of breast cancer cells with RGS10, we verified that miR-539-5p might be the upstream regulation target of RGS10 through bioinformatics prediction and in-vitro experiments.

      (3) The role of miR-539-5p in breast cancer has been described in previous studies. Hence, it is recommended to provide detailed elaboration on how miR-539-5p regulates the expression of RSG10.

      Answer: Thank you for your comments. To verify the effect of miRNA-539-5p regulating the expression of RSG10, we transfected miR-539-5p mimic, miR-539-5p mimic NC, miR-539-5p inhibitor, miR-539-5p inhibitor NC in SKBR3 cells and MDA-MB-231 cells respectively, and verified the expression of RGS10 through RT-qPCR and Western blot experiments. The results showed that compared with the transfected miR-539-5p mimic NC or wild-type SKBR3 cells, RGS10 m RNA and protein levels were significantly reduced. On the contrary, after MDA-MB-231 cells were transfected with miR-539-5p inhibitor to inhibit the expression of miR-539-5p, RGS10 mRNA and protein levels in MDA-MB-231 cells were significantly increased (Fig. 3.4A-C, Fig. 3.5A-C). This indicates that miR-539-5p can target and regulate RGS10.

      (4) To enhance the clarity and interpretability of the Western blot results, it would be advisable to mark the specific kilodalton (kDa) values of the proteins.

      Answer: Thank you for your comments and suggestions. We have corrected to mark the specific kilodalton (kDa) values of the proteins in WB.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The function of RGS10 in breast cancer was identified in the paper. However, some major issues in this paper need to be specified:

      (1) From reading the introduction section and its references, RGS proteins participate in multiple essential cellular processes and may be tumor initiators or suppressors (Li et al., 2023). This article focuses on the significance of RGS10 in breast cancer, it is recommended to show how the function of RGS10 exhibits therapeutic significance in other types of cancer.

      Answer: Thanks for your comments and suggestions on our findings. The dysregulation of RGS protein expression has been observed to be associated with various types of cancer. Especially in ovarian cancer cells. (PMID: 26293348). It has been found that the RGS10 expression is lower than that of normal ovarian cells. (PMID: 21044322). In addition, it has been found that knocking down RGS10 can enhance the vitality of ovarian cancer cells and promote chemoresistance by activating the Rheb GTP/mTOR signaling pathway. (PMID: 26319900). A study suggests that RGS10 mediates inflammation signaling regulation in SKOV-3 ovarian cancer cells with high expression of TNF and COX-2 after RGS10 knockdown. In colorectal tumors, RGS10 transcription is regulated by DNA methylation and histone deacetylation. (PMID: 35810565). RGS10 expression also are associated with poor prognosis in laryngeal cancer, hepatocellular carcinoma, and pediatric acute myeloid leukemia. (PMID: 32776811, PMID: 26516143, PMID: 30538250)

      (2) The authors characterize RGS10 protein expression in the breast cancer cell lines MDA-MB-231, MCF7, and SKBR3 in vitro Figure 2A. However, more information would strengthen the data - e.g. information on the expression of RGS10 protein and the survival in public databases, as well as the correlation between RGS10 and Her-2 expression.

      Answer: Thanks for your comments. we have checked the correlation of RGS10 expression and survival rate of Her-2 positive breast cancer patients in a public database. Although there is no significant difference in the “p” value, however, RGS10 high-expression patients have a favorable prognosis tendency than RGS10 low-expression patients after the 100th month.

      Author response image 1.

      (3) Regarding the current situation of clinical trials in the RGS family, the potential to develop RGS 10 for clinic translation is a driving factor for EMT.

      Answer: Thank you for your comments. The RGS (G protein signal transduction regulator) gene family provides an important "braking" function for the cell receptor family of G-protein coupled receptors (GPCR). GPCR controls hundreds of important functions in systemic cells and is the largest class of drug targets, with over one-third of FDA approved drugs treating diseases by binding to GPCR and altering its activity. When GPCRs are activated by hormones or neurotransmitters, they initiate signaling cascades within host cells through signal-carrying proteins called G proteins. The function of the RGS protein is to inactivate the G protein, thereby shutting down this signaling cascade reaction, which limits G protein signal transduction and allows cells to reset and receive new incoming signals. If it were not for it, the signals triggered by GPCR would inappropriately remain on, and the signal transduction would experience dysfunction (PMID: 33007266). The potential to develop RGS10 as a driving factor of EMT is meaningful for clinic translation.

      (4) In Figure 3A, the paper showed that differential gene expression revealed 70 genes were significantly upregulated in RGS10-depleted SKBR3 cells, The authors didn't show any data on the expression of other EMT-related proteins in pathway analysis.

      Answer: Thank you for your comments. The enrichment analysis of RNA sequencing in RGS10-depleted SKBR3 cells suggests that high correlation factors that are associated with EMT, such as TAGLN, TNFSF10, NDUFA4L2, CCN5, PHGDH, ST3GAL5, ANG, and LCN2.

      (5) In Figure 3B, the paper focuses on LCN2 in pathway analysis, however, the author did not elaborate on the significance of LCN2-related pathways in EMT.

      Answer: Thank you for your comments. Some studies have the significance of LCN2-related pathways in EMT. It was confirmed that LCN2 upregulation triggered by PTEN insufficiency induces EMT to promote migration and invasion in MCF7 cells (PMID: 27466505). The activation of STAT3 contributes to an increase in LCN2 expression, which activates ERK pathway-dependent EMT, thus promoting lung metastasis in MDA-MB-231 cells in breast cancer (PMID: 33473115). The silencing of LCN2 reduced the ability of migration and invasion of SUM149 cells and the proportion of tumor stem cells, suggesting that LCN2 may mediate the invasion and metastasis of cancer cells by regulating the stemness of breast cancer cells. The biological effects of LCN2 small molecule inhibitors ZINC00640089 and ZINC00784494 targeting IBC cells have been confirmed. The siRNA-mediated silencing of LCN2 in IBC cells significantly reduces cell proliferation, viability, migration, and invasion. (PMID: 34445288).

      (6) Minor: the author did not conduct a semi-quantitative analysis of the immunohistochemical results of RGS10.

      Answer: Thank you for your suggestion. We would like to demonstrate the qualitative analysis of RGS10 immunohistochemistry. The semi-quantitative analysis is not required in the paper.

      Reviewer #2 (Recommendations For The Authors):

      The role of RGS10 was well-characterized in this study, However, some minor points need to be modified.

      (1) Page 15 line 296, description of cell proliferation was missing, please modify.

      Answer: Thank you for your comments. We have corrected the description of cell proliferation on Page 15 highlighted in red.

      (2) In Figure 2C, the title of the Y-axis was missing.

      Answer: Thank you for your comments. We have corrected the description of the Y-axis title in Figure 2C.

      (3) Describe the transfection reagent that was used in this study, and incorporated into the methods section.

      Answer: Thank you for your comments. We have added the description of the transfection reagent to the methods section.

      (4) The manuscript needs proofreading.

      Answer: Thank you for your comments. We have proofread the manuscript.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      We sincerely value the insightful and constructive feedback (italicized) provided by the reviewers, which has been instrumental in identifying areas of our manuscript that required further clarification or amendment. In response to these valuable comments, we have significantly revised the manuscript to enhance clarity and accuracy. Specifically, we have corrected an oversight related to the robot’s velocity and secondary antibody ratios, and addressed previously missing values in Figs. 3E and 4E. Importantly, these corrections did not alter the outcomes of our results. Additionally, we have enriched our manuscript with new data analyses, as reflected in Figures 1B, 1F, 2H-J, 4D, 4F-H, S1A, S1C-E, S3H, S5, and Table 1, ensuring a more comprehensive presentation of our findings. Below are our responses detailing each comment and explaining the modifications integrated into the revised manuscript.

      Reviewer 1:

      (1) To address the question of whether PAG photostimulation biases the cells that respond to the robot, a counterbalanced experiment, in which the BLA activity is initially recorded during the foraging vs. robot test and the PAG stimulation happens at the end of the session, should have been performed.

      In our study, we investigated fear behavior and BLA cell responses to intrinsic dPAG photostimulation (320 pulses) in naïve animals, followed by their reactions to an extrinsic predatory robot. We recognize the reviewer's concern regarding the potential  influence of initial dPAG photostimulation on BLA neuron responses to the robot. We address this issue in our discussion (pg. 13) as follows: “However, it is crucial to consider the recent discovery that optogenetic stimulation of CA3 neurons (3000 pulses) leads to gain-of-function changes in CA3-CA3 recurrent (monosynaptic) excitatory synapses (Oishi et al., 2019). Although there is no direct connection between dPAG neurons and the BLA (Vianna and Brandao 2003, McNally, Johansen, and Blair 2011, Cameron et al. 1995), and no studies have yet demonstrated gain-of-function changes in polysynaptic pathways to our knowledge, the potential for our dPAG photostimulation (320 pulses) to induce similar changes in amygdalar neurons, thereby enhancing their sensitivity to predatory threats, cannot be dismissed.”

      (2) In Figure 3, it is unclear which criteria (e.g. response latency, minimum Z score, spike fidelity) was used to identify the BLA neurons that were indirectly activated by PAG stimulation. A graphic containing at least the distribution of the response latencies for each BLA neuron after PAG laser activation is needed.

      We have specified the criteria for determining the responsiveness of BLA neurons to dPAG stimulation on page 22. This involves analyzing the first 500-ms post-stimulation across five 0.1-s bins. Units were classified as ‘stim cells’ if they showed z-scores greater than 3 (z > 3) in any of the bins during the initial 500-ms period post-stimulation. Neurons activated by both pellet procurement and dPAG stimulation were not included in the 'stim cell' category. Additionally, we have included a graphic in the revised manuscript (Fig. S3C) that presents the distribution of response latencies of BLA neurons to dPAG stimulation.

      (3) To strengthen the claim that it is a BLA-PVT-PAG circuit that carries information about predatory threat, a new experiment using CTB and cFos could be used to demonstrate that PAG neurons that project to PVT are recruited during the robot exposure.

      Our study primarily aimed to explore the transmission of threat signals between the dPAG and BLA. We acknowledge that our evidence for the PVT’s intermediary role, derived from CTB injections in the BLA and subsequent CTB+cFos co-labeling analysis in the PVT (Fig. 4G and 4H), is limited. Accordingly, we have moderated the emphasis on the PVT’s involvement in both the abstract and introduction. We now present the PVT’s role as a promising direction for future research in the discussion section of our revised manuscript.

      (4) In Fig 2, the authors' interpretation is that photostimulation of PAG neurons elicits fleeing responses in the rats. However, there is a vast literature demonstrating that the PAG is also involved in nociception. Although this is recognized by the authors in the first part of the introduction and briefly described in the discussion, the authors should more explicitly explain that PAG stimulation produces analgesia and thus is unlikely to underlie the escaping responses observed. This may not be intuitive for a broader audience.

      We appreciate the reviewer's insightful suggestion to elaborate on the PAG involvement in nociception and analgesia, as supported by the literature. While our initial manuscript acknowledged these functions, we have now expanded our discussion to address the PAG’s multifaceted roles (pg. 12): “As mentioned in the introduction, the dPAG is recognized as part of the ascending nociceptive pathway to the BLA (De Oca et al. 1998, Gross and Canteras 2012, Herry and Johansen 2014, Kim, Rison, and Fanselow 1993, Ressler and Maren 2019, Walker and Davis 1997). The dPAG is also implicated in non-opioid analgesia (e.g., Bagley and Ingram 2020, Cannon et al. 1982, Fields 2000). However, it is essential to emphasize that, despite its roles in pain modulation, the primary behavior observed in dPAG-stimulated, naive rats foraging for food in an open arena was goal-directed escape to the safe nest, underscoring the dPAG’s critical function in survival behaviors.” Note that this aligns with human studies on PAG stimulation (e.g., Carrive and Morgan 2012, Magierek et al. 2003), particularly those by Amano et al. (Amano et al. 1982), which reported patients feeling an urge to escape, similar to being chased, upon PAG stimulation.

      (5) To truly demonstrate the functional links between the PAG and BLA, more experiments are needed. For example, one could record from BLA neurons during the robot surge while performing optogenetic inhibition of the PAG neurons. There is also no evidence that activity in the indirect pathway that connects the PAG to the BLA is indispensable for the expression of defensive responses towards the robot (e.g., causality tests using chemogenetic or optogenetic inactivation).

      We agree that incorporating optogenetic inhibition of PAG neurons while simultaneously recording from BLA neurons during a robot surge would strengthen the evidence for the functional connectivity between the PAG and BLA. Such an experiment would necessitate the transfection and photoinhibition of a wide array of dPAG neurons responsive to predatory threats. This procedure is technically more viable in transgenic mouse models, given their suitability for genetic manipulation. In light of this, and in response to the suggestions in the Joint Public Review, we have revised the abstract, introduction, and discussion to offer a more cautious interpretation of our findings. This revision reflects a careful consideration of both the evidence and the limitations inherent in our study (pg. 13): “While our findings demonstrate that opto-stimulation of the dPAG is sufficient to trigger both fleeing behavior and increased BLA activity, we have not established that the dPAG is necessary for the BLA’s response to predatory threats. To establish causality, it is essential to conduct experiments such as optogenetic inhibition to determine whether the dPAG is indispensable for activating BLA neurons and initiating escape behavior in the face of threats. The complexity of targeting the dPAG, which includes its dorsomedial, dorsolateral, lateral, and ventrolateral subdivisions (e.g., Bandler, Carrive, and Zhang 1991, Bandler and Keay 1996, Carrive 1993), suggests the need for future studies using transgenic mouse models. Should inactivation of the dPAG negate the BLA's response to predatory threats, it would underscore the dPAG's central role in this defensive mechanism. Conversely, if BLA responses remain unaffected by dPAG inactivation, this could indicate the existence of multiple pathways for antipredatory defense mechanisms.”

      (6) The manuscript lacks information about the number of rats and trials that were used across the experiments (e.g. Fig 2G-J). In some occasions, the authors start the experiments with a specific number of animals and then reduce the N by half without providing a rationale (e.g. Fig. 3). Equally confusing is the experimental timeline. For example: a) Were the pre-robot, robot, and post-robot sessions always performed within the same day? b) It was described that microdrivable arrays were used, but did the same rats experienced the robot test more than one time? c) How many bins were used for normalization during the Z-score calculation and when were the data binned at 100 ms versus 1 s? d) How many trials were used for each analysis? For example, to identify robot cells, did the authors establish a minimum number of trials per animal to calculate the peristimulus time histograms? Having a significant number of trials is critical to make sure that the observed neuronal responses are replicable across the trials. e) How was the neuronal activity related to "pellet retrieval" aligned during robot sessions? Was the activity aligned with the moment in which the rat touches the pellet or when the animal returns to the nest with the pellet? f) How did the authors control for trials in which the rat consumed the pellets in the same local vs. those in which they returned to the nest to eat it? All these points are extremely important for future replicability.

      We apologize for any confusion caused by the initial lack of detail in our experimental procedures. The revised manuscript has been updated with comprehensive methodological details:  

      (i) The study involved thirteen rats (ChR2, n = 9; EYFP, n = 4), subjected to dPAG stimulation using fixed light parameters (473 nm, 20 Hz, 10-ms pulse width, 2 s duration) during Long and Short pellet distance trials (refer to Fig. 2E-G). The stimulation intensity was adjusted to each animal's response (fleeing behavior), ranging from 1-3 mW. Additional testing occurred over multiple days, with incremental adjustments to stimulation parameters (intensity, frequency, duration) after confirming normal baseline foraging behavior (Fig. 2H-J, at x = 0). These details are now clearly depicted in the manuscript.

      (ii) The primary objective was to investigate BLA neuron responses to dPAG opto-stimulation. Six rats were initially tested, with three later assessed for their reactions to dPAG stimulation in the presence of an actual predator, to gauge behavioral effects.

      (iii) Regarding the experimental timeline:

      a) Pre-robot, robot, and post-robot sessions were conducted successively on the same day.

      b) Sessions with the robot predator were repeated until habituation occurred or when unit recordings were deemed invalid due to microdrive limitations or the absence of unit detection. Throughout these sessions, the success rate for pellet retrieval remained consistently low. Specifically, the mean success rate for the dPAG recordings was 2.803% + 1.311. For the BLA recordings, animals did not succeed in retrieving pellets during any of the robot trials. To provide a more detailed account of the methodology, the manuscript has been updated to include the number of recording days and the units recorded in the "Behavioral Procedures" section.

      c) As described in Materials and Methods, unit recording data were binned at 0.1-s intervals and normalized against a 5-s pre-event baseline (50 bins). For statistical analyses in Figure 1F’s rightmost column, 1-s bins were used to simplify post-hoc analysis corrections.

      d) Each recording session consisted of 5-15 trials. Trials were excluded if rats attempted to procure the pellet within 10 s post-dPAG stimulation or robot activation, ensuring accurate characterization of unit responsiveness. Consequently, the number of trials varied among subjects.

      e) Pellet retrieval was indicated by the animal entering a designated zone 19 cm from the pellet, driven by hunger.

      f) Animals were trained to retrieve pellets and return to their nest for consumption prior to robot testing sessions, as elaborated in the “Baseline foraging” section.

      (7) In the abstract, the authors mention that predictive cues are ambiguous during naturalistic predatory threats, but it is not clear what do they mean by ambiguous. In addition, in the introduction section, the authors describe that the present study will investigate how the dPAG and BLA communicate threat signals. However, the author should clarify right in the beginning that these two regions are not monosynaptically connected with each other and cite the proper references.

      The abstract’s original sentence, “…where predictive cues are ambiguous and do not afford reiterative trial-and-error learning…” has been refined to “…characterized by less explicit cues and the absence of reiterative trial-and-error learning events …” This adjustment more accurately reflects that cues in natural settings often lack the clear and consistent quality of those in controlled experimental settings, which is necessary for the straightforward process of trial-and-error learning.

      Regarding the dPAG and BLA connectivity, the revised introduction (pg. 5) now states: “Considering the lack of direct monosynaptic projections between dPAG and BLA neurons (Vianna and Brandao 2003, McNally, Johansen, and Blair 2011, Cameron et al. 1995), we utilized anterograde and retrograde tracers in the dPAG and BLA, respectively. This was complemented by c-Fos expression analysis following exposure to predatory threats. Our anatomical findings suggest that the paraventricular nucleus of the thalamus (PVT) may be part of a network that conveys predatory threat information from the dPAG to the BLA.”

      (8) In the introduction section, the authors should clarify that the US information is conveyed from the PAG to BLA via the lateral thalamus (posterior intralaminar nucleus, medial geniculate nucleus) or dorsal midline thalamus (paraventricular nucleus of the thalamus). The statement regarding how "the PAG functions as part of the ascending pain transmission pathway, providing footshock US information to the BLA" is misleading because the PAG does not send monosynaptic projections directly to the BLA.

      The revised text (pg. 3) now reads: “…suggest that the dPAG is part of the ascending US pain transmission pathway to the BLA, the presumed site for CS-US association formation (De Oca et al. 1998, Gross and Canteras 2012, Herry and Johansen 2014, Kim, Rison, and Fanselow 1993, Ressler and Maren 2019, Walker and Davis 1997). This pathway is thought to be mediated through the lateral and dorsal-midline thalamus regions, including the posterior intralaminar nucleus and paraventricular nucleus of the thalamus (Krout and Loewy, 2000; McNally, Johansen, and Blair, 2011; Yeh, Ozawa, and Johansen, 2021; but see Brunzell and Kim, 2001).”

      (9) The author's assumption that threat information flows from the PAG to the BLA, rather than BLA to PAG, based on electrical stimulation and lesion experiments performed in previous studies is problematic for at least three reasons: a) Electrical stimulation can activate fibers of passage as well as presynaptic neurons antidromically. b) The lesion approach may not have targeted 100% of the neurons in PAG, which extends anatomically along the antero-posterior axis of the midbrain for several millimeters in rats. This observation also disagrees with more recent studies using optogenetics and imaging tools demonstrating that the PAG is the downstream target of the BLA-CeA pathway. c) The authors cited prior reports describing the role of the amygdala-PAG pathway in dampening the US response and providing a negative signal to the PAG. However, a series of previous studies demonstrating that the PAG serves as the downstream target of the central nucleus of the amygdala for the expression of defensive response are completely ignored by the authors. Here are just some examples: Massi et al, 2023, PMID: 36652513; Tovote et al 2016, PMID: 27279213; Penzo et al, 2014 PMID: 24523533).

      We recognize the complexities in interpreting findings from electrical stimulation and lesion studies. Our prior work (Kim et al. 2013) supports the conclusion that predatory threat information directionally flows from the dPAG to the BLA, as evidenced by distinct behavioral outcomes from experimental manipulations of dPAG and BLA. Specifically, dPAG stimulation-induced fleeing behavior was blocked by BLA lesions (as well as muscimol inactivation), whereas BLA stimulation-induced fleeing was unaffected by dPAG or combined dPAG+vPAG lesions (refer to Fig. 5A), suggesting a flow from dPAG to BLA. Our manuscript further clarifies that dPAG optostimulation results confirmed that escape behavior in foraging rats, induce by dPAG electrical stimulation (Kim et al. 2013), was activated by intrinsic dPAG neurons rather than by fibers of passage or current spread to other brain regions.  

      Furthermore, the PAG’s anatomical and functional diversity, with distinct segments along its longitudinal axis associated with different defensive behaviors, reinforces our conclusions. The dPAG is implicated in flight responses, while the vPAG is associated with freezing behavior (e.g., Bandler and Shipley 1994, Kim, Rison, and Fanselow 1993, Lefler, Campagner, and Branco 2020, Morgan, Whitney, and Gold 1998). The critiques' referenced studies primarily focus on the BLA-CeA-vPAG circuit's role in freezing during Pavlovian fear conditioning, contrasting with our emphasis on the dPAG-PVT-BLA circuit and its mediation in escape behavior in response to naturalistic predatory threats.

      We also note that different invasive procedures can yield varying behavioral outcomes. For example, both acute (e.g., optogenetic and muscimol inactivation) and chronic (e.g., surgical ablation) manipulations within the same brain circuit have shown diverse effects across species (Otchy et al. 2015). Moreover, optogenetics comes with its own set of conceptual and technical challenges (Adamantidis et al. 2015), including the difficulty of targeting, quantifying and photo-inhibiting 100% of PAG neurons. Despite the limitations of each technique, our collective evidence from lesions, inactivation, electrical stimulation (Kim et al. 2013), optostimulation, and single-unit recordings (the present study) supports the premise that the dPAG acts upstream of the BLA in processing predatory threat information.

      (10) In the discussion, the authors suggest that the PVT may be the interface between the PAG and the BLA for the expression of antipredatory defensive behavior during their foraging vs. robot test, but previous studies looking at the role of PVT in antipredator defensive behavior and/or approach-avoidance conflict tasks are not cited and discussed in the manuscript (Engelke et al, 2021, PMID: 33947849; Choi et al 2019, PMID: 30979815; Choi and McNally 2017, PMID: 28193686).

      We thank the reviewer for pointing out these pivotal studies, which we have carefully reviewed and integrated into the revised manuscript (pg. 14): “These results, in conjunction with previous research on the roles of the dPAG, PVT, and BLA in producing flight behaviors in naïve rats (Choi and Kim 2010, Daviu et al. 2020, Deng, Xiao, and Wang 2016, Kim et al. 2013, Kim et al. 2018, Kong et al. 2021, Ma et al. 2021, Reis et al. 2021), the anterior PVT’s involvement in cat odor-induced avoidance behavior (Engelke et al. 2021), and the PVT’s regulation of behaviors motivated by both appetitive and aversive stimuli (Choi and McNally 2017, Choi et al. 2019), suggest the involvement of the dPAGàPVTàBLA pathways in antipredatory defensive mechanisms, particularly as rats leave the safety of the nest to forage in an open arena (Figure 4I) (Reis et al. 2023).”  

      (11) The authors use the expression "looming robot predator" in many cases throughout the manuscript. However, it is unclear whether the defensive responses observed in the rats are elicited by the looming stimulus produced by the movement of the robot towards the rats. The authors describe that rats do not respond to a stationary robot, but would the sound produced by the movement of the robot elicit defensive responses? Would non-approaching lateral or dorsoventral movements (not associated with looming) be sufficient to induce defensive behavior in the rats? There is a vast literature in the field about defensive behaviors induced by looming stimuli. The authors should empirically demonstrate that the escaping responses induced by the robot are mediated by looming or refrain to use the looming terminology to avoid confusion.

      Our use of "looming robot predator" is based on empirical evidence from a prior parametric study, which identified the forward, or 'looming,' motion of the Robogator as the key stimulus eliciting a flight response in rats (Kim, Choi, and Lee 2016). This reaction significantly decreased when the robot moved backward from the same starting position, producing a similar sound, and was absent when the robot remained stationary. This suggests that neither sound alone nor the mere presence of a novel object provokes goal-directed escape behavior (Kong et al. 2021). This aligns with studies indicating that simulated looming stimuli, like an expanding disk, induce flight or freezing responses in mice (De Franceschi et al. 2016, Yilmaz and Meister 2013).

      It should be noted that the 2013 study by Yilmaz & Meister (Yilmaz and Meister 2013) on the looming disk paradigm showed that not all mice responded to the stimuli (e.g., Figs. 2A and 3A), with those that did exhibiting rapid habituation by the second exposure. This contrasts with our predatory robot paradigm (Choi and Kim 2010), where all rats consistently fled from the looming robotic predator across multiple trials, underscoring the critical role of looming motion in simulating predator attacks that trigger flight behavior in rats.

      Thus, the term "looming" accurately captures the nature of the robot's movement and its effect on eliciting defensive responses in rats. Nonetheless, should the editors agree with the reviewer's suggestion to minimize potential confusion, we are willing to substitute "looming" with "approaching," although we consider the terms to be synonymous in the context of our study.

      (12) If the authors are citing the Rescorla-Wagner model, they should include at least one additional sentence to explain it, as many people in the field are not familiar with this model.

      In response to the request for clarification on the Rescorla-Wagner model, we have added an explanatory sentence (pg. 4): “Fundamentally, the negative feedback circuit between the amygdala and the dPAG serves as a biological implementation of the Rescorla–Wagner (1972) model, a foundational theory of associative learning that emphasizes the importance of prediction errors in reinforcement (i.e., US), as applied to FC (Fanselow 1998).”

      (13) The authors need to include the normality test used to determine whether a parametric or non-parametric statistical analysis was the most appropriate test for each experiment.

      We have included the outcomes of the normality tests, detailed in Table S1.

      (14) In Fig. 1F, the authors show a representative PAG neuron with peristimulus-time histogram and rasters reaching frequencies higher than 100 Hz and sustained firing rates of >50 Hz following robot activation. The authors should include a firing rate analysis (e.g., average firing rate and maximum firing rate before and after robot activation) of the 22 robot-responsive PAG neurons recorded during the session to clarify whether this high firing rate, which is atypical in other brain regions, is commonly observed in the PAG. Showing the isolated waveforms of some representative neurons would help to clarify whether the activity is being recorded from a single-isolated unit instead of multiple units within the same channel.

      In response to the critique, we have expanded our analysis to include both average and maximum firing rates before and after robot activation for the 22 robot-responsive PAG neurons. This detailed firing rate analysis, illustrating their distribution, has been incorporated into the revised manuscript (refer to Figure S1C and S1D). Furthermore, to alleviate concerns regarding the identification of single-unit activity versus potential multi-unit recordings, we have included peri-event raster plots and waveforms for two additional representative neurons in Figure 1F.

      (15) In Figure 2, the authors should indicate when the recordings are performed on anesthetized vs. freely-moving awake animals.

      In the original manuscript, we specified that the optrode recordings depicted in Figure 2B were conducted on anesthetized rats. To enhance clarity and directly address the critique, we have now clearly indicated this condition in Figure 2A as well.

      (16) The optogenetic stimulation parameters used in Fig 2H indicate that 0.5 mW was sufficient to induce behavioral changes. This is surprising because most optogenetic experiments in the field use much higher intensities (> 5mW). If much lower intensities are sufficient to drive PAG-mediated behaviors, this may be a very important observation that should be conveyed to the field. I recommend the reviewers clarify if they in fact used 0.5 mW and then discuss that the laser intensity used in the experiments was 10X lower than that required for other brain regions

      In our study, we indeed observed that 0.5 mW of dPAG stimulation increased the latency to procure the pellet without completely preventing the action. Notably, at 1 mW, more than half of the animals (n = 5/9 rats; Fig. 2H) and at 3 mW, all rats (9/9) failed to procure the pellet and fled from the foraging area to the nest (Fig. 2G). These results indicate that even lower intensities were sufficient to elicit behavioral changes through dPAG stimulation in a large foraging arena, highlighting the dPAG's sensitivity to optogenetic manipulation. This finding is consistent with our earlier research on dPAG electrical stimulation, which required significantly lower intensities to provoke defensive behaviors compared to the BLA. Specifically, the stimulation intensity needed for aversive behavior in the dPAG was substantially lower (dPAG: 65.0 ± 6.85 µA) than for the BLA (BLA: 275.0 ± 24.44 µA) (Kim et al. 2013). Furthermore, Deng et al. (Deng, Xiao, and Wang 2016) showed that 1 mW of blue light could elicit a 60% freezing response, with 2 mW triggering flight behavior within a latency of 0.6 seconds.

      (17) In Fig 2 G-J, how many animals are being used per group and how was the sequence of the experiments performed? This is very important for replicability.

      A total of three rats were utilized for the robot testing experiments depicted in Fig. 2 G-J. The experimental sequence for these animals consisted of successive pre-stimulation, stimulation, post-stimulation, and robot sessions. We have updated the manuscript to include this information.

      (18) For the photostimulation of PAG neurons in Figs. 2 and 3, the authors need to clarify if the same parameters of laser stimulation used during the anesthetized recordings were also used during the behavioral tests. Also, the wavelength corresponding to the blue laser should be 473 nm instead of 437 nm.

      We thank the reviewer for identifying the error. We confirm that the opto-stimulation parameters (473 nm, 10-ms pulse width, 2 s duration) were consistently applied across both anesthetized recordings and behavioral tests. This consistency has been explicitly stated in the revised manuscript to ensure clarity regarding our experimental approach.

      (19) In Fig. 3I, how was the representative trials selected? Instead of picking up the most representative trials, the authors should demonstrate the response of the cell during the entire session.

      In response to the critique, we clarify that the color-coded PETH shown in Fig. 3I represents averaged BLA activity across a comprehensive set of trials. This includes 8 pre-stimulation, 10 stimulation, and 8 post-stimulation trials for the robot-activated sessions, with a similar distribution for non-stimulated sessions. This approach was chosen to provide a representative overview of the cell's response throughout the entire session. To address the request for more detailed data, we have added traditional PETHs to the revised manuscript (see Fig. S3H), which depict the cell's response across all trials.

      (20) Fig 4 D should demonstrate a colabeling between the anterograde PAG fibers in the PVT and the retrogradely labeled neurons from BLA instead of PAG fibers only.

      We wish to clarify that Fig. 4D is intended to show the distribution of dPAG terminals within the midline thalamic nuclei, as noted in prior research (Krout and Loewy 2000). Although dPAG terminals are distributed throughout the midline thalamus, our observations have specifically highlighted a notable increase in c-Fos expression within the paraventricular nucleus of the thalamus (PVT) in rats subjected to the robotic predator stimulus, in contrast to those in the foraging-only control condition (Fig. 4E). Addressing the reviewer's point, we direct attention to Fig. 4G, which includes images labeled "Robot-experienced" and "Merge." This figure demonstrates a subset of PVT neurons that were retrogradely labeled with CTB injected into the BLA, anterogradely labeled with AAV injected into the dPAG, and activated (as indicated by c-Fos expression) in response to the robotic predator. This provides specific colabeling evidence between anterograde PAG fibers in the PVT and retrogradely labeled neurons from the BLA, directly addressing the critique.

      (21) The resolution of the cFos images is very low and makes it hard to appreciate.

      We have updated Figs. 4F and 4G with high-resolution versions to ensure the details are more clearly visible. Furthermore, should there be a need for even greater clarity, we are prepared to supply the images as TIFF files, which are known for preserving high image quality.

      Reviewer 2:

      (1) The text is clearly written, and I appreciated the inclusion of interesting citations, such as the one about paintings by cavemen. The authors also do a good job of discussing the underlying theoretical framework and the figures are easy to understand. Although the topic is very interesting, the amount of novel work is somewhat low. Figure 1 shows that dPAG cells are activated by the predator, and this has been shown by many prior reports. Similarly, Figure 2 shows that dPAG activation creates defensive responses, and this too has been shown by many prior reports.

      We appreciate the reviewer’s positive remarks. We acknowledge the rich body of research documenting dPAG neuronal activation by various predator cues such as odors (e.g., fox urine) (Lu et al. 2023), and scenarios involving anesthetized or spontaneously moving rat/cat predators, either physically partitioned or harness-restrained (Bindi et al. 2022, Deng, Xiao, and Wang 2016, Esteban Masferrer et al. 2020). Nevertheless, our study distinguishes itself by examining dPAG neuronal responses to a robotic predator, uniquely designed to replicate consistent looming motions across multiple trials and subjects within an environment that simulates natural foraging conditions, inclusive of a safe nest (cf. Choi and Kim, 2010). This approach allowed us to not only reveal the immediate activation of dPAG neurons in response to a rapidly approaching predator but also to explore the consequent fleeing behavior towards safety, thereby providing new insights into the dPAG's role in mediating goal-directed defensive responses in a more ecologically-relevant setting. Furthermore, our investigation extends beyond these findings to assess the impact of dPAG activation on BLA neuronal responses and their functional connectivity during predator-prey interactions, offering a fresh perspective on the neural circuits that support survival behaviors in animals when confronted with naturalistic threats.

      (2) The results in Figure 3 are novel and interesting, but the characterization of BLA activity is incomplete. For example, what are the percentages of BLA cells that are inhibited or activated by all major behaviors observed? These behaviors include approach to pellet, escape from robot, freezing, stretch-attend postures, etc. These same analyses should also be added to dPAG activity in Figure 1. How does BLA single cell encoding of these behaviors relate to their responsivity to dPAG stimulation? And, finally, it is unclear what is the significance of BLA correlated synchronous firing. Is the animal more or less likely to be performing certain behaviors when correlated BLA firing occurs?

      Our analysis, as presented in Figs. 3I, 3K, and S3D-F, selectively focused on BLA cell responses during distinct behaviors such as approaching a pellet and escaping from the robot. These behaviors were selected because their precise temporal markers allow for accurate correlation with BLA cell activity, building on the findings of our previous research (Kim et al. 2018, Kong et al. 2021).

      The robot's motion, programmed to advance a fixed distance before retreating to its starting position, is designed to repeatedly elicit foraging, thus facilitating analysis of neural changes during conflict situations involving food approach and predator avoidance. However, this also leads to the rapid diminution of freezing and stretch-attend postures inside the nest as animals quickly adapt to the robot's movement pattern, rendering a time-stamped analysis of these behaviors unfeasible under our experimental conditions. While the inclusion of these behaviors in our analysis would be insightful, especially in extended interaction scenarios where the robot advances to the nest opening and remains before returning in a less predictable manner, such conditions would likely reduce foraging behavior due to increased fear, deviating from our study's primary objective of elucidating the interactions between the dorsal periaqueductal gray (dPAG) and the basolateral amygdala (BLA) functions.

      Regarding the significance of BLA correlated synchronous firing, our findings, particularly in Figures 3M-O and S4, demonstrate significant synchronous activity among BLA neuronal pairs during encounters with the robot, as opposed to pre-stim, stim, and post-stim sessions. This synchrony is notably prominent among neurons responsive to dPAG stimulation, indicating that BLA neurons involved in processing dPAG signals may play a crucial role in enhancing BLA network coherence to effectively manage predatory threat information (pg. 13).

      (3) In Figure 4, the authors identify the PVT as a potential region that can mediate dPAG to BLA communication via anatomical tracing. However, functional assays are missing. For example, if the PVT is inhibited chemogenetically, does this result in a smaller number of BLA cells that are activated by dPAG stimulation? Does activation of the dPAG-PVT or the PVT-BLA projections cause defensive behaviors? Functionally showing that the dPAG-PVT-BLA circuit controls defensive actions would be a major advance in the field and would greatly enhance the significance of this paper. It would also provide an anatomical substrate to support the view that the BLA is downstream of the dPAG, which was first demonstrated by the authors in their elegant 2013 PNAS paper.

      We appreciate the reviewer’s constructive critique and valuable suggestions on the necessity for functional validation of the dPAG-PVT-BLA circuit's involvement in mediating defensive behaviors. In light of these comments, we have carefully considered and included a discussion on the importance of these proposed experiments as a direction for future research in our manuscript revision (also see response to Reviewer 1’s critique #5).

      Our initial work in 2013 (Kim et al. 2013) laid the groundwork for identifying BLA neurons responsive to dPAG stimulation and suggested the PVT as a potential relay in this neural circuit. Recognizing the limitations of our current study, which does not include direct functional assays, we have adjusted our manuscript to convey the speculative aspect of the dPAG-PVT-BLA circuit’s role more accurately. Moreover, we have enriched our discussion by citing relevant studies that lend support to our proposed circuit mechanism. These references serve to place our findings within the broader context of existing research and highlight the imperative for subsequent studies to empirically confirm the functional significance of the dPAG-PVT-BLA pathway in driving defensive behaviors.

      Reviewer 3:

      (1) The Introduction refers to a negative feedback amygdala-dPAG from a study of the Johansen group, but in this case, the authors were referring to the ventrolateral and not the dorsal PAG.

      We thank the reviewer for pointing out the need to distinguish between the dPAG and vPAG regions in our introduction. While Johansen et al. (2010) investigated the roles of PAG (including both dPAG and vPAG regions; see their Supplementary Figs. 4, 5, and 10), the differentiation between their specific contributions to the amygdala's negative feedback mechanism was not explicitly detailed in their initial publication. This distinction was further elaborated upon in later work by the same group (Yeh, Ozawa, and Johansen 2021), which specifically illuminated the dPAG's role in conditioned fear memory formation and its neural pathways to the PVT that influence fear learning. To reflect this nuanced understanding, we have revised our introduction (pg. 3): “In parallel, Johansen et al. (2010) found that pharmacological inhibition of the PAG, encompassing both dPAG and vPAG regions, diminishes the behavioral and neural responses in the amygdala elicited by periorbital shock US, thereby impairing the acquisition of auditory FC.”

      (2) In the experiments recording dPAG in response to the predator threat, the authors mentioned cells activated by the predator threat, referred to as "robot cells." Were these cells inhibited in response to threat?

      In the Result and Materials and Methods sections, we report that 23.4% (22 out of 94) of dPAG neurons, termed “robot cells,” showed a significant increase in firing rates (z > 3) within a latency of less than 500 ms during exposure to the looming robot threat, but not during the pre- and post-robot sessions. These cells are highlighted in Figures 1E-G. In contrast, we identified only a single unit exhibiting a decrease in activity (z-score < -3) in response to the robot threat. Given the overwhelming prevalence of cells with excitatory responses to the threat, our discussions and analyses have primarily centered on these excited cells. Nevertheless, to ensure a full depiction of our observations, we have included data on the inhibited unit in the revised manuscript, specifically in Figure S1E.

      (3) The authors claim that tetrodes were implanted in the dorsal PAG; however, the electrodes' tips shown in the figures are positioned more ventrally in the lateral PAG (see Figures 1B, S5A).

      The PAG is anatomically organized into dorsomedial (dmPAG), dorsolateral (dlPAG), lateral (lPAG), and ventrolateral (vlPAG) columns along the rostro-caudal axis of the aqueduct. The designation "dorsal PAG" (dPAG) traditionally encompasses the dmPAG, dlPAG, and lPAG regions, a classification supported by extensive track-tracing, neurochemical, and immunohistochemical evidence (e.g., (Bandler, Carrive, and Zhang 1991, Bandler and Keay 1996, Carrive 1993)). As Bandler and Shipley (Bandler and Shipley 1994) summarized, “These findings suggest that what has been traditionally called the 'dorsal PAG' (a collective term for regions dorsal and lateral to the aqueduct), consists of three anatomically distinct longitudinal columns: dorsomedial and lateral columns…and a dorsolateral column…" Similarly, Schenberg et al. (Schenberg et al. 2005) clarified in their review that, “According to this parcellation...the defensive behaviors (freezing, flight or fight) and aversion-related responses (switch-off behavior) were ascribed to the DMPAG, DLPAG, and LPAG (usually named the ‘dorsal’ PAG).” In our study, electrode placements were strictly within these specified dPAG regions. The electrode tip locations depicted in Figures 1B and S5A correspond with the -6.04 mm template (left panel below) from Paxinos & Watson’s atlas (Paxinos and Watson 1998), situated anteriorly to the emergence of the  vlPAG (right panel below). To enhance clarification in our manuscript, we provide a detailed definition of the dPAG that includes the dmPAG, dlPAG,  and lPAG, and support our electrode placement rationale with references to established literature (pg. 5).

      Author response image 1.

      (4) It would be nice to include a series of observations applying inhibitory tools (i.e., optogenetic photo inhibition) in the dPAG and BLA and see how they affect the behavioral responses in the 'approach food-avoid predator' paradigm. Moreover, it would be interesting to explore how inhibiting the dPAG to PVT pathway influences the flee response during the robot surge.

      We appreciate the suggestion to explore the effects of optogenetic inhibition in the dPAG and BLA on behavioral responses within the 'approach food-avoid predator' paradigm, as well as the potential impact of inhibiting the dPAG to PVT pathway on flee responses during robot surge incidents. As mentioned in our response to Reviewer 1’s critique #5, the application of optogenetic inhibition necessitates transfecting, quantifying, and photoinhibiting a comprehensive set of dPAG neurons activated by predatory threats. This approach is more viable in future studies that can leverage transgenic mouse models for their genetic tractability. Following the Joint Public Review’s recommendations, we have revised our manuscript to ensure a more measured interpretation of our data, carefully balancing the evidence from tracer studies against the limitations of our current methodology.

      Furthermore, referencing Reviewer 1’s critique #9, it is important to consider that various invasive techniques can yield different behavioral outcomes. For instance, research by Olveczky and colleagues (Otchy et al. 2015) demonstrated that acute manipulations (i.e., optogenetic and muscimol inactivation) and chronic surgical ablation of the same brain circuit can produce distinct effects in rats and finches. Despite these methodological constraints, our collective results from lesion, inactivation, electrical stimulation (Kim et al. 2013), optostimulation, and single-unit recording (present) studies cohesively suggest that the dPAG functions upstream of the BLA in processing predatory threat signals.

      (5) The authors should also examine whether 'synaptic' appositions exist between the anterogradely labeled terminals from the dPAG and the double labeled CTB and cFOS neurons in the PVT.

      We appreciate the suggestion to investigate the presence of synaptic appositions, which could potentially offer valuable insights into the synaptic connections and functional interactions within this neural circuit. However, due to the specialized nature of electron microscopy required for these examinations and the extensive resources it entails, this line of inquiry falls beyond the scope of our current study. We hope to address this aspect in future studies, where we can dedicate the necessary resources and expertise to conducting these intricate analyses.

      (6) It is odd to see the projection fields shown in Fig. 4D, where the projection to the PVT looks much sparser compared to other targets in the thalamus and hypothalamus. If the projection to the PVT has such an important function, why does it seem so weak? This should be discussed. Also, because the projection to the PVT seems sparse, the authors should consider alternative paths like the one involving the cuneiform nucleus. The cuneiform nucleus is an important region responding to looming shadows with strong bidirectional links to the dorsolateral periaqueductal gray, providing strong projections to the rostral PVT.

      The perceived scarcity of the dPAG-PVT pathway might not reflect its functional significance accurately. The PVT's small size could make its projections appear less dense in broad anatomical studies. To address this, we have updated Figure 4D with a high-resolution image that offers a detailed view of the PVT region. This enhancement (refer to the updated Fig. 4, bottom) more accurately depicts the projection density within the PVT. It is also critical to consider that the functional impact of neural pathways is not solely dependent on the quantity of projecting neurons. For instance, work by Deisseroth and colleagues (Rajasethupathy et al. 2015) has shown that even relatively sparse monosynaptic projections from the anterior cingulate cortex to the hippocampus can exert significant effects on neural circuit dynamics. Additionally, we have expanded our discussion to consider the potential roles of other circuits, such as the cuneiform nucleus, in driving the behavioral responses observed in our study (pg. 15): “Given the recent significance attributed to the superior colliculus in detecting innate visual threats (Lischinsky and Lin 2019, Wei et al. 2015, Zhou et al. 2019) and the cuneiform nucleus in the directed flight behavior of mice (Bindi et al. 2023, Tsang et al. 2023), further exploration into the communication between these structures and the dPAG-BLA circuitry is warranted.”

      (7) Finally, in the Discussion, it would be nice to comment on how the BLA mediates flee responses. Which pathways are likely involved?

      This excellent suggestion has been incorporated in the discussion (pg. 15): “Future studies will also need to delineate the downstream pathways emanating from the BLA that orchestrate goal-directed flight responses to external predatory threats as well as internal stimulations from the dPAG/BLA circuit. Potential key structures include the dorsal/posterior striatum, which has been associated with avoidance behaviors in response to airpuff in head-fixed mice (Menegas et al. 2018) and flight reactions triggered by auditory looming cues (Li et al. 2021). Additionally, the ventromedial hypothalamus (VMH) has been implicated in flight behaviors in mice, evidenced by responses to the presence of a rat predator (Silva et al. 2013) and upon optogenetic activation of VMH Steroidogenic factor 1 (Kunwar et al. 2015) or the VMH-anterior hypothalamic nucleus pathway (Wang, Chen, and Lin 2015). Investigating the indispensable role of these structures in flight behavior could involve lesion or inactivation studies. Such interventions are anticipated to inhibit flight behaviors elicited by amygdala stimulation and predatory threats, confirming their critical involvement. Conversely, activating these structures in subjects with an inactivated or lesioned amygdala, which would typically inhibit fear responses to external threats (Choi and Kim 2010), is expected to induce fleeing behavior, further elucidating their functional significance.”

      Adamantidis, A., S. Arber, J. S. Bains, E. Bamberg, A. Bonci, G. Buzsaki, J. A. Cardin, R. M. Costa, Y. Dan, Y. Goda, A. M. Graybiel, M. Hausser, P. Hegemann, J. R. Huguenard, T. R. Insel, P. H. Janak, D. Johnston, S. A. Josselyn, C. Koch, A. C. Kreitzer, C. Luscher, R. C. Malenka, G. Miesenbock, G. Nagel, B. Roska, M. J. Schnitzer, K. V. Shenoy, I. Soltesz, S. M. Sternson, R. W. Tsien, R. Y. Tsien, G. G. Turrigiano, K. M. Tye, and R. I. Wilson. 2015. "Optogenetics: 10 years after ChR2 in neurons--views from the community."  Nat Neurosci 18 (9):1202-12. doi: 10.1038/nn.4106.

      Amano, K., T. Tanikawa, H. Kawamura, H. Iseki, M. Notani, H. Kawabatake, T. Shiwaku, T. Suda, H. Demura, and K. Kitamura. 1982. "Endorphins and pain relief. Further observations on electrical stimulation of the lateral part of the periaqueductal gray matter during rostral mesencephalic reticulotomy for pain relief."  Appl Neurophysiol 45 (1-2):123-35.

      Bagley, E. E., and S. L. Ingram. 2020. "Endogenous opioid peptides in the descending pain modulatory circuit."  Neuropharmacology 173:108131. doi: 10.1016/j.neuropharm.2020.108131.

      Bandler, R., P. Carrive, and S. P. Zhang. 1991. "Integration of somatic and autonomic reactions within the midbrain periaqueductal grey: viscerotopic, somatotopic and functional organization."  Prog Brain Res 87:269-305. doi: 10.1016/s0079-6123(08)63056-3.

      Bandler, R., and K. A. Keay. 1996. "Columnar organization in the midbrain periaqueductal gray and the integration of emotional expression."  Prog Brain Res 107:285-300. doi: 10.1016/s0079-6123(08)61871-3.

      Bandler, R., and M. T. Shipley. 1994. "Columnar organization in the midbrain periaqueductal gray: modules for emotional expression?"  Trends Neurosci 17 (9):379-89. doi: 10.1016/0166-2236(94)90047-7.

      Bindi, R. P., C. C. Guimaraes, A. R. de Oliveira, F. F. Melleu, M. A. X. de Lima, M. V. C. Baldo, S. C. Motta, and N. S. Canteras. 2023. "Anatomical and functional study of the cuneiform nucleus: A critical site to organize innate defensive behaviors."  Ann N Y Acad Sci 1521 (1):79-95. doi: 10.1111/nyas.14954.

      Bindi, R. P., R. G. O. Maia, F. Pibiri, M. V. C. Baldo, S. L. Poulter, C. Lever, and N. S. Canteras. 2022. "Neural correlates of distinct levels of predatory threat in dorsal periaqueductal grey neurons."  Eur J Neurosci 55 (6):1504-1518. doi: 10.1111/ejn.15633.

      Cameron, A. A., I. A. Khan, K. N. Westlund, and W. D. Willis. 1995. "The efferent projections of the periaqueductal gray in the rat: a Phaseolus vulgaris-leucoagglutinin study. II. Descending projections."  J Comp Neurol 351 (4):585-601. doi: 10.1002/cne.903510408.

      Cannon, J. T., G. J. Prieto, A. Lee, and J. C. Liebeskind. 1982. "Evidence for opioid and non-opioid forms of stimulation-produced analgesia in the rat."  Brain Res 243 (2):315-21. doi: 10.1016/0006-8993(82)90255-4.

      Carrive, P, and M. M. Morgan. 2012. "Periaqueductal Gray." In The Human Nervous System, edited by J. K.; Paxinos Mai, G., 367-400. London: Academic Press.

      Carrive, P. 1993. "The periaqueductal gray and defensive behavior: functional representation and neuronal organization."  Behav Brain Res 58 (1-2):27-47. doi: 10.1016/0166-4328(93)90088-8.

      Choi, E. A., P. Jean-Richard-Dit-Bressel, C. W. G. Clifford, and G. P. McNally. 2019. "Paraventricular Thalamus Controls Behavior during Motivational Conflict."  J Neurosci 39 (25):4945-4958. doi: 10.1523/JNEUROSCI.2480-18.2019.

      Choi, E. A., and G. P. McNally. 2017. "Paraventricular Thalamus Balances Danger and Reward."  J Neurosci 37 (11):3018-3029. doi: 10.1523/JNEUROSCI.3320-16.2017.

      Choi, J. S., and J. J. Kim. 2010. "Amygdala regulates risk of predation in rats foraging in a dynamic fear environment."  Proc Natl Acad Sci U S A 107 (50):21773-7. doi: 10.1073/pnas.1010079108.

      De Franceschi, G., T. Vivattanasarn, A. B. Saleem, and S. G. Solomon. 2016. "Vision Guides Selection of Freeze or Flight Defense Strategies in Mice."  Curr Biol 26 (16):2150-4. doi: 10.1016/j.cub.2016.06.006.

      De Oca, B. M., J. P. DeCola, S. Maren, and M. S. Fanselow. 1998. "Distinct regions of the periaqueductal gray are involved in the acquisition and expression of defensive responses."  J Neurosci 18 (9):3426-32. doi: 10.1523/JNEUROSCI.18-09-03426.1998.

      Deng, H., X. Xiao, and Z. Wang. 2016. "Periaqueductal Gray Neuronal Activities Underlie Different Aspects of Defensive Behaviors."  J Neurosci 36 (29):7580-8. doi: 10.1523/JNEUROSCI.4425-15.2016.

      Engelke, D. S., X. O. Zhang, J. J. O'Malley, J. A. Fernandez-Leon, S. Li, G. J. Kirouac, M. Beierlein, and F. H. Do-Monte. 2021. "A hypothalamic-thalamostriatal circuit that controls approach-avoidance conflict in rats."  Nat Commun 12 (1):2517. doi: 10.1038/s41467-021-22730-y.

      Esteban Masferrer, M., B. A. Silva, K. Nomoto, S. Q. Lima, and C. T. Gross. 2020. "Differential Encoding of Predator Fear in the Ventromedial Hypothalamus and Periaqueductal Grey."  J Neurosci 40 (48):9283-9292. doi: 10.1523/JNEUROSCI.0761-18.2020.

      Fanselow, M. S. 1998. "Pavlovian conditioning, negative feedback, and blocking: mechanisms that regulate association formation."  Neuron 20 (4):625-7. doi: 10.1016/s0896-6273(00)81002-8.

      Fields, H. L. 2000. "Pain modulation: expectation, opioid analgesia and virtual pain."  Prog Brain Res 122:245-53. doi: 10.1016/s0079-6123(08)62143-3.

      Gross, C. T., and N. S. Canteras. 2012. "The many paths to fear."  Nat Rev Neurosci 13 (9):651-8. doi: 10.1038/nrn3301.

      Herry, C., and J. P. Johansen. 2014. "Encoding of fear learning and memory in distributed neuronal circuits."  Nat Neurosci 17 (12):1644-54. doi: 10.1038/nn.3869.

      Kim, E. J., O. Horovitz, B. A. Pellman, L. M. Tan, Q. Li, G. Richter-Levin, and J. J. Kim. 2013. "Dorsal periaqueductal gray-amygdala pathway conveys both innate and learned fear responses in rats."  Proc Natl Acad Sci U S A 110 (36):14795-800. doi: 10.1073/pnas.1310845110.

      Kim, E. J., M. S. Kong, S. G. Park, S. J. Y. Mizumori, J. Cho, and J. J. Kim. 2018. "Dynamic coding of predatory information between the prelimbic cortex and lateral amygdala in foraging rats."  Sci Adv 4 (4):eaar7328. doi: 10.1126/sciadv.aar7328.

      Kim, J. J., J. S. Choi, and H. J. Lee. 2016. "Foraging in the face of fear: Novel strategies for evaluating amygdala functions in rats." In Living without an amygdala, edited by D. G. Amaral and R. Adolphs, 129-148. The Guilford Press.

      Kim, J. J., R. A. Rison, and M. S. Fanselow. 1993. "Effects of amygdala, hippocampus, and periaqueductal gray lesions on short- and long-term contextual fear."  Behav Neurosci 107 (6):1093-8. doi: 10.1037//0735-7044.107.6.1093.

      Kong, M. S., E. J. Kim, S. Park, L. S. Zweifel, Y. Huh, J. Cho, and J. J. Kim. 2021. "'Fearful-place' coding in the amygdala-hippocampal network."  Elife 10. doi: 10.7554/eLife.72040.

      Krout, K. E., and A. D. Loewy. 2000. "Periaqueductal gray matter projections to midline and intralaminar thalamic nuclei of the rat."  J Comp Neurol 424 (1):111-41. doi: 10.1002/1096-9861(20000814)424:1<111::aid-cne9>3.0.co;2-3.

      Kunwar, P. S., M. Zelikowsky, R. Remedios, H. Cai, M. Yilmaz, M. Meister, and D. J. Anderson. 2015. "Ventromedial hypothalamic neurons control a defensive emotion state."  Elife 4. doi: 10.7554/eLife.06633.

      Lefler, Y., D. Campagner, and T. Branco. 2020. "The role of the periaqueductal gray in escape behavior."  Curr Opin Neurobiol 60:115-121. doi: 10.1016/j.conb.2019.11.014.

      Li, Z., J. X. Wei, G. W. Zhang, J. J. Huang, B. Zingg, X. Wang, H. W. Tao, and L. I. Zhang. 2021. "Corticostriatal control of defense behavior in mice induced by auditory looming cues."  Nat Commun 12 (1):1040. doi: 10.1038/s41467-021-21248-7.

      Lischinsky, J. E., and D. Lin. 2019. "Looming Danger: Unraveling the Circuitry for Predator Threats."  Trends Neurosci 42 (12):841-842. doi: 10.1016/j.tins.2019.10.004.

      Lu, B., P. Fan, M. Li, Y. Wang, W. Liang, G. Yang, F. Mo, Z. Xu, J. Shan, Y. Song, J. Liu, Y. Wu, and X. Cai. 2023. "Detection of neuronal defensive discharge information transmission and characteristics in periaqueductal gray double-subregions using PtNP/PEDOT:PSS modified microelectrode arrays."  Microsyst Nanoeng 9:70. doi: 10.1038/s41378-023-00546-8.

      Magierek, V., P. L. Ramos, N. G. da Silveira-Filho, R. L. Nogueira, and J. Landeira-Fernandez. 2003. "Context fear conditioning inhibits panic-like behavior elicited by electrical stimulation of dorsal periaqueductal gray."  Neuroreport 14 (12):1641-4. doi: 10.1097/00001756-200308260-00020.

      McNally, G. P., J. P. Johansen, and H. T. Blair. 2011. "Placing prediction into the fear circuit."  Trends Neurosci 34 (6):283-92. doi: 10.1016/j.tins.2011.03.005.

      Menegas, W., K. Akiti, R. Amo, N. Uchida, and M. Watabe-Uchida. 2018. "Dopamine neurons projecting to the posterior striatum reinforce avoidance of threatening stimuli."  Nat Neurosci 21 (10):1421-1430. doi: 10.1038/s41593-018-0222-1.

      Morgan, M. M., P. K. Whitney, and M. S. Gold. 1998. "Immobility and flight associated with antinociception produced by activation of the ventral and lateral/dorsal regions of the rat periaqueductal gray."  Brain Res 804 (1):159-66. doi: 10.1016/s0006-8993(98)00669-6.

      Otchy, T. M., S. B. Wolff, J. Y. Rhee, C. Pehlevan, R. Kawai, A. Kempf, S. M. Gobes, and B. P. Olveczky. 2015. "Acute off-target effects of neural circuit manipulations."  Nature 528 (7582):358-63. doi: 10.1038/nature16442.

      Paxinos, G., and C. Watson. 1998. The Rat Brain in Stereotaxic Coordinates. San Diego: Academic Press.

      Rajasethupathy, P., S. Sankaran, J. H. Marshel, C. K. Kim, E. Ferenczi, S. Y. Lee, A. Berndt, C. Ramakrishnan, A. Jaffe, M. Lo, C. Liston, and K. Deisseroth. 2015. "Projections from neocortex mediate top-down control of memory retrieval."  Nature 526 (7575):653-9. doi: 10.1038/nature15389.

      Ressler, R. L., and S. Maren. 2019. "Synaptic encoding of fear memories in the amygdala."  Curr Opin Neurobiol 54:54-59. doi: 10.1016/j.conb.2018.08.012.

      Schenberg, L. C., R. M. Povoa, A. L. Costa, A. V. Caldellas, S. Tufik, and A. S. Bittencourt. 2005. "Functional specializations within the tectum defense systems of the rat."  Neurosci Biobehav Rev 29 (8):1279-98. doi: 10.1016/j.neubiorev.2005.05.006.

      Silva, B. A., C. Mattucci, P. Krzywkowski, E. Murana, A. Illarionova, V. Grinevich, N. S. Canteras, D. Ragozzino, and C. T. Gross. 2013. "Independent hypothalamic circuits for social and predator fear."  Nat Neurosci 16 (12):1731-3. doi: 10.1038/nn.3573.

      Tsang, E., C. Orlandini, R. Sureka, A. H. Crevenna, E. Perlas, I. Prankerd, M. E. Masferrer, and C. T. Gross. 2023. "Induction of flight via midbrain projections to the cuneiform nucleus."  PLoS One 18 (2):e0281464. doi: 10.1371/journal.pone.0281464.

      Vianna, D. M., and M. L. Brandao. 2003. "Anatomical connections of the periaqueductal gray: specific neural substrates for different kinds of fear."  Braz J Med Biol Res 36 (5):557-66. doi: 10.1590/s0100-879x2003000500002.

      Walker, D. L., and M. Davis. 1997. "Involvement of the dorsal periaqueductal gray in the loss of fear-potentiated startle accompanying high footshock training."  Behav Neurosci 111 (4):692-702. doi: 10.1037//0735-7044.111.4.692.

      Wang, L., I. Z. Chen, and D. Lin. 2015. "Collateral pathways from the ventromedial hypothalamus mediate defensive behaviors."  Neuron 85 (6):1344-58. doi: 10.1016/j.neuron.2014.12.025.

      Wei, P., N. Liu, Z. Zhang, X. Liu, Y. Tang, X. He, B. Wu, Z. Zhou, Y. Liu, J. Li, Y. Zhang, X. Zhou, L. Xu, L. Chen, G. Bi, X. Hu, F. Xu, and L. Wang. 2015. "Processing of visually evoked innate fear by a non-canonical thalamic pathway."  Nat Commun 6:6756. doi: 10.1038/ncomms7756.

      Yeh, L. F., T. Ozawa, and J. P. Johansen. 2021. "Functional organization of the midbrain periaqueductal gray for regulating aversive memory formation."  Mol Brain 14 (1):136. doi: 10.1186/s13041-021-00844-0.

      Yilmaz, M., and M. Meister. 2013. "Rapid innate defensive responses of mice to looming visual stimuli."  Curr Biol 23 (20):2011-5. doi: 10.1016/j.cub.2013.08.015.

      Zhou, Z., X. Liu, S. Chen, Z. Zhang, Y. Liu, Q. Montardy, Y. Tang, P. Wei, N. Liu, L. Li, R. Song, J. Lai, X. He, C. Chen, G. Bi, G. Feng, F. Xu, and L. Wang. 2019. "A VTA GABAergic Neural Circuit Mediates Visually Evoked Innate Defensive Responses."  Neuron 103 (3):473-488 e6. doi: 10.1016/j.neuron.2019.05.027.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      SARS-CoV-2 infection induces syncytia formation, which promotes viral transmission. In this paper, the authors aimed to understand how host-derived inflammatory cytokines IL-1α/β combat SARS-CoV-2 infection.

      Strengths:

      First, they used a cell-cell fusion assay developed previously to identify IL-1α/β as the cytokines that inhibit syncytia formation. They co-cultured cells expressing the spike protein and cells expressing ACE2 and found that IL-1β treatment decreased syncytia formation and S2' cleavage.

      Second, they investigated the IL-1 signaling pathway in detail, using knockouts or pharmacological perturbation to understand the signaling proteins responsible for blocking cell fusion. They found that IL-1 prevents cell-cell fusion through MyD88/IRAK/TRAF6 but not TAK1/IKK/NF-κB, as only knocking out MyD88/IRAK/TRAF6 eliminates the inhibitory effect on cell-cell fusion in response to IL-1β. This revealed that the inhibition of cell fusion did not require a transcriptional response and was mediated by IL-1R proximal signaling effectors.

      Third, the authors identified RhoA/ROCK activation by IL-1 as the basis for this inhibition of cell fusion. By visualizing a RhoA biosensor and actin, they found a redistribution of RhoA to the cell periphery and cell-cell junctions after IL-1 stimulation. This triggered the formation of actin bundles at cell-cell junctions, preventing fusion and syncytia formation. The authors confirmed this molecular mechanism by using constitutively active RhoA and an inhibitor of ROCK.

      Diverse Cell types and in vivo models were used, and consistent results were shown across diverse models. These results were convincing and well-presented.

      Weaknesses:

      As the authors point out in the discussion, whether IL-1-mediated RhoA activation is specific to viral infection or regulates other RhoA-regulated processes is unclear. We would also require high-magnification images of the subcellular organization of the cytoskeleton to appreciate the effect of IL-1 stimulation.

      Thanks for the suggestions. We tested the role of IL-1β in other RhoA-regulated processes, and found that IL-1β-mediated RhoA activation also reduced cell migration in a cell scratch assay (see Author response image 1). We also provided high-magnification images in the revised Figures 4 and 5, as well as their respective figure supplements.

      Author response image 1.

      (A) Cell scratch assay images of HEK293T cells treated with PBS or IL-1β. (B) Quantification of cell migration in (A).

      Reviewer #2 (Public Review):

      Summary:

      In this study, Zheng et al investigated the role of inflammatory cytokines in protecting cells against SARS-CoV-2 infection. They demonstrate that soluble factors in the supernatants of TLR-stimulated THP1 cells reduce fusion events between HEK293 cells expressing SARS-CoV-2 S protein and the ACE2 receptor. Using qRT-PCR and ELISA, they demonstrate that IL-1 cytokines are (not surprisingly) upregulated by TLR treatment in THP1 cells. Further, they convincingly demonstrate that recombinant IL-1 cytokines are sufficient to reduce cell-to-cell fusion mediated by the S protein. Using chemical inhibitors and CRISPR knock-out of key IL-1 receptor signaling components in HEK293 cells, they demonstrate that components of the myddosome (MYD88, IRAK1/4, and TRAF6) are required for fusion inhibition, but that downstream canonical signaling (i.e., TAK1 and NFKB activation) is not required. Instead, they provide evidence that IL-1-dependent non-canonical activation of RhoA/Rock is important for this phenotype. Importantly, the authors demonstrate that expression of a constitutively active RhoA alone is sufficient to inhibit fusion and that chemical inhibition of Rock could reverse this inhibition. The authors followed up these in vitro experiments by examining the effects of IL-1 on SARS-COV-2 infection in vivo and they demonstrate that recombinant IL-1 can reduce viral burden and lung pathogenesis in a mouse model of infection. However, the contribution of the RhoA/Rock pathway and inhibition of fusion to IL-1-mediated control of SARS-CoV-2 infection in vivo remains unclear.

      Strengths:

      (1) The bioluminescence cell-cell fusion assay provides a robust quantitative method to examine cytokine effects on viral glycoprotein-mediated fusion.

      (2) The study identifies a new mechanism by which IL-1 cytokines can limit virus infection.

      (3) The authors tested IL-1 mediated inhibition of fusion induced by many different coronavirus S proteins and several SARS-CoV-2 strains.

      Weaknesses:

      (1) The qualitative assay demonstrating S2 cleavage and IL-1 mediated inhibition of this phenotype is extremely variable across the data figures. Sometimes it appears like S2 cleavage (S2') is reduced, while in other figures immunoblots show that total S2 protein is decreased. Based on the proposed model the expectation would be that S2 abundance would be rescued when cleavage is inhibited.

      In our present manuscript, IL-1-mediated changes of the full-length spike showed some variation between authentic SARS-CoV-2 infection model and HEK293T-S + HEK293T-ACE2 coculture model, while IL-1 inhibited S2’ cleavage accompanied by a reduction of S2 subunit in both models.

      In the authentic SARS-CoV-2 infection model, we observed that IL-1 inhibited S2' cleavage accompanied with a reduction in both S2 subunit and full-length spike protein. This is likely because the S2 subunit and full-length spike protein in this model are not only from infected cells, but also from intracellular viral particles. IL-1 inhibited SARS-CoV-2 induced cell-cell fusion and reduced the viral load in host cells, therefore the abundance of S2 subunit and full-length spike proteins were both reduced.

      In the HEK293T-based co-culture model, IL-1 inhibited S2' cleavage accompanied with a reduction in S2 subunit, while the full-length spike protein was more or less rescued. Based on our previous study, R685A and ΔRRAR spike mutants cannot generate the S2 subunit, but still generated S2′ fragment to induce cell-cell fusion, and the S2' fragment produced from R685A and ΔRRAR spike mutants were only slightly reduced compared to wild-type spike protein, suggesting that the S2' fragment is mainly derived from the full-length spike directly, and to a minimal extent from the S2 subunit (Fig. 4B and 4G, PMID: 34930824). Thus, inhibition of S2’ cleavage by IL-1 mainly rescued the full-length spike protein.

      (2) The text referencing Figure 1H suggests that TLR-stimulated THP-1 cell supernatants "significantly" reduce syncytia, but image quantification and statistics are not provided to support this statement.

      Thanks for pointing out this issue. We have provided fluorescence image quantification and statistics in the revised version of our manuscript (Figure 1D, Figure 1-figure supplement 1A, Figure 1H-1I, Figure 2H-2I, Figure 1-figure supplement 1D-1E, Figure 1-figure supplement 1H-1I, Figure 2-figure supplement 1C-1D, Figure 2-figure supplement 2B-2E, Figure 2-figure supplement 2G-2H, Figure 2-figure supplement 6A-6B, Figure 2-figure supplement 7F-7G).

      (3) The authors conclude that because IL-1 accumulates in TLR2-stimulated THP1 monocyte supernatants, this cytokine accounts for the ability of these supernatants to inhibit cell-cell fusion. However, they do not directly test whether IL-1 is required for the phenotype. Inhibition of the IL-1 receptor in supernatant-treated cells would help support their conclusion.

      Thanks for the suggestion. Accordingly, we performed experiment and found that IL-1RA treatment reduced the inhibitory effect of PGN-stimulated THP-1 cell culture supernatant on cell-cell fusion, suggesting that IL-1 is required for the inhibition. This result has been added in our revised manuscript (Figure 2J and Figure2-figure supplement 4C).

      (4) Immunoblot analysis of IL-1 treated HEK293 cells suggests that this cytokine does not reduce the abundance of ACE2 or total S protein in cells. However, it is possible that IL-1 signaling reduces the abundance of these proteins on the cell surface, which would result in a similar inhibition of cell-cell fusion. The authors should confirm that IL-1 treatment of their cells does not change Ace2 or S protein on the cell surface.

      Thanks for the suggestion. Accordingly, we applied Wheat Germ Agglutinin (WGA) to stain cell surface in HKE293T cells and observed that IL-1β treatment did not change ACE2 or Spike protein on the cell surface. This result has been added in our revised manuscript (Figure 5-figure supplement 3A-D).

      (5) In Figure 5A, expression of constitutively active RhoA appears to have profound effects on how ACE2 runs by SDS-PAGE, suggesting that RhoA may have additional effects on ACE2 biology that might account for the decreased cell-cell fusion. This phenotype should be addressed in the text and explored in more detail.

      Thanks for pointing out this. We also noticed that the occurrence of cell-cell fusion reduced the amount of ACE2, whereas inhibition of cell-cell fusion restored the ACE2 abundance. Take the original Figure 5A (revised Figure 4-figure supplement 2B) as example, the increased ACE2 protein should be attributed to the decreased cell-cell fusion upon RhoA-CA transfection, as Spike binding with ACE2 leads to clathrin- and AP2-dependent endocytosis, resulting in ACE2 degradation in the lysosome (PMID: 36287912).

      In addition, we have examined the potential effect of RhoA-CA on ACE2, and found that RhoA-CA did not affect ACE2 expression, nor Spike binding to ACE2 (revised Figure 5-figure supplement 2E); it did not affect ACE2 distribution on cell surface either (revised Figure 5-figure supplement 2F and G).

      (6) The experiments linking IL-1 mediated restriction of SARS-COV-2 fusion to the control of virus infection in vivo are incomplete. The reported data demonstrate that recombinant IL-1 can restrict virus replication in vivo, but they fall short of confirming that the in vitro mechanism described (reduced fusion) contributes to the control of SARS-CoV2 replication in vivo. A critical piece of data that is missing is the demonstration that the ROCK inhibitor phenocopies IL-1RA treatment of SARS-COV-2 infected mice (viral infection and pathology).

      Thanks for this suggestion. Accordingly, we applied the ROCK inhibitor in vivo to confirm its role in SARS-CoV-2-infected mice, and found similar phenotype as the IL-1RA treatment experiment. That is to say, Y-26732 treatment prevented the formation of IL-1β-induced actin bundles at cell-cell junctions, thus promoted syncytia formation and further viral transmission in vivo (revised Figure 7).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      I suggest providing single-channel images in a supplementary figure for the live-cell images in Figures 4 and 5. Higher magnification images would also help distinguish the subcellular details of the cytoskeleton organization.

      Thanks for the suggestion. We have provided the single channel images and higher magnification images in the revised Figures 4 and 5, as well as their respective figure supplements.

      In Figure 4, the authors showed that IL-1 activates RhoA and induces the accumulation of activated RhoA at the cell-cell junctions. They also showed that IL-1 promotes the formation of actin bundles at cell-cell junctions. However, the authors have not shown any connection between RhoA and actin yet, but in lines 263-264, they claim that actin bundle formation is induced by RhoA. Evidence for this part was shown in later results, but at this moment, it is lacking. The same applies to lines 282-284; I think this conclusion that IL-1-induced actin bundle formation is through the RhoA-ROCK pathway should come after showing how RhoA affects actin bundle formation at cell-cell junctions. To this end, I suggest moving Supplementary Figures S12B and S12D to the main figure, as they provide strong evidence of the IL-1-RhoA-ROCK-actin pathway.

      We appreciate these valuable comments. As suggested, we have moved the respective supplementary figures to the main figures to support our findings in the revised manuscript (Figure 4E and Figure 4-figure supplement 2B; Figure 5C and Figure 5-figure supplement 2A), the text has also been adjusted accordingly.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      HMGCS1, 3-hydroxy-3-methylglutaryl-CoA synthase1 is predicted to be involved in Acetyl-CoA metabolic process and mevalonate-cholesterol pathway. To induce diet-induced diabetes, they fed wild-type littermates either a standard chow (Control) or a high fat-high sucrose (HFHG) diet, where the diet composition consisted of 60% fat, 20% protein, and 20% carbohydrate (H10060, Hfkbio, China). The dietary regimen was maintained for 14 weeks. Throughout this period, body weight and fasting blood glucose (FBG) levels were measured on a weekly basis. Although the authors induced diabetes with a diet also rich in fat, the cholesterol concentration or metabolism was not investigated. After the treatment, were the animals with endothelial dysfunction? How was the blood pressure of the animals?

      Thank you for your comments and kind suggestions. We have conducted a study on the impact of HFHG diet on the serum levels of total cholesterol(T-CHO) in mice over a 14-week period. Our findings indicated that the HFHG diet significantly elevated T-CHO levels in the serum of mice (Supplementary Figure 5E). Additionally, HFHG diet was associated with an increased in blood pressure (Figure 5F) and it exacerbated the progression of endothelial dysfunction in mice (Figure 5H-L).

      Strengths:

      To explore the potential role of circHMGCS1 in regulating endothelial cell function, the authors cloned exons 2-7 of HMGCS1 into lentiviral vectors for ectopic overexpression of circHMGCS1 (Figure S2). The authors could use this experiment as a concept proof and investigate the glucose concentration in the cell culture medium. Is the pLV-circ HMGCS1 transduction in HUVEC increasing the glucose release? (Line 163)

      In the manuscript, we utilized a DMEM culture medium containing 4500 mg/L glucose. Given that the HUVEC cell culture is glucose-dependent for its metabolic processes, it was challenging to precisely evaluate the relationship between pLV-circHMGCS1 transduction and the glucose concentration in the medium.

      Weaknesses:

      (1) Pg 20. The cells were transfected with miR-4521 mimics, miR-inhibitor, or miR-NC and incubated for 24 hours. Subsequently, the cells were treated with PAHG for another 24 hours. Were the cells transfected with lipofectanine? The protocol or the lipofectamine kit used should be described. The lipofectamine protocol suggests using an incubation time of 72 hours. Why did the authors incubate for only 24 hours? If the authors did the mimic and inhibitor curves, these should be added to the supplementary figures. Please, describe the miRNA mimic and antagomir concentration used in cell culture.

      For detailed transfection methods of miRNA mimic and its inhibitor, please refer to “Transfection of miRNA mimic or inhibitor” (Line 587) in the revised Experimental Section. We employed the Hieff Trans®siRNA/miRNA in vitro transfection reagent (yeason, China, 40806ES03), with a transfection duration of 48h. The miR-4521 content in HUVEC post-transfection was quantified using qRT-PCR. The transfection of the miR-4521 mimic for 48h notably enhanced its expression in HUVEC (Supplementary Figure 3B), whereas the transfection of the miR-4521 inhibitor for the same duration significantly suppressed its expression (Supplementary Figure 3C). The concentration used for both miRNA mimic and inhibitor transfection was 50 nM. In the revised manuscript, we have corrected the transfection time and clarified that we did not utilize miRNA antagomirs in our experiments.

      (2) Pg 20, line 507. What was the miR-4521 agomiR used to treatment of the animals?

      miRNA agomir serves as a valuable experimental tool for elucidating miRNA function, used to simulate the overexpression of a specific miRNA. miRNA agomir is a chemically modified RNA molecule identical in sequence to the target miRNA, engineered for enhanced stability and transfection efficacy. Utilizing miRNA agomir enables the overexpression of the target miRNA, facilitating the investigation of miRNA functions and mechanism in vivo. In our study, we have employed miRNA mimic for cellular studies and miRNA agomir in vivo applications to achieve high expression of miRNA (Fu et al, 2019).

      (3) Figure 1B. The results are showing the RT-qPCR for only 5 circRNA, however, the results show 48 circRNAs were upregulated, and 18 were downregulated (Figure S1D). Why were the other cicRNAs not confirmed? The circRNAs upregulated with high expression are not necessarily with the best differential expression comparing control vs. PAHG groups. Furthermore, Figure 1A and S1D show circRNAs downregulated also with high expression. Why were these circRNAs not confirmed?

      Our study aims to the identification of potential biomarkers for endothelial dysfunction in type 2 diabetes, To the end, we focused on circRNAs that exhibited significant upregulation following PAHG treatment. In our sequencing data, the p-values for these top upregulated circRNAs were notably below the threshold of 0.001, prompting their selection for further validation. We employed qRT-PCR to ascertain the consistency of their expression levels with the RNA-sequencing findings. Among these, circHMGCS1 was identified as a promising candidate with regulatory potential in endothelial dysfunction. Additionally, circRNAs that were significantly downregulated will be the subject of our ongoing research endeavors.

      (4) Figure 1B shows the relative circRNAs expression. Were host genes expressed in the same direction?

      circRNAs are generated from specific exons or introns of their host genes, either individually or in combination, and the main function of circRNA depends on its non-coding RNA characteristics. The expression levels of circRNAs is not necessarily correlated with those of their host genes, and similarly, the function of circRNAs do not inherently relate to the functions of the host genes (Kristensen et al, 2019; Liu & Chen, 2022). Consequently, the data presented in Figure 1B were primarily aimed at validating the accuracy of circRNA-seq. Although we did not conduct host gene expression analysis for the identified circRNAs, our subsequent results indicated that the overexpression of circHMGCS1 did not influence the expression levels of HMGCS1 (Figure 2A).

      (5) Line 128. The circRNA RT-qPCR methodology was not described. The methodology should be described in detail in the Methods Session.

      The only difference between the circRNA RT-qPCR method and other gene detection is that random primers need to be used for reverse transcription during the reverse transcription process. Unlike linear RNAs that possess a 3' polyA tail, which allows for the use of oligo(dT) primers, circRNAs require random primers to initiate the reverse transcription process. Beyond this distinction, the other processes are no different from the common qRT-PCR process. We have revised the Isolation of RNA and miRNA for quantitative Real Time-PCR (qRT-PCR) analysis method in the revised version (Line 695).

      (6) Line 699. The relative gene expression was calculated using the 2-ΔΔCt method. This is not correct, the expression for miRNA and gene expression are represented in percentage of control.

      We initially employed the 2^-ΔΔCt method to ascertain the relative gene expression levels. Subsequently, we scaled all values by a factor of 100 to amplify the visual representation of the observed variations, thereby enhancing the visualization of the data.

      (7) Line 630. Detection of ROS for tissue and cells. The methodology for tissue was described, but not for cells.

      We have added the detailed description of the cellular ROS detection methods in the revised manuscript as follows:

      For ROS detection in cells, the treated cells were washed once by PBS, then 20 μM DHE was added, and incubated at 37°C for 30 min away from light, then washed three times by PBS and then colorless DMEM medium was added, followed by fluorescence microscopy for observation (Line 640-643).

      (8) Line 796. RNA Fluorescent In Situ Hybridization (RNA-FISH). Figure 1F shows that the RNA-Fluorescence in situ hybridization (RNA-FISH) confirmed the robust expression of cytoplasmic circHMGCS1 in HUVECs (Figure 1F). However, in the methods, lines 804 and 805 described the probes targeting circMAP3K5 and miR-4521 were applied to the sections. Hybridization was performed in a humid chamber at 37C overnight. Is it correct?

      We have made a correction in the revised manuscript. The accreted description is "the probes targeting circHMGCS1 and miR-4521 were applied to the sections"(Line816).

      (9) Line 14. Fig 1-H. The authors discuss qRT-PCR demonstrated that circHMGCS1 displayed a stable half-life exceeding 24 h, whereas the linear transcript HMGCS1 mRNA had a half-life less than 8 h (Figure 1H). Several of the antibodies may contain trace amounts of RNases that could degrade target RNA and could result in loss of RNA hybridization signal or gene expression. Thus, all of the solutions should contain RNase inhibitors. The HMGCS1 mRNA expression could be degraded over the incubation time (0-24hs) leading to incorrect results. Moreover, in the methods is not mentioned if the RNAse inhibitor was used. Please, could the authors discuss and provide information?

      This experiment was performed in cell culture as described in our Experimental Methods (Line 753), where we added actinomycin D directly into the cell culture well plates, and the cells remained in a healthy state during this treatment. We did not directly extract mRNA from cells for this experiment. Additionally, all solutions utilized throughout the whole experiment were prepared using Rnase-free water, ensuring that the integrity of the mRNA.

      (10) Further experiments demonstrated that the overexpression of circHMGCS1 stimulated the expression of adhesion molecules (VCAM1, ICAM1, and ET-1) (Figures 2B and 2C), suggesting that circHMGCS1 is involved in VED. How were these genes expressed in the RNA-seq?

      In the manuscript, we only focused exclusively on circRNA and miRNA sequencing, and not perform mRNA sequencing, Consequently, we employed qRT-PCR and Western blot to assess the expression alterations of ET-1, ICAM1, and VCAM1 at gene and protein level. The findings revealed that the overexpression of circHMGCS1 significantly upregulated the expression of adhesion molecules (VCAM1, ICAM1, and ET-1).

      (11) Line 256. By contrast, the combined treatment of circHMGCS1 and miR-4521 agomir did not significantly affect the body weight and blood glucose levels. OGTT and ITT experiments demonstrated that miR-4521 agomir considerably enhanced glucose tolerance and insulin resistance in diabetic mice (Figures 5C, 5D, and Figures S5B and S5C). Why did the miR-4521 agomir treatment considerably enhance glucose tolerance and insulin resistance in diabetic mice, but not the blood glucose levels?

      Our results showed that miR-4521 agomir could effectively suppress the increase of body weight and blood glucose in mice (Figure 5A-B).

      (12) In the experiments related to pull-down, the authors performed Biotin-coupled miR-4521 or its mutant probe, which was employed for circHMGCS1 pull-down. This result only confirms the Luciferase experiments shown in Figure 4A. The experiment that the authors need to perform is pull-down using a biotin-labeled antisense oligo (ASO) targeting the circHMGCS1 backsplice junction sequence followed by pulldown with streptavidin-conjugated magnetic beads to capture the associated miRNAs and RNA binding proteins (RBPs). Also, the ASO pulldown assay can be coupled to miRNA RT-qPCR and western blotting analysis to confirm the association of miRNAs and RBPs predicted to interact with the target circRNA.

      This point is correct. As suggested, we utilized a biotin-labeled circHMGCS1 probe for pull down experiments. Because circRNA-miRNA interactions are mainly mediated by the RNA-induced silencing complex, which includes Argonaute 2 (AGO2), we examined the levels of miR-4521 and AGO2 in the capture meterial. Our results demonstrated that circHMGCS1 significantly captured miR-4521 in the cells, with a concomitant acquisition of AGO2. These findings have been integrated into the revised manuscript (Supplementary Figures 4D and 4E).

      (13) In Figure 5, the authors showed that the results suggest that miR-4521 can inhibit the occurrence of diabetes, whereas circHMGCS1 specifically dampens the function of miR-4521, weakening its protective effect against diabetes. In this context, what are the endogenous target genes for the miR-4521 that could be regulating diabetes?

      In this study, we focused on the role of miR-4521 in endothelial function. Our animal experiments involving ARG1 knockdown revealed that the reduction of ARG1 expression resulted in the inability of miR-4521 to modulate the progression of type 2 diabetes. Consequently, ARG1 is likely an endogenous target gene of miR-4521, potentially implicated in the regulation of diabetes.

      (14) In the western blot of Figure 5, the β-actin band appears to be different from the genes analyzed. Was the same membrane used for the four proteins? The Ponceau S membrane should be provided.

      As described in our experimental methodology (Western blot analysis), we have utilized PVDF membranes for our Western blot experiments. β-actin, recognized for its high expression and specificity as a housekeeping gene, yields distinct bands with minimal background noise. This property can lead to the migration β-actin from the spot wells to both sides during electrophoresis. So much so that it is not aligned with the lane shown by the target gene. And the other 3 genes can see the phenomenon of obvious lane because their expression is not as high as β-actin. We replaced β-actin with a similar background in the revised manuscript (Figure 5L).

      (15) Why did the authors use AAV9, since the AAV9 has a tropism for the liver, heart, skeletal muscle, and not to endothelial vessels?

      AAV9 has garnered significant interest as a gene delivery vector due to its extensive tissue penetration, minimal immunogenicity, and stable gene expression profile. Its application in cardiovascular disease research and therapy has been widely reported (Barbon et al, 2023; Yao et al, 2018; Zincarelli et al, 2008). Meanwhile, we employed AAV9 for gene delivery via the tail vein injection in mice, and as shown in Figure 5J and Figure 7Q, we observed GFP signals carried by AAV9 in the thoracic aorta of mice. These findings suggest that AAV9 possesses the capability to infect endothelial cells effectively.

      Reviewer #2 (Public Review):

      Summary:

      The authors observed an aggravated vascular endothelial dysfunction upon overexpressing circHMGCS1 and inhibiting miR-4521. This study discovered that circHMGCS1 promotes arginase 1 expression by sponging miR-4521, which accelerated the impairment of vascular endothelial function.

      Strengths:

      The study is systematic and establishes the regulatory role of the circHMGCS1-miR-4521 axis in diabetes-induced cardiovascular diseases.

      Weaknesses:

      (1) The authors selected the miR-4521 as the target based on their reduced expression upon circHMGCS1 overexpression. Since the miRNA level is downregulated, the downstream target gene is expected to be upregulated even in the absence of circRNA. The changes in miRNA expression opposite to the levels of target circRNA could be through Target RNA-Directed MicroRNA Degradation. In addition, miRNA can also be stabilized by circRNAs. Hence, selecting miRNA targets based on opposite expression patterns and concluding miRNA sponging by circRNA needs further evidence of direct interactions.

      Thank you for your positive comments and kind suggestions.

      As suggested by Public Reviewer #1 (12), we employed a biotin-tagged circHMGCS1 to capture miR-4521 and AGO2 in HUVECs (Supplementary Figures 4D and 4E), and Dual luciferase assays have confirmed that miR-4521 can bind to circHMGCS1 directly. Furthermore, RNA pull down and RIP assays have demonstrated the direct binding capability of circHMGCS1 for miR-4521. Collectively, these findings underscore the direct interaction between circHMGCS1 and miR-4521.

      (2) The majority of the experiments were performed with an overexpression vector which can generate a lot of linear RNAs along with circRNAs. The linear RNAs produced by the overexpression vectors can have a similar effect to the circRNA due to sequence identity.

      In our manuscript, the employed vectors incorporate reverse repeat sequences that facilitate efficient circularization of circRNAs. This design ensures robust circular shearing upon the insertion of circRNA sequences into the polyclonal sites, thereby enhancing the overexpression of circRNAs (Supplementary Figure 2). Moreover, we used lentiviral virus as a vector for circRNA overexpression, not direct plasmid transfection. As demonstrated in Figure 2A, upon overexpression of circHMGCS1, we observed a significant upregulation in circHMGCS1 levels compared to the pLV-circNC and Control groups. Notably, the expression levels of the linear HMGCS1 mRNA did not exhibit significant alterations.

      (3) There is a lack of data of circHMGCS1 silencing and its effect on target miRNA & mRNAs.

      According to your suggestion, we employed shRNA to knockdown circHMGCS1 in HUVEC, and qRT-PCR was used to assess the expression levels of miR-4521 and ARG1. The knockdown of circHMGCS1 significantly inhibit the expression of circHMGCS1 in HUVEC without obviously affecting the levels of HMGCS1 mRNA. We then selected circHMGCS1 shRNA1 for further investigation. We observed that the knockdown of circHMGCS1 resulted in an upregulation of miR-4521 and a downregulation of ARG1 expression.

      Author response image 1.

      The impact of circHMGCS1 knockdown on ARG1 and miR-4521 expression levels in HUVEC. The cells were transfected with either circHMGCS1 shRNA1 or circHMGCS1 shRNA2, and the expressions levels of circHMGCS1 and HMGCS1 (A), miR-4521 (B) and ARG1 (C and D) in HUVECs were detected by qRT-PCR and Western blot. n=3 in each group. *p < 0.05, **p < 0.01. All significant difference was determined by one-way ANOVA followed by Bonferroni multiple comparison post hoc test, error bar indicates SD.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      I suggest improving the discussion based on the literature.

      (1) Line 131. .... (hsa_circ_0008621, 899 nt in length, identified as circHMGCS1 in subsequent studies because of its host gene being HMGCS1). Please, provide the reference.

      We appreciate the valuable comments. We have made changes for improvement, which is add in Line 133(Liang et al, 2021).

      (2) The authors conclude that both in vitro and in vivo data suggest that the miR-4521 or circHMGCS1 fails to regulate the effect of diabetes-induced VED in the absence of ARG1. Therefore, ARG1 may serve as a promising VED biomarker, and circHMGCS1 and miR-4521 play a key role in regulating diabetes-induced VED by ARG1. In this context, they should re-evaluate whether this is the best title. "Circular RNA HMGCS1 sponges miR-4521 to aggravate type 2 diabetes-induced vascular endothelial dysfunction"

      This manuscript initiates its exploration with circRNA as the focal point of study (Figure 1 and Figure 2), It then delves into the miRNAs associated with circRNA and elucidates their interactions (Figure 3, Figure 4 and Figure 5). Subsequently, the manuscript identifies the target genes of miRNA and validates the regulatory effects of circRNA and miR-4521 on ARG1 (Figure 6). The study culminates with the application of the ceRNA theory to confirm the significance of ARG1 in the functional interplay between circHMGCS1 and miR-4521 (Figure 7). These findings throughout the manuscript are dedicated to uncovering the pivotal roles of circHMGCS1 and miR-4521 in modulating vascular endothelial function. Notably, the interaction between circHMGCS1 and miR-4521 represents a novel discovery of our research. Therefore, we aim to emphasize the critical function of circHMGCS1 and miR-4521 in the regulation of vascular endothelial dysfunction in type 2 diabetes within the manuscript.

      Reviewer #2 (Recommendations For The Authors):

      I have a few suggestions for improving the study further.

      (1) Although the experiments suggest the role of circHMGCS1, miR-4521 in vascular endothelial function, the direct regulation or interaction of circHMGCS1-miR-4521-ARG1 is unclear. A rescue experiment that checks the effect of circHMGCS1 silencing with/without inhibition of miR-4521 on ARG1 expression must be performed to prove the circHMGCS1- miR-4521 regulatory axis.

      Thank you very much for your constructive comments.

      According to your suggestion, we utilized shRNA to effectively knockdown circHMGCS1 in HUVEC, Subsequent expression analysis via qRT-PCR was conducted to assess the levels of miR-4521 and ARG1. The knockdown of circHMGCS1 significantly reduced the expression of circHMGCS1 in HUVEC without influencing the expression of the host gene HMGCS1. Concurrently, the knockdown of circHMGCS1 resulted in an upregulation of miR-4521 (Supplementary Figure 4B) and a downregulation of ARG1 (Figure 6P and 6Q). In our manuscript, the upregulation in ARG1 expression caused by circHMGCS1 overexpression was reduced by miR-4521, and the downregulation in ARG1 expression caused by miR-4521 overexpression was also reversed by circHMGCS1. When miR-4521 was knocked down, the expression of ARG1 increased, and circHMGCS1 abrogated its regulatory effect on the expression of ARG1. Collectively, these findings indicate that the interplay between circHMGCS1 and miR-4521 significantly influences ARG1 expression.

      Author response image 2.

      The impact of circHMGCS1 knockdown on ARG1 and miR-4521 expression levels in HUVEC. The cells were transfected with either circHMGCS1 shRNA1 or circHMGCS1 shRNA2, and the expressions levels of circHMGCS1 and HMGCS1 (A), miR-4521 (B) and ARG1 (C and D) in HUVECs were detected by qRT-PCR and Western blot. n=3 in each group. *p < 0.05, **p < 0.01. All significant difference was determined by one-way ANOVA followed by Bonferroni multiple comparison post hoc test, error bar indicates SD.

      (2) It is unclear how the authors arrived at the circHMGCS1-miR-4521 pair. The pull down of circHMGCS1 followed by qPCR enrichment analysis of all target miRNAs must be performed to select the target miRNA.

      In this manuscript, we identified the expression of miRNA under PAHG treatment through miRNA sequencing, and then further screened out 4 miRNAs with potential binding sites to circHMGCS1 utilizing the miRanda database. Subsequently, we employed qRT-PCR and Western blot analysis to confirm the regulatory influence of miR-4521 on endothelial function (Figure 3). Following this, RIP, RNA pull down, dual luciferase and RNA-FISH experiments were conducted to map the interaction between circHMGCS1 and miR-4521 (Figure 4), the direct interaction between circHMGCS1 and miR-4521 was further substantiated through overexpression and knockdown studies (Figures 5-7). while the reviewer's method may offer a more direct validation, our methodology initially involved a database-driven screening of candidate miRNAs with the potential to target and bind circHMGCS1, followed by experimental validation of these interactions. Both methodologies are capable of establishing the interaction sites between circHMGCS1 and miR-4521.

      (3) Since the back splicing is not that efficient, the linear RNA from the overexpression construct may produce many linear RNAs with miRNA binding sites. The effect seen in the case of overexpression experiments needs to consider the level of linear and circular HMGCS1 produced by the vector.

      In this manuscript, the vector's multiple cloning site is flanked by inverted repeat sequences that facilitate efficient circRNA looping. This design enables the inserted sequence to form a stable loop and undergo circularization upon transcription, leading to the overexpression of circRNA (Supplementary Figure 2). For the validation of circular RNA, we employed divergent primers that straddle the circRNA splicing junction. These primers are specific for circRNA amplification and do not amplify the corresponding linear RNA, as demonstrated in Figure 2A. Upon overexpression of circHMGCS1, we observed a significant increase in circHMGCS1 levels compared to the empty vector and Control groups, while there was no significant change in the expression level of HMGCS1 mRNA.

      (4) As miR-4521 has multiple miRNA binding sites on circHMGCS1, it is not very clear which sites were mutated in circHMGCS1-MUT.

      We have made corrections to Supplementary Figure 4C. Utilizing the miRanda algorithm, we identified 10 potential binding sites for miR-4521 on circHMGCS1. Subsequently, we selected the site with the highest binding affinity for mutational analysis (miR-4521 binding positions 3-15, circHMGCS1 binding positions 260-281, binding rate 91.67%, binding ability -17.299999 kCal/Mol). We employed a dual-luciferase assay to confirm the direct interaction between circHMGCS1 and miR-4521.

      (5) Since the ceRNA network works efficiently in an equimolar concentration of the regulatory molecules, providing the copy number of circHMGCS1, miR-4521, and target mRNAs would be helpful.

      We employed qRT-PCR to ascertain the absolute quantification of mRNA copy numbers, following established methodologies (Nolan et al, 2006; Wagatsuma et al, 2005; Zhang et al, 2009). Our qRT-PCR data reveal that the circHMGCS1 mRNA copy number is 2343±529. In comparison, the ARG1 mRNA copy number stands at 88±27, while the miR-4521 copy number is significantly higher, recorded at 36277±9407.

      Author response image 3.

      The distribution of copy numbers for circHMGCS1, miR-4521 and ARG1 in HUVECs.

      (6) The yellow highlighted "cyclization-mediated sequence-F & R" does not seem to be complementary sequences. The method section may include the details of the vectors and cloning strategies for the overexpression constructs.

      The figure below illustrates the schematic representation of the complementary structure between the upstream and downstream sequences that facilitate circRNA circularization. This strategic pairing is designed to enhance the circularization efficiency of circRNA while concurrently suppressing mRNA synthesis (Liang & Wilusz, 2014). Details of this design have been integrated into the experimental method (Line539). The specific additions are as follows:

      The circHMGCS1 sequence [NM_001098272: 43292575-43297268], the splice site AG/GT and ALU elements were inserted into the pCDH-circRNA-GFP vector (upstream ALU: AAAGTGCTGAGATTACAGGCGTGAGCCACCACCCCCGGCCCACTTTTTGTAAAGGTACGTACTAATGACTTTTTTTTTATACTTCAG, downstream ALU: GTAAGAAGCAAGGAAAAGAATTAGGCTCGGCACGGTAGCTCACACCTGTAATCCCAGCA). The restriction enzyme sites selected were EcoRI and NotI.

      Author response image 4.

      (7) Since circHMGCS1 is a multi-exonic circRNA that can undergo alternative splicing and divergent primers only validate the backsplice junction, the full-length sequence of mature circHMGCS1 needs to be checked by circRNA-RCA PCR followed by Sanger sequencing.

      In compliance with your guidance, we have enriched the revised manuscript with additional data. Specifically, we have included the full-length nucleic acid electrophoresis diagram of circHMGCS1 in Supplementary Figure 1F, the Sanger sequencing results in Supplementary Figure 1G, and a comparative analysis of the circHMGCS1 sequences obtained from Sanger sequencing with those referenced in the circBase database, presented in Supplementary Figure 1H.

      Reference:

      Barbon, E., C. Kawecki, S. Marmier, A. Sakkal, F. Collaud, S. Charles, G. Ronzitti, C. Casari, O.D. Christophe, C.V. Denis, P.J. Lenting, and F. Mingozzi. 2023. Development of a dual hybrid AAV vector for endothelial-targeted expression of von Willebrand factor. Gene Ther. 30: 245-254.

      Fu, Y., J. Chen, and Z. Huang. 2019. Recent progress in microRNA-based delivery systems for the treatment of human disease. ExRNA. 1: 24.

      Kristensen, L.S., M.S. Andersen, L.V.W. Stagsted, K.K. Ebbesen, T.B. Hansen, and J. Kjems. 2019. The biogenesis, biology and characterization of circular RNAs. Nat Rev Genet. 20: 675-691.

      Liang, D., and J.E. Wilusz. 2014. Short intronic repeat sequences facilitate circular RNA production. Genes Dev. 28: 2233-2247.

      Liang, J., X. Li, J. Xu, G.M. Cai, J.X. Cao, and B. Zhang. 2021. hsa_circ_0072389, hsa_circ_0072386, hsa_circ_0008621, hsa_circ_0072387, and hsa_circ_0072391 aggravate glioma via miR-338-5p/IKBIP. Aging (Albany NY). 13: 25213-25240.

      Liu, C.X., and L.L. Chen. 2022. Circular RNAs: Characterization, cellular roles, and applications. Cell. 185: 2016-2034.

      Nolan, T., R.E. Hands, and S.A. Bustin. 2006. Quantification of mRNA using real-time RT-PCR. Nat Protoc. 1: 1559-1582.

      Wagatsuma, A., H. Sadamoto, T. Kitahashi, K. Lukowiak, A. Urano, and E. Ito. 2005. Determination of the exact copy numbers of particular mRNAs in a single cell by quantitative real-time RT-PCR. J Exp Biol. 208: 2389-2398.

      Yao, C., T. Veleva, L. Scott, Jr., S. Cao, L. Li, G. Chen, P. Jeyabal, X. Pan, K.M. Alsina, I.D. Abu-Taha, S. Ghezelbash, C.L. Reynolds, Y.H. Shen, S.A. Lemaire, W. Schmitz, F.U. Müller, A. El-Armouche, N. Tony Eissa, C. Beeton, S. Nattel, X.H.T. Wehrens, D. Dobrev, and N. Li. 2018. Enhanced Cardiomyocyte NLRP3 Inflammasome Signaling Promotes Atrial Fibrillation. Circulation. 138: 2227-2242.

      Zhang, X.X., T. Zhang, M. Zhang, H.H. Fang, and S.P. Cheng. 2009. Characterization and quantification of class 1 integrons and associated gene cassettes in sewage treatment plants. Appl Microbiol Biotechnol. 82: 1169-1177.

      Zincarelli, C., S. Soltys, G. Rengo, and J.E. Rabinowitz. 2008. Analysis of AAV serotypes 1-9 mediated gene expression and tropism in mice after systemic injection. Mol Ther. 16: 1073-1080.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The researchers demonstrated that when cytokine priming is combined with exposure to pathogens or pathogen-associated molecular patterns, human alveolar macrophages and monocyte-derived macrophages undergo metabolic adaptations, becoming more glycolytic while reducing oxidative phosphorylation. This metabolic plasticity is greater in monocyte-derived macrophages than in alveolar macrophages.

      Strengths:

      This study presents evidence of metabolic reprogramming in human macrophages, which significantly contributes to our existing understanding of this field primarily derived from murine models.

      Weaknesses:

      The study has limited conceptual novelty.

      We acknowledge that the study has limited conceptual novelty, however, the current manuscript provides the field with evidence of the changes in the phenotype and functions of human macrophages in response to IFN-γ or IL-4 which is currently lacking in the literature. Moreover, our data shows for the first time that human airway macrophages change their function in response to IFN-γ.  

      Reviewer #2 (Public Review):

      Summary:

      The authors aimed to functionally characterize primary human airway macrophages and monocytederived macrophages, correlating their glycolytic shift in metabolism. They conducted this macrophage characterization in response to type II interferon and IL-4 priming signals, followed by different stimuli of irradiated Mycobacterium tuberculosis and LPS.

      Strengths:

      (1) The study employs a thorough measurement of metabolic shift in metabolism by assessing extracellular acidification rate (ECAR) and oxygen consumption rate (OCR) of differentially polarized primary human macrophages using the Seahorse XFe24 Analyzer.

      (2) The effect of differential metabolic shift on the expression of different surface markers for macrophage activation is evaluated through immunofluorescence flow cytometry and cytokine measurement via ELISA.

      (3) The authors have achieved their aim of preliminarily characterizing the glycolysis-dependent cytokine profile and activation marker expression of IFN-g and IL-4 primed primary human macrophages.

      (4) The results of the study support its conclusion of glycolysis-dependent phenotypical differences in cytokine secretion and activation marker expression of Ams and MDMs.

      Weaknesses:

      (1) The data are presented in duplicates for cross-analyses.

      (2) The data presented supports a distinct functional profile of airway macrophages (Ams) compared to monocyte (blood)-derived macrophages (MDMs) in response to the same priming signals. However, the study does not attempt to explore the underlying mechanism for this difference.

      (3) The study is descriptive in nature, and the results validate IFN-g-mediated glycolytic reprogramming in primary human macrophages without providing mechanistic insights.

      (1) We acknowledge the data is presented in duplicate for cross-analyses. This duplication allowed us to examine both (A) the effect of IFN-γ or IL-4 on primary human airway and monocyte derived macrophages in the presence or absence of distinct stimulations and (B) to directly compare the fold change in function occurring in the AM with the changes in the MDM.

      (2 & 3) We acknowledge that our study is descriptive however, by inhibiting glycolysis using 2DG we have demonstrated that increased flux through glycolysis is mechanistically required to mediate enhanced cytokine responses in both primary human AM and MDM primed with IFN-γ. However, we acknowledge that we have not determined the differential molecular mechanisms downstream of IFNγ in the AM versus the MDM. IFN-γ promotes both pro- and anti-inflammatory cytokines in AM and this was reduced by inhibiting glycolysis with 2DG. This identifies glycolysis as a key mechanistic pathway which can be therapeutically targeted in AM to modulate inflammation. Mechanistic studies on human AM are limited due to low number of AM retrieved from BAL samples. Nevertheless, the differences between AM and MDM identified in the current study indicate that future mechanistic studies are warranted to identify why IFN-γ promotes IL-10 in AM and not MDM, and, why TNF is differentially regulated by glycolysis in the two macrophage subpopulations, for example.  

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript, the authors explore the contribution of metabolism to the response of two subpopulations of macrophages to bacterial pathogens commonly encountered in the human lung, as well as the influence of priming signals typically produced at a site of inflammation. The two subpopulations are resident airway macrophages (AM) isolated via bronchoalveolar lavage and monocyte-derived macrophages (MDM) isolated from human blood and differentiated using human serum. The two cell types were primed using IFNγ and Il-4, which are produced at sites of inflammation as part of initiation and resolution of inflammation respectively, followed by stimulation with either irradiated Mycobacterium tuberculosis (Mtb) or LPS to simulate interaction with a bacterial pathogen. The authors use human cells for this work, which makes use of widely reported and thoroughly described priming signals, as well as model antigens. This makes the observations on the functional response of these two subpopulations relevant to human health and disease. To examine the relationship between metabolism and functional response, the authors measure rates of oxidative phosphorylation and glycolysis under baseline conditions, primed using IFNγ or IL-4, and primed and stimulated with Mtb or LPS.

      Strengths:

      • The data indicate that both populations of macrophages increase metabolic rates when primed, but MDMs decrease their rates of oxidative phosphorylation after IL-4 priming and bacterial exposure while AMs do not.

      • It is demonstrated that glycolysis rates are directly linked to the expression of surface molecules involved in T-cell stimulation and while secretion of TNFα in AM is dependent on glycolysis, in MDM this is not the case. IL-1β is regulated by glycolysis only after IFN-γ priming in both MDM and AM populations. It is also demonstrated that Mtb and LPS stimulation produces responses that are not metabolically consistent across the two macrophage populations. The Mtb-induced response in MDMs differed from the LPS response, in that it relies on glycolysis, while this relationship is reversed in AMs. The difference in metabolic contributions to functional outcomes between these two macrophage populations is significant, despite acknowledgement of the reductive nature of the system by the authors.

      • The observations that AM and MDM rely on glycolysis for the production of cytokines during a response to bacterial pathogens in the lung, but that only MDM shift to Warburg Metabolism, though this shift is blocked following exposure to IL-4, are supported by the data and a significant contribution the study of the innate immune response.

      Weaknesses:

      • It is unclear whether changes in glycolysis and oxidative phosphorylation in primed cells are due to priming or subsequent treatments. ECAR and OCR analyses were therefore difficult to interpret.

      All data sets have been presented and analysed relative to both unprimed unstimulated to show both the effect of priming and subsequent stimulation. A second analysis was subsequently conducted where each data set was normalised to its own baseline in terms of percentage change. Therefore, each of unprimed, IFN-γ and IL-4 primed cells were set to 100% in order to assess the effect of stimulation independent of the baseline priming effect. For clarity we have removed the following line:

      “Percentage change for ECAR and OCR was calculated from the respective baseline of each data set to visualise the differential ability of IFN-γ, IL-4 primed or unprimed AM to respond to stimulation (Figure S1C,D).”

      We have amended the text in the manuscript (lines 164-173) to “Since IFN-γ priming increased cellular energetics in the AM at baseline, we calculated percent change in ECAR and OCR from the baseline rate of each group in order to assess if IFN-γ or IL-4 primed AM have altered capacity to change their metabolism in response to stimulation (Figure 1C,D). This was carried out to equalise all the primed data sets at baseline before stimulation (Figure S1C, S1D).  These data indicate that whilst the peak of glycolysis is elevated in IFN-γ primed AM (Figure 1A), all AM have a similar capacity to increase glycolysis upon stimulation when baseline differences in metabolism were adjusted for the effects of cytokine priming (Figure 1C). IFN-γ increased the percent change in OCR of AM in response to both bacterial stimuli compared to the unstimulated IFN-γ primed control (Figure 1D). These data indicate that priming AM alters the metabolic baselines of human tissue resident macrophages and not their ability to respond to bacterial stimuli.”

      • The data may not support a claim that AM has greater "functional plasticity" without a direct comparison of antigen presentation. Moreover, MDM secrete more IL-1β than AM. The claim that AM "have increased ability to produce all cytokines assayed in response to Mtb stimulation" does not appear to be supported by the data.

      Our data suggests that the MDM are more phenotypically plastic (in terms of their ability to alter expression of cell surface markers in response to cytokine cues), whereas AM have a greater ability to alter cytokine production, our measure of functional plasticity. We have now defined the use of the terms ‘functional plasticity’ and ‘phenotypic plasticity’ in the context of our paper in lines 6063. To consider different culture and plating requirements of MDM versus AM, cytokine production was analysed relative to the average of the unprimed Mtb or LPS control of the respective MDM or AM. This allowed us to draw more accurate comparisons between the two macrophage populations by examining their relative ability to increase their cytokine production (expressed as fold change) rather than defining this functional plasticity only in terms of concentrations of cytokine produced in culture.  

      We have therefore added the following sentence into the conclusion of the manuscript. “Cumulatively, the data presented herein suggests that the MDM maybe more phenotypically plastic than the AM, while the AM have enhanced functional plasticity in their ability to modulate cytokine production after exposure Th1 and Th2 cytokines.”

      We have edited the discussion (lines 421-423) to clarify the following "have increased ability to produce all cytokines assayed in response to Mtb stimulation" and changed it to “stimulated with Mtb have significantly more production of IL-1β, TNF and IL-10 compared with unprimed controls. This is in contrast with IFN-γ primed MDM which only upregulate TNF compared to their unprimed controls.”   

      • The claim that AM are better for "innate training" via IFNγ may not be consistent with increased IL1β and a later claim that MDM have increased production and are "associated with optimal training."

      We have removed the word “better” and now simply state that AM are a tractable target to induce innate training in the human lung.

      • Statistical analyses may not appropriately support some of the conclusions.

      We have consulted with a statistician. Please see response to reviewer 3 recommendations for authors point 1 below.  

      • AM populations would benefit from further definition-presumably this is a heterogenous, mixed population.

      AM are routinely >97% CD68+CD14+ used in the current study (Author response image 1). However, we acknowledge that tissue resident macrophages represent a spectrum of phenotypes. Given limitations in cell numbers from primary human AM derived from BALF, we have not attempted to define the function of discreet subpopulations of AM.

      • The term "functional plasticity" could also be more stringently defined for the purposes of this study.

      We are terming functional plasticity to be the macrophages’ ability to alter their production of cytokines in response to external cues like IFN-γ and IL-4 whereas phenotypic plasticity is measured based on ability to alter the cell surface expression of activation markers.  We have now defined this in the manuscript (lines 60-63).

      Author response image 1.

      Expression of macrophage markers on AM. 

      Conclusion:

      Overall, the authors succeed in their goals of investigating how inflammatory and anti-inflammatory cytokine priming contributes to the metabolic reprogramming of AM and MDM populations. Their conclusions regarding the relationship between cytokine secretion and inflammatory molecule expression in response to bacterial stimuli are supported by the data. The involvement of metabolism in innate immune cell function is relevant when devising treatment strategies that target the innate immune response during infection. The data presented in this paper further our understanding of that relationship and advance the field of innate immune cell biology.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1)  Authors are suggested to provide rationale for their choice of cytokines as IFN-gamma and IL-4. This will be useful for the readers.

      We have updated the following sentence (line 44-46) in the manuscript to add more rationale for the choice of IFN-γ and IL-4.  “There is a paucity of data on the role of metabolism in response to Th1 or Th2 microenvironments induced by cytokines-such as IFN-γ or IL-4 respectively, in human macrophages, especially in tissue resident macrophages, such as AM.”

      (2)  Authors have shown the final outcome of metabolic reprogramming in terms of expression of HLADR and CD-40, and cytokine release. What pathways/receptors are activated or associated with IL-4 and IFN-gamma priming as a first line of response?

      The relationship between IFN-γ or IL-4 induced expression of CD40 is established in haematological cell lines and fibroblasts as well as APC, with roles for the JAK/STAT pathways and upregulation of IRFs defined (1-3). Similarly, the relationship between exogenous IFN-γ and upregulation of HLA-DR expression on human monocytes or endothelial cells is established (4, 5). Whist our work does not outline the signalling pathways downstream of Th1 or Th2 cytokine priming, we have shown for the first time that glycolysis mechanistically underpins the shift in phenotype and function observed in human macrophages upon priming with IFN-γ or IL-4.

      (3)  What are the intracellular signals leading to glycolytic shift?

      One of the most likely mechanisms that under pin the shift to glycolytic metabolism is the stabilisation of HIF-1α mediated by activation of mTOR (see response below and rebuttal figure 2).  

      (4)  Additional evidence is required to show Warburg effect such as stabilization and activation of HIF1alpha.

      We acknowledge that we have not shown the activation and stabilisation of HIF-1α, however, we have provided functional evidence of increased glycolysis with concomitant decreased oxidative phosphorylation indicative of Warburg metabolism.

      In order to address this gap in evidence we have reworded the manuscript to describe this functional change to “Warburg-like metabolism” throughout the manuscript. In addition, we have undertaken Western Blotting to provide evidence of mTOR activation when cells are primed with IFN-γ (Author response image 2).

      Author response image 2.

      IFN-γ activates mTOR in primary human monocytes. Monocytes were isolated from healthy donor PBMC using magnetic separation. Monocytes were left untreated (-), stimulated with rapamycin as a negative control (Rap; 50 nM), IFN-γ (10 ng/ml) or IFN-γ and rapamycin simultaneously (IFN-γ + Rap) for 15 minutes. Phosphorylation of S6 was used as a readout of mTOR activation and measured by western blot using β-actin as a control with a blot (A) and (b) densitometry results are shown as the relative expression of pS6: β-actin from. Graphs show data of n=1 of unprimed (black dot) vs IFN-γ primed (red) with and without rapamycin. ImageLab (Bio-Rad) software was used to perform densitometric analysis. 

      (5)  What is the importance of showing percentage change vs fold change in figure 1 (1C vs 1A)?

      All data sets have been presented and analysed relative to both unprimed unstimulated to show the effect of first priming and subsequent stimulation (Figure 1A). A second analysis was subsequently conducted where each data set was normalised to its own baseline in terms of percentage change (Figure 1C). Therefore, each of unprimed, IFN-γ or IL-4 primed cells were set to 100% to assess the effect of stimulation independent of the pre-existing effect of priming on the baseline metabolism. For clarity we have removed the following line:

      “Percentage change for ECAR and OCR was calculated from the respective baseline of each data set to visualise the differential ability of IFN-γ, IL-4 primed or unprimed AM to respond to stimulation (Figure S1C,D).”

      We have amended the text (lines 164-173) in the manuscript to “Since IFN-γ priming increased cellular energetics in the AM at baseline, we calculated percent change in ECAR and OCR from the baseline rate of each group in order to assess if IFN-γ or IL-4 primed AM have altered capacity to change their metabolism in response to stimulation (Figure 1C,D). This was carried out to equalise all the primed data sets at baseline before stimulation (Figure S1C, S1D).  These data indicate that whilst the peak of glycolysis is elevated in IFN-γ primed AM (Figure S1A), all AM have a similar capacity to increase glycolysis upon stimulation when baseline differences in metabolism were adjusted for the effects of cytokine priming (Figure 1C). IFN-γ increased the percent change in OCR of AM in response to both bacterial stimuli compared to the unstimulated IFN-γ primed control (Figure 1D). These data indicate that priming AM alters the metabolic baselines of human tissue resident macrophages and not their ability to respond to bacterial stimuli.”

      (6)  Why IL-4 primed cells have lower glycolysis than unprimed control cells even in absence of pathogen in Figure 1A?

      IL-4 primed AM do not have statistically significant changes in glycolysis compared with unprimed control cells in the absence of stimulation.  

      Reviewer #2 (Recommendations For The Authors):

      The manuscript entitled "Human airway macrophages are metabolically reprogrammed by IFN-γ resulting in glycolysis dependent functional plasticity" by Cox et al., characterizes glycolytic-linked cytokine secretion and surface receptor expression of primary human airway macrophages (AM) and monocyte-derived macrophages (MDM). The authors primed the primary macrophages with type II interferon (IFN-γ) or interleukin-4 (IL-4) into Th1 and Th2 polarized states. This was followed by measurement of the shift in macrophage metabolism to glycolysis (ECAR measurement) and/or oxidative phosphorylation (OCR measurement) in response to lipopolysaccharide and irradiated Mycobacterium tuberculosis. The authors then utilize 2-DG (an inhibitor of glycolysis) to show the reliance of glycolytic shift in metabolism to drive the expression of different macrophage activation markers in MDMs and cytokine secretion in AMs.

      Significance:

      The study provides important validation of IFN-γ-mediated glycolytic shift and its correlated functionalities in primary human macrophage populations.

      Highlights: The study characterizes glycolytic-linked cytokine secretion and expression of macrophage activation markers in primary human resident (lung) and monocyte (blood)-derived macrophages. The study also shows data in support of IFN-γ alone in mediating glycolytic reprogramming of human primary macrophages.

      Limitations:

      The study lacks novelty and does not provide any new or different information in relation to IFN-γmediated glycolytic shift in the metabolism of human macrophages.

      Major comments:

      (1) The authors have relied on irradiated Mycobacterium tuberculosis (Mtb) and LPS stimulation to measure different correlates of macrophage functions. Additionally, the authors have discussed their results with irradiated Mtb with that of infection with live Mtb. There are also recent reports that show Mtb infection limiting glycolytic reprogramming in murine and human macrophages (PMID: 31914380) in contrast to their observation with irradiated Mtb. The authors should also include live Mtb infection or other replicative live bacterium for the induction of surface activation markers and cytokine release in their setup.

      We thank the reviewer for this suggestion; however, this is beyond the scope of the current study which was to assess AM and MDM in the context of immune stimulation in a reductive manner using TLR4 ligand LPS and a more complete whole bacteria stimulation. The selected bacterial ligands were employed in the study to allow us to model an optimal macrophage host response. This minimises the confounding variable of live bacteria which can perturb cellular metabolism and immune responses, which we have highlighted in the discussion. Since both LPS and irradiated Mtb induced similar metabolic and phenotypic profiles, it is likely that the effects of priming are maintained with diverse stimuli.  

      (2) The authors should add a quantitative measure (like extracellular lactate secretion or ECAR level) for the extent of glycolytic inhibition by the use of 5 mM 2-DG in their setup.

      We would like to draw the attention of the reviewer to the data represented in supplementary figure 2B, demonstrating that 2DG lowers ECAR at 5mM at both 1 and 24 h post stimulation with iH37Rv by an average of approximately 40%. In addition, we have acknowledged that inhibition with 5 mM 2DG does not fully inhibit glycolysis as outlined in the study limitations (lines 477-480).  

      (3) Percent change and fold change have been used to show the same or similar result in Fig. 1 and 2. Whereas, supplementary Fig. 1 shows absolute ECAR/OCR values in addition to fold change. The authors can plot either fold change or percent change in different measurements to avoid confusion. For example, do ECAR changes upon LPS stimulation in Fig. 1A and 1C come from the same dataset? One of the data points in percent change shows a decrease in percent ECAR change under no cytokine control, whereas all the data points in fold change show an increase.

      We have addressed this comment above in response to reviewer 1 point 5 (recommendations for the authors).

      We thank the reviewer for highlighting this single error in the data points for percent change. We have fixed this data point which was a result of a calculation error. All data throughout the manuscript has now been rechecked.   

      Minor comments:

      (1) The manuscript for review should be line-marked for referencing and commenting during review.

      We have now included line-marking on the manuscript.  

      (2) The authors can depict marker legends differently for all figures. In all figures, circles to squares or triangles represent treatment/stimulation with iH37Rv or LPS. The authors can depict this as circles to squares/triangles in contrast to different legends.

      We have changed the legend to include a more detailed description of data represented inserting additional information regarding the colours and symbols represented in the figures.  

      (3) Describe bars in supplementary figure 1A - 1H in its legend?

      We thank the reviewer for highlighting this oversight, we have amended the legend to state “error bars represent standard deviation”

      (4) Discuss the significant increase in CD86 expression in IFN-γ and IL-4 primed unstimulated AMs in Fig. 3E.

      We have updated the results section to state that IFN-γ increased the expression of CD86 when isolated in the absence of bacterial stimulations in Fig. 3E (lines 271-272). There is no significant increase in CD86 by IL-4 primed unstimulated AM. IL-4 primed human AM only upregulated CD86 when treated with 2DG or in the presence of stimulation.  

      (5) Contrary to Fig. 2, the data points of unstimulated cells in Fig. 4 vary for different treatment conditions (no cytokine, IFN-γ, and IL-4) for each cytokine measurement. What is the difference between unstimulated cells in Fig. 4 (for each cytokine) from that of Fig. 2 (for each receptor MFI)?

      Unstimulated cells change their surface activation markers and phenotype in response to IFN-γ and IL-4 in Fig. 2. For Fig. 4, IFN-γ and IL-4 are not sufficient to induce cytokine secretion in the absence of stimulation with bacterial ligands.  

      (6) The methodology for seeding and treatment of cells is reemphasized for almost all results. Defining macrophage priming and stimulation of macrophages in the method section and once at the start of results should be fine.

      Plating happens differently for Seahorse compared to the flow cytometric phenotyping and ELISA for cytokine production. For clarity we have stated and reemphasized the seeding and treatment of cells throughout the results section.  

      (7) Clarify "IL-4 reduced glycolysis in response to LPS stimulation" in relation to the results depicted in Fig. 1A and 1C. Similarly, clarify "IL-4 resulting in reduced IL-1β and IL-10 production" in relation to Fig. 4E.

      For clarity we have added the following lines (157-160, 164-170) to the manuscript:  

      “IL-4 primed iH37Rv stimulated AM increased ECAR to similar extent as unprimed controls (Figure 1A; left). Conversely, IL-4 primed AM stimulated with LPS AM did not increase their ECAR to the same extent as controls (Figure 1A; right), suggesting that IL-4 reduces the AM ability to increase ECAR in response to LPS stimulation.”   

      “Since IFN-γ priming increased cellular energetics in the AM at baseline, we calculated percent change in ECAR and OCR from the baseline rate of each group in order to assess if IFN-γ or IL-4 primed AM have altered capacity to change their metabolism in response to stimulation (Figure 1C,D). This was carried out to equalise all the primed data sets at baseline before stimulation (Figure S1C, S1D). These data indicate that whilst the peak of glycolysis is elevated in IFN-γ primed AM (Figure S1A), all AM have a similar capacity to increase glycolysis upon stimulation when baseline differences in metabolism were adjusted for the effects of cytokine priming (Figure 1C).”

      For clarity we have amended the sentence the reviewer has highlighted (lines 214-215): “IL-4 primed AM had reduced fold change in glycolysis upon stimulation with LPS compared with controls”.

      Since IFN-γ priming induced large effect sizes, we statistically analysed the IL-4 primed and unprimed data sets in the absence of the IFN-γ primed data sets to determine how IL-4 influenced macrophage function. The only data where this resulted in any statistical significance was in response to cytokine production. We have now clarified this in the methods and relevant figure legends by stating, “Statistically significant differences were determined using two-way ANOVA with a Tukey post-test (AD); *P≤0.05, **P≤0.01, ***P≤0.001, ****P≤0.0001 or #P≤0.05, ##P≤0.01 (where IFN-γ primed data sets were excluded for post-test analysis to analyse statistical differences between no cytokine and IL4 treated data sets).

      To further clarify this, we have amended the text of the manuscript (lines 307-310) to reflect this. “All stimulated AM secreted IL-10 regardless of priming (Figure 4E). IFN-γ significantly enhanced iH37Rv induced IL-10 in AM compared to unprimed or IL-4 primed comparators (Figure 4E). IL-4 priming of human AM significantly reduced IL-10 production in response to iH37Rv compared with unprimed AM (Figure 4E). LPS strongly induced IL-10 production in unprimed MDM, which was significantly attenuated by either IFN-γ or IL-4 priming (Figure 4F).”  

      (8) Clarify whether data points in unstimulated, iH37Rv stimulated, and LPS-stimulated control cells in Fig. 3A - 3F are from independent experiments from those in Fig. 2A - 2F? The distribution of data points of control (no 2-DG treatment) in Fig. 3 is highly similar to the corresponding data points in Fig. 2. Similarly, provide clarification for similarity in Fig. 5A - 5F and Fig. 4A - 4F.

      The data illustrated in figure 2 and 3 are from one very large dataset, as are the data in figures 4 and 5. This large experiment was designed to test the effect of priming macrophages with IFN- or IL-4 (in the presence or absence of stimulation), and also to determine if the differential responses elicited due to priming were dependent on glycolysis (by inhibiting with 2DG). For clarity and transparency, the same stimulated dataset is repeated in both figures. Given the size and complexity of the experiment, we chose to present the data this way to aid the reader.  

      (9) Clarify the statement "where data was reanalyzed in the absence of IFN-γ" in the section pertaining to Statistical analysis. The authors should clearly mention nature of biological and technical replicates for each experiment in its figure legend. The authors should also confirm multiple comparison correction in all 2-way ANOVA tests done in each figure legend."

      We have amended the text (lines 133-136) to clarify this point “P-values of ≤0.05 were considered statistically significant and denoted with an asterisk. Alternatively, P-values of ≤0.05 were denoted with a hashtag where data was analysed in the absence of IFN-γ primed data sets, to analyse statistical differences between no cytokine and IL-4 treated data sets.”  

      Figures represent biological replicates (which are the average of technical replicates, presented as a single data point). This is indicated by the following sentence in each figure legend: “Each linked data point represents the average of technical duplicates for one individual biological donor”.  

      Each legend has been amended to include the multiple comparison post-test applied.

      (10) Discuss the differences and similarities of IFN-γ driven metabolic reprogramming of primary murine macrophages with the results of this study relative to cytokine secretion and activation marker expression.

      We have added additional discussion and detail comparing human and murine macrophages in lines 381-382, 403, 407 and 412-415 of the manuscript.

      (11) The repetitive data plots of similar results can be significantly reduced to improve the interpretation of the results.

      The benefit of the plotting the data in this way is for a clearer understanding and representation of the data. The repetitive data plots allow the benefit of being able to first delineate the effect of priming and priming plus stimulation and then, separately, to further examine the differences in AM versus MDM. The repetition of the primed data points then allows of the reader to determine the effect of inhibiting glycolysis with 2DG on unprimed and primed macrophages (with and without stimulation).   

      Reviewer #3 (Recommendations For The Authors):

      The methods used and data reported in this manuscript contribute to our understanding of the role of metabolism in programming of macrophages during priming. Suggestions for improving the presentation and interpretation of results include:

      • Consult with a statistician regarding analyses of the multiple conditions used during these assays. The use of repeated statistical analyses with different comparison groups in the same figure/data set seems atypical and should either be amended or fully justified in the text. Also, use of two-way vs. one-way ANOVA should be evaluated and clarified.

      We have now consulted a statistician. We have amended the text (lines 133-136) to clarify this point “P-values of ≤0.05 were considered statistically significant and denoted with an asterisk. Alternatively, P-values of ≤0.05 were denoted with a hashtag where data was analysed in the absence of IFN-γ primed data sets, to analyse statistical differences between no cytokine and IL-4 treated groups.”  

      There are two variables in the data sets; cytokine priming as well as stimulation status therefore we opted for a two-way ANOVA rather than a One-way ANOVA. There are three stimulation groups: unstimulated, Mtb-stimulated and LPS-stimulated. Cytokine priming also has three groups: no cytokine, IFN-y, or IL-4. There are two variables (priming and stimulation), each with 3 groups i.e., six treatment conditions in total, therefore two-way AVOVA with multiple comparisons tests help pinpoint exactly which groups (e.g., the 6 different levels of the 'stimulation' and 'cytokine' treatments) are significantly different from each other. This was important for understanding the specific effects of our treatments. The reader can therefore also deduce how these six treatment conditions compare to each other.

      In contrast, performing multiple single comparisons independently of the rest of the dataset (e.g. t tests), increases the risk of false positives (type 1 error). Multiple comparisons ANOVA with post-tests adjust for this, helping to reduce the likelihood of a type 1 error. These stats are more stringent, and it is therefore harder to get P values <0.05. Hence, if we compared all six treatment groups without adjustment, you increase the chance of finding false positives due to the sheer number of comparisons, leading to biased and incorrect conclusions.

      In our case, multiple comparisons tests were essential after the two-way ANOVA because they helped to objectively identify specific treatment group differences and control the overall error rate when we were extracting our conclusions, thereby reducing any risk of biases in our conclusions.

      A one-way ANOVA is used to test the effect of a single variable with more than two groups contained in the dataset. For example, in our case if you only want to test how different 'stimulation' groups affect ECAR or OCR, only in unprimed macrophages, a one-way ANOVA would be used.

      The current study used two-way ANOVA to test the effects of two variables (priming and stimulation, or in some cases priming and inhibition) each containing 3 groups, and see if there is any interaction between the two factors. For example, in our case this allowed us to examine how the 'stimulation' and the 'cytokine' priming affect ECAR/OCR levels and to determine if the effect of 'stimulation' depends on the 'cytokine' priming.

      • More justification could be given for the dose of IFNγ used for priming. Inflammatory priming is typically performed with a "low-dose" treatment (e.g., ~1 ng/ml), whereas the authors use 10 ng/ml, which would be considered a high dose. It would be useful to repeat select experiments with a more standard low-dose treatment of IFNg to demonstrate that this is also sufficient to induce the observed metabolic changes.

      Previous work has identified little difference in the response of AM and peripheral monocytes to low versus high doses of IFN-γ (6). We have inserted the following into the study limitations (lines 479-481).  

      “Furthermore, only one dose of IFN-γ was utilised due to limitations in AM yield, however, recently both low and high doses of IFN-γ have been shown to have similar effects on AM in vitro (6).”

      • Check for accuracy of the Fig.4 legend. Also check that 4G and 4B math is consistent.

      The legend for Figure 4 has been amended for incorrect A,B to state G,H. The math has been double checked for accuracy and is correct. 3 out of 10 MDM donors produced IL-1β in the absence of IFN-γ in Figure 4B, therefore the average used to calculate the data represented in Figure 4G was brought down markedly by donors who produced little or no IL-1β.  

      • Functional plasticity is a vague term and difficult to interpret in this context. It is stated that AM have greater functional plasticity, but MDMs appear to have greater capacity to secrete IL-1β and respond more robustly to IL-4 in terms of T cell stimulation. On that note, the claims regarding antigen presentation would be more impactful if a direct comparison of antigen presentation capacity was made between AM and MDM.

      Our data suggests that AM have a greater ability to alter cytokine production, such as IL1β. To consider different culture and plating requirements of MDM v AM cytokine concentration was normalised and expressed in terms of fold change.  This gives a more controlled and accurate comparison of the ability of IFN-γ or IL-4 to modulate cytokine production in AM compared with MDM.  

      The terms ‘functional plasticity’ and phenotypic plasticity’ have now been defined in the manuscript in lines 60-63.  

      We have therefore added the following sentence into the conclusion of the manuscript (lines 490-493). “Cumulatively, the data presented herein suggests that the MDM maybe more phenotypically plastic than the AM, while the AM have enhanced functional plasticity in their ability to produce cytokine after exposure Th1 and Th2 cytokines.”

      However, we acknowledge that the MDM may be regarded as more plastic because of their ability to respond robustly to IL-4, whereas the phenotypic and functional changes in the AM in response to IL4 are more limited. Whilst the focus of our work was to determine if AM are a tractable target to promote immunity in the lungs through upregulation of pro-inflammatory effector function, their ability to downregulated inflammation in response to IL-4 is comparatively less profound compared with MDM.  

      We acknowledge the shortcomings of our work which did not allow us to directly measure antigen processing in the AM, due to limitations in the cellular yield from BALF. We have edited the text (lines 251-252 and 286) to clarify this for the reader.  

      • Inconsistent normalization complicates interpretation of metabolic data. For example, it is unclear, for example, whether changes in glycolysis and oxidative phosphorylation in primed cells are due to priming or subsequent treatments. Check harmony of methods for analysis of "metabolic assays" with Fig.1 data, axis, and legend.

      We have addressed this comment, which is similar to points made by the other reviewers and amended the manuscript to increase clarity. These changes are outlined in the response to reviewer 1, point 5 (recommendations for the author). In addition, we have amended the metabolic assay method (lines 111-112) to state that “Post stimulation the ECAR and OCR were continually sampled at 20-minute intervals for times indicated.”

      • A direct comparison of cytokine production after priming and stimulation with Mtb or LPS is limited by inconsistent axes. The data may not support a claim that AM has greater "functional plasticity" without a direct comparison of antigen presentation. Moreover, MDM secrete more IL-1β than AM. The claim that that AM "have increased ability to produce all cytokines assayed in response to Mtb stimulation" does not appear to be supported by the data.

      We have amended the text to clarify this issue (lines 313-315). “These data suggest that the AM have greater functional plasticity in terms of their ability to upregulate cytokine production in response to IFN-γ, compared with the MDM. IFN-γ primed AM have enhanced IL-10 and TNF production in response to Mtb and LPS, respectively.”  

      We have amended the manuscript and have replaced “IFN-γ primed AM have increased ability to produce all cytokines assayed in response to Mtb stimulation” with the following (lines 421-423) “IFNγ primed AM stimulated with Mtb have significantly more production of IL-1β, TNF and IL-10 compared with unprimed controls. This is in contrast with IFN-γ primed MDM which only upregulate TNF compared to their unprimed controls.”

      • AM populations could be defined experimentally.

      Airway macrophages were adherence purified from bronchoalveolar lavage fluid defined as CD68+CD14+ as per rebuttal figure 1. The purpose of this study was to examine if human peripherally derived or lung resident macrophages were plastic in response to the classical polarising cytokines IFNγ and IL-4. We have identified that the AM and MDM do indeed have different functional and metabolic responses to these cytokines. However, determining functional differences within the AM subpopulations is beyond the scope of the current study and hampered by low cell numbers in human BALF.  

      References

      (1) Conzelmann M, Wagner AH, Hildebrandt A, Rodionova E, Hess M, Zota A, Giese T, Falk CS, Ho AD, Dreger P, Hecker M, Luft T. IFN-γ activated JAK1 shifts CD40-induced cytokine profiles in human antigen-presenting cells toward high IL-12p70 and low IL-10 production. Biochemical pharmacology 2010; 80: 2074-2086.

      (2) Fries KM, Sempowski GD, Gaspari AA, Blieden T, Looney RJ, Phipps RP. CD40 Expression by human fibroblasts. Clinical Immunology and Immunopathology 1995; 77: 42-51.

      (3) Gu W, Chen J, Yang L, Zhao KN. TNF-α promotes IFN-γ-induced CD40 expression and antigen process in Myb-transformed hematological cells. TheScientificWorldJournal 2012; 2012: 621969.

      (4) Hershman MJ, Appel SH, Wellhausen SR, Sonnenfeld G, Polk HC, Jr. Interferon-gamma treatment increases HLA-DR expression on monocytes in severely injured patients. Clinical and experimental immunology 1989; 77: 67-70.

      (5) Maenaka A, Kenta I, Ota A, Miwa Y, Ohashi W, Horimi K, Matsuoka Y, Ohnishi M, Uchida K, Kobayashi T. Interferon-γ-induced HLA Class II expression on endothelial cells is decreased by inhibition of mTOR and HMG-CoA reductase. FEBS open bio 2020; 10: 927-936.

      (6) Thiel BA, Lundberg KC, Schlatzer D, Jarvela J, Li Q, Shaw R, Reba SM, Fletcher S, Beckloff SE, Chance MR, Boom WH, Silver RF, Bebek G. Human alveolar macrophages display marked hyporesponsiveness to IFN-γ in both proteomic and gene expression analysis. PLoS One 2024; 19: e0295312.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      (1) The authors' primary research question revolves around the inquiry of "how far in advance semantic information might become available from parafoveal preview." In contrast to prior studies, the current research seeks to achieve a breakthrough in terms of timing by employing innovative technology. They mention in the manuscript that "most of these studies have been limited to measuring parafoveal preview from fixations to an immediately adjacent word... We tackle these core issues using a new technique that combines the use of frequency tagging and the measurement of magnetoencephalography (MEG)-based signals." However, the argumentation for how this new technology constitutes a breakthrough is not sufficiently substantiated. Specifically, there are two aspects that require further clarification. Firstly, the authors should clarify the importance of investigating the timing of semantic integration in their research question. They need to justify why previous studies focusing on the preview effect during fixations to an immediately adjacent word cannot address their specific inquiry about "how far in advance semantic information might become available from parafoveal preview," which requires examining parafoveal processing (POF). Secondly, in terms of the research methodology, the authors should provide a more comprehensive explanation of the advantages offered by MEG technology in the observation of the timing of semantic integration compared to the techniques employed in prior research. Indeed, the authors have overlooked some rather significant studies in this area. For instance, the research conducted by Antúnez, Milligan, Hernández-Cabrera, Barber, & Schotter in 2022 addresses the same research question mentioned in the current study and employs a similar experimental design. Importantly, they utilize a natural reading paradigm with synchronized ERP and eye-tracking recordings. Collectively, these studies, along with the series of prior research studies employing ERP techniques and RSVP paradigms discussed by the authors in their manuscript, provide ample evidence that semantic information becomes available and integrated from words before fixation occurs. Therefore, the authors should provide a more comprehensive citation of relevant research and delve deeper into explaining the potential contributions of their chosen technology to this field.

      We express our gratitude to the reviewer for providing insightful comments. Firstly, we clarify the advantages of the RIFT technique. The revised paragraph is on Page 4 with tracked changes and is copied as follows:

      “…… The RIFT technique provides a notable advantage by generating a signal — the tagging response signal — specifically yoked to just the tagged word. This ensures a clear separation in processing the tagged word from the ongoing processing of other words, addressing a challenge faced by eye tracking and ERP/FRP approaches. Moreover, RIFT enables us to monitor the entire dynamics of attentional engagement with the tagged word, which may begin a few words before the tagged word is fixated.”

      We also rephase our research questions in the introduction section on Page 5 with tracked changes:

      “This paradigm allows us to address three questions. First, we aimed to measure when in the course of reading people begin to direct attention to parafoveal words. Second, we sought to ascertain when semantic information obtained through parafoveal preview is integrated into the sentence context. Modulations of pre-target RIFT responses by the contextual congruity of target words would serve as evidence that parafoveal semantic information has not only been extracted and integrated into the sentence context but that it is affecting how readers allocate attention across the text. Third, we explored whether these parafoveal semantic attention effects have any relationship to reading speed.”

      Secondly, we would like to elucidate the significance of investigating the timing of semantic integration and why this complements existing findings of parafoveal processing (POF) during reading. Our manuscript has been revised accordingly, with specific modifications highlighted on Page 2. The revised passage reads as follows:

      “…… eye tracking-based evidence for the extraction of parafoveal semantic information …… was eventually extended into English …… For example, Schotter and Jia (2016) showed preview benefits on early gaze measures for plausible compared to implausible words, even for plausible words that were unrelated to the target. These results demonstrate that semantic information can indeed be extracted from parafoveal words. However, due to the limitations of the boundary paradigm, which only assesses effects after target words have been fixated, it is challenging to precisely determine when and how parafoveal semantic processing takes place. Furthermore, it is generally hard to distinguish between the effects of cross-saccade integration (e.g., mismatch between the preview and the word fixated) and the effects of how differing words fit into the context itself (Veldre and Andrews, 2016a, 2016b).”

      Thirdly, we now better highlight the contributions of Antúnez et al. paper as they have provided important evidence for parafoveal semantic processing during natural reading. The relevant modifications are highlighted on Page 3. The revised passage is as follows: “Although many of these effects have been measured in the context of unnatural reading paradigms (e.g., the “RSVP flanker paradigm”), similar effects obtain during natural reading. Using the stimuli and procedures from Schotter and Jia (2016), Antúnez et al. (2022) showed that N400 responses, measured relative to the fixation before the target words (i.e., before the boundary change while the manipulated words were in parafoveal preview), were sensitive to the contextual plausibility of these previewed words. These studies suggest that semantic information is available from words before they are fixated, even if that information does not always have an impact on eye fixation patterns.”

      References:

      Schotter ER, Jia A. 2016. Semantic and plausibility preview benefit effects in English: Evidence from eye movements. J Exp Psychol Learn Mem Cogn 42:1839–1866. doi:10.1037/xlm0000281

      Veldre A, Andrews S. 2016a. Is Semantic Preview Benefit Due to Relatedness or Plausibility? J Exp Psychol Hum Percept Perform 42:939–952. doi:10.1037/xhp0000200

      Veldre A, Andrews S. 2016b. Semantic preview benefit in English: Individual differences in the extraction and use of parafoveal semantic information. J Exp Psychol Learn Mem Cogn 42:837–854. doi:10.1037/xlm0000212

      Antúnez M, Milligan S, Andrés Hernández-Cabrera J, Barber HA, Schotter ER. 2022. Semantic parafoveal processing in natural reading: Insight from fixation-related potentials & eye movements. Psychophysiology 59:e13986. doi:10.1111/PSYP.13986

      (2) Further, the authors emphasize semantic integration in their observed results but overlook the intricate relationship between access, priming, and integration. This assertion appears overly confident. Despite using low-constraint sentences and low-predicted targets (lines 439-441), differences between congruent and incongruent conditions may be influenced by word-level factors. For instance, in the first coherent sentence, such as "Last night, my lazy brother came to the party one minute before it was over" (line 1049), replacing the keyword "brother" with an incongruent word could create an incoherent sentence, possibly due to semantic violation, relation mismatch with "lazy," or prediction error related to animate objects. A similar consideration applies to the second example sentence, "Lily says this blue jacket will be a big fashion trend this fall" (line 1050), where the effect might result from a discrepancy between "blue" and an incongruent word. However, the authors do not provide incongruent sentences to substantiate their claims. I recommend that the authors discuss alternative explanations and potentially control for confounding factors before asserting that their results unequivocally reflect semantic integration. My intention is not to dispute the semantic integration interpretation but to stress the necessity for stronger evidence to support this assertion.

      We agree with the reviewer that stimulus control is very critical for this kind of work and apologize for the lack of clarity in the original manuscript.

      (1) We fully agree that word-level factors can be an important confound, which is why we carefully controlled word-level factors in the experimental design. As detailed in the Appendix of the original manuscript, each pair of target words has been strategically embedded into two sentences, allowing for the creation of both congruent and incongruent sentence pairs through the interchange of these words. We now have explicitly specified this design in all sentences, as reflected in the edited manuscript on Page 38. For example, considering the exemplar pair of “brother/jacket”,

      “Last night, my lazy brother/jacket came to the party one minute before it was over.

      Lily says this blue jacket/brother will be a big fashion trend this fall.”

      In this design, the pair of target words is presented in both congruent and incongruent sentences. Participant A reads “lazy brother” and “blue jacket”, while Participant B reads “lazy jacket” and “blue brother”. This approach ensures that the same target words appear in both congruent and incongruent conditions across participants, serving as an effective control for word-level factors.

      (2) We acknowledge that the consideration of word-level information is crucial when making claims about contextual integration in the current study. However, we don’t think there are many cases in the stimulus set where a single feature like animacy is enough to create the mismatch. Instead, the stimuli were written so that it is not possible to strongly predict any word or even a specific semantic feature, so that appreciating the mismatch requires the comprehender to integrate the word into the context (and especially to integrate the word with the immediately preceding one). However, this more local modifier/noun plausibility may behave differently from a more global contextual plausibility, which is a limitation of the stimulus set and has been discussed in the revised manuscript, as indicated by the tracked changes on Page 16, as copied below:

      “Two noteworthy limitations exist in the current study. Firstly, the construction of pretarget–target word pairs consistently follows an adjective–noun phrase structure, potentially leading to semantic violations arising from immediate local incongruence rather than a broader incongruence derived from the entire sentential context. While the context preceding target words was deliberately minimized to ensure a pure effect of bottom-up parafoveal processing rather than the confounding impact of top-down prediction, it is essential to recognize that information from both local and global contexts can exert distinct effects on word processing during natural reading (Wong et al., 2022). Future investigations should incorporate more information-rich contexts to explore the extent to which the parafoveal semantic integration effect observed in this study can be generalized.”

      References:

      Wong R, Veldre A, Andrews S. 2022. Are There Independent Effects of Constraint and Predictability on Eye Movements During Reading? J Exp Psychol Learn Mem Cogn. doi:10.1037/XLM0001206

      Reviewer #2 (Public Review):

      This MEG study used co-registered eye-tracking and Rapid Invisible Frequency Tagging (RIFT) to track the effects of semantic parafoveal preview during natural sentence reading. Unpredictable target words could either be congruent or incongruent with sentence context. This modulated the RIFT response already while participants were fixating on the preceding word. This indicates that the semantic congruency of the upcoming word modulates visual attention demands already in parafoveal preview.

      The quest for semantic parafoveal preview in natural reading has attracted a lot of attention in recent years, especially with the development of co-registered EEG and MEG. Evidence from dynamic neuroimaging methods using innovative paradigms as in this study is important for this debate.

      We express our gratitude to the reviewer for recognizing the significance of our research question in the domain of natural reading.

      Major points:

      (1) The authors frame their study in terms of "congruency with sentence context". However, it is the congruency between adjective-noun pairs that determines congruency (e.g. "blue brother" vs "blue jacket", and examples p. 16 and appendix). This is confirmed by Suppl Figure 1, which shows a significantly larger likelihood of refixations to the pre-target word for incongruent sentences, probably because the pre-target word is most diagnostic for the congruency of the target word. The authors discuss some possibilities as to why there is variability in parafoveal preview effects in the literature. It is more likely to see effects for this simple and local congruency, rather than congruency that requires an integration and comprehension of the full sentence. I'm not sure whether the authors really needed to present their stimuli in a full-sentence context to obtain these effects. This should be explicitly discussed and also mentioned in the introduction (or even the abstract).

      We have addressed this limitation of the study explicitly in the revised manuscript. The modifications can be found in the tracked changes on Page 16, and is copied as follows:

      “Two noteworthy limitations exist in the current study. Firstly, the construction of pretarget–target word pairs consistently follows an adjective–noun phrase structure, potentially leading to semantic violations arising from immediate local incongruence rather than a broader incongruence derived from the entire sentential context. While the context preceding target words was deliberately minimized to ensure a pure effect of bottom-up parafoveal processing rather than the confounding impact of top-down prediction, it is essential to recognize that information from both local and global contexts can exert distinct effects on word processing during natural reading (Wong et al., 2022). Future investigations should incorporate more information-rich contexts to explore the extent to which the parafoveal semantic integration effect observed in this study can be generalized.”

      References:

      Wong R, Veldre A, Andrews S. 2022. Are There Independent Effects of Constraint and Predictability on Eye Movements During Reading? J Exp Psychol Learn Mem Cogn. doi:10.1037/XLM0001206

      (2) The authors used MEG and provided a source estimate for the tagging response (Figure 2), which unsurprisingly is in the visual cortex. The most important results are presented at the sensor level. This does not add information about the brain sources of the congruency effect, as the RIFT response probably reflects top-down effects on visual attention etc. Was it necessary to use MEG? Would EEG have produced the same results? In terms of sensitivity, EEG is better than MEG as it is more sensitive to radial and deeper sources. This should be mentioned in the discussion and/or methods section.

      Source estimation was exclusively provided for the tagging response rather than the congruency effect because we posit that this conditional contrast would emanate from the same brain regions exhibiting the tagging responses in general. As depicted in the following figure, source localization for the congruency effect was identified in the left association cortex (Brodmann area 18), the same area as the source localization for the tagging response (the negative cluster observed here is due to the incongruent minus congruent contrast). While we agree with the Reviewer that the RIFT result might indicate a top-down effect on visual attention, it is important to note that, due to the low-pass filter property of synapses, observing a tagging response at a high frequency beyond the visual cortex is challenging.

      Author response image 1.

      We discussed the necessity of using MEG in the edited manuscript with tracked changes on Page 20, and is copied as follows:

      “While the current study was conducted using MEG, these procedures might also work with EEG. If so, this would make our approach accessible to more laboratories as EEG is less expensive. However, there are currently no studies directly comparing the RIFT response in EEG versus MEG. Therefore, it would be of great interest to investigate if the current findings can be replicated using EEG.”

      (3) The earliest semantic preview effects occurred around 100ms after fixating the pre-target word (discussed around l. 323). This means that at this stage the brain must have processed the pre-target and the target word and integrated their meanings (at some level). Even in the single-word literature, semantic effects at 100 ms are provocatively early. Even studies that tried to determine the earliest semantic effects arrived at around 200 ms (e.g. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3382728/, https://psycnet.apa.org/record/2013-17451-002). The present results need to be discussed in a bit more detail in the context of the visual word recognition literature.

      We have incorporated this valuable suggestion into the discussion section to enhance the clarity of our key result regarding the timing of parafoveal semantic integration. The revised manuscript with tracked changes can be found on Page 14, and the relevant passage is provided below:

      “Our results also provide information about the time course of semantic integration …… by as early as within 100 ms after fixating on the pre-target word. The timing of this parafoveal semantic effect appears remarkably early, considering that typical semantic access for a single word occurs no earlier than around 200 ms, as demonstrated in the visual word recognition literature (Carreiras et al., 2014). For instance, in a Go/NoGo paradigm, the earliest distinguishable brain activity related to category-related semantic information of a word occurs at 160 ms (Amsel et al., 2013; Hauk et al., 2012). Therefore, the RIFT results presented here suggest that natural reading involves parallel processing that spans multiple words. The level of (covert) attention allocated to the target word, as indexed by the significant difference in RIFT responses compared to the baseline interval, was observed even three words in advance (see Figure 2C). This initial increase in RIFT coincided with the target entering the perceptual span (McConkie and Rayner, 1975; Rayner, 1975; Underwood and McConkie, 1985), likely aligning with the initial extraction of lower-level perceptual information about the target. The emerging sensitivity of the RIFT signal to target plausibility, detected around 100 ms after the fixation on the pre-target word, suggests that readers at that time had accumulated sufficient semantic information about the target words and integrated that information with the evolving sentence context. Therefore, it is plausible that the initial semantic processing of the target word commenced even before the pre-target fixation and was distributed across multiple words. This parallel processing of multiple words facilitates rapid and fluent reading.”

      References:

      Carreiras M, Armstrong BC, Perea M, Frost R. 2014. The what, when, where, and how of visual word recognition. Trends Cogn Sci 18:90–98. doi:10.1016/j.tics.2013.11.005

      Amsel BD, Urbach TP, Kutas M. 2013. Alive and grasping: Stable and rapid semantic access to an object category but not object graspability. Neuroimage 77:1–13. doi:10.1016/J.NEUROIMAGE.2013.03.058

      Hauk O, Coutout C, Holden A, Chen Y. 2012. The time-course of single-word reading: Evidence from fast behavioral and brain responses. Neuroimage 60:1462. doi:10.1016/J.NEUROIMAGE.2012.01.061

      McConkie GW, Rayner K. 1975. The span of the effective stimulus during a fixation in reading. Percept Psychophys 17:578–586. doi:10.3758/BF03203972

      Rayner K. 1975. The perceptual span and peripheral cues in reading. Cogn Psychol 7:65–81.

      Underwood NR, McConkie GW. 1985. Perceptual Span for Letter Distinctions during Reading. Read Res Q 20:153. doi:10.2307/747752

      (4) As in previous EEG/MEG studies, the authors found a neural but no behavioural preview effect. As before, this raises the question of whether the observed effect is really "critical" for sentence comprehension. The authors provide a correlation analysis with reading speed, but this does not allow causal conclusions: Some people may simply read slowly and therefore pay more attention and get a larger preview response. Some readers may hurry and therefore not pay attention and not get a preview response. In order to address this, one would have to control for reading speed and show an effect of RIFT response on comprehension performance (or vice versa, with a task that is not close to ceiling performance). The last sentence of the discussion is currently not justified by the results.

      We acknowledge that the correlation analysis between the RIFT effect and reading speed on the group level lacks causality, making it less ideal for addressing this question. We have incorporated this acknowledgment as one of the limitations of the current study in the revised manuscript on Page 16, as indicated by the tracked changes, and the relevant passage is provided below:

      “Two noteworthy limitations exist in the current study. …… Secondly, the correlation analysis between the pre-target RIFT effect and individual reading speed (Figure 5) does not establish a causal relationship between parafoveal semantic integration and reading performance. Given that the comprehension questions in the current study were designed primarily to maintain readers’ attention and the behavioural performance reached a ceiling level, employing more intricate comprehension questions in future studies would be ideal to accurately measure reading comprehension and reveal the impact of semantic parafoveal processing on it.”

      We reformulated the last sentence:

      “These results support the idea that words are processed in parallel and suggest that early and deep parafoveal processing may be important for fluent reading.”

      (5) L. 577f.: ICA components were selected by visual inspection. I would strongly recommend including EOG in future recordings when the control of eye movements is critical.

      We appreciate the reviewer for providing this valuable suggestion. We acknowledge that EOG recordings were not included in the current study due to restrictions on MEG data collection from the University of Birmingham during the COVID-19 pandemic. In our future studies, we will follow the reviewer's suggestion to incorporate EOG recordings in data collection. This addition will facilitate optimal eye movement-related artifact rejection through ICA, as recommended by Dimigen in his methodological paper:

      Dimigen, O. (2020). Optimizing the ICA-based removal of ocular EEG artifacts from free viewing experiments. NeuroImage, 207, 116117.

      (6) The authors mention "saccade planning" a few times. I would suggest looking at the SWIFT model of eye movement control, which is less mechanistic than the dominant EZ-Reader model (https://psycnet.apa.org/record/2005-13637-003). It may be useful for the framing of the study and interpretation of the results (e.g. second paragraph of discussion).

      In the revised manuscript, we have provided a more comprehensive explanation eye movements/saccade planning, aligning it with the SWIFT model. Please refer to Page 15 with tracked changes, and the updated passage is provided below:

      “The results of the present study are aligned with the SWIFT model of eye movement control in natural reading (Engbert et al., 2005), wherein the activation field linked to a given word is hypothesized to be both temporally and spatially distributed. Indeed, we found that the initial increase in covert attention to the target word occurred as early as three words before, as measured by RIFT responses (Figure 2C). These covert processes enable the detection of semantic incongruity (Figure 3B and Figure 3C). However, it may occur at the non-labile stage of saccade programming, preventing its manifestation in fixation measures of the currently fixated pre-target word (Figure 1B). Therefore, the RIFT technique’s capacity to yoke patterns to a specific word offers a unique opportunity to track the activation field of word processing during natural reading.”

      References:

      Engbert R, Nuthmann A, Richter EM, Kliegl R. 2005. Swift: A dynamical model of saccade generation during reading. Psychol Rev 112:777–813. doi:10.1037/0033-295X.112.4.777

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      While the manuscript is well-written and presents a structured analysis of the data, it requires further clarification and substantiation regarding the originality of the research questions, the advantages of the proposed methodology, and the interpretation of the results related to semantic integration. Additional references and a more thorough discussion of related research are needed to strengthen the manuscript's contribution to the field.

      We appreciate the reviewer's kind words about this manuscript and the insightful comments and suggestions provided. In the revised manuscript, we have now placed additional emphasis on the importance of investigating semantic integration within the realm of parafoveal processing in natural reading. We have clarified the advantages of employing MEG and RIFT and expanded upon our results in the context of Antúnez et al.'s 2022 paper, as suggested by the reviewer.

      Reviewer #2 (Recommendations For The Authors):

      (1) L. 59: The "N400" has been linked to much more than "semantic access". I think it is widely accepted that "access" happens (or at least begins) earlier, and that the N400 reflects high-level integration processes etc.

      Earlier debates about whether the N400 is more linked to access or integration have resolved in favour of an access account, but with a growing appreciation of the blurred boundaries between constructions like access, priming, and integration, as Reviewer 1 also pointed out in comment #2.

      (2) L. 177: I wasn't sure about the selection of sensors. Were the same sensors used for all participants (whether they had a tagging response or not)?

      We appreciate the reviewer for highlighting the confusion regarding the sensor selection procedure in the study. In response, we have added further clarifications about this procedure in the Method section of the revised manuscript. The relevant changes can be found on Page 25 with tracked changes, and the modified passage is reproduced below:

      "Please note that the tagging response sensors may vary in number across participants (7.9 ± 4.5 sensors per participant, M ± SD). Additionally, they may have a different but overlapping spatial layout, primarily over the visual cortex. For the topography of all tagging response sensors, please refer to Figure 2A."

      (3) Ll. 247ff.: I don't understand the idea of a "spill-over effect". The future cannot spill into the past. Or does this refer to possible artefacts or technical problems?

      In the revised manuscript, we have rephrased this passage with tracked changes on Page 11, and the updated version is provided below:

      “We conducted a similar analysis of the coherence measured when participants fixated the target word and found no significant modulations related to the contextual congruity of that target word. …… Thus, the parafoveal semantic integration effect identified during the pre-target intervals cannot be attributed to signal contamination from fixations on the target word induced by the temporal smoothing of filters.”

      (4) I struggled to follow the "internal attention" explanation for the paradoxical RIFT effect (p. 11/12).

      We appreciate the reviewer for pointing out the confusion, and we have rephrased the passage in the revised manuscript with tracked changes on Page 13. The revised version is provided below:

      "Previous work has demonstrated that tagging responses decrease as attention shifts from an external task (e.g., counting visual targets) to an internal task (e.g., counting heartbeats) (Kritzman et al., 2022). Similarly, in a reading scenario, visually perceiving the flickering word constitutes an external task, while the internal task involves the semantic integration of previewed information into the context. If more attentional resources are internally directed when faced with the challenge of integrating a contextually incongruent word, fewer attentional resources would remain for processing the flickering word. This may be the kind of shift reflected in the reduction in RIFT responses."

      References:

      Kritzman L, Eidelman-Rothman M, Keil A, Freche D, Sheppes G, Levit-Binnun N. 2022. Steady-state visual evoked potentials differentiate between internally and externally directed attention. Neuroimage 254:119133.

      (5) L. 572: Why was detrending necessary on top of a 0.5 Hz high-pass filter? Was detrending applied to the continuous raw data, or to epochs? Was it just the linear trend or other polynomial terms?

      We agree with the Reviewer that, given the prior application of a 0.5Hz high-pass filter to the data, the detrending does not alter the data. Nonetheless, we included this procedure in the manuscript for the sake of completeness. In the revised manuscript, we have provided additional clarification on this point, as indicated by the tracked changes on Page 23. The modified passage is presented below:

      "Subsequently, detrending was applied individually to each channel of the continuous raw data to factor out the linear trend."

      (6) Source analysis, p. 25f.: How was the beamformer regularized?

      This information was already included in the original manuscript on Page 26. The original text is provided below for reference:

      “No regularisation was performed to the CSD matrices (lambda = 0).”

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important study reports a novel measurement for the chemotactic response to potassium by Escherichia coli. The authors convincingly demonstrate that these bacteria exhibit an attractant response to potassium and connect this to changes in intracellular pH level. However, some experimental results are incomplete, with additional controls/alternate measurements required to support the conclusions. The work will be of interest to those studying bacterial signalling and response to environmental cues.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This paper shows that E. coli exhibits a chemotactic response to potassium by measuring both the motor response (using a bead assay) and the intracellular signaling response (CheY phosporylation level via FRET) to step changes in potassium concentration. They find increase in potassium concentration induces a considerable attractant response, with an amplitude larger than aspartate, and cells can quickly adapt (but possibly imperfectly). The authors propose that the mechanism for potassium response is through modifying intracellular pH; they find both that potassium modifies pH and other pH modifiers induce similar attractant responses. It is also shown, using Tar- and Tsr-only mutants, that these two chemoreceptors respond to potassium differently. Tsr has a standard attractant response, while Tar has a biphasic response (repellent-like then attractant-like). Finally, the authors use computer simulations to study the swimming response of cells to a periodic potassium signal secreted from a biofilm and find a phase delay that depends on the period of oscillation.

      Strengths:

      The finding that E. coli can sense and adapt to potassium signals and the connection to intracellular pH is quite interesting and this work should stimulate future experimental and theoretical studies regarding the microscopic mechanisms governing this response. The evidence (from both the bead assay and FRET) that potassium induces an attractant response is convincing, as is the proposed mechanism involving modification of intracellular pH.

      Weaknesses:

      The authors show that changes in pH impact fluorescent protein brightness and modify the FRET signal; this measurement explains the apparent imprecise adaptation they measured. However, this effect reduces confidence in the quantitative accuracy of the FRET measurements. For example, part of the potassium response curve (Fig. 4B) can be attributed to chemotactic response and part comes from the pH modifying the FRET signal. Measuring the full potassium response curve of the no-receptor mutants as a control would help quantify the true magnitude of the chemotactic response and the adaptation precision to potassium.

      Response: We thank the reviewer for the suggestion. We have now measured the full potassium response curve for the no-receptor mutant (HCB1414-pVS88), as shown in Fig. S4. We characterized the pH effects on CFP and YFP channels at different concentrations of KCl, and the relationship between the ratio of the signal post- to pre-KCl addition and the KCl concentration was established for both channels, as shown in Fig. S4C. The pH-corrected signal after KCl addition for strains with receptors was obtained by dividing the original signal after KCl addition by this ratio at the specific KCl concentration. This was done for both CFP and YFP channels. The pH-corrected responses for the Tar-only and Tsr-only strains are represented by red dots in Fig. 5BC. The recalculated response curve and adaptation curve for the wild-type strain are shown in Fig. S5. The same correction was applied to Fig. 3 as well. We also re-performed the simulations using the corrected dose-response curve and replotted Fig. 6, though the simulation results did not change much.

      We have now added a subsection “Revised FRET responses by correcting the pH effects on the brightness of eCFP and eYFP” at line 296 in “Results” to describe this.

      The measured response may also be impacted by adaptation. For other strong attractant stimuli, the response typically shows a low plateau before it recovers (adapts). However, in the case of Potassium, the FRET signal does not have an obvious plateau following the stimuli. Do the authors have an explanation for that? One possibility is that the cells may have already partially adapted when the response reaches its minimum, which could indicate a different response and/or adaptation dynamics from that of a regular chemo-attractant? In any case, directly measuring the response to potassium in mutants without adaptation enzymes (CheR, CheB) and with the receptors in different methylation levels would shed more light on the problem.

      Response: We appreciate the reviewer’s insightful questions. To observe the low plateau before adaptation, a saturating amount of attractant should be added in a stepwise manner. According to the dose-response curve we measured for potassium, a saturating amount of potassium would be close to 100 mM. In fact, there is a small segment of the low plateau in the step response to 30 mM KCl (Fig. 4C or Fig. S5A). To observe more of this low plateau, we could have used a higher concentration of KCl. However, a stimulation higher than 30 mM KCl will induce substantial physiological changes in the cell, resulting in a significant decrease in fluorescence for both channels (Fig. S7). Therefore, the range of KCl concentration that can be reliably applied in FRET measurements is limited.

      The half-time of adaptation at 30 mM KCl was measured to be approximately 80 s, demonstrating a faster adaptation than 0.1 mM MeAsp, which induced a similar magnitude of response. Nevertheless, this is still significantly slower than the time required for medium exchange in the flow chamber, which takes less than 10 s to replace 99% of the medium. Thus, the effect on the measured response magnitude due to adaptation should be small (less than 10%).

      We thank the reviewer for the suggestion of measuring the response to potassium in mutants without adaptation enzymes (CheR, CheB) and with the receptors in different methylation levels. However, these mutants are typically less sensitive than the wild-type, exhibiting higher values of K0.5 (Sourjik & Berg, PNAS 99:123, 2002), and thus require an even higher KCl concentration to see the low plateau. Consistent with this, we attempted to measure the response to potassium in a cheRcheB mutant (HCB1382-pVS88). As shown in Fig. R1 below, there is no response to up to 30 mM KCl, suggesting that the sensitive region of the mutant is beyond 30 mM KCl.

      The relevant text was added at line 413-424.

      Author response image 1.

      The response of the cheRcheB mutant (HCB1382-pVS88) to different concentrations of KCl. The blue solid line denotes the original signal, while the red dots represent the pH-corrected signal. The vertical purple (green) dashed lines indicate the moment of adding (removing) 0.01 mM, 0.1 mM, 0.3 mM, 1 mM, 3 mM, 10 mM and 30 mM KCl, in chronological order.

      There seems to be an inconsistency between the FRET and bead assay measurements, the CW bias shows over-adaptation, while the FRET measurement does not.

      Response: We thank the reviewer for pointing this out. We have now demonstrated that the imprecise adaptation shown in the FRET assay primarily resulted from the pH-induced intensity change of the fluorescent proteins. As shown in Fig. S5A&C, the FRET signal also shows over-adaptation, similar to the bead assay, when we recalculated the response by correcting the CFP and YFP channels.

      Now we clarified it at line 315.

      The small hill coefficient of the potassium response curve and the biphasic response of the Tar-only strain, while both very interesting, require further explanation since these are quite different than responses to more conventional chemoattractants.

      Response: We thank the reviewer for pointing this out. We have now recalculated the pH-corrected results for the dose-response curve (Fig. S5) and the biphasic response of the Tar-only strain (Fig. 5C). The new Hill coefficient is 0.880.14 (meanSD), which is close to the response to MeAsp (1.2) (ref. 46). We suspected that this Hill coefficient of slightly less than 1 resulted from the different responses of Tar and Tsr receptors to potassium.

      The Tar-only strain exhibits a repellent response to stepwise addition of low concentrations of potassium less than 10 mM, and a biphasic response above (Fig. 5C). This biphasic response might result from additional pH-effects on the activity of intracellular enzymes such as CheRB and CheA, which may have a different timescale and response from the Tar receptor. We have now added the penultimate paragraph in “Discussion” to talk about the response of the Tar-only strain.

      Reviewer #2 (Public Review):

      Summary:

      Zhang et al investigated the biophysical mechanism of potassium-mediated chemotactic behavior in E coli. Previously, it was reported by Humphries et al that the potassium waves from oscillating B subtilis biofilm attract P aeruginosa through chemotactic behavior of motile P aeruginosa cells. It was proposed that K+ waves alter PMF of P aeruginosa. However, the mechanism was this behaviour was not elusive. In this study, Zhang et al demonstrated that motile E coli cells accumulate in regions of high potassium levels. They found that this behavior is likely resulting from the chemotaxis signalling pathway, mediated by an elevation of intracellular pH. Overall, a solid body of evidence is provided to support the claims. However, the impacts of pH on the fluorescence proteins need to be better evaluated. In its current form, the evidence is insufficient to say that the fluoresce intensity ratio results from FRET. It may well be an artefact of pH change. Nevertheless, this is an important piece of work. The text is well written, with a good balance of background information to help the reader follow the questions investigated in this research work.

      In my view, the effect of pH on the FRET between CheY-eYFP and CheZ-eCFP is not fully examined. The authors demonstrated in Fig. S3 that CFP intensity itself changes by KCl, likely due to pH. They showed that CFP itself is affected by pH. This result raises a question of whether the FRET data in Fig3-5 could result from the intensity changes of FPs, but not FRET. The measured dynamics may have nothing to do with the interaction between CheY and CheZ. It should be noted that CFP and YFP have different sensitivities to pH. So, the measurement is likely confounded by the change in intracellular pH. Without further experiments to evaluate the effect of pH on CFP and YFP, the data using this FRET pair is inconclusive.

      Response: We thank the reviewer for pointing this out. We have now measured the full potassium response curve for the no-receptor mutant (HCB1414-pVS88), as shown in Fig. S4. We characterized the pH effects on CFP and YFP channels at different concentrations of KCl, and the relationship between the ratio of the signal post- to pre-KCl addition and the KCl concentration was established for both channels, as shown in Fig. S4C. The pH-corrected signal after KCl addition for strains with receptors was obtained by dividing the original signal after KCl addition by this ratio at the specific KCl concentration. This was done for both CFP and YFP channels. The pH-corrected responses for the Tar-only and Tsr-only strains are represented by red dots in Fig. 5BC. The recalculated response curve and adaptation curve for the wild-type strain are shown in Fig. S5. The same correction was applied to Fig. 3 as well. We also re-performed the simulations using the corrected dose-response curve and replotted Fig. 6, though the simulation results did not change much.

      We have now added a subsection “Revised FRET responses by correcting the pH effects on the brightness of eCFP and eYFP” at line 296 in “Results” to describe this.

      The data in Figure 1 is convincing. It would be helpful to include example videos. There is also ambiguity in the method section for this experiment. It states 100mM KCl was flown to the source channel. However, it is not clear if 100 mM KCl was prepared in water or in the potassium-depleted motility buffer. If KCl was prepared with water, there would be a gradient of other chemicals in the buffer, which confound the data.

      Response: We apologize for the ambiguity. The KCl solution used in this work was prepared in the potassium-depleted motility buffer. We have now clarified this at both lines 116 and 497. We now provided an example video, Movie S1, with the relevant text added at line 123.

      The authors show that the FRET data with both KCl and K2SO4, and concluded that the chemotactic response mainly resulted from potassium ions. However, this was only measured by FRET. It would be more convincing if the motility assay in Fig1 is also performed with K2SO4.

      Response: We thank the reviewer for the suggestion. The aim of comparing the responses to KCl and K2SO4 was to determine the role of chloride ions in the response and to prove that the chemotactic response of E. coli to KCl comes primarily from its response to potassium ions. It is more sensitive to compare the responses to KCl and K2SO4 by using the FRET assay. In contrast, the microfluidic motility assay is less sensitive in revealing the difference in the chemotactic responses, making it difficult to determine the potential role of chloride ions.

      Methods:

      • Please clarify the promotes used for the constitutive expression of FliCsticky and LacI.

      Response: The promoters used for the constitutive expression of LacIq and FliCsticky were the Iq promoter and the native promoter of fliC, respectively (ref. 57).

      Now these have been clarified at line 471.

      • Fluorescence filters and imaging conditions (exposure time, light intensity) are missing.

      Response: Thank you for the suggestion. We have now added more descriptions at lines 535-546: The FRET setup was based on a Nikon Ti-E microscope equipped with a 40× 0.60 NA objective. The illumination light was provided by a 130-W mercury lamp, attenuated by a factor of 1024 with neutral density filters, and passed through an excitation bandpass filter (FF02-438/24-25, Semrock) and a dichroic mirror (FF458-Di02-25x36, Semrock). The epifluorescent emission was split into cyan and yellow channels by a second dichroic mirror (FF509-FDi01-25x36, Semrock). The signals in the two channels were then filtered by two emission bandpass filters (FF01-483/32-25 and FF01-542/32-25, Semrock) and collected by two photon-counting photomultipliers (H7421-40, Hamamatsu, Hamamatsu City, Japan), respectively. Signals from the two photomultipliers were recorded at a sampling rate of 1 Hz using a data-acquisition card installed in a computer (USB-1901(G)-1020, ADlink, New Taipei, Taiwan).

      • Please clarify if the temperature was controlled in motility assays.

      Response: All measurements in our work were performed at 23 ℃. It was clarified at line 496.

      • L513. It is not clear how theta was selected. Was theta set to be between 0 and pi? If not, P(theta) can be negative?

      Response: The θ was set to be between 0 and π. This has now been added at line 581.

      • Typo in L442 (and) and L519 (Koff)

      Response: Thank you. Corrected.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) From the motor measurements the authors find that the CW bias over-adapts to a level larger than prestimulus, but this is not seen in the FRET measurements. What causes this inconsistency? Fig. 2D seems to rule out any change in CheY binding to the motor.

      Response: We thank the reviewer for pointing this out. We have now demonstrated that the imprecise adaptation shown in the FRET assay primarily resulted from the pH-induced intensity change of the fluorescent proteins. As shown in Fig. S5A&C, the FRET signal also shows over-adaptation, similar to the bead assay, when we recalculated the response by correcting the CFP and YFP channels.

      We now clarified it at line 315.

      (2) It would be useful to compare the response amplitude for potassium (Fig. 3C) to a large concentration of both MeAsp and serine. This is a fairer comparison since your work shows potassium acts on both Tar and Tsr. Alternatively, testing a much larger concentration (~10^6 micromolar) at which MeAsp also binds to Tsr would also be useful.

      Response: We thank the reviewer for pointing this out. We have now recalculated the response to potassium by correcting the pH-induced effects on fluorescence intensity of CFP and YFP. The response to 30 mM KCl was 1.060.10 times as large as that to 100 μM MeAsp. The aim of the comparison between the responses to potassium and MeAsp was to provide an idea of the magnitude of the chemotactic response to potassium. The stimulus of 100 μM MeAsp is already a saturating amount of attractant and induces zero-kinase activity, thus using a higher stimulus (adding serine or a larger concentration of MeAsp) is probably not needed. Moreover, a larger concentration (~10^6 micromolar) of MeAsp would also induce an osmotactic response.

      (3) The fitted Hill coefficient (~0.5) to the FRET response curve is quite small and the authors suggest this indicates negative cooperativity. Do they have a proposed mechanism for negative cooperativity? Have similar coefficients been measured for other responses?

      Response: We thank the reviewer for pointing this out. We have now recalculated the pH-corrected results for the dose-response curve (Fig. S5). The new Hill coefficient is 0.880.14 (meanSD), which is close to the response to MeAsp (1.2) (ref. 46). We suspect that this Hill coefficient of slightly less than 1 results from the differing responses of Tar and Tsr receptors to potassium.

      (3a) The authors state a few times that the response to potassium is "very sensitive", but the low Hill coefficient indicates that the response is not very sensitive (at least compared to aspartate and serine responses).

      Response: We apologize for the confusion. We described the response to potassium as “very sensitive” due to the small value of K0.5. This has now been clarified at line 236.

      (3b) Since the measurements are performed in wild-type cells the response amplitude following the addition of potassium may be biased if the cell has already partially adapted. This seems to be the case since the FRET time series does not plateau after the addition of the stimulus. The accuracy of the response curve and hill coefficient would be more convincing if the experiment was repeated with a cheR cheB deficient mutant.

      Response: We thank the reviewer for raising these questions. To observe the low plateau before adaptation, a saturating amount of attractant should be added in a stepwise manner. According to the dose-response curve we measured for potassium, a saturating amount of potassium would be close to 100 mM. In fact, there is a small segment of the low plateau in the step response to 30 mM KCl (Fig. 4C or Fig. S5A). To observe more of this low plateau, we could have used a higher concentration of KCl. However, a stimulation higher than 30 mM KCl will induce substantial physiological changes in the cell, resulting in a significant decrease in fluorescence for both channels (Fig. S7). Therefore, the range of KCl concentration that can be reliably applied in FRET measurements is limited.

      The half-time of adaptation at 30 mM KCl was measured to be approximately 80 s, demonstrating a faster adaptation than 0.1 mM MeAsp, which induced a similar magnitude of response. Nevertheless, this is still significantly slower than the time required for medium exchange in the flow chamber, which takes less than 10 s to replace 99% of the medium. Thus, the effect on the measured response magnitude due to adaptation should be small (less than 10%).

      We thank the reviewer for the suggestion of measuring the response to potassium in mutants without adaptation enzymes (CheR, CheB) and with the receptors in different methylation levels. However, these mutants are typically less sensitive than the wild-type, exhibiting higher values of K0.5 (ref. 46), and thus require an even higher KCl concentration to see the low plateau. Consistent with this, we attempted to measure the response to potassium in a cheRcheB mutant (HCB1382-pVS88). As shown in Fig. R1, there is no response to up to 30 mM KCl, suggesting that the sensitive region of the mutant is beyond 30 mM KCl.

      The relevant text was added at line 413-424.

      (4) The authors show that the measured imprecise adaptation can be (at least partially) attributed to pH impacting the FRET signal by changing eCFP and eYFP brightness.

      (4a) Comparing Fig. 5C and D, the chemosensing and pH response time scales look similar. Therefore, does the pH effect bias the measured response amplitude (just as it biases the adapted FRET level)?

      Response: We agree with the reviewer that the pH effect on CFP and YFP biases the measured response amplitude. We have now performed the measurement of dose-response curve to potassium for the no-receptor mutant (HCB1414-pVS88), as shown in Fig. S4. The pH effects on CFP and YFP were corrected. The dose-response curve and adaptation curve were recalculated and plotted in Fig. S5.

      (4b) It would help to measure a full response curve (at many concentrations) for the no-receptor strain as a control. This would help distinguish, as a function of concentration, how much response can be attributed to pH impacting the FRET signal versus the true chemotactic response.

      Response: We thank the reviewer for the suggestion. We have now performed the measurements for the no-receptor strain. The impact of pH on CFP and YFP has been corrected. The pH-corrected results, previously in Fig.3-5, are now presented in Fig. 3, Fig. S5 and Fig. 5, respectively.

      (5) The biphasic response of Tar is strange and warrants further discussion. Do the authors have any proposed mechanisms that lead to this behavior? For the 10mM and 30mM KCl measurements there is a repellent response followed by an attractant response for both adding and removing the stimuli, why is this?

      Response: We thank the reviewer for pointing this out. The Tar-only strain exhibits a repellent response to stepwise addition of low concentrations of potassium less than 10 mM, and a biphasic response above (Fig. 5C). This biphasic response might result from additional pH-effects on the activity of intracellular enzymes such as CheRB and CheA, which may have a different timescale and response from the Tar receptor. We have now added the penultimate paragraph in “Discussion” to talk about the response of the Tar-only strain.

      (5a) The fact that Tar and Tsr are both attractant (after the initial repellant response in Tar) appears to be inconsistent with previous work on pH response (Ref 52, Yang and Sourjik Molecular Microbiology (2012) 86(6), 1482-1489). This study also didn't see any biphasic response.

      Response: We thank the reviewer for pointing this out. The Tar-only strain shows a repellent response to stepwise addition of low concentrations of potassium, specifically less than 10 mM. This is consistent with previous observations of the response of Tar to changes in intracellular pH (refs. 44,45) and also with the work of Yang and Sourjik (new ref. 53), although the work in ref. 53 dealt with the response to external pH change, and bacteria were known to maintain a relatively stable intracellular pH when external pH changes (Chen & Berg, Biophysical Journal (2000) 78:2280-2284). Interestingly, the Tar-only strain exhibits a biphasic response to high potassium concentrations of 10 mM and above. This biphasic response might result from additional pH-effects on the activity of intracellular enzymes such as CheRB and CheA (ref. 56), which may have a different timescale and response from the Tar receptor. We have now added the penultimate paragraph in “Discussion” to talk about the response of the Tar-only strain.

      (5b) The response of Tar to the removal of sodium benzoate (Fig. S2) seems to be triphasic, is there any explanation for this?

      Response: We thank the reviewer for pointing this out. We have now acknowledged in the legend of Fig. S2 that this response is interesting and warrants further exploration: “The response to the removal of sodium benzoate seems to be a superposition of an attractant and a repellent response, the reason for which deserves to be further explored.”

      (6) Fitting the MWC model leads to N=0.35<1. It is fine to use this as a phenomenological parameter, but can the authors comment on what might be causing such a small effective cluster size for potassium response?

      Response: We thank the reviewer for pointing this out. We have now recalculated the pH-corrected results for the dose-response curve (Fig. S5). The new Hill coefficient is 0.880.14 (meanSD), which is close to the response to MeAsp (1.2) (ref. 46). We now refit the MWC model to the pH-corrected dose-response curve, obtaining N of 0.85. We think the small N is due partly to the fact that we are fitting the curve with four parameters: N, Kon, Koff, and fm, while only three features of the sigmoid does-response curve are relevant (the vertical scale, the midpoint concentration, and the slope of the sigmoid). Future experiments may determine these parameters more accurately, but they should not significantly affect the simulation results as long as the wild-type dose-response curve is accurate.

      (7) The results of the modeling are closely related to Zhu et. al. Phys. Rev. Lett. 108, 128101. Is the lag time for large T related to the adaptation time?

      Response: We thank the reviewer for pointing this out. We used a similar framework of modeling as Zhu et. al. The potassium response was also analogous to the chemotactic response to MeAsp. Thus, the results are closely related to Zhu et al. We have now cited Zhu et al. (Ref. 52) and noted this at line 366.

      The lag time for large T is related to the adaptation time. We have now simulated the chemotaxis to potassium for large T with different adaptation time by varying the methylation rate kR. The results are shown in Fig. S8. The simulated lag time decreases with the methylation rate kR, but levels off at high values of kR. Now this has been added at line 603.

      Minor issues:

      • Fig. 1C: should the axis label be y?

      Response: Yes, thank you. Now corrected.

      • Line 519: Koff given twice, the second should be Kon.

      Response: Thank you. Corrected.

      • When fitting the MWC model (Eq. 3 and Fig. 6B) did you fix a particular value for m?

      Response: m was treated as a fitting parameter, grouped in the parameter fm.

      Reviewer #2 (Recommendations For The Authors):

      Minor points: - I suggest explaining the acronyms when they first appear in the text (eg CMC, CW, CCW).

      Response: Thank you. Now they have been added.

      • L144. L242. "decrease" is ambiguous since membrane potential is negative. I understand the authors meant less negative (which is an increase). I suggest to avoid this expression.

      Response: Thank you for the suggestion. Now they have been replaced by “The absolute value of the transmembrane electrical potential will decrease”.

      • For Fig 1b - it says the shaded area is SEM in the text, but SD in the legend. Please clarify.

      Response: Thank you. The annotation in the legend has now been revised as SEM.

      • Fig 1C label of x axis should be "y" instead of "x" to be consistent with Fig 1A.

      Response: Thank you. It has now been revised.

      • In Figure 2, the number of independent experiments as well as the number of samples should be included.

      Response: Thank you. The response in Fig. 2C is the average of 83 motors from 5 samples for wild-type strain (JY26-pKAF131). The response in Fig. 2D is the average of 22 motors from 4 samples for the chemotaxis-defective strain (HCB901-pBES38). They have now been added to the legend.

      • Regarding the attractant or repelling action of potassium and sucrose, it would be important to have a move showing the cells' behaviours.

      Response: We thank the reviewer for the suggestion. We have now provided Movie S1 to show the cells’ behavior to potassium. As shown in Fig. 3B, the chemotactic response to 60 mM sucrose is very small compared to the response to 30 mM KCl. This implies that a noticeable response to sucrose necessitates higher concentrations of stimulation. However, Jerko et al. [Rosko, J., Martinez, V. A., Poon, W. C. K. & Pilizota, T. Proc. Natl Acad. Sci. USA 114, E7969-E7976 (2017).] have shown that high concentrations of sucrose lead to a significant reduction in the speed of the flagella motor. Thus, in a motility assay for sucrose, the osmolarity-induced motility effect may overwhelm the minor repellent-like response.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      In this study, the authors aim to understand why decision formation during behavioural tasks is distributed across multiple brain areas. They hypothesize that multiple areas are used in order to implement an information bottleneck (IB). Using neural activity recorded from monkey DLPFC and PMd performing a 2-AFC task, they show that DLPFC represents various task variables (decision, color, target configuration), while downstream PMd primarily represents decision information. Since decision information is the only information needed to make a decision, the authors point out that PMd has a minimal sufficient representation (as expected from an IB). They then train 3-area RNNs on the same task and show that activity in the first and third areas resemble the neural representations of DLPFC and PMd, respectively. In order to propose a mechanism, they analyse the RNN and find that area 3 ends up with primarily decision information because feedforward connections between areas primarily propagate decision information.

      The paper addresses a deep, normative question, namely why task information is distributed across several areas.

      Overall, it reads well and the analysis is well done and mostly correct (see below for some comments). My major problem with the paper is that I do not see that it actually provides an answer to the question posed (why is information distributed across areas?). I find that the core problem is that the information bottleneck method, which is evoked throughout the paper, is simply a generic compression method.

      Being a generic compressor, the IB does not make any statements about how a particular compression should be distributed across brain areas - see major points (1) and (2).

      If I ignore the reference to the information bottleneck and the question of why pieces of information are distributed, I still see a more mechanistic study that proposes a neural mechanism of how decisions are formed, in the tradition of RNN-modelling of neural activity as in Mante et al 2013. Seen through this more limited sense, the present study succeeds at pointing out a good model-data match, and I could support a publication along those lines. I point out some suggestions for improvement below.

      We thank the reviewer for their comments, feedback and suggestions. We are glad to hear you support the good model-data match for this manuscript.  With your helpful comments, we have clarified the connections to the information bottleneck principle and also contrasted it against the information maximization principle (the InfoMax principle), an alternative hypothesis. We elaborate on these issues in response to your points below, particularly major points (1) and (2). We also address all your other comments below.

      Major points

      (1) It seems to me that the author's use of the IB is based on the reasoning that deep neural networks form decisions by passing task information through a series of transformations/layers/areas and that these deep nets have been shown to implement an IB. Furthermore, these transformations are also loosely motivated by the data processing inequality.

      On Major Point 1 and these following subpoints, we first want to make a high-level statement before delving into a detailed response to your points as it relates to the information bottleneck (IB). We hope this high-level statement will provide helpful context for the rest of our point-by-point responses. 

      We want to be clear that we draw on the information bottleneck (IB) principle as a general principle to explain why cortical representations differ by brain area. The IB principle, as applied to cortex, is only stating that a minimal sufficient representation to perform the task is formed in cortex, not how it is formed. The alternative hypothesis to the IB is that brain areas do not form minimal sufficient representations. For example, the InfoMax principle states that each brain area stores information about all inputs (even if they’re not necessary to perform the task). InfoMax isn’t unreasonable: it’s possible that storing as much information about the inputs, even in downstream areas, can support flexible computation and InfoMax also supports redundancy in cortical areas. Indeed, many studies claim that action choice related signals are in many cortical areas, which may reflect evidence of an InfoMax principle in action for areas upstream of PMd.

      While we observe an IB in deep neural networks and cortex in our perceptual decision-making task, we stress that its emergence across multiple areas is an empirical result. At the same time, multiple areas producing an IB makes intuitive sense: due to the data processing inequality, successive transformations typically decrease the information in a representation (especially when, e.g., in neural networks, every activation passes through the Relu function, which is not bijective). Multiple areas are therefore a sufficient and even ‘natural’ way to implement an IB, but multiple areas are not necessary for an IB. That we observe an IB in deep neural networks and cortex emerge through multi-area computation is empirical, and, contrasting InfoMax, we believe it is an important result of this paper. 

      Nevertheless, your incisive comments have helped us to update the manuscript that when we talk about the IB, we should be clear that the alternative hypothesis is non-minimal representations, a prominent example of which is the InfoMax principle. We have now significantly revised our introduction to avoid this confusion. We hope this provides helpful context for our point-by-point replies, below.

      However, assuming as a given that deep neural networks implement an IB does not mean that an IB can only be implemented through a deep neural network. In fact, IBs could be performed with a single transformation just as well. More formally, a task associates stimuli (X) with required responses (Y), and the IB principle states that X should be mapped to a representation Z, such that I(X;Z) is minimal and I(Y,Z) is maximal. Importantly, the form of the map Z=f(X) is not constrained by the IB. In other words, the IB does not impose that there needs to be a series of transformations. I therefore do not see how the IB by itself makes any statement about the distribution of information across various brain areas.

      We agree with you that an IB can be implemented in a single transformation. We wish to be clear that we do not intend to argue necessity: that multiple areas are the only way to form minimal sufficient representations. Rather, multiple areas are sufficient to induce minimal sufficient representations, and moreover, they are a natural and reasonably simple way to do so. By ‘natural,’ we mean that minimal sufficient representations empirically arise in systems with multiple areas (more than 2), including deep neural networks and the cortex at least for our task and simulations. For example, we did not see minimal sufficient representations in 1- or 2-area RNNs, but we did see them emerge in RNNs with 3 areas or more. One potential reason for this result is that sequential transformations through multiple areas can never increase information about the input; it can only maintain or reduce information due to the data processing inequality.

      Our finding that multiple areas facilitate IBs in the brain is therefore an empirical result: like in deep neural networks, we observe the brain has minimal sufficient representations that emerge in output areas (PMd), even as an area upstream (DLPFC) is not minimal. While the IB makes a statement that this minimal sufficient representation emerges, to your point, the fact that it emerges over multiple areas is not a part of the IB – as you have pointed out, the IB doesn’t state where or how the information is discarded, only that it is discarded. Our RNN modeling later proposes one potential mechanism for how it is discarded. We updated the manuscript introduction to make these points:

      “An empirical observation from Machine Learning is that deep neural networks tend to form minimal sufficient representations in the last layers. Although multi-layer computation is not necessary for an IB, they provide a sufficient and even “natural” way to form an IB. A representation z = f(x) cannot contain more information than the input x itself due to the data processing inequality[19]. Thus, adding additional layers typically results in representations that contain less information about the input.”

      And later in the introduction:

      “Consistent with these predictions of the IB principle, we found that DLPFC has information about the color, target configuration, and direction. In contrast, PMd had a minimal sufficient representation of the direction choice. Our recordings therefore identified a cortical IB. However, we emphasize the IB does not tell us where or how the minimal sufficient representation is formed. Instead, only our empirical results implicate DLPFC-PMd in an IB computation. Further, to propose a mechanism for how this IB is formed, we trained a multi-area RNN to perform this task. We found that the RNN faithfully reproduced DLPFC and PMd activity, enabling us to propose a mechanism for how cortex uses multiple areas to compute a minimal sufficient representation.”

      In the context of our work, we want to be clear the IB makes these predictions:

      Prediction 1: There exists a downstream area of cortex that has a minimal and sufficient representation to perform a task (i.e.,. I(X;Z) is minimal while preserving task information so that I(Z;Y) is approximately equal to  I(X;Y)). We identify PMd as an area with a minimal sufficient representation in our perceptual-decision-making task. 

      Prediction 2 (corollary if Prediction 1 is true): There exists an upstream brain area that contains more input information than the minimal sufficient area. We identify DLPFC as an upstream area relative to PMd, which indeed has more input information than downstream PMd in our perceptual decision-making task. 

      Note: as you raise in other points, it could have been possible that the IB is implemented early on, e.g., in either the parietal cortex (dorsal stream) or inferotemporal cortex (ventral stream), so that DLPFC and PMd both contained minimal sufficient representations. The fact that it doesn’t is entirely an empirical result from our data. If DLPFC had minimal sufficient representations for the perceptual decision making task, we would have needed to record in other regions to identify brain areas that are consistent with Prediction 2. But, empirically, we found that DLPFC has more input information relative to PMd, and therefore the DLPFC-PMd connection is implicated in the IB process.

      What is the alternative hypothesis to the IB? We want to emphasize: it isn’t single-area computation. It’s that the cortex does not form minimal sufficient representations. For example, an alternative hypothesis (“InfoMax”) would be for all engaged brain areas to form representations that retain all input information. One reason this could be beneficial is because each brain area could support a variety of downstream tasks. In this scenario, PMd would not be minimal, invalidating Prediction 1. However, this is not supported by our empirical observations of the representations in PMd, which has a minimal sufficient representation of the task. We updated our introduction to make this clear:

      “But cortex may not necessarily implement an IB. The alternative hypothesis to IB is that the cortex does not form minimal sufficient representations. One manifestation of this alternative hypothesis is the “InfoMax” principle, where downstream representations are not minimal but rather contain maximal input information22. This means information about task inputs not required to perform the task are present in downstream output areas. Two potential benefits of an InfoMax principle are (1) to increase redundancy in cortical areas and thereby provide fault tolerance, and (2) for each area to support a wide variety of tasks and thereby improve the ability of brain areas to guide many different behaviors. In contrast to InfoMax, the IB principle makes two testable predictions about cortical representations. Prediction 1: there exists a downstream area of cortex that has a minimal and sufficient representation to perform a task (i.e., I(X; Z) is minimal while preserving task information so that I(Z; Y) ≈ I(X; Y)). Prediction 2 (corollary if Prediction 1 is true): there exists an upstream area of cortex that has more task information than the minimal sufficient area.”

      Your review helped us realize we should have been clearer in explaining that these are the key predictions of the IB principle tested in our paper. We also realized we should be much clearer that these predictions aren’t trivial or expected, and there is an alternative hypothesis. We have re-written the introduction of our paper to highlight that the key prediction of the IB is minimal sufficient representations for the task, in contrast to the alternative hypothesis of InfoMax.

      A related problem is that the authors really only evoke the IB to explain the representation in PMd: Fig 2 shows that PMd is almost only showing decision information, and thus one can call this a minimal sufficient representation of the decision (although ignoring substantial condition independent activity).

      However, there is no IB prediction about what the representation of DLPFC should look like.

      Consequently, there is no IB prediction about how information should be distributed across DLPFC and PMd.

      We agree: the IB doesn’t tell us how information is distributed, only that there is a transformation that eventually makes PMd minimal. The fact that we find input information in DLPFC reflects that this computation occurs across areas, and is an empirical characterization of this IB in that DLPFC has direction, color and context information while PMd has primarily direction information. To be clear: only our empirical recordings verified that the DLPFC-PMd circuit is involved in the IB. As described above, if not, we would have recorded even further upstream to identify an inter-areal connection implicated in the IB.

      We updated the text to clearly state that the IB predicts that an upstream area’s activity should contain more information about the task inputs. We now explicitly describe this in the introduction, copy and pasted again here for convenience.

      “In contrast to InfoMax, the IB principle makes two testable predictions about cortical representations. Prediction 1: there exists a downstream area of cortex that has a minimal and sufficient representation to perform a task (i.e., I(X; Z) is minimal while preserving task information so that I(Z; Y) ≈ I(X; Y)). Prediction 2 (corollary if Prediction 1 is true): there exists an upstream area of cortex that has more task information than the minimal sufficient area.

      Consistent with the predictions of the IB principle, we found that DLPFC has information about the color, target configuration, and direction. In contrast, PMd had a minimal sufficient representation of the direction choice. Our recordings therefore identified a cortical IB. However, we emphasize the IB does not tell us where or how the minimal sufficient representation is formed. Instead, only our empirical results implicate DLPFC-PMd in an IB computation Further, to propose a mechanism for how this IB is formed, we trained a multi-area RNN to perform this task.”  

      The only way we knew DLPFC was not minimal was through our experiments. Please also note that the IB principle does not describe how information could be lost between areas or layers, whereas our RNN simulations show that this may occur through preferential propagation of task-relevant information with respect to the inter-area connections.  

      (2) Now the authors could change their argument and state that what is really needed is an IB with the additional assumption that transformations go through a feedforward network. However, even in this case, I am not sure I understand the need for distributing information in this task. In fact, in both the data and the network model, there is a nice linear readout of the decision information in dPFC (data) or area 1 (network model). Accordingly, the decision readout could occur at this stage already, and there is absolutely no need to tag on another area (PMd, area 2+3).

      Similarly, I noticed that the authors consider 2,3, and 4-area models, but they do not consider a 1-area model. It is not clear why the 1-area model is not considered. Given that e.g. Mante et al, 2013, manage to fit a 1-area model to a task of similar complexity, I would a priori assume that a 1-area RNN would do just as well in solving this task.

      While decision information could indeed be read out in Area 1 in our multi-area model, we were interested in understanding how the network converged to a PMd-like representation (minimal sufficient) for solving this task. Empirically, we only observed a match between our model representations and animal cortical representations during this task when considering multiple areas. Given that we empirically observed that our downstream area had a minimal sufficient representation, our multi-area model allowed how this minimal sufficient representation emerged (through preferential propagation of task-relevant information).

      We also analyzed single-area networks in our initial manuscript, though we could have highlighted these analyses more clearly to be sure they were not overlooked. We are clearer in this revision that we did consider a 1-area network (results in our Fig 5). While a single-area RNN can indeed solve this task, the single area model had all task information present in the representation, and did not match the representations in DLPFC or PMd. It would therefore not allow us to understand how the network converged to a PMd-like representation (minimal sufficient) for solving this task. We updated the schematic in Fig 5 to add in the single-area network (which may have caused the confusion).

      We have added an additional paragraph commenting on this in the discussion. We also added an additional supplementary figure with the PCs of the single area RNN (Fig S15). We highlight that single area RNNs do not resemble PMd activity because they contain strong color and context information. 

      In the discussion:

      “We also found it was possible to solve this task with single area RNNs, although they did not resemble PMd (Figure S15) since it did not form a minimal sufficient representation. Rather, for our RNN simulations, we found that the following components were sufficient to induce minimal sufficient representations: (1) RNNs with at least 3 areas, following Dale’s law (independent of the ratio of feedforward to feedback connections).”

      I think there are two more general problems with the author's approach. First, transformations or hierarchical representations are usually evoked to get information into the right format in a pure feedforward network. An RNN can be seen as an infinitely deep feedforward network, so even a single RNN has, at least in theory, and in contrast to feedforward layers, the power to do arbitrarily complex transformations. Second, the information coming into the network here (color + target) is a classical xor-task. While this task cannot be solved by a perceptron (=single neuron), it also is not that complex either, at least compared to, e.g., the task of distinguishing cats from dogs based on an incoming image in pixel format.

      An RNN can be viewed as an infinitely deep feedforward network in time. However, we wish to clarify two things. First, our task runs for a fixed amount of time, and therefore this RNN in practice is not infinitely deep in time. Second, if it were to perform an IB operation in time, we would expect to see color discriminability decrease as a function of time. Indeed, we considered this as a mechanism (recurrent attenuation, Figure 4a), but as we show in Supplementary Figure S9, we do not observe it to be the case that discriminability decreases through time. This is equivalent to a dynamical mechanism that removes color through successive transformations in time, which our analyses reject (Fig 4). We therefore rule out that an IB is implemented through time via an RNN’s recurrent computation (viewed as feedforward in time). Rather, as we show, the IB comes primarily through inter-areal connections between RNN areas. We clarified that our dynamical hypothesis is equivalent to rejecting the feedforward-in-time filtering hypothesis in the Results: 

      “We first tested the hypothesis that the RNN IB is implemented primarily by recurrent dynamics (left side of Fig. 4a). These recurrent dynamics can be equivalently interpreted as the RNN implementing a feedforward neural network in time.”  

      The reviewer is correct that the task is a classical XOR task and not as complex as e.g., computer vision classification. That said, our related work has looked at IBs for computer vision tasks and found them in deep feedforward networks (Kleinman et al., ICLR 2021). Even though the task is relatively straightforward, we believe it is appropriate for our conclusions because it does not have a trivial minimal sufficient representation: a minimal sufficient representation for XOR must contain only target, but not color or target configuration information. This can only be solved via a nonlinear computation. In this manner, we favor this task because it is relatively simple, and the minimal sufficient representations are interpretable, while at the same time not being so trivially simple (the minimal sufficient representations require nonlinearity to compute).  

      Finally, we want to note that this decision-making task is a logical and straightforward way to add complexity to classical animal decision-making tasks, where stimulus evidence and the behavioral report are frequently correlated. In tasks such as these, it may be challenging to untangle stimulus and behavioral variables, making it impossible to determine if an area like premotor cortex represents only behavior rather than stimulus. However, our task decorrelates both the stimulus and the behaviors. 

      (3) I am convinced of the author's argument that the RNN reproduces key features of the neural data. However, there are some points where the analysis should be improved.

      (a) It seems that dPCA was applied without regularization. Since dPCA can overfit the data, proper regularization is important, so that one can judge, e.g., whether the components of Fig.2g,h are significant, or whether the differences between DLPFC and PMd are significant.

      We note that the dPCA codebase optimizes the regularization hyperparameter through cross-validation and requires single-trial firing rates for all neurons, i.e., data matrices of the form (n_Neurons x Color x Choice x Time x n_Trials), which are unavailable for our data. We recognized that you are fundamentally asking whether differences are significant or not. We therefore believe it is possible to address this through a statistical test, described further below. 

      In order to test whether the differences of variance explained by task variables between DLPFC and PMd are significant, we performed a shuffle test. For this test, we randomly sampled 500 units from the DLPFC dataset and 500 units from the PMd dataset. We then used dPCA to measure the variance explained by target configuration, color choice, and reach direction (e.g., Var<sup>True</sup><sub>DLPFC,Color</sub>, Var<sup>True</sup><sub>PMd,Color</sub>).

      To test if this variance was significant, we performed the following shuffle test. We combined the PMd and DLPFC dataset into a pool of 1000 units and then randomly selected 500 units from this pool to create a surrogate PMd dataset and used the remaining 500 units as a surrogate DLPFC dataset. We then again performed dPCA on these surrogate datasets and estimated the variance for the various task variables (e.g., Var<sub>ShuffledDLPFC,Color</sub>  ,Var<sub>ShuffledPMd,Color</sub>).

      We repeated this process for 100 times and estimated a sampling distribution for the true difference in variance between DLPFC and PMd for various task variables (e.g., Var<sup>True</sup><sub>DLPFC,Color</sub> - Var<sup>True</sup><sub>PMd,Color</sub>). At the same time, we estimated the distribution of the variance difference between surrogate PMd and DLPFC dataset for various task variables (e.g., Var<sub>ShuffleDLPFC,Color</sub> - Var<sub>ShufflePMd,Color</sub>). 

      We defined a p-value as the number of shuffles in which the difference in variance was higher than the median of the true difference and divided it by 100. Note, for resampling and shuffle tests with n shuffles/bootstraps, the lowest theoretical p-value is given as 2/n, even in the case that no shuffle was higher than the median of the true distribution. Thus, the differences were statistically significant (p < 0.02) for color and target configuration but not for direction (p=0.72). These results are reported in Figure S6 and show both the true sampling distribution and the shuffled sampling distributions.

      (b) I would have assumed that the analyses performed on the neural data were identical to the ones performed on the RNN data. However, it looked to me like that was not the case. For instance, dPCA of the neural data is done by restretching randomly timed trials to a median trial. It seemed that this restretching was not performed on the RNN. Maybe that is just an oversight, but it should be clarified. Moreover, the decoding analyses used SVC for the neural data, but a neural-net-based approach for the RNN data. Why the differences?

      Thanks for bringing up these points. We want to clarify that we did include SVM decoding for the multi-area network in the appendix (Fig. S4), and the conclusions are the same. Moreover, in previous work, we also found that training with a linear decoder led to analogous conclusions (Fig. 11 of Kleinman et al, NeurIPS 2021).  As we had a larger amount of trials for the RNN than the monkey, we wanted to allow a more expressive decoder for the RNN, though this choice does not affect our conclusions. We clarified the text to reflect that we did use an SVM decoder.

      “We also found analogous conclusions when using an SVM decoder (Fig. S4).”

      dPCA analysis requires trials of equal length. For the RNN, this is straightforward to generate because we can set the delay lengths to be equal during inference (although the RNN was trained on various length trials and can perform various length trials). Animals must have varying delay periods, or else they will learn the timing of the task and anticipate epoch changes. Because animal trial lengths were therefore different, their trials had to be restretched. We clarified this in the Methods.

      “For analyses of the RNN, we fixed the timing of trials, obviating the need to to restretch trial lengths. Note that while at inference, we generated RNN trials with equal length, the RNN was trained with varying delay periods.” 

      (4) The RNN seems to fit the data quite nicely, so that is interesting. At the same time, the fit seems somewhat serendipitous, or at least, I did not get a good sense of what was needed to make the RNN fit the data. The authors did go to great lengths to fit various network models and turn several knobs on the fit. However, at least to me, there are a few (obvious) knobs that were not tested.

      First, as already mentioned above, why not try to fit a single-area model? I would expect that a single area model could also learn the task - after all, that is what Mante et al did in their 2013 paper and the author's task does not seem any more complex than the task by Mante and colleagues.

      Thank you for bringing up this point. As mentioned in response to your prior point, we did analyze a single-area RNN (Fig. 5d). We updated the schematic to clarify that we analyzed a single area network. Moreover, we also added a supplementary figure to qualitatively visualize the PCs of the single area network (Fig. S15). While a single area network can solve the task, it does not allow us to study how representations change across areas, nor did it empirically resemble our neural recordings. Single-area networks contain significant color, context, and direction information. They therefore do not form minimal representations and do not resemble PMd activity.

      Second, I noticed that the networks fitted are always feedforward-dominated. What happens when feedforward and feedback connections are on an equal footing? Do we still find that only the decision information propagates to the next area? Quite generally, when it comes to attenuating information that is fed into the network (e.g. color), then that is much easier done through feedforward connections (where it can be done in a single pass, through proper alignment or misalignment of the feedforward synapses) than through recurrent connections (where you need to actively cancel the incoming information). So it seems to me that the reason the attenuation occurs in the inter-area connections could simply be because the odds are a priori stacked against recurrent connections. In the real brain, of course, there is no clear evidence that feedforward connections dominate over feedback connections anatomically.

      We want to clarify that we did pick feedforward and feedback connections based on the following macaque atlas, reference 27 in our manuscript: 

      Markov, N. T., Ercsey-Ravasz, M. M., Ribeiro Gomes, A. R., Lamy, C., Magrou, L., Vezoli, J., Misery, P., Falchier, A., Quilodran, R., Gariel, M. A., Sallet, J., Gamanut, R., Huissoud, C., Clavagnier, S., Giroud, P., Sappey-Marinier, D., Barone, P., Dehay, C., Toroczkai, Z., … Kennedy, H. (2014). A weighted and directed interareal connectivity matrix for macaque cerebral cortex. Cerebral Cortex , 24(1), 17–36.

      We therefore believe there is evidence for more feedforward than feedback connections. Nevertheless, as stated in response to your next point below, we ran a simulation where feedback and feedforward connectivity were matched.

      More generally, it would be useful to clarify what exactly is sufficient:

      (a) the information distribution occurs in any RNN, i.e., also in one-area RNNs

      (b) the information distribution occurs when there are several, sparsely connected areas

      (c) the information distribution occurs when there are feedforward-dominated connections between areas

      We better clarify what exactly is sufficient. 

      - We trained single-area RNNs and found that these RNNs contained color information; additionally two area RNNs also contained color information in the last area (Fig 5d). 

      - We indeed found that the minimal sufficient representations emerged when we had several areas, with Dale’s law constraint on the connectivity. When we had even sparser connections, without Dale’s law, there was significantly more color information, even at 1% feedforward connections; Fig 5a.

      - When we matched the percentage of feedforward and feedback connections with Dale’s law constraint on the connectivity (10% feedforward and 10% feedback), we also observed minimal sufficient representations (Fig S9). 

      Together, we found that minimal sufficient representations emerged when we had several areas (3 or greater), with Dale’s law constraint on the connectivity, independent of the ratio of feedforward/feedback connections. We thank the reviewer for raising this point about the space of constraints leading to minimal sufficient representations in the late area. We clarified this in the Discussion.

      “We also found it was possible to solve this task with single area RNNs, although they did not resemble PMd (Figure S15) since it did not form a minimal sufficient representation. Rather, for our RNN simulations, we found that the following components were sufficient to induce minimal sufficient representations: RNNs with at least 3 areas, following Dale’s law (independent of the ratio of feedforward to feedback connections).”

      Thank you for your helpful and constructive comments!

      Reviewer #2 (Public Review):

      Kleinman and colleagues conducted an analysis of two datasets, one recorded from DLPFC in one monkey and the other from PMD in two monkeys. They also performed similar analyses on trained RNNs with various architectures.

      The study revealed four main findings. (1) All task variables (color coherence, target configuration, and choice direction) were found to be encoded in DLPFC. (2) PMD, an area downstream of PFC, only encoded choice direction. (3) These empirical findings align with the celebrated 'information bottleneck principle,' which suggests that FF networks progressively filter out task-irrelevant information. (4) Moreover, similar results were observed in RNNs with three modules.

      We thank the reviewer for their comments, feedback and suggestions, which we address below.

      While the analyses supporting results 1 and 2 were convincing and robust, I have some concerns and recommendations regarding findings 3 and 4, which I will elaborate on below. It is important to note that findings 2 and 4 had already been reported in a previous publication by the same authors (ref. 43).

      Note the NeurIPS paper only had PMd data and did not contain any DLPFC data. That manuscript made predictions about representations and dynamics upstream of PMd, and subsequent experiments reported in this manuscript validated these predictions. Importantly, this manuscript observes an information bottleneck between DLPFC and PMd.

      Major recommendation/comments:

      The interpretation of the empirical findings regarding the communication subspace in relation to the information bottleneck theory is very interesting and novel. However, it may be a stretch to apply this interpretation directly to PFC-PMd, as was done with early vs. late areas of a FF neural network.

      In the RNN simulations, the main finding indicates that a network with three or more modules lacks information about the stimulus in the third or subsequent modules. The authors draw a direct analogy between monkey PFC and PMd and Modules 1 and 3 of the RNNs, respectively. However, considering the model's architecture, it seems more appropriate to map Area 1 to regions upstream of PFC, such as the visual cortex, since Area 1 receives visual stimuli. Moreover, both PFC and PMd are deep within the brain hierarchy, suggesting a more natural mapping to later areas. This contradicts the CCA analysis in Figure 3e. It is recommended to either remap the areas or provide further support for the current mapping choice.

      We updated the Introduction to better clarify the predictions of the information bottleneck (IB) principle. In particular, the IB principle predicts that later areas should have minimal sufficient representations of task information, whereas upstream areas should have more information. In PMd, we observed a minimal sufficient representation of task information during the decision-making task. In DLPFC, we observed more task information, particularly more information about the target colors and the target configuration.

      In terms of the exact map between areas, we do not believe or intend to claim the DLPFC is the first area implicated in the sensorimotor transformation during our perceptual decision-making task. Rather, DLPFC best matches Area 1 of our model. It is important to note that we abstracted our task so that the first area of our model received checkerboard coherence and target configuration as input (and hence did not need to transform task visual inputs). Indeed, in Figure 1d we hypothesize that the early visual areas should contain additional information, which we do not model directly in this work. Future work could model RNNs to take in an image or video input of the task stimulus. In this case, it would be interesting to assess if earlier areas resemble visual cortical areas. We updated the results, where we first present the RNN, to state the inputs explicitly and be clear the inputs are not images or videos of the checkerboard task.

      “The RNN input was 4D representing the target configuration and checkerboard signed coherence, while the RNN output was 2D, representing decision variables for a left and right reach (see Methods).”

      Another reason that we mapped Area 1 to DLPFC is because anatomical, physiological and lesion studies suggest that DLPFC receives inputs from both the dorsal and ventral stream (Romanski, et, al, 2007; Hoshi, et al, 2006; Wilson, at al, 1993). The dorsal stream originates from the occipital lobe, passes through the posterior parietal cortex, to DLPFC, which carries visuospatial information of the object. The ventral stream originates from the occipital lobe, passes through the inferior temporal cortex, ventrolateral prefrontal cortex to DLPFC, which encodes the identity of the object, including color and texture. In our RNN simulation, Area 1 receives processed inputs of the task: target configuration and the evidence for each color in the checkerboard. Target configuration contains information of the spatial location of the targets, which represents the inputs from the dorsal stream, while evidence for each color by analogy is the input from the ventral stream. Purely visual areas would not fit this dual input from both the dorsal and ventral stream. A potential alternative candidate would be the parietal cortex which is largely part of the dorsal stream and is thought to have modest color inputs (although there is some shape and color selectivity in areas such as LIP, e.g., work from Sereno et al.). On balance given the strong inputs from both the dorsal and ventral stream, we believe Area 1 maps better on to DLPFC than earlier visual areas.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Line 35/36: Please specify the type of nuisance that the representation is robust to. I guess this refers to small changes in the inputs, not to changes in the representation itself.

      Indeed it refers to input variability unrelated to the task. We clarified the text.

      (2) For reference, it would be nice to have a tick for the event "Targets on" in Fig.2c.

      In this plot, the PSTHs are aligned to the checkerboard onset. Because there is a variable time between target and checkerboard onset, there is a trial-by-trial difference of when the target was turned on, so there is no single place on the x-axis where we could place a “Targets on” tick. In response to this point, we generated a plot with both targets on and check on alignment, with a break in the middle, shown in Supplementary Figure S5. 

      (3) It would strengthen the comparison between neural data and RNN if the DPCA components of the RNN areas were shown, as they are shown in Fig.2g,h for the neural data.

      We include the PSTHs plotted onto the dPCA components here for Area 1 of the exemplar network. Dashed lines indicate a left reach, while solid lines indicate a right reach, and the color corresponds to the color of the selected target. As expected, we find that the dPCA components capture the separation between components. We emphasize that the trajectory paths along the decoder axes are not particularly meaningful to interpret, except to demonstrate whether variables can be decoded or not (as in Fig 2g,h, comparing DLPFC and PMd). The decoder axes of dPCA are not constrained in any way, in contrast to the readout (encoder) axis (see Methods). This is why our manuscript focuses on analyzing the readout axes. However, if the reviewer strongly prefers these plots to be put in the manuscript, we will add them.   

      Author response image 1.

      (4) The session-by-session decode analysis presented in Fig.2i suggests that DLPFC has mostly direction information while in Area 1 target information is on top, as suggested by Fig.3g. An additional decoding analysis on trial averaged neural data, i.e. a figure for neural data analogous to Fig.3g,h, would allow for a more straightforward and direct comparison between RNN and neural data. 

      We first clarify that we did not decode trial-averaged neural data for either recorded neural data or RNNs. In Fig 3g, h (for the RNN) all decoding was performed on single trial data and then averaged. We have revised the main manuscript to make this clear. Because of this, the mean accuracies we reported for DLPFC and PMd in the text are therefore computed in the same way as the mean accuracies presented in Fig 3g, h. We believe this likely addresses your concern: i.e., the mean decode accuracies presented for both neural data and the RNN were computed the same way. 

      If the above paragraph did not address your concern, we also wish to be clear that we presented the neural data as histograms, rather than a mean with standard error, because we found that accuracies were highly variable depending on electrode insertion location. For example, some insertions in DLPFC achieved chance-levels of decoding performance for color and target configuration. For this reason, we prefer to keep the histogram as it shows more information than reporting the mean, which we report in the main text. However, if the reviewer strongly prefers us to make a bar plot of these means, we will add them.

      (5) Line 129 mentions an analysis of single trials. But in Fig.2i,j sessions are analyzed. Please clarify.

      For each session, we decode from single trials and then average these decoding accuracies, leading to a per-session average decoding accuracy. Note that for each session, we record from different neurons. In the text, we also report the average over the sessions. We clarified this in the text and Methods.

      (6) Fig.4c,f show how color and direction axes align with the potent subspaces. We assume that the target axis was omitted here because it highly aligns with the color axis, yet we note that this was not pointed out explicitly.

      You are correct, and we revised the text to point this out explicitly.

      “We quantified how the color and direction axis were aligned with these potent and null spaces of the intra-areal recurrent dynamics matrix of Area 1 ($\W^1_{rec}$). We did not include the target configuration axis for simplicity, since it highly aligns with the color axis for this network.”

      (7) The caption of Fig.4c reads: "Projections onto the potent space of the intra-areal dynamics for each area." Yet, they only show area 1 in Fig.4c, and the rest in a supplement figure. Please refer properly.

      Thank you for pointing this out. We updated the text to reference the supplementary figure.

      (8) Line 300: "We found the direction axis was more aligned with the potent space and the color axis was more aligned with the null space." They rather show that the color axis is as aligned to the potent space as a random vector, but nothing about the alignments with the null space. Contrarily, on line 379 they write "...with the important difference that color information isn't preferentially projected to a nullspace...". Please clarify.

      Thank you for pointing this out. We clarified the text to read: “We found the direction axis was more aligned with the potent space”. The text then describes that the color axis is aligned like a random vector: “In contrast, the color axis was aligned to a random vector.”

      (9) Line 313: 'unconstrained' networks are mentioned. What constraints are implied there, Dale's law? Please define and clarify.

      Indeed, the constraint refers to Dale’s law constraints. We clarified the text: “Further, we found that W<sub>21</sub> in unconstrained 3 area networks (i.e., without Dale's law constraints) had significantly reduced…”

      (10) Line 355 mentions a 'feedforward bottleneck'. What does this exactly mean? No E-I feedforward connections, or...? Please define and clarify.

      This refers to sparser connections between areas than within an area, as well as a smaller fraction of E-I connections. We clarified the text to read:

      “Together, these results suggest  that a connection bottleneck in the form of neurophysiological architecture constraints (i.e., sparser connections between areas than within an area, as well as a smaller fraction of E-I connections) was the key design choice leading to RNNs with minimal color representations and consistent with the information bottleneck principle.”

      (11) Fig.5c is supposedly without feedforward connections, but it looks like the plot depicts these connections (i.e. identical to Fig.5b).

      In Figure 5, we are varying the E to I connectivity in panel B, and the E-E connectivity in panel C. We vary the feedback connections in Supp Fig. S12. We updated the caption accordingly. 

      (12) For reference, it would be nice to have the parameters of the exemplar network indicated in the panels of Fig.5.

      We updated the caption to reference the parameter configuration in Table 1 of the Appendix.

      (13) Line 659: incomplete sentence

      Thank you for pointing this out. We removed this incomplete sentence.

      (14) In the methods section "Decoding and Mutual information for RNNs" a linear neural net decoder as well as a nonlinear neural net decoder are described, yet it was unclear which one was used in the end.

      We used the nonlinear network, and clarified the text accordingly. We obtained consistent conclusions using a linear network, but did not include these results in the text. (These are reported in Fig. 11 of Kleinman et al, 2021). Moreover, we also obtain consistent results by using an SVM decoder in Fig. S4 for our exemplar parameter configuration.

      (15) In the discussion, the paragraph starting from line 410 introduces a new set of results along with the benefits of minimal representations. This should go to the results section.

      We prefer to leave this as a discussion, since the task was potentially too simplistic to generate a clear conclusion on this matter. We believe this remains a discussion point for further investigation.

      (16) Fig S5: hard to parse. Show some arrows for trajectories (a) (d) is pretty mysterious: where do I see the slow dynamics?

      Slow points are denoted by crosses, which forms an approximate line attractor. We clarified this in the caption.

      Reviewer #2 (Recommendations For The Authors):

      Minor recommendations (not ordered by importance)

      (1) Be more explicit that the recordings come from different monkeys and are not simultaneously recorded. For instance, say 'recordings from PFC or PMD'. Say early on that PMD recordings come from two monkeys and that PFC recordings come from 1 of those monkeys. Furthermore, I would highlight which datasets are novel and which are not. For instance, I believe the PFC dataset is a previously unpublished dataset and should be highlighted as such.

      We added: “The PMd data was previously described in a study by Chandrasekaran and colleagues” to the main text which clarifies that the PMd data was previously recorded and has been analyzed in other studies.

      (2) I personally feel that talking about 'optimal', as is done in the abstract, is a bit of a stretch for this simple task.

      In using the terminology “optimal,” we are following the convention of IB literature that optimal representations are sufficient and minimal. The term “optimal” therefore is task-specific; every task will have its own optimal representation. We clarify in the text that this definition comes from Machine Learning and Information Theory, stating:

      “The IB principle defines an optimal representation as a representation that is minimal and sufficient for a task or set of tasks.”

      In this way, we take an information-theoretic view for describing multi-area representations. This view was satisfactory for explaining and reconciling the multi-area recordings and simulations for this task, and we think it is helpful to provide a normative perspective for explaining the differences in cortical representations by brain area. Even though the task is simple, it still allows us to study how sensory/perceptual information is represented, and well as how choice-related information is being represented.

      (3) It is mentioned (and even highlighted) in the abstract that we don't know why the brain distributes computations. I agree with that statement, but I don't think this manuscript answers that question. Relatedly, the introduction mentions robustness as one reason why the brain would distribute computations, but then raises the question of whether there is 'also a computational benefit for distributing computations across multiple areas'. Isn't the latter (robustness) a clear 'computational benefit'?

      We decided to keep the word “why” in the abstract, because this is a generally true statement (it is unclear why the brain distributes computation) that we wish to convey succinctly, pointing to the importance of studying this relatively grand question (which could only be fully answered by many studies over decades). We consider this the setting of our work. However, to avoid confusion that we are trying to give a full answer to this question, we are now more precise in the first paragraph of our introduction as to the particular questions we ask that will take a step towards this question. In particular, the first paragraph now asks these questions, which we answer in our study.

      “For example, is all stimuli and decision-related information present in all brain areas, or do the cortical representations differ depending on their processing stage? If the representations differ, are there general principles that can explain why the cortical representations differ by brain area?”

      We also removed the language on robustness, as we agree it was confusing. Thank you for these suggestions. 

      (4) Figure 2e and Fig. 3d, left, do not look very similar. I suggest zooming in or rotating Figure 2 to highlight the similarities. Consider generating a baseline CCA correlation using some sort of data shuffle to highlight the differences.

      The main point of the trajectories is to demonstrate that both Area 1 and DLPFC represent both color and direction. We now clarify this in the manuscript. However, we do not intend for these two plots to be a rigorous comparison of similarity. Rather, we quantify similarity using CCA and our decoding analysis. We also better emphasize the relative values of the CCA, rather than the absolute values.

      (5) Line 152: 'For this analysis, we restricted it to sessions with significant decode accuracy with a session considered to have a significant decodability for a variable if the true accuracy was above the 99th percentile of the shuffled accuracy for a session.' Why? Sounds fishy, especially if one is building a case on 'non-decodability'. I would either not do it or better justify it.

      The reason to choose only sessions with significant decoding accuracy is that we consider those sessions to be the sessions containing information of task variables. In response to this comment, we also now generate a plot with all recording sessions in Supplementary Figure S7. We modified the manuscript accordingly.

      “For this analysis, we restricted it to sessions with significant decode accuracy with a session considered to have a significant decodability for a variable if the true accuracy was above the 99th percentile of the shuffled accuracy for a session. This is because these sessions contain information about task variables. However, we also present the same analyses using all sessions in Fig. S7.”

      (6) Line 232: 'The RNN therefore models many aspects of our physiological data and is therefore'. Many seems a stretch?

      We changed “many” to “key.”

      (7) The illustration in Fig. 4B is very hard to understand, I recommend removing it.

      We are unsure what this refers to, as Figure 4B represents data of axis overlaps and is not an illustration. 

      (8) At some point the authors use IB instead of information bottleneck (eg line 288), I would not do it.

      We now clearly write that IB is an abbreviation of Information Bottleneck the first time it is introduced.  

      (9) Fig. 5 caption is insufficient to understand it. Text in the main document does not help. I would move most part of this figure, or at least F, to supplementary. Instead, I would move the results in S11 and S10 to the main document.

      We clarified the caption to summarize the key points. It now reads: 

      “Overall, neurophysiological architecture constraints in the form of multiple areas, sparser connections between areas than within an area, as well as a smaller fraction of E-I connections lead to a minimal color representation in the last area.”

      (10) Line 355: 'Together, these results suggest that a connection bottleneck in the form of neurophysiological architecture constraints was the key design choice leading to RNNs with minimal color representations and consistent with the information bottleneck principle.' The authors show convincingly that increased sparsity leads to the removal of irrelevant information. There is an alternative model of the communication subspace hypothesis that uses low-rank matrices, instead of sparse, to implement said bottlenecks (https://www.biorxiv.org/content/10.1101/2022.07.21.500962v2)

      We thank the reviewer for pointing us to this very nice paper. Indeed, a low-rank connectivity matrix is another mechanism to limit the amount of information that is passed to subsequent areas. In fact, the low-rank matrix forms a hard-version of our observations as we found that task-relevant information was preferentially propagated along the top singular mode of the inter-areal connectivity matrix. In our paper we observed this tendency naturally emerges through training with neurophysiological architecture constraints. In the paper, for the multi-area RNN, they hand-engineered the multi-area network, whereas our network is trained. We added this reference to our discussion. 

      Thank you for your helpful and constructive comments.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this work by Wang et al., the authors use single-molecule super-resolution microscopy together with biochemical assays to quantify the organization of Nipah virus fusion protein F (NiV-F) on cell and viral membranes. They find that these proteins form nanoscale clusters which favors membrane fusion activation, and that the physical parameters of these clusters are unaffected by protein expression level and endosomal cleavage. Furthermore, they find that the cluster organization is affected by mutations in the trimer interface on the NiV-F ectodomain and the putative oligomerization motif on the transmembrane domain, and that the clusters are stabilized by interactions among NiV-F, the AP2-complex, and the clathrin coat assembly. This work improves our understanding of the NiV fusion machinery, which may have implications also for our understanding of the function of other viruses.

      Strengths:

      The conclusions of this paper are well-supported by the presented data. This study sheds light on the activation mechanisms underlying the NiV fusion machinery.

      Weaknesses:

      The authors provide limited details of the convolutional neural network they developed in this work. Even though custom-codes are made available, a description of the network and specifications of how it was used in this work would aid the readers in assessing its performance and applicability. The same holds for the custom-written OPTICS algorithm. Furthermore, limited details are provided for the imaging setup, oxygen scavenging buffer, and analysis for the single-molecule data, which limits reproducibility in other laboratories. The claim of 10 nm resolution is not backed up by data and seems low given the imaging conditions and fluorophores used. Fourier Ring Correlation analysis would have validated this claim. If the authors refer to localization precision rather than resolution, then this should be specified and appropriate data provided to support this claim.

      We thank reviewer 1 for these suggestions. We described key steps in imaging setup, singlemolecule data reconstruction, the OPTICS algorithm in cluster identification, and 1D CNN in

      classification of the OPTICS data in the Materials and Methods section. We also provided a recipe for the imaging buffer. We refer to 10 nm localization precision rather than resolution. The localization precision achieved by our SMLM system is shown in the Author response image 1.

      Author response image 1.

      The localization precision of the custom-built SMLM. Shows the distribution of localization error at the x (dX), y (dY), and z (dZ) direction in nanometer of blinks generated from Alexa Flour 647 labeled to NiV-F expressed on the plasma membrane of PK13 cells. The lateral precision is <10 nm and the axial precision is < 20 nm. 

      Reviewer #2 (Public Review): 

      Summary:

      In this manuscript, Wang and co-workers employ single molecule light microscopy (SMLM) to detect NiV fusion protein (NiV-F) in the surface of cells. They corroborate that these glycoproteins form microclusters (previously seen and characterized together with the NiVG and Nipah Matrix protein by Liu and co-workers (2018) also with super-resolution light microscopy). Also seen by Liu and coworkers the authors show that the level of expression of NiV-F does not alter the identity of these microclusters nor endosomal cleavage. Moreover, mutations and the transmembrane domain or the hexamer-of-trimer interface seem to have a mild effect on the size of the clusters that the authors quantified.

      Importantly, it has also been shown that these particles tend to cluster in Nipah VLPs.

      We thank reviewer #2 for the comments and suggestions. This paper is built on Liu et al 1 to further characterize the nanoclusters formed by NiV-F and their role in membrane fusion activation. While Liu et al. studied the NiV glycoprotein distribution at the NiV assembly sites to inform mechanisms in NiV assembly and release, Wang et al. analyzed the nanoorganization and distribution of NiV-F at the prefusion conformation, providing insights into the membrane fusion activation mechanisms.  

      Strengths:

      The authors have tried to perform SMLM in single VLPs and have shown partially the importance of NiV-F clustering.

      Weaknesses:

      The labelling strategy for the NiV-F is not sufficiently explained. The use of a FLAG tag in the extracellular domain should be validated and compared with the unlabelled WT NiV-F when expressed in functional pseudoviruses (for example HIV-1 based particles decorated with NiV-F). This experiment should also be carried out for both infection and fusion (including BlaM-Vpr as a readout for fusion). I would also suggest to run a time-of-addition BlaM experiment to understand how this particular labelling strategy affects single virion fusion as compared to the the WT.  

      We thank reviewer #2 for this suggestion. We have made various efforts to validate the expression and function of FLAG-tagged NiV-F. The NiV-F-FLAG shows comparable cell surface expression levels and induces similar cell-cell fusion levels in 293T cells as that of untagged NiV-F 1. The NiV-F-FLAG also showed similar levels of virus entry as untagged NiV-F when both were pseudotyped on a recombinant Vesicular Stomatitis Virus (VSV) with the VSV glycoprotein replaced by a Renilla luciferase reporter gene (VSV-ΔG-rLuc; Fig. S1D). We also performed a virus entry kinetics assay using NiV VLPs expressing NiV-M-βlactamase (NiV-M-Bla), NiV-G-HA, and NiV-F-FLAG, NiV-F-AU1 or untagged NiV-F. The intracellular AU1 tag is located at the C-terminus of NiV-F (Genbank accession no. AY816748.1). However, we detected different levels of NiV-M-Bla in equal volume of VLPs, suggesting that the tags in NiV-F affect the budding of the VLPs (Author response image 2A). Therefore, we performed fusion kinetics assay by using VLPs expressing the same levels of NiV-M-Bla. Among them, the NiV-F-FLAG on VLPs shows the most efficient fusion between VLP and HEK293T cell membranes (Author response image 2B), significantly more efficient than that of untagged NiV-F and NiV-FAU1. However, we cannot attribute the enhanced fusion activity to the FLAG tag, because the readout of this assay relies on both the levels of β-lactamase (introduced by NiV-M-Bla in VLPs) and the NiV-F constructs. The tags in NiV-F could affect both the budding of VLPs and the stoichiometry of F and M in individual VLPs. We did not use the HIV-based pseudovirus system because the incorporation of NiV-F into HIV pseudoviruses requires a C-terminal deletion 2,3.

      In summary, the FLAG tag does not affect cell-cell fusion 1 and virus entry when pseudotyped to the recombinant VSV-ΔG-rLuc viruses (Fig. S1D). Given that we do not observe any difference in clustering between an HA- and FLAG-tagged NiV-F constructs on PK13 cell surface (Fig. S1A-C), we conclude that the FLAG tag has minimal effect on both the fusion activity and the nanoscale distribution of NiV-F. 

      Author response image 2.

      Viral entry is not affected by labeling of NiV-F. A) Western blot analysis of NiV-M-Bla in NiV-VLPs generated by HEK293T cells expressing NiV-M-Bla, NiV-G-HA and NiV-F-FLAG, untagged NiV-F, or NiV-F-AU1. Equal volume of VLPs were separated by a denaturing 10% SDS–PAGE and probed against β-lactamase (SANTA CRUZ, sc-66062). B) NiV-VLPs expressing NiV-M-BLa, NiV-G-HA, and NiV-F-FLAG, untagged NiV-F or NiV-F-AU1 expression plasmids were bond to the target HEK293T cells loaded with CCF2-AM dye at 4°C. The Blue/Green (B/G) ratio was measured at 37°C for 4 hrs at a 3-min interval. Results were normalized to the maximal B/G ratio of NiV-F-FLAG-NiV VLPs. Results from one representative experiment out of three independent experiments are shown. 

      It would also be very important to compare the FLAG labelling approach with recent advances in the field (for instance incorporating noncanonical amino acids (ncAAs) into NiVF by amber stop-codon suppression, followed by click chemistry). 

      We are greatly thankful for this comment from reviewer #2. Labeling noncanonical amino acids (ncAAs) with biorthogonal click chemistry is indeed a more precise labeling strategy compared to the traditional epitope labeling approach used in this paper. We will explore the applications of ncAAs labeling in single-molecule localization imaging and virus-host interactions in future projects. 

      In this paper, the FLAG tag inserted in NiV-F protein seems to have minimal effect on the NiV-F-induced virus entry and cell-cell fusion 1 (Fig. S1). Although the FLAG tag labeling approach may increase the detectable size of NiV-F nanoclusters due to the use of the antibody complex, it should not affect our conclusions drawn from the relative comparisons between wt and mutant NiV-F or control and drug-treated cells. 

      The correlation between the existence of microclusters of a particular size and their functionality is missing. Only cell-cell fusion assays are shown in supplementary figures and clearly, single virus entry and fusion cannot be compared with the biophysics of cell-cell fusion. Not only the environment is completely different, membrane curvature and the number of NiV-F drastically varies also. Therefore, specific fusion assays (either single virus tracking and/or time-of-addition BlaM kinetics with functional pseudoviruses) are needed to substantiate this claim.  

      We thank Reviewer 2 for the suggestion. To support the link between F clustering and viruscell membrane fusion, we conducted pseudotyped virus entry and VLP fusion kinetics assays, as shown in revised Figure S4. The viral entry results (Fig. S4 E and F) corroborate that of the cell-cell fusion assay (Fig. S4A and B) and previously published data 4. The fusion kinetics confirmed that the real-time fusion kinetics was affected by mutations at the hexameric interface, with the hypo-fusogenic mutants L53D and V108D exhibited reduced entry efficiency while the hyper-fusogenic mutant Q393L showed increased efficiency (Fig. S4G and H). The results were described in detail in the revised manuscript. 

      Additionally, we performed a pseudotyped virus entry assay on the LI4A (Fig. S6F and G) and YA (Fig. S7F and G) mutants to verify the function of these mutants on viruses in revised Supplemental Figures. Neither LI4A nor YA incorporated into the VSV/NiV pseudotyped viruses as shown by the Western blot analyses of the pseudovirions (Fig. S6F and S7F), and thus did not induce virus entry, consisting with the cell-cell fusion results (Fig. S6C, D and Fig. S7C, D). We did not perform the entry kinetic assay of these two mutants as they do not incorporate into VLPs or pseudovirions. 

      The authors also claim they could not characterize the number of NiV-F particles per cluster. Another technique such as number and brightness (Digman et al., 2008) could support current SMLM data and identify the number of single molecules per cluster. Also, this technology does not require complex microscopy apparatus. I suggest they perform either confocal fluorescence fluctuation spectroscopy or TIRF-based nandb to validate the clusters and identify how many molecule are present in these clusters.  

      We thank reviewer 2 for this suggestion. Determining the true copy number of NiV-F in individual clusters could verify whether the F clusters on the plasma membrane are hexamer-of-trimer assemblies. Regardless, it does not affect our conclusion that the organization of NiV-F into nanoclusters affects the membrane fusion triggering ability. The confocal fluorescence fluctuation spectroscopy (FFS) and TIRF-based analyses are accessible tools for quantifying fluorophore copy numbers and/or stoichiometry based on fluorescence fluctuation or photobleaching. However, these methods are unable to quantify the number of proteins in individual clusters because they analyze fluorophores either in the entire cell (as in wide-field epifluorescence microscopy coupled with FFS and TIRF-coupled photobleaching) 5–7 or within a large excitation volume (confocal laser scanning microscopycoupled FFS) 8. Both of these volumes are significantly larger than a single NiV-F cluster, which has an average diameter of 24-26 nm (Fig. 1F). 

      The current SMLM setup is useful for characterizing the protein distribution and organization. However, quantifying the true protein copy number within a nanocluster is challenging because of the stochasticity of fluorophore blinking and the unknown labeling stoichiometry 9–11. To address the challenge in fluorophore blinking, quantitative DNA-PAINT (qDNA-PAINT) may be used because the on-off frequency of the fluorophores is tied to the well-defined kinetic constants of DNA binding and the influx rate of the imager strands, rather than the stochasticity of fluorophore blinking. Thus, the frequency of blinks can be translated to protein counting 12. To address the challenge in unknown labeling stoichiometry, DNA origami can be used as a calibration standard 11. DNA origami supports handles at a regular space with several to tens of nanometers apart, and the handles can be conjugated with a certain number of proteins of interest. The copy number of protein interest in the experimental group can be determined by comparing the SMLM localization distribution of the sample to that of the DNA origami calibration standard. Given the requirement of a more sophisticated SMLM setup and a high-precision calibration tool, we will explore the quantification of NiV-F copy numbers in nanoclusters in a future project. 

      Also, it is not clear how many cells the authors employ for their statistics (at least 30-50 cells should be employed and not consider the number of events blinking events. I hope the authors are not considering only a single cell to run their stats... The differences between the mutants and the NiV-F is minor even if their statistical analyses give a difference (they should average the number and size of the clusters per cell for a total of 30-50 cells with experiments performed at least in three different cells following the same protocol). Overall, it seems that the authors have only evaluated a very low number of cells.

      We disagree with this comment from Reviewer #2. The sample size for cluster analysis in SMLM images was chosen by considering the target of the study (cells and VLPs) and the data acquisition and analysis standards in the SMLM imaging field. We also noted the sample size (# of ROI and cells) in the figure legend. 

      Below, we compared the sample sizes in our study to those in similar studies that used comparable imaging and cluster analysis methods from 2015 to 2024. The classical clustering analysis methods are categorized into global clustering (e.g. nearest neighbor analysis, Ripley’s K function, and pair correlation function) and complete clustering, such as density-based analysis (e.g. DBSCAN, Superstructure, FOCAL, ToMATo) and Tessellationbased analysis (e.g. Delaunay triangulation, Voronoii Tessellation). The global clustering analysis method provides spatial statistics for global protein clustering or organization (e.g. clustering extent), while the complete clustering approach extracts information from a single-cluster level, such as the morphology and localization density of individual clusters. We used the density-based analyses, DBSCAN and OPTICS, for cluster analysis on cell plasma membranes and VLP membranes. 

      Author response table 1.

      The comparison of imaging methods, analysis methods, and sample size in the current study to other studies conducted from 2015 to 2024.

      They should also compare the level of expression (with the number of molecules per cell provided by number and brightness) with the total number of clusters. 

      We thank reviewer 2 for this suggestion. We compared the level of expression with the total number of clusters for F-WT in Figure 1I in the main text.  

      The same applies to the VLP assay. I assume the authors have only taken VLPs expressing both NiV-M and NiV-F (and NiV-G). But even if this is not clearly stated I would urge the authors to show how many viruses were compared per condition (normally I would expect 300 particles per condition coming from three independent experiments. As a negative control to evaluate the cluster effect I would mix the different conditions. Clearly you have clusters with all conditions and the differences in clustering depending on each condition are minimal. Therefore you need to increase the n for all experiments.

      We thank reviewer 2 for this comment. We acquired and analyzed more images of NiV VLPs bearing F-WT, Q393L, L53D, and V108D. Results are shown in the revised Figure 4 and the number of VLPs (>300) used for analysis is specified in the figure legend. An increased number of VLP images does not affect the classification result in Figure 4C. 

      As for the suggestion on “evaluating the cluster effect at different mixed conditions”, I assume that reviewer 2 would like to see how the presence of different viral structural proteins (F, M, and G) on VLPs could affect F clustering.  We showed that the organization of NiV envelope proteins on the VLP membrane is similar in the presence or absence of NiV-M by direct visualization 27, suggesting that the effect of NiV-M on F-WT clustering on VLPs is minimal. We also show comparable incorporation of NiV-F among the NiV-F hexamer-oftrimer mutants (Fig. 4A). Therefore, we did not test the F clustering at different F, M, and G combinations in this paper. However, this could be an interesting question to pursue in a paper focusing on NiV VLP production. 

      Reviewer #3 (Public Review):

      Summary:

      The manuscript by Wang and colleagues describes single molecule localization microscopy to quantify the distribution and organization of Nipah virus F expressed on cells and on virus-like particles. Notably the crystal structure of F indicated hexameric assemblies of F trimers. The authors propose that F clustering favors membrane fusion.

      Strengths:

      The manuscript provides solid data on imaging of F clustering with the main findings of:

      -  F clusters are independent of expression levels

      -  Proteolytic cleavage does not affect F clustering

      -  Mutations that have been reported to affect the hexamer interface reduce clustering on cells and its distribution on VLPs - - F nanoclusters are stabilized by AP

      Weaknesses:

      The relationship between F clustering and fusion is per se interesting, but looking at F clusters on the plasma membrane does not exclude that F clustering occurs for budding. Many viral glycoproteins cluster at the plasma membrane to generate micro domains for budding. 

      This does not exclude that these clusters include hexamer assemblies or clustering requires hexamer assemblies. 

      We thank reviewer #3 for this question. We did not focus on the role of NiV-F clusters for budding in the current manuscript, although this is an interesting topic to pursue. In this manuscript, we observed that NiV VLP budding is decreased for some cluster-disrupting mutants, such as F-YA, and F-LI4A. however, F-V108D showed increased budding compared to F-WT (Fig. 4A). We also observed that VLPs and VSV/NiV pseudoviruses expressing L53D have little NiV-G (Fig. 4A, Fig. S4F and S4H), although the incorporation level of L53D is comparable to that of wt F in both VLPs and pseudovirions (Fig. 4A and Fig. S4F). L53D is a hypofusogenic mutant with decreased clustering ability. Therefore, our current data do not show a clear link between F clustering and NiV VLP budding or glycoprotein incorporation. 

      We reported that both NiV-F and -M form clusters at the plasma membrane although NiV-F clusters are not enriched at the NiV-M positive membrane domains 1. This result indicates that NiV-M is the major driving force for assembly and budding, while NiV-F is passively incorporated into the assembly sites. The central role of NiV-M in budding is also supported by a recent study showing that NiV-M induces membrane curvature by binding to PI(4,5)P2 in the inner leaflet of the plasma membrane 28. However, the expression of NiV-F alone induces the production of vesicles bearing NiV-F 29 and NiV-F recruits vesicular trafficking and actin cytoskeleton factors to VLPs either alone or in combination with NiV-G and -M, indicating a potential autonomous role in budding 30. Additionally, several electron microscopy studies show that the paramyxovirus F forms 2D lattice interspersed above the M lattice, suggesting the participation of F in virus assembly and budding. Nonetheless, the evidence above suggests that NiV-F may play a role in budding, but our data cannot correlate NiV-F clustering to budding. 

      Assuming that the clusters are important for entry, hexameric clusters are not unique to Nipah virus F. Similar hexameric clusters have been described for the HEF on influenza virus C particles (Halldorsson et al 2021) and env organization on Foamy virus particles (Effantin et al 2016), both with specific interactions between trimers. What is the organization of F on Nipah virus particles? If F requires to be hexameric for entry, this should be easily imaged by EM on infectious or inactivated virus particles. 

      We thank reviewer #3 for this suggestion. The hexamer-of-trimer NiV-F is observed on the VLP surface by electron tomography 4. The NiV-F hexamer-of-trimers are arranged into a soccer ball-like structure, with one trimer being part of multiple hexamer-of-trimers. The implication of NiV-F clusters in virus entry and the potential mechanism for NiV-F higherorder structure formation are discussed in the revised manuscripts. 

      AP stabilization of the F clusters is curious if the clusters are solely required for entry? Virus entry does not recruit the clathrin machinery. Is it possible that F clusters are endocytosed in the absence of budding? 

      We thank reviewer #3 for this question. The evidence from the current study does not exclude the role of NiV-F clustering in virus budding. NiV-F is known to be endocytosed in the virus-producing cells for cleavage by Cathepsin B or L at endocytic compartments at a pH-dependent manner31–33 in the absence of budding. However, given that all cleaved and uncleaved NiV-F have an endocytosis signal sequence at the cytoplasmic tail and are able to interact with AP-2 for endosome assembly and the cleaved and uncleaved F may have similar clustering patterns (Fig. 2), we do not think NiV-F clustering is specifically regulated for the cleavage of NiV-F. A plausible hypothesis is that NiV-F clusters are stabilized by multiple intrinsic factors (e.g. trimer interface) and host factors (e.g. AP-2) on cell membrane for cell-cell fusion and virus budding. We linked the clustering to the fusion ability of NiV-F in this study, but the NiV-F clustering may also be important in facilitating virus budding. Once in the viruses, the higher-order assembly of the clusters (e.g. lattice) may form due to protein enrichment, and the cell factors may not be the major maintenance force. 

      Clusters are required for budding. 

      Other points:

      Fig. 3: Some of the V108D and L53D clusters look similar in size than wt clusters. It seems that the interaction is important but not absolutely essential. Would a double mutant abrogate clustering completely?

      We thank Reviewer #3 for the suggestion. We generated a double mutant of NIV-F with L53D and V108D (NiV-F-LV) and assessed its expression and processing. Although the mutant retained processing capability, it exhibited minimal surface expression, making it unfeasible to analyze its nano-organization on the cell or viral membrane.

      Author response image 4.

      The expression and fusion activity of Flag-tagged NiV-F and NiV-F L53D-V108D (LV). (A) Representative western blot analysis of NiV-F-WT, LV in the cell lysate of 293T cells. 293T cells were transfected by NiV-F-WT or the LV mutant. The empty vector was used as a negative control. The cell lysates were analyzed on SDS-PAGE followed by western blotting after 28hrs post-transfection. F0 and F2 were probed by the M2 monoclonal mouse antiFLAG antibody. GAPDH was probed by monoclonal mouse anti-GAPDH. (B) Representative images of 293T cell-cell fusion induced by NiV-G and NiV-F-WT or NiV-F-LV. 293T cells were co-transfected with plasmids coding for NiV-G and empty vector (NC) or NiV-F constructs. Cells were fixed at 18 hrs post-transfection. Arrows point to syncytia. Scale bar: 10um. (C) Relative cell-cell fusion levels in 293T cells in (B). Five fields per experiment were counted from three independent experiments. Data are presented as mean ± SEM. (D) The cell surface expression levels of NiV-F-WT, NiV-F-LV in 293T cells measured by flow cytometry. Mean fluorescence Intensity (MFI) values were calculated by FlowJo and normalized to that of F-WT. Data are presented as mean ± SEM of three independent experiments. Statistical significance was determined by the unpaired t-test with Welch’s correction (*P<0.05, **P<0.01, ***P<0.001, ****P<0.0001). Values were compared to that of the NiV-F-WT.

      Fig. 4: The distribution of F on VLPs should be confirmed by cryoEM analyses. This would also confirm the symmetry of the clusters. The manuscript by Chernomordik et al. JBC 2004 showed that influenza HA outside the direct contact zone affects fusion, which could be further elaborated in the context of F clusters and the fusion mechanism.

      We thank reviewer 3 for this suggestion. The distribution of F on VLPs was resolved by electron tomogram which showed that the NiV-F hexamer-of-trimers are arranged into a soccer ball-like structure 4. The role of influenza HA outside of the contact zone in fusion activation is an interesting phenomenon. It may address the energy transmission within and among clusters. We will pursue this topic in a future project.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      •  Please define all used abbreviations throughout the manuscript and in the SI.

      We defined the abbreviations at their first usage. 

      •  The sentence starting with "Additionally, ..." on line 155 appears to be incomplete.

      We corrected this sentence.  

      •  The statement starting with "As reported, ..." on line 181 should be supported by a reference.

      We added a reference. 

      •  In Fig. 4C, it is unclear what the x and y axes represent.  

      Fig. 4C is a t-SNE plot for visualizing high-dimensional data in a low-dimensional space. It maintains the local data structure but does not represent exact quantitative relationships. In other words, points that are close together in Fig. 4C are also close in the high-dimensional space, meaning the OPTICS plots, which reflect the clustering patterns, are similar for two points that are positioned near each other in Fig. 4C. Therefore, the x and y axes do not represent the original, quantitative data, and thus the axis titles are meaningless.  

      •  The reference on line 306 appears to be unformatted.

      We reformatted the reference.  

      Reviewer #2 (Recommendations For The Authors):

      The authors need to include the overall statistics for each experiment (at least 30 to 50 cells with three independent experiments are needed). 

      We highlighted the sample size (number of ROI and number of cells) used for analysis in the figure legend. The determination of the sample size is justified in Table 1 in the response letter. 

      The authors need to generate a functional pseudovirus system (for example HIVpp/NiV F) to run both infectivity and fusion experiments (including Apr-BlaM assay). 

      We tested viral entry using a VSV/NiV pseudovirus system and the viral entry kinetics using VLPs expressing NiV-M-β-lactamase. The results are presented in Fig. S1, S4, S6, and S7.  

      Reviewer #3 (Recommendations For The Authors):

      Even low resolution EM data on VLPs or viruses would strengthen the conclusions.

      We thank this reviewer for the suggestion. We cited the NiV VLP images acquired by electron tomography 4, but we currently have limited resources to perform cryoEM on NiV VLPs.  

      References.

      (1) Liu, Q., Chen, L., Aguilar, H. C. & Chou, K. C. A stochastic assembly model for Nipah virus revealed by super-resolution microscopy. Nature Communications 9, 3050 (2018).

      (2) Khetawat, D. & Broder, C. C. A Functional Henipavirus Envelope Glycoprotein Pseudotyped Lentivirus Assay System. Virology Journal 7, 312 (2010).

      (3) Palomares, K. et al. Nipah Virus Envelope-Pseudotyped Lentiviruses Efficiently Target ephrinB2Positive Stem Cell Populations In Vitro and Bypass the Liver Sink When Administered In Vivo. J Virol 87, 2094–2108 (2013).

      (4) Xu, K. et al. Crystal Structure of the Pre-fusion Nipah Virus Fusion Glycoprotein Reveals a Novel Hexamer-of-Trimers Assembly. PLoS Pathog 11, e1005322 (2015).

      (5)    Bakker, E. & Swain, P. S. Estimating numbers of intracellular molecules through analysing fluctuations in photobleaching. Sci Rep 9, 15238 (2019).

      (6) Nayak, C. R. & Rutenberg, A. D. Quantification of Fluorophore Copy Number from Intrinsic

      Fluctuations during Fluorescence Photobleaching. Biophys J 101, 2284–2293 (2011).

      (7) Salavessa, L. & Sauvonnet, N. Stoichiometry of ReceptorsReceptors at the Plasma MembranePlasma membrane During Their EndocytosisEndocytosis Using Total Internal Reflection Fluorescent (TIRF) MicroscopyMicroscopy Live Imaging and Single-Molecule Tracking. in Exocytosis and Endocytosis: Methods and Protocols (eds. Niedergang, F., Vitale, N. & Gasman, S.) 3–17 (Springer US, New York, NY, 2021). doi:10.1007/978-1-0716-1044-2_1.

      (8) Slenders, E. et al. Confocal-based fluorescence fluctuation spectroscopy with a SPAD array detector. Light Sci Appl 10, 31 (2021).

      (9) Annibale, P., Vanni, S., Scarselli, M., Rothlisberger, U. & Radenovic, A. Identification of clustering artifacts in photoactivated localization microscopy. Nat Methods 8, 527–528 (2011).

      (10) Baumgart, F. et al. Varying label density allows artifact-free analysis of membrane-protein nanoclusters. Nat Methods 13, 661–664 (2016).

      (11) Zanacchi, F. C. et al. A DNA origami platform for quantifying protein copy number in super-resolution. Nat Methods 14, 789–792 (2017).

      (12) Jungmann, R. et al. Multiplexed 3D cellular super-resolution imaging with DNA-PAINT and Exchange-PAINT. Nature Methods 11, 313–318 (2014).

      (13) Rubin-Delanchy, P. et al. Bayesian cluster identification in single-molecule localization microscopy data. Nat Methods 12, 1072–1076 (2015).

      (14) Griffié, J. et al. 3D Bayesian cluster analysis of super-resolution data reveals LAT recruitment to the T cell synapse. Sci Rep 7, 4077 (2017).

      (15) Dynamic Bayesian Cluster Analysis of Live-Cell Single Molecule Localization Microscopy Datasets - Griffié - 2018 - Small Methods - Wiley Online Library. https://onlinelibrary.wiley.com/doi/full/10.1002/smtd.201800008.

      (16) Caetano, F. A. et al. MIiSR: Molecular Interactions in Super-Resolution Imaging Enables the Analysis of Protein Interactions, Dynamics and Formation of Multi-protein Structures. PLOS Computational Biology 11, e1004634 (2015).

      (17) Malkusch, S. & Heilemann, M. Extracting quantitative information from single-molecule superresolution imaging data with LAMA – LocAlization Microscopy Analyzer. Sci Rep 6, 34486 (2016).

      (18) Zhang, Y., Lara-Tejero, M., Bewersdorf, J. & Galán, J. E. Visualization and characterization of individual type III protein secretion machines in live bacteria. Proceedings of the National Academy of Sciences 114, 6098–6103 (2017).

      (19) Tobin, S. J. et al. Single molecule localization microscopy coupled with touch preparation for the quantification of trastuzumab-bound HER2. Sci Rep 8, 15154 (2018).

      (20) Levet, F. et al. SR-Tesseler: a method to segment and quantify localization-based super-resolution microscopy data. Nature Methods 12, 1065–1071 (2015).

      (21) Peters, R., Griffié, J., Burn, G. L., Williamson, D. J. & Owen, D. M. Quantitative fibre analysis of singlemolecule localization microscopy data. Sci Rep 8, 10418 (2018).

      (22) Levet, F. et al. A tessellation-based colocalization analysis approach for single-molecule localization microscopy. Nat Commun 10, (2019).

      (23) Banerjee, C. et al. ULK1 forms distinct oligomeric states and nanoscopic structures during autophagy initiation. Science Advances 9, eadh4094 (2023).

      (24) Pageon, S. V. et al. Functional role of T-cell receptor nanoclusters in signal initiation and antigen discrimination. Proceedings of the National Academy of Sciences 113, E5454–E5463 (2016).

      (25) Cresens, C. et al. Flat clathrin lattices are linked to metastatic potential in colorectal cancer. iScience 26, 107327 (2023).

      (26) Seeling, M. et al. Immunoglobulin G-dependent inhibition of inflammatory bone remodeling requires pattern recognition receptor Dectin-1. Immunity 56, 1046-1063.e7 (2023).

      (27) Liu, Q. T. et al. The nanoscale organization of Nipah virus matrix protein revealed by super-resolution microscopy. Biophysical Journal 121, 2290–2296 (2022).

      (28) Norris, M. J. et al. Measles and Nipah virus assembly: Specific lipid binding drives matrix polymerization. Science Advances 8, eabn1440 (2022).

      (29) Patch, J. R. et al. The YPLGVG sequence of the Nipah virus matrix protein is required for budding. Virol. J. 5, 137 (2008).

      (30) Johnston, G. P. et al. Nipah Virus-Like Particle Egress Is Modulated by Cytoskeletal and Vesicular Trafficking Pathways: a Validated Particle Proteomics Analysis. mSystems 4, e00194-19 (2019).

      (31) Diederich, S. et al. Activation of the Nipah Virus Fusion Protein in MDCK Cells Is Mediated by Cathepsin B within the Endosome-Recycling Compartment. J Virol 86, 3736–3745 (2012).

      (32) Diederich, S., Thiel, L. & Maisner, A. Role of endocytosis and cathepsin-mediated activation in Nipah virus entry. Virology 375, 391–400 (2008).

      (33) Pager, C. T., Craft, W. W., Patch, J. & Dutch, R. E. A mature and fusogenic form of the Nipah virus fusion protein requires proteolytic processing by cathepsin L. Virology 346, 251–257 (2006).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors tested whether learning to suppress (ignore) salient distractors (e.g., a lone colored nontarget item) via statistical regularities (e.g., the distractor is more likely to appear in one location than any other) was proactive (prior to paying attention to the distractor) or reactive (only after first attending the distractor) in nature. To test between proactive and reactive suppression the authors relied on a recently developed and novel technique designed to "ping" the brain's hidden priority map using EEG inverted encoding models. Essentially, a neutral stimulus is presented to stimulate the brain, resulting in activity on a priority map which can be decoded and used to argue when this stimulation occurred (prior to or after attending to a distracting item). The authors found evidence that despite learning to suppress the high probability distractor location, the suppression was reactive, not proactive in nature.

      Overall, the manuscript is well-written, tests a timely question, and provides novel insight into a long-standing debate concerning distractor suppression.

      Strengths (in no particular order):

      (1) The manuscript is well-written, clear, and concise (especially given the complexities of the method and analyses).

      (2) The presentation of the logic and results is mostly clear and relatively easy to digest.

      (3) This question concerning whether location-based distractor suppression is proactive or reactive in nature is a timely question.

      (4) The use of the novel "pinging" technique is interesting and provides new insight into this particularly thorny debate over the mechanisms of distractor suppression.

      Weaknesses (in no particular order):

      (1) The authors tend to make overly bold claims without either A) mentioning the opposing claim(s) or B) citing the opposing theoretical positions. Further, the authors have neglected relevant findings regarding this specific debate between proactive and reactive suppression.

      (2) The authors should be more careful in setting up the debate by clearly defining the terms, especially proactive and reactive suppression which have recently been defined and were more ambiguously defined here.

      (3) There were some methodological choices that should be further justified, such as the choice of stimuli (e.g., sizes, colors, etc.).

      (4) The figures are often difficult to process. For example, the time courses are so far zoomed out (i.e., 0, 500, 100 ms with no other tick marks) that it makes it difficult to assess the timing of many of the patterns of data. Also, there is a lot of baseline period noise which complicates the interpretations of the data of interest.

      (5) Sometimes the authors fail to connect to the extant literature (e.g., by connecting to the ERP components, such as the N2pc and PD components, used to argue for or against proactive suppression) or when they do, overreach with claims (e.g., arguing suppression is reactive or feature-blind more generally).

      We thank the reviewer for their insightful feedback and have made several adjustments to address the concerns raised. To provide a balanced discussion, we tempered our claims about suppression mechanisms and incorporated additional references to opposing theoretical positions, including the signal suppression hypothesis, while clarifying the definitions of proactive and reactive suppression based on recent terminology (Liesefeld et al., 2024). We justified methodological choices, such as the slight size differences between stimuli to achieve perceptual equivalence and the randomization of target and distractor colors to mitigate potential luminance biases. We have revised our figure to enhance figure clarity. Lastly, while our counterbalanced design precluded reliable ERP assessments (e.g., N2pc, PD), we discussed their potential relevance for future research and ensured consistency with the broader literature on suppression mechanisms.

      Reviewer #2 (Public Review):

      Summary:

      The authors investigate the mechanisms supporting learning to suppress distractors at predictable locations, focusing on proactive suppression mechanisms manifesting before the onset of a distractor. They used EEG and inverted encoding models (IEM). The experimental paradigm alternates between a visual search task and a spatial memory task, followed by a placeholder screen acting as a 'ping' stimulus -i.e., a stimulus to reveal how learned distractor suppression affects hidden priority maps. Behaviorally, their results align with the effects of statistical learning on distractor suppression. Contrary to the proactive suppression hypothesis, which predicts reduced memory-specific tuning of neural representations at the expected distractor location, their IEM results indicate increased tuning at the high-probability distractor location following the placeholder and prior to the onset of the search display.

      Strengths:

      Overall, the manuscript is well-written and clear, and the research question is relevant and timely, given the ongoing debate on the roles of proactive and reactive components in distractor processing. The use of a secondary task and EEG/IEM to provide a direct assessment of hidden priority maps in anticipation of a distractor is, in principle, a clever approach. The study also provides behavioral results supporting prior literature on distractor suppression at high-probability locations.

      Weaknesses:

      (1) At a conceptual level, I understand the debate and opposing views, but I wonder whether it might be more comprehensive to present also the possibility that both proactive and reactive stages contribute to distractor suppression. For instance, anticipatory mechanisms (proactive) may involve expectations and signals that anticipate the expected distractor features, whereas reactive mechanisms contribute to the suppression and disengagement of attention.

      This is an excellent point. Indeed, while many studies, including our own, have tried to dissociate between proactive and reactive mechanisms, as if it is one or the other, the overall picture is arguably more nuanced. We have added a paragraph to the discussion on page 19 to address this. At the same time, (for more details see our responses to your comments 3 and 5), we have added a paragraph where we provide an alternative explanation of the current data in the light of the dual-task nature of our experiment.

      (2) The authors focus on hidden priority maps in pre-distractor time windows, arguing that the results challenge a simple proactive view of distractor suppression. However, they do not provide evidence that reactive mechanisms are at play or related to the pinging effects found in the present paradigm. Is there a relationship between the tuning strength of CTF at the high-probability distractor location and the actual ability to suppress the distractor (e.g., behavioral performance)? Is there a relationship between CTF tuning and post-distractor ERP measures of distractor processing? While these may not be the original research questions, they emerge naturally and I believe should be discussed or noted as limitations.

      Thank you for raising these important points. While CTF slopes have been shown to provide spatially and temporally resolved tracking of covert spatial attention and memory representations at the group level, to the best of our knowledge, no study to date has found a reliable correlation between CTFs and behavior. Moreover, the predictive value of the learned suppression effect, while also highly reliable at the group level, has been proven to be limited when it comes to individual-level performance (Ivanov et al. 2024; Hedge et al., 2018). Nevertheless, based on your suggestion, we explored whether there was a correlation between the averaged gradient slope within the time window where the placeholder revived the memory representation and the average distance slope in reaction times for the learned suppression effect. This correlation was not significant (r = .236, p = 0.267), which, considering our sample size and the reasons mentioned earlier, is not particularly surprising. Given that our sample size was chosen to measure group level effects, we decided not to include individual differences analysis it in the manuscript.

      Regarding the potential link between the CTF tuning profile and post-distractor ERP measures like N2pc and Pd, our experimental design presented a specific challenge. To reliably assess lateralized ERP components like N2pc or Pd the high probability location must be restricted to static lateralized positions (e.g., on the horizontal midline). Our counterbalanced design (see also our response to comment 9 by reviewer 1), which was crucial to avoid bias in spatial encoding models, precluded such a targeted ERP analysis.

      (3) How do the authors ensure that the increased tuning (which appears more as a half-split or hemifield effect rather than gradual fine-grained tuning, as shown in Figure 5) is not a byproduct of the dual-task paradigm used, rather than a general characteristic of learned attentional suppression? For example, the additional memory task and the repeated experience with the high-probability distractor at the specific location might have led to longer-lasting and more finely-tuned traces for memory items at that location compared to others.

      Thank you for raising these important points. Indeed, a unique aspect of our study that sets it apart from other studies, is that the effects of learned suppression were not measured directly via an index of distractor processing, but rather inferred indirectly via tuning towards a location in memory. The critical assumption here, that we now make explicit on page 18, is that various sources of attentional control jointly determine the priority landscape, and this priority landscape can be read out by neutral ping displays. An alternative however, as suggested by the reviewer, is that memory representations may have been sharper when they remembered location was at the high probability distractor location. We believe this is unlikely for various reasons. First, at the behavioral level there was no evidence that memory performance differed for positions overlapping high and low probability distractor locations (also see our response to reviewer 3 minor comment 4). Second, there was no hint whatsoever that the memory representation already differed during encoding or maintenance (This is now explicitly indicated in the revised manuscript on page 14), which would have been expected if the spatial distractor imbalance modulated the spatial memory representations.

      Nevertheless, as discussed in more detail in response to comment 5, there is an alternative explanation for the observed gradient modulation that may be specific to the dual nature of our experiment.

      (4) It is unclear how IEM was performed on total vs. evoked power, compared to typical approaches of running it on single trials or pseudo-trials.

      Thank you for pointing out that our methods were not clear. We did not run our analysis on single trials because we were interested in separately examining the spatial selectivity of both evoked alpha power (phase locked activity aligned with stimulus onset) and total alpha power (all activity regardless of signal phase). It is only possible to calculate evoked and total power when averaging across trials. Thus, when we partitioned the data into sets for the IEM analysis, we averaged trials for each condition/stimulus location to obtain a measurement of evoked and total power each condition for each set. This is the same approach used in previous work (e.g. Foster et al., 2016; van Moorselaar et al., 2018).

      We reviewed our method section and can see why this was unclear. In places, we had incorrectly described the dimensions of training and test data as electrodes x trials. To address this, we’ve rewritten the “Time frequency analysis”, “Inverted encoding model” sections, and added a new “Training and test data” section. We hope that these sections are easier to follow.

      (5) Following on point 1. What is the rationale for relating decreased (but not increased) tuning of CTF to proactive suppression? Could it be that proactive suppression requires anticipatory tuning towards the expected feature to implement suppression? In other terms, better 'tuning' does not necessarily imply a higher signal amplitude and could be observable even under signal suppression. The authors should comment on this and clarify.

      We appreciate your highlighting of these highly relevant alternative explanations. In response, we have revised a paragraph in the General Discussion on page 18 to explicitly outline our rationale for associating decreased tuning with proactive suppression. However, in doing so, we now also consider the alternative perspective that proactive suppression might actually require enhanced tuning towards the expected feature to implement suppression effectively.

      It's important to note that both of these interpretations – decreased tuning as a sign of suppression and increased tuning as a preparatory mechanism for suppression – diverge significantly from the commonly held model (including our own initial assumptions) wherein weights at the to-be-suppressed location are simply downregulated.

      Minor:

      (1) In the Word file I reviewed, there are minor formatting issues, such as missing spaces, which should be double-checked.

      Thank you! We have now reviewed the text thoroughly and tried our best to avoid formatting issues.

      (2) Would the authors predict that proactive mechanisms are not involved in other forms of attention learning involving distractor suppression, such as habituation?

      Habituation is a form of non-associative learning where the response to a repetitive stimulus decreases over time. As such, we would not characterize these changes as “proactive”, as it only occurs following the (repeated) exposure to the stimulus. 

      (3) A clear description in the Methods section of how individual CTFs for each location were derived would help in understanding the procedure.

      Thank you. We have now added several sentences on page 27 to clarify how individual CTFs in Figure 3 and distance CTFs in Figure 5 are calculated.

      “The derived channel responses (8 channels × 8 location bins) were then used for the following analyses: (a) calculating individual Channel Tuning Functions (CTFs) based on each of the eight physical location bins (e.g., Figure 3C and 3D); (b) grouping responses according to the distance between each physical location and the high-probability distractor location to calculate distance CTFs (e.g., Figure 5); and (c) averaging across location bins to represent the general strength of spatial selectivity in tracking the memory cue, irrespective of its specific location (e.g., Figure 3A and 3B).”

      (4) Why specifically 1024 resampling iterations?

      Thank you for your question. The statistical analysis was conducted using the permutation_cluster_1samp_test function within the MNE package in Python. We have clarified this on page 25. The choice of 1024 permutations reflects the default setting of the function, which is generally considered sufficient for robust non-parametric statistical testing. This number provides a balance between computational efficiency and the precision of p-value estimation in the context of our analyses.

      Reviewer #3 (Public Review):

      Summary:

      In this experiment, the authors use a probe method along with time-frequency analyses to ascertain the attentional priority map prior to a visual search display in which one location is more likely to contain a salient distractor.  The main finding is that neural responses to the probe indicate that the high probability location is attended, rather than suppressed, prior to the search display onset.  The authors conclude that suppression of distractors at high-probability locations is a result of reactive, rather than proactive, suppression.

      Strengths:

      This was a creative approach to a difficult and important question about attention.  The use of this "pinging" method to assess the attentional priority map has a lot of potential value for a number of questions related to attention and visual search. Here as well, the authors have used it to address a question about distractor suppression that has been the subject of competing theories for many years in the field. The paper is well-written, and the authors have done a good job placing their data in the larger context of recent findings in the field.

      Weaknesses:

      The link between the memory task and the search task could be explored in greater detail. For example, how might attentional priority maps change because of the need to hold a location in working memory? This might limit the generalizability of these findings. There could be more analysis of behavioral data to address this question. In addition, the authors could explore the role that intertrial repetition plays in the attentional priority map as these factors necessarily differ between conditions in the current design. Finally, the explanation of the CTF analyses in the results could be written more clearly for readers who are less familiar with this specific approach (which has not been used in this field much previously).

      We appreciate the reviewer's valuable feedback and have made significant revisions to address the concerns raised. To clarify the connection between the memory and search tasks, we conducted additional analyses to explore the effects of spatial distance between the memory cue location and the high-probability distractor location on behavioral performance. We also investigated the potential influence of intertrial repetition effects on the observed results by removing trials with location repetitions. To enhance clarity, we revised the explanation of the CTF analyses in the Results section and improved figure annotations to ensure accessibility for readers unfamiliar with this approach. Collectively, these updates further discuss how the pattern of CTF slopes reflect the interplay between memory and search tasks while addressing key methodological and interpretative considerations.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Suggestions/Critiques (in no particular order)

      (1) The authors discuss the tripartite model (bottom-up, top-down, and selection history) but neglect recent and important discussions of why this trichotomy might be unnecessarily complicated (e.g., Anderson, 2024: Trichotomy revisited: A monolithic theory of attentional control). Simply put, one of the 3 pillars (i.e., selection history) likely does not fall into a unitary construct or "box"; instead, it likely contains many subcomponents (e.g., reward associations, stimulus-response habit learning, statistical learning, etc.). Since the focus of the current study is learned distractor suppression based on the statistical regularities of the distractor, the authors should comment on which aspects of selection history are relevant, perhaps by using this monolithic framework.

      We appreciate the reviewer's insightful suggestion regarding theoretical frameworks of attentional control. While Anderson (2024) proposes a monolithic theory that challenges the traditional tripartite model, our study deliberately maintains a pragmatic approach. The main purpose of our experiment is empirically investigating the mechanisms of learned distractor suppression, rather than adjudicating between competing theoretical models.

      We agree that selection history is not a unitary construct but comprises multiple subcomponents, including reward associations, stimulus-response habit learning, and statistical learning. In this context, our study specifically focuses on statistical learning as a key mechanism of distractor suppression. By explicitly acknowledging the multifaceted nature of selection history and referencing Anderson's monolithic perspective, we invite readers to consider the theoretical implications while maintaining our research's primary focus on empirical investigation. To this end, we have modified the manuscript to read (see page 3):

      "The present study investigates the mechanisms underlying statistical learning, specifically learned distractor suppression, which represents one critical subcomponent of selection history. While theoretical models like the tripartite framework and the recent monolithic theory (Anderson, 2024) offer complementary perspectives on attentional control, our investigation focuses on empirically characterizing the statistical learning mechanisms underlying learned distractor suppression."

      (2) The authors discuss previous demonstrations of location-based and feature-based learned distractor suppression. The authors admit that there have been a large number of studies but seem to mainly cite those that were conducted by the authors themselves (with the exception being Vatterott & Vecera, 2012). For example, there are other studies investigating location-based suppression (Feldmann-Wüstefeld et al., 2021; Sauter et al., 2021), feature-based suppression (Gaspelin & Luck, 2018a; Stilwell et al., 2022; Stilwell & Gaspelin, 2021; Vatterott et al., 2018), or both (Stilwell et al., 2019). The authors do not cite Gaspelin and colleagues at all in the manuscript, despite claiming that singleton-based suppression is not proactive.

      We appreciate your pointing out the need for a more comprehensive citation of the literature on learned distractor suppression, particularly with respect to location-based and feature-based suppression. In response to your comment, we have now expanded the reference list on page 4 to include relevant studies that further support our discussion of both location-based and feature-based suppression mechanisms.

      (3) The authors use the terms "proactive" and "reactive" suppression without taking into consideration the recent terminology paper, which one of the current authors, Theeuwes, helped to write (Liesefeld et al., 2024, see Figure 8). The terms proactive and reactive suppression need to be defined relative to a time point. The authors need to be careful in defining proactive suppression as prior to the first shift of attention, but after the stimuli appear and reactive suppression as after the first shift of attention and after the stimuli appear. Thus, the critical time point is the first shift of attention. Does suppression occur before or after the first shift of attention? The authors could alleviate this by using the term "stimulus-triggered suppression" to refer to "suppression that occurs after the distractor appears and before it captures attention" (Liesefeld et al., 2024).

      Thank you for pointing out that this was insufficiently clear in the previous version. In the revised version we specifically refer to the recent terminology paper on page 5 to make clear that suppression could theoretically occur at three distinct moments in time, and that the present paper was designed to dissociate between suppression before or after the first shift of attention.

      (4) Could the authors justify why the circle stimulus (2° in diameter) was smaller than the diamonds (2.3° x 2.3°)? Are the stimuli equated for the area? Or, for width and height? Doesn't this create a size singleton target on half of all trials (whenever the target is a circle) in addition to the lone circle being a shape singleton? Along these lines, could the authors justify why the colors were used and not equiluminant? This version of red is much brighter than this version of green if assessed by a spectrophotometer. Thus, there are sensory imbalances between the colors. Further, the grey used as the ping is likely not equiluminant to both colors. Thus, the grey "ping" is likely dimmer for red items but brighter for green items. Is this a fair "ping"?

      Thank you for raising these important points. We chose, as is customary in this experimental paradigm (e.g., Huang et al., 2023; Duncan et al., 2023), to make the diamond slightly larger (2.3° x 2.3°) than the circle (2° in diameter) to ensure a better visual match in overall size appearance. If the circle and diamond stimuli were equated strictly in terms of size (both at 2°), the diamond would appear visually smaller due to the differences in geometric shape. By adjusting the dimensions slightly, we aimed to minimize any unintentional differences in perceptual salience.

      As for the colors used in the experiment, the reviewer is right that there might be sensory imbalances between the red and green stimuli, with red appearing brighter than green based on measurements such as spectrophotometry. To ensure that any effects couldn’t be explained by sensory imbalance in the displays, we randomized target and distractor colors across trials, meaning that roughly half the trials had a red distractor and half had a green distractor. This randomization should have mitigated any systematic biases caused by color differences.

      We appreciate your feedback and have clarified these points in method section in the revised manuscript on page 22:

      "Please note that although the colors were not equiluminant, the target and distractor colors were randomized across trials such that roughly half the trials had a red distractor, and half had a green distractor. This randomization process should help mitigate any systematic biases this may cause."

      (5) For the eye movement artifact rejection, the authors use a relatively liberal rejection routine (i.e., allowing for eye movements up to 1.2° visual angle and a threshold of 15 μV). Given that every 3.2 μV deviation in HEOG corresponds to ~ ± 0.1° of visual angle (Lins, et al., 1993), the current oculomotor rejection allows for eye movements between 0.5° and 1.2° visual angle to remain which might allow for microsaccades (e.g., Poletti, 2023) to contaminate the EEG signal (e.g., Woodman & Luck, 2003).

      The reviewer correctly points out that our eye rejection procedure, which is the same as in our previous work (e.g., Duncan et al., 2023), still allows for small, but systematic biases in eye position towards the remembered location and potentially towards or away from the high probability distractor location. While we cannot indefinitely exclude this possibility, we believe this is unlikely for the following reasons. First, although there is a link between microsaccades and covert attention, it has been demonstrated that subtle biases in eye position cannot explain the link between alpha activity and the content of spatial WM (Foster et al., 2016, 2017). Specifically, Foster et al. (2017) found no evidence for a gaze-position-related CTF, while an analysis on that same data yielded clear target related CTFs. Similarly, within the present data set there was no evidence that the observed revival induced by the ping display could be attributed to systematic changes in gaze position, as a multivariate cross-session decoding analysis with x,y positions from the tracker did not yield reliable above-chance decoding of the location in memory.

      Author response image 1.

      (6) The authors claim that "If the statistically learned suppression was spatial-based and feature-blind, one would also expect impaired target processing at the high-probability location." (p. 7, lines 194-195). Why is it important that suppression is feature-blind here? Further, is this a fair test of whether suppression is feature-blind? What about inter-trial priming of the previous trial? If the previous trial's singleton color repeated RTs might be faster than if it switched. In other words, the more catastrophic the interference (the target shape, target color, distractor shape, distractor color) change between trials, the more RTs might slow (compared with consistencies between trials, such that the target and distractor shapes repeat and the target and distractor colors repeat). Lastly, given the variability across both the shape and color dimensions, the claim that this type of suppression is feature-blind might be an artifact of the design promoting location-based instead of feature-based suppression.

      Thank you for raising this point. In the past we have used the finding that learned suppression was not specific to distractors, but also generalized to targets to argue in favor of proactive (or stimulus triggered) suppression. However, we agree that given the current experimental parameters it may be an oversimplification to conclude that the effect was feature-blind based on the impaired target processing as observed here. As this argument is also not relevant to our main findings, we have removed this interpretation and simply report that the effect was observed for both distractor and targets. Nevertheless, we would like to point out that while inter-trial priming could influence reaction times, the features of both target and distractors (shape and color) were randomly assigned on each trial. This should mitigate consistent feature repetitions effects. Additionally, previous research has demonstrated that suppression effects persist even when immediate feature repetitions are controlled for or statistically accounted for (e.g., Wang & Theeuwes 2018 JEP:HPP; Huang et al., 2021 PB&R).

      (7) The authors should temper claims such as "suppression occurs only following attentional enhancement, indicating a reactive suppression mechanism rather than proactive suppression." (p. 15, lines 353-353). Perhaps this claim may be true in the current context, but this claim is too generalized and not supported, at least yet. Further, "Within the realm of learned distractor suppression, an ongoing debate centers around the question of whether, and precisely when, visual distractors can be proactively suppressed. As noted, the idea that learned spatial distractor suppression is applied proactively is largely based on the finding that the behavioral benefit observed when distractors appear with a higher probability at a given location is accompanied by a probe detection cost (measured via dot offset detection) at the high probability distractor location (Huang et al., 2022, 2023; Huang, Vilotijević, et al., 2021)." (p. 15, lines 355-361). Again, the authors should either cite more of the opposing side of the debate (e.g., the signal suppression hypothesis, Gaspelin & Luck, 2019 or Luck et al., 2021) and the many lines of converging evidence of proactive suppression) or temper the claims.

      Thank you for your constructive feedback regarding our statements on suppression mechanisms. We acknowledge that our original claim was intended to reflect our specific findings within the context of this study and was not meant to generalize across all research in the field. To prevent any misunderstanding, we have tempered our claims to avoid overgeneralization by clarifying that our findings suggest a tendency toward reactive suppression within the specific experimental conditions we investigated (see page 17).

      Furthermore, learned distractor suppression is multifaceted, encompassing both feature-based suppression (as proposed by the signal suppression hypothesis) and spatial-based suppression (as examined in the current study). The signal suppression hypothesis provides proactive evidence related to the suppression of specific feature values (Gaspelin et al., 2019; Gaspelin & Luck, 2018b; Stilwell et al., 2019). We have incorporated references to these studies to offer a more comprehensive perspective on the ongoing debate at a broader level (see page 17).

      (8) "These studies however, mainly failed to find evidence in support of active preparatory inhibition (van Moorselaar et al., 2020, 2021; van Moorselaar & Slagter, 2019), with only one study observing increased preparatory alpha contralateral to the high probability distractor location (Wang et al., 2019)." (p. 15, lines 367-370). This is an odd phrasing to say "many studies" have shown one pattern (citing 3 studies) and "only" one showing the opposite, especially given these were all from the current authors' labs.

      Agreed. We have rewritten this text on page 17.

      “These studies however, failed to find evidence in support of active preparatory inhibition as indexed via increased alpha power contralateral to the high probability distractor location  (van Moorselaar et al., 2020, 2021; van Moorselaar & Slagter, 2019; but see Wang et al., 2019).”

      (9) Could the authors comment on why total power was significantly above baseline immediately (without clearer timing marks, ~10-50 ms) after the onset of the cue (Figure 3)? Is this an artifact of smearing? Further, it appears that there is significant activity (as strong as the evoked power of interest) in the baseline period of the evoked power when the memory item is presented on the vertical midline in the upper visual field (this is also true, albeit weaker, for the memory cue item presented on the horizontal midline to the right). This concern again appears in Figure 4 where the Alpha CTF slope was significantly below or above the baseline prior to the onset of the memory cue. Evoked Alpha was already significantly higher than baseline in the baseline period. In Figure 5, evoked power is already higher and different for the hpl than the lpls even at the memory cue (and before the memory cue onsets). There are often periods of differential overlap during the baseline period, or significant activity in the baseline period or at the onset of the critical, time-locked stimulus array. The authors should explain why this might be (e.g., smearing).

      Thank you for pointing this out. As suggested by the reviewer, this ‘unexpected’ pre-stimulus decoding is indeed the result of temporal smearing induced by our 5th order Butterworth filter. The immediate onset of reliable tuning (sometimes even before stimulus onset) is then also a typical aspect of studies that track tuning profiles across time in the lower frequency bands such as alpha (van Moorselaar & Slagter 2019; van Moorselaar et al., 2020; Foster et al., 2016).

      Indeed, visual inspection also suggests that evoked activity tracked items at the top of the screen, an effect that is unlikely to result from temporal smearing as it is temporally interrupted around display onset. However, it is important to note that CTFs by location are based on far fewer trials, making them inherently noisier. The by-location plots primarily serve to show that the observed pattern is generally consistent across locations. In any case, given that the high probability distractor location was counterbalanced across participants it did not systematically influence our results.

      (10) Given that EEG was measured, perhaps the authors could show data to connect with the extant literature. For example, by showing the ERP N2pc and PD components. A strong prediction here is that there should be an N2pc component followed by a PD component if there is the first selection of the singleton before it is suppressed.

      Thank you for your great suggestion regarding the analysis of ERP components such as N2pc and Pd. To reliably assess lateralized ERP components like N2pc or Pd the high probability location must be restricted to static lateralized positions (e.g., on the horizontal midline such as Wang et al., 2019). In contrast, our study was designed to utilize an inverted encoding model to investigate the mechanisms underlying spatial suppression. To avoid bias in training the spatial model toward specific spatial locations (see also the previous comment), we counterbalanced the high-probability location across participants, ensuring an equal distribution of high-probability locations within the sample. Given this counterbalanced design, it was not feasible to reliably assess these components within the scope of the current study. Yet, we agreed with the reviewer that it would be of theoretical interest to examine Pd and N2pc evoked by the search display, particularly in this scenario where suppression has been triggered prior to search onset.

      (11) Figure 2 (behavioral results) is difficult to see (especially the light grey and white bars). A simple fix might be to outline all the bars in black.

      Thank you! We have incorporated your suggestion by outlining all the bars on page 10.

      Reviewer #3 (Recommendations For The Authors):<br /> (1) I'm wondering about the link between the memory task and the search task.  I think the interpretation of the data should include more discussion of the fact that much of the search literature doesn't involve simultaneously holding an unrelated location in memory.  How might that change the results?

      For example - what happens behaviorally on the subset of trials in which the location to be held in memory is near the high probability distractor location?  All the behavioral data is more or less compartmentalized, but I think some behavioral analysis of this and related questions might be quite useful.  I know there are comparisons of behavior in single vs. dual-task cases (for the memory task at least), but I think the analyses could go deeper.

      Thank you for your great suggestion. To investigate the potential interactions between the spatial memory task and the visual search task, we conducted additional analyses on the behavioral data. First, we examined whether memory recall was influenced by the spatial distance (dist0 to dist4) between the memory cue location and the high-probability distractor location. As shown in the figure below, memory recall is not systematically biased either toward or away from the high-probability distractor location (p = .562, ηp<sup>2</sup> = .011).

      We also assessed how the memory task might affect search performance. Specifically, we plotted reaction times as a function of the spatial overlap between the memory cue location and any of the search items, separating trials by distractor-present (match-target, match-distractor, match-neutral) and distractor-absent (match-target, match-neutral) conditions. Although visually the result pattern seems to suggest that search performance was facilitated when the memory cue spatially overlapped with the target and interfered with when it overlapped with the distractor, this pattern did not reach statistical significance (distractor-present: p = .249, ηp<sup>2</sup> = .002; distractor-absent: p = .335, ηp<sup>2</sup> = .002). We have now included these analyses in our supplemental material.

      Beyond additional data analyses, there are also theoretical questions to be asked.  For example, one could argue that in order to maintain a location near or at the high probability distractor location in working memory, the priority map would have to shift substantially. This doesn't necessarily mean that proactive suppression always occurs in search when there is a high probability location. Instead, one could argue that when you need to maintain a high probability location in memory but also know that this location might contain a distractor, the representation necessarily looks quite different than if there were no memory tasks.  Maybe there are reasons against this kind of interpretation but more discussion could be devoted to it in the manuscript. I guess another way to think of this question is - how much is the ping showing us about attentional priority for search vs. attentional priority for memory, or is it simply a combination of those things, and if so, how might that change if we could ping the attentional priority map without a simultaneous memory task?

      Thank you for this valuable suggestion. The aim of our study was to explore how the CTFs elicited by the memory cue were influenced by the search task. We employed a simultaneous memory task because directly measuring CTFs in relation to the search task was not feasible, as the HPL typically does not vary within individual participants. Consequently, CTFs locked to placeholder onsets could reflect arbitrary differences between (subgroups of) participants rather than true differences in the HPL. To address this, we combined the search task with a VWM task, leveraging the fact that location-specific CTFs can reliably be elicited by a memory cue and that the location of this cue relative to the HPL can be systematically varied within participants (Foster et al., 2016, 2017; van Moorselaar et al., 2018). This approach allowed us to examine the CTFs elicited by the memory cue and how these were modulated by their distance from the HPL.

      While it is theoretically possible that the observed changes resulted from alterations in how the memory cue was maintained in memory only, this explanation seems unlikely, for memory performance (recall) did not vary as a function of the cue's distance from the HPL, suggesting that the distance-related changes in the CTFs are reflections of both tasks. Moreover, distractor learning typically occurs without awareness (Gao & Theeuwes 2022; Wang & Theeuwes 2018). It is difficult to understand how such unconscious processes could lead to anticipations in the memory task and subsequently modulate the representation of the consciously remembered memory cue only. We therefore believe that if we would have pinged the attentional priority map without a simultaneous memory task, the results would have been similar to those obtained in the present experiment, indicating stronger tuning at the HPL. Yet, this work still needs to be done.

      To address this comment, we have added a paragraph on p. 18:

      “However, two alternative explanations warrant consideration. First, one could argue that observed modulations in the revived CTFs do not provide insight into the mechanisms underlying distractor suppression but instead reflect changes in the memory representation itself, potentially triggered by the anticipation of the HPL in the search task. According to this view, the changes in the revived CTFs would be unrelated to how search performance (in particular distractor suppression) was achieved. While this is theoretically possible, we believe it to be unlikely. Memory performance (recall) did not vary as a function of the cue's distance from the HPL, whereas the revived CTFs did, indicating that these changes likely reflect contributions from both tasks. Additionally, distractor learning typically occurs without conscious awareness (Gao & Theeuwes 2022; Wang & Theeuwes 2018). It is difficult to conceive how such unconscious processes could produce anticipatory effects in the memory task and selectively modulate the representation of the consciously remembered memory cue. Second, the apparent lack of suppression and the presence of a pronounced tuning at the high-probability distractor location could actually reflect a proactive mechanism that manifests in a way that seems reactive due to the dual-task nature of our experiment.”

      (2) When the distractor appears at a particular location with a high probability it necessarily means that intertrial effects differ between high and low probability distractor locations.  Consecutive trials with a distractor at the same location are far more frequent in the high probability condition.  You may not have enough power to look at this, and I know this group has analyzed this behaviorally in the past, but I do wonder how much that influences the EEG data reported here.  Are CTFs also sensitive to distractors/targets from the most recent trial?  And does that contribute to the overall patterns observed here?

      Thank you for your thoughtful comment. Indeed, Statistical distractor learning studies naturally involve a higher proportion of intertrial effects for high-probability distractors compared to low-probability ones. Previous research, including the present study, has demonstrated that while distractor location improves performance—shown by faster response times (t(23) = 6.32, p < .001, d = 0.33) and increased accuracy (t(23) = 4.21, p < .001, d = 0.86)—intertrial effects alone cannot fully account for the learned suppression effects induced by spatial distractor imbalances. This analysis in now reflected in the revised manuscript on page 9.

      However, as noted by the reviewer, this leaves uncertain to what extent the neural indices of statistical learning, in this case the modulation of channel tuning functions, capture the effects of interest beyond the contributions of intertrial priming. To address this issue, one possible approach is to rerun the CTF analysis after excluding trials with location repetitions. Since the distractor location is unknown to participants at the time the CTF is revived by the placeholder, we removed trials where the memory cue location repeated the distractor location from the preceding trial, rather than trials with distractor location repetitions between consecutive trials. Our analyses indicate that after trials removal (~ 9% of overall trials), the spatial gradient pattern in the CTF slopes remains similar. However, the cluster-based permutation analysis fails to reveal any significant findings, and a one-sample t-test on the slopes averaged within the 100 ms time window of interest yields a p-value of 0.106. While this could suggest that the current pattern is influenced by distractor-cue repetition, it is more likely that the trial removal resulted in an underpowered analysis. To investigate this, we randomly removed an equivalent number of trials (9%), which similarly resulted in insignificant findings, although the overall result pattern remained comparable (p = 0.066 for the one-sample t-test on the slopes average within the interested time window of 100 ms).

      Author response image 2.

      Also, in our previous pinging study we observed that, despite the trial imbalance, decoding was approximately equal between high probability trailing (i.e., location intertrial priming) and non-trailing trials, suggesting that the ping is able to retrieve the priority landscape that build up across longer timescales.

      (3) Maybe there is too much noise in the data for this, but one could look at individual differences in the magnitude of the high probability distractor suppression and the magnitude of the alpha CTF slope.  If there were a correlation here it would bolster the argument about the relationship between priority to the distractor location and subsequent behavior reduction of interference from that distractor.  

      Thank you for this valuable suggestion. We investigated whether there was a correlation between the average gradient slope during the time window in which the placeholder revived the memory representation and the average distance slope in reaction times for the learned suppression effect. This correlation was not significant (r = .236, p = 0.267), which is perhaps expected given the potential noise levels, as noted by the reviewer. Furthermore, while the learned suppression effect is robust at the group level, its predictive value for individual-level performance has been shown to be limited (Ivanov et al., 2024; Hedge et al., 2018). Consequently, we chose not to include this analysis in the manuscript (see also our response to comment 2 by reviewer 2).

      (4) The results sections are a bit dense in places, especially starting at the bottom of page 11.  For readers who are familiar with the general questions being asked but less so with the particular time-frequency analyses and CTF approaches being used (like myself), I think a bit more time could be spent setting up these analyses within the results section to make extra clear what's going on.

      Thank you for your feedback regarding the clarity of our Results section. We have revised this section to make it more understandable and easier to follow, especially for readers who may be less familiar with the specific time-frequency analyses and modeling approaches used in our study. Specifically, we have provided additional interpretations alongside the reported results from page 10 to page 13 to aid comprehension and ensure that the methodology and findings are accessible to a broader audience. Additionally, we have revised the figure notes to further enhance clarity and understanding.

      Other comments:

      Abstract: "a neutral placeholder display was presented to probe how hidden priority map is reconfigured..."  i think the word "the" is missing before "priority map"

      Thank you. We have added the word “the” before “hidden priority map”.

      p. 4, Müller's group also has a number of papers that demonstrate how learned distractor regularities impact search (From the ~2008-2012 range, probably others as well), it might be worth citing a few here.

      Thank you for your suggestion. In the revised manuscript, we have added citations to several key papers from Muller’s group on page 4 as well as other research groups.

      p.5 - Chang et al. (2023) seems highly relevant to the current study (and consistent with its results) - depending on word limits, it might make sense to expand the description of this in the introduction to make clear how the present study builds upon it

      Thank you! We have expanded the discussion of Chang et al. (2023) on page 5 to provide more detailed elaboration of their study and its relevance to our work.

      p. 7 - maybe not for the current study, but I do wonder whether the distortion of spatial memory by the presence of the search task occurs only when there is a relevant regularity in the search task. In other words, if the additional singleton task had completely unpredictable target and distractor locations, would there be memory distortions?  Possibly for the current dataset, the authors could explore whether the behavioral distortion is systematically towards or away from the high probability distractor location.

      Thank you for your insightful suggestion. Following your recommendation, we conducted an additional analysis to examine memory recall as a function of the distance between the memory cue location and the high-probability distractor location. Figure S1A illustrates the results, depicting memory recall deviation across various distances (dist0 to dist4) from the high-probability distractor location.

      Our statistical analysis indicates that memory recall is not systematically biased either towards or away from the high-probability distractor location (p = .562, η<sub>p</sub><sup>2</sup> = .011). This finding suggests that spatial memory recall remains relatively stable and is not heavily influenced by the presence of regularities in the distractor locations.

      p. 7 - in addition to stats it would be helpful to report descriptive statistics for the high probability vs. other distractor location comparisons

      Thank you! We have added descriptive statistics on page 8 and page 9.

      p. 19, "64%" repeated unnecessarily - also, shouldn't it be 65% if it's 5% at each of the other seven locations?

      Thank you. This is now corrected in the revised manuscript.

      p. 20 "This process continued until participants demonstrated a thorough understanding of the assigned tasks" Were there objective criteria to measure this?

      Thank you for pointing out this issue. To clarify, objective criteria were indeed used to assess participants’ readiness to proceed. Specifically:

      For the training phase practice trials, participants were required to achieve an average memory recall deviation of less than 13°.

      For the test phase practice trials, participants needed to demonstrate a minimum of 65% accuracy in the search task. In addition, participants were asked to verbally confirm their understanding of the task goals with the experimenter before proceeding.

      We have revised the manuscript to clearly indicate these criteria on p. 23.

      p. 21 "P-values were Greenhouse-Geiser corrected in case where the..." I think "case" should be "cases"

      Thank you. We have corrected this in the revised manuscript.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer 1

      We thank the reviewer for their thoughtful comments. We have addressed them below, and we believe that have significantly strengthened the clarity of the manuscript.

      Main Comments:

      In Fig. 2C-D, I am not sure I understand why ≈ 100 mutations fix with β = 0. In the absence of epistasis, and since the coefficients hi are sampled from a symmetric distribution centered at zero, it is to be expected that roughly half of the mutations will have positive fitness effects and thus will eventually fix in the population. With L = 250, I would have expected to see the number of fixed mutations approach ≈ 125 for β = 0. Perhaps I am missing something?

      • In our simulations, we initialize all populations from a state where there are only 100 available beneficial mutations (i.e., the initial rank is always 100). Without epistasis, these initial beneficial mutations are the only beneficial mutations that will be present throughout the entire trajectory. Hence, for β = 0, only 100 beneficial mutations can fix. Previously, this information could be found in the “Materials and methods” section of the SI. To make this aspect of our simulation more clear in the revision, we have added a discussion of the initial rank to the “Landscape structure” subsection of the model definition section. In addition, we have merged “Materials and methods” with “Further simulation details” in the SI into one section, and have listed the values for the simulation parameters in the model definition section.

      Along these lines, the authors show that increasing β leads to a higher number of fixed mutations. I am not sure I understand their explanation for this. In line 209 they write that as β increases, “mutations are needed to cease adaptation”. The way I see it, in the absence of epistasis the fitness peak should correspond to a genotype with ≈ L/2 mutations (the genotype carrying all mutations with hi > 0). Increasing the magnitude of microscopic epistasis (i.e., increasing β ), and assuming that there is no bias towards positive epistasis (which there shouldn’t be based on the model formulation, i.e., section "Disorder statistics" on page 4), can change the “location” of the fitness peak, such that it now corresponds to a different genotype. Statistically speaking, however, there are more genotypes with L/2 mutations than with any other number of mutations, so I would have expected that, on average, the number of mutations fixed in the population would still have been ≈ L/2 (naturally with somewhat large variation across replicates, as seems to be the case).

      • With epistasis, the situation becomes more complex. The structure of our model imposes significant sign epistasis in general (i.e. mutations can be beneficial on one background genotype and deleterious on another). This means that in the presence of epistasis, more than 100 mutations can be required to reach a local optimum even when the initial rank was 100. Intuitively, this occurs because mutations that were deleterious on the ancestral background genotype can become beneficial on future genotypes. We find that this occurs consistently throughout adaptation, leading to the accumulation of more mutations with increasing epistasis.

      • Please note that we use the value L = 1000 in our simulations. We have also made the fact that we use L = 1000 more clear by moving the description of the simulation parameters to the main text.

      I do see how, in the clonal interference regime, there can be multiple genotypes in the population at a given time (each with a different mutational load), thus making the number of fixed mutations larger than L/2 when aggregating over all genotypes in the population. But this observation makes less intuitive sense to me in the SSWM regime. In lines 207-208, the authors state that “as beta increases, a greater number of new available beneficial mutations are generated per each typical fixation event”. While this is true, it is also the case that a greater number of mutations that would have been beneficial in the absence of epistasis are now deleterious due to negative epistasis (if I am understanding what the authors mean correctly).

      • The reviewer is correct to note that in the strong clonal interference regime, there will be more accumulated mutations across the entire population than in any single strain. However, we report the number mutations that have fixed, i.e., become present in the entire population.

      • We find that the typical decrease in rank (per fixation event) of the population decreases with increasing epistasis — i.e., the number of available beneficial mutations that are “consumed” when a mutation fixes is typically lower in systems with stronger epistasis.

      Similarly, I am not sure I understand how one goes from equation (6) to equation (7). In particular, it would seem to me that the term 4αiαj Ji j in equation (6) should be equally likely to be positive or negative (again assuming no bias towards positive Ji j). I thus do not see why ηi j in equation (7) is sampled from a normal distribution with mean µβ instead of just mean zero.

      • The reviewer is correct that, for a uniformly random initial state, αi , αj , and Ji j will be uncorrelated so that the distribution of 4αiαj Ji j can be computed exactly (and has mean zero). However, we initialize from a state with rank 100, so that we need to compute the distribution of the random variable E[αiαj Ji j|αiαj Ji j > 0, R = 100]. This is mathematically very challenging, because there are nontrivial correlations between spins even at initialization. For these reasons, we found the uniformly random approximation insufficient. This is described in the paragraph following Equation (7) in the resubmission.

      Minor Comments:

      The authors use a model including terms up to second-order epistasis. To be clear, I think this choice is entirely justified: as they mention in their manuscript, this structure allows to approximate any fitness model defined on a Boolean hypercube. As I understand it, the reason for not incorporating higher-order terms (as in e.g. Reddy and Desai, eLife 2021) has to do with computational efficiency, i.e., accommodating higher-order terms in equation (10) may lead to a substantial increase in computation time. Is this the case?

      • The author is correct that the incorporation of higher-order terms leads to significantly more expensive computation. It’s an interesting direction of future inquiry to see if our adaptive fast fitness computation method can be extended to higher-order interactions.

      Reviewer 2

      We would like to thank the reviewer for their careful reading and their useful comments connecting our work to spin glass physics. We believe the resulting additions to the paper have made our contributions stronger, and that they reveal some novel connections between the substitution trajectory and correlation functions in spin glasses. A summary of our investigation is provided below, and we have added two paragraphs to the discussion section under the heading “Connections to spin glass physics”.

      Main Comments:

      In spin glasses, slowdown of dynamics could have contributions from stretched exponential relaxation of spin correlations as well as aging, each of which are associated with their own exponents. In the present model, these processes could be quantified by computing two-point correlations associated with genomic overlap, as a function of lag time as well as waiting time (generation number). The population dynamics of competing strains makes the analysis more complicated. But it should be possible to define these correlations by separately averaging over lineages starting from a single parent genome, and over distinct parent genomes. It would be interesting to see how exponents associated with these correlations relate to the exponent c associated with asymptotic fitness growth.

      • To investigate this point, we first considered the two-point correlation function 〈αi (tw)αi (tw+ ∆t)〉 for waiting time tw and lag time ∆t. Because all spins are statistically identical, it is natural to average this over the spin index i, leading to the quantity

      Viewed as a function of ∆t for any fixed tw, it is clear that . If m mutations with respect to α(tw) have fixed at time tw + ∆t, a similar calculation shows that . Surprisingly, this simple derivation reveals that the two-spin correlation function commonly studied in spin glass physics is an affine transformation of the substitution trajectory commonly studied in population genetics. Moreover, it shows that the effect of tw is to change the definition of the ancestral strain, so that we may set tw = 0 without loss of generality and study the correlation function χ2(t) = 1 − 2m(t) where m(t) is the mean substitution trajectory of the population. Much of our analysis proceeds by analyzing the effect of epistasis on the accumulation of mutations. This relation provides a novel connection between this analysis and the analysis of correlation functions in the spin glass literature.

      • It is well known that in the SSWM limit without epistasis, the substitution trajectory follows a power law similar to the fitness trajectory with relaxation exponent 1.0 [1]. Informed by this identity, we performed simulations in the SSWM limit and fit power laws to the correlation function χ2 as a function of time. We have verified that χ2(t) obeys a power- law relaxation with exponent roughly 1.0 for β = 0; moreover, as anticipated by the reviewer, the corresponding exponent decreases with increasing β . Nevertheless, we find that these relaxation exponents are distinct from those found for the fitness trajectory, despite following the same qualitative trend. This point is particularly interesting, as it highlights that the dynamics of fixation induce a distinct functional form at the level of the correlation functions when compared to, for example, the Glauber dynamics in statistical physics.

      The strength of dynamic correlations in spin glasses can be characterized by the four-point susceptibility, which contains information about correlated spin flips. These correlations are maximized over characteristic timescales. In the context of evolution, such analysis may provide insights on the correlated accumulation of mutations on different sets of loci over different timescales. It would be interesting to see how these correlations change as a function of the mutation rate as well as the strength of epistasis.

      • To study this point, we considered the four-point correlation function

      Because spins are statistically identical, we found numerically that the genotype average is roughly equivalent to the angular average over trajectories. Inter-changing the order of the summation and the angular averaging, we then find that

      so that the information contained in the four-point correlation function is the same as the information contained in the two-point correlation function.

      Fig. 2E and Fig. 5 together suggests an intriguing possibility when interpreted in the spin glass context. It is clear that in the absence of epistasis, clonal interference accelerates fitness growth. Fig. 2E additionally suggests that this scenario will continue to hold even in the presence of weak, but finite epistasis, but disappears for sufficiently strong epistasis. I wonder if the two regimes are separated by a phase transition at some non-trivial strength of epistasis. Indeed, the qualitative behavior appears to change from that of a random field Ising spin glass for small β , to that of a zero field Sherrington-Kirkpatrick spin glass for sufficiently large β . While the foregoing comments are somewhat speculative, perhaps a discussion along these lines, and what it means in the context of evolution could be a useful addition to the discussion section of the paper.

      • We thank the reviewer for this interesting suggestion, and we have added a discussion of this point to the text in the future directions section, lines 483–489.

      Minor Comments:

      1. In the abstract (line 17-18), I recommend use of the phrase "a simulated evolving population" to avoid a possible misinterpretation of the work as experimental as opposed to numerical.

      • We have added the word “simulated”.

      1. In line 70, the word "the" before "statistical physics" is redundant.

      • We have removed “the”.

      1. To make the message in lines 294-295 visually clear, I recommend keeping the Y-axis scale bars constant across Fig. 4A and Fig. 4B.

      • We appreciate the suggestion. However, we found that when putting the two figures on the same scale, because the agreement is only qualitative and not quantitative (as emphasized in the text), it becomes difficult to view the trend in both systems. For this reason, we have chosen to keep the figure as-is.

      1. Fig. 6 caption states: "Without epistasis, the rank decreases with increasing µ". It should be "rank increases".

      • We have fixed this.

      1. In the last sentence in the caption to Fig. 8, the labels "(A, β =0)" and "(B, β =0.25)" need to be swapped.

      • We have fixed this.

      Editor Comments

      We thank the editor for pointing our attention towards these three interesting references, in particular the second, which appears most relevant to our work. We have added a discussion of reference 2 in the future directions section (lines 471–482), commenting on how to determine the contribution of within-path clonal interference to the fitness dynamics in our model. We have also added a reference to article 3 in the model description, commenting on the importance of sign epistasis and the prevalence of sign epistasis in our model with β > 0.

      References:

      1. Good BH, Desai MM. The impact of macroscopic epistasis on long-term evolutionary dynamics. Genetics. 2015.
    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The enteroviruses comprise a medically important genus in the large and diverse picornavirus family, and are known to be released without lysis from infected cells in large vesicles containing numerous RNA genome-containing capsids - a feature allowing for en bloc transmission of multiple viral genomes to newly infected cells that engulf these vesicles. SIRT-1 is an NAD-dependent protein deacetylase that has numerous and wide ranging effects on cellular physiology and homeostasis, and it is known to be engaged in cellular responses to stress and autophagy.

      Jassey et al. show that RNAi depletion of SIRT-1 impairs the release of enterovirus D-68 (EVD68) in EVs recovered from the supernatant fluids of infected cells using a commercial exosome isolation kit. The many functions attributed to SIRT-1 in the literature reflect its capacity to deacetylate various cell proteins engaged in transcription, DNA repair, and regulation of metabolism, apoptosis and autophagy. However, Jassey et al. make the surprising claim that the proviral role of SIRT-1 in promoting enterovirus release is not dependent on its deacetylase activity. Fig. S1C is crucial to this suggestion, as it is said to show that reconstituting expression with a catalytically-inactive mutant can rescue virus release from SIRT-1 depleted cells. However, no information is provided concerning the levels of endogenous and ectopicallyexpressed SIRT-1 proteins in this experiment, making it very difficult to interpret the results. Is the mutant SIRT-1 protein expressed at a higher level than the non-mutant protein? Is there a 'sponging' effect with these transfections that lessens the siRNA efficiency and reduces knockdown of the endogenous protein? Fig. S1B and Fig. 4C convincingly show that EX527, a small molecule inhibitor of the deacetylase activity of SIRT-1, inhibits extracellular release of the virus. This suggests that the deacetylase activity of SIRT-1 is in fact required for the proviral effect of SIRT-1. This is a fundamentally important question that will require more investigation.

      We have included western blot data (Fig. S1D), which shows comparable levels of expression between the wild-type and mutant SIRT-1 constructs as well as the endogenous SIRT-1. While both constructs partially rescued EV-D68 titers in SIRT-1 knockdown cells, only the wild-type construct rescued SERCA2A protein levels, indicating that SIRT-1 deacetylase activity is required for SERCA2A expression but not for EV-D68 infection.

      Fig. 6 shows how SIRT-I knockdown impacts the release of enterovirus D68 in EVs recovered from cell culture supernatant using a commercial 'Total Exosome Isolation Kit'. The authors should describe the principle this kit exploits to isolate 'exosomes' (affinity isolation?) and specify which antibodies it involves (anti-phosphatidylserine, anti-CD63, others?) This could impact the outcome of these experiments, and moreover is important to include in the longterm scientific record. The authors are appropriately cautious in describing the vesicles they presume to be isolated by the kit as simply 'extracellular vesicles', since there are multiple types of EVs with very different mechanisms of biogenesis, of which 'exosomes' are but one specific type. It would have been more elegant had the authors shown that SIRT-1 is required for EVD68 release in detergent-sensitive vesicles with low buoyant density in isopycnic gradients, and to characterize the size and number of viral capsids in these vesicles by electron microscopy.

      We have added a description of the Total Exosome Isolation Kit principle to the materials and methods. The reagent, in brief, ties up water molecules and forces less soluble components, such as vesicles, out of the culture media, which can then be pelleted by centrifugation. The purity and size distribution of exosomes isolated with this kit is comparable to ultracentrifugation.

      Fig. 6 shows that SIRT-1 depletion upregulates CD63 expression, but has no apparent impact on the release of CD63-positive 'EVs' from uninfected cells. EV-D68 infection also upregulates CD63 expression in SIRT-1 replete cells, and in this case, increases the release of CD63-positive EVs. The combination of infection and SIRT-1 depletion massively upregulates CD63 expression, but appears to eliminate the enhanced release of CD63-positive EVs resulting from infection alone. These are interesting results, from which the authors infer CD63 is associated with EVs containing EV-D68. But, do we know this? Can a CD63 pulldown immunoprecipitate EV-D68 capsid proteins or viral RNA? CD63 is strongly associated with exosomes released from cells through the multi-vesicular body pathway, which are distinct from the LC3-positive EVs released by secretory autophagy that have previously been associated with enteroviruses. The authors suggest that 'knockdown of SIRT-1 may prevent the exocytosis of CD63-positive EVs", but this is a very broad claim (and not really demonstrated by Fig. 6): it requires a clearer definition of what the authors mean by 'exocytosis' and a much more detailed analysis of the size and buoyant density of EVs released in a SIRT-1-dependent process.

      We have toned down this suggestion, which sets up our logic for what is now Figure 7 but we agree does not prove the specific nature of these vesicles.

      The authors suggest that almost all EV-D68 released from infected cells is released without cell lysis in EVs. However, they generally show data from only a single time point following infection (5 or 6 hrs post-infection). It would have been interesting to see a more complete temporal analysis, and to know whether a high proportion of virus continues to be released in EVs, or if it is swamped out ultimately by lytic release of nonenveloped virus.

      In these cells, very little virus is released at earlier timepoints, and after 6hpi it is difficult to analyze virus release because of cell detachment and lysis. In a future publication we will use less susceptible cells to analyze a time course of release.

      Fig. 1D indicates that a small fraction of SIRT-1 leaks from the nucleus in EV-D68 infected cells. The authors suggest this is due to targeted nuclear export, rather than simply leaky nuclear pores which are well known to exist in enterovirus-infected cells. The authors present similar fluorescent microscopy data showing inhibition of TFEB export in leptomycin-B treated cells in Fig. S2A in support of their claim that this is specific SIRT-1 export, but these data are far from convincing - there is equivalent residual TFEB and SIRT-1 in the cytoplasm of the treated cells. Quantitative immunoblots of cytoplasmic and nuclear cell fractions might prove more compelling.

      We have changed the text to remove the word “block” and instead suggest that there is inhibition, given the difference we observe with and without leptomycin-B.

      Finally, the authors should be more specific in describing the viruses they have studied (EV-D68 and PV). It would be preferable to describe these as 'enteroviruses' (including in the title of the manuscript), rather than more broadly as 'picornaviruses'. There is no certainty that the requirement for SIRT-1 in non-lytic release of virus extends to hepatoviruses or other picornaviral genera, for which mechanisms of nonlytic release may be quite different.

      We have made this change and thank the reviewer for pointing this out.

      Reviewer #2 (Public Review):

      The authors aimed to connect SIRT-1 to EV-D68 virus release through mediating ER stress. They are successful in robustly connecting these pathways experimentally and show a new role for SIRT-1 in EV-D68 infection. These results extend to additional viruses, suggesting role(s) for SIRT-1 in diverse virus infection.

      The authors note that EV-D68 does not significantly impact SIRT-1 protein levels (Fig 1E and F), though this has been described for other picornaviruses (Xander et al., J Immunol 2019; Han et al., J Cell Sci 2016; Kanda et al Biochem Biophys Res Commun 2015). This may be of interest to note in the manuscript.

      We have cited the above papers in the manuscript and thank the reviewer for these suggestions.

      The data regarding CVB3 (Fig S4) are especially interesting because they show no discernable impact on infection. The manuscript should describe this further and perhaps speculate on potential reasons. Could it be due to inefficient knockdown?

      We have shown that both genetic and pharmacological inhibition of SIRT-1 does not significantly alter CVB3 titers. We do not think this is due to inefficient knockdown since the CVB3 and PV experiments were done concurrently. We are currently investigating why CVB3 responds differently from EV-D68 and PV.

      SIRT-1 (and other sirtuins) have been linked to an innate interferon response. Are any of the phenotypes observed here due to IFN responses? The use of H1HeLa cells would suggest this is not the case.

      We think this is unlikely because H1HeLas are not IFN-competent and the knockdown of SIRT1 did not significantly alter viral RNA replication

      Reviewer #1 (Recommendations For The Authors):

      In Fig. 1, it would be informative to show an immunoblot of the protein in knockdown vs control cells (this is shown in different experiments in Fig. 2A and 3C, with variable degrees of knockdown efficiency, but ideally should be shown here also).

      The knockdown efficiency of SIRT-1 is now shown in Fig. S1D. We thank the reviewer for this suggestion.

      Why is the extracellular virus titer in the control cells in Fig. 1C so much lower (over a 1.5 logs) than in Fig. 1B? Has the plasmid transfection induced an innate immune response, and could this be confounding the experiment?

      We think this is due to stress induced by transfection and not an innate immune response, since H1Hela are not interferon competent.

      SIRT-1 is recognized to have a regulatory role in autophagy, but the author's claim that it is "essential for stress induced and basal autophagy" would be strengthened by including in Fig. 2B control images of starved and CCCP-treated cells.

      LC3 lipidation and p62 degradation are the hallmarks of autophagy initiation and flux, which are shown in Fig. 2A. The goal of Fig. 2B was to verify the impact of SIRT-1 knockdown in restricting basal autophagic degradation. We will examine the effect of starvation and CCCP treatment in future studies. We thank the reviewer for understanding.

      The BiP immunoblot shown in Fig. 4B does not support the claim that 'TG [thapsigargin] treatment induced BiP protein levels' whereas 'EV-D68 infection reduced BiP levels...suggesting that EV-D68 blocks ER stress.' The apparent differences in BiP expression are minimal and of questionable biological significance.

      We have consistently observed a reduction in BiP levels during EV-D68 infection in both hSABCi-NS1.1 as indicated in Fig. 4B and H1HeLa (see Author response image 1), which is consistent with an ER stress blockade during EV-D68 infection.

      Author response image 1.

      Minor comments:

      1) The variable and wide-ranging scale of the y-axis in Figs. 1A-C and S1 is distracting, exaggerates small differences, and makes it difficult to assess the magnitude of differences in virus titers. The scale should be standardized and held constant in graphs showing results from similar types of experiments.

      Our graphs are plotted based on the viral titers from experiments, mostly done on different days. We are confident that the variabilities in the y-axis do not affect the statistical analyses.

      2) The number and types of (technical or biological?) of experimental replicates should be indicated in the figure legends. Ideally, each replicate should be individually plotted in graphs.

      All experiments are repeated at least three times unless otherwise indicated. We have added this information to the figure legends.

      3) Fig. S5C - how many replicates were done, and is there a statistically significant difference in viral RNA abundance at the last time point?

      The experiment was done three times, twice with a low MOI (0.1) and once with a high MOI (30). There is no statistical difference at the last time point as shown in the graphs in Author response image 2.

      Author response image 2.

      Reviewer #2 (Recommendations For The Authors):

      Figure 1D would benefit from staining for viral replication compartments (J2, for instance) to correlate the amount of viral dsRNA with nuclear egress of SIRT-1. Similar data would benefit Figure 5A. The data in Figure S5 suggests that most, but not all cells, are infected, so having this control seems important for their IFA experiments.

      SIRT-1 dsRNA staining for EV-D68 infection is shown in Fig. S5A and all cells appear to be infected. The IFA data (Author response image 3) shows dsRNA staining of CVB3-infected cells.

      Author response image 3.

      Are EVs not released as efficiently with SIRT-1 knockdown? The authors show that knockdown reduces CD63 levels in purified EVs, but this could be explained if exosomes are not generated as robustly with SIRT-1 knockdown.

      We don’t want to use the word “exosomes” since their definition is very specific, and only use it once in our manuscript, to describe known membrane associations of CD63. We do not think SIRT-1 knockdown affects the intracellular generation of EVs, since depleting SIRT-1 leads to the buildup of CD63 positive signals in the whole cell lysates compared to the scramble control (Fig. 7B and C). Instead, our data suggest that SIRT-1 regulates the release of EVs during EV-D68 infection.

      Labels of graphs for "Infection" versus treatment ("TG" or "EX527") is unclear. All samples are presumably infected, so perhaps the authors meant to label these diagrams as untreated.

      We have made the changes in the labels and thank the reviewer for helping make these graphs more clear.

      The induction of ER stress with TG and repression of stress with EV-D68 infection is clear from BiP western blots. Are BiP levels reduced in SIRT-1 knockdown cells? Their data with TG treatment and knockdown suggests this may be possible.

      We have not examined the impact of SIRT-1 knockdown on BiP protein levels. But since SIRT1 KD increases ER stress, as evidenced by a reduction in SERCA2A levels (Fig. 3C and E), we would expect an increase in BiP levels in SIRT-1 depleted cells.

      Would the authors expect TG to reduce EVs with EV-D68 as well? Presumably, combination of TG with SIRT-1 would reduce EVs similar to the results shown in Figure 6C. They mention in the discussion that TG and SIRT-1 "share common cellular targets" so it would be interesting to determine if TG acts similar to SIRT-1 knockdown with regard to EVs.

      We think TG will similarly reduce EVs in EV-D68-infected cells, and we are currently testing this hypothesis.

      Because of the inclusion of the SARS-CoV-2 data and mention in the abstract, it may be appropriate to include that data (Fig S7) in the main figures. The authors mention SIRT-1 as important to MERS-CoV infection in the introduction, but SIRT-1 has been implicated in RNA virus infection, including picornaviruses (noted above). The expansion of this section to provide additional context would benefit the introduction and discussion.

      We have moved the former Fig. S7 to the main manuscript as Fig. 6.

    1. Author response:

      The following is the authors’ response to the current reviews.

      eLife assessment

      This study presents an important finding on the influence of visual uncertainty and Bayesian cue combination on implicit motor adaptation in young healthy participants, hereby linking perception and action during implicit adaptation. The evidence supporting the claims of the authors is convincing. The normative approach of the proposed PEA model, which combines ideas from separate lines of research, including vision research and motor learning, opens avenues for future developments. This work will be of interest to researchers in sensory cue integration and motor learning.

      Thank you for the updated assessment. We are also grateful for the insightful and constructive comments from the reviewers, which have helped us improve the manuscript again. We made necessary changes following their comments (trimmed tests, new analysis results, etc) and responded to the comments in a point-by-point fashion below. We hope to publish these responses alongside the public review. Thank you again for fostering the fruitful discussion here.

      Public Reviews:

      Reviewer #1 (Public Review):

      I appreciate the normative approach of the PEA model and am eager to examine this model in the future. However, two minor issues remain:

      (1) Clarification on the PReMo Model:

      The authors state, "The PReMo model proposes that this drift comprises two phases: initial proprioceptive recalibration and subsequent visual recalibration." This description could misinterpret the intent of PReMo. According to PReMo, the time course of the reported hand position is merely a read-out of the *perceived hand position* (x_hat in your paper). Early in adaptation, the perceived hand position is biased by the visual cursor (x_hat in the direction of the cursor); towards the end, due to implicit adaptation, x_hat reduces to zero. This is the same as PEA. I recommend that the authors clarify PReMo's intent to avoid confusion.

      Note, however, the observed overshoot of 1 degree in the reported hand position. In the PReMo paper, we hypothesized that this effect is due to the recalibration of the perceived visual target location (inspired by studies showing that vision is also recalibrated by proprioception, but in the opposite direction). If the goal of implicit adaptation is to align the perceived hand position (x_hat) with the perceived target position (t_hat), then there would be an overshoot of x_hat over the actual target position.

      PEA posits a different account for the overshoot. It currently suggests that the reported hand position combines x_hat (which takes x_p as input) with x_p itself. What is reasoning underlying the *double occurrence* of x_p?

      There seem to be three alternatives that seem more plausible (and could lead to the same overshooting): 1) increasing x_p's contribution (assuming visual uncertainty increases when the visual cursor is absent during the hand report phase), 2) decreasing sigma_p (assuming that participants pay more attention to the hand during the report phase), 3) it could be that the perceived target position undergoes recalibration in the opposite direction to proprioceptive recalibration. All these options, at least to me, seem equally plausible and testable in the future.

      For clarification of the PReMo model’s take on Fig4A, we now write:

      “The PReMo model proposes that the initial negative drift reflects a misperceived hand location, which gradually reduces to zero, and the late positive drift reflects the influence of visual calibration of the target (Tsay, Kim, Saxena, et al., 2022). ”

      However, we would like to point out that the PEA model does not predict a zero (perceived hand location) even at the late phase of adaptation: it remains negative, though not as large as during initial adaptation (see Figure 4A, red line). Furthermore, we have not seen any plausible way to use a visually biased target to explain the overshoot of the judged hand location (see below when we address the three alternative hypotheses the reviewer raised).

      We don’t think the “double” use of xp is a problem, simply because there are TWO tasks under investigation when the proprioceptive changes are measured along with adaptation. The first is the reaching adaptation task itself: moving under the influence of the clamped cursor. This task is accompanied by a covert estimation of hand location after the movement (). Given the robustness of implicit adaptation, this estimation appears mandatory and automatic. The second task is the hand localization task, during which the subject is explicitly asked to judge where the hand is. Here, the perceived hand is based on the two available cues, one is the actual hand location xp, and the other is the influence from the just finished reaching movement (i.e., ). For Bayesian modeling from a normative perspective, sensory integration is based on the available cues to fulfill the task. For the second task of reporting the hand location, the two cues are xp and (with a possible effect of the visual target, which is unbiased since it is defined as 0 in model simulation; thus, its presence does not induce any shift effect). xp is used sequentially in this sense. Thus, its dual use is well justified.

      Our hypothesis is that the reported hand position results from a combination of from the previous movement and the current hand position xp. However, specifically for the overshoot of the judged hand location in the late part of the adaptation (Fig4A), the reviewer raised three alternative explanations by assuming that the PReMo model is correct. Under the PReMo model, the estimated hand location is only determined by , and xp is not used in the hand location report phase. In addition, (with xp used once) and a visual recalibration of the target can explain away the gradual shift from negative to positive (overshoot).

      We don’t think any of them can parsimoniously explain our findings here, and we go through these three hypotheses one by one:

      (1) increasing xp's contribution (assuming visual uncertainty increases when the visual cursor is absent during the hand report phase)

      (2) decreasing σp (assuming that participants pay more attention to the hand during the report phase)

      The first two alternative explanations basically assume that xp has a larger contribution (weighting in Bayesian terms) in the hand location report phase than in the adaptation movement phase, no matter due to an increase in visual uncertainty (alternative explanation 1) or a reduction in proprioceptive uncertainty (alternative explanation 2). Thus, we assume that the reviewer suggests that a larger weight for xp can explain why the perceived hand location changes gradually from negative to positive. However, per the PReMo model, a larger weight for the xp will only affect , which is already assumed to change from negative to zero. More weight in  in the hand report phase (compared to the adaptation movement phase) would not explain away the reported hand location from negative to positive. This is because no matter how much weight the xp has, the PReMo model assumes a saturation for the influence of xp on . Thus would not exceed zero in the late adaptation. Then, the PReMo model would rely on the so-called visual shift of the target to explain the overshoot. This leads us to the third alternative the reviewer raised:

      (3) it could be that the perceived target position undergoes recalibration in the opposite direction to proprioceptive recalibration.

      The PReMo model originally assumed that the perceived target location was biased in order to explain away the positive overshoot of the reported hand location. We assume that the reviewer suggests that the perceived target position, which is shifted to the positive direction, also “biases” the perceived hand position. We also assume that the reviewer suggests that the perceived hand location after a clamp trial () is zero, and somehow the shifted perceived target position “biases” the reported hand location after a clamp trial. Unfortunately, we did not see any mathematical formulation of this biasing effect in the original paper (Tsay, Kim, Haith, et al., 2022). We are not able to come up with any formulation of this hypothesized biasing effect based on Bayesian cue integration principles. Target and hand are two separate perceived items; how one relates to another needs justification from a normative perspective when discussing Bayesian models. Note this is not a problem for our PEA models, in which both cues used are about hand localization, one is and the other is xp.

      We believe that mathematically formulating the biasing effect (Figure 4A) is non-trivial since the reported hand location changes continuously from negative to positive. Thus, quantitative model predictions, like the ones our PEA model presents here, are needed.

      To rigorously test the possible effect of visual recalibration of the target, there are two things to do: 1) use the psychometric method to measure the biased perception of the target, and 2) re-do Tsay et al. 2020 experiment without the target. For 2), compared to the case with the target, the PEA model would predict a larger overshoot, while the PReMo would predict a smaller overshoot or even zero overshoot. This can be left for future studies.

      (2) Effect of Visual Uncertainty on Error Size:

      I appreciate the authors' response about methodological differences between the cursor cloud used in previous studies and the Gaussian blob used in the current study. However, it is still not clear to me how the authors reconcile previous studies showing that visual uncertainty reduced implicit adaptation for small but not large errors (Tsay et al, 2021; Makino, et al 2023) with the current findings, where visual uncertainty reduced implicit adaptation for large but not small errors.

      Could the authors connect the dots here: I could see that the cursor cloud increases potential overlap with the visual target when the visual error is small, resulting in intrinsic reward-like mechanisms (Kim et al, 2019), which could potentially explain attenuated implicit adaptation for small visual errors. However, why would implicit adaptation in response to large visual errors remain unaffected by the cursor cloud? Note that we did verify that sigma_v is increased in (Tsay et al. 2021), so it is unlikely due to the cloud simply failing as a manipulation of visual uncertainty.

      In addition, we also reasoned that testing individuals with low vision could offer a different test of visual uncertainty (Tsay et al, 2023). The advantage here is that both control and patients with low vision are provided with the same visual input-a single cursor. Our findings suggest that uncertainty due to low vision also shows reduced implicit adaptation in response to small but not large errors, contrary to the findings in the current paper. Missing in the manuscript is a discussion related to why the authors' current findings contradict those of previous results.

      For connecting the dots for two previous studies (Tsay et al., 2021, 2023); Note Makino et al., 2023 is not in this discussion since it investigated the weights of multiple cursors, as opposed to visual uncertainty associated with a cursor cloud):

      First, we want to re-emphasize that using the cursor cloud to manipulate visual uncertainty brings some confounds, making it not ideal for studying visuomotor adaptation. For example, in the error clamp paradigm, the error is defined as angular deviation. The cursor cloud consists of multiple cursors spanning over a range of angles, which affects both the sensory uncertainty (the intended outcome) and the sensory estimate of angles (the error estimate, the undesired outcome). In Bayesian terms, the cursor cloud aims to modulate the sigma of a distribution (σv) in our model), but it additionally affects the mean of the distribution (µ). This unnecessary confound is neatly avoided by using cursor blurring, which is still a cursor with its center (µ) unchanged from a single cursor. Furthermore, as correctly pointed out in the original paper by Tsay et al., 2020, the cursor cloud often overlaps with the visual target; this "target hit" would affect adaptation, possibly via a reward learning mechanism (Kim et al., 2019). This is a second confound that accompanies the cursor cloud. Yes, the cursor cloud was verified as associated with high visual uncertainty (Tsay et al., 2021); this verification was done with a psychophysics method with a clean background, not in the context of a hand reaching a target that is needed. Thus, despite the cursor cloud having a sizeable visual uncertainty, our criticisms for it still hold when used in error-clamp adaptation.

      Second, bearing these confounds of the cursor cloud in mind, we postulate one important factor that has not been considered in any models thus far that might underlie the lack of difference between the single-cursor clamp and the cloud-cursor clamp when the clamp size is large: the cursor cloud might be harder to ignore than a single cursor. For Bayesian sensory integration, the naive model is to consider the relative reliability of cues only. Yes, the cloud is more uncertain in terms of indicating the movement direction than a single cursor. However, given its large spread, it is probably harder to ignore during error-clamp movements. Note that ignoring the clamped cursor is the task instruction, but the large scatter of the cursor cloud is more salient and thus plausible and harder to ignore. This might increase the weighting of the visual cue despite its higher visual uncertainty. This extra confound is arguably minimized by using the blurred cursor as in our Exp4 since the blurred cursor did not increase the visual angle much (Figure 5D; blurred vs single cursor: 3.4mm vs 2.5mm in radius, 3.90o vs  2.87o in spread). In contrast, the visual angle of the dot cloud is at least a magnitude larger (cursor cloud vs. single cursor: at least 25o vs. 2.15o in the spread, given a 10o standard deviation of random sampling).

      Third, for the low-vision study (Tsay et al., 2023), the patients indeed show reduced implicit adaptation for a 3 o clamp (consistent with our PEA model) but an intact adaptation for 30-degree clamp (not consistent). Though this pattern appears similar to what happens for normal people whose visual uncertainty is upregulated by cursor cloud (Tsay et al., 2021), we are not completely convinced that the same underlying mechanism governs these two datasets. Low-vision patients indeed have higher visual uncertainty about color, brightness, and object location, but their visual uncertainty about visual motion is still unknown. Due to the difference in impairment among low vision people (e.g., peripheral or central affected) and the different roles of peripheral and central vision in movement planning and control (Sivak & Mackenzie, 1992), it is unclear about the overall effect of visual uncertainty in low vision people. The direction of cursor movement that matters for visuomotor rotation here is likely related to visual motion perception. Unfortunately, the original study did not measure this uncertainty in low-vision patients. We believe our Exp1 offers a valid method for this purpose for future studies. More importantly, we should not expect low-vision patients to integrate visual cues in the same way as normal people, given their long-term adaptation to their vision difficulties. Thus, we are conservative about interpreting the seemingly similar findings across the two studies (Tsay et al., 2021, 2023) as revealing the same mechanism.

      A side note: these two previous studies proposed a so-called mis-localization hypothesis, i.e., the cursor cloud was mislocated for small clamp size (given its overlapping with the target) but not for large clamp size. They suggested that the lack of uncertainty effect at small clamp sizes is due to mislocalization, while the lack of uncertainty effect at large clamp sizes is because implicit adaptation is not sensitive to uncertainty at large angles. Thus, these two studies admit that cursor cloud not only upregulates uncertainty but also generates an unwanted effect of so-called “mis-localization” (overlapping with the target). Interestingly, their hypothesis about less sensitivity to visual uncertainty for large clamps is not supported by a model or theory but merely a re-wording of the experiment results.

      In sum, our current study cannot offer an easy answer to "connect the dots" in the aforementioned two studies due to methodology issues and the specialty of the population. However, for resolving conflicting findings, our study suggests solutions include using a psychometric test to quantify visual uncertainty for cursor motion (Exp1), a better uncertainty-manipulation method to avoid a couple of confounds (Exp4, blurred cursor), and a falsifiable model. Future endeavors can solve the difference between studies based on the new insights from the current.

      Reviewer #2 (Public Review):

      Summary:

      The authors present the Perceptual Error Adaptation (PEA) model, a computational approach offering a unified explanation for behavioral results that are inconsistent with standard state-space models. Beginning with the conventional state-space framework, the paper introduces two innovative concepts. Firstly, errors are calculated based on the perceived hand position, determined through Bayesian integration of visual, proprioceptive, and predictive cues. Secondly, the model accounts for the eccentricity of vision, proposing that the uncertainty of cursor position increases with distance from the fixation point. This elegantly simple model, with minimal free parameters, effectively explains the observed plateau in motor adaptation under the implicit motor adaptation paradigm using the error-clamp method. Furthermore, the authors experimentally manipulate visual cursor uncertainty, a method established in visuomotor studies, to provide causal evidence. Their results show that the adaptation rate correlates with perturbation sizes and visual noise, uniquely explained by the PEA model and not by previous models. Therefore, the study convincingly demonstrates that implicit motor adaptation is a process of Bayesian cue integration

      Strengths:

      In the past decade, numerous perplexing results in visuomotor rotation tasks have questioned their underlying mechanisms. Prior models have individually addressed aspects like aiming strategies, motor adaptation plateaus, and sensory recalibration effects. However, a unified model encapsulating these phenomena with a simple computational principle was lacking. This paper addresses this gap with a robust Bayesian integration-based model. Its strength lies in two fundamental assumptions: motor adaptation's influence by visual eccentricity, a well-established vision science concept, and sensory estimation through Bayesian integration. By merging these well-founded principles, the authors elucidate previously incongruent and diverse results with an error-based update model. The incorporation of cursor feedback noise manipulation provides causal evidence for their model. The use of eye-tracking in their experimental design, and the analysis of adaptation studies based on estimated eccentricity, are particularly elegant. This paper makes a significant contribution to visuomotor learning research.

      The authors discussed in the revised version that the proposed model can capture the general implicit motor learning process in addition to the visuomotor rotation task. In the discussion, they emphasize two main principles: the automatic tracking of effector position and the combination of movement cues using Bayesian integration. These principles are suggested as key to understanding and modeling various motor adaptations and skill learning. The proposed model could potentially become a basis for creating new computational models for skill acquisition, especially where current models fall short.

      Weaknesses:

      The proposed model is described as elegant. In this paper, the authors test the model within a limited example condition, demonstrating its relevance to the sensorimotor adaptation mechanisms of the human brain. However, the scope of the model's applicability remains unclear. It has shown the capacity to explain prior data, thereby surpassing previous models that rely on elementary mathematics. To solidify its credibility in the field, the authors must gather more supporting evidence.

      Indeed, our model here is based on one particular experimental paradigm, i.e., the error-clamp adaptation. We used it simply because 1) this paradigm is one rare example that implicit motor learning can be isolated in a clean way, and 2) there are a few conflicting findings in the literature for us to explain away by using a unified model.

      For our model’s broad impact, we believe that as long as people need to locate their effectors during motor learning, the general principle laid out here will be applicable. In other words, repetitive movements with a Bayesian cue combination of movement-related cues can underlie the implicit process of various motor learning. To showcase its broad impact, in upcoming studies, we will extend this model to other motor learning paradigms, starting from motor adaptation paradigms that involve both explicit and implicit processes.

      Reviewer #3 (Public Review):

      (2.1) Summary

      In this paper, the authors model motor adaptation as a Bayesian process that combines visual uncertainty about the error feedback, uncertainty about proprioceptive sense of hand position, and uncertainty of predicted (=planned) hand movement with a learning and retention rate as used in state space models. The model is built with results from several experiments presented in the paper and is compared with the PReMo model (Tsay, Kim et al., 2022) as well as a cue combination model (Wei & Körding, 2009). The model and experiments demonstrate the role of visual uncertainty about error feedback in implicit adaptation.

      In the introduction, the authors notice that implicit adaptation (as measured in error-clamp based paradigms) does not saturate at larger perturbations, but decreases again (e.g. Moorehead et al., 2017 shows no adaptation at 135{degree sign} and 175{degree sign} perturbations). They hypothesized that visual uncertainty about cursor position increases with larger perturbations since the cursor is further from the fixated target. This could decrease importance assigned to visual feedback which could explain lower asymptotes.

      The authors characterize visual uncertainty for 3 rotation sizes in a first experiment, and while this experiment could be improved, it is probably sufficient for the current purposes. Then the authors present a second experiment where adaptation to 7 clamped errors are tested in different groups of participants. The models' visual uncertainty is set using a linear fit to the results from experiment 1, and the remaining 4 parameters are then fit to this second data set. The 4 parameters are 1) proprioceptive uncertainty, 2) uncertainty about the predicted hand position, 3) a learning rate and 4) a retention rate. The authors' Perceptual Error Adaptation model ("PEA") predicts asymptotic levels of implicit adaptation much better than both the PReMo model (Tsay, Kim et al., 2022), which predicts saturated asymptotes, or a causal inference model (Wei & Körding, 2007) which predicts no adaptation for larger rotations. In a third experiment, the authors test their model's predictions about proprioceptive recalibration, but unfortunately compare their data with an unsuitable other data set (Tsay et al. 2020, instead of Tsay et al. 2021). Finally, the authors conduct a fourth experiment where they put their model to the test. They measure implicit adaptation with increased visual uncertainty, by adding blur to the cursor, and the results are again better in line with their model (predicting overall lower adaptation), than with the PReMo model (predicting equal saturation but at larger perturbations) or a causal inference model (predicting equal peak adaptation, but shifted to larger rotations). In particular the model fits for experiment 2 and the results from experiment 4 show that the core idea of the model has merit: increased visual uncertainty about errors dampens implicit adaptation.

      (2.2) Strengths

      In this study the authors propose a Perceptual Error Adaptation model ("PEA") and the work combines various ideas from the field of cue combination, Bayesian methods and new data sets, collected in four experiments using various techniques that test very different components of the model. The central component of visual uncertainty is assessed in a first experiment. The model uses 4 other parameters to explain implicit adaptation. These parameters are: 1) a learning and 2) a retention rate, as used in popular state space models and the uncertainty (variance) of 3) predicted and 4) proprioceptive hand position. In particular, the authors observe that asymptotes for implicit learning do not saturate, as claimed before, but decrease again when rotations are very large and that this may have to do with visual uncertainty (e.g. Tsay et al., 2021, J Neurophysiol 125, 12-22). The final experiment confirms predictions of the fitted model about what happens when visual uncertainty is increased (overall decrease of adaptation). By incorporating visual uncertainty depending on retinal eccentricity, the predictions of the PEA model for very large perturbations are notably different from, and better than, the predictions of the two other models it is compared to. That is, the paper provides strong support for the idea that visual uncertainty of errors matters for implicit adaptation.

      (2.3) Weaknesses

      Although the authors don't say this, the "concave" function that shows that adaptation does not saturate for larger rotations has been shown before, including in papers cited in this manuscript.

      For a proper citation of the “concave” adaptation function: we assume the reviewer is referring to the study by Morehead, 2017 which tested large clamp sizes up to 135 o and 175 o. Unsurprisingly, the 135 o and 175 o conditions lead to nearly zero adaptation, possibly due to the trivial fact that people cannot even see the moving cursor. We have quoted this seminar study from the very beginning. All other error-clamp studies with a block design emphasized an invariant or saturated implicit adaptation with large rotations (e.g., Kim, et al., 2019).

      The first experiment, measuring visual uncertainty for several rotation sizes in error-clamped paradigms has several shortcomings, but these might not be so large as to invalidate the model or the findings in the rest of the manuscript. There are two main issues we highlight here. First, the data is not presented in units that allow comparison with vision science literature. Second, the 1 second delay between movement endpoint and disappearance of the cursor, and the presentation of the reference marker, may have led to substantial degradation of the visual memory of the cursor endpoint. That is, the experiment could be overestimating the visual uncertainty during implicit adaptation.

      For the issues related to visual uncertainty measurement in Exp1:

      First, our visual uncertainty is about cursor motion direction in the display plane, and the measurement in Exp1 has never been done before. Thus, we do not think our data is comparable to any findings in visual science about fovea/peripheral comparison. We quoted Klein and others’ work (Klein & Levi, 1987; Levi et al., 1987) in vision science since their studies showed that the deviation from the fixation is associated with an increase in visual uncertainty. Their study thus inspired us to conduct Exp1 to probe how our concerned visual uncertainty (specifically for visual motion direction) changes with an increasing deviation from the fixation. Any model and its model parameters should be specifically tailored to the task or context it tries to emulate. In our case, motion direction in a center-out-reaching setting is the modeled context, and all the relevant model parameters should be specified in movement angles. This is particularly important since we need to estimate parameters from one experiment to predict behaviors in another experiment.

      Second, the 1s delay of the reference cursor has minimal impact on the estimate of visual uncertainty based on previous vision studies. Our Exp1 used a similar visual paradigm by (White et al., 1992), which shows that delay does not lead to an increase in visual uncertainty over a broad range of values (from 0.2s to >1s, see their Figure 5-6).

      These two problems have been addressed in the revised manuscript, with proper citations listed.

      The paper's third experiment relies to a large degree on reproducing patterns found in one particular paper, where the reported hand positions - as a measure of proprioceptive sense of hand position - are given and plotted relative to an ever present visual target, rather than relative to the actual hand position. That is, 1) since participants actively move to a visual target, the reported hand positions do not reflect proprioception, but mostly the remembered position of the target participants were trying to move to, and 2) if the reports are converted to a difference between the real and reported hand position (rather than the difference between the target and the report), those would be on the order of ~20° which is roughly two times larger than any previously reported proprioceptive recalibration, and an order of magnitude larger than what the authors themselves find (1-2°) and what their model predicts. Experiment 3 is perhaps not crucial to the paper, but it nicely provides support for the idea that proprioceptive recalibration can occur with error-clamped feedback.

      Reviewer 3 thinks Tsay 2020 dataset is not appropriate for our theorization, but we respectfully disagree. For the three points raised here, we would like to elaborate:

      (1) As we addressed in the previous response, the reported hand location in Figure 4A (Tsay et al., 2020) is not from a test of proprioceptive recalibration as conventionally defined. In the revision, we explicitly state that this dataset is not about proprioceptive recalibration and also delete texts that might mislead people to think so (see Results section). Instead, proprioceptive recalibration is measured by passive movement, as in our Exp3 (Figure 4E). For error-clamp adaptation here, "the remembered position of the target" is the target. Clearly, the participants did not report the target position, which is ever-present. Instead, their reported hand location shows an interestingly continuous change with ongoing adaptation.

      (2) Since the Tsay 2020 dataset is not a so-called proprioceptive recalibration, we need not take the difference between the reported location and the actual hand location. Indeed, the difference would be ~20 degrees, but comparing it to the previously reported proprioceptive recalibration is like comparing apples to oranges. In fact, throughout the paper, we refer to the results in Fig 4A as “reported hand location”, not proprioceptive recalibration. The target direction is defined as zero degree thus its presence will not bias the reported hand in the Bayesian cue combination (as this visual cue has a mean value of 0). Using the target as the reference also simplifies our modeling.

      (3) Exp3 is crucial for our study since it shows our model and its simple Bayesian cue combination principle are applicable not only to implicit adaptation but also to proprioceptive measures during adaptation. Furthermore, it reproduced the so-called proprioceptive recalibration and explained it away with the same Bayesian cue combination as the adaptation. We noticed that this field has accumulated an array of findings on proprioceptive changes induced by visuomotor adaptation. However, currently, there is a lack of a computational model to quantitatively explain them. Our study at least made an initial endeavor to model these changes.

      Perhaps the largest caveat to the study is that it assumes that people do not look at the only error feedback available to them (and can explicitly suppress learning from it). This was probably true in the experiments used in the manuscript, but unlikely to be the case in most of the cited literature. Ignoring errors and suppressing adaptation would also be a disastrous strategy to use in the real world, such that our brains may not be very good at this. So the question remains to what degree - if any - the ideas behind the model generalize to experiments without fixation control, and more importantly, to real life situations.

      The largest caveat raised by the reviewer appears to be directed to the error-clamp paradigm in general, not only to our particular study. In essence, this paradigm indeed requires participants to ignore the clamped error; thus, its induced adaptive response can be attributed to implicit adaptation. The original paper that proposed this paradigm (Morehead et al., 2017) has been cited 220 times (According to Google Scholar, at the time of this writing, 06/2024), indicating that the field has viewed this paradigm in a favorable way.

      Furthermore, we agree that this kind of instruction and feedback (invariant clamp) differ from daily life experience, but it does not prevent us from gaining theoretical insights by studying human behaviors under this kind of "artificial" task setting. Thinking of the saccadic adaptation (Deubel, 1987; Kojima et al., 2004): jumping the target while the eye moves towards it, and this somewhat artificial manipulation again makes people adapt implicitly, and the adaptation itself is a "disastrous" strategy for real-life situations. However, scientists have gained an enormous understanding of motor adaptation using this seemingly counterproductive adaptation in real life. Also, think of perceptual learning of task-irrelevant stimuli (Seitz & Watanabe, 2005, 2009): when participants are required to learn to discriminate one type of visual stimuli, the background shows another type of stimuli, which people gradually learn even though they do not even notice its presence. This "implicit" learning can be detrimental to our real life, too, but the paradigm itself has advanced our understanding of the inner workings of the cognitive system.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      L101: There is a typo: (Tsay et al., 2020), 2020) should be corrected to (Tsay et al., 2020).

      Thanks for pointing it out, we corrected this typo.

      L224-228: It would be beneficial to evaluate the validity of the estimated sigma_u and sigma_p based on previous reports.

      We can roughly estimate σu by evaluating the variability of reaching angles during the baseline phase when no perturbation is applied. The standard deviation of the reaching angle in Exp 2 is 5.128o±0.190o, which is close to the σu estimated by the model (5.048o). We also used a separate perceptual experiment to test the proprioceptive uncertainty (n = 13, See Figure S6), σp from this experiment is 9.737o±5.598o, also close to the σp extracted by the model (11.119o). We added these new analysis results to the final version of the paper.

      L289-298: I found it difficult to understand the update equations of the proprioceptive calibration based on the PEA model. Providing references to the equations or better explanations would be helpful.

      We expanded the process of proprioceptive calibration in Supplementary Text 1 with step-by-step equations and more explanations. 

      Reviewer #3 (Recommendations For The Authors):

      Suggestions (or clarification of previous suggestions) for revisions

      The authors persist on using the Tsay et al 2020 paper despite its many drawbacks which the authors attempt to address in their reply. But the main drawback is that the results in the 2020 paper is NOT relative to the unseen hand but to the visual target the participants were supposed to move their hand to. If the results were converted so to be relative to the unseen hand, the localization biases would be over 20 deg in magnitude.

      The PEA simulations are plotted relative to the unseen hand which makes sense. If the authors want to persist using the Tsay 2020 dataset despite any issues, they at least need to make sure that the simulations are mimicking the same change. That is, the data from Tsay 2020 needs to be converted to the same variable used in the current paper.

      If the main objection for using the Tsay 2021 is that the design would lead to forgetting, we found that active localization (or any intervening active movements like no-cursor reach) does lead to some interference or forgetting (a small reduction in overall magnitude of adaptation) this is not the case for passive localization, see Ruttle et al, 2021 (data on osf). This was also just a suggestion, there may of course also be other, more suitable data sets.

      As stated above, changing the reference system is not necessary, nor does it affect our results. Tsay et al 2020 dataset is unique since it shows the gradual change of reported hand location along with error-clamp adaptation. The forgetting (or reduction in proprioceptive bias), even if it exists, would not affect the fitting quality of our model for the Tsay 2020 dataset: if we assume that forgetting is invariant over the adaptation process, the forgetting would only reduce the proprioceptive bias uniformly across trials. This can be accounted for by a smaller weight on . The critical fact is that the model can explain the gradual drift of the proprioceptive judgment of the hand location.

      By the way, Ruttle et al.'s 2021 dataset is not for error-clamp adaptation, and thus we will leave it to test our model extension in the future (after incorporating an explicit process in the model).

      References

      Deubel, H. (1987). Adaptivity of gain and direction in oblique saccades. Eye Movements from Physiology to Cognition. https://www.sciencedirect.com/science/article/pii/B9780444701138500308

      Kim, H. E., Parvin, D. E., & Ivry, R. B. (2019). The influence of task outcome on implicit motor learning. ELife, 8. https://doi.org/10.7554/eLife.39882

      Klein, S. A., & Levi, D. M. (1987). Position sense of the peripheral retina. JOSA A, 4(8), 1543–1553.

      Kojima, Y., Iwamoto, Y., & Yoshida, K. (2004). Memory of learning facilitates saccadic adaptation in the monkey. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 24(34), 7531–7539.

      Levi, D. M., Klein, S. A., & Yap, Y. L. (1987). Positional uncertainty in peripheral and amblyopic vision. Vision Research, 27(4), 581–597.

      Morehead, J. R., Taylor, J. A., Parvin, D. E., & Ivry, R. B. (2017). Characteristics of implicit sensorimotor adaptation revealed by task-irrelevant clamped feedback. Journal of Cognitive Neuroscience, 29(6), 1061–1074.

      Seitz, & Watanabe. (2005). A unified model for perceptual learning. Trends in Cognitive Sciences, 9(7), 329–334.

      Seitz, & Watanabe. (2009). The phenomenon of task-irrelevant perceptual learning. Vision Research, 49(21), 2604–2610.

      Sivak, B., & Mackenzie, C. L. (1992). Chapter 10 The Contributions of Peripheral Vision and Central Vision to Prehension. In L. Proteau & D. Elliott (Eds.), Advances in Psychology (Vol. 85, pp. 233–259). North-Holland.

      Tsay, J. S., Avraham, G., Kim, H. E., Parvin, D. E., Wang, Z., & Ivry, R. B. (2021). The effect of visual uncertainty on implicit motor adaptation. Journal of Neurophysiology, 125(1), 12–22.

      Tsay, J. S., Kim, H. E., Saxena, A., Parvin, D. E., Verstynen, T., & Ivry, R. B. (2022). Dissociable use-dependent processes for volitional goal-directed reaching. Proceedings. Biological Sciences / The Royal Society, 289(1973), 20220415.

      Tsay, J. S., Kim, H., Haith, A. M., & Ivry, R. B. (2022). Understanding implicit sensorimotor adaptation as a process of proprioceptive re-alignment. ELife, 11, e76639.

      Tsay, J. S., Parvin, D. E., & Ivry, R. B. (2020). Continuous reports of sensed hand position during sensorimotor adaptation. Journal of Neurophysiology, 124(4), 1122–1130.

      Tsay, J. S., Tan, S., Chu, M. A., Ivry, R. B., & Cooper, E. A. (2023). Low Vision Impairs Implicit Sensorimotor Adaptation in Response to Small Errors, But Not Large Errors. Journal of Cognitive Neuroscience, 35(4), 736–748.

      White, J. M., Levi, D. M., & Aitsebaomo, A. P. (1992). Spatial localization without visual references. Vision Research, 32(3), 513–526.

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study presents a valuable finding on the influence of visual uncertainty and Bayesian cue combination on implicit motor adaptation in young healthy participants. The evidence supporting the claims of the authors is solid, although a better discussion of the link between the model variables and the outcomes of related behavioral experiments would strengthen the conclusions. The work will be of interest to researchers in sensory cue integration and motor learning.

      Public Reviews:

      Reviewer #1 (Public Review):

      This valuable study demonstrates a novel mechanism by which implicit motor adaptation saturates for large visual errors in a principled normative Bayesian manner. Additionally, the study revealed two notable empirical findings: visual uncertainty increases for larger visual errors in the periphery, and proprioceptive shifts/implicit motor adaptation are non-monotonic, rather than ramp-like. This study is highly relevant for researchers in sensory cue integration and motor learning. However, I find some areas where statistical quantification is incomplete, and the contextualization of previous studies to be puzzling.

      Thank you for your feedback and the positive highlights of our study. We appreciate your insights and will address the concerns in our revisions.

      Issue #1: Contextualization of past studies.

      While I agree that previous studies have focused on how sensory errors drive motor adaptation (e.g., Burge et al., 2008; Wei and Kording, 2009), I don't think the PReMo model was contextualized properly. Indeed, while PReMo should have adopted clearer language - given that proprioception (sensory) and kinaesthesia (perception) have been used interchangeably, something we now make clear in our new study (Tsay, Chandy, et al. 2023) - PReMo's central contribution is that a perceptual error drives implicit adaptation (see Abstract): the mismatch between the felt (perceived) and desired hand position. The current paper overlooks this contribution. I encourage the authors to contextualize PReMo's contribution more clearly throughout. Not mentioned in the current study, for example, PReMo accounts for the continuous changes in perceived hand position in Figure 4 (Figure 7 in the PReMo study).

      There is no doubt that the current study provides important additional constraints on what determines perceived hand position: Firstly, it offers a normative Bayesian perspective in determining perceived hand position. PReMo suggests that perceived hand position is determined by integrating motor predictions with proprioception, then adding a proprioceptive shift; PEA formulates this as the optimal integration of these three inputs. Secondly, PReMo assumed visual uncertainty to remain constant for different visual errors; PEA suggests that visual uncertainty ought to increase (but see Issue #2).

      Thank you for the comments and suggestions. We have now incorporated the citation for (Tsay et al., 2024), to acknowledge their clarification on the terms of perceptual error. We also agree that our model differs in two fundamental ways. One is to ditch the concept of proprioceptive shift and its contribution to the perceived hand location; instead, we resort to a “one-shot” integration of three types of cues with Bayesian rules. This is a more elegant and probably more ecological way of processing hand location per Occam's Razor. The second essential change is to incorporate the dependency of visual uncertainty on perturbation size into the model, as opposed to resorting to a ramp function of proprioceptive changes relative to perturbation size. The ramp function is not well grounded in perception studies. Yes, we acknowledged that PReMo is the first to recognize the importance of perceptual error, but highlighted the model differences in our Discussion.

      We also think the PReMo model has the potential to explain Fig 4A. But the Tsay et al., 2022 paper assumes that “a generic shift in visual space” explains the gradual proprioceptive changes from negative to positive (see page 17 in Tsay et al., 2022). We do not think that evoking this visual mechanism is necessary to explain Fig 4A; instead, the proprioceptive change is a natural result of hand deviations during implicit adaptation. As the hand moves away from the target (in the positive direction) during adaptation, the estimated hand location goes alone with it. We believe this is the correct way of explaining Fig4A results. As we played around with the PReMo model, we found it is hard to use visual shift to explain this part of data without additional assumptions (at least not with the ones published in Tsay et al., 2022). Furthermore, our PEA model also parsimoniously explains away the proprioceptive shift observed in a completely different setting, i,e., the proprioceptive changes measured by the passive method as a function of perturbation size in Exp 3.

      We expanded the discussion about the comparison between the two models, especially about their different views for explaining Fig4A.

      Issue #2: Failed replication of previous results on the effect of visual uncertainty.

      (2a) A key finding of this paper is that visual uncertainty linearly increases in the periphery; a constraint crucial for explaining the non-monotonicity in implicit adaptation. One notable methodological deviation from previous studies is the requirement to fixate on the target: Notably, in the current experiments, participants were asked to fixate on the target, a constraint not imposed in previous studies. In a free-viewing environment, visual uncertainty may not attenuate as fast, and hence, implicit adaptation does not attenuate as quickly as that revealed in the current design with larger visual errors. Seems like this current fixation design, while important, needs to be properly contextualized considering how it may not represent most implicit adaptation experiments.

      First, we don’t think there is any previous study that examined visual uncertainty as a function of perturbation size. Thus, we do not have a replication problem here. Secondly, our data indicate that even without asking people to fixate on the target, people still predominantly fixate on the target during error-clamp adaptation (when they are “free” viewing). For our Exp 1, the fixation on the straight line between the starting position and the target is 86%-95% (as shown in Figure S1 now, also see below). We also collected eye-tracking data in Exp 4, which is a typical error-clamp experiment. More than 95% fall with +/- 50 pixels around the center of the screen, even slightly higher than Exp 1. This is well understandable: the typical error-clamp adaptation requires people to ignore the cursor and move the hand towards the target. To minimize the interference of the concurrently moving cursor, people depend on the fixation on the target, the sole task-relevant visual marker in the workspace, to achieve the task goal.

      In sum, forcing the participants to fixate on the target is not because we aimed to make up the linear dependency of visual uncertainty; we required them to do so to mimic the eye-tracking pattern in typical error-clamp learning, which has been revealed in our pilot experiment. The visual uncertainty effect is sound, our study is the first to clearly demonstrate it.

      Author response image 1.

      On a side note (but an important one), the high percentage of fixation on the aiming target is also true for conventional visuomotor rotation, which involves strategic re-aiming (shown in Bromberg et al., 2019; de Brouwer et al., 2018, we have an upcoming paper to show this). This is one reason that our new theory would also be applicable to other types of motor adaptation.

      (2b) Moreover, the current results - visual uncertainty attenuates implicit adaptation in response to large, but not small, visual errors - deviates from several past studies that have shown that visual uncertainty attenuates implicit adaptation to small, but not large, visual errors (Tsay, Avraham, et al. 2021; Makino, Hayashi, and Nozaki, n.d.; Shyr and Joshi 2023). What do the authors attribute this empirical difference to? Would this free-viewing environment also result in the opposite pattern in the effect of visual uncertainty on implicit adaptation for small and large visual errors?

      We don’t think all the mentioned previous studies manipulated the visual uncertainty in a parametric way, and none of them provided quantitative measures of visual uncertainty. As we detailed in our Exp4 and in our Discussion, we don’t think Tsay et al., 2021 paper’s manipulation of visual uncertainty is appropriate (see below for 2d). Makino et al., 2023 study used multiple clamped cursors to perturb people, and its effect is not easily accountable since additional processes might be invoked given this kind of complex visual feedback. More importantly, we do not think this is a direct way of modulating visual uncertainty, nor did they provide any evidence.

      (2c) In the current study, the measure of visual uncertainty might be inflated by brief presentation times of comparison and referent visual stimuli (only 150 ms; our previous study allowed for a 500 ms viewing time to make sure participants see the comparison stimuli). Relatedly, there are some individuals whose visual uncertainty is greater than 20 degrees standard deviation. This seems very large, and less likely in a free-viewing environment.

      For our 2AFC, the reference stimulus is the actual clamped cursor, which lasts for 800 ms. The comparison stimulus is a 150-ms dot representation appearing near the reference. For measuring perception of visual motion, this duration is sufficient as previous studies used similar durations (Egly & Homa, 1984; Owsley et al., 1995). We think the 20-degree standard deviation is reasonable given that people fixate on the target, with only peripheral vision to process the fast moving cursor. The steep linear increase in visual uncertainty about visual motion is well documented. The last author of this paper has shown that the uncertainty of visual motion speed (though not about angels) follows the same steep trend (Wei et al., 2010). It is noteworthy that without using our measured visual uncertainty in Exp1, if we fit the adaptation data in Exp2 to “estimate” the visual uncertainty, they are in fact well aligned with each other (see Figure S7 and Supplementary Text 2). This is a strong support that our estimation is valid and accurate. We think this high visual uncertainty is an important message to the field. Thus we now highlighted its magnitude in our Discussion.

      (2d) One important confound between clear and uncertain (blurred) visual conditions is the number of cursors on the screen. The number of cursors may have an attenuating effect on implicit adaptation simply due to task-irrelevant attentional demands (Parvin et al. 2022), rather than that of visual uncertainty. Could the authors provide a figure showing these blurred stimuli (gaussian clouds) in the context of the experimental paradigm? Note that we addressed this confound in the past by comparing participants with and without low vision, where only one visual cursor is provided for both groups (Tsay, Tan, et al. 2023).

      Thank you for raising this important point about types of visual stimuli for manipulating uncertainty. We used Gaussian blur of a single cursor (similar to Burge et al., 2008) instead of a cloud of dots. We now added a figure inset to show how this blur looks.

      Using a cursor cloud Makino et al., 2023; Tsay et al., 2021 to modulate visual uncertainty has inherent drawbacks that make it unsuitable for visuomotor adaptation. For the error clamp paradigm, the error is defined as angular deviation. The cursor cloud consists of multiple cursors spanning over a range of angles, which affects both the sensory uncertainty (the intended outcome) and the sensory estimate of angles (the error estimate, the undesired outcome). In Bayesian terms, the cursor cloud aims to modulate the sigma of a distribution (sigma_v       in         our       model), but it additionally affects the mean of the distribution (mu). This unnecessary confound is avoided by using cursor blurring, which is still a cursor with its center (mu) unchanged from a single cursor. Furthermore, as correctly pointed out in the original paper by Tsay et al., 2021, the cursor cloud often overlaps with the visual target, this “target hit” would affect adaptation, possibly via a reward learning mechanism (See Kim et al., 2019). This is a second confound that accompanies the cursor cloud.

      Issue #3: More methodological details are needed.

      (3a) It's unclear why, in Figure 4, PEA predicts an overshoot in terms of perceived hand position from the target. In PReMo, we specified a visual shift in the perceived target position, shifted towards the adapted hand position, which may result in overshooting of the perceived hand position with this target position. This visual shift phenomenon has been discovered in previous studies (e.g., (Simani, McGuire, and Sabes 2007)).

      Visual shift, as it is called in Simani et al., 2007, is irrelevant for our task here. The data we are modeling are motor adaptation (hand position changes) and so-called proprioceptive changes (hand localization changes), both are measured and referenced in the extrinsic coordinate, not referenced to a visual target. For instance, the proprioceptive changes are either relative to the actual hand location (Exp 3) or relative to the goal (Fig 4A). We also don’t think visual shift is necessary in explaining the perceptual judgment of an unseen hand (the target shown during the judgment indeed has an effect of reducing the biasing effect of PE, see below for responses to reviewer 3).

      In the PEA model, the reported hand angle is the result of integrating cues from the actual hand position and the estimated hand position (x_hand_hat) from previous movements. This integration process leads to the combined reported hand position potentially overshooting or undershooting, depending on the degree of adaptation. It is the changed proprioceptive cue (because the actively moved hand slowly adapted to the error clamp) leading to the overshoot of the perceived hand position.

      In Results, we now explain these value changes with parentheses. Model details about the mechanisms of cue combination and model predictions can be found in Supplementary Text 1. We believe these detailed explanations can make this apparent.

      (3b) The extent of implicit adaptation in Experiment 2, especially with smaller errors, is unclear. The implicit adaptation function seems to be still increasing, at least by visual inspection. Can the authors comment on this trend, and relatedly, show individual data points that help the reader appreciate the variability inherent to these data?

      Indeed, the adaptation for small errors appears not completely saturated with our designated number of trials. However, this will not affect our model analysis. Our model fitting for PEA and other competing models is done on the time-series of adaptation, not on the saturated adaptation extent (see Fig 3A). Thus, despite that some conditions might not produce the full range of adaptation, the data is sufficient to constrain the models. We now mention this concern in Results; we also emphasize that the model not only explains the adaptation magnitude (operationally defined as adaptation extent measured at the same time, i.e., the end of the adaptation phase) but also the full learning process.

      In response, we have included individual data points in the revised Figure 3B-D to provide a clear illustration of the extent of implicit adaptation, particularly for small perturbations.

      (3c) The same participants were asked to return for multiple days/experiments. Given that the authors acknowledge potential session effects, with attenuation upon re-exposure to the same rotation (Avraham et al. 2021), how does re-exposure affect the current results? Could the authors provide clarity, perhaps a table, to show shared participants between experiments and provide evidence showing how session order may not be impacting results?

      Thank you for raising the issue of session and re-exposure effects. First, we don’t think Exp1 has an effect on Exp4. Exp1 is a perceptual task and Exp4 is a motor adaptation task. Furthermore, Exp1 used random visual stimuli on both sides, thus it did not lead to any adaptation effect on its own. Second, Exp4 indeed had three sessions performed on three days, but the session effect does not change our main conclusion about the visual uncertainty. We used a 3-way repeated-measures anova (3 day x 3 perturbation x 2 visual uncertainty) revealed a significant main effect of day (F(2,36) = 17.693, p<0.001), indicating changes in performance across sessions (see Figure below). Importantly, the effects of perturbation and visual uncertainty (including their interactions) remain the same. The day factor did not interact with them. The main effect of day shows that the overall adaptation effect is reduced across days. Post-hoc pairwise comparisons elucidated that single-trial learning (STL) performance on Day 1 was significantly higher than on Day 2 (p = 0.004) and Day 3 (p < 0.001), with no significant difference between Day 2 and Day 3 (p = 0.106). Other ANOVA details: significant main effects for perturbation (F(1,36) = 8.872, p<0.001) and visual uncertainty (F(1,18) = 49.164, p<0.001), as well as a significant interaction between perturbation size and visual uncertainty (F(2,36) = 5.160, p = 0.013). There were no significant interactions involving the day factor with any other factors (all p > 0.182). Thus, the overall adaptation decreases over the days, but the day does not affect our concerned interaction effect of visual uncertainty and perturbation. The fact that their interaction preserved over different sessions strengthened our conclusion about how visual uncertainty systematically affects implicit adaptation.

      Author response image 2.

      (3d) The number of trials per experiment should be detailed more clearly in the Methods section (e.g., Exp 4). Moreover, could the authors please provide relevant code on how they implemented their computational models? This would aid in future implementation of these models in future work. I, for one, am enthusiastic to build on PEA.

      We have clarified the number of trials conducted in each experiment, with detailed information now readily available in the Methods section of the main text. In addition, we have made the code for data analysis and modeling publicly accessible. These resources can be found in the updated "Data Availability" section of our paper.

      (3f) In addition to predicting a correlation between proprioceptive shift and implicit adaptation on a group level, both PReMo and PEA (but not causal inference) predict a correlation between individual differences in proprioceptive shift and proprioceptive uncertainty with the extent of implicit adaptation (Tsay, Kim, et al. 2021). Interestingly, shift and uncertainty are independent (see Figures 4F and 6C in Tsay et al, 2021). Does PEA also predict independence between shift and uncertainty? It seems like PEA does predict a correlation.

      Thank you for addressing this insightful question. Our PEA model indeed predicts a positive correlation (although not linear) between the proprioceptive uncertainty and the amplitude of the estimated hand position (x_hand_hat). This prediction is consistent with the simulations conducted, using the same parameters that were applied to generate the results depicted in

      Figure 4B of our manuscript (there is a sign flip as x_hand_hat is negative).

      Author response image 3.

      Regarding the absence of a correlation observed in Tsay et al., 2021, we offer several potential explanations for this discrepancy. First, the variability observed in passive hand localization during motor adaptation (as in Tsay et al., 2021) does not directly equal proprioceptive uncertainty, which typically requires psychophysical testing to accurately assess. Second, our study showed that the proprioceptive bias attenuates during the repetitive measurements; in our Exp3, it decreased within a block of three trials. We noticed that Tsay et al., 2021 study used 36 measurements in a row without interleaving adaptation trials. Thus, the “averaged” proprioceptive bias in Tsay’s study might not reflect the actual bias during adaptation. We also noticed that that study showed large individual differences in both proprioceptive bias and proprioceptive variability (not uncertainty), thus getting a positive result, if it were really there, would require a large number of participants, probably larger than their n=30ish sample size. These putative explanations are not put in the revision, which already has a long discussion and has no space for discussing about a null result.

      Reviewer #2 (Public Review):

      Summary:

      The authors present the Perceptual Error Adaptation (PEA) model, a computational approach offering a unified explanation for behavioral results that are inconsistent with standard state-space models. Beginning with the conventional state-space framework, the paper introduces two innovative concepts. Firstly, errors are calculated based on the perceived hand position, determined through Bayesian integration of visual, proprioceptive, and predictive cues. Secondly, the model accounts for the eccentricity of vision, proposing that the uncertainty of cursor position increases with distance from the fixation point. This elegantly simple model, with minimal free parameters, effectively explains the observed plateau in motor adaptation under the implicit motor adaptation paradigm using the error-clamp method. Furthermore, the authors experimentally manipulate visual cursor uncertainty, a method established in visuomotor studies, to provide causal evidence. Their results show that the adaptation rate correlates with perturbation sizes and visual noise, uniquely explained by the PEA model and not by previous models. Therefore, the study convincingly demonstrates that implicit motor adaptation is a process of Bayesian cue integration

      Strengths:

      In the past decade, numerous perplexing results in visuomotor rotation tasks have questioned their underlying mechanisms. Prior models have individually addressed aspects like aiming strategies, motor adaptation plateaus, and sensory recalibration effects. However, a unified model encapsulating these phenomena with a simple computational principle was lacking. This paper addresses this gap with a robust Bayesian integration-based model. Its strength lies in two fundamental assumptions: motor adaptation's influenced by visual eccentricity, a well-established vision science concept, and sensory estimation through Bayesian integration. By merging these well-founded principles, the authors elucidate previously incongruent and diverse results with an error-based update model. The incorporation of cursor feedback noise manipulation provides causal evidence for their model. The use of eye-tracking in their experimental design, and the analysis of adaptation studies based on estimated eccentricity, are particularly elegant. This paper makes a significant contribution to visuomotor learning research.

      Weaknesses:

      The paper provides a comprehensive account of visuomotor rotation paradigms, addressing incongruent behavioral results with a solid Bayesian integration model. However, its focus is narrowly confined to visuomotor rotation, leaving its applicability to broader motor learning paradigms, such as force field adaptation, saccadic adaptation, and de novo learning paradigms, uncertain. The paper's impact on the broader fields of neuroscience and cognitive science may be limited due to this specificity. While the paper excellently demonstrates that specific behavioral results in visuomotor rotation can be explained by Bayesian integration, a general computational principle, its contributions to other motor learning paradigms remain to be explored. The paper would benefit from a discussion on the model's generality and its limitations, particularly in relation to the undercompensating effects in other motor learning paradigms.

      Thank you for your thoughtful review and recognition of the contributions our work makes towards understanding implicit motor adaptation through the Perceptual Error Adaptation (PEA) model. We appreciate your suggestion to broaden the discussion about the model's applicability beyond the visuomotor rotation paradigm, a point we acknowledge was not sufficiently explored in our initial discussion.

      Our model is not limited to the error-clamp adaptation, where the participants were explicitly told to ignore the rotated cursor. The error-clamp paradigm is one rare example that implicit motor learning can be isolated in a nearly idealistic way. Our findings thus imply two key aspects of implicit adaptation: 1) localizing one’s effector is implicitly processed and continuously used to update the motor plan; 2) Bayesian cue combination is at the core of integrating movement feedback and motor-related cues (motor prediction cue in our model) when forming procedural knowledge for action control.

      We will propose that the same two principles should be applied to various kinds of motor adaptation and motor skill learning, which constitutes motor learning in general. Most of our knowledge about motor adaptation is from visuomotor rotation, prism adaptation, force field adaptation, and saccadic adaptation. The first three types all involve localizing one’s effector under the influence of perturbed sensory feedback, and they also have implicit learning. We believe they can be modeled by variants of our model, or at least should consider using the two principles we laid out above to think of their computational nature. For skill learning, especially for de novo learning, the area still lacks a fundamental computational model that accounts for skill acquisition process on the level of relevant movement cues. Our model suggests a promising route, i.e., repetitive movements with a Bayesian cue combination of movement-related cues might underlie the implicit process of motor skills.

      We added more discussion on the possible broad implications of our model in the revision.

      Reviewer #3 (Public Review):

      Summary

      In this paper, the authors model motor adaptation as a Bayesian process that combines visual uncertainty about the error feedback, uncertainty about proprioceptive sense of hand position, and uncertainty of predicted (=planned) hand movement with a learning and retention rate as used in state space models. The model is built with results from several experiments presented in the paper and is compared with the PReMo model (Tsay, Kim, et al., 2022) as well as a cue combination model (Wei & Körding, 2009). The model and experiments demonstrate the role of visual uncertainty about error feedback in implicit adaptation.

      In the introduction, the authors notice that implicit adaptation (as measured in error-clamp-based paradigms) does not saturate at larger perturbations, but decreases again (e.g. Moorehead et al., 2017 shows no adaptation at 135{degree sign} and 175{degree sign} perturbations). They hypothesized that visual uncertainty about cursor position increases with larger perturbations since the cursor is further from the fixated target. This could decrease the importance assigned to visual feedback which could explain lower asymptotes.

      The authors characterize visual uncertainty for 3 rotation sizes in the first experiment, and while this experiment could be improved, it is probably sufficient for the current purposes. Then the authors present a second experiment where adaptation to 7 clamped errors is tested in different groups of participants. The models' visual uncertainty is set using a linear fit to the results from experiment 1, and the remaining 4 parameters are then fit to this second data set. The 4 parameters are 1) proprioceptive uncertainty, 2) uncertainty about the predicted hand position, 3) a learning rate, and 4) a retention rate. The authors' Perceptual Error Adaptation model ("PEA") predicts asymptotic levels of implicit adaptation much better than both the PReMo model (Tsay, Kim et al., 2022), which predicts saturated asymptotes, or a causal inference model (Wei & Körding, 2007) which predicts no adaptation for larger rotations. In a third experiment, the authors test their model's predictions about proprioceptive recalibration, but unfortunately, compare their data with an unsuitable other data set. Finally, the authors conduct a fourth experiment where they put their model to the test. They measure implicit adaptation with increased visual uncertainty, by adding blur to the cursor, and the results are again better in line with their model (predicting overall lower adaptation) than with the PReMo model (predicting equal saturation but at larger perturbations) or a causal inference model (predicting equal peak adaptation, but shifted to larger rotations). In particular, the model fits experiment 2 and the results from experiment 4 show that the core idea of the model has merit: increased visual uncertainty about errors dampens implicit adaptation.

      Strengths

      In this study, the authors propose a Perceptual Error Adaptation model ("PEA") and the work combines various ideas from the field of cue combination, Bayesian methods, and new data sets, collected in four experiments using various techniques that test very different components of the model. The central component of visual uncertainty is assessed in the first experiment. The model uses 4 other parameters to explain implicit adaptation. These parameters are 1) learning and 2) retention rate, as used in popular state space models, and the uncertainty (variance) of 3) predicted and 4) proprioceptive hand position. In particular, the authors observe that asymptotes for implicit learning do not saturate, as claimed before, but decrease again when rotations are very large and that this may have to do with visual uncertainty (e.g. Tsay et al., 2021, J Neurophysiol 125, 12-22). The final experiment confirms predictions of the fitted model about what happens when visual uncertainty is increased (overall decrease of adaptation). By incorporating visual uncertainty depending on retinal eccentricity, the predictions of the PEA model for very large perturbations are notably different from and better than, the predictions of the two other models it is compared to. That is, the paper provides strong support for the idea that visual uncertainty of errors matters for implicit adaptation.

      Weaknesses

      Although the authors don't say this, the "concave" function that shows that adaptation does not saturate for larger rotations has been shown before, including in papers cited in this manuscript.

      The first experiment, measuring visual uncertainty for several rotation sizes in error-clamped paradigms has several shortcomings, but these might not be so large as to invalidate the model or the findings in the rest of the manuscript. There are two main issues we highlight here. First, the data is not presented in units that allow comparison with vision science literature. Second, the 1 second delay between the movement endpoint and the disappearance of the cursor, and the presentation of the reference marker, may have led to substantial degradation of the visual memory of the cursor endpoint. That is, the experiment could be overestimating the visual uncertainty during implicit adaptation.

      The paper's third experiment relies to a large degree on reproducing patterns found in one particular paper, where the reported hand positions - as a measure of proprioceptive sense of hand position - are given and plotted relative to an ever-present visual target, rather than relative to the actual hand position. That is, 1) since participants actively move to a visual target, the reported hand positions do not reflect proprioception, but mostly the remembered position of the target participants were trying to move to, and 2) if the reports are converted to a difference between the real and reported hand position (rather than the difference between the target and the report), those would be on the order of ~20{degree sign} which is roughly two times larger than any previously reported proprioceptive recalibration, and an order of magnitude larger than what the authors themselves find (1-2{degree sign}) and what their model predicts. Experiment 3 is perhaps not crucial to the paper, but it nicely provides support for the idea that proprioceptive recalibration can occur with error-clamped feedback.

      Perhaps the largest caveat to the study is that it assumes that people do not look at the only error feedback available to them (and can explicitly suppress learning from it). This was probably true in the experiments used in the manuscript, but unlikely to be the case in most of the cited literature. Ignoring errors and suppressing adaptation would also be a disastrous strategy to use in the real world, such that our brains may not be very good at this. So the question remains to what degree - if any - the ideas behind the model generalize to experiments without fixation control, and more importantly, to real-life situations.

      Specific comments:

      A small part of the manuscript relies on replicating or modeling the proprioceptive recalibration in a study we think does NOT measure proprioceptive recalibration (Tsay, Parvin & Ivry, JNP, 2020). In this study, participants reached for a visual target with a clamped cursor, and at the end of the reach were asked to indicate where they thought their hand was. The responses fell very close to the visual target both before and after the perturbation was introduced. This means that the difference between the actual hand position, and the reported/felt hand position gets very large as soon as the perturbation is introduced. That is, proprioceptive recalibration would necessarily have roughly the same magnitude as the adaptation displayed by participants. That would be several times larger than those found in studies where proprioceptive recalibration is measured without a visual anchor. The data is plotted in a way that makes it seem like the proprioceptive recalibration is very small, as they plot the responses relative to the visual target, and not the discrepancy between the actual and reported hand position. It seems to us that this study mostly measures short-term visual memory (of the target location). What is astounding about this study is that the responses change over time to begin with, even if only by a tiny amount. Perhaps this indicates some malleability of the visual system, but it is hard to say for sure.

      Regardless, the results of that study do not form a solid basis for the current work and they should be removed. We would recommend making use of the dataset from the same authors, who improved their methods for measuring proprioception shifts just a year later (Tsay, Kim, Parvin, Stover, and Ivry, JNP, 2021). Although here the proprioceptive shifts during error-clamp adaptation (Exp 2) were tiny, and not quite significant (p<0.08), the reports are relative to the actual location of the passively placed unseen hand, measured in trials separate from those with reach adaptation and therefore there is no visual target to anchor their estimates to.

      Experiment 1 measures visual uncertainty with increased rotation size. The authors cite relevant work on this topic (Levi & Klein etc) which has found a linear increase in uncertainty of the position of more and more eccentrically displayed stimuli.

      First, this is a question where the reported stimuli and effects could greatly benefit from comparisons with the literature in vision science, and the results might even inform it. In order for that to happen, the units for the reported stimuli and effects should (also) be degrees of visual angle (dva).

      As far as we know, all previous work has investigated static stimuli, where with moving stimuli, position information from several parts of the visual field are likely integrated over time in a final estimate of position at the end of the trajectory (a Kalman filter type process perhaps). As far as we know, there are no studies in vision science on the uncertainty of the endpoint of moving stimuli. So we think that the experiment is necessary for this study, but there are some areas where it could be improved.

      Then, the linear fit is done in the space of the rotation size, but not in the space of eccentricity relative to fixation, and these do not necessarily map onto each other linearly. If we assume that the eye-tracker and the screen were at the closest distance the manufacturer reports it to work accurately at (45 cm), we would get the largest distances the endpoints are away from fixation in dva. Based on that assumed distance between the participant and monitor, we converted the rotation angles to distances between fixation and the cursor endpoint in degrees visual angle: 0.88, 3.5, and 13.25 dva (ignoring screen curvature, or the absence of it). The ratio between the perturbation angle and retinal distance to the endpoint is roughly 0.221, 0.221, and 0.207 if the minimum distance is indeed used - which is probably fine in this case. But still, it would be better to do fit in the relevant perceptual coordinate system.

      The first distance (4 deg rotation; 0.88 dva offset between fixation and stimulus) is so close to fixation (even at the assumed shortest distance between eye and screen) that it can be considered foveal and falls within the range of noise of eye-trackers + that of the eye for fixating. There should be no uncertainty on or that close to the fovea. The variability in the data is likely just measurement noise. This also means that a linear fit will almost always go through this point, somewhat skewing the results toward linearity. The advantage is that the estimate of the intercept (measurement noise) is going to be very good. Unfortunately, there are only 2 other points measured, which (if used without the closest point) will always support a linear fit. Therefore, the experiment does not seem suitable to test linearity, only to characterize it, which might be sufficient for the current purposes. We'd understand if the effort to do a test of linearity using many more rotations requires too much effort. But then it should be made much clearer that the experiment assumes linearity and only serves to characterize the assumed linearity.

      Final comment after the consultation session:

      There were a lot of discussions about the actual interpretation of the behavioral data from this paper with regards to past papers (Tsay et al. 2020 or 2021), and how it matches the different variables of the model. The data from Tsay 2020 combined both proprioceptive information (Xp) and prediction about hand position (Xu) because it involves active movements. On the other hand, Tsay et al. 2021 is based on passive movements and could provide a better measure of Xp alone. We would encourage you to clarify how each of the variables used in the model is mapped onto the outcomes of the cited behavioral experiments.

      The reviewers discussed this point extensively during the consultation process. The results reported in the Tsay 2020 study reflect both proprioception and prediction. However, having a visual target contributes more than just prediction, it is likely an anchor in the workspace that draws the response to it. Such that the report is dominated by short-term visual memory of the target (which is not part of the model). However, in the current Exp 3, as in most other work investigating proprioception, this is calculated relative to the actual direction.

      The solution is fairly simple. In Experiment 3 in the current study, Xp is measured relative to the hand without any visual anchors drawing responses, and this is also consistent with the reference used in the Tsay et al 2021 study and from many studies in the lab of D. Henriques (none of which also have any visual reach target when measuring proprioceptive estimates). So we suggest using a different data set that also measures Xp without any other influences, such as the data from Tsay et al 2021 instead.

      These issues with the data are not superficial and can not be solved within the model. Data with correctly measured biases (relative to the hand) that are not dominated by irrelevant visual attractors would actually be informative about the validity of the PEA model. Dr. Tsay has so much other that we recommend using a more to-the-point data set that could actually validate the PEA model.

      As the comments are repetitive at some places, we summarize them into three questions and address it one by one below:

      (1) Methodological Concerns about visual uncertainty estimation in Experiment 1: a) the visual uncertainty is measured in movement angles (degrees), while the unit in vision science is in visual angles (vda). This mismatch of unit hinders direct comparison between the found visual uncertainty and those reported in the literature, and b) a 1-second delay between movement endpoint and the reference marker presentation causes an overestimate of visual uncertainty due to potential degradation of visual memory. c) The linear function of visual uncertainty is a result of having only three perturbation sizes.

      a) As noted by the reviewer, our visual uncertainty is about cursor motion direction in the display plane, which has never been measured before. We do not think our data is comparable to any findings in visual science about fovea/peripheral comparison. We quoted Klein and others’ work Klein & Levi, 1987; Levi et al., 1987 in vision science since their studies showed that the deviation from the fixation is associated with the increase in visual uncertainty. Their study thus inspired our Exp1 to probe how our concerned visual uncertainty (specifically for visual motion direction) changes with an increasing deviation from the fixation. We believe that any model and its model parameters should be specifically tailored to the task or context it tries to emulate. In our case, motion direction in a center-out reaching setting is the modeled context, and all the relevant model parameters should be specified in movement angles.

      b) The 1s delay of the reference cursor appears to have minimum impact on the estimate of visual uncertainty, based on previous vision studies. Our Exp1 used a similar visual paradigm by White et al., 1992, which shows that delay does not lead to an increase in visual uncertainty over a broad range of values (from 0.2s to >1s, see their Figure 5-6). We will add more methodology justifications in our revision.

      c) We agree that if more angles are tested we can be more confident about the linearity of visual uncertainty. However, the linear function is a good approximation of visual uncertainty (as shown in Figure 2C). More importantly, our model performance does not hinge on a strict linear function. Say, if it is a power function with an increasing slope, our model will still predict the major findings presented in the paper, as correctly pointed out by the reviewer. It is the increasing trend of visual uncertainty, which is completely overlooked by previous studies, that lead to various seemingly puzzling findings in implicit adaptation. Lastly, without assuming a linear function, we fitted the large dataset of motor adaptation from Exp2 to numerically estimate the visual uncertainty. This estimated visual uncertainty has a strong linear relationship with perturbation size (R = 0.991, p<0.001). In fact, the model-fitted visual uncertainty is very close to the values we obtained in Exp1. We now included this analysis in the revision. See details in Supplementary text 2 and Figure S7.

      (2) Experiment 3's: the reviewer argues that the Tsay et al., 2020 data does not accurately measure proprioceptive recalibration, thus it is not suitable for showing our model’s capacity in explaining proprioceptive changes during adaptation.

      Response: We agree that the data from Tsay et al., 2020 is not from passive localization, which is regarded as the widely-accepted method to measure proprioceptive recalibration, a recalibration effect in the sensory domain. The active localization, as used in Tsay et al., 2020, is hypothesized as closely related to people’s forward prediction (where people want to go as the reviewer put it in the comments). However, we want to emphasize that we never equated Tsay’s findings as proprioceptive recalibration: throughout the paper we call them “reported hand location”. We reserved “proprioceptive recalibration” to our own Exp3, which used a passive localization method. Thus, we are not guilty of using this term. Secondly, as far as we know, localization bias or changes, no matter measured by passive or active methods, have not been formally modeled quantitatively. We believe our model can explain both, at least in the error-clamp adaptation setting here. Exp3 is for passive localization, the proprioceptive bias is caused by the biasing effect from the just-perceived hand location (X_hand_hat) from the adaptation trial. Tsay et al. 2020 data is for active localization, whose bias shows a characteristic change from negative to positive. This can be explained by just-perceived hand location (X_hand_hat again) and a gradually-adapting hand (X_p). We think this is a significant advance in the realm of proprioceptive changes in adaptation. Of course, our idea can be further tested in other task conditions, e.g., conventional visuomotor rotation or even gain adaptation, which should be left for future studies.

      For technical concerns, Tsay et al., 2020 data set is not ideal: when reporting hand location, the participants view the reporting wheel as well as the original target. As correctly pointed out by the reviewer, the presence of the target might provide an anchoring cue for perceptual judgment, which acts as an attractor for localization. If it were the case, our cue combination would predict that this extra attractor effect would lead to a smaller proprioceptive effect than that is currently reported in their paper. The initial negative bias will be closer to the target (zero), and the later positive bias will be closer to the target too. However, the main trend will remain, i.e. the reported hand location would still show the characteristic negative-to-positive change. The attractor effect of the target can be readily modeled by giving less weight to the just-perceived hand location (X_hand_hat). Thus, we would like to keep Tsay et al., 2020 data in our paper but add some explanations of the limitations of this dataset as well as how the model would fare with these limitations.

      That being said, our model can explain away both passive and active localization during implicit adaptation elicited by error clamp. The dataset from Tsay et al., 2021 paper is not a good substitute for their 2020 paper in terms of modeling, since that study interleaved some blocks of passive localization trials with adaptation trials. This kind of block design would lead to forgetting of both adaptation (Xp in our model) and the perceived hand (X_hand_hat in our model), the latter is still not considered in our model yet. As our Exp3, which also used passive localization, shows, the influence of the perceived hand on proprioceptive bias is short-lived, up to three trials without adaptation trials. Of course, it would be of great interest to design future studies to study how the proprioceptive bias changes over time, and how its temporal changes relate to the perceptual error. Our model provides a testbed to move forward in this direction.

      (3) The reviewer raises concerns about the study's assumption that participants ignore error feedback, questioning the model's applicability to broader contexts and real-world scenarios where ignoring errors might not be viable or common.

      Reviewer 2 raised the same question above. We moved our responses here. “We appreciate your suggestion to broaden the discussion about the model's applicability beyond the visuomotor rotation paradigm, a point we acknowledge was not sufficiently explored in our initial discussion.

      Our model is not limited to the error-clamp adaptation, where the participants were explicitly told to ignore the rotated cursor. The error-clamp paradigm is one rare example that implicit motor learning can be isolated in a nearly idealistic way. Our findings thus imply two key aspects of implicit adaptation: 1) localizing one’s effector is implicitly processed and continuously used to update the motor plan; 2) Bayesian cue combination is at the core of integrating movement feedback and motor-related cues (motor prediction cue in our model) when forming procedural knowledge for action control.

      We will propose that the same two principles should be applied to various kinds of motor adaptation and motor skill learning, which constitutes motor learning in general. Most of our knowledge about motor adaptation is from visuomotor rotation, prism adaptation, force field adaptation, and saccadic adaptation. The first three types all involve localizing one’s effector under the influence of perturbed sensory feedback, and they also have implicit learning. We believe they can be modeled by variants of our model, or at least should consider using the two principles we laid out above to think of their computational nature. For skill learning, especially for de novo learning, the area still lacks a fundamental computational model that accounts for skill acquisition process on the level of relevant movement cues. Our model suggests a promising route, i.e., repetitive movements with a Bayesian cue combination of movement-related cues might underlie the implicit process of motor skills.”

      We also add one more important implication of our model: as stated above, our model also explains that the proprioceptive changes, revealed by active or passive localization methods, are brought by (mis)perceived hand localization via Bayesian cue combination. This new insight, though only tested here using the error-clamp paradigm, can be further utilized in other domains, e.g., conventional visuomotor rotation or force field adaptation. We hope this serves as an initial endeavor in developing some computational models for proprioception studies. Please see the extended discussion on this matter in the revision.

      Recommendations for the authors:

      Revisions:

      All three reviewers were positive about the work and have provided a set of concrete and well-aligned suggestions, which the authors should address in a revised version of the article. These are listed below.

      A few points of particular note:

      (1) There are a lot of discussions about the actual interpretation of behavioral data from this paper or past papers (Tsay et al. 2020 or 2021) and how it matches the different variables of the model.

      (2) There are some discussions on the results of the first experiment, both in terms of how it is reported (providing degrees of visual angle) and how it is different than previous results (importance of the point of fixation). We suggest also discussing a few papers on eye movements during motor adaptation from the last years (work of Anouk de Brouwer and Opher Donchin). Could the authors also discuss why they found opposite results to that of previous visual uncertainty studies (i.e., visual uncertainty attenuates learning with large, but not small, visual errors); rather than the other way around as in Burge et al and Tsay et al 2021 and Makino Nozaki 2023 (where visual uncertainty attenuates small, but not large, visual errors).

      (3) It is recommended by several reviewers to discuss the applicability of the model to other areas/perturbations.

      (4) Several reviewers and I believe that the impact of the paper would be much higher if the code to reproduce all the simulations of the model is made available to the readers. In addition, while I am very positive about the fact that the authors shared the data of their experiments, metadata seems to be missing while they are highly important because these data are otherwise useless.

      Thank you for the concise summary of the reviewers’ comments. We have addressed their concerns point by point.

      Reviewer #2 (Recommendations For The Authors):

      L142: The linear increase in visual uncertainty should be substantiated by previous research in vision science. Please cite relevant papers and discuss why the linear model is considered reasonable.

      We cited relevant studies in vision science. Their focus is more about eccentricity inflate visual uncertainty, similar to our findings that deviations from the fixation direction inflate visual uncertainty about motion direction.

      We also want to add that our model performance does not hinge on a strict linear function of visual uncertainty. Say, if it is a power function with an increasing slope, our model will still predict the major findings presented in the paper. It is the increasing trend of visual uncertainty, which is completely overlooked by previous studies, that lead to various seemingly puzzling findings in implicit adaptation. Furthermore, without assuming a linear function, we fitted the large dataset of motor adaptation from Exp2 to numerically estimate the visual uncertainty. This estimated visual uncertainty has a strong linear relationship with perturbation size (R = 0.991, p<0.001). In fact, the model-fitted visual uncertainty is very close to the values we obtained in Exp1. We now included this new analysis in the revision. See details in Supplementary text 2 and Figure S7.

      L300: I found it challenging to understand the basis for this conclusion. Additional explanatory support is required.

      We unpacked this concluding sentence as follows:

      “The observed proprioceptive bias is formally modeled as a result of the biasing effect of the perceived hand estimate x_hand_hat. In our mini-block of passive localization, the participants neither actively moved nor received any cursor perturbations for three trials in a row. Thus, the fact that the measured proprioceptive bias is reduced to nearly zero at the third trial suggests that the effect of perceived hand estimate x_hand_hat decays rather rapidly.”

      L331: For the general reader, a visual representation of what the blurring mask looks like would be beneficial.

      Thanks for the nice suggestion. We added pictures of a clear and a blurred cursor in Figure 5D.

      L390: This speculation is intriguing. It would be helpful if the authors explained why they consider causal inference to operate at an explicit process level, as the reasoning is not clear here, although the idea seems plausible.

      Indeed, our tentative conclusion here is only based on the model comparison results here. It is still possible that causal inference also work for implicit adaptation besides explicit adaptation. We make a more modest conclusion in the revision:

      “The casual inference model is also based on Bayesian principle, then why does it fail to account for the implicit adaptation? We postulate that the failure of the causal inference model is due to its neglect of visual uncertainty as a function of perturbation size, as we revealed in Experiment 1. In fact, previous studies that advocating the Bayesian principle in motor adaptation have largely focused on experimentally manipulating sensory cue uncertainty to observe its effects on adaptation (Burge et al., 2008; He et al., 2016; Körding & Wolpert, 2004; Wei & Körding, 2010), similar to our Experiment 4. Our findings suggest that causal inference of perturbation alone, without incorporating visual uncertainty, cannot fully account for the diverse findings in implicit adaptation. The increase in visual uncertainty by perturbation size is substantial: our Experiment 1 yielded an approximate seven-fold increase from a 4° perturbation to a 64° perturbation. We have attributed this to the fact that people fixate in the desired movement direction during movements. Interestingly, even for conventional visuomotor rotation paradigm where people are required to “control” the perturbed cursor, their fixation is also on the desired direction, not on the cursor itself (de Brouwer, Albaghdadi, et al., 2018; de Brouwer, Gallivan, et al., 2018). Thus, we postulate that a similar hike in visual uncertainty in other “free-viewing” perturbation paradigms. Future studies are warranted to extend our PEA model to account for implicit adaptation in other perturbation paradigms.”

      L789: The method of estimating Sigma_hand in the brain was unclear. Since Bayesian computation relies on the magnitude of noise, the cognitive system must have estimates of this noise. While vision and proprioception noise might be directly inferred from signals, the noise of the hand could be deduced from the integration of these observations or an internal model estimate. This process of estimating noise magnitude is theorized in recursive Bayesian integration models (or Kalman filtering), where the size estimate of the state noise (sigma_hand) is updated concurrently with the state estimate (x_hand hat). The equation in L789 and the subsequent explanation appear to assume a static model of noise estimation. However, in practice, the noise parameters, including Sigma_hand, are likely dynamic and updated with each new observation. A more detailed explanation of how Sigma_hand is estimated and its role in the cognitive process.

      This is a great comment. In fact, if a Kalman filter is used, the learning rate and the state noise all should be dynamically updated on each trial, under the influence of the observed (x_v). In fact, most adaptation models assume a constant learning rate, including our model here. But a dynamic learning rate (B in our model) is something worth trying. However, in our error-clamp setting, x_v is a constant, thus this observation variable cannot dynamically update the Kalman filter; that’s why we opt to use a “static” Bayesian model to explain our datasets. Thus, Sigma_hand can be estimated by using Bayesian principles as a function of three cues available, i.e., the proprioceptive cue, the visual cue, and the motor prediction cue. We added a

      detailed derivation of sigma_hand in the revision in Supplementary text 1.

      Reviewer #3 (Recommendations For The Authors):

      We observed values in Fig 2C for the 64-degree perturbation that seem to be outliers, i.e., greater than 50 degrees. It is unclear how a psychometric curve could have a "slope" or JNP of over 60, especially considering that the tested range was only 60. Since the data plotted in panel C is a collapse of the signed data in panel B, it is perplexing how such large data points were derived, particularly when the signed uncertainty values do not appear to exceed 30.

      Related to the previous point, we would also recommend connecting individual data points: if the uncertainty increases (linearly or otherwise), then people with low uncertainty at the middle distance should also have low uncertainty at the high distance, and people with high uncertainty at one point, should also have that at other distances. Or perhaps the best way to go about this is to use the uncertainty at the two smaller perturbations to predict uncertainty at the largest perturbation for each participant individually?

      Thank you for your suggestion to examine the consistency of individual levels of visual uncertainty across perturbation sizes. First, a sigma_v of 60 degrees is well possible, naturally falling out of the experimental data. It shows some individuals indeed have large visual uncertainty. Given these potential outliers (which should not be readily removed as we don’t have any reason to do so), we estimated the linear function of sigma_v with a robust method, i.e., the GLM with a gamma distribution, which favors right-skewed distribution that can well capture positive outliers. Furthermore, we added in our revision a verification test of our estimates of sigma_v: we used Exp2’s adaptation data to estimate sigma_v without assuming its linear dependency. As shown, the model-fitted sigma_v closely matched the estimated ones from Exp1 (see Supplementary text 2 and Figure S7).

      We re-plotted the sigma_v with connected data points provided, and the data clearly indicate that individuals exhibit consistent levels of visual uncertainty across different perturbation sizes, i.e. those with relatively lower uncertainty at middle distances (in fact, angles) tend to exhibit relatively lower uncertainty at higher distances too, and similarly, those with higher uncertainty at one distance maintain that level of uncertainty at other distances. This is confirmed by spearman correlation analysis to assess the consistency of uncertainties across different degrees of perturbation among individuals. Again, we observed significant correlations between perturbation angles, indicating good individual consistency (4 and 16 degrees, rho = 0.759, p<0.001; 16 and 64 degrees, rho = 0.527, p = 0.026).

      Author response image 4.

      The illustration in Fig 2A does not seem to show a stimulus that is actually used in the experiment (looks like about -30{degree sign} perturbation). It would be good to show all possible endpoints with all other visual elements to scale - including the start-points of the PEST procedure.

      Thanks for the suggestion. We updated Fig 2A to show a stimulus of +16 degree, as well as added an additional panel to show all the possible endpoints.

      Finally (related to the previous point), in lines 589-591 it says the target is a blue cross. Then in lines 614-616, it says participants are to fixate the blue cross or the start position. The start position was supposed to have disappeared, so perhaps the blue plus moved to the start position (which could be the case, when looking at the bottom panel in Fig 2A, although in the illustration the plus did not move fully to the start position, just toward it to some degree). Perhaps the descriptions need to be clarified, or it should be explained why people had to make an eye movement before giving their judgments. And if people could have made either 1) no eye movement, but stayed at fixation, 2) moved to the blue plus as shown in the last panel in Fig 2A, or 3) fixated on the home position, we'd be curious to know if this affected participants' judgments.

      Thanks for pointing that out. The blue cross serves as the target in the movement task, then disappears with the cursor after 800ms of frozen time. The blue cross then appeared in the discrimination task at the center of the screen, i.e. the start location. Subjects were asked to fixate at the blue cross during the visual discrimination task. Note this return the fixation to the home position is exactly what we will see in typical error-clamp adaptation: once the movement is over, people guided their hand back to the home position. We performed a pilot study to record the typical fixation pattern during error-clamp adaptation, and Exp1 was intentionally designed to mimic its fixation sequence. We have now updated the description of Figure 2A, emphasizing the stimulus sequence. .

      In Figure 4A, the label "bias" is confusing as that is used for recalibrated proprioceptive sense of hand position as well as other kinds of biases elsewhere in the paper. What seems to be meant is the integrated hand position (x-hat_hand?) where all three signals are apparently combined. The label should be changed and/or it should be clarified in the caption.

      Thanks for pointing that out, it should be x_hand_hat, and we have corrected this in the revised version of Figure 4.

      In the introduction, it is claimed that larger perturbations have not been tested with "implicit adaptation" paradigms, but in the same sentence, a paper is cited (Moorehead et al., 2017) that tests a rotation on the same order of magnitude as the largest one tested here (95{degree sign}), as well as much larger rotations (135{degree sign} and 175{degree sign}). With error-clamps. Interestingly, there is no adaptation in those conditions, which seems more in line with the sensory cue integration model. Can the PEA model explain these results as well? If so, this should be included in the paper, and if not, it should be discussed as a limitation.

      First, we double checked our manuscript and found that we never claimed that larger perturbations had not been tested.

      We agree that it is always good to have as many conditions as possible. However, the 135 and 175 degree conditions would lead to minimum adaptation, which would not help much in terms of model testing. We postulated that this lack of adaptation is simply due to the fact that people cannot see the moving cursor, or some other unknown reasons. Our simple model is not designed to cover those kinds of extreme cases.

      Specify the size of the arc used for the proprioceptive tests in Exp 3 and describe the starting location of the indicator (controlled by the left hand). Ideally, the starting location should have varied across trials to avoid systematic bias.

      Thank you for the comments. The size of the arc used during these tests, as detailed in the methods section of our paper, features a ring with a 10 cm radius centered at the start position. This setup is visually represented as a red arc in Figure 7B.

      After completing each proprioceptive test trial, participants were instructed to position the indicator at approximately -180° on the arc and then relax their left arm. Although the starting location for the subsequent trial remained at-180°, it was not identical for every trial, thereby introducing slight variability.

      Please confirm that the proprioceptive biases plotted in Fig 4E are relative to the baseline.

      Thank you for bringing this to our attention. Yes, the proprioceptive biases illustrated in Figure 4E are indeed calculated relative to the baseline measurements. We have added this in the method part.

      Data availability: the data are available online, but there are some ways this can be improved. First, it would be better to use an open data format, instead of the closed, proprietary format currently used. Second, there is no explanation for what's in the data, other than the labels. (What are the units? What preprocessing was done?) Third, no code is made available, which would be useful for a computational model. Although rewriting the analyses in a non-proprietary language (to increase accessibility) is not a reasonable request at this point in the project, I'd encourage it for future projects. But perhaps Python, R, or Julia code that implements the model could be made available as a notebook of sorts so that other labs could look at (build on) the model starting with correct code - increasing the potential impact of this work.

      Great suggestions. We are also fully supportive of open data and open science. We now:

      (1) Updated our data and code repository to include the experimental data in an open data format (.csv) for broader accessibility.

      (2) The data are now accompanied by detailed descriptions to clarify their contents.

      (3) We have made the original MATLAB (.m) codes for data analysis, model fitting and simulation available online.

      (4) We also provide the codes in Jupyter Notebook (.ipynb) formats.

      These updates can be found in the revised “Data Availability” section of our manuscript.

      References

      Bromberg, Z., Donchin, O., & Haar, S. (2019). Eye Movements during Visuomotor Adaptation Represent Only Part of the Explicit Learning. eNeuro, 6(6). https://doi.org/10.1523/ENEURO.0308-19.2019

      Burge, J., Ernst, M. O., & Banks, M. S. (2008). The statistical determinants of adaptation rate in human reaching. Journal of Vision, 8(4), 1–19.

      de Brouwer, A. J., Gallivan, J. P., & Flanagan, J. R. (2018). Visuomotor feedback gains are modulated by gaze position. Journal of Neurophysiology, 120(5), 2522–2531.

      Egly, R., & Homa, D. (1984). Sensitization of the visual field. Journal of Experimental Psychology. Human Perception and Performance, 10(6), 778–793.

      Kim, H. E., Parvin, D. E., & Ivry, R. B. (2019). The influence of task outcome on implicit motor learning. eLife, 8. https://doi.org/10.7554/eLife.39882

      Klein, S. A., & Levi, D. M. (1987). Position sense of the peripheral retina. JOSA A, 4(8), 1543–1553.

      Levi, D. M., Klein, S. A., & Yap, Y. L. (1987). Positional uncertainty in peripheral and amblyopic vision. Vision Research, 27(4), 581–597.

      Makino, Y., Hayashi, T., & Nozaki, D. (2023). Divisively normalized neuronal processing of uncertain visual feedback for visuomotor learning. Communications Biology, 6(1), 1286.

      Owsley, C., Ball, K., & Keeton, D. M. (1995). Relationship between visual sensitivity and target localization in older adults. Vision Research, 35(4), 579–587.

      Simani, M. C., McGuire, L. M. M., & Sabes, P. N. (2007). Visual-shift adaptation is composed of separable sensory and task-dependent effects. Journal of Neurophysiology, 98(5), 2827–2841.

      Tsay, J. S., Avraham, G., Kim, H. E., Parvin, D. E., Wang, Z., & Ivry, R. B. (2021). The effect of visual uncertainty on implicit motor adaptation. Journal of Neurophysiology, 125(1), 12–22.

      Tsay, J. S., Chandy, A. M., Chua, R., Miall, R. C., Cole, J., Farnè, A., Ivry, R. B., & Sarlegna, F. R. (2024). Minimal impact of proprioceptive loss on implicit sensorimotor adaptation and perceived movement outcome. bioRxiv : The Preprint Server for Biology. https://doi.org/10.1101/2023.01.19.524726

      Tsay, J. S., Kim, H., Haith, A. M., & Ivry, R. B. (2022). Understanding implicit sensorimotor adaptation as a process of proprioceptive re-alignment. eLife, 11, e76639.

      Wei, K., Stevenson, I. H., & Körding, K. P. (2010). The uncertainty associated with visual flow fields and their influence on postural sway: Weber’s law suffices to explain the nonlinearity of vection. Journal of Vision, 10(14), 4.

      White, J. M., Levi, D. M., & Aitsebaomo, A. P. (1992). Spatial localization without visual references. Vision Research, 32(3), 513–526.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      The authors identify new mechanisms that link a PIK3R1 mutant to cellular signaling and division in Activated PI3 Kinase Delta Syndrome 1 and 2 (APDS1/2). The conclusion that this mutant serves as a dominant negative form of the protein, impacting PI3K complex assembly and IRS/AKT signaling, is important, and the evidence from constitutive and inducible systems in cultured cells is convincing. Nevertheless, there are several limitations relating to differences between cell lines and expression systems, as well as more global characterization of the protein interaction landscape, which would further enhance the work.

      We are pleased by this fair assessment, while noting that this work relates to APDS2 (PIK3R1-related) rather than APDS1 (PIK3CD-related). Our findings we believe are clear, but the observation that studies including more global proteomics/phosphoproteomics in cells expressing mutants at endogenous levels would add further insight is well made. We hope that this report may motivate such studies by laboratories with wider access to primary cells from patients and knock-in mice.

      Public Reviews

      Reviewer #1 (Public Review):

      Summary:

      This study provides convincing data showing that expression of the PIK3R1(delta Exon11) dominant negative mutation in Activated PI3K Delta Syndrome 1/2 (APDS1/2) patient-derived cells reduces AKT activation and p110δ protein levels. Using a 3T3-L1 model cell system, the authors show that overexpressed p85α delta Exon 11) displays reduced association with the p110α catalytic subunit but strongly interacts with Irs1/2. Overexpression of PIK3R1 dominant negative mutants inhibits AKT phosphorylation and reduces cellular differentiation of preadipocytes. The strength of this article is the clear results derived from Western blots analysis of cell signaling markers (e.g. pAKT1), and co-immunoprecipitation of PI3K holoenzyme complexes and associated regulatory factors (e.g. Irs1/2). The experimental design, interpretation, and quantification broadly support the authors' conclusions.

      Strengths:

      The authors analyze a variety of PIK3R1 mutants (i.e. delta Exon11, E489K, R649W, and Y657X), which reveals a range of phenotypes that support the proposed model for dominant negative activity. The use of clonal cell lines with doxycycline-induced expression of the PIK3R1 mutants (DExon 11, R649W, and Y657X) provides convincing experimental data concerning the relationship between p85α mutant expression and AKT phosphorylation in vivo. The authors convincingly show that p85α delta Exon11, R649W, or Y657X) is unable to associate with p110α but instead more strongly associates with Irs1/2 compared to wild type p85α. This helps explain why the authors were unable to purify the recombinant p110α/p85α delta Exon 11) heterodimeric complex from insect cells.

      Weaknesses:

      Future experimentation will be needed to reconcile the cell type specific differences (e.g. APDS2 patient-derived cells vs. the 3T3-L1 cell model system) in PIK3R1 mutant behavior reported by the authors.

      This is a fair comment. It has been established for many years that relative protein levels even of wild type PIK3CA and PIK3R1 gene products influence sensitivity of PI3K to growth factor stimulation. Such issues of stoichiometry become exponentially more complicated when the numerous potential interactions among the full repertoire of Class 1 PI3K regulatory subunits (3 splice variants of PIK3R1, and also PIK3R2 and PIK3R3) and corresponding catalytic subunits (PIK3CA, PIK3CB, PIK3CD) are considered, and when different activities and stabilities of PIK3R1 mutants are added to the mix. It thus seems obvious to us that different levels of expression of different mutants in different cellular contexts will have different signalling consequences. We establish a paradigm in this paper using an overexpression system, and we strongly agree that this merits further investigation in a wider variety of primary cells (or cells with knock in at the endogenous locus), where available.

      An unbiased proteomic study that broadly evaluates the cell signaling landscape could provide a more holistic understanding of the APDS2 and SHORT mutants compared to a candidate-based approach.

      We agree. This would be highly informative, but we think would best be carried out in both “metabolic” and “immune” cells with endogenous levels of expression of SHORT or APDS2 PIK3R1 mutants. These are not all currently available to us, and require follow up studies.

      Additional biochemical analysis of p110α/p85α delta Exon 11 complex is needed to explain why this mutant regulatory subunit does not strongly associate with the p110 catalytic subunit.

      We agree. We present this observation in our overexpression system, which is clear and reproducible, even though somewhat surprising. The failure to bind p110a is likely not absolute, as sufficient p110a-p85a<sup>DEx11</sup> was synthesised in vitro in a prior study to permit structural and biochemical studies, although a series of technical workarounds were required to generate enough heterodimeric PI3K to study in vitro given the manifest instability of the complex, particularly when concentrated (PMID 28167755). We already note in discussion that p85a can homodimerize and bind PTEN, likely among other partners, and it may be that the APDS2 deletion strongly favours binding to proteins that effectively compete with p110a. However this requires further study of the wider interactome of the mutant PIK3R1, which, as noted above, are beyond the scope of the current study.

      It remains unclear why p85α delta Exon 11 expression reduces p110δ protein levels in APDS2 patient-derived dermal fibroblasts.

      We caution that we only had the opportunity to study dermal fibroblasts cultured from a single APDS2 patient, as noted in the paper, and so replication of this finding in future will be of interest. Nevertheless the observation is robust and reproducible in these cells, and we agree that this apparently selective effect on p110d  is not fully explained. Having said that, it has been observed previously that heterodimers of the DEx11 p85a variant with either p110a or p110d are unstable, and when the unstable complexes were eventually synthesised, p110a and p110d were demonstrated to show differences in engagement with the mutant p85, with greater disruption of inhibitory interactions observed for p110d (PMID 28167755). It is thus not a great stretch to imagine that as well as disinhibiting p110d more, the DEx11 p85a variant also destabilises the p85a-p110d complex more, potentially explaining its near disappearance in cells with low baseline p110d expression. Following on from the preceding question and response, however, is an alternative explanation, based on the 3T3-L1 overexpression studies in this paper, wherein we were unable to demonstrate binding of p110a by DEx11 p85a. If, in any given cellular context, the mutant p85 could bind p110d but not p110a, then the destabilising effect would be observed only for p110d. So in summary, we believe the selective effect on p110d is explained by differences in binding kinetics and heterodimer stability for different DEx11 p85a-containing complexes. The net effect of these differences may vary among cell types depending on relative levels of subunit expression.

      This study would benefit from a more comprehensive biochemical analysis of the described p110α/p85α, p110β/p85α, and p110δ/p85α mutant protein complexes. The current limitation of this study to the use of a single endpoint assay to measure PI3K lipid kinase activity in the presence of a single regulatory input (i.e. RTK-derived pY peptide). A broader biochemical analysis of the mutant PI3K complexes across the canonical signaling landscape will be important for establishing how competition between wild-type and mutant regulatory subunits is regulated in different cell signaling pathways.

      We agree that a wider analysis of upstream inputs and downstream network would be of interest, though as noted above the ultimate functional consequences of mutants will be an amalgam of any differential signalling effects of complexes that are stable enough to function, and differential effects of mutant p85a on the kinetics of distinct heterodimer assembly and stability. In this paper we seek to suggest a paradigm worthy of further, deeper assessment. We note that the search space here is large indeed (A. different cell types with differing profiles of PI3K subunit expression B. Multiple upstream stimuli and C. Multiple downstream outputs, with timecourse of responses an additional important factor to consider). These studies are realistically beyond the scope of the current work, but we hope that further studies, as suggested by the reviewer, follow.

      Reviewer #2 (Public Review)

      Summary:

      Patsy R. Tomlinson et al; investigated the impact of different p85alpha variants associated with SHORT syndrome or APDS2 on insulin-mediated signaling in dermal fibroblasts and preadipocytes. They find no evidence of hyperactive PI3K signalling monitored by pAKT in APDS2 patient-derived dermal fibroblast cells. In these cells p110alpha protein levels were comparable to levels in control cells, however, the p110delta protein levels were strongly reduced. Remarkably, the truncated APDS2-causal p85alpha variant was less abundant in these cells than p85alpha wildtype. Afterwards, they studied the impact of ectopically expressed p85alpha variants on insulin-mediated PI3K signaling in 3T3-L1 preadipocytes. Interestingly they found that the truncated APDS2-causal p85alpha variant impaired insulin-induced signaling. Using immunoprecipitation of p110alpha they did not find truncated APDS2-causal p85alpha variant in p110alpha precipitates. Furthermore, by immunoprecipitating IRS1 and IRS2, they observed that the truncated APDS2-causal p85alpha variant was very abundant in IRS1 and IRS2 precipitates, even in the absence of insulin stimulation. These important findings add in an interesting way possible mechanistic explanation for the growing number of APDS2 patients described with features of SHORT syndrome.

      Strengths:

      Based on state-of-the-art functional investigation the authors propose indicating a loss-of-function activity of the APDS2-disease causing p85alpha variant in preadipocytes providing a possible mechanistic explanation for the growing number of APDS2 patients described with features of SHORT syndrome.

      Weaknesses:

      Related to Figure 1: PIK3R1 expression not only by Western blotting but also by quantifying the RNA transcripts, e.g. mutant and wildtype transcripts, was not performed. RNA expression analysis would further strengthen the suggested impaired stabilization/binding.

      It is not completely clear to us how further PIK3R1 mRNA analysis would enhance the points we seek to make. Perhaps the reviewer’s point is that changes in protein expression could be explained by reduced transcription rather than having anything to do with altered protein turnover? As shown in Figure 1 supplemental figure 1, sequencing cDNA from each of the primary cell lines studied indicates that both mutant and WT alleles are expressed at or close to 50% of the total mRNA for PIK3CA or PIK3R1 as relevant. While this is not strictly quantitative, allied to prior evidence that these are dominant alleles which require to be expressed to exert their effect, with no evidence for altered mRNA expression of these variants in prior studies, we don’t believe any further quantification of mRNA expression would add value.

      Related to Figure 2

      As mentioned by the authors in the manuscript the expression of p110delta but also p110beta in 3T3-L1 preadipocytes ectopically expressing p85alpha variants has not been analyzed.

      We agree that such determination would have been a useful addition to the study, but regretfully it was not undertaken in these modified 3T3-L1 cells at the time of study. However independent bulk RNAseq studies of the founder 3T3-L1 cells from which the stably transduced cells were generated, undertaken as part of an unrelated study, revealed the following relative levels of endogenous expression of PI3K subunit mRNA:

      Author response table 1.

      We have not determined endogenous protein expression, and so have left the text of the discussion unchanged, simply noting that we have not formally assessed protein expression of p110d/p110b. However these transcriptomic findings suggest that p110d protein is likely either undetectable, or else present at extremely low levels compared to endogenous p110a. p110b also appears to be expressed at a much lower level than p110a. In our studies overexpressing mutant PIK3R1 and assessing insulin action, we believe we are largely or perhaps entirely assaying the effect of the mutants on p110a, in keeping with the fact that genetic and pharmacological studies have firmly established that it is p110a that is responsible for mediating the metabolic actions of insulin in adipose tissue and preadipocytes including 3T3-L1 (e.g. PMID 16647110). Indeed, to quote from this study, in 3T3-L1 “… inhibitors of p110b (TGX-115 and TGX-286) and p110d (IC87114 and PIK-23) had no effect on the insulin-stimulated phosphorylation of any protein in the PI3-K pathway.”

      We have added the following sentence to the discussion:

      “The current study has limitations. We have studied primary cells from only a single APDS2 patient, and in the 3T3-L1 cell model, we did not determine whether p110d protein could be detected. If not, this could explain the lack of detectable AKT phosphorylation with induction of Pik3r1 DEx11.  Indeed, previous pharmacological studies in 3T3-L1 adipocytes has shown that selective inhibition of p110d or p110b does not alter insulin-induced phosphorylation of any protein studied in the PI3-K pathway, attesting to the dominance of p110a in insulin action in this cell model (Knight et al, 2006).” 

      Furthermore, a direct comparison of the truncated APDS2-causal p85alpha variant with SHORT syndrome-causal p85alpha variants in regard to pAKT level, and p85alpha expression level has not been performed.

      These investigations would further strengthen the data.

      The cell lines conditionally expressing SHORT syndrome variants have been reported already, as cited (PMID: 27766312). Remarkably, the degree of inhibition of insulin-stimulated signalling is actually less pronounced for the SHORT syndrome variants than for the overexpressed APDS2 variant, as seen in the excerpt from the prior paper below. In this prior paper the maximum insulin concentration used, 100nM, was the concentration used in the current study. While overexpression of the APDS2 p85a variant ablated the response to insulin entirely, it is still seen in the prior study, albeit at a clearly reduced level.

      Related to Figure 3

      The E489K and Y657X p85alpha variants should be also tested in combination with p110delta in the PI3K activity in vitro assay. This would help to further decipher the overall impact, especially of the E489K variant.

      We agree that this would make our data more complete, but for logistical reasons (primarily available personnel) we were compelled to constrain the number of p85-p110 combinations we studied. We elected to prioritise the PIK3R1 R649W variant as by far the most common causal SHORT syndrome variant, and as the variant showing the “cleanest” functional perturbation, namely severely impaired or absent ability to dock to phosphotyrosines in cognate proteins.  The paradox that we sought to explain in this paper, namely the phenotypic combination of gain-of-function APDS2 with loss-of-function SHORT syndrome features holds only for APDS2 PIK3R1 variants, and so while it is interesting to document that the canonical SHORT syndrome variant also inhibits PI3Kb and PI3Kd activation in vitro, this was not the main purpose of our study.

      Reviewer #1 (Recommendations For The Authors):

      Points of clarification and suggestions for improving the manuscript:

      (1) Explain whether there are any PIK3R1-independent genetic alterations in the APDS2 and PROS-derived cell lines. For example, are there differences in the karyotype of mutant cell lines compared to wild-type cells?

      Karyotypic abnormalities are not an established feature of either PROS or APDS2, and the patients from whom cells were derived were documented to be of normal karyotype. Karyotypic abnormalities acquired during cell culture would not be unprecedented, but confirming normal karyotypes in primary cell lines where there is no specific reason to suppose any alteration exceeds normal expectations for primary cell studies, and so this has not been undertaken.

      (2) When introducing the APDS2-associated PIK3R1 mutation (lines 126-128), the authors describe both the exon 11 skipping and in-frame deletions. I recommend rewording this sentence to say exon 11 skipping results in an in-frame deletion of PIK3R1. The current wording makes it seem like APDS2-derived cells contain two genetic perturbations: (1) exon 11 skipping and (2) in-frame deletion. Include a diagram in Figure 1 to help explain the location of the mutations being studied in relationship to the PIK3R1 gene sequence and domains (i.e. nSH2, iSH2, cSH2). The description of the exon 11 skipping and in-frame deletions (lines 126-128) would benefit from having a complementary figure that diagrams the location of these mutations in the PIK3R1 gene.

      On review we agree that clarity of description could be enhanced. We have now edited these lines as follows:

      “We began by assessing dermal fibroblasts cultured from a previously described woman with APDS2 due to the common causal PIK3R1 mutation. This affects a splice donor site and causes skipping of exon 11, leading to an in-frame deletion of 42 amino acids (434-475 inclusive) in the inter-SH2 domain, which is shared by all PIK3R1 isoforms (Patient A.1 in (Lucas et al., 2014b))(Figure 1 figure supplement 1).”

      We have moreover introduced a further figure element including a schematic of all PIK3R1 mutations reported in the current study (new Figure 1 figure supplement 1)

      (3) For Figure 2, I recommend including a cartoon that illustrates the experimental design showing the induced expression of PIK3R1 mutants, R649W and Y657X, in the background of the wild-type endogenous gene expression.

      Such a figure element has now been generated and included as Figure 2 figure supplement 1, duly called out in the results section where appropriate.

      (4) For the data plotted in Figure 1B-1C, please clarify whether the experiments represent a single patient or all 3-4 patients shown in Figure 1A.

      Each datapoint shown represents one of the patients in the immunoblots, with all patients included. Each point in turn is the mean from 3 independent experiments. We have added the following to the Figure legend:

      “(B)-(E) quantification of immunoblot bands from 3 independent experiments shown for phosphoAKT-S473, phosphoAKT-T308, p110d and p110a respectively. Each point represents data from one of the patient cell lines in the immunoblots. Paired datapoints +/- insulin are shown in (B) and (C), and dotted lines mark means.”

      (5) I recommend rewording the following sentence: "Given this evidence that APDS2-associated PIK3R1 delta Exon 11 potently inhibits PI3Kα when overexpressed in 3T3-L1 preadipocytes," to say "... potently inhibits PI3Kα signaling when overexpressed in 3T3-L1 preadipocytes." The data shown in Figures 1 and 2 do not support a direct biochemical inhibition of PI3Kα lipid kinase activity by p85α (delta Exon 11).

      This edit has been made.

      (6) Provide more discussion concerning the percentage of humans with APDS2 or SHORT syndrome that contain the mutations discussed in this paper. How strong is the genotype-phenotype link for these diseases? Are these diseases inherited or acquired through environmental stresses?

      Both APDS2 and SHORT syndrome are very well established, highly penetrant and stereotyped monogenic disease. APDS is defined by the presence of activating PIK3R1 mutations such as the one studied here (by far the commonest causal mutation).  SHORT syndrome clinically has some superficial resemblance to other human genetic syndrome including short stature, but when careful attention is paid to characteristic features it is nearly universally attributable to loss-of-function PIK3R1 mutations with the single exception of one case in which a putatively pathogenic PKCE mutation was described (PMID: 28934384). Although both syndromes are monogenic it is often not accurate to refer to them as inherited, as, particularly in SHORT syndrome, de novo mutations (i.e. not found in either parent) are common. Environmental modifiers of phenotypes have not been described. To the introduction has now been added the comment that both conditions are highly penetrant and monogenic.

      (7) The data presented in Figure 5 would benefit from additional discussion and citations that describe the molecular basis of the interaction between PI3K and Irs1/2. What studies have previously established this is a direct protein-protein interactions? Are there PI3K mutants that don't interact with Irs1/2 that can be included as a negative control? Alternatively, the authors can simply reference other papers to support the mechanism of interaction.

      There is a voluminous literature dating back to the early 1990s documenting the mode of interaction of PI3K with Irs1/2. Relevant papers have now been cited as requested:

      p85-Irs1 binding: PMID 1332046 (White lab, PNAS 1992)

      p85-Irs2 binding: PMID 7675087 (White lab, Nature 1995)

      “This may be important, as p85a mediates recruitment of PI3K to activated tyrosine kinase receptors and their tyrosine phosphorylated substrates, including the insulin-receptor substrate proteins Irs1 (PMID 1332046) and Irs2 (PMID 7675087).”

      Regarding PI3K mutants that don't interact with Irs1/2, the SHORT syndrome mutant R649W which we include in this study is perhaps the best example of this, so it is both disease-causing and functions as such a negative control.

      (8) To see the effect of the dominant negative delta Exon 11, the truncated p85α needs to be super stoichiometric to the full-length p85α (Figure 2 - Supplemental Figure 2). This is distinct from the results in Figure 1 showing the ADPS2-derived dermal fibroblast express 5-10x lower levels of p85α delta Exon 11 compared to full-length p85α (Figure 1A), but still strongly inhibits pAKT S473 and T308 (Figure 1B-1C). The manuscript would benefit from more discussion concerning the cell type specific differences in phenotypes. Alternatively, do the APDS2-derived dermal fibroblasts have other genetic perturbations that are not accounted for that potentially modulate cell signaling differently compared to 3T3-L1 preadipocytes?

      The reviewer is astute to point out this apparent contrast. First of all, we have no reason to suppose there is any specific, PI3K-modifying genetic perturbation present in the primary dermal fibroblasts studied, although of course the genetic background of these cells is very distinct to that of 3T3-L1 mouse embryo fibroblasts. Related to such background differences, however, substantial variability is usually apparent in insulin-responsiveness even of healthy control dermal fibroblasts. This means that caution should be exercised in extrapolating from studies of the primary cells of a single individual. To illustrate this, we point the reviewer to our 2016 study in which we extensively studied the dermal fibroblasts of a proband with SHORT syndrome due to PIK3R1 Y657X:

      From this study we conclude that A. WT controls show quite substantial variation in insulin-stimulated AKT phosphorylation and B. even the SHORT syndrome p85a Y657X variant, expressed at higher levels that WT p85a in dermal fibroblasts, does not produce an obvious decrease in insulin-stimulated AKT phosphorylation, despite extensive evidence from other human cell studies and knock-in mice that it does indeed impaired insulin action in metabolic tissues. For both these reasons we are not convinced that the lower insulin-induced AKT phosphorylation we described in Figure 1 should be overinterpreted until reproduced in other studies with primary cells from further APDS2 patients. This is why we did not comment more extensively on this. We now add the following qualifier in results:

      “Despite this, no increase in basal or insulin-stimulated AKT phosphorylation was seen in APDS2 cells compared to cells from wild-type volunteers or from people with PROS and activating PIK3CA mutations H1047L or H1047R (Fig 1A-C, Fig 1 figure supplement 3A,B). Although insulin-induced AKT phosphorylation was lower in fibroblasts from the one APDS2 patient studied compared to controls, we have previously reported extensive variability in insulin-responsiveness of primary dermal fibroblasts from WT controls. Moreover even primary cells from a patient expressing high levels of the SHORT syndrome-associated p85a Y657X did not show attenuated insulin action, so we do not believe the reduced insulin action in APDS2 cells in the current study should be overinterpreted until reproduced in further APDS2 cells.”

      Nevertheless we remind the reviewer that the main purpose of our primary cell experiment was to determine if there were any INCREASE in basal PI3K activity, or any difference in p110a or p110d protein levels, and we regard our findings in these regards to be clear.

      The manuscript would benefit from additional explanation concerning why the E489K, R649W, and Y657X are equivalent substitutes for the characterization of p110α/p85α delta Exon 11). Perhaps a more explicit description of these mutations in relationship to the location of p85α delta Exon 11) mutation would help. I recommend including a diagram in Figure 3 showing the position of the delta Exon 11, E489K, R649W, and Y657X mutations in the PIK3R1 coding sequence. B. Also, please clarify whether all three holoenzyme complexes were biochemically unstable (i.e. p110α/p85α, p110β/p85α, p110δ/p85α) when p85α delta Exon 11) was expressed in insect cells.

      A. Whether or not E489K, R649W and Y657X are “equivalent” to the APDS2 mutant is not really a meaningful issue here. These mutants are being studied because they cause SHORT syndrome without immunodeficiency, while the APDS2 mutant causes APDS2 often with features of SHORT syndrome. That is, it is naturally occurring mutations and the associated genotype-phenotype correlation that we seek to understand. Of the 3 SHORT syndrome causal mutations chosen, R649W is by far the commonest, effectively preventing phosphotyrosine binding, Y657X has the interesting attribute that it can be discriminated from full length p85 on immunoblots due to its truncation, and is moreover a variant that we have studied in cells and mice before, while the rarer E489K is an interesting SHORT syndrome variant as it is positioned more proximally in the p85a protein than most SHORT syndrome causal variants. All variants studied are now illustrated in the new Figure 1 figure supplement 1. B. Regarding stability of PI3K heterodimers containing the APDS2 p85a variant, we tried extensively to purify p110a and p110d complexes without success despite several approaches to optimise production. We did not try to synthesise the p110b-containing complex.

      (10) I recommend presenting the results in Figure 4 before Figure 3 because it provides a good rationale for why it's difficult to purify the p110α/p85α delta Exon 11) holoenzyme from insect cells.

      This would be true of p110d were studied in Figure 4 but it is not. Figure 4 looks instead at effects on p110a of heterologous overexpression of mutant p85, is a natural lead in to the ensuing figures 5 and 6, and we do not agree it would add value or enhance flow to swap Figures 3 and 4.

      (11) The authors show that overexpression of the p85α delta Exon 11) did not result in p110α/p85α delta Exon 11) complex formation based on co-immunoprecipitation. Do the authors get the same result when they co-immunoprecipitation p110α/p85α and p110δ/p85α in the APDS2-derived dermal fibroblasts used in Figure 1A?

      This is an interesting question but not an experiment we have done. It is not unfeasible, but generating enough cells to undertake IP experiments of this nature in dermal fibroblasts is a significant undertaking, and with finite resources available and only one primary cell line to study we elected not to pursue this.

      Details in Methods section:

      (1) Include catalog numbers and vendors for reagents (e.g. lipids, PhosSTOP, G-Dynabeads, etc.). There is not enough information provided to reproduce this work.

      We have now added all vendors and catalogue numbers where relevant.

      (2) Concerning the stated lipid composition (5/10/15/45/20/5 %) in the liposome preparation protocol. Please clarify whether these numbers represent molar percentages or mg/mL percentages.

      We have now added that this is expressed as “(wt/vol)”

      (3) What is the amino acid sequence of the PDGFR (pY2) peptide used for the PI3K activity assay?

      This assay has been published and references with detailed methods are cited. For clarity, however we now say:

      “PI(3,4,5)P3 production was measured by modified PI3-Kinase activity fluorescence polarisation assay (Echelon Biosciences, Salt Lake City, UT, USA). 10μL reactions in 384-well black microtitre plates used 1mM liposomes containing 50μM PI(4,5)P2, optimised concentrations of purified PI3K proteins, 100μM ATP, 2mM MgCl2, with or without 1μM tyrosine bisphosphorylated 33-mer peptide derived from mouse PDGFRβ residues 735-767, including phosphotyrosine at positions 740 and 751 (“pY2”; 735-ESDGGYMDMSKDESIDYVPMLDMKGDIKYADIE-767;  Cambridge peptides).”

      (4) Include a Supplemental file containing a comprehensive description of the plasmids and coding sequencing used in this study.

      Such a supplemental file has been created and is included as Table 2

      Minor points of clarification, citations, and typos:

      (1) Clarify why Activated PI3K Delta Syndrome 1 (APDS1) is thus named APDS2. See lines 71-72 of the introduction. Also see line 89: "...is common in APDS2, but not in APDS1." Briefly describe the difference between APDS1 and APDS2?

      This is described in the introduction, but we apologise if our wording was not sufficiently clear. We have tried now to remove any ambiguity:

      “Some PIK3R1 mutations reduce basal inhibition of catalytic subunits, usually due to disruption of the inhibitory inter-SH2 domain, and are found in cancers (Philp et al, 2001) and vascular malformations with overgrowth(Cottrell et al, 2021). In both diseases, hyperactivated PI3Ka, composed of heterodimers of PIK3R1 products and p110a, drives pathological growth. Distinct inter-SH2 domain PIK3R1 mutations, mostly causing skipping of exon 11 and deletion of residues 434-475, hyperactivate PI3Kd in immune cells, causing highly penetrant monogenic immunodeficiency (Deau et al, 2014; Lucas et al, 2014b). This phenocopies the immunodeficiency caused by genetic activation of p110d itself, which is named Activated PI3K Delta Syndrome 1 (APDS1) (Angulo et al, 2013; Lucas et al, 2014a). The PIK3R1-related syndrome, discovered shortly afterwards, is thus named APDS2.”

      (2) Figure legend 1. Clarify reference to "Figure EV2".

      (3) Figure legend 2. Clarify reference to "Figure EV3".

      (4) Figure legend 3. Clarify reference to "Figure EV5".

      Thank you for pointing out this oversight, arising from failure to update nomenclature fully between versions. “EV” figures actually are the figure supplements in the submission. All labels have now been updated.

      (5) For Figure 1 - supplemental figure 1C, indicate experimental conditions on the blot (e.g. -/+ insulin).

      This is now added

      (6) Figure 4B, y-axis. Clarify how data was quantified. Perhaps reword "(% WT without DOX)" for clarity.

      We have left the Y axis label as it is, but have added the following to the figure legend:

      “(B) Quantification of immunoblot bands from immunoprecipitates from 3 independent experiments, expressed as a percentage relative to the intensity of the band in WT cells without doxycycline exposure.”

      (7) In the results section (lines 117-124), please explicitly state whether the described mutations are homo- or heterozygous.

      All mutations are heterozygous, as now explicitly stated

      (8) I recommend spelling out the SHORT and APDS2 acronyms in the abstract to make this study more accessible.

      We respectfully disagree that such spelling out in the abstract would improve accessibility. Both acronyms are clunky and wordy and are more likely to obscure meaning by squeezing out other words in the abstract. APDS is already spelled out in the introduction, and we now add the following for SHORT syndrome:

      “More surprisingly, phenotypic overlap is reported between APDS2 and SHORT syndrome. SHORT syndrome, named for the characteristic developmental features (Short stature, Hyperextensibility, Hernia, Ocular depression, Rieger anomaly, and Teething delay) is caused by loss of PI3Ka function due to disruption of the phosphotyrosine-binding C-terminal SH2 domain (Chudasama et al, 2013; Dyment et al, 2013; Thauvin-Robinet et al, 2013).”

      (9) I recommend explaining in more detail or rewording the following jargon/terms to make the writing more accessible to a broad audience: "reduced linear growth" (line 83) and "larger series" (line 86). I assume "reduced linear growth" is height.

      Edited as follows:

      “It  features short stature, insulin resistance, and dysmorphic features (Avila et al, 2016). In recent years, both individual case reports (Bravo Garcia-Morato et al, 2017; Petrovski et al, 2016; Ramirez et al, 2020; Szczawinska-Poplonyk et al, 2022) and larger case series (Elkaim et al, 2016; Jamee et al, 2020; Maccari et al, 2023; Nguyen et al, 2023; Olbrich et al, 2016; Petrovski et al., 2016) have established that many people with APDS2 have overt features of SHORT syndrome, while, more generally, linear growth impairment is common in APDS2, but not in APDS1.”

      (10) For clarity, reword lines 214-215 to read, "No increase in p110α levels was seen on conditional overexpression of wild-type or R649W p85α."

      Change made, thank you

      (11) Figure 6A - Western blot label says, "657X" instead of "Y657X."

      Now corrected

      (12) Lines 214-215: For clarity, reword the sentence to say, "No increase in p110α was seen on conditional overexpression...".

      REPEAT OF POINT 10 ABOVE

      (13) Clarify what interactions are being competed for in the following statement: "... delta Ex11 may exert its inhibitory action by competing with PI3K holoenzyme" (lines 237-238). Are you referring to the interaction between p110α and p85α or the interaction between p110α/p85α and another protein?

      We have endeavoured to clarify by editing as follows:

      “As APDS2 p85a DEx11 does not appear to displace wild-type p85a from p110a despite strong overexpression, it is likely that there are high levels of truncated p85a unbound to p110a in the cell. This may be important, as p85a mediates recruitment of PI3K to activated tyrosine kinase receptors and their tyrosine phosphorylated substrates, including the insulin-receptor substrate proteins Irs1 and Irs2. Excess free regulatory subunits compete with heterodimeric PI3K holoenzyme for binding to these phosphotyrosines (Ueki et al., 2002), raising the possibility that excess free, truncated APDS2 p85a DEx11 may exert its inhibitory action similarly by outcompeting PI3K holoenzyme for phosphotyrosine binding.”

      (14) Provide more information about the following statement and how it relates to the mutations in this study: "Homozygous truncating PIK3R1 mutations abolishing p85α expression while preserving p55α and p50α produce agammaglobulinaemia" (lines 271-272). The manuscript would benefit from a more explicit description of the nature of these mutations.

      This wording seems to us to be explicit, however we agree that a schematic of PIK3R1 genotype-phenotype correlation, as requested elsewhere, would help readers. Such a schematic is now included as Figure 1 figure supplement 1.

      (15) Typo on line 299: "unclike".

      Corrected.

      (16) The data presented in this study support a model in which p85α (DExon 11) expression functions as a dominant negative. Please clarify why in the discussion section you explain that p85α (DExon 11) activates PI3K. For example, "...skipping of exon 11, were shown in 2014 to activate PI3K..." (lines 290-291), "...activate PI3Kδ on one hand..." (line 309); "...APDS2 mutations in PIK3R1 has mixed consequences, producing greater hyperactivation of p110δ than p110α" (lines 354-355).

      We do not entirely understand the reviewer’s question and thus request here. p85α (DExon 11) activates PI3Kd in immune cells and in vitro, and this is accepted, based on numerous reports, to be the mechanism underlying immunodeficiency. We do not challenge this, and cite evidence for any such claims in our report. The dominant negative activity we describe here towards PI3Ka activation is based not on inhibition of mutant-containing heterodimer, but rather on destabilisation of and/or competition with heterodimeric WT holoenzyme. This is the basis of the model we present; that is, a finely balanced competition between enzymic activation and mutant holoenzyme destabilisation and competition of mutant free p85a with WT holoenzyme, whose net effect likely differs among cells and tissues, most likely based on the repertoire and proportions of PI3K subunit expression. If the reviewer has specific suggestions for us that will make this point clearer still we should be happy to consider them.

      (17) Provide references for the statements in lines 349-353 of the discussion.

      This brief closing paragraph is a succinct recap and summary of the key points made throughout the manuscript and thoroughly referenced therein. We prefer to keep this section clean to maximise clarity, but are happy to copy references from the various other places in the manuscript to back up these assertions if this is preferred by the editorial team. Current text:

      “In summary, it is already established that: A. genetic activation of PIK3CD causes immunodeficiency without disordered growth, while B. inhibition of PIK3R1 recruitment to RTKs and their substrates impairs growth and insulin action, without immunodeficiency, despite all catalytic subunits being affected and C. loss of p85 alone causes immunodeficiency.”

      Reviewer #2 (Recommendations For The Authors):

      In the abstract line 42 I would rather talk from SHORT syndrome like features.

      Some patients do indeed meet the criteria for SHORT syndrome, but there is a spectrum. We have thus added this qualification and removed “short stature” to maintain the word count, as this is itself a SHORT syndrome-like feature.

      Line 74 It would be helpful for the reader to give the amino-acid exchange and affected position of this single case.

      We agree. Now added.

      Furthermore, an illustration indicating the location of the different PIK3R1 variants on the p85 alpha level would be helpful for the reader.

      As noted above such a figure element is now included as Figure 1 figure supplement 1 and duly called out in the text

      The sentence in lines 298-300 makes no sense to me. Do you mean, unlike APDS1 murine models?

      We agree, on review, that this paragraph is convoluted and makes a simple observation complex. We have rewritten now in what we hope is a more accessible style:

      “Thus, study of distinct PIK3R1-related syndromes shows that established loss-of-function PIK3R1 mutations produce phenotypes attributable selectively to impaired PI3Ka hypofunction, while activating mutations produce phenotypes attributable to selectively increased PI3Kd signalling. Indeed, not only do such activating mutations not produce phenotypes attributable to PI3Ka activation, but they surprisingly have features characteristic of impaired PI3Ka function.”

      Line 321 I propose including the notion of different cells: “The balance between expression and signalling in different cells may be a fine one ...”

      This change has been made

      Line 352 C. loss replace with complete loss.

      “C.” actually denotes the last in a list after “A.” and “B.”. We have now used bold to emphasise this, but we imagine house style may dictate how we approach this.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1

      The study provides a complete comparative interactome analysis of α-arrestin in both humans and drosophila. The authors have presented interactomes of six humans and twelve Drosophila α-arrestins using affinity purification/mass spectrometry (AP/MS). The constructed interactomes helped to find α-arrestins binding partners through common protein motifs. The authors have used bioinformatic tools and experimental data in human cells to identify the roles of TXNIP and ARRDC5: TXNIP-HADC2 interaction and ARRDC5-V-type ATPase interaction. The study reveals the PPI network for α-arrestins and examines the functions of α-arrestins in both humans and Drosophila.

      Comments

      I will like to congratulate the authors and the corresponding authors of this manuscript for bringing together such an elaborate study on α-arrestin and conducting a comparative study in drosophila and humans.

      Introduction:

      The introduction provides a rationale behind why the comparison between humans and Drosophila is carried out.

      • Even though this is a research manuscript, including existing literature on similar comparison of α-arrestin from other articles will invite a wide readership.

      Results:

      The results cover all the necessary points concluded from the experiments and computational analysis.

      1) The authors could point out the similarity of the α-arrestin in both humans and Drosophila. While comparing α-arrestin in both humans and Drosophila If percentage homology between α-arrestin of both Drosophila and humans needs to be calculated.

      Thank you for your insightful feedback. As suggested by reviewer, we determined percentage homology of α-arrestin protein sequences from human and Drosophila using Clustal Omega. This homology is now illustrated as a heatmap in revised Figure S5. Please note that only the values with percentage homology of 40% or higher are selectively labeled.

      • Citing the direct connecting genes from the network in the text will invite citations and a wider readership.

      Figures:

      The images are elaborate and well-made.

      2) The authors could use a direct connected gene-gene network that pointing interactions. This can be used by other readers working on the same topic and ensure reproducibility and citations.

      We appreciate your valuable comment. Based on the reviewer’s suggestion, we have developed a new website in which one can navigate the gene-gene networks of α-arrestins. These direct connected gene-gene networks are housed in the network data exchange (NDEx) project. Additionally, we have included gene ontology and protein class details for α-arrestins’ interactors in these set of networks, offering a more comprehensive view of α-arrestins’ interactomes.

      On page 24 lines 15-18, we have revised the manuscript to introduce the newly developed website, as follows.

      “Lastly, to assist the research community, we have made comprehensive α-arrestin interactome maps on our website (big.hanyang.ac.kr/alphaArrestin_PPIN). Researchers can search and download their interactomes of interest as well as access information on potential cellular functions and protein class associated with these interactomes.”  

      3-1) The co-expression interactions represented as figures should reveal interaction among the α-arrestin and other genes. Which are the sub-network genes does the α- arrestin interact to/ with from the sub-network? The arrows are only pointing at the sub-networks. The figures do not reveal their interaction. Kindly reveal the interaction in the figure with the proper nodes in the figure.

      3-2) Figure 2: the network attached in both human and drosophila is well represented. The green lines from α-arrestin indicate the strength of the interaction. Several smaller expression networks are seen. But "α-arrestin" in both organisms seems highly disconnected from all the genes. Connected genes have edges, not arrows. If α-arrestin can be shown connected to these gene-gene networks will help in identifying which genes connect with which gene through α-arrestin. This can be used by other readers working on the same topic and ensure reproducibility and citations.

      Thank you for your valuable comment. In response to the reviewer’s recommendation, we’ve added supplementary figure, Figure S4, which illustrates direct interaction between α-arrestin and protein components of clustered complexes (or sub-networks) in addition to the associations shown between α-arrestins and the clustered complexes in Figure 2. We believe that this newly incorporated information regarding direct protein interactions will invite citations and wider readership as the reviewer pointed out.

      On page 12 line 27 to page 13 line 5, we have revised the manuscript to cite the direction interactions between ARRDC3 and proteins involved in ubiquitination-dependent proteolysis, as follows.

      “While the association of ARRDC3 with these ubiquitination-dependent proteolysis complexes is statistically insignificant, ARRDC3 does interact with individual components of these complexes such as NEDD4, NEDD4L, WWP1, and ITCH (Figure S4A). This suggest their functional relevance in this context, as previously reported in both literatures and databases (Nabhan et al., 2010; Shea et al., 2012; Szklarczyk et al., 2015; Warde-Farley et al., 2010) (Puca & Brou, 2014; Xiao et al., 2018).”

      Direct interaction between α-arrestins and protein components of clustered complexes are illustrated in the newly added figure, Figure S4.

      4-1) Figure 4. The Protein blot image was blurred. Kindly provide a higher-resolution image.

      4-2) Figure 5. B. - The authors can provide images with higher resolution blot images. The bands were not visible.

      We appreciate for valuable comment. Unfortunately, the protein blot image was scanned from the original film and the images we provided in the figure represent the highest resolution that we have obtained to date. Raw, uncropped images are shown in Author response image 1 and 2.

      Author response image 1.

      Raw image of Figure 4B

      Author response image 2.

      Raw image of Figure 5B

      5) Figure: 5. A. - I see non-specific amplifications in the gel images. Are these blotting images? or the gel images that were changed to "Grayscale"? Non-specific amplification may imply that the experiment was not repeated and standardized. Was it gel images or blot images?

      We appreciate your insightful comment. The images in Figure 5A represent western blot bands from co-immunoprecipitation assay for analysis of the interaction between TXNIP and HDAC2 proteins. Since immunoblotting using immunoprecipitates can usually detect some non-specific bands from heavy (~ 50 kDa) and light (~25 kDa) chains of the target antibody or from multiple co-immunoprecipitated proteins, we assume that the vague non-specific bands in Figure 5A might be a heavy chain of TXNIP or HDAC2 antibody or an unclear non-specific band. Because target bands showed strong intensity and very clear pattern compared to the non-specific bands in the co-immunoprecipitation assay, we believe that this data is sufficient to support the interaction of TXNIP with HDAC2. Finally, In the revised Figure 5A, we’ve modified the labeling for different experimental conditions, namely siCon and siTXNIP treatments, and added expected size of proteins (kDa), as shown below.

      6) Figure 5. A. RT-PCR analysis: What was your expected size of the amplifications? the ladder indicated is in KDa. Is that right?

      We appreciate your insightful questions. As mentioned above, Figure 5A shows the blotting images of co-immunoprecipitation analysis, and the ladder indicates the molecular weight (kDa) of protein markers. For clearer interpretation, the expected size of target proteins has been added in Figure 5A in the revised manuscript.

      7) How were the band intensities determined?

      Thank you for your question. For quantification of immunoblot results, the densities of target protein bands were analyzed with Image J, as we described in the Materials and Methods.

      Discussion:

      The authors have utilized and discussed the conclusion they draw from their study. But could highlight more on ARRDCs and why it was selected out of the other arrestins. The authors have provided future work directions associated with their work.

      8) Why were only ARRDCs presented amongst all the arrestin in the main part of the manuscript?

      We’re grateful for your valuable feedback. The reason we focused on α-arrestins was that α-arrestins have been discovered relatively recently, especially when compared to more established visual/ β-arrestin proteins in the same arrestin family but the biological functions of many α-arrestins remain largely unexplored, with notable exceptions in the budding yeast model and a few α-arrestins in mammals and invertebrate species. Most importantly, comparative study highlighting the shared or unique features of α-arrestins is yet to be undertaken. To gain a more comprehensive understanding of these unexplored α-arrestins across multiple species, we’ve centered our research on the ARRDCs within the arrestin protein family.

      On page 21 lines 8-17, we’ve edited the manuscript to emphasize the importance of a comparative study on α-arrestins, as detailed below.

      “According to a phylogenetic analysis of arrestin family proteins, α-arrestins were shown to be ubiquitously conserved from yeast to human (Alvarez, 2008). However, compared to the more established visual/ β-arrestin proteins, α-arrestins have been discovered more recently and much of their molecular mechanisms and functions remain mostly unexplored except for budding yeast model (Zbieralski & Wawrzycka, 2022). Based on the high-confidence interactomes of α-arrestins from human and Drosophila, we identified conserved and specific functions of these α-arrestins. Furthermore, we uncovered molecular functions of newly discovered function of human specific α-arrestins, TXNIP and ARRDC5. We anticipate that the discovery made here will enhance current understanding of α-arrestins.”

      9) The discussion could be elaborated more by utilizing the data.

      We appreciate your insightful feedback. Based on the reviewer’s suggestion, we’ve enhanced the discussion in the manuscript to provide a clearer interpretation of our results. First, we’ve added description of conserved protein complexes significantly associated with α-arrestins, stated on page 22 lines 5-12 and lines 23-26.

      Page 22 lines 5-12: “The integrative map of protein complexes also highlighted both conserved and unique relationships between α-arrestins and diverse functional protein complexes. For instance, protein complexes involved in ubiquitination-dependent proteolysis, proteasome, RNA splicing, and intracellular transport (motor proteins) were prevalently linked with α-arrestins in both human and Drosophila. To more precisely identify conserved PPIs associated with α-arrestins, we undertook ortholog predictions within the α-arrestins’ interactomes. This revealed 58 orthologous interaction groups that were observed to be conserved between human and Drosophila (Figure 3).”

      Page 22 lines 23-26: “Additionally, interaction between α-arrestins and entities like motor proteins, small GTPase, ATP binding proteins, and endosomal trafficking components were identified to be conserved. Further validation of these interactions could unveil molecular mechanisms consistently associated with these cellular functions.”

      Secondly, we’ve added description of role of ARRDC5 in osteoclast maturation, as stated on page 23 lines 22-24.

      “Conversely, depletion of ARRDC5 reduces osteoclast maturation, underscoring the pivotal role of ARRDC5 in osteoclast development and function (Figure S9A and B).”

      Lastly, we examined the association between α-arrestins’ interactomes and human diseases, incorporating our findings into the discussion. The newly introduced figure based on the result is Figure S10.

      On page 24 lines 10-14, we’ve added discussion on Figure S10 as follows.

      “We further explored association between α-arrestins’ interactomes and disease pathways (Figure S10). Notably, the interactomes of α-arrestins in human showed clear links to specific diseases. For instance, ARRDC5 is closely associated with disease resulting from viral infection and cardiovascular conditions. ARRDC2, ARRDC4, and TXNIP share common association with certain neurodegenerative diseases, while ARRDC1 is implicated in cancer.”

      Supplementary figures:

      The authors have a rigorous amount of work added together for the success of this manuscript.

      10) The reference section needs editing before publication. Maybe the arrangement was disturbed during compiling.

      Thank you for your valuable comment. Based on the reviewer’s suggestion, we have rearranged the reference section to enhance its clarity. Below are excerpts from the update reference section in the manuscript.

      “Adenuga, D., & Rahman, I. (2010). Protein kinase CK2-mediated phosphorylation of HDAC2 regulates co-repressor formation, deacetylase activity and acetylation of HDAC2 by cigarette smoke and aldehydes. Arch Biochem Biophys, 498(1), 62-73. doi:10.1016/j.abb.2010.04.002

      Adenuga, D., Yao, H., March, T. H., Seagrave, J., & Rahman, I. (2009). Histone Deacetylase 2 Is Phosphorylated, Ubiquitinated, and Degraded by Cigarette Smoke. American Journal of Respiratory Cell and Molecular Biology, 40(4), 464-473. doi:10.1165/rcmb.2008-0255OC

      Akalin, A., Franke, V., Vlahovicek, K., Mason, C. E., & Schubeler, D. (2015). Genomation: a toolkit to summarize, annotate and visualize genomic intervals. Bioinformatics, 31(7), 1127-1129. doi:10.1093/bioinformatics/btu775

      Alvarez, C. E. (2008). On the origins of arrestin and rhodopsin. BMC Evol Biol, 8, 222. doi:10.1186/1471-2148-8-222”

      11) many important references were missing.

      We appreciate and agree with the reviewer’s comment. In response to the reviewer’s recommendation, we’ve thoroughly reviewed the manuscript and below are sections of the manuscript where around 20 new references have been added.

      On page 8 lines 12-14:

      “Utilizing the known affinities between short linear motifs in α-arrestins and protein domains in interactomes(El-Gebali et al., 2019; UniProt Consortium, 2018) “

      On page 8 lines 19-22:

      “One of the most well-known short-linear motifs in α-arrestin is PPxY, which is reported to bind with high affinity to the WW domain found in various proteins, including ubiquitin ligases (Ingham, Gish, & Pawson, 2004; Macias et al., 1996; Sudol, Chen, Bougeret, Einbond, & Bork, 1995)”

      On page 9 lines 3-6:

      “Next, we conducted enrichment analyses of Pfam proteins domains (El-Gebali et al., 2019; Huang da, Sherman, & Lempicki, 2009b) among interactome of each α-arrestin to investigate known and novel protein domains commonly or specifically associated (Figure S3A; Table S5).”

      On page 9 lines 7-10:

      “HECT and C2 domains are well known to be embedded in the E3 ubiquitin ligases such as NEDD4, HECW2, and ITCH along with WW domains (Ingham et al., 2004; Melino et al., 2008; Rotin & Kumar, 2009; Scheffner, Nuber, & Huibregtse, 1995; Weber, Polo, & Maspero, 2019)”

      On page 10 lines 12-16:

      “In fact, the known binding partners, NEDD4, WWP2, WWP1, and ITCH in human and CG42797, Su(dx), Nedd4, Yki, Smurf, and HERC2 in Drosophila, that were detected in our data are related to ubiquitin ligases and protein degradation (C. Chen & Matesic, 2007; Ingham et al., 2004; Y. Kwon et al., 2013; Marin, 2010; Melino et al., 2008; Rotin & Kumar, 2009) (Figure 1E; Figure S2F).”

      On page 13 lines 20-21:

      “Given that α-arrestins are widely conserved in metazoans (Alvarez, 2008; DeWire, Ahn, Lefkowitz, & Shenoy, 2007), “

      On page 14 lines 12-17:

      “The most prominent functional modules shared across both species were the ubiquitin-dependent proteolysis, endosomal trafficking, and small GTPase binding modules, which are in agreement with the well-described functions of α-arrestins in membrane receptor degradation through ubiquitination and vesicle trafficking (Dores et al., 2015; S. O. Han et al., 2013; Y. Kwon et al., 2013; Nabhan et al., 2012; Puca & Brou, 2014; Puca et al., 2013; Shea et al., 2012; Xiao et al., 2018; Zbieralski & Wawrzycka, 2022) (Figure 3).”  

      Reviewer #2

      In this manuscript, the authors present a novel interactome focused on human and fly alpha-arrestin family proteins and demonstrate its application in understanding the functions of these proteins. Initially, the authors employed AP/MS analysis, a popular method for mapping protein-protein interactions (PPIs) by isolating protein complexes. Through rigorous statistical and manual quality control procedures, they established two robust interactomes, consisting of 6 baits and 307 prey proteins for humans, and 12 baits and 467 prey proteins for flies. To gain insights into the gene function, the authors investigated the interactors of alpha-arrestin proteins through various functional analyses, such as gene set enrichment. Furthermore, by comparing the interactors between humans and flies, the authors described both conserved and species-specific functions of the alpha-arrestin proteins. To validate their findings, the authors performed several experimental validations for TXNIP and ARRDC5 using ATAC-seq, siRNA knockdown, and tissue staining assays. The experimental results strongly support the predicted functions of the alpha-arrestin proteins and underscore their importance. `

      I would like to suggest the following analyses to further enhance the study:

      1) It would be valuable if the authors could present a side-by-side comparison of the interactomes of alpha-arrestin proteins, both before and after this study. This visual summary network would demonstrate the extent to which this work expanded the existing interactome, emphasizing the overall contribution of this study to the investigation of the alpha-arrestin protein family.

      We greatly appreciate your insightful feedback. In response to the reviewer’s suggestion, we’ve depicted a network of known PPIs associated with α-arrestins (Figure S2C and D). Furthermore, by comparing our high-confidence PPIs to these known sets, we found that the overlaps are statistically significant and the high-confidence PPIs of α-arrestins broaden the existing interactome (Figure S2E).

      From page 7 line 26 to page 8 line 8, we’ve detailed this side-by-side comparisons of existing interactome and newly discovered high-confidence PPIs of α-arrestins, as outline below.

      “As a result, we successfully identified many known interaction partners of α-arrestins such as NEDD4, WWP2, WWP1, ITCH and TSG101, previously documented in both literatures and PPI databases (Figure S2C-F) (Colland et al., 2004; Dotimas et al., 2016; Draheim et al., 2010; Mellacheruvu et al., 2013; Nabhan et al., 2012; Nishinaka et al., 2004; Puca & Brou, 2014; Szklarczyk et al., 2015; Warde-Farley et al., 2010; Wu et al., 2013). Additionally, we greatly expanded repertoire of PPIs associated with α-arrestins in human and Drosophila, resulting in 390 PPIs between six α-arrestins and 307 prey proteins in human, and 740 PPIs between twelve α-arrestins and 467 prey proteins in Drosophila (Figure S2E). These are subsequently referred to as ‘high-confidence PPIs’ (Table S3).”

      2) While the authors conducted several analyses exploring protein function, there is a need to further explore the implications of the interactome in human diseases. For instance, it would be beneficial to investigate the association of the newly identified interactome members with specific human diseases. Including such investigations would strengthen the link between the interactome and human disease contexts.

      Thank you for your valuable comment. As suggested by the reviewer, we examined the association between α-arrestins’ interactomes and human diseases, incorporating our findings into the discussion. The newly introduced figure based on the result is Figure S10.

      On page 24 lines 10-14, we’ve added discussion on Figure S10 as follows.

      “We further explored association between α-arrestins’ interactomes and disease pathways (Figure S10). Notably, the interactomes of α-arrestins in human showed clear links to specific diseases. For instance, ARRDC5 is closely associated with disease resulting from viral infection and cardiovascular conditions. ARRDC2, ARRDC4, and TXNIP share common association with certain neurodegenerative diseases, while ARRDC1 is implicated in cancer.”

      Reviewer #3:

      Lee, Kyungtae and colleagues have discovered and mapped out alpha-arrestin interactomes in both human and Drosophila through the affinity purification/mass spectrometry and the SAINTexpress method. They found the high confident interactomes, consisting of 390 protein-protein interactions (PPIs) between six human alpha-arrestins and 307 preproteins, as well as 740 PPIs between twelve Drosophila alpha-arrestins and 467 prey proteins. To define and characterize these identified alpha-arrestin interactomes, the team employed a variety of widely recognized bioinformatics tools. These included protein domain enrichment analysis, PANTHER for protein class enrichment, DAVID for subcellular localization analysis, COMPLEAT for the identification of functional complexes, and DIOPT to identify evolutionary conserved interactomes. Through these analyses, they confirmed known alpha-arrestin interactors' role and associated functions such as ubiquitin ligase and protease. Furthermore, they found unexpected biological functions in the newly discovered interactomes, including RNA splicing and helicase, GTPase-activating proteins, ATP synthase. The authors carried out further study into the role of human TXNIP in transcription and epigenetic regulation, as well as the role of ARRDC5 in osteoclast differentiation. This study holds important value as the newly identified alpha-arrestin interactomes are likely aiding functional studies of this group of proteins. Despite the overall support from data for the paper's conclusions, certain elements related to data quantification, interpretation, and presentation demand more detailed explanation and clarification.

      1) In Figure 1B, it is shown that human alpha-arrestins were N-GFP tagged (N-terminal) and Drosophila alpha-arrestins were C-GFP (C-terminal). However, the rationale of why the authors used different tags for human and fly proteins was not explained in the main text and methods.

      We appreciate your valuable comment. Both N- and C-terminally tagged α-arrestins have been used previously. Given that our study aims to increase the repertoire of α-arrestin interacting proteins, where GFP is added might not be a concern. We note that GFP is a relatively bulky tag, and tagging a protein with GFP can potentially abolish the interaction with some of the binding proteins. Follow-up studies utilizing different approaches for detecting protein-protein interactions, such as BioID and yeast two-hybrid, will allow us to build more comprehensive α-arrestin interactomes.

      2) In Figure 2A, there seems to be an error for labeling the GAL4p/GAL80p complex that includes NOTCH2, NOTCH1 and TSC2.

      Thank you for comment. We double-checked COMPLEAT (protein COMPLex Enrichment Analysis Tool) database for the name of protein complex consisting of NOTCH1, NOTCH2, AND TSC2. The database indeed labeled this complex as the “GAL4p/GAL80p complex”. However, given the potential for mis-annotation (since we could not ascertain the relevance of these proteins to the “GAL4p/GAL80p complex”), we chose to exclude this protein complex from the network. The update protein complex network is illustrated in the revised Figure 2A.

      3) In Figure 5, given that knockdown of TXNIP did not affect the levels and nuclear localization of HDAC2, the authors suggest that TXNIP might modulate HDAC2 activity. However, the ChiP assay suggest a different model - TXNIP-HDAC2 interaction might inhibit the chromatin occupancy of HDAC2, reducing histone deacetylation and increasing global chromatin accessibly. The authors need to propose a model consistent with these sets of all data.

      We greatly appreciate your detailed feedback. Our data indicates a global decrease in chromatin accessibility (Figure 4C-G) and a diminished interaction between TXNIP and HDAC2 under depletion of TXNIP (Figure 5A). Additionally, we observed an increased occupancy of HDAC2 and subsequent histone deacetylation at TXNIP-target promoter regions (Figure 5C) without any changes in the HDAC2 expression level (Figure 5A) in TXNIP- knockdown cells. From these observations, we infer that the interaction between TXNIP-HDAC2 might suppress the function of HDAC2, a major gene silencer affecting the formation of condensed or accessible chromatin by deacetylating activity. Although we checked whether TXNIP could induce cytosolic retention of HDAC2 to inhibit nuclear function of HDAC2, TNXIP knockdown did not alter its subcellular localization (Figure 5B).

      To elucidate the mechanism by which TXNIP inhibits the function of HDAC2, we further investigated the effect of TXNIP on the levels of HDAC2 phosphorylation, which is known to be crucial for its deacetylase activity and the formation of transcriptional repressive complex. However, as shown in the Figure S8C and D, the knockdown of TXNIP did not affect the HDAC2 phosphorylation status, as well as the interaction between HDAC2 and other components in NuRD complex in the immunoblotting and co-IP assays, respectively. The results suggest that TXNIP may inhibit the function of HDAC2 independently of these factors.

      Following the reviewer’s suggestion, we carefully provided a proposed model describing the possible role of TXNIP in transcriptional regulation through interaction with HDAC2 and co-repressor complex in Figure S8E.

      Description of these newly added figures can be found in the revised manuscript from page 18 line 7 to 27, as outlined below.

      “HDAC2 typically operates within the mammalian nucleus as part of co-repressor complexes as it lacks ability to bind to DNA directly (Hassig, Fleischer, Billin, Schreiber, & Ayer, 1997). The nucleosome remodeling and deacetylation (NuRD) complex is one of the well-recognized co-repressor complexes that contains HDAC2 (Kelly & Cowley, 2013; Seto & Yoshida, 2014) and we sought to determine if depletion of TXNIP affects interaction between HDAC2 and other components in this NuRD complex. While HDAC2 interacted with MBD3 and MTA1 under normal condition, the interaction between HDAC2 and MBD3 or MTA1 was not affected upon TXNIP depletion (Figure S8C). Next, given that HDAC2 phosphorylation is known to influence its enzymatic activity and stability (Adenuga & Rahman, 2010; Adenuga, Yao, March, Seagrave, & Rahman, 2009; Bahl & Seto, 2021; Tsai & Seto, 2002), we tested if TXNIP depletion alters phosphorylation status of HDAC2. The result indicated, however, that phosphorylation status of HDAC2 does not change upon TXNIP depletion (Figure S8D). In summary, our findings suggest a model where TXNIP plays a role in transcriptional regulation independent of these factors (Figure S8E). When TXNIP is present, it directly interacts with HDAC2, a key component of transcriptional co-repressor complex. This interaction suppresses the HDAC2 ‘s recruitment to target genomic regions, leading to the histone acetylation of target loci possibly through active complex including histone acetyltransferase (HAT). As a result, transcriptional activation of target gene occurs. In contrast, when TXNIP expression is diminished, the interaction between TXNIP and HDAC2 weakens. This restores histone deacetylating activity of HDAC2 in the co-repressor complex, leading to subsequent repression of target gene transcription.”

      4) The authors showed that ectopic expression of ARRDC5 increased osteoclast differentiation and function. Does loss of ARDDC5 lead to defects in osteoclast function and fate determination?

      We appreciate your valuable comment. We have confirmed the endogenous expression of ARRDC5 in osteoclasts and conducted a loss-of-function study using shARRDC5. As determined by qPCR, ARRDC5 was endogenously expressed very low in osteoclasts. Even during RANKL-induced osteoclast differentiation, the CT value (29-31) for ARRDC5 expression was high in osteoclasts compared to the CT value (17-24) for the expression of marker genes Cathepsin K, TRAP, and NFATc1. Even though its endogenous expression was very low, we generated ARRDC5 knockdown cells by infecting BMMs with lentivirus expressing shRNA of ARRDC5 and subsequently differentiated the cells into mature osteoclasts. After five days of differentiation, we observed a significant decrease in the total number of TRAP-positive multinucleated cells (No. of TRAP+ MNCs) in shARRDC5 cells compared to that in the control cells. This result indicates that the loss of ARRDC5 leads to defects in osteoclast differentiation. Result of this loss-of-function study using shARRDC5 is depicted in Figure S9A and B.

      In the revised manuscript, following sentence explaining Figure S9A and B was added on page 19 lines 15-17 as follows.

      “Depletion of ARRDC5 using short hairpin RNA (shRNA) impaired osteoclast differentiation, further affirming its crucial role in this differentiation process (Figure S9A and B).”

      5) From Figure 6D, the authors argued that ARRDC5 overexpression resulted in more V-ATPase signals: however, there is no quantification. Quantification of the confocal images will foster the conclusion. Also, western blots for V-ATPase proteins will provide an alternative way to determine the effects of ARRDC5.

      We appreciate your insightful feedback. As suggested by the reviewer, we quantified V-type ATPase signals using confocal images, which were shown in Figure 6D. The ImageJ program was employed for integrated density measurements, and the integrated density of GFP-GFP overexpressing osteoclasts was set to 1 for relative comparison. The result in the revised Figure 6D revealed a significant increase in V-type ATPase signals in GFP-ARRDC5 overexpressing osteoclasts compared to that in GFP-GFP overexpressing osteoclasts, as outlined below.

      We also agree with the reviewer’s comment that Western blot for V-ATPase proteins will be an alternative way to determine the effects of ARRDC5 in osteoclast differentiation. We have confirmed no different expression of V-type ATPase between GFP-GFP and GFP-ARRDC5 overexpressing osteoclasts using qPCR and western blot analysis. The corresponding western blot result is shown in the revised Figure S9C.

      In addition, the corresponding qPCR that measures the expression level of V-type ATPase between GFP-GFP and GFP-ARRDC5 overexpressing osteoclasts is shown in Author response image 3.

      Author response image 3.

      Moreover, based on the references, the V-type ATPase is localized at the plasma membrane during osteoclast differentiation (Toyomura et al., 2003). Although mRNA and protein expression levels were similar in both cells, localization of V-ATPase in plasma membrane was significantly increased in GFP-ARRDC5 overexpressing osteoclasts compared to that in GFP-GFP osteoclasts, as shown in the revised Figure 6D above.

      6) The results from Figure 6D did not support the authors' argument that ARRDC5 might control the membrane localization of the V-ATPase, as bafilomycin is the V-ATPase inhibitor. ARRDC5 knockdown experiments will help to determine whether ARRDC5 can control the membrane localization of the V-ATPase in osteoclast.

      Thank you for your insightful comment. V-type ATPase has been reported to play an important role in the differentiation and function of osteoclasts (Feng et al., 2009; Qin et al., 2012). Given that various subunits of the V-type ATPase interact with ARRDC5 (Figure 6A), we speculated that ARRDC5 might be involved in the function of this complex and play a role in osteoclast differentiation and function. As answered above, GFP-ARRDC5 overexpressing osteoclasts showed a similar expression level of V-type ATPase to GFP-GFP cells but exhibited increased V-type ATPase signals at the cell membrane compared to those in GFP-GFP cells (Figure 6D). Additionally, co-localization of ARRDC5 and V-type ATPase was observed in the osteoclast membrane (Figure 6D), as predicted by the human ARRDC5-centric PPI network. On the other side, bafilomycin A1, a V-type ATPase inhibitor, not only blocked localization of V-type ATPase to plasma membrane in GFP-ARRDC5 overexpressing osteoclasts, but also reduced ARRDC5 signals (Figure 6D). These results indicate that ARRDC5 plays a role in osteoclast differentiation and function by interacting with V-type ATPase and promoting the localization of V-type ATPase to plasma membrane in osteoclasts.

      V-type ATPase present in osteoclast membrane is important to cell fusion, maturation, and function during osteoclast differentiation (Feng et al., 2009; Qin et al., 2012). GFP-ARRDC5 overexpressing osteoclasts showed a significant increase of V-type ATPase signals in the cell membrane compared to GFP-GFP cells (Figure 6D), and also significantly increased cell fusion (No. of TRAP+ MNCs in Figure 6B) and resorption activity (resorption pit formation in Figure 6C). However, ARRDC5 knockdown in osteoclasts (shARRDC5 cells) showed a significant decrease in No. of TRAP+ MNCs compared to that in the control cells, indicating that the loss of ARRDC5 leads to defects in cell fusion during osteoclast differentiation (Figure S9A and B). As described above, the endogenous expression of ARRDC5 was very low in osteoclasts and could be specifically expressed in a certain timepoint during the differentiation. Therefore, to better understand the interaction with V-type ATPase of ARRDC5 in osteoclasts, ARRDC5 overexpression is more suitable than its knockdown.

      Part of the manuscript on page 19 line 21 to page 20 line 6 was edited to support our statement, as outlined below.

      “The V-type ATPase is localized at the osteoclast plasma membrane (Toyomura et al., 2003) and its localization is important for cell fusion, maturation, and function during osteoclast differentiation (Feng et al., 2009; Qin et al., 2012). Furthermore, its localization is disrupted by bafilomycin A1, which is shown to attenuate the transport of the V-type ATPase to the membrane (Matsumoto & Nakanishi-Matsui, 2019). We analyzed changes in the expression level and localization of V-type ATPase, especially V-type ATPase V1 domain subunit (ATP6V1), in GFP-GFP and GFP-ARRDC5 overexpressing osteoclasts. The level of V-type ATPase expression did not change in osteoclasts regardless of ARRDC5 expression levels (Figure S9C). GFP signals were detected at the cell membrane when GFP-ARRDC5 was overexpressed, indicating that ARRDC5 might also localize to the osteoclast plasma membrane (Figure 6D; Figure S9D). In addition, we detected more V-type ATPase signals at the cell membrane in the GFP-ARRDC5 overexpressing osteoclasts, and ARRDC5 and V-type ATPase were co-localized at the osteoclast membrane (Figure 6D; Figure S9D).”

      7) The tables (excel files) do not have proper names for each table S numbers. Please correct the name of excel files for readers.

      We appreciate your valuable comments. In response to the reviewer’s suggestion, we’ve renamed excel files to more appropriate titles for easier readability. List of renamed tables (excel files) are shown below.

      Table S1. List of α-arrestins from human and Drosophila Table S2. Evaluation sets of α-arrestins PPIs Table S3. Summary tables of SAINTexpress results Table S4. Protein domains and short linear motifs in the α-arrestin interactomes Table S5. Enriched Pfam domains in the α-arrestin interactomes Table S6. Subcellular localizations of α-arrestin interactomes Table S7. Summary of protein complexes and cellular components associated with α-arrestin Table S8. Orthologous relationship of α-arrestin interactomes between human and Drosophila Table S9. Summary of ATAC- and RNA-seq read counts before and after processing Table S10. Differential accessibility of ACRs and gene expression Table S11. Summary of ATAC-seq peaks located in promoters and gene expression level Table S12. List of primer sequences used in this study

      8) http://big.hanyang.ac.kr/alphaArrestin_Fly link does not work. Please fix the link.

      We appreciate your comment. In response to the reviewer’s comment, we have made comprehensive α-arrestin interactome maps on our new website (big.hanyang.ac.kr/alphaArrestin_PPIN) and confirmed that users can be re-directed to networks housed in NDEx.

      Author response image 4.

      Screen shot of the first page of the newly developed website.

      Website address: big.hanyang.ac.kr/‌‌‌‌‌‍‍‍‌‌alphaArrestin_PPIN

      Author response image 5.

      Screen shot of the gene-gene network involving α-arrestin in human.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Response to reviewer’s comments

      Reviewer #1 (Public Review):

      In this study, the structural characteristics of plant AlaDC and SerDC were analyzed to understand the mechanism of functional differentiation, deepen the understanding of substrate specificity and catalytic activity evolution, and explore effective ways to improve the initial efficiency of theanine synthesis.

      On the basis of previous solid work, the authors successfully obtained the X-ray crystal structures of the precursors of theanine synthesis-CsAlaDC and AtSerDC, which are key proteins related to ethylamine synthesis, and found a unique zinc finger structure on these two crystal structures that are not found in other Group II PLP-dependent amino acid decarboxylases. Through a series of experiments, it is pointed out that this characteristic zinc finger motif may be the key to the folding of CsAlaDC and AtSerDC proteins, and this discovery is novel and prospective in the study of theine synthesis.

      In addition, the authors identified Phe106 of CsAlaDC and Tyr111 of AtSerDC as key sites of substrate specificity by comparing substrate binding regions and identified amino acids that inhibit catalytic activity through mutation screening based on protein structure. It was found that the catalytic activity of CsAlaDCL110F/P114A was 2.3 times higher than that of CsAlaDC. At the same time, CsAlaDC and AtSerDC substrate recognition key motifs were used to carry out evolutionary analysis of the protein sequences that are highly homologous to CsAlaDC in embryos, and 13 potential alanine decarboxylases were found, which laid a solid foundation for subsequent studies related to theanine synthesis.

      In general, this study has a solid foundation, the whole research idea is clear, the experimental design is reasonable, and the experimental results provide strong evidence for the author's point of view. Through a large number of experiments, the key links in the theanine synthesis pathway are deeply studied, and an effective way to improve the initial efficiency of theanine synthesis is found, and the molecular mechanism of this way is expounded. The whole study has good novelty and prospectivity, and sheds light on a new direction for the efficient industrial synthesis of theanine

      Response: Thank you very much for taking time to review this manuscript. We appreciate all your insightful comments and constructive suggestions.

      Reviewer #1 (Recommendations For The Authors):

      (1) If some test methods are not original, references or method basis should be indicated.

      Response: Thank you very much for your careful reading of the manuscript and valuable suggestions. We have added references for the enzymatic activity experiments performed to measure the synthesis of theanine in the revised manuscript.

      (2) The conclusion is a little lengthy, and the summary of the whole study is not well condensed.

      Response: Thank you very much for your valuable suggestions. We have refined the conclusion in the revised manuscript, and it is as follows:

      In conclusion, our structural and functional analyses have significantly advanced understanding of the substrate-specific activities of alanine and serine decarboxylases, typified by CsAlaDC and AtSerDC. Critical amino acid residues responsible for substrate selection were identified—Tyr111 in AtSerDC and Phe106 in CsAlaDC—highlighting pivotal roles in enzyme specificity. The engineered CsAlaDC mutant (L110F/P114A) not only displayed enhanced catalytic efficiency but also substantially improved L-theanine yield in a synthetic biosynthesis setup with PsGS or GMAS. Our research expanded the repertoire of potential alanine decarboxylases through the discovery of 13 homologous enzyme candidates across embryophytic species and uncovered a special motif present in serine protease-like proteins within Fabale, suggesting a potential divergence in substrate specificity and catalytic functions. These insights lay the groundwork for the development of industrial biocatalytic processes, promising to elevate the production of L-theanine and supporting innovation within the tea industry.

      Reviewer #2 (Public Review)

      Summary:

      The manuscript focuses on the comparison of two PLP-dependent enzyme classes that perform amino acyl decarboxylations. The goal of the work is to understand the substrate specificity and factors that influence the catalytic rate in an enzyme linked to theanine production in tea plants.

      Strengths:

      The work includes x-ray crystal structures of modest resolution of the enzymes of interest. These structures provide the basis for the design of mutagenesis experiments to test hypotheses about substrate specificity and the factors that control catalytic rate. These ideas are tested via mutagenesis and activity assays, in some cases both in vitro and in plants.

      Weaknesses:

      The manuscript could be more clear in explaining the contents of the x-ray structures and how the complexes studied relate to the reactant and product complexes. The structure and mechanism section would also be strengthened by including a diagram of the reaction mechanism and including context about reactivity. As it stands, much of the structural results section consists of lists of amino acids interacting with certain ligands without any explanation of why these interactions are important or the role they play in catalysis. The experiments testing the function of a novel Zn(II)-binding domain also have serious flaws. I don't think anything can be said at this point about the function of the Zn(II) due to a lack of key controls and problems with experimental design.

      Response: Thank you very much for your thoughtful comments and feedback on our manuscript. We are pleased to hear that the work's strengths, such as the X-ray crystal structures and the mutagenesis experiments tied to the catalytic rate and substrate specificity, align with the goals of our research.

      We recognize the areas identified for improvement and appreciate the suggestions provided. We have emphasized how we use the structural information obtained to infer the roles of key amino acid residues in the reaction. Additionally, we have added a diagram of the reaction mechanism in the Supplementary figure to provide clearer context on reactivity and improve the overall understanding of the catalytic process. Regarding the structural results section, we have included a discussion that contextualizes the list of amino acids and their interactions with the ligands by explaining their significance and roles in catalysis. We acknowledge the weaknesses you've pointed out in the experiments concerning the novel Zn(II)-binding domain, but we would like to clarify that the focus of our study was not primarily on the zinc structure. While we agree that there may be limitations in the experimental design and controls for the zinc binding domain, we believe that these flaws do not significantly impact the overall findings of the study. The experiment served as a preliminary exploration of the potential functionality of the domain, and further studies are required to fully understand its role and mechanism.

      Reviewer #2 (Recommendations For The Authors):

      (1) In addition to the points raised in the public review, it would be ideal to provide some context for the enzymatic characterization. Why are the differences in kinetic parameters for AlaDC and SerDC significant?

      Response: Thank you for your comments and suggestions. The Km values for CsAlaDC and SerDCs are comparable, suggesting similar substrate affinities. However, CsAlaDC exhibits a significantly lower Vmax compared to AtSerDC and CsSerDC. This discrepancy implies that CsAlaDC and SerDCs may differ in the rates at which they convert substrate to product when saturated with substrate. SerDCs may have a faster turnover rate, meaning they convert substrate to product and release the enzyme more quickly, resulting in a higher Vmax. Differences in the stability or correct folding of the enzymes under assay conditions can also affect their Vmax. If SerDCs are more stable, they might maintain their catalytic activity better at higher substrate concentrations, contributing to a higher Vmax. We have added these to the part of “Enzymatic properties of CsAlaDC, AtSerDC, and CsSerDC” in our revised manuscript.

      (2) Why is Phe106/Tyr111 pair critical for substrate specificity? Does the amino acid contact the side chain? It might be helpful to a reader to formulate a hypothesis for this interaction.

      Response: Thank you for the question and comments. We conducted a comparison between the active sites of CsAlaDC and AtSerDC and observed a distinct difference in only two amino acids: F106 in CsAlaDC and Y111 in AtSerDC. The remaining amino acids were found to be identical. Expanding on previous research concerning Group II PLP-dependent amino acid decarboxylases, it was postulated and subsequently confirmed that these specific amino acids play a crucial role in substrate recognition. However, since we lack the structure of the enzyme-substrate complex, we are unable to elucidate the precise interactions occurring between the substrate and the amino acids at this particular site based solely on structural information.

      (3) Line 55 - Define EA again.

      Response: Thank you very much for your careful reading of the manuscript and valuable suggestions. We have redefined “EA” as the abbreviation for ethylamine in the revised manuscript.

      (4) Line 58 - The meaning of "determined by the quality formation of tea" is not clear.

      Response: Thank you very much for your careful reading of the manuscript and valuable suggestions. We have modified it in the revised manuscript.

      (5) Line 65 - Missing words between "despite they".

      Response: Thank you very much for your careful reading of the manuscript. We have corrected it in the revised manuscript.

      (6) Line 67 - Need a reference for the statement about lower activity?

      Response: Thank you for the question and comments. We have provided the following reference to support this statement in the revised manuscript.

      Reference: Bai, P. et al. (2021) Biochemical characterization of specific Alanine Decarboxylase (ADC) and its ancestral enzyme Serine Decarboxylase (SDC) in tea plants (Camellia sinensis). BMC Biotechnol. 21,17.

      (7) Line 100-101 - The meaning of "its closer relationship was Dicots plants." is not clear.

      Response: We have revised the sentence in the revised manuscript, as follows: “Phylogenetic analysis indicated that CsAlaDC is homologous with SerDCs in Dicots plants.”

      (8) Line 139 - Missing a word between "as well as" and "of".

      Response: Thank you very much for your careful reading of the manuscript and valuable suggestions. We have corrected it in the revised manuscript.

      (9) Line 142 - The usage of comprised here is not correct. It would be more correct to say "The overall architecture of CsAlaDC and AtSerDC is homodimeric with the two subunits...".

      Response: Thank you very much for your careful reading of the manuscript and valuable suggestions. We have corrected it in the revised manuscript.

      (10) Line 148-149 - I didn't understand the statement about the "N-terminal structures" Are these structures obtained from protein samples that have a truncated N-terminus?

      Response: Group II PLP-dependent amino acid decarboxylases are comprised of three distinct structural domains: the N-terminal domain, the large domain, and the C-terminal domain. Each of these domains possesses unique structural features. Similarly, CsAlaDC and AtSerDC can also be classified into three structural domains based on their specific characteristics. To achieve more stable proteins for further experiments, we conducted truncation on both of these proteins. The truncated section pertains to a subsection of the N-terminal domain and is truncated from the protein's N-terminus.

      (11) Line 153 - Say "is composed of" instead of "composes of".

      Response: Thank you very much for your careful reading of the manuscript and valuable suggestions. We have corrected it in the revised manuscript.

      (12) Line 156 - I didn't understand the statement about the cofactor binding process. What is the cofactor observed? And how can we say anything about the binding process from a single static structure of the enzyme? It might be better to say that the cofactor binding site is located at the subunit junction - but the identity of the cofactor still needs to be defined first.

      Response: Thank you for your comments and suggestions. The cofactor mentioned here is PLP. We aim to elucidate the binding state of PLP at the active site, excluding the binding process. The description has been revised in the revised manuscript.

      (13) Lines 157-158 - I didn't understand the conclusion about the roles of each monomer. In the images in Figure 3 - both monomers appear to bind PLP but the substrate is not present - so it's not clear how conclusions can be drawn about differential substrate binding in the two subunits.

      Response: Thank you very much for your careful reading and valuable suggestions. The main idea we want to convey is that this protein possesses two active sites. At each active site, the two monomers carry out distinct functions. Of course, our previous conclusion is inaccurate due to the non-existence of the substrate. So, we have made the necessary amendments in the revised manuscript.

      (14) Line 161 - I would say loop instead of ring.

      Response: Thank you very much for your careful reading of the manuscript and valuable suggestions. We have corrected it in the revised manuscript.

      (15) Line 165 - Please provide some references for this statement. It would also be ideal to state the proximity of the Zn-binding motif to the active site or otherwise provide some information about the role of the motif based on its location.

      Response: Thank you for your comments and suggestions. We have provided the following references to support this statement in the revised manuscript.

      Author response image 1.

      (A) Structure of histidine decarboxylase. (B) Structure of glutamate decarboxylase.

      Reference:

      30 Komori, H. et al. (2012) Structural study reveals that Ser-354 determines substrate specificity on human Histidine Decarboxylase. J Biol Chem. 287, 29175-83.

      31 Huang, J. et al. (2018) Lactobacillus brevis CGMCC 1306 glutamate decarboxylase: Crystal structure and functional analysis. Biochem Biophys Res Co. 503, 1703-1709

      In CsAlaDC, the zinc is positioned at a distance of 29.6 Å from the active center, whereas in AtSerDC, the zinc is situated 29 Å away from the active center. Hence, we hypothesize that this structure does not impact the enzyme's catalytic activity but might be correlated with its stability.

      (16) Lines 166-178 - This paragraph appears to be a list of all of the interactions between the protein, PLP, and the EA product. It would be ideal to provide some text to explain why these interactions are important and what we can learn from them.

      Response: Thank you very much for your careful reading of the manuscript and valuable suggestions. We have been conducting additional analysis on the functional roles of amino acid residues involved in the interaction between the active site and PLP. This analysis focuses on aiding PLP binding, determining its orientation, and understanding enzyme catalytic mechanisms. These details are mentioned in the revised manuscript.

      (17) Line 192 - Bond not bound.

      Response: Thank you very much for your careful reading of the manuscript and valuable suggestions. We have made corrections in the revised manuscript.

      (18) Lines 201-207 - It would be ideal to verify that the inclusion of 5 mM DTT affects Zn binding. It's not clear to me that this reagent would necessarily disrupt Zn binding. Under certain circumstances, it could instead promote Zn association. For example, if the Cys ligands are oxidized initially but then become reduced? I don't think the current experiment really provides any insight into the role of the Zn.

      Response: Thank you for your valuable insights regarding the role of DTT and its potential effects on Zn binding in our experiments. The main function of DTT is to protect or restore the reduced state of proteins and other biological molecules, particularly by disrupting the crosslinking formed by thiol (-SH) groups and disulfide bonds to maintain the function and structure of proteins. Therefore, the reason for DTT's inhibition of enzyme activity is unknown, and we cannot provide a reasonable explanation for this phenomenon. As a result, we have removed the section discussing the inhibition of enzyme activity by DTT in our revised manuscript.

      Reviewer #3 (Public Review):

      In the manuscript titled "Structure and Evolution of Alanine/Serine Decarboxylases and the Engineering of Theanine Production," Wang et al. solved and compared the crystal structures of Alanine Decarboxylase (AlaDC) from Camellia sinensis and Serine Decarboxylase (SerDC) from Arabidopsis thaliana. Based on this structural information, the authors conducted both in vitro and in vivo functional studies to compare enzyme activities using site-directed mutagenesis and subsequent evolutionary analyses. This research has the potential to enhance our understanding of amino acid decarboxylase evolution and the biosynthetic pathway of the plant-specialized metabolite theanine, as well as to further its potential applications in the tea industry. Response: Thank you very much for taking the time to review this manuscript. We appreciate all your insightful comments.

      Reviewer #3 (Recommendations For The Authors):

      Page 6, Figure 2, Page 23 (Methods)

      "The supernatants were purified with a Ni-Agarose resin column followed by size-exclusion chromatography."

      What kind of SEC column did the authors use? Can the authors provide the SEC elution profile comparison results and size standard curve?

      Response: We use a Superdex 200 (Hiload 16/600) column for size exclusion chromatography. The comparison results of SEC elution profiles for AtSerDC and CsAlaDC, along with the standard curve of SEC column, are presented below.

      Author response image 2.

      (A) Comparison of elution profiles of CsAlaDC and AtSerDC. (B) Elution profile of Blue Dextron 2000. (C) Elution profile of mixed protein (Aldolase, 158000 Da,71.765ml; Conalbumin, 75000 Da,79.391ml; Ovalbumin, 44000 Da,83.767ml; Carbonic anhydrase, 29000 Da,90.019ml; Ribonuclease A, 13700 Da,98.145ml). (D) Size standard curves of Superdex 200 (Hiload 16/600) column.

      Page 6 & Page 24 (Methods)

      "The 100 μL reaction mixture, containing 20 mM substrate (Ala or Ser), 100 mM potassium phosphate, 0.1 mM PLP, and 0.025 mM purified enzyme, was prepared and incubated at standard conditions (45 ℃ and pH 8.0 for CsAlaDC, 40 ℃ and pH 8.0 for AtSerDC for 30 min)."

      (1) The enzymatic activities of CsAldDC and AtSerDC were measured at two different temperatures (45 and 40 ℃, but their activities were directly compared. Is there a reason for experimenting at different temperatures?

      Response: We determined that the optimal reaction temperature for AtSerDC is 40°C and for CsAlaDC is 45°C through our verification process. Consequently, all subsequent experiments were performed at these specific temperatures.

      Author response image 3.

      (A) Relative activity of CsAlaDC at different temperatures. (B) Relative activity of AtSerDC at different temperatures.

      (2) Enzyme activities were measured at temperatures above 40℃, which is not a physiologically relevant temperature and may affect the stability or activity of the proteins. At the very least, the authors should provide temperature-dependent protein stability data (e.g., CD spectra analysis) or, if possible, temperature-dependent enzyme activities, to show that their experimental conditions are suitable for studying the activities of these enzymes.

      Response: Thank you very much for your careful reading. We have already validated that the experimental temperature we used did not significantly affect the stability of the protein before experimenting. The results are shown in the figure below:

      Author response image 4.

      Place the two proteins individually into water baths set at temperatures of 25°C, 37°C, 45°C, 60°C, and 80°C for 15 minutes. Subsequently, carry out enzymatic reactions utilizing a standard reaction system, with untreated enzymes serving as the experimental control within the said system. The experimental results suggest that the temperature at which we experimented does not have a significant impact on the stability of the enzyme.

      (3) The authors used 20 mM of substrate. What are the physiological concentrations of alanine and serine typically found in plants?

      Response: The content of alanine in tea plant roots ranges from 0.28 to 4.18 mg/g DW (Yu et al., 2021; Cheng et al., 2017). Correspondingly, the physiological concentration of alanine is 3.14 mM to 46.92 mM, in tea plant roots. The content of serine in plants ranges from 0.014 to 17.6 mg/g DW (Kumar et al., 2017). Correspondingly, the physiological concentration of serine is 0.13 mM to 167.48 mM in plants. In this study, the substrate concentration of 20 mM was close to the actual concentrations of alanine and serine in plants.

      Yu, Y. et al. (2021) Glutamine synthetases play a vital role in high accumulation of theanine in tender shoots of albino tea germplasm "Huabai 1". J. Agric. Food Chem. 69 (46),13904-13915.

      Cheng, S. et al. (2017) Studies on the biochemical formation pathway of the amino acid L-theanine in tea (Camellia sinensis) and other plants.” J. Agric. Food Chem. 65 (33), 7210-7216.

      Kumar, V. et al. (2017) Differential distribution of amino acids in plants. Amino Acids. 49(5), 821-869.

      Pages 6-7 & Table 1

      (1) Use the correct notation for Km and Vmax. Also, the authors show kinetic parameters and use multiple units (e.g., mmol/L or mM for Km).

      Response: Thank you very much for your careful reading of the manuscript and valuable suggestions. We have corrected this in the revised manuscript.

      (2) When comparing the catalytic efficiency of enzymes, kcat/Km (or Vmax/Km) is generally used. The authors present a comparison of catalytic activity from results to conclusion. A clarification of what results are being compared is needed.

      Response: Thank you for your comments and suggestions. The catalytic activity is assessed by comparing reaction rates.

      Page 7 & Figure 3

      In Figure 3A, the authors describe the overall structure, but a simple explanation or labeling within the figure should be added.

      Response: Thank you very much for your suggestions, we have made modifications to Figure 3A as follows:

      Author response image 5.

      Crystal structures of CsAlaDC and AtSerDC. (A) Dimer structure of CsAlaDC. The color display of the N-terminal domain, large domain, and C-terminal domains of chain A is shown in light pink, khaki and sky blue, respectively. Chain B is shown in spring green. The PLP molecule is shown as a sphere model. The zinc finger structure at the C-terminus of CsAlaDC is indicated by the red box. The gray spheres represent zinc ions, while the red dotted line depicts the coordination bonds formed by zinc ions with cysteine and histidine.

      Figures 3F & 4A

      In these figures, the two structures are overlaid and compared, but the colors are very similar to see the differences. The authors should use a different color scheme.

      Response: Thank you very much for your suggestions, we have made modifications to the Figure 3F & 4A as follows:

      Author response image 6.

      (Figure 3F) - The monomers of CsAlaDC and AtSerDC are superimposed. CsAlaDC is depicted in spring green, while AtSerDC is shown in plum. The conserved amino acid catalytic ring is indicated by the red box. (Figure 4A) - Superposition of substrate binding pocket amino acid residues in CsAlaDC and AtSerDC. The amino acid residues of CsAlaDC are shown in spring green, the amino acid residues of AtSerDC are shown in plum, with the substrate specificity-related amino acid residue highlighted in a red ellipse.

      Pages 7 & 8

      Figures 3 and 4 do not include illustrations of what the authors describe in the text. The reader will not be able to understand the descriptions until they download and view the structures themselves. The authors should create additional figures to make it easier for readers to understand the structures.

      Response: Thank you very much for your suggestions, we have included supplementary figure 1 in the revised manuscript, which presents more elaborate structural depictions of the two proteins.

      Pages 9 & 10

      "This result suggested this Tyr is required for the catalytic activity of CsAlaDC and AtSerDC."

      The author's results are interesting, but it is recommended to perform the experiments in a specific order. First, experiments should determine whether mutagenesis affects the protein's stability (e.g., CD, as discussed earlier), and second, whether mutagenesis affects ligand binding (e.g., ITC, SPR, etc.), before describing how site-directed mutagenesis alters enzyme activity. In particular, the authors' hypothesis would be much more convincing if they could show that the ligand binding affinity is similar between WT and mutants.

      Response: Thank you for your insightful feedback on our manuscript, which we greatly appreciate. Your suggestion to methodically sequence the experiments provides a clear pathway to bolster the strength and conclusiveness of our results.

      We agree that it is crucial to first assess the stability of the mutant proteins, as changes therein could inadvertently affect catalytic activity. To this end, we have employed circular dichroism (CD) to study the potential structural alterations in the proteins induced by mutations. The experimental results are shown in the following figure:

      Author response image 7.

      (A) Circular Dichroism Spectra of CsAlaDC (WT). (B) Circular Dichroism Spectra of CsAlaDC (Y336F). (C) Circular Dichroism Spectra of CD of AtSerDC (WT). (D) Circular Dichroism Spectra of AtSerDC (Y341F).

      The experimental results indicate that the secondary structure of the mutant proteins remains unchanged, which means the mutations do not alter the protein's stability.

      The ligand PLP forms a Schiff base structure with the ε-amino group of a lysine residue in the protein, with maximum absorbance around 420-430 nm. Since we have already added PLP during the protein purification process, as long as the absorbance of mutant proteins and wild-type proteins is the same at 420-430 nm at equivalent concentrations, it indicates that the mutant proteins do not affect the binding of the ligand PLP. Therefore, we scanned the UV-visible absorption spectra of both the wild-type and mutant proteins, and the results are as presented in the following figure:

      Author response image 8.

      (A) UV-Visible Absorption Spectra of CsAlaDC (WT) compared to CsAlaDC (Y336F). (B) UV-Visible Absorption Spectra of AtSerDC (WT) compared to AtSerDC (Y341F).

      The mutant protein and the wild-type protein exhibit similar absorbance at 420-430 nm, indicating that the mutation does not affect the binding of PLP to the protein.

      The above experiments have confirmed that the mutations do not significantly affect the stability of the protein or the affinity for the ligand, so we can more confidently attribute changes in enzyme activity to the specific role of the tyrosine residue in question. We believe this comprehensive approach will substantiate our hypothesis and illustrate the necessity of this Tyr residue for the catalytic activity of CsAlaDC and AtSerDC enzymes.

      Figure 3

      In the 3D structure figure provided by the authors, the proposed reaction mechanism of the enzyme and the involved amino acids are not included. Can the authors add a supplementary figure with a schematic drawing that includes more information, such as distances?

      Response: Thank you for your valuable feedback on our manuscript. We completely agree that a schematic drawing with additional details, including distances, would enhance the clarity and understanding of the enzymatic mechanism. In response to your suggestion, we have added a supplementary figure 2 in the revised manuscript that accurately illustrates the proposed reaction pathway, highlighting the key amino acids involved.

      Page 10

      "The results showed that 5 mM L-DTT reduced the relative activity of CsAlaDC and AtSerDC to 22.0% and 35.2%, respectively"

      The authors primarily use relative activity to compare WT and mutants. Can the authors specify the exact experiments, units, and experimental conditions? Is it Vmax or catalytic efficiency? If so, under what specific experimental conditions?

      Response: Thank you for your attention and review of our research paper, we appreciate your suggestions and feedback. The experimental protocol employed to evaluate the influence of DTT on protein catalytic efficiency is outlined as follows:

      The 100 μL reaction mixture, containing 20 mM substrate (Ala or Ser), 100 mM potassium phosphate, 0.1 mM PLP, 5 mM L-DTT, and 0.025 mM purified enzyme, was prepared and incubated at standard conditions (45 °C and pH 8.0 for CsAlaDC for 5 min, 40 °C and pH 8.0 for AtSerDC for 2 min). DTT is absent as a control in the reaction system. Then the reaction was stopped with 20 μL of 10% trichloroacetic acid. The product was derivatized with 6-aminoquinolyl-N-hydroxy-succinimidyl carbamate (AQC) and subjected to analysis by UPLC. All enzymatic assays were performed in triplicate.

      However, due to the unknown mechanism of DTT inhibition on protein activity, we have removed this part of the content in the revised manuscript.

      Pages 10-12

      The identification of 'Phe106 in CsAlaDC' and 'Tyr111 in AtSerDC,' along with the subsequent mutagenesis and enzymatic activity assays, is intriguing. However, the current manuscript lacks an explanation and discussion of the underlying reasons for these results. As previously mentioned, it would be helpful to gain insights and analysis from WT-ligand and mutant-ligand binding studies (e.g., ITC, SPR, etc.). Furthermore, the authors' analysis would be more convincing with accompanying structural analysis, such as steric hindrance analysis.

      Response: Thank you for your insightful comments and constructive feedback on our manuscript. We appreciate the interest you have expressed in the identification of 'Phe106 in CsAlaDC' and 'Tyr111 in AtSerDC' and their functional implications based on mutagenesis and enzymatic assays.

      In order to investigate the binding status of the mutant protein and the ligand PLP,we scanned the UV-visible absorption spectra of both the wild-type and mutant proteins, and the results are as presented in the following figure:

      Author response image 9.

      (A) UV-Visible Absorption Spectra of CsAlaDC (WT) compared to CsAlaDC (F106Y). (B) UV-Visible Absorption Spectra of AtSerDC (WT) compared to AtSerDC (Y111F).

      The mutant protein and the wild-type protein exhibit similar absorbance at 420-430 nm, indicating that the mutation does not affect the binding of PLP to the protein. Therefore, we can conclude that the change in activity of the mutant protein is caused by the substitution of the amino acid at that site, i.e., the amino acid at that site affects substrate specificity. By combining the structure of the two proteins, we can see that the Lys at position 111 of AtSerDC is a hydrophilic amino acid, which increases the hydrophilicity of the active site, and thus the substrate is the hydrophilic amino acid Ser. In contrast, the amino acid at the corresponding site in CsAlaDC is Phe, which, lacking a hydroxyl group compared to Lys, increases the hydrophobicity of the active site, making the substrate lean towards the hydrophobic amino acid Ala. We have added a discussion of the potential reasons for this result to the revised manuscript's discussion section.

      Page 5 & Figure 1B

      "As expected, CsSerDC was most closed to AtSerDC, which implies that they shared similar functions. However, CsAlaDC is relatively distant from CsSerDC."

      In Figure 1B, CsSerDC and AtSerDC are in different clades, and this figure does not show that the two enzymes are closest. To provide another quantitative comparison, please provide a matrix table showing amino acid sequence similarities as a supplemental table.

      Response: Many thanks for your constructive suggestion. We added a matrix table showing amino acid sequence similarities in the supplemental materials. The results showed that the similarity of amino acid sequences between CsSerDC and AtSerDC is 86.21%, which is higher than that between CsAlaDC and CsSerDC (84.92%). This data exactly supports the description of Figure 1B. We added the description of the amino acid sequence similarities analysis in the revised manuscript. The description of "As expected, CsSerDC was most closed to AtSerDC, which implies that they shared similar functions. " is not accurate enough, so we revised it to "As expected, CsSerDC was closer to AtSerDC, which implies that they shared similar functions.", in the revised manuscript.

      Page 5 & Figure 1C

      Figure 1C, which shows a multiple sequence alignment with the amino acid sequences of the 6 SerDCs and CsAlaDC, clearly shows the differences between the sequences of AlaDC and other SerDCs. However, the authors' hypothesis would be more convincing if they showed that this difference is also conserved in AlaDCs from other plants. Can the authors show a new multiple-sequence alignment by adding more amino acid sequences of other AlaDCs?

      Response: Thank you for your comments and suggestions. We aim to discover additional alanine decarboxylase. However, at present, the only experimentally confirmed alanine decarboxylase is CsAlaDC. No experimentally verified alanine decarboxylases have been found in other plant species.

      Figure 5A

      Figure 5A is missing the error bar.

      Response: Figure 5A serves as a preliminary screening for these mutants, without conducting repeated experiments. Subsequently, only the L110F and P114A mutants, which exhibited significantly improved activity, underwent further experimental verification to confirm their enhanced functionality.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      This work from Cui, Pan, Fan, et al explores memory impairment in chronic pain mouse models, a topic of great interest in the neurobiology field. In particular, the work starts from a very interesting observation, that WT mice can be divided into susceptible and unsusceptible to memory impairment upon modelling chronic pain with CCI. This observation represents the basis of the work where the authors identify the sphingosine receptor S1PR1 as down-regulated in the dentate gyrus of susceptible animals and demonstrate through an elegant range of experiments involving AAV-mediated knockdown or overexpression of S1PR1 that this receptor is involved in the memory impairment observed with chronic pain. Importantly for translational purposes, they also show that activation of S1PR1 through a pharmacological paradigm is able to rescue the memory impairment phenotype.

      The authors also link these defects to reduced dendritic branching and a reduced number of mature excitatory synapses in the DG to the memory phenotype.

      They then proceed to explore possible mechanisms downstream of S1PR1 that could explain this reduction in dendritic spines. They identify integrin α2 as an interactor of S1PR1 and show a reduction in several proteins involved in actin dynamic, which is crucial for dendritic spine formation and plasticity.

      They thus hypothesize that the interaction between S1PR1 and Integrin α2 is fundamental for the activation of Rac1 and Cdc42 and consequently for the polymerisation of actin; a reduction in this pathway upon chronic pain would thus lead to impaired actin polymerisation, synapse formation, and thus impaired memory.

      The work is of great interest and the experiments are of very good quality with results of great importance. I have however some concerns. The main concern I have relates to the last part of the work, namely Figures 8 and 9, which I feel are not at the same level as the results presented in the previous 7 Figures, which are instead outstanding.

      In particular:

      - In Figure 8, given the reduction in all the proteins tested, the authors need to check some additional proteins as controls. One good candidate could be RhoA, considering the authors say it is activated by S1PR2 and not by S1PR1;

      Thanks for your suggestion. We tested the expression level of RhoA in mice 7 days and 21 days post CCI as negative controls (Supplemental Figure 9).

      - In addition to the previous point, could the authors also show that the number of neurons is not grossly different between susceptible and unsusceptible mice? This could be done by simply staining for NeuN or performing a western blot for a neuronal-specific protein (e.g. Map2 or beta3-tubulin);

      As suggested, we performed immunofluorescence using NeuN antibody to detect the number of neurons in susceptible and unsusceptible mice. The number is not significantly different between the two populations (Supplementary Figure 7).

      - In Figure 8, the authors should also evaluate the levels of activated RAC1 and activated Cdc42, which are much more important than just basal levels of the proteins to infer an effect on actin dynamics. This is possible through kits that use specific adaptors to pulldown GTP-Rac1 and GTP-Cdc42;

      Thanks for your constructive suggestion. An elevated level and hyperactivation of Rac1 protein are both associated with actin dynamics and dendritic development [1]. We agree that showing the levels of activated RAC1 is better to infer its effect on actin dynamics. Here in Figure 8, the purpose of this experiment is to prove the levels of actin organization related proteins are altered according to the expression level of S1PR1, thus drawing a conclusion that the actin organization was disrupted, but not to specifically emphasize that S1PR1 activated these proteins. We apologize for the confusion made but we think the current data is enough to support the conclusion.

      Thanks again for your advice. Your understanding is greatly appreciated.

      - In Figure 9C, the experiment is performed in an immortalised cell line. I feel this needs to be performed at least in primary hippocampal neurons;

      Thanks for your suggestion. As suggested, we performed the experiment in primary hippocampal neurons. Knockdown of S1pr1 in primary hippocampal neurons induced reduction in the number of branches and filamentous actin. Please refer to the updated Figure 9C.

      - In Figure 9D, the authors use a Yeast two-hybrid system to demonstrate the interaction between S1PR1 and Integrin α2. However, as the yeast two-hybrid system is based on the proximity of the GAL4 activating domain and the GAL4 binding domain, which are used to activate the transcription of reporter genes, the system is not often used when probing the interaction between transmembrane proteins. Could the authors use other transmembrane proteins as negative controls?;

      Thanks for your question. We apologize for the unclear description in the method part. Traditional yeast two-hybrid system can only detect protein interactions that occur in the nucleus, but cannot detect ones between membrane proteins. Here, we utilized the split-ubiquitin membrane-based Yeast two-hybrid system. Briefly, in the ubiquitin system, ubiquitin, a protein composed of 76 amino acid residues that can mediate the ubiquitination degradation of target proteins by proteasomes, is split into two domains, namely Cub at the C-terminus and NbuG at the N-terminus, which are fused and expressed with the bait protein “Bait” and the prey protein “Prey”, respectively. At the same time, Cub is also fused with transcription factors. If Bait and Prey proteins could bind, Cub and NbuG would be brought together and a complete ubiquitin would be formed, which would be recognized by the proteasome and the fused transcription factor would be cut off and enter the cell nucleus to activate the expression of the reporter gene. We then determine whether the Bait and Prey proteins interact with each other through the growth of the yeast.

      Thanks again for pointing this out. We reworded the method in M&M (Line 678-696).

      - In Figure 9E, the immunoblot is very unconvincing. The bands in the inputs are very weak for both ITGA2 and S1PR1, the authors do not show the enrichment of S1PR1 upon its immunoprecipitation and the band for ITGA2 in the IP fraction has a weird appearance. Were these experiments performed on DG lysates only? If so, I suggest the authors repeat the experiment using the whole brain (or at least the whole hippocampus) so as to have more starting material. Alternatively, if this doesn't work, or in addition, they could also perform the immunoprecipitation in heterologous cells overexpressing the two proteins;

      Thanks for the question and suggestion. We used DG lysates from both the dentate gyrus of a single mouse as the starting material. We updated the result which showed clearer bands (Figure 9E).

      - About the point above, even if the results were convincing, the authors can't say that they demonstrate an interaction in vivo. In co-IP experiments, the interaction is much more likely to occur in the lysate during the incubation period rather than being conserved from the in vivo state. These co-IPs demonstrate the ability of proteins to interact, not necessarily that they do it in vivo. If the authors wanted to demonstrate this, they could perform a Proximity ligation assay in primary hippocampal neurons, using antibodies against S1PR1 and ITGA2.

      Thanks for your concern. Co-immunoprecipitation (Co-IP) is the gold standard to identify protein-protein interactions [2], and it is one of the most efficient techniques to study these protein-protein interactions in vivo [3]. We repeated the experiment and followed the experimental procedure exactly to avoid the protein interaction due to over-incubation. Over-incubation, particularly at room temperature, may result in non-specific binding and therefore high background, thus we performed Co-IPs at 4°C to preserve protein interactions. We agree that Proximity ligation assay is better suited for studies of endogenously expressed proteins in primary cells [4]. Since we optimized the experiment procedure to avoid non-specific binding and particularly, Co-IP utilized proteins from DG lysates which could validate the specificity of the protein interaction in native tissue, we prefer to keep the Co-IP result in Figure 9E.

      Thanks again for your suggestion. We appreciate your understanding on this matter.

      - In Figure 9H, could the authors increase the N to see if shItga2 causes further KD in the CCI?

      As suggested, we repeated the experiment and increased the N to 6. As shown in the following picture, shItga2 did not cause further KD in the CCI.

      Author response image 1.

      - To conclusively demonstrate that S1PR1 and ITGA2 participate in the same pathway, they could show that knocking down the two proteins at the same time does not have additive effects on behavioral tests compared to the knockdown of each one of them in isolation.

      Thanks for your suggestion. As suggested, we knocked down the two proteins at the same and did not observe additive effects on behavioral tests compared to the knockdown of each one of them in isolation. Please refer to Figure 9L-O.

      Other major concerns:

      - Supplementary Figure 5: the image showing colocalisation between S1PR1 and CamKII is not very convincing. Is the S1PR1 antibody validated on Knockout or knockdown in immunostaining?;

      S1PR1 is a membrane receptor and the S1P1 antibody (PA1-1040, Invitrogen) shows membranous staining with diffuse dot-like signals (Please refer to the image “A” provided by ThermoFisher Scientific). Here, we utilized the antibody to detect the expression of S1PR1 in DG granule cells. We can see the diffuse dot-like signals aggregated in each single granule cell. CaMKII shows intense staining around the border of the granule cell soma (Image “B”) [5]. According to the images shown in Supplementary Figure 5B, we concluded that S1PR1 is expressed in CaMKII+ cells.

      Besides, as suggested, we validated the S1PR1 antibody on knockdown in immunostaining (Image “C” and “D”). The expression of S1PR1 is significantly decreased compared with the control.

      Author response image 2.

      - It would be interesting to check S1PR2 levels as a control in CCI-chronic animals;

      As suggested, we quantified the S1PR2 levels in Sham and CCI animals, and there is no significant difference between groups (Supplementary Figure 9).

      - Figure 1: I am a bit concerned about the Ns in these experiments. In the chronic pain experiments, the N for Sham is around 8 whereas is around 20 for CCI animals. Although I understand higher numbers are necessary to see the susceptible and unsusceptible populations, I feel that then the same number of Sham animals should be used;

      Thanks for your concern. In the preliminary experiment, we noticed that the ratio of susceptible and unsusceptible populations is around 1:1. After the behavioral tests, we need to further take samples to investigate molecular and cellular changes of each group. Thus, we set sham around 8 and CCI around 20 to ensure that after characterization into susceptible and unsusceptible groups, each group has relatively equal numbers for further investigations.

      - Figures 1E and 1G have much higher Ns than the other panels. Why is that? If they have performed this high number of animals why not show them in all panels?;

      Thanks for your concern. For Figure 1B, C, D and F, we showed the data for each batch of experiment, while for Figure 1E and 1G, we used data collected from all batches of experiment. To show the data from a single batch, we would like to demonstrate the ratio of susceptible to unsusceptible is relatively stable, but not only based on a big sample size.

      - In the experiments where viral injection is performed, the authors should show a zoomed-out image of the brain to show the precision of the injection and how spread the expression of the different viruses was;

      As suggested, we showed the zoomed-out image in Supplementary Figure 6. The viruses are mainly expressed in the hippocampal DG.

      - The authors should check if there is brain inflammation in CCI chronic animals. This would be interesting to explain if this could be the trigger for the effects seen in neurons. In particular, the authors should check astrocytes and microglia. This is of interest also because the pathways altered in Figure 8A are related to viral infection.

      - If the previous point shows increased brain inflammation, it would be interesting for the authors to check whether a prolonged anti-inflammatory treatment in CCI animals administered before the insurgence of memory impairment could stop it from happening;

      - In addition, the authors should speculate on what could be the signal that can induce these molecular changes starting from the site of injury;

      - Also, as the animals are all WT, the authors should speculate on what could render some animals prone to have memory impairments and others resistant.<br />

      Thanks for the above four suggestions. We have observed inflammation including T cell infiltration and microglia activation in the hippocampal DG in CCI chronic animals and also used S1PR1 modulator which has anti-lymphocyte mediated inflammatory effect to prevent the insurgence of memory impairment from happening. We also examined the alteration in the numbers of peripheral T-lymphocyte subsets and the serum levels of cytokines. Furthermore, we found a neuron-microglia dialogue in the DG which may promote the resilience to memory impairment in CCI animals. Since these are unpublished results, we apologize that we would not give much detailed information to the public at the current stage. We will publish these data as soon as possible. Thanks for your understanding.

      Reviewer #2 (Public Review):

      Summary:

      The study investigates the molecular mechanisms underlying chronic pain-related memory impairment by focusing on S1P/S1PR1 signaling in the dentate gyrus (DG) of the hippocampus. Through behavioural tests (Y-maze and Morris water maze) and RNA-seq analysis, the researchers segregated chronic pain mice into memory impairment-susceptible and -unsusceptible subpopulations. They discovered that S1P/S1PR1 signaling is crucial for determining susceptibility to memory impairment, with decreased S1PR1 expression linked to structural plasticity changes and memory deficits.

      Knockdown of S1PR1 in the DG induced a susceptible phenotype, while overexpression or pharmacological activation of S1PR1 promoted resistance to memory impairment and restored normal synaptic structure. The study identifies actin cytoskeleton-related pathways, including ITGA2 and its downstream Rac1/Cdc42 signaling, as key mediators of S1PR1's effects, offering new insights and potential therapeutic targets for chronic pain-related cognitive dysfunction.

      This manuscript consists of a comprehensive investigation and significant findings. The study provides novel insights into the molecular mechanisms of chronic pain-related memory impairment, highlighting the critical role of S1P/S1PR1 signaling in the hippocampal dentate gyrus. The clear identification of S1P/S1PR1 as a potential therapeutic target offers promising avenues for future research and treatment strategies. The manuscript is well-structured, methodologically sound, and presents valuable contributions to the field.

      Strengths:

      (1) The manuscript is well-structured and written in clear, concise language. The flow of information is logical and easy to follow.

      (2) The segregation of mice into memory impairment-susceptible and -unsusceptible subpopulations is innovative and well-justified. The statistical analyses are robust and appropriate for the data.

      (3) The detailed examination of S1PR1 expression and its impact on synaptic plasticity and actin cytoskeleton reorganization is impressive. The findings are significant and contribute to the understanding of chronic pain-related memory impairment.

      Weaknesses:

      (1) Results: While the results are comprehensive, some sections are data-heavy and could be more reader-friendly with summarized key points before diving into detailed data.

      Thanks for the suggestion. For the first sentence in each part/paragraph, we used statement that summarises what will be investigating in the following experiments to make it more reader-friendly. They are labeled as blue in the main text.

      (2) Discussion: There is a need for a more balanced discussion regarding the limitations of the study. For example, addressing potential biases in the animal model or limitations in the generalizability of the findings to humans would strengthen the discussion. Also, providing specific suggestions for follow-up studies would be beneficial.

      As suggested, we discussed more on the limitations of this study and outlined some directions for future research (Line 481-498).

      (3) Conclusion: The conclusion, while concise, could better highlight the study's broader impact on the field and potential clinical implications.

      Thanks. We reworded the conclusion to better highlight the impacts of this study (Line 501-505).

      Reviewer #3 (Public Review):

      Summary of the Authors' Objectives:

      The authors aimed to delineate the role of S1P/S1PR1 signaling in the dentate gyrus in the context of memory impairment associated with chronic pain. They sought to understand the molecular mechanisms contributing to the variability in memory impairment susceptibility and to identify potential therapeutic targets.

      Major Strengths and Weaknesses of the Study:

      The study is methodologically robust, employing a combination of RNA-seq analysis, viral-mediated gene manipulation, and pharmacological interventions to investigate the S1P/S1PR1 pathway. The use of both knockdown and overexpression approaches to modulate S1PR1 levels provides compelling evidence for its role in memory impairment. The research also benefits from a comprehensive assessment of behavioral changes associated with chronic pain.

      However, the study has some weaknesses. The categorization of mice into 'susceptible' and 'unsusceptible' groups based on memory performance requires further validation. Additionally, the reliance on a single animal model may limit the generalizability of the findings. The study could also benefit from a more detailed exploration of the impact of different types of pain on memory impairment.

      Assessment of the Authors' Achievements:

      The authors successfully identified S1P/S1PR1 signaling as a key factor in chronic pain-related memory impairment and demonstrated its potential as a therapeutic target. The findings are supported by rigorous experimental evidence, including biochemical, histological, and behavioral data. However, the study's impact could be enhanced by further exploration of the molecular pathways downstream of S1PR1 and by assessing the long-term effects of S1PR1 manipulation.

      Impact on the Field and Utility to the Community:

      This study is likely to have a significant impact on pain research by providing a novel perspective on the mechanisms underlying memory impairment in chronic pain conditions. The identification of the S1P/S1PR1 pathway as a potential therapeutic target could guide the development of new treatments.

      Additional Context for Readers:

      The study's approach to categorizing susceptibility to memory impairment could inspire new methods for stratifying patient populations in clinical settings.

      Recommendations:

      (1) A more detailed explanation of the k-means clustering algorithm and its application in categorizing mice should be provided.

      As suggested, we explained the k-means clustering algorithm in details (Line 697-711).

      (2) The discussion on the potential influence of different pain types or sensitivities on memory impairment should be expanded.

      Thanks for your suggestion. We discussed this point in the limitations of this study (Line 484-491).

      (3) The protocol for behavioral testing should be clarified and the potential for learning or stress effects should be addressed.

      Thanks for your suggestion. We clarified the order of the battery of behavioral tests in this study (Line 537-542). We start with the least stressful test (Y-maze) and leave the most stressful of all for last (Morris Water maze) [6]. Besides, we also conducted behavioral assays to prove that a one-day rest is enough to decrease carryover effects from prior test (Y-maze). We examined the stress related behaviors one day after Y-maze (23d post CCI) using open field test (OFT) and elevated plus maze (EPM). As shown in Author response image 3, the tests did not reflect the mice were under stressful circumstances. Thus, the order in which the tests were performed are appropriate in this study.

      Author response image 3.

      (4) Conduct additional behavioral assays for other molecular targets implicated in the study.

      We agree that other molecular targets on susceptibility to memory impairment would be interesting to know. Our study was designed to focus specifically on ITGA2 this time and we'd like to keep the focus intact, but we have included your point as a consideration for future study (Lines 496-498). Thank you for the suggestion.

      (5) The effective drug thresholds and potential non-specific effects of pharmacological interventions should be discussed in more detail.

      As suggested, we emphasized this point of drug SEW2871 in Line 242-245.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Minor concerns:

      - In Figure 6E the lines of the different groups are not visible. Showing the errors as error bars for each point would probably be better;

      We apologize for the mistake of using mean±SD here instead of mean±SEM. After changing to mean±SEM, the lines of Figure 6E, Figure 7E and 7L become much clearer. It looks a little bit messy to show the error bars since there are numerous points, so we prefer to keep the line style.

      - Do the authors have any speculation on why the % time in the quadrant is not further affected in the KD Itga2 in CCI animals (Figure 9K)?;

      In CCI animals, the level of S1PR1 expression is decreased. ITGA2 may participate in the same pathway with S1PR1. Thus, knocking down ITGA2 in CCI animals will not further affect the animal behaviors. This has been proved by knocking down the two proteins at the same time and no additive effects were observed on behavioral tests compared to the knockdown of each one of them in isolation (Figure 9L-O).

      - In the methods, it's unclear if in the multiple infusion, the animals were anaesthetised or kept awake;

      We have clarified this point in the method. mice were deeply anesthetized by 1% pentobarbital sodium (40 mg/kg, i.p.). (Line 649-650)

      - As the DG is quite small, could the authors clarify if, when performing western blots, they used the two DGs from one animal for each sample or if they pulled together the DGs of several animals?;

      We used the two DGs from one animal for each sample. The amount of protein extracted from each sample is enough for 20-30 times of Western Blot assays. We have now added this to the method for clarity (Line 612).

      - Is it possible to check the correlation between performance in the YM and MWM with S1PR1 levels?;

      We would also be interested in this point. The data that we have cannot reveal this for it is difficult to manipulate the S1PR1 levels by using KD and overexpression viruses.

      - EM images have a poor resolution in the figures, could the authors show higher-resolution images?;

      We have inserted 300 DPI images for high resolution output.

      - In line 268 there is a mention of an "ShLamb1"?

      We apologize for the mistake and it was revised.

      Reviewer #3 (Recommendations For The Authors):

      This study explored the role of S1P/S1PR1 signaling within the dentate gyrus (DG) in chronic pain-related memory impairment using a murine model. The authors identified decreased expression of S1PR1 in the DG of mice susceptible to memory deficits. They demonstrated that S1PR1 knockdown increased susceptibility to memory deficits, whereas its overexpression or pharmacological activation mitigated these effects. Further biochemical and immunofluorescence analyses indicated that disruptions in S1P/S1PR1 signaling were related to disruptions in actin cytoskeleton dynamics, influenced by molecular pathways involving ITGA2, Rac1/Cdc42 signaling, and the Arp2/3 complex. These findings offer intriguing insights and suggest a potential therapeutic target for treating memory impairment in chronic pain.

      Major Concerns:

      The following five major concerns are the same with the five recommendations from Reviewer 3 on Page 9-10. Please refer to the answers above.

      (1) The division of subjects into 'susceptible' and 'unsusceptible' categories requires further clarification regarding the methodologies and rationale employed, particularly concerning the use of the k-means clustering algorithm in data analysis. This explanation will strengthen the scientific grounding of the categorization process.

      (2) The categorization of 'susceptible' and 'unsusceptible' groups might also benefit from a more detailed analysis or discussion concerning the influence of different pain sensitivities or types of pain assessments. Although the study mentions that memory impairment stands independent of pain thresholds, a more nuanced exploration could provide deeper insights.

      (3) The article could benefit from more clarity on the protocol of behavioral testing, especially regarding the potential effects of repeated testing on performance outcomes due to learning or stress.

      (4) While the connection between S1P/S1PR1 signaling and the molecular pathways highlighted (ITGA2, Rac1/Cdc42, Arp2/3) is intriguing, only ITGA2 underwent further behavioral validation in vivo. Conducting additional behavioral assays for one or more of the molecular targets could substantially strengthen these findings.

      (5) Discussions regarding effective drug thresholds and the potential for non-specific effects are essential to fully evaluate the implications of pharmacological interventions utilized in the study.

      Minor Concerns:

      (1) Clarification of evidence of the specific infusion sites in pharmacological experiments would enhance the transparency and replicability of these methods.

      For the infusion of S1PR1 agonist, guide cannula (internal diameter 0.34 mm, RWD) was unilaterally implanted into DG of hippocampus (-1.3 A/P, -1.95 M/L, and -2.02 D/V) as evidenced by Figure 5B.

      (2) It would be beneficial if the manuscript provided details regarding the efficiency and reach of viral transfection within the neuronal population. This information would help in assessing the impact of genetic manipulations.

      S1PR1 immunostaining showed that the efficiency is quite high and the reach of viral transfection is sufficient.

      Author response image 4.

      (3) The manuscript should make explicit the normalization techniques used in quantitative assessments such as Western blotting, including the housekeeping genes or proteins used for this purpose.

      Here, we used housekeeping protein normalization for normalizing Western blot data. GAPDH was used as the internal control. First, the stained blot is imaged, a rectangle is drawn around the target protein in each lane, and the signal intensity inside the rectangle is measured by using ImageJ. The signal intensity obtained can then be normalized by being divided by the signal intensity of the loading internal control (GAPDH) detected on the same blot. The average of the ratios from the control group is calculated, and all individual ratios are divided by this average to obtain a new set of values, which represent the normalized values (Line 619-625).

      (4) Details about the control groups in behavioral assessments were subjected to comparable handling and experimental conditions as the chronic pain groups are crucial, barring nerve injury, for maintaining the integrity of the comparative analysis.

      We agree that a control group and an experimental group is identical in all respects except for one difference-nerve injury. We have added this point in the method (Line 520-522).

      Minor Recommendations:

      The following four minor recommendations are the same with the four minor concerns from Reviewer 3 on Page 12-13. Please refer to the answers above.

      (1) Clarify the specifics of infusion site verification in pharmacological experiments.

      (2) Provide details on the efficiency and neuronal reach of viral transfections.

      (3) Explicitly describe the normalization techniques used in quantitative assessments.

      (4) Ensure that control groups in behavioral assessments undergo comparable handling to maintain analysis integrity.

      References

      (1) Gualdoni, S., et al., Normal levels of Rac1 are important for dendritic but not axonal development in hippocampal neurons. Biology of the Cell, 2007. 99(8): p. 455-464.

      (2) Alam, M.S., Proximity Ligation Assay (PLA). Curr Protoc Immunol, 2018. 123(1): p. e58.

      (3) Song, P., S. Zhang, and J. Li, Co-immunoprecipitation Assays to Detect In Vivo Association of Phytochromes with Their Interacting Partners. Methods Mol Biol, 2021. 2297: p. 75-82.

      (4) Krieger, C.C., et al., Proximity ligation assay to study TSH receptor homodimerization and crosstalk with IGF-1 receptors in human thyroid cells. Frontiers in Endocrinology, 2022. 13.

      (5) Arruda-Carvalho, M., et al., Conditional Deletion of α-CaMKII Impairs Integration of Adult-Generated Granule Cells into Dentate Gyrus Circuits and Hippocampus-Dependent Learning. The Journal of Neuroscience, 2014. 34(36): p. 11919-11928.

      (6) Wolf, A., et al., A Comprehensive Behavioral Test Battery to Assess Learning and Memory in 129S6/Tg2576 Mice. PLoS One, 2016. 11(1): p. e0147733.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Responses to Editors:

      We appreciate the editors’ concern regarding the difficulty of disentangling the contributions of tightly-coupled brain regions to the speech-gesture integration process—particularly due to the close temporal and spatial proximity of the stimulation windows and the potential for prolonged disruption. While we agree with that stimulation techniques, such as transcranial magnetic stimulation (TMS), can evoke or modulate neuronal activity both locally within the target region and in remote connected areas of the network. This complex interaction makes drawing clear conclusions about the causal relationship between stimulation and cognitive function more challenging. However, we believe that cause-and-effect relationships in cognitive neuroscience studies using non-invasive brain stimulation (NIBS) can still be robustly established if key assumptions are explicitly tested and confounding factors are rigorously controlled (Bergmann & Hartwigsen et al., 2021, J Cogn Neurosci).

      In our experiment, we addressed these concerns by including a sham TMS condition, an irrelevant control task, and multiple control time points. The results showed that TMS selectively disrupted the IFG-pMTG interaction during specific time windows of the task related to gesture-speech semantic congruency, but not in the sham TMS condition or the control task (gender congruency effect) (Zhao et al., 2021, JN). This selective disruption provides strong evidence for a causal link between IFG-pMTG connectivity and gesture-speech integration in the targeted time window.

      Regarding the potential for transient artifacts from TMS, we acknowledge that previous research has demonstrated that single-pulse TMS induces brief artifacts (0–10 ms) due to direct depolarization of cortical neurons, which momentarily disrupts electrical activity in the stimulated area (Romero et al., 2019, NC). However, in the case of paired-pulse TMS (ppTMS), the interaction between the first and second pulses is more complex. The first pulse increases membrane conductance in the target neurons via shunting inhibition mediated by GABAergic interneurons. This effectively lowers neuronal membrane resistance, “leaking” excitatory current and diminishing the depolarization induced by the second pulse, leading to a reduction in excitability during the paired-pulse interval. This mechanism suppresses the excitatory response to the second pulse, which is reflected in a reduced motor evoked potential (MEP) (Paulus & Rothwell, 2016, J Physiol).

      Furthermore, ppTMS has been widely used in previous studies to infer causal temporal relationships and explore the neural contributions of both structurally and functionally connected brain regions, across timescales as brief as 3–60 ms. We have reviewed several studies that employed paired-pulse TMS to investigate neural dynamics in regions such as the tongue and lip areas of the primary motor cortex (M1), as well as high-level semantic regions like the pMTG, PFC, and ATL (Table 1). These studies consistently demonstrate the methodological rigor and precision of double-pulse TMS in elucidating the temporal dynamics between different brain regions within short temporal windows.

      Given these precedents and the evidence provided, we respectfully assert the validity of the methods employed in our study. We therefore kindly request the editors to reconsider the assessment that “the methods are insufficient for studying tightly-coupled brain regions over short timescales.” We hope that the editors’ concerns about the complexities of TMS-induced effects have been adequately addressed, and that our study’s design and results provide a clear and convincing causal argument for the role of IFG-pMTG in gesture-speech integration.

      Author response table 1.

      Double-pulse TMS studies on brain regions over 3-60 ms time interval

      Reference

      Teige, C., Mollo, G., Millman, R., Savill, N., Smallwood, J., Cornelissen, P. L., & Jefferies, E. (2018). Dynamic semantic cognition: Characterising coherent and controlled conceptual retrieval through time using magnetoencephalography and chronometric transcranial magnetic stimulation. Cortex, 103, 329-349.

      Amemiya, T., Beck, B., Walsh, V., Gomi, H., & Haggard, P. (2017). Visual area V5/hMT+ contributes to perception of tactile motion direction: a TMS study. Scientific reports, 7(1), 40937.

      Muessgens, D., Thirugnanasambandam, N., Shitara, H., Popa, T., & Hallett, M. (2016). Dissociable roles of preSMA in motor sequence chunking and hand switching—a TMS study. Journal of Neurophysiology, 116(6), 2637-2646.

      Vernet, M., Brem, A. K., Farzan, F., & Pascual-Leone, A. (2015). Synchronous and opposite roles of the parietal and prefrontal cortices in bistable perception: a double-coil TMS–EEG study. Cortex, 64, 78-88.

      Pitcher, D. (2014). Facial expression recognition takes longer in the posterior superior temporal sulcus than in the occipital face area. Journal of Neuroscience, 34(27), 9173-9177.

      Bardi, L., Kanai, R., Mapelli, D., & Walsh, V. (2012). TMS of the FEF interferes with spatial conflict. Journal of cognitive neuroscience, 24(6), 1305-1313.

      D’Ausilio, A., Bufalari, I., Salmas, P., & Fadiga, L. (2012). The role of the motor system in discriminating normal and degraded speech sounds. Cortex, 48(7), 882-887.

      Pitcher, D., Duchaine, B., Walsh, V., & Kanwisher, N. (2010). TMS evidence for feedforward and feedback mechanisms of face and body perception. Journal of Vision, 10(7), 671-671.

      Gagnon, G., Blanchet, S., Grondin, S., & Schneider, C. (2010). Paired-pulse transcranial magnetic stimulation over the dorsolateral prefrontal cortex interferes with episodic encoding and retrieval for both verbal and non-verbal materials. Brain Research, 1344, 148-158.

      Kalla, R., Muggleton, N. G., Juan, C. H., Cowey, A., & Walsh, V. (2008). The timing of the involvement of the frontal eye fields and posterior parietal cortex in visual search. Neuroreport, 19(10), 1067-1071.

      Pitcher, D., Garrido, L., Walsh, V., & Duchaine, B. C. (2008). Transcranial magnetic stimulation disrupts the perception and embodiment of facial expressions. Journal of Neuroscience, 28(36), 8929-8933.

      Til Ole Bergmann, Gesa Hartwigsen; Inferring Causality from Noninvasive Brain Stimulation in Cognitive Neuroscience. J Cogn Neurosci 2021; 33 (2): 195–225. https://doi.org/10.1162/jocn_a_01591

      Romero, M.C., Davare, M., Armendariz, M. et al. Neural effects of transcranial magnetic stimulation at the single-cell level. Nat Commun 10, 2642 (2019). https://doi.org/10.1038/s41467-019-10638-7

      Paulus W, Rothwell JC. Membrane resistance and shunting inhibition: where biophysics meets state-dependent human neurophysiology. J Physiol. 2016 May 15;594(10):2719-28. doi: 10.1113/JP271452. PMID: 26940751; PMCID: PMC4865581.

      Staat, C., Gattinger, N., & Gleich, B. (2022). PLUSPULS: A transcranial magnetic stimulator with extended pulse protocols. HardwareX, 13. https://doi.org/10.1016/j.ohx.2022.e00380

      Zhao, W., Li, Y., and Du, Y. (2021). TMS reveals dynamic interaction between inferior frontal gyrus and posterior middle temporal gyrus in gesture-speech semantic integration. The Journal of Neuroscience, 10356-10364. https://doi.org/10.1523/jneurosci.1355-21.2021.

      Reviewer #1 (Public review):

      Summary:

      The authors quantified information in gesture and speech, and investigated the neural processing of speech and gestures in pMTG and LIFG, depending on their informational content, in 8 different time-windows, and using three different methods (EEG, HD-tDCS and TMS). They found that there is a time-sensitive and staged progression of neural engagement that is correlated with the informational content of the signal (speech/gesture).

      Strengths:

      A strength of the paper is that the authors attempted to combine three different methods to investigate speech-gesture processing.

      We sincerely thank the reviewer for recognizing our efforts in conducting three experiments to explore the neural activity linked to the amount of information processed during multisensory gesture-speech integration. In Experiment 1, we observed that the extent of inhibition in the pMTG and LIFG was closely linked to the overlapping gesture-speech responses, as quantified by mutual information. Building on the established roles of the pMTG and LIFG in our previous study (Zhao et al., 2021, JN), we then expanded our investigation to determine whether the dynamic neural engagement between the pMTG and LIFG during gesture-speech processing was also associated with the quality of the information. This hypothesis was further validated through high-temporal resolution EEG, where we examined ERP components related to varying information contents. Notably, we observed a close time alignment between the ERP components and the time windows of the TMS effects, which were associated with the same informational matrices in gesture-speech processing.

      Weaknesses:

      (1) One major issue is that there is a tight anatomical coupling between pMTG and LIFG. Stimulating one area could therefore also result in stimulation of the other area (see Silvanto and Pascual-Leone, 2008). I therefore think it is very difficult to tease apart the contribution of these areas to the speech-gesture integration process, especially considering that the authors stimulate these regions in time windows that are very close to each other in both time and space (and the disruption might last longer over time).

      Response 1: We greatly appreciate the reviewer’s careful consideration. We trust that the explanation provided above has clarified this issue (see Response to Editors for detail).

      (2) Related to this point, it is unclear to me why the HD-TDCS/TMS is delivered in set time windows for each region. How did the authors determine this, and how do the results for TMS compare to their previous work from 2018 and 2023 (which describes a similar dataset+design)? How can they ensure they are only targeting their intended region since they are so anatomically close to each other?

      Response 2: The current study builds on a series of investigations that systematically examined the temporal and spatial dynamics of gesture-speech integration. In our earlier work (Zhao et al., 2018, J. Neurosci), we demonstrated that interrupting neural activity in the IFG or pMTG using TMS selectively disrupted the semantic congruency effect (reaction time costs due to semantic incongruence), without affecting the gender congruency effect (reaction time costs due to gender incongruence). These findings identified the IFG and pMTG as critical hubs for gesture-speech integration. This informed the brain regions selected for subsequent studies.

      In Zhao et al. (2021, J. Neurosci), we employed a double-pulse TMS protocol, delivering stimulation within one of eight 40-ms time windows, to further examine the temporal involvement of the IFG and pMTG. The results revealed time-window-selective disruptions of the semantic congruency effect, confirming the dynamic and temporally staged roles of these regions during gesture-speech integration.

      In Zhao et al. (2023, Frontiers in Psychology), we investigated the semantic predictive role of gestures relative to speech by comparing two experimental conditions: (1) gestures preceding speech by a fixed interval of 200 ms, and (2) gestures preceding speech at its semantic identification point. We observed time-window-selective disruptions of the semantic congruency effect in the IFG and pMTG only in the second condition, leading to the conclusion that gestures exert a semantic priming effect on co-occurring speech. These findings underscored the semantic advantage of gesture in facilitating speech integration, further refining our understanding of the temporal and functional interplay between these modalities.

      The design of the current study—including the choice of brain regions and time windows—was directly informed by these prior findings. Experiment 1 (HD-tDCS) targeted the entire gesture-speech integration process in the IFG and pMTG to assess whether neural activity in these regions, previously identified as integration hubs, is modulated by changes in informativeness from both modalities (i.e., entropy) and their interactions (mutual information, MI). The results revealed a gradual inhibition of neural activity in both areas as MI increased, evidenced by a negative correlation between MI and the tDCS inhibition effect in both regions. Building on this, Experiments 2 and 3 employed double-pulse TMS and ERPs to further assess whether the engaged neural activity was both time-sensitive and staged. These experiments also evaluated the contributions of various sources of information, revealing correlations between information-theoretic metrics and time-locked brain activity, providing insights into the ‘gradual’ nature of gesture-speech integration.

      We acknowledge that the rationale for the design of the current study was not fully articulated in the original manuscript. In the revised version, we provided a more comprehensive and coherent explanation of the logic behind the three experiments, as well as the alignment with our previous findings in Lines 75-102:

      ‘To investigate the neural mechanisms underlying gesture-speech integration, we conducted three experiments to assess how neural activity correlates with distributed multisensory integration, quantified using information-theoretic measures of MI. Additionally, we examined the contributions of unisensory signals in this process, quantified through unisensory entropy. Experiment 1 employed high-definition transcranial direct current stimulation (HD-tDCS) to administer Anodal, Cathodal and Sham stimulation to either the IFG or the pMTG. HD-tDCS induces membrane depolarization with anodal stimulation and membrane hyperpolarization with cathodal stimulation[26], thereby increasing or decreasing cortical excitability in the targeted brain area, respectively. This experiment aimed to determine whether the overall facilitation (Anodal-tDCS minus Sham-tDCS) and/or inhibitory (Cathodal-tDCS minus Sham-tDCS) of these integration hubs is modulated by the degree of gesture-speech integration, as measure by MI.

      Given the differential involvement of the IFG and pMTG in gesture-speech integration, shaped by top-down gesture predictions and bottom-up speech processing [23], Experiment 2 was designed to further assess whether the activity of these regions was associated with relevant informational matrices. Specifically, we applied inhibitory chronometric double-pulse transcranial magnetic stimulation (TMS) to specific temporal windows associated with integration processes in these regions[23], assessing whether the inhibitory effects of TMS were correlated with unisensory entropy or the multisensory convergence index (MI).

      Experiment 3 complemented these investigations by focusing on the temporal dynamics of neural responses during semantic processing, leveraging high-temporal event-related potentials (ERPs). This experiment investigated how distinct information contributors modulated specific ERP components associated with semantic processing. These components included the early sensory effects as P1 and N1–P2[27,28], the N400 semantic conflict effect[14,28,29], and the late positive component (LPC) reconstruction effect[30,31]. By integrating these ERP findings with results from Experiments 1 and 2, Experiment 3 aimed to provide a more comprehensive understanding of how gesture-speech integration is modulated by neural dynamics.’

      Although the IFG and pMTG are anatomically close, the consistent differentiation of their respective roles, as evidenced by our experiment across various time windows (TWs) and supported by previous research (see Response to editors for details), reinforces the validity of the stimulation effect observed in our study.

      References

      Zhao, W.Y., Riggs, K., Schindler, I., and Holle, H. (2018). Transcranial magnetic stimulation over left inferior frontal and posterior temporal cortex disrupts gesture-speech integration. Journal of Neuroscience 38, 1891-1900. 10.1523/Jneurosci.1748-17.2017.

      Zhao, W., Li, Y., and Du, Y. (2021). TMS reveals dynamic interaction between inferior frontal gyrus and posterior middle temporal gyrus in gesture-speech semantic integration. The Journal of Neuroscience, 10356-10364. https://doi.org/10.1523/jneurosci.1355-21.2021.

      Zhao, W. (2023). TMS reveals a two-stage priming circuit of gesture-speech integration. Front Psychol 14, 1156087. 10.3389/fpsyg.2023.1156087.

      Bikson, M., Inoue, M., Akiyama, H., Deans, J.K., Fox, J.E., Miyakawa, H., and Jefferys, J.G.R. (2004). Effects of uniform extracellular DC electric fields on excitability in rat hippocampal slices. J Physiol-London 557, 175-190. 10.1113/jphysiol.2003.055772.

      Federmeier, K.D., Mai, H., and Kutas, M. (2005). Both sides get the point: hemispheric sensitivities to sentential constraint. Memory & Cognition 33, 871-886. 10.3758/bf03193082.

      Kelly, S.D., Kravitz, C., and Hopkins, M. (2004). Neural correlates of bimodal speech and gesture comprehension. Brain and Language 89, 253-260. 10.1016/s0093-934x(03)00335-3.

      Wu, Y.C., and Coulson, S. (2005). Meaningful gestures: Electrophysiological indices of iconic gesture comprehension. Psychophysiology 42, 654-667. 10.1111/j.1469-8986.2005.00356.x.

      Fritz, I., Kita, S., Littlemore, J., and Krott, A. (2021). Multimodal language processing: How preceding discourse constrains gesture interpretation and affects gesture integration when gestures do not synchronise with semantic affiliates. J Mem Lang 117, 104191. 10.1016/j.jml.2020.104191.

      Gunter, T.C., and Weinbrenner, J.E.D. (2017). When to take a gesture seriously: On how we use and prioritize communicative cues. J Cognitive Neurosci 29, 1355-1367. 10.1162/jocn_a_01125.

      Ozyurek, A., Willems, R.M., Kita, S., and Hagoort, P. (2007). On-line integration of semantic information from speech and gesture: Insights from event-related brain potentials. J Cognitive Neurosci 19, 605-616. 10.1162/jocn.2007.19.4.605.

      (3) As the EEG signal is often not normally distributed, I was wondering whether the authors checked the assumptions for their Pearson correlations. The authors could perhaps better choose to model the different variables to see whether MI/entropy could predict the neural responses. How did they correct the many correlational analyses that they have performed?

      Response 3: We greatly appreciate the reviewer’s thoughtful comments.

      (1) Regarding the questioning of normal distribution of EEG signals and the use of Pearson correlation, in Figure 5 of the manuscript, we have already included normal distribution curves to illustrate the relationships between average ERP amplitudes across each ROI or elicited cluster and the three information models.

      Additionally, we performed the Shapiro-Wilk test, a widely accepted method for assessing bivariate normality, on both the MI/entropy and averaged ERP data. The p-values for all three combinations were greater than 0.05, indicating that the sample data from all bivariate combinations were normally distributed (Author response table 2).

      Author response table 2.

      Shapiro-Wilk results of bivariable normality test

      To further consolidate the relationship between entropy/MI and various ERP components, we also conducted a Spearman rank correlation analysis (Author response table 3-5). While the correlation between speech entropy and ERP amplitude in the P1 component yielded a p-value of 0.061, all other results were consistent with those obtained from the Pearson correlation analysis across the three experiments. Therefore, our conclusion that progressive neural responses reflected the degree of information remains robust. Although the Spearman rank and Pearson correlation analyses yielded similar results, we opted to report the Pearson correlation coefficients throughout the manuscript to maintain consistency.

      Author response table 3.

      Comparison of Pearson and Spearman results in Experiment 1

      Author response table 4.

      Comparison of Pearson and Spearman results in Experiment 2

      Author response table 5.

      Comparison of Pearson and Spearman results in Experiment 3

      (2) Regarding the reviewer’s comment ‘choose to model the different variables to see whether MI/entropy could predict the neural responses’, we employed Representational Similarity Analysis (RSA) (Popal et.al, 2019) with MI and entropy as continuous variables. This analysis aimed to build a model to predict neural responses based on these feature metrics.

      To capture dynamic temporal features indicative of different stages of multisensory integration, we segmented the EEG data into overlapping time windows (40 ms in duration with a 10 ms step size). The 40 ms window was chosen based on the TMS protocol used in Experiment 2, which also employed a 40 ms time window. The 10 ms step size (equivalent to 5 time points) was used to detect subtle shifts in neural responses that might not be captured by larger time windows, allowing for a more granular analysis of the temporal dynamics of neural activity.

      Following segmentation, the EEG data were reshaped into a four-dimensional matrix (42 channels × 20 time points × 97 time windows × 20 features). To construct a neural similarity matrix, we averaged the EEG data across time points within each channel and each time window. The resulting matrix was then processed using the pdist function to compute pairwise distances between adjacent data points. This allowed us to calculate correlations between the neural matrix and three feature similarity matrices, which were constructed in a similar manner. These three matrices corresponded to (1) gesture entropy, (2) speech entropy, and (3) mutual information (MI). This approach enabled us to quantify how well the neural responses corresponded to the semantic dimensions of gesture and speech stimuli at each time window.

      To determine the significance of the correlations between neural activity and feature matrices, we conducted 1000 permutation tests. In this procedure, we randomized the data or feature matrices and recalculated the correlations repeatedly, generating a null distribution against which the observed correlation values were compared. Statistical significance was determined if the observed correlation exceeded the null distribution threshold (p < 0.05). This permutation approach helps mitigate the risk of spurious correlations, ensuring that the relationships between the neural data and feature matrices are both robust and meaningful.

      Finally, significant correlations were subjected to clustering analysis, which grouped similar neural response patterns across time windows and channels. This clustering allowed us to identify temporal and spatial patterns in the neural data that consistently aligned with the semantic features of gesture and speech stimuli, thus revealing the dynamic integration of these multisensory modalities across time. Results are as follows:

      (1) Two significant clusters were identified for gesture entropy (Author response image 1 left). The first cluster was observed between 60-110 ms (channels F1 and F3), with correlation coefficients (r) ranging from 0.207 to 0.236 (p < 0.001). The second cluster was found between 210-280 ms (channel O1), with r-values ranging from 0.244 to 0.313 (p < 0.001).

      (2) For speech entropy (Author response image 1 middle), significant clusters were detected in both early and late time windows. In the early time windows, the largest significant cluster was found between 10-170 ms (channels F2, F4, F6, FC2, FC4, FC6, C4, C6, CP4, and CP6), with r-values ranging from 0.151 to 0.340 (p = 0.013), corresponding to the P1 component (0-100 ms). In the late time windows, the largest significant cluster was observed between 560-920 ms (across the whole brain, all channels), with r-values ranging from 0.152 to 0.619 (p = 0.013).

      (3) For mutual information (MI) (Author response image 1 right), a significant cluster was found between 270-380 ms (channels FC1, FC2, FC3, FC5, C1, C2, C3, C5, CP1, CP2, CP3, CP5, FCz, Cz, and CPz), with r-values ranging from 0.198 to 0.372 (p = 0.001).

      Author response image 1.

      Results of RSA analysis.

      These additional findings suggest that even using a different modeling approach, neural responses, as indexed by feature metrics of entropy and mutual information, are temporally aligned with distinct ERP components and ERP clusters, as reported in the current manuscript. This alignment serves to further consolidate the results, reinforcing the conclusion we draw. Considering the length of the manuscript, we did not include these results in the current manuscript.

      (3) In terms of the correction of multiple comparisons, in Experiment 1, two separate participant groups were recruited for HD-tDCS applied over either the IFG or pMTG. FDR correction was performed separately for each group, resulting in six comparisons for each brain region (three information matrices × two tDCS effects: anodal-sham or cathodal-sham). In Experiment 2, six comparisons (three information matrices × two sites: IFG or pMTG) were submitted for FDR correction. In Experiment 3, FDR correction was applied to the seven regions of interest (ROIs) within each component, resulting in five comparisons.

      Reference:

      Wilk, M.B. (2015). The Shapiro Wilk And Related Tests For Normality.

      Popal, H., Wang, Y., & Olson, I. R. (2019). A guide to representational similarity analysis for social neuroscience. Social cognitive and affective neuroscience, 14(11), 1243-1253.

      (4) The authors use ROIs for their different analyses, but it is unclear why and on the basis of what these regions are defined. Why not consider all channels without making them part of an ROI, by using a method like the one described in my previous comment?

      Response 4: For the EEG data, we conducted both a traditional ROI analysis and a cluster-based permutation approach. The ROIs were defined based on a well-established work (Habets et al., 2011), allowing for hypothesis-driven testing of specific regions. In addition, we employed a cluster-based permutation methods, which is data-driven and helps enhance robustness while addressing multiple comparisons. This method serves as a complement to the hypothesis-driven ROI analysis, offering an exploratory, unbiased perspective. Notably, the results from both approaches were consistent, reinforcing the reliability of our findings.

      To make the methods more accessible to a broader audience, we clarified the relationship between these approaches in the revised manuscript in Lines 267-270: ‘To consolidate the data, we conducted both a traditional region-of-interest (ROI) analysis, with ROIs defined based on a well-established work40, and a cluster-based permutation approach, which utilizes data-driven permutations to enhance robustness and address multiple comparisons’

      Additionally, we conducted an RSA analysis without defining specific ROIs, considering all channels in the analysis. This approach yielded consistent results, further validating the robustness of our findings across different analysis methods. See Response 3 for detail.

      Reference:

      Habets, B., Kita, S., Shao, Z.S., Ozyurek, A., and Hagoort, P. (2011). The Role of Synchrony and Ambiguity in Speech-Gesture Integration during Comprehension. J Cognitive Neurosci 23, 1845-1854. 10.1162/jocn.2010.21462

      (5) The authors describe that they have divided their EEG data into a "lower half" and a "higher half" (lines 234-236), based on entropy scores. It is unclear why this is necessary, and I would suggest just using the entropy scores as a continuous measure.

      Response 5: To identify ERP components or spatiotemporal clusters that demonstrated significant semantic differences, we split each model into higher and lower halves based on entropy scores. This division allowed us to capture distinct levels of information processing and explore how different levels of entropy or mutual information (MI) related to neural activity. Specifically, the goal was to highlight the gradual activation process of these components and clusters as they correlate with changes in information content. Remarkably, consistent results were observed between the ERP components and clusters, providing robust evidence that semantic information conveyed through gestures and speech significantly influenced the amplitude of these components or clusters. Moreover, the semantic information was shown to be highly sensitive, varying in tandem with these amplitude changes.

      Reviewer #2 (Public review):

      Comment:

      Summary:

      The study is an innovative and fundamental study that clarified important aspects of brain processes for integration of information from speech and iconic gesture (i.e., gesture that depicts action, movement, and shape), based on tDCS, TMS, and EEG experiments. They evaluated their speech and gesture stimuli in information-theoretic ways and calculated how informative speech is (i.e., entropy), how informative gesture is, and how much shared information speech and gesture encode. The tDCS and TMS studies found that the left IFG and pMTG, the two areas that were activated in fMRI studies on speech-gesture integration in the previous literature, are causally implicated in speech-gesture integration. The size of tDC and TMS effects are correlated with the entropy of the stimuli or mutual information, which indicates that the effects stem from the modulation of information decoding/integration processes. The EEG study showed that various ERP (event-related potential, e.g., N1-P2, N400, LPC) effects that have been observed in speech-gesture integration experiments in the previous literature, are modulated by the entropy of speech/gesture and mutual information. This makes it clear that these effects are related to information decoding processes. The authors propose a model of how the speech-gesture integration process unfolds in time, and how IFG and pMTG interact with each other in that process.

      Strengths:

      The key strength of this study is that the authors used information theoretic measures of their stimuli (i.e., entropy and mutual information between speech and gesture) in all of their analyses. This made it clear that the neuro-modulation (tDCS, TMS) affected information decoding/integration and ERP effects reflect information decoding/integration. This study used tDCS and TMS methods to demonstrate that left IFG and pMTG are causally involved in speech-gesture integration. The size of tDCS and TMS effects are correlated with information-theoretic measures of the stimuli, which indicate that the effects indeed stem from disruption/facilitation of the information decoding/integration process (rather than generic excitation/inhibition). The authors' results also showed a correlation between information-theoretic measures of stimuli with various ERP effects. This indicates that these ERP effects reflect the information decoding/integration process.

      We sincerely thank the reviewer for recognizing our efforts and the innovation of employing information-theoretic measures to elucidate the brain processes underlying the multisensory integration of gesture and speech.

      Weaknesses:

      The "mutual information" cannot fully capture the interplay of the meaning of speech and gesture. The mutual information is calculated based on what information can be decoded from speech alone and what information can be decoded from gesture alone. However, when speech and gesture are combined, a novel meaning can emerge, which cannot be decoded from a single modality alone. When example, a person produces a gesture of writing something with a pen, while saying "He paid". The speech-gesture combination can be interpreted as "paying by signing a cheque". It is highly unlikely that this meaning is decoded when people hear speech only or see gestures only. The current study cannot address how such speech-gesture integration occurs in the brain, and what ERP effects may reflect such a process. Future studies can classify different types of speech-gesture integration and investigate neural processes that underlie each type. Another important topic for future studies is to investigate how the neural processes of speech-gesture integration change when the relative timing between the speech stimulus and the gesture stimulus changes.

      We greatly appreciate Reviewer2 ’s thoughtful concern regarding whether "mutual information" adequately captures the interplay between the meanings of speech and gesture. We would like to clarify that the materials used in the present study involved gestures that were performed without actual objects, paired with verbs that precisely describe the corresponding actions. For example, a hammering gesture was paired with the verb “hammer”, and a cutting gesture was paired with the verb “cut”. In this design, all gestures conveyed redundant information relative to the co-occurring speech, creating significant overlap between the information derived from speech alone and that from gesture alone.

      We understand the reviewer’s concern about cases where gestures and speech might provide complementary, rather than redundant, information. To address this, we have developed an alternative metric for quantifying information gains contributed by supplementary multisensory cues, which will be explored in a subsequent study. However, for the present study, we believe that the observed overlap in information serves as a key indicator of multisensory convergence, a central focus of our investigation.

      Regarding the reviewer’s concern about how neural processes of speech-gesture integration may change with varying relative timing between speech and gesture stimuli, we would like to highlight findings from our previous study (Zhao, 2023, Frontiers in Psychology). In that study, we explored the semantic predictive role of gestures relative to speech under two timing conditions: (1) gestures preceding speech by a fixed interval of 200 ms, and (2) gestures preceding speech at its semantic identification point. Interestingly, only in the second condition did we observe time-window-selective disruptions of the semantic congruency effect in the IFG and pMTG. This led us to conclude that gestures play a semantic priming role for co-occurring speech. Building on this, we designed the present study with gestures deliberately preceding speech at its semantic identification point to reflect this semantic priming relationship. Additionally, ongoing research in our lab is exploring gesture and speech interactions in natural conversational settings to investigate whether the neural processes identified here remain consistent across varying contexts.

      To address potential concerns and ensure clarity regarding the limitations of the MI measurement, we have included a discussion of tthis in the revised manuscript in Lines 543-547: ‘Furthermore, MI quantifies overlap in gesture-speech integration, primarily when gestures convey redundant meaning. Consequently, the conclusions drawn in this study are constrained to contexts in which gestures serve to reinforce the meaning of the speech. Future research should aim to explore the neural responses in cases where gestures convey supplementary, rather than redundant, semantic information.’ This is followed by a clarification of the timing relationship between gesture and speech: ‘Note that the sequential cortical involvement and ERP components discussed above are derived from a deliberate alignment of speech onset with gesture DP, creating an artificial priming effect with gesture semantically preceding speech. Caution is advised when generalizing these findings to the spontaneous gesture-speech relationships, although gestures naturally precede speech[34].’ (Lines 539-543).

      Reviewer #3 (Public review):

      In this useful study, Zhao et al. try to extend the evidence for their previously described two-step model of speech-gesture integration in the posterior Middle Temporal Gyrus (pMTG) and Inferior Frontal Gyrus (IFG). They repeat some of their previous experimental paradigms, but this time quantifying Information-Theoretical (IT) metrics of the stimuli in a stroop-like paradigm purported to engage speech-gesture integration. They then correlate these metrics with the disruption of what they claim to be an integration effect observable in reaction times during the tasks following brain stimulation, as well as documenting the ERP components in response to the variability in these metrics.

      The integration of multiple methods, like tDCS, TMS, and ERPs to provide converging evidence renders the results solid. However, their interpretation of the results should be taken with care, as some critical confounds, like difficulty, were not accounted for, and the conceptual link between the IT metrics and what the authors claim they index is tenuous and in need of more evidence. In some cases, the difficulty making this link seems to arise from conceptual equivocation (e.g., their claims regarding 'graded' evidence), whilst in some others it might arise from the usage of unclear wording in the writing of the manuscript (e.g. the sentence 'quantitatively functional mental states defined by a specific parser unified by statistical regularities'). Having said that, the authors' aim is valuable, and addressing these issues would render the work a very useful approach to improve our understanding of integration during semantic processing, being of interest to scientists working in cognitive neuroscience and neuroimaging.

      The main hurdle to achieving the aims set by the authors is the presence of the confound of difficulty in their IT metrics. Their measure of entropy, for example, being derived from the distribution of responses of the participants to the stimuli, will tend to be high for words or gestures with multiple competing candidate representations (this is what would presumptively give rise to the diversity of responses in high-entropy items). There is ample evidence implicating IFG and pMTG as key regions of the semantic control network, which is critical during difficult semantic processing when, for example, semantic processing must resolve competition between multiple candidate representations, or when there are increased selection pressures (Jackson et al., 2021). Thus, the authors' interpretation of Mutual Information (MI) as an index of integration is inextricably contaminated with difficulty arising from multiple candidate representations. This casts doubt on the claims of the role of pMTG and IFG as regions carrying out gesture-speech integration as the observed pattern of results could also be interpreted in terms of brain stimulation interrupting the semantic control network's ability to select the best candidate for a given context or respond to more demanding semantic processing.

      Response 1: We sincerely thank the reviewer for pointing out the confound of difficulty. The primary aim of this study is to investigate whether the degree of activity in the established integration hubs, IFG and pMTG, is influenced by the information provided by gesture-speech modalities and/or their interactions. While we provided evidence for the differential involvement of the IFG and pMTG by delineating their dynamic engagement across distinct time windows of gesture-speech integration and associating these patterns with unisensory information and their interaction, we acknowledge that the mechanisms underlying these dynamics remain open to interpretation. Specifically, whether the observed effects stem from difficulties in semantic control processes, as suggested by the reviewer, or from resolving information uncertainty, as quantified by entropy, falls outside the scope of the current study. Importantly, we view these two interpretations as complementary rather than mutually exclusive, as both may be contributing factors. Nonetheless, we agree that addressing this question is a compelling avenue for future research.

      In the revised manuscript, we have included an additional analysis to assess whether the confounding effects of lexical or semantic control difficulty—specifically, the number of available responses—affect the neural outcomes. To address this, we performed partial correlation analyses, controlling for the number of responses.

      We would like to clarify an important distinction between the measure of entropy derived from the distribution of responses and the concept of response diversity. Entropy, in our analysis, is computed based on the probability distribution of each response, as captured by the information entropy formula. In contrast, response diversity refers to the simple count of different responses provided. Mutual Information (MI), by its nature, is also an entropy measure, quantifying the overlap in responses. For reference, although we observed a high correlation between the three information matrices and the number of responses (gesture entropy & gesture response number: r = 0.976, p < 0.001; speech entropy & speech response number: r = 0.961, p < 0.001; MI & total response number: r = 0.818, p < 0.001), it is crucial to emphasize that these metrics capture different aspects of the semantic information represented. In the revised manuscript, we have provided a table detailing both entropy and response numbers for each stimulus, to allow for greater transparency and clarity.

      Furthermore, we have added a comprehensive description of the partial correlation analysis conducted across all three experiments in the methodology section: for Experiment 1, please refer to Lines 213–222: ‘To account for potential confounds related to multiple candidate representations, we conducted partial correlation analyses between the tDCS effects and gesture entropy, speech entropy, and MI, controlling for the number of responses provided for each gesture and speech, as well as the total number of combined responses. Given that HD-tDCS induces overall disruption at the targeted brain regions, we hypothesized that the neural activity within the left IFG and pMTG would be progressively affected by varying levels of multisensory convergence, as indexed by MI. Moreover, we hypothesized that the modulation of neural activity by MI would differ between the left IFG and pMTG, as reflected in the differential modulation of response numbers in the partial correlations, highlighting their distinct roles in semantic processing[37].’

      Experiment 2: ‘To control for potential confounds, partial correlations were also performed between the TMS effects and gesture entropy, speech entropy, and MI, controlling for the number of responses for each gesture and speech, as well as the total number of combined responses. By doing this, we can determine how the time-sensitive contribution of the left IFG and pMTG to gesture–speech integration was affected by gesture and speech information distribution.’ (Lines 242–246).

      Experiment 3: ‘Additionally, partial correlations were conducted, accounting for the number of responses for each respective metric’ (Lines 292–293).

      As anticipated by the reviewer, we observed a consistent modulation of response numbers across both regions as well as across the four ERP components and associated clusters. The detailed results are presented below:

      Experiment 1: ‘However, partial correlation analysis, controlling for the total response number, revealed that the initially significant correlation between the Cathodal-tDCS effect and MI was no longer significant (r = -0.303, p = 0.222, 95% CI = [-0.770, 0.164]). This suggests that the observed relationship between Cathodal-tDCS and MI may be confounded by semantic control difficulty, as reflected by the total number of responses. Specifically, the reduced activity in the IFG under Cathodal-tDCS may be driven by variations in the difficulty of semantic control rather than a direct modulation of MI.’ (Lines 310-316) and ‘’Importantly, the reduced activity in the pMTG under Cathodal-tDCS was not influenced by the total response number, as indicated by the non-significant correlation (r = -0.253, p = 0.295, 95% CI = [-0.735, 0.229]). This finding was further corroborated by the unchanged significance in the partial correlation between Cathodal-tDCS and MI, when controlling for the total response number (r = -0.472, p = 0.048, 95% CI = [-0.903, -0.041]). (Lines 324-328).

      Experiment 2:’ Notably, inhibition of pMTG activity in TW2 was not influenced by the number of speech responses (r = -0.539, p = 0.087, 95% CI = [-1.145, 0.067]). However, the number of speech responses did affect the modulation of speech entropy on the pMTG inhibition effect in TW2. This was evidenced by the non-significant partial correlation between pMTG inhibition and speech entropy when controlling for speech response number (r = -0.218, p = 0.545, 95% CI = [-0.563, 0.127]).

      In contrast, the interrupted IFG activity in TW6 appeared to be consistently influenced by the confound of semantic control difficulty. This was reflected in the significant correlation with both gesture response number (r = -0.480, p = 0.032, 95% CI = [-904, -0.056]), speech response number (r = -0.729, p = 0.011, 95% CI = [-1.221, -0.237]), and total response number (r = -0.591, p = 0.008, 95% CI = [-0.993, -0.189]). Additionally, partial correlation analyses revealed non-significant relationship between interrupted IFG activity in TW6 and gesture entropy (r = -0.369, p = 0.120, 95% CI = [-0.810, -0.072]), speech entropy (r = -0.455, p = 0.187, 95% CI = [-1.072, 0.162]), and MI (r = -0.410, p = 0.091, 95% CI = [-0.856, -0.036]) when controlling for response numbers.’ (Lines 349-363)

      Experiment 3: ‘To clarify potential confounds of semantic control difficulty, partial correlation analyses were conducted to examine the relationship between the elicited ERP components and the relevant information matrices, controlling for response numbers. Results consistently indicated modulation by response numbers in the relationship of ERP components with the information matrix, as evidenced by the non-significant partial correlations between the P1 amplitude (P1 component over ML: r = -0.574, p = 0.082, 95% CI = [-1.141, -0.007]) and the P1 cluster (r = -0.503, p = 0.138, 95% CI = [-1.102, 0.096]) with speech entropy; the N1-P2 amplitude (N1-P2 component over LA: r = -0.080, p = 0.746, 95% CI = [-0.554, 0.394]) and N1-P2 cluster (r \= -0.179, p = 0.464, 95% CI = [-0.647, 0.289]) with gesture entropy; the N400 amplitude (N400 component over LA: r = 0.264, p = 0.247, 95% CI = [-0.195,0.723]) and N400 cluster (r = 0.394, p = 0.095, 95% CI = [-0.043, 0.831]) with gesture entropy; the N400 amplitude (N400 component over LA: r = -0.134, p = 0.595, 95% CI = [-0.620, 0.352]) and N400 cluster (r = -0.034, p = 0.894, 95% CI = [-0.524,0.456]) with MI; and the LPC amplitude (LPC component over LA: r \= -0.428, p = 0.217, 95% CI = [-1.054, 0.198]) and LPC cluster (r \= -0.202, p = 0.575, 95% CI = [-0.881, 0.477]) with speech entropy.’ (Lines 424-438)

      Based on the above results, we conclude that there is a dynamic interplay between the difficulty of semantic representation and the control pressures that shape the resulting neural responses. Furthermore, while the role of the IFG in control processes remains consistent, the present study reveals a more segmented role for the pMTG. Specifically, although the pMTG is well-established in the processing of distributed speech information, the integration of multisensory convergence, as indexed by MI, did not elicit the same control-related modulation in pMTG activity. A comprehensive discussion of the control process in shaping neural responses, as well as the specific roles of the IFG and pMTG in this process, is provided in the Discussion section in Lines (493-511): ‘Given that control processes are intrinsically integrated with semantic processing50, a distributed semantic representation enables dynamic modulation of access to and manipulation of meaningful information, thereby facilitating flexible control over the diverse possibilities inherent in a concept. Accordingly, an increased number of candidate responses amplifies the control demands necessary to resolve competing semantic representations. This effect was observed in the present study, where the association of the information matrix with the tDCS effect in IFG, the inhibition of pMTG activity in TW2, disruption of IFG activity in TW6, and modulation of four distinct ERP components collectively demonstrated that response quantity modulated neural activity. These results underscore the intricate interplay between the difficulty of semantic representation and the control pressures that shape the resulting neural responses. 

      The IFG and pMTG, central components of the semantic control network, have been extensively implicated in previous research 50-52. While the role of the IFG in managing both unisensory information and multisensory convergence remains consistent, as evidenced by the confounding difficulty results across Experiments 1 and 2, the current study highlights a more context-dependent function for the pMTG. Specifically, although the pMTG is well-established in the processing of distributed speech information, the multisensory convergence, indexed by MI, did not evoke the same control-related modulation in pMTG activity. These findings suggest that, while the pMTG is critical to semantic processing, its engagement in control processes is likely modulated by the specific nature of the sensory inputs involved’

      Reference:

      Tesink, C.M.J.Y., Petersson, K.M., van Berkum, J.J.A., van den Brink, D., Buitelaar, J.K., and Hagoort, P. (2009). Unification of speaker and meaning in language comprehension: An fMRI study. J Cognitive Neurosci 21, 2085-2099. 10.1162/jocn.2008.21161

      Jackson, R.L. (2021). The neural correlates of semantic control revisited. Neuroimage 224, 117444. 10.1016/j.neuroimage.2020.117444.

      Jefferies, E. (2013). The neural basis of semantic cognition: converging evidence from neuropsychology, neuroimaging and TMS. Cortex 49, 611-625. 10.1016/j.cortex.2012.10.008.

      Noonan, K.A., Jefferies, E., Visser, M., and Lambon Ralph, M.A. (2013). Going beyond inferior prefrontal involvement in semantic control: evidence for the additional contribution of dorsal angular gyrus and posterior middle temporal cortex. J Cogn Neurosci 25, 1824-1850. 10.1162/jocn_a_00442.

      In terms of conceptual equivocation, the use of the term 'graded' by the authors seems to be different from the usage commonly employed in the semantic cognition literature (e.g., the 'graded hub hypothesis', Rice et al., 2015). The idea of a graded hub in the controlled semantic cognition framework (i.e., the anterior temporal lobe) refers to a progressive degree of abstraction or heteromodal information as you progress through the anatomy of the region (i.e., along the dorsal-to-ventral axis). The authors, on the other hand, seem to refer to 'graded manner' in the context of a correlation of entropy or MI and the change in the difference between Reaction Times (RTs) of semantically congruent vs incongruent gesture-speech. The issue is that the discourse through parts of the introduction and discussion seems to conflate both interpretations, and the ideas in the main text do not correspond to the references they cite. This is not overall very convincing. What is it exactly the authors are arguing about the correlation between RTs and MI indexes? As stated above, their measure of entropy captures the spread of responses, which could also be a measure of item difficulty (more diverse responses imply fewer correct responses, a classic index of difficulty). Capturing the diversity of responses means that items with high entropy scores are also likely to have multiple candidate representations, leading to increased selection pressures. Regions like pMTG and IFG have been widely implicated in difficult semantic processing and increased selection pressures (Jackson et al., 2021). How is this MI correlation evidence of integration that proceeds in a 'graded manner'? The conceptual links between these concepts must be made clearer for the interpretation to be convincing.

      Response 2: Regarding the concern of conceptual equivocation, we would like to emphasize that this study represents the first attempt to focus on the relationship between information quantity and neural engagement, a question addressed in three experiments. Experiment 1 (HD-tDCS) targeted the entire gesture-speech integration process in the IFG and pMTG to assess whether neural activity in these regions, previously identified as integration hubs, is modulated by changes in informativeness from both modalities (i.e., entropy) and their interactions (MI). The results revealed a gradual inhibition of neural activity in both areas as MI increased, evidenced by a negative correlation between MI and the tDCS inhibition effect in both regions. Building on this, Experiments 2 and 3 employed double-pulse TMS and ERPs to further assess whether the engaged neural activity was both time-sensitive and staged. These experiments also evaluated the contributions of various sources of information, revealing correlations between information-theoretic metrics and time-locked brain activity, providing insights into the ‘gradual’ nature of gesture-speech integration.

      Therefore, the incremental engagement of the integration hub of IFG and pMTG along with the informativeness of gesture and speech during multisensory integration is different from the "graded hub," which refers to anatomical distribution. We sincerely apologize for this oversight. In the revised manuscript, we have changed the relevant conceptual equivocation in Lines 44-60: ‘Consensus acknowledges the presence of 'convergence zones' within the temporal and inferior parietal areas [1], or the 'semantic hub' located in the anterior temporal lobe[2], pivotal for integrating, converging, or distilling multimodal inputs. Contemporary theories frame the semantic processing as a dynamic sequence of neural states[3], shaped by systems that are finely tuned to the statistical regularities inherent in sensory inputs[4]. These regularities enable the brain to evaluate, weight, and integrate multisensory information, optimizing the reliability of individual sensory signals[5]. However, sensory inputs available to the brain are often incomplete and uncertain, necessitating adaptive neural adjustments to resolve these ambiguities [6]. In this context, neuronal activity is thought to be linked to the probability density of sensory information, with higher levels of uncertainty resulting in the engagement of a broader population of neurons, thereby reflecting the brain’s adaptive capacity to handle diverse possible interpretations[7,8]. Although the role of 'convergence zones' and 'semantic hubs' in integrating multimodal inputs is well established, the precise functional patterns of neural activity in response to the distribution of unified multisensory information—along with the influence of unisensory signals—remain poorly understood.

      To this end, we developed an analytic approach to directly probe the cortical engagement during multisensory gesture-speech semantic integration.’  

      Furthermore, in the Discussion section, we have replaced the term 'graded' with 'incremental' (Line 456,). Additionally, we have included a discussion on the progressive nature of neural engagement, as evidenced by the correlation between RTs and MI indices in Lines 483-492: ‘The varying contributions of unisensory gesture-speech information and the convergence of multisensory inputs, as reflected in the correlation between distinct ERP components and TMS time windows (TMS TWs), are consistent with recent models suggesting that multisensory processing involves parallel detection of modality-specific information and hierarchical integration across multiple neural levels[4,48]. These processes are further characterized by coordination across multiple temporal scales[49]. Building on this, the present study offers additional evidence that the multi-level nature of gesture-speech processing is statistically structured, as measured by information matrix of unisensory entropy and multisensory convergence index of MI, the input of either source would activate a distributed representation, resulting in progressively functioning neural responses.’

      Reference:

      Damasio, H., Grabowski, T.J., Tranel, D., Hichwa, R.D., and Damasio, A.R. (1996). A neural basis for lexical retrieval. Nature 380, 499-505. DOI 10.1038/380499a0.

      Patterson, K., Nestor, P.J., and Rogers, T.T. (2007). Where do you know what you know? The representation of semantic knowledge in the human brain. Nature Reviews Neuroscience 8, 976-987. 10.1038/nrn2277.

      Brennan, J.R., Stabler, E.P., Van Wagenen, S.E., Luh, W.M., and Hale, J.T. (2016). Abstract linguistic structure correlates with temporal activity during naturalistic comprehension. Brain and Language 157, 81-94. 10.1016/j.bandl.2016.04.008.

      Benetti, S., Ferrari, A., and Pavani, F. (2023). Multimodal processing in face-to-face interactions: A bridging link between psycholinguistics and sensory neuroscience. Front Hum Neurosci 17, 1108354. 10.3389/fnhum.2023.1108354.

      Noppeney, U. (2021). Perceptual Inference, Learning, and Attention in a Multisensory World. Annual Review of Neuroscience, Vol 44, 2021 44, 449-473. 10.1146/annurev-neuro-100120-085519.

      Ma, W.J., and Jazayeri, M. (2014). Neural coding of uncertainty and probability. Annu Rev Neurosci 37, 205-220. 10.1146/annurev-neuro-071013-014017.

      Fischer, B.J., and Pena, J.L. (2011). Owl's behavior and neural representation predicted by Bayesian inference. Nat Neurosci 14, 1061-1066. 10.1038/nn.2872.

      Ganguli, D., and Simoncelli, E.P. (2014). Efficient sensory encoding and Bayesian inference with heterogeneous neural populations. Neural Comput 26, 2103-2134. 10.1162/NECO_a_00638.

      Meijer, G.T., Mertens, P.E.C., Pennartz, C.M.A., Olcese, U., and Lansink, C.S. (2019). The circuit architecture of cortical multisensory processing: Distinct functions jointly operating within a common anatomical network. Prog Neurobiol 174, 1-15. 10.1016/j.pneurobio.2019.01.004.

      Senkowski, D., and Engel, A.K. (2024). Multi-timescale neural dynamics for multisensory integration. Nat Rev Neurosci 25, 625-642. 10.1038/s41583-024-00845-7.

      Reviewer #2 (Recommendations for the authors):

      I have a number of small suggestions to make the paper more easy to understand.

      We sincerely thank the reviewer for their careful reading and thoughtful consideration. All suggestions have been thoroughly addressed and incorporated into the revised manuscript.

      (1) Lines 86-87, please clarify whether "chronometric double-pulse TMS" should lead to either excitation or inhibition of neural activities

      Double-pulse TMS elicits inhibition of neural activities (see responses to editors), which has been clarified in the revised manuscript in Lines 90-93: ‘we applied inhibitory chronometric double-pulse transcranial magnetic stimulation (TMS) to specific temporal windows associated with integration processes in these regions[23], assessing whether the inhibitory effects of TMS were correlated with unisensory entropy or the multisensory convergence index (MI)’

      (2) Line 106 "validated by replicating the semantic congruencey effect". Please specify what the task was in the validation study.

      The description of the validation task has been added in Lines 116-119: ‘To validate the stimuli, 30 participants were recruited to replicate the multisensory index of semantic congruency effect, hypothesizing that reaction times for semantically incongruent gesture-speech pairs would be significantly longer than those for congruent pairs.’

      (3) Line 112. "30 subjects". Are they Chinese speakers?

      Yes, all participants in the present study, including those in the pre-tests, are native Chinese speakers.

      (4) Line 122, "responses for each item" Please specify whether you mean here "the comprehensive answer" as you defined in 118-119.

      Yes, and this information has been added in Lines 136-137: ‘comprehensive responses for each item were converted into Shannon's entropy (H)’

      (5) Line 163 "one of three stimulus types (Anodal, Cathodal or Sham)". Please specify whether the order of the three conditions was counterbalanced across participants. Or, whether the order was fixed for all participants.

      The order of the three conditions was counterbalanced across participants, a clearer description has been added in the revised manuscript in Lines 184-189: ‘Participants were divided into two groups, with each group undergoing HD-tDCS stimulation at different target sites (IFG or pMTG). Each participant completed three experimental sessions, spaced one week apart, during which 480 gesture-speech pairs were presented across various conditions. In each session, participants received one of three types of HD-tDCS stimulation: Anodal, Cathodal, or Sham. The order of stimulation site and type was counterbalanced using a Latin square design to control for potential order effects.’

      (6) Line 191-192, "difference in reaction time between semantic incongruence and semantic congruent pairs)" Here, please specify which reaction time was subtracted from which one. This information is very crucial; without it, you cannot interpret your graphs.

      (17) Figure 3. Figure caption for (A). "The semantic congruence effect was calculated as the reaction time difference between...". You need to specify which condition was subtracted from what condition; otherwise, you cannot interpret this figure. "difference" is too ambiguous.

      Corrections have been made in the revised manuscript in Lines 208-211: ‘Neural responses were quantified based on the effects of HD-tDCS (active tDCS minus sham tDCS) on the semantic congruency effect, defined as the difference in reaction times between semantic incongruent and congruent conditions (Rt(incongruent) - Rt(congruent))’ and Line 796-798: ‘The semantic congruency effect was calculated as the reaction time (RT) difference between semantically incongruent and semantically congruent pairs (Rt(incongruent) - Rt(congruent))’.

      (7) Line 363 "progressive inhibition of IFG and pMTG by HD-tDCS as the degree of gesture-speech interaction, indexed by MI, advanced." This sentence is very hard to follow. I don't understand what part of the data in Figure 3 speaks to "inhibition of IFG". And what is "HD-tDCS"? I think it is easier to read if you talk about correlation (not "progressive" and "advanced").

      High-Definition transcranial direct current stimulation (HD-tDCS) was applied to modulate the activity of pMTG and IFG, with cathodal stimulation inducing inhibitory effects and anodal stimulation facilitating neural activity. In Figure 3, we examined the relationship between the tDCS effects on pMTG and IFG and the three information matrices (entropy and MI). Our results revealed significant correlations between MI and the cathodal-tDCS effects in both regions. We acknowledge that the original phrasing may have been unclear, and in the revised manuscript, we have provided a more explicit explanation to enhance clarity in Lines 443-445: ‘Our results, for the first time, revealed that the inhibition effect of cathodal-tDCS on the pMTG and IFG correlated with the degree of gesture-speech multisensory convergence, as indexed by MI’.

      (8) Lines 367-368 I don't understand why gesture is top down and speech is bottom up. Is that because gesture precedes speech (gesture is interpretable at the point of speech onset)?

      Yes, since we employed a semantic priming paradigm by aligning speech onset with the gesture comprehension point, we interpret the gesture-speech integration process as an interaction between the top-down prediction from gestures and the bottom-up processing of speech. In the revised manuscript, we have provided a clearer and more coherent description that aligns with the results. Lines 445-449: ‘Moreover, the gradual neural engagement was found to be time-sensitive and staged, as evidenced by the selectively interrupted time windows (Experiment 2) and the distinct correlated ERP components (Experiment 3), which were modulated by different information contributors, including unisensory entropy or multisensory MI’

      (9) Line 380 - 381. Can you spell out "TW" and "IP"?

      (16) Line 448, NIBS, Please spell out "NIBS".

      "TW" have been spelled out in Lines 459: ‘time windows (TW)’,"IP" in Line 460: ‘identification point (IP)’. The term "NIBS" was replaced with "HD-tDCS and TMS" to provide clearer specification of the techniques employed: ‘Consistent with this, the present study provides robust evidence, through the application of HD-tDCS and TMS, that the integration hubs for gesture and speech—the pMTG and IFG—operate in an incremental manner.’ (Lines 454-457). 

      (10) Line 419, The higher certainty of gesture => The higher the certainty of gesture is

      (13) Line 428, "a larger MI" => "a larger MI is"

      (12) Line 427-428, "the larger overlapped neural populations" => "the larger, the overlapped neural populations"

      Changes have been made in Line 522 ‘The higher the certainty of gesture is’ , Line 531: ‘a larger MI is’ and Line 530 ‘the larger, overlapped neural populations’

      (11) Line 423 "Greater TMS effect over the IFG" Can you describe the TMS effect?

      TMS effect has been described as ‘Greater TMS inhibitory effect’ (Line 526)

      (14) Line 423 "reweighting effect" What is this? Please describe (and say which experiment it is about).

      Clearer description has been provided in Lines 535-538: ‘As speech entropy increases, indicating greater uncertainty in the information provided by speech, more cognitive effort is directed towards selecting the targeted semantic representation. This leads to enhanced involvement of the IFG and a corresponding reduction in LPC amplitude’.

      (15) Line 437 "the graded functionality of every disturbed period is not guaranteed" (I don't understand this sentence).

      Clearer description has been provided in Lines 552-557: ‘Additionally, not all influenced TWs exhibited significant associations with entropy and MI. While HD-tDCS and TMS may impact functionally and anatomically connected brain regions[55,56], whether the absence of influence in certain TWs can be attributed to compensation by other connected brain areas, such as angular gyrus[57] or anterior temporal lobe[58], warrants further investigation. Therefore, caution is needed when interpreting the causal relationship between inhibition effects of brain stimulation and information-theoretic metrics (entropy and MI).

      References:

      Humphreys, G. F., Lambon Ralph, M. A., & Simons, J. S. (2021). A Unifying Account of Angular Gyrus Contributions to Episodic and Semantic Cognition. Trends in neurosciences, 44(6), 452–463. https://doi.org/10.1016/j.tins.2021.01.006

      Bonner, M. F., & Price, A. R. (2013). Where is the anterior temporal lobe and what does it do?. The Journal of neuroscience : the official journal of the Society for Neuroscience, 33(10), 4213–4215. https://doi.org/10.1523/JNEUROSCI.0041-13.2013

      (18) Figure 4. "TW1", "TW2", etc. are not informative. Either replace them with the actual manuscript or add manuscript information (either in the graph itself or in the figure title).

      Information was added into the figure title ‘Figure 4. TMS impacts on semantic congruency effect across various time windows (TW).’ (Line 804), included a detailed description of each time window in Lines 805-807: ‘(A) Five time windows (TWs) showing selective disruption of gesture-speech integration were chosen: TW1 (-120 to -80 ms relative to speech identification point), TW2 (-80 to -40 ms), TW3 (-40 to 0 ms), TW6 (80 to 120 ms), and TW7 (120 to 160 ms).’

      (19) Table 2C.

      The last column is titled "p(xi, yi)". I don't understand why the authors use this label for this column.

      In the formula, at the very end, there is "p(xi|yi). I wonder why it is p(xi|yi), as opposed to p(yi|xi).

      Mutual Information (MI) was calculated by subtracting the entropy of the combined gesture-speech dataset (Entropy(gesture + speech)) from the sum of the individual entropies of gesture and speech (Entropy(gesture) + Entropy(speech)). Thus, the p(xi,yi) aimed to describe the entropy of the combined dataset. We acknowledge the potential ambiguity in the original description, and in the revised manuscript, we have changed the formula of p(xi,yi) into ‘p(xi+yi)’ (Line 848) in Table 2C, and the relevant equation of MI ‘’. Also we provided a clear MI calculation process in Lines 143-146: ‘MI was used to measure the overlap between gesture and speech information, calculated by subtracting the entropy of the combined gesture-speech dataset (Entropy(gesture + speech)) from the sum of their individual entropies (Entropy(gesture) + Entropy(speech)) (see Appendix Table 2C)’.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors should try and produce data showing that the confound of difficulty due to the number of lexical or semantic representations is not underlying high-entropy items if they wish to improve the credibility of their claim that the disruption of the congruency effect is due to speech-gesture integration. Additionally, they should provide more evidence either in the form of experiments or references to better justify why mutual information is an index for integration in the first place.

      Response 1: An additional analysis has been conducted to assess whether the number of lexical or semantic representations affect the neural outcomes, please see details in the Responses to Reviewer 3 (public review) response 1.

      Mutual information (MI), a concept rooted in information theory, quantifies the reduction in uncertainty about one signal when the other is known, thereby capturing the statistical dependence between them. MI is calculated as the difference between the individual entropies of each signal and their joint entropy, which reflects the total uncertainty when both signals are considered together. This metric aligns with the core principle of multisensory integration: different modalities reduce uncertainty about each other by providing complementary, predictive information. Higher MI values signify that the integration of sensory signals results in a more coherent and unified representation, while lower MI values indicate less integration or greater divergence between the modalities. As such, MI serves as a robust and natural index for assessing the degree of multisensory integration.

      To date, the use of MI as an index of integration has been limited, with one notable study by Tremblay et al. (2016), cited in the manuscript, using pointwise MI to quantify the extent to which two syllables mutually constrain each other. While MI has been extensively applied in natural language processing to measure the co-occurrence strength between words (e.g., Lin et al., 2012), its application as an index of multisensory convergence—particularly in the context of gesture-speech integration as employed in this study—is novel. In the revised manuscript, we have clarified the relationship between MI and multisensory convergence: ‘MI assesses share information between modalities[25],indicating multisensory convergence and acting as an index of gesture-speech integration’ (Lines 73-74).

      Also, in our study, we calculated MI as per its original definition, by subtracting the entropy of summed dataset of gesture-speech from the combined entropies of gesture and speech. The detailed calculation method is provided in Lines 136-152: ‘To quantify information content, comprehensive responses for each item were converted into Shannon's entropy (H) as a measure of information richness (Figure 1A bottom). With no significant gender differences observed in both gesture (t(20) = 0.21, p = 0.84) and speech (t(20) = 0.52, p = 0.61), responses were aggregated across genders, resulting in 60 answers per item (Appendix Table 2). Here, p(xi) and p(yi) represent the distribution of 60 answers for a given gesture (Appendix Table 2B) and speech (Appendix Table 2A), respectively. High entropy indicates diverse answers, reflecting broad representation, while low entropy suggests focused lexical recognition for a specific item (Figure 2B). MI was used to measure the overlap between gesture and speech information, calculated by subtracting the entropy of the combined gesture-speech dataset (Entropy(gesture + speech)) from the sum of their individual entropies (Entropy(gesture) + Entropy(speech)) (see Appendix Table 2C). For specific gesture-speech combinations, equivalence between the combined entropy and the sum of individual entropies (gesture or speech) indicates absence of overlap in response sets. Conversely, significant overlap, denoted by a considerable number of shared responses between gesture and speech datasets, leads to a noticeable discrepancy between combined entropy and the sum of gesture and speech entropies. Elevated MI values thus signify substantial overlap, indicative of a robust mutual interaction between gesture and speech.’

      Additional examples outlined in Appendix Table 2 in Lines 841-848:

      This novel application of MI as a multisensory convergence index offers new insights into how different sensory modalities interact and integrate to shape semantic processing.

      Reference:

      Tremblay, P., Deschamps, I., Baroni, M., and Hasson, U. (2016). Neural sensitivity to syllable frequency and mutual information in speech perception and production. Neuroimage 136, 106-121. 10.1016/j.neuroimage.2016.05.018

      Lin, W., Wu, Y., & Yu, L. (2012). Online Computation of Mutual Information and Word Context Entropy. International Journal of Future Computer and Communication, 167-169.

      (2) Finally, if the authors wish to address the graded hub hypothesis as posited by the controlled semantic cognition framework (e.g., Rice et al., 2015), they would have to stimulate a series of ROIs progressing gradually through the anatomy of their candidate regions showing the effects grow along this spline, more than simply correlate MI with RT differences.

      Response 2: We appreciate the reviewer’s thoughtful consideration. The incremental engagement of the integration hub of IFG and pMTG along with the informativeness of gesture and speech during multisensory integration is different from the concept of "graded hub," which refers to anatomical distribution. See Responses to reviewer 3 (public review) response 2 for details.

      (3) The authors report significant effects with p values as close to the threshold as p=0.49 for the pMTG correlation in Experiment 1, for example. How confident are the authors these results are reliable and not merely their 'statistical luck'? Especially in view of sample sizes that hover around 22-24 participants, which have been called into question in the field of non-invasive brain stimulation (e.g., Mitra et al, 2021)?

      Response 3: In Experiment 1, a total of 52 participants were assigned to two groups, each undergoing HD-tDCS stimulation over either the inferior frontal gyrus (IFG) or posterior middle temporal gyrus (pMTG), yielding 26 participants per group for correlation analysis. Power analysis, conducted using G*Power, indicated that a sample size of 26 participants per group would provide sufficient power (0.8) to detect a large effect size (0.5) at an alpha level of 0.05, justifying the chosen sample size. To control for potential statistical artifacts, we compared the results to those from the unaffected control condition.

      In the Experiment 1, participants were tasked with a gender categorization task, where they responded as accurately and quickly as possible to the gender of the voice they saw, while gender congruency (e.g., a male gesture paired with a male voice or a female gesture with a male voice) was manipulated. This manipulation served as direct control, enabling the investigation of automatic and implicit semantic interactions between gesture and speech. This relevant information was provided in the manuscript in Lines 167-172:‘An irrelevant factor of gender congruency (e.g., a man making a gesture combined with a female voice) was created[22,23,35]. This involved aligning the gender of the voice with the corresponding gender of the gesture in either a congruent (e.g., male voice paired with a male gesture) or incongruent (e.g., male voice paired with a female gesture) manner. This approach served as a direct control mechanism, facilitating the investigation of the automatic and implicit semantic interplay between gesture and speech[35]’. Correlation analyses were conducted to examine the TMS disruption effects on gender congruency, comparing reaction times for gender-incongruent versus congruent trials. No significant correlations were found between TMS disruption effects on either the IFG (Cathodal-tDCS effect with MI: r = 0.102, p = 0.677; Anodal-tDCS effect with MI: r = 0.178, p = 0.466) or pMTG (Cathodal-tDCS effect with MI: r \= -0.201, p = 0.410; Anodal-tDCS effect with MI: r = -0.232, p = 0.338).

      Moreover, correlations between the TMS disruption effect on semantic congruency and both gesture entropy, speech entropy, and mutual information (MI) were examined. P-values of 0.290, 0.725, and 0.049 were observed, respectively.  

      The absence of a TMS effect on gender congruency, coupled with the lack of significance when correlated with the other information matrices, highlights the robustness of the significant finding at p = 0.049.

      (4) The distributions of entropy for gestures and speech are very unequal. Whilst entropy for gestures has high variability, (.12-4.3), that of speech is very low (ceiling effect?) with low variance. Can the authors comment on whether they think this might have affected their analyses or results in any way? For example, do they think this could be a problem when calculating MI, which integrates both measures? L130-131.'

      Response 4: We sincerely thank the reviewer for raising this insightful question. The core premise of the current study is that brain activity is modulated by the degree of information provided. Accordingly, the 20 entropy values for gesture and speech represent a subset of the overall entropy distribution, with the degree of entropy correlating with a distributed pattern of neural activity, regardless of the scale of variation. This hypothesis aligns with previous studies suggesting that neuronal activity is linked to the probability density of sensory information, with higher levels of uncertainty resulting in the engagement of a broader population of neurons, thereby reflecting the brain’s adaptive capacity to handle diverse possible interpretations (Fischer & Pena, 2011; Ganguli & Simoncelli, 2014).

      Importantly, we conducted another EEG experiment with 30 subjects. Given the inherent differences between gesture and speech, it is important to note that speech, being more structurally distinct, tends to exhibit lower variability than gesture. To prevent an imbalance in the distribution of gesture and speech, we manipulated the information content of each modality. Specifically, we created three conditions for both gesture and speech (i.e., 0.75, 1, and 1.25 times the identification threshold), thereby ensuring comparable variance between the two modalities: gesture (mean entropy = 2.91 ± 1.01) and speech (mean entropy = 1.82 ± 0.71) (Author response table 6).

      Full-factorial RSA analysis revealed an early P1 effect (0-100 ms) for gesture and a late LPC effect (734-780 ms) for speech (Author response image 2b). Crucially, the identified clusters showed significant correlations with both gesture (Author response image 2c1) and speech entropy (Author response image 2c3), respectively. These findings replicate the results of the present study, demonstrating that, irrespective of the variance in gesture and speech entropy, both modalities elicited ERP amplitude responses in a progressive manner that aligned with their respective information distributions.

      Regarding the influence on MI values, since MI was calculated based on the overlapping responses between gesture and speech, a reduction in uncertainty during speech comprehension would naturally result in a smaller contribution to the MI value. However, as hypothesized above, the MI values were also assumed to represent a subset of the overall distribution, where the contributions of both gesture and speech are expected to follow a normal distribution. This hypothesis was further supported by our replication experiment. When the contributions of gesture and speech were balanced, a correlation between MI values and N400 amplitude was observed (Author response image 2c2), consistent with the results reported in the present manuscript. These findings not only support the idea that the correlation between MI and ERP components is unaffected by the subset of MI values but also confirm the replicability of our results.

      Author response table 6.

      Quantitative entropy for each gesture stimulus (BD: before discrimination point; DP: discrimination point; AD: after discrimination point) and speech stimulus (BI: before identification point; IP: identification point; AI: after identification point).

      Author response image 2.

      Results of group-level analysis and full-factorial RSA. a: The full-factorial representational similarity analysis (RSA) framework is illustrated schematically. Within the general linear model (GLM), the light green matrix denotes the representational dissimilarity matrix (RDM) for gesture semantic states, while light blue matrix represents speech semantic states, and the light red matrix illustrates the semantic congruency effect. The symbol ‘e’ indicates the random error term. All matrices, including the neural dissimilarity matrix, are structured as 18 * 18 matrices, corresponding to 18 conditions (comprising 3 gesture semantic states, 3 speech semantic states, and 2 congruency conditions). b: Coding strength for gesture states, speech states and congruency effect. Shaded clusters represent regions where each factor exhibited significant effects. Clusters with lower opacity correspond to areas where the grand-mean ERP amplitudes across conditions showed the highest correlation with unimodal entropy or MI. c1-c6: Topographical correlation maps illustrate the four significant RSA clusters (top), accompanied by the highest correlations between ERP amplitudes within the significant RSA clusters and the information matrices (bottom). Black dots represent electrodes exhibiting significant correlations, while black stars highlight the electrode with the highest correlation coefficient.

      (5) L383: Why are the authors calling TW2 pre-lexical and TW6 post-lexical? I believe they must provide evidence or references justifying calling these periods pre- and post-lexical. This seems critical given the argument they're trying to make in this paragraph.

      Response 5: The time windows (TWs) selected for the current study were based on our previous work (Zhao et al., 2021, J. Neurosci). In that study, we employed a double-pulse TMS protocol, delivering stimulation across eight 40-ms time windows: three windows preceding the speech identification point (TWs 1-3) and five windows following it (TWs 4-8). The pre-lexical time windows (TWs 1-3) occur before speech identification, while the post-lexical time windows (TWs 4-8) occur after this point. in the revised manuscript, we have made that clear in Lines 462-466:

      “In TW2 of gesture-speech integration, which precedes the speech identification point23 and represents a pre-lexical stage, the suppression effect observed in the pMTG was correlated with speech entropy. Conversely, during TW6, which follows the speech identification point23 and represents a post-lexical stage, the IFG interruption effect was influenced by both gesture entropy, speech entropy, and their MI”

      Reference:

      Zhao, W., Li, Y., and Du, Y. (2021). TMS reveals dynamic interaction between inferior frontal gyrus and posterior middle temporal gyrus in gesture-speech semantic integration. The Journal of Neuroscience, 10356-10364. 10.1523/jneurosci.1355-21.2021.

      (6) Below, I recommend the authors improve their description of the criteria employed to select ROIs. This is important for several reasons. For example, the lack of a control ROI presumably not implicated in integration makes the interpretation of the specificity of the results difficult. Additionally, other regions have been proposed more consistently by recent evidence as multimodal integrators, like for example, the angular gyrus (Humphreys, 2021), or the anterior temporal lobe. The inclusion of IFG as a key region for integration and the oversight of angular gyrus seems to me unjustified in the light of recent evidence.

      Response 6: We appreciate the reviewer’s thoughtful consideration. The selection of IFG and pMTG as ROIs was based on a meta-analysis of multiple fMRI studies on gesture-speech integration, in which these two locations were consistently identified as activated. See Table 2 for details of the studies and coordinates of brain locations reported.

      Author response table 7.

      Meta-analysis of previous studies on gesture-speech integration.

      Based on the meta-analysis of previous studies, we selected the IFG and pMTG as ROIs for gesture-speech integration. The rationale for selecting these brain regions is outlined in the introduction in Lines 65-68: ‘Empirical studies have investigated the semantic integration between gesture and speech by manipulating their semantic relationship[15-18] and revealed a mutual interaction between them[19-21] as reflected by the N400 latency and amplitude[14] as well as common neural underpinnings in the left inferior frontal gyrus (IFG) and posterior middle temporal gyrus (pMTG)[15,22,23]’.

      And further described in Lines 79-80: ‘_Experiment 1 employed high-definition transcranial direct current stimulation (HD-tDCS) to administer Anodal, Cathodal and Sham stimulation to either the IFG or the pMTG ’._ And Lines 87-90: ‘Given the differential involvement of the IFG and pMTG in gesture-speech integration, shaped by top-down gesture predictions and bottom-up speech processing [23], Experiment 2 was designed to assess whether the activity of these regions was associated with relevant informational matrices’.

      In the Methods section, we clarified the selection of coordinates in Lines 193-199: ‘Building on a meta-analysis of prior fMRI studies examining gesture-speech integration[22], we targeted Montreal Neurological Institute (MNI) coordinates for the left IFG at (-62, 16, 22) and the pMTG at (-50, -56, 10). In the stimulation protocol for HD-tDCS, the IFG was targeted using electrode F7 as the optimal cortical projection site[36], with four return electrodes placed at AF7, FC5, F9, and FT9. For the pMTG, TP7 was selected as the cortical projection site36, with return electrodes positioned at C5, P5, T9, and P9.’

      The selection of IFG or pMTG as integration hubs for gesture and speech has also been validated in our previous studies. Specifically, Zhao et al. (2018, J. Neurosci) applied TMS to both areas. Results demonstrated that disrupting neural activity in the IFG or pMTG via TMS selectively impaired the semantic congruency effect (reaction time costs due to semantic incongruence), while leaving the gender congruency effect unaffected. These findings identified the IFG and pMTG as crucial hubs for gesture-speech integration, guiding the selection of brain regions for our subsequent studies.

      In addition, Zhao et al. (2021, J. Neurosci) employed a double-pulse TMS protocol across eight 40-ms time windows to explore the temporal dynamics of the IFG and pMTG. The results revealed time-window-selective disruptions of the semantic congruency effect, further supporting the dynamic and temporally staged involvement of these regions in gesture-speech integration.

      While we have solid rationale for selecting the IFG and pMTG as key regions, we acknowledge the reviewer's point that the involvement of additional functionally and anatomically brain areas, cannot be excluded. We have included in the discussion as limitations in Lines 552-557: ‘Additionally, not all influenced TWs exhibited significant associations with entropy and MI. While HD-tDCS and TMS may impact functionally and anatomically connected brain regions[55,56], whether the absence of influence in certain TWs can be attributed to compensation by other connected brain areas, such as angular gyrus[57] or anterior temporal lobe[58], warrants further investigation. Therefore, caution is needed when interpreting the causal relationship between inhibition effects of brain stimulation and information-theoretic metrics (entropy and MI).

      References:

      Willems, R.M., Ozyurek, A., and Hagoort, P. (2009). Differential roles for left inferior frontal and superior temporal cortex in multimodal integration of action and language. Neuroimage 47, 1992-2004. 10.1016/j.neuroimage.2009.05.066.

      Drijvers, L., Jensen, O., and Spaak, E. (2021). Rapid invisible frequency tagging reveals nonlinear integration of auditory and visual information. Human Brain Mapping 42, 1138-1152. 10.1002/hbm.25282.

      Drijvers, L., and Ozyurek, A. (2018). Native language status of the listener modulates the neural integration of speech and iconic gestures in clear and adverse listening conditions. Brain and Language 177, 7-17. 10.1016/j.bandl.2018.01.003.

      Drijvers, L., van der Plas, M., Ozyurek, A., and Jensen, O. (2019). Native and non-native listeners show similar yet distinct oscillatory dynamics when using gestures to access speech in noise. Neuroimage 194, 55-67. 10.1016/j.neuroimage.2019.03.032.

      Holle, H., and Gunter, T.C. (2007). The role of iconic gestures in speech disambiguation: ERP evidence. J Cognitive Neurosci 19, 1175-1192. 10.1162/jocn.2007.19.7.1175.

      Kita, S., and Ozyurek, A. (2003). What does cross-linguistic variation in semantic coordination of speech and gesture reveal?: Evidence for an interface representation of spatial thinking and speaking. J Mem Lang 48, 16-32. 10.1016/S0749-596x(02)00505-3.

      Bernardis, P., and Gentilucci, M. (2006). Speech and gesture share the same communication system. Neuropsychologia 44, 178-190. 10.1016/j.neuropsychologia.2005.05.007.

      Zhao, W.Y., Riggs, K., Schindler, I., and Holle, H. (2018). Transcranial magnetic stimulation over left inferior frontal and posterior temporal cortex disrupts gesture-speech integration. Journal of Neuroscience 38, 1891-1900. 10.1523/Jneurosci.1748-17.2017.

      Zhao, W., Li, Y., and Du, Y. (2021). TMS reveals dynamic interaction between inferior frontal gyrus and posterior middle temporal gyrus in gesture-speech semantic integration. The Journal of Neuroscience, 10356-10364. 10.1523/jneurosci.1355-21.2021.

      Hartwigsen, G., Bzdok, D., Klein, M., Wawrzyniak, M., Stockert, A., Wrede, K., Classen, J., and Saur, D. (2017). Rapid short-term reorganization in the language network. Elife 6. 10.7554/eLife.25964.

      Jackson, R.L., Hoffman, P., Pobric, G., and Ralph, M.A.L. (2016). The semantic network at work and rest: Differential connectivity of anterior temporal lobe subregions. Journal of Neuroscience 36, 1490-1501. 10.1523/JNEUROSCI.2999-15.2016.

      Humphreys, G. F., Lambon Ralph, M. A., & Simons, J. S. (2021). A Unifying Account of Angular Gyrus Contributions to Episodic and Semantic Cognition. Trends in neurosciences, 44(6), 452–463. https://doi.org/10.1016/j.tins.2021.01.006

      Bonner, M. F., & Price, A. R. (2013). Where is the anterior temporal lobe and what does it do?. The Journal of neuroscience : the official journal of the Society for Neuroscience, 33(10), 4213–4215. https://doi.org/10.1523/JNEUROSCI.0041-13.2013

      (7) Some writing is obscure or unclear, in part due to superfluous words like 'intricate neural processes' on L74. Or the sentence in L47 - 48 about 'quantitatively functional mental states defined by a specific parser unified by statistical regularities' which, even read in context, fails to provide clarity about what a quantitatively functional mental state is, or how it is defined by specific parsers (or what these are), and what is the link to statistical regularities. In some cases, this lack of clarity leads to difficulties assessing the appropriateness of the methods, or the exact nature of the claims. For example, do they mean degree of comprehension instead of comprehensive value? I provide some more examples below:

      Response 7: We appreciate the reviewer’s thoughtful consideration. The revised manuscript now includes a clear description and a detailed explanation of the association with the statistical logic, addressing the concerns raised in Lines 47-55: ‘Contemporary theories frame the semantic processing as a dynamic sequence of neural states[3], shaped by systems that are finely tuned to the statistical regularities inherent in sensory inputs[4]. These regularities enable the brain to evaluate, weight, and integrate multisensory information, optimizing the reliability of individual sensory signals [5]. However, sensory inputs available to the brain are often incomplete and uncertain, necessitating adaptive neural adjustments to resolve these ambiguities[6]. In this context, neuronal activity is thought to be linked to the probability density of sensory information, with higher levels of uncertainty resulting in the engagement of a broader population of neurons, thereby reflecting the brain’s adaptive capacity to handle diverse possible interpretations[7,8].’

      References:

      Brennan, J.R., Stabler, E.P., Van Wagenen, S.E., Luh, W.M., and Hale, J.T. (2016). Abstract linguistic structure correlates with temporal activity during naturalistic comprehension. Brain and Language 157, 81-94. 10.1016/j.bandl.2016.04.008.

      Benetti, S., Ferrari, A., and Pavani, F. (2023). Multimodal processing in face-to-face interactions: A bridging link between psycholinguistics and sensory neuroscience. Front Hum Neurosci 17, 1108354. 10.3389/fnhum.2023.1108354.

      Noppeney, U. (2021). Perceptual Inference, Learning, and Attention in a Multisensory World. Annual Review of Neuroscience, Vol 44, 2021 44, 449-473. 10.1146/annurev-neuro-100120-085519.

      Ma, W.J., and Jazayeri, M. (2014). Neural coding of uncertainty and probability. Annu Rev Neurosci 37, 205-220. 10.1146/annurev-neuro-071013-014017.

      Fischer, B.J., and Pena, J.L. (2011). Owl's behavior and neural representation predicted by Bayesian inference. Nat Neurosci 14, 1061-1066. 10.1038/nn.2872.

      Ganguli, D., and Simoncelli, E.P. (2014). Efficient sensory encoding and Bayesian inference with heterogeneous neural populations. Neural Comput 26, 2103-2134. 10.1162/NECO_a_00638.

      Comment 7.1: a) I am not too sure what they mean by 'response consistently provided by participants for four to six consecutive instances' [L117-118]. They should be clearer with the description of these 'pre-test' study methods.

      Response 7.1: Thank you for this insightful question. An example of a participant's response to the gesture 'an' is provided below (Table 3). Initially, within 240 ms, the participant provided the answer "an," which could potentially be a guess. To ensure that the participant truly comprehends the gesture, we repeatedly present it until the participant’s response stabilizes, meaning the same answer is given consistently over several trials. While one might consider fixing the number of repetitions (e.g., six trials), this could lead to participants predicting the rule and providing the same answer out of habit. To mitigate this potential bias, we allow the number of repetitions to vary flexibly between four and six trials. 

      We understand that the initial phrase might be ambiguous, in the revised manuscript, we have changed the phrase into: ‘For each gesture or speech, the action verb consistently provided by participants across four to six consecutive repetitions—with the number of repetitions varied to mitigate learning effects—was considered the comprehensive response for the gesture or speech.’ (Lines 130-133)

      Author response table 8.

      Example of participant's response to the gesture 'an'

      Comment 7.2: b) I do not understand the paragraph in L143 - 146. This is important to rephrase for clarification. What are 'stepped' neural changes? What is the purpose of 'aggregating' neural responses with identical entropy / MI values?

      Response 7.2: It is important to note that the 20 stimuli exhibit 20 increments of gesture entropy values, 11 increments of speech entropy values, and 19 increments of mutual information values (Appendix Table 3). This discrepancy arises from the calculation of entropy and mutual information, where the distributions were derived from the comprehensive set of responses contributed by all 30 participants. As a result, these values were impacted not only by the distinct nameabilities of the stimuli but also by the entirety of responses provided. Consequently, in the context of speech entropy, 9 items demonstrate the nameability of 1, signifying unanimous comprehension among all 30 participants, resulting in an entropy of 0. Moreover, stimuli 'ning' and 'jiao' share an identical distribution, leading to an entropy of 0.63. Regarding MI, a value of 0.66 is computed for the combinations of stimuli 'sao' (gesture entropy: 4.01, speech entropy: 1.12, Author response image 32) and 'tui' (gesture entropy: 1.62, speech entropy: 0, Author response image 4). This indicates that these two sets of stimuli manifest an equivalent degree of integration.

      Author response image 3.

      Example of gesture answers (gesture sao), speech answers (speech sao), and mutual information (MI) for the ‘sao’ item

      Author response image 4.

      Example of gesture answers (gesture tui), speech answers (speech tui), and mutual information (MI) for the ‘tui’ item

      To precisely assess whether lower entropy/MI corresponds to a smaller or larger neural response, neural responses (ERP amplitude or TMS inhibition effect) with identical entropy or MI values were averaged before undergoing correlational analysis. We understand that the phrasing might be ambiguous. Clear description has been changed in the revised manuscript in Lines 157-160: ‘To determine whether entropy or MI values corresponds to distinct neural changes, the current study first aggregated neural responses (including inhibition effects of tDCS and TMS or ERP amplitudes) that shared identical entropy or MI values, prior to conducting correlational analyses.’

      Comment 7.3: c) The paragraph in L160-171 is confusing. Is it an attempt to give an overview of all three experiments? If so, consider moving to the end or summarising what each experiment is at the beginning of the paragraph giving it a name (i.e., TMS). Without that, it is unclear what each experiment is counterbalancing or what 'stimulation site' refers to, for example, leading to a significant lack of clarity.

      Response 7.3: We are sorry for the ambiguity, in the revised manuscript, we have moved the relevant phrasing to the beginning of each experiment.

      ‘Experiment 1: HD-tDCS protocol and data analysis

      Participants were divided into two groups, with each group undergoing HD-tDCS stimulation at different target sites (IFG or pMTG). Each participant completed three experimental sessions, spaced one week apart, during which 480 gesture-speech pairs were presented across various conditions. In each session, participants received one of three types of HD-tDCS stimulation: Anodal, Cathodal, or Sham. The order of stimulation site and type was counterbalanced using a Latin square design to control for potential order effects’ (Lines 183-189)

      ‘Experiment 2: TMS protocol and data analysis

      Experiment 2 involved 800 gesture-speech pairs, presented across 15 blocks over three days, with one week between sessions. Stimulation was administered at three different sites (IFG, pMTG, or Vertex). Within the time windows (TWs) spanning the gesture-speech integration period, five TWs that exhibited selective disruption of integration were selected: TW1 (-120 to -80 ms relative to the speech identification point), TW2 (-80 to -40 ms), TW3 (-40 to 0 ms), TW6 (80 to 120 ms), and TW7 (120 to 160 ms)23 (Figure 1C). The order of stimulation site and TW was counterbalanced using a Latin square design.’ (Lines 223-230)

      ‘Experiment 3: Electroencephalogram (EEG) recording and data analysis

      Experiment 3, comprising a total of 1760 gesture-speech pairs, was completed in a single-day session.’ (Lines 249-250)

      Comment 7.4: d) L402-406: This sentence is not clear. What do the authors mean by 'the state of [the neural landscape] constructs gradually as measured by entropy and MI'? How does this construct a neural landscape? The authors must rephrase this paragraph using clearer language since in its current state it is very difficult to assess whether it is supported by the evidence they present.

      Response 7.4: We are sorry for the ambiguity, in the revised manuscript we have provided clear description in Lines 483-492: ‘The varying contributions of unisensory gesture-speech information and the convergence of multisensory inputs, as reflected in the correlation between distinct ERP components and TMS time windows (TMS TWs), are consistent with recent models suggesting that multisensory processing involves parallel detection of modality-specific information and hierarchical integration across multiple neural levels[4,48]. These processes are further characterized by coordination across multiple temporal scales[49]. Building on this, the present study offers additional evidence that the multi-level nature of gesture-speech processing is statistically structured, as measured by information matrix of unisensory entropy and multisensory convergence index of MI, the input of either source would activate a distributed representation, resulting in progressively functioning neural responses’

      References:

      Benetti, S., Ferrari, A., and Pavani, F. (2023). Multimodal processing in face-to-face interactions: A bridging link between psycholinguistics and sensory neuroscience. Front Hum Neurosci 17, 1108354. 10.3389/fnhum.2023.1108354.

      Meijer, G.T., Mertens, P.E.C., Pennartz, C.M.A., Olcese, U., and Lansink, C.S. (2019). The circuit architecture of cortical multisensory processing: Distinct functions jointly operating within a common anatomical network. Prog Neurobiol 174, 1-15. 10.1016/j.pneurobio.2019.01.004.

      Senkowski, D., and Engel, A.K. (2024). Multi-timescale neural dynamics for multisensory integration. Nat Rev Neurosci 25, 625-642. 10.1038/s41583-024-00845-7.

      (8) Some writing suffers from conceptual equivocation. For example, the link between 'multimodal representation' and gesture as a type of multimodal extralinguistic information is not straightforward. What 'multimodal representations' usually refer to in semantic cognition is not the co-occurrence of gesture and speech, but the different sources or modalities that inform the structure of a semantic representation or concept (not the fact we use another modality vision to perceive gestures that enrich the linguistic auditory communication of said concepts). See also my comment in the public review regarding the conceptual conflation of the graded hub hypothesis.

      Response 8: We aimed to clarify that the integration of gesture and speech, along with the unified representation it entails, is not merely a process whereby perceived gestures enhance speech comprehension. Rather, there exists a bidirectional influence between these two modalities, affecting both their external forms (Bernaidis et al., 2006) and their semantic content (Kita et al., 2003; Kelly et al., 2010). Given that multisensory processing is recognized as an interplay of both top-down and bottom-up mechanisms, we hypothesize that this bidirectional semantic influence between gesture and speech operates similarly. Consequently, we recorded neural responses—specifically the inhibitory effects observed through TMS/tDCS or ERP components—beginning at the onset of speech, which marks the moment when both modalities are accessible.

      We prioritize gesture for two primary reasons. Firstly, from a naturalistic perspective, speech and gesture are temporally aligned; gestures typically precede their corresponding speech segments by less than one second (Morrelsamuls et al., 1992). This temporal alignment has prompted extensive research aimed at identifying the time windows during which integration occurs (Obermeier et al., 2011, 2015). Results indicate that local integration of gesture and speech occurs within a time frame extending from -200 ms to +120 ms relative to gesture-speech alignment, where -200 ms indicates that gestures occur 200 ms before speech onset, and +120 ms signifies gestures occurring after the identification point of speech.

      Secondly, in our previous study (Zhao, 2023), we investigated this phenomenon by manipulating gesture-speech alignment across two conditions: (1) gestures preceding speech by a fixed interval of 200 ms, and (2) gestures preceding speech at its semantic identification point. Notably, only in the second condition did we observe time-window-selective disruptions of the semantic congruency effect in the IFG and pMTG. This led us to conclude that gestures serve a semantic priming function for co-occurring speech.

      We recognize that our previous use of the term "co-occurring speech" may have led to ambiguity. Therefore, in the revised manuscript, we have replaced those sentences with a detailed description of the properties of each modality in Lines 60-62: ‘Even though gestures convey information in a global-synthetic way, while speech conveys information in a linear segmented way, there exists a bidirectional semantic influence between the two modalities[9,10]’

      Conceptual conflation of the graded hub hypothesis has been clarified in the Response to Reviewer 3 (public review) response 2.

      References:

      Bernardis, P., & Gentilucci, M. (2006). Speech and gesture share the same communication system. Neuropsychologia, 44(2), 178-190

      Kelly, S. D., Ozyurek, A., & Maris, E. (2010b). Two sides of the same coin: speech and gesture mutually interact to enhance comprehension. Psychological Science, 21(2), 260-267. doi:10.1177/0956797609357327

      Kita, S., & Ozyurek, A. (2003). What does cross-linguistic variation in semantic coordination of speech and gesture reveal?: Evidence for an interface representation of spatial thinking and speaking. Journal of Memory and Language, 48(1), 16-32. doi:10.1016/s0749-596x(02)00505-3

      Obermeier, C., & Gunter, T. C. (2015). Multisensory Integration: The Case of a Time Window of Gesture-Speech Integration. Journal of Cognitive Neuroscience, 27(2), 292-307. doi:10.1162/jocn_a_00688

      Obermeier, C., Holle, H., & Gunter, T. C. (2011). What Iconic Gesture Fragments Reveal about Gesture-Speech Integration: When Synchrony Is Lost, Memory Can Help. Journal of Cognitive Neuroscience, 23(7), 1648-1663. doi:10.1162/jocn.2010.21498

      Morrelsamuels, P., & Krauss, R. M. (1992). WORD FAMILIARITY PREDICTS TEMPORAL ASYNCHRONY OF HAND GESTURES AND SPEECH. Journal of Experimental Psychology-Learning Memory and Cognition, 18(3), 615-622. doi:10.1037/0278-7393.18.3.615

      Hostetter, A., and Mainela-Arnold, E. (2015). Gestures occur with spatial and Motoric knowledge: It's more than just coincidence. Perspectives on Language Learning and Education 22, 42-49. doi:10.1044/lle22.2.42.

      McNeill, D. (2005). Gesture and though (University of Chicago Press). 10.7208/chicago/9780226514642.001.0001.

      Zhao, W. (2023). TMS reveals a two-stage priming circuit of gesture-speech integration. Front Psychol 14, 1156087. 10.3389/fpsyg.2023.1156087.

      (9) The last paragraph of the introduction lacks a conductive thread. The authors describe three experiments without guiding the reader through a connecting thread underlying the experiments. Feels more like three disconnected studies than a targeted multi-experiment approach to solve a problem. What is each experiment contributing to? What is the 'grand question' or thread unifying these?

      Response 9: The present study introduced three experiments to explore the neural activity linked to the amount of information processed during multisensory gesture-speech integration. In Experiment 1, we observed that the extent of inhibition in the pMTG and LIFG was closely linked to the overlapping gesture-speech responses, as quantified by mutual information. Building on the established roles of the pMTG and LIFG in our previous study (Zhao et al., 2021, JN), we then expanded our investigation to determine whether the dynamic neural engagement between the pMTG and LIFG during gesture-speech processing was also associated with the quality of the information. This hypothesis was further validated through high-temporal resolution EEG, where we examined ERP components related to varying information qualities. Notably, we observed a close time alignment between the ERP components and the time windows of the TMS effects, which were associated with the same informational matrices in gesture-speech processing.

      Linkage of the three experiments has been clarified in the introduction in Lines 75-102: ‘

      To investigate the neural mechanisms underlying gesture-speech integration, we conducted three experiments to assess how neural activity correlates with distributed multisensory integration, quantified using information-theoretic measures of MI. Additionally, we examined the contributions of unisensory signals in this process, quantified through unisensory entropy. Experiment 1 employed high-definition transcranial direct current stimulation (HD-tDCS) to administer Anodal, Cathodal and Sham stimulation to either the IFG or the pMTG. HD-tDCS induces membrane depolarization with anodal stimulation and membrane hyperpolarization with cathodal stimulation[26], thereby increasing or decreasing cortical excitability in the targeted brain area, respectively. This experiment aimed to determine whether the overall facilitation (Anodal-tDCS minus Sham-tDCS) and/or inhibitory (Cathodal-tDCS minus Sham-tDCS) of these integration hubs is modulated by the degree of gesture-speech integration, as measure by MI.

      Given the differential involvement of the IFG and pMTG in gesture-speech integration, shaped by top-down gesture predictions and bottom-up speech processing [23], Experiment 2 was designed to further assess whether the activity of these regions was associated with relevant informational matrices. Specifically, we applied inhibitory chronometric double-pulse transcranial magnetic stimulation (TMS) to specific temporal windows associated with integration processes in these regions[23], assessing whether the inhibitory effects of TMS were correlated with unisensory entropy or the multisensory convergence index (MI).

      Experiment 3 complemented these investigations by focusing on the temporal dynamics of neural responses during semantic processing, leveraging high-temporal event-related potentials (ERPs). This experiment investigated how distinct information contributors modulated specific ERP components associated with semantic processing. These components included the early sensory effects as P1 and N1–P2[27,28], the N400 semantic conflict effect[14,28,29], and the late positive component (LPC) reconstruction effect[30,31]. By integrating these ERP findings with results from Experiments 1 and 2, Experiment 3 aimed to provide a more comprehensive understanding of how gesture-speech integration is modulated by neural dynamics’

      References:

      Bikson, M., Inoue, M., Akiyama, H., Deans, J.K., Fox, J.E., Miyakawa, H., and Jefferys, J.G.R. (2004). Effects of uniform extracellular DC electric fields on excitability in rat hippocampal slices. J Physiol-London 557, 175-190. 10.1113/jphysiol.2003.055772.

      Federmeier, K.D., Mai, H., and Kutas, M. (2005). Both sides get the point: hemispheric sensitivities to sentential constraint. Memory & Cognition 33, 871-886. 10.3758/bf03193082.

      Kelly, S.D., Kravitz, C., and Hopkins, M. (2004). Neural correlates of bimodal speech and gesture comprehension. Brain and Language 89, 253-260. 10.1016/s0093-934x(03)00335-3.

      Wu, Y.C., and Coulson, S. (2005). Meaningful gestures: Electrophysiological indices of iconic gesture comprehension. Psychophysiology 42, 654-667. 10.1111/j.1469-8986.2005.00356.x.

      Fritz, I., Kita, S., Littlemore, J., and Krott, A. (2021). Multimodal language processing: How preceding discourse constrains gesture interpretation and affects gesture integration when gestures do not synchronise with semantic affiliates. J Mem Lang 117, 104191. 10.1016/j.jml.2020.104191.

      Gunter, T.C., and Weinbrenner, J.E.D. (2017). When to take a gesture seriously: On how we use and prioritize communicative cues. J Cognitive Neurosci 29, 1355-1367. 10.1162/jocn_a_01125.

      Ozyurek, A., Willems, R.M., Kita, S., and Hagoort, P. (2007). On-line integration of semantic information from speech and gesture: Insights from event-related brain potentials. J Cognitive Neurosci 19, 605-616. 10.1162/jocn.2007.19.4.605.

      Zhao, W., Li, Y., and Du, Y. (2021). TMS reveals dynamic interaction between inferior frontal gyrus and posterior middle temporal gyrus in gesture-speech semantic integration. The Journal of Neuroscience, 10356-10364. 10.1523/jneurosci.1355-21.2021.

      (10) The authors should provide a clearer figure to appreciate their paradigm, illustrating clearly the stimulus presentation (gesture and speech).

      Response 10: To reduce ambiguity, unnecessary arrows were deleted from Figure 1.

      Comment 11.1: (11) Required methodological clarifications to better assess the strength of the evidence presented:

      a) Were the exclusion criteria only handedness and vision? Did the authors exclude based on neurological and psychiatric disorders? Psychoactive drugs? If not, do they think the lack of these exclusion criteria might have influenced their results?

      Response 11.1: Upon registration, each participant is required to complete a questionnaire alongside the consent form and handedness questionnaire. This procedure is designed to exclude individuals with potential neurological or psychiatric disorders, as well as other factors that may affect their mental state or reaction times. Consequently, all participants reported in the manuscript do not have any of the aforementioned neurological or psychiatric disorders. The questionnaire is attached below:

      Author response image 4.

      Comment 11.2: b) Are the subjects from the pre-tests (L112-113) and the replication study (L107) a separate sample or did they take part in Experiments 1-3?

      Response 11.2: The participants in each pre-test and experiment were independent, resulting in a total of 188 subjects. Since the stimuli utilized in this study were previously validated and reported (Zhao et al., 2021), the 90 subjects who participated in the three pre-tests are not included in the final count for the current study, leaving a total of 98 participants reported in the manuscript in Lines 103-104: ‘Ninety-eight young Chinese participants signed written informed consent forms and took part in the present study’.

      Comment 11.3: c) L176. The authors should explain how they selected ROIs. This is very important for the reasons outlined above.

      Response 11.3: Please see Response to Comment 6 for details.

      Comment 11.4: d) The rationale for Experiment 1 and its analysis approach should be explicitly described. Why perform Pearson correlations? What is the conceptual explanation of the semantic congruency effect and why should it be expected to correlate with the three information-theoretic metrics? What effects could the authors expect to find and what would they mean? There is a brief description in L187-195 but it is unclear.

      Response 11.4: We thank the reviewer for their rigorous consideration. The semantic congruency effect is widely used as an index of multisensory integration. Therefore, the effects of HD-tDCS on the IFG and pMTG, as measured by changes in the semantic congruency effect, serve as an indicator of altered neural responses to multisensory integration. In correlating these changes with behavioral indices of information degree, we aimed to assess whether the integration hubs (IFG and pMTG) function progressively during multisensory gesture-speech integration. The rationale for using Pearson correlations is based on the hypothesis that the 20 sets of stimuli used in this study represent a sample from a normally distributed population. Thus, even with changes in the sample (e.g., using another 20 values), the gradual relationship between neural responses and the degree of information would remain unchanged. This hypothesis is supported by the findings from another experiment (see details in Response to Comment 4).

      In the revised manuscript, we have provided a clear description of the rationale for Experiment 1 in Lines 206-219: ‘To examine the relationship between the degree of information and neural responses, we conducted Pearson correlation analyses using a sample of 20 sets. Neural responses were quantified based on the effects of HD-tDCS (active tDCS minus sham tDCS) on the semantic congruency effect, defined as the difference in reaction times between semantic incongruent and congruent conditions (Rt(incongruent) - Rt(congruent)). This effect served as an index of multisensory integration[35] within the left IFG and pMTG. The variation in information was assessed using three information-theoretic metrics. To account for potential confounds related to multiple candidate representations, we conducted partial correlation analyses between the tDCS effects and gesture entropy, speech entropy, and MI, controlling for the number of responses provided for each gesture and speech, as well as the total number of combined responses. Given that HD-tDCS induces overall disruption at the targeted brain regions, we hypothesized that the neural activity within the left IFG and pMTG would be progressively affected by varying levels of multisensory convergence, as indexed by MI.’

      Additionally, in the introduction, we have rephrased the relevant rationale in Lines 75-86: _‘_To investigate the neural mechanisms underlying gesture-speech integration, we conducted three experiments to assess how neural activity correlates with distributed multisensory integration, quantified using information-theoretic measures of MI. Additionally, we examined the contributions of unisensory signals in this process, quantified through unisensory entropy. Experiment 1 employed high-definition transcranial direct current stimulation (HD-tDCS) to administer Anodal, Cathodal and Sham stimulation to either the IFG or the pMTG. HD-tDCS induces membrane depolarization with anodal stimulation and membrane hyperpolarization with cathodal stimulation[26], thereby increasing or decreasing cortical excitability in the targeted brain area, respectively. This experiment aimed to determine whether the overall facilitation (Anodal-tDCS minus Sham-tDCS) and/or inhibitory (Cathodal-tDCS minus Sham-tDCS) of these integration hubs is modulated by the degree of gesture-speech integration, as measure by MI

      Reference:

      Kelly, S.D., Creigh, P., and Bartolotti, J. (2010). Integrating speech and iconic gestures in a Stroop-like task: Evidence for automatic processing. Journal of Cognitive Neuroscience 22, 683-694. 10.1162/jocn.2009.21254.

      Comment 11.5: e) The authors do not mention in the methods if FDR correction was applied to the Pearson correlations in Experiment 1. There is a mention in the Results Figure, but it is unclear if it was applied consistently. Can the authors confirm, and explicitly state the way they carried out FDR correction for this family of tests in Experiment 1? This is especially important in the light of some of their results having a p-value of p=.049.

      Response 11.5: FDR correction was applied to Experiment 1, and all reported p-values were corrected using this method. In the revised manuscript, we have included a reference to FDR correction in Lines 221-222: ‘False discovery rate (FDR) correction was applied for multiple comparisons.’

      In Experiment 1, since two separate participant groups (each N = 26) were recruited for the HD-tDCS over either the IFG or pMTG, FDR correction was performed separately for each group. Therefore, for each brain region, six comparisons (three information matrices × two tDCS effects: anodal-sham or cathodal-sham) were submitted for FDR correction.

      In Experiment 2, six comparisons (three information matrices × two sites: IFG or pMTG) were submitted for FDR correction. In Experiment 3, FDR correction was applied to the seven regions of interest (ROIs) within each component, resulting in five comparisons

      The confidence of a p-value of 0.049 was clarified in Response to Comment 3.

      Comment 11.6: f) L200. What does the abbreviation 'TW' stands for in this paragraph? When was it introduced in the main text? The description is in the Figure, but it should be moved to the main text.]

      Comment 11.7: g) How were the TWs chosen? Is it the criterion in L201-203? If so, it should be moved to the start of the paragraph. What does the word 'selected' refer to in that description? Selected for what? The explanation seems to be in the Figure, but it should be in the main text. It is still not a complete explanation. What were the criteria for assigning TWs to the IFG or pMTG?

      Response 11.6& 11.7: Since the two comments are related, we will provide a synthesized response. 'TW' refers to time window, the selection of which was based on our previous study (Zhao et al., 2021, J. Neurosci). In Zhao et al. (2021), we employed the same experimental protocol—using inhibitory double-pulse transcranial magnetic stimulation (TMS) over the IFG and pMTG in one of eight 40-ms time windows relative to the speech identification point (IP; the minimal length of lexical speech), with three time windows before the speech IP and five after. Based on this previous work, we believe that these time windows encompass the potential gesture-speech integration process. Results demonstrated a time-window-selective disruption of the semantic congruency effect (i.e., reaction time costs driven by semantic conflict), with no significant modulation of the gender congruency effect (i.e., reaction time costs due to gender conflict), when stimulating the left pMTG in TW1, TW2, and TW7, and when stimulating the left IFG in TW3 and TW6. Based on these findings, the present study selected the five time windows that showed a selective disruption effect during gesture-speech integration.

      Note that in the present study, we applied stimulation to both the IFG and pMTG across all five time windows, and further correlated the TMS disruption effects with the three information matrices.

      We recognize that the rationale for the choice of time windows was not sufficiently explained in the original manuscript. In the revised manuscript, we have added the relevant description in Lines 223-228: ‘Stimulation was administered at three different sites (IFG, pMTG, or Vertex). Within the time windows (TWs) spanning the gesture-speech integration period, five TWs that exhibited selective disruption of integration were selected: TW1 (-120 to -80 ms relative to the speech identification point), TW2 (-80 to -40 ms), TW3 (-40 to 0 ms), TW6 (80 to 120 ms), and TW7 (120 to 160 ms)[23] (Figure 1C). The order of stimulation site and TW was counterbalanced using a Latin square design.’

      Comment 11.8: h) Again, the rationale for the Pearson correlations of semantic congruency with information-theoretic metrics should be explicitly outlined. What is this conceptually?

      Response 11.8: Given that the rationale behind Experiment 1 and Experiment 2 is similar—both investigating the correlation between interrupted neural effects and the degree of information—we believe that the introduction of the Pearson correlation between semantic congruency and information-theoretic metrics, as presented in Experiment 1 (see Response to Comment 11.4 for details), is sufficient for both experiments.

      Comment 11.9: i)What does 'gesture stoke' mean in the Figure referring to Experiment 3? Figure 1D is not clear. What are the arrows referring to?

      Response 11.9: According to McNeill (1992), gesture phases differ based on whether the gesture depicts imagery. Iconic and metaphoric gestures are imagistic and typically consist of three phases: a preparation phase, a stroke phase, and a retraction phrase. Figure 4 provides an example of these three phases using the gesture ‘break’. In the preparation phase, the hand and arm move away from their resting position to a location in gesture space where the stroke begins. As illustrated in the first row of Figure 4, during the preparation phase of the ‘break’ gesture, the hands, initially in a fist and positioned downward, rise to a center-front position. In the stroke phase, the meaning of the gesture is conveyed. This phase occurs in the central gesture space and is synchronized with the linguistic segments it co-expresses. For example, in the stroke phase of the ‘break’ gesture (second row of Figure 4), the two fists move 90 degrees outward before returning to a face-down position. The retraction phase involves the return of the hand from the stroke position to the rest position. In the case of the ‘break’ gesture, this involves moving the fists from the center front back into the resting position (see third row of Figure 4).

      Therefore, in studies examining gesture-speech integration, gestures are typically analyzed starting from the stroke phase (Habets et al., 2011; Kelly et al., 2010), a convention also adopted in our previous studies (Zhao et al., 2018, 2021, 2023). We acknowledge that this should be explained explicitly, and in the revised manuscript, we have added the following clarification in Lines 162-166: ‘Given that gestures induce a semantic priming effect on concurrent speech[33], this study utilized a semantic priming paradigm in which speech onset was aligned with the DP of each gesture[23,33], the point at which the gesture transitions into a lexical form[34]. The gesture itself began at the stroke phase, a critical moment when the gesture conveys its primary semantic content[34].’

      Additionally, Figure 1 has been revised in the manuscript to eliminate ambiguous arrows. (see Response 10 for detail).

      Author response image 5.

      An illustration of the gesture phases of the 'break' gesture.

      References:

      Habets, B., Kita, S., Shao, Z. S., Ozyurek, A., & Hagoort, P. (2011). The Role of Synchrony and Ambiguity in Speech-Gesture Integration during Comprehension. Journal of Cognitive Neuroscience, 23(8), 1845-1854. doi:10.1162/jocn.2010.21462

      Kelly, S. D., Creigh, P., & Bartolotti, J. (2010). Integrating Speech and Iconic Gestures in a Stroop-like Task: Evidence for Automatic Processing. Journal of Cognitive Neuroscience, 22(4), 683-694. doi:DOI 10.1162/jocn.2009.21254

      Comment 11.10: j) L236-237: "Consequently, four ERP components were predetermined" is very confusing. Were these components predetermined? Or were they determined as a consequence of the comparison between the higher and lower halves for the IT metrics described above in the same paragraph? The description of the methods is not clear.

      Response 11.10: The components selected were based on a comparison between the higher and lower halves of the information metrics. By stating that these components were predetermined, we aimed to emphasize that the components used in our study are consistent with those identified in previous research on semantic processing. We acknowledge that the phrasing may have been unclear, and in the revised manuscript, we have provided a more explicit description in Lines 267-276: ‘To consolidate the data, we conducted both a traditional region-of-interest (ROI) analysis, with ROIs defined based on a well-established work[40], and a cluster-based permutation approach, which utilizes data-driven permutations to enhance robustness and address multiple comparisons.

      For the traditional ROI analysis, grand-average ERPs at electrode Cz were compared between the higher (≥50%) and lower (<50%) halves for gesture entropy (Figure 5A1), speech entropy (Figure 5B1), and MI (Figure 5C1). Consequently, four ERP components were determined: the P1 effect observed within the time window of 0-100 ms[27,28], the N1-P2 effect observed between 150-250ms[27,28], the N400 within the interval of 250-450ms[14,28,29], and the LPC spanning from 550-1000ms[30,31].’

      Reference: Habets, B., Kita, S., Shao, Z.S., Ozyurek, A., and Hagoort, P. (2011). The Role of Synchrony and Ambiguity in Speech-Gesture Integration during Comprehension. J Cognitive Neurosci 23, 1845-1854. 10.1162/jocn.2010.21462.

      (12) In the Results section for Experiment 2 (L292-295), it is not clear what the authors mean when they mention that a more negative TMS effect represents a stronger interruption of the integration effect. If I understand correctly, the correlation reported for pMTG was for speech entropy, which does not represent integration (that would be MI).

      Response 12: Since the TMS effect was defined as active TMS minus Vertex TMS, the inhibitory TMS effect is inherently negative. A greater inhibitory TMS effect corresponds to a larger negative value, such that a more negative TMS effect indicates a stronger disruption of the integration process. We acknowledge that the previous phrasing was somewhat ambiguous. In the revised manuscript, we have rephrased the sentence as follows: ‘a larger negative TMS effect signifies a greater disruption of the integration process’ (Lines 342-343)

      Multisensory integration transcends simple data amalgamation, encompassing complex interactions at various hierarchical neural levels and the parallel detection and discrimination of raw data from each modality (Benetti et al., 2023; Meijer et al., 2019). Therefore, we regard the process of gesture-speech integration as involving both unisensory processing and multisensory convergence. The correlation of gesture and speech entropy reflects contributions from unisensory processing, while the mutual information (MI) index indicates the contribution of multisensory convergence during gesture-speech integration. The distinction between these various source contributions will be the focus of Experiment 2 and Experiment 3, as described in the revised manuscript Lines 87-102: ‘Given the differential involvement of the IFG and pMTG in gesture-speech integration, shaped by top-down gesture predictions and bottom-up speech processing [23], Experiment 2 was designed to further assess whether the activity of these regions was associated with relevant informational matrices. Specifically, we applied inhibitory chronometric double-pulse transcranial magnetic stimulation (TMS) to specific temporal windows associated with integration processes in these regions[23], assessing whether the inhibitory effects of TMS were correlated with unisensory entropy or the multisensory convergence index (MI).

      Experiment 3 complemented these investigations by focusing on the temporal dynamics of neural responses during semantic processing, leveraging high-temporal event-related potentials (ERPs). This experiment investigated how distinct information contributors modulated specific ERP components associated with semantic processing. These components included the early sensory effects as P1 and N1–P2[27,28], the N400 semantic conflict effect[14,28,29], and the late positive component (LPC) reconstruction effect[30,31]. By integrating these ERP findings with results from Experiments 1 and 2, Experiment 3 aimed to provide a more comprehensive understanding of how gesture-speech integration is modulated by neural dynamics’.  

      References:

      Benetti, S., Ferrari, A., and Pavani, F. (2023). Multimodal processing in face-to-face interactions: A bridging link between psycholinguistics and sensory neuroscience. Front Hum Neurosci 17, 1108354. 10.3389/fnhum.2023.1108354.

      Meijer, G.T., Mertens, P.E.C., Pennartz, C.M.A., Olcese, U., and Lansink, C.S. (2019). The circuit architecture of cortical multisensory processing: Distinct functions jointly operating within a common anatomical network. Prog Neurobiol 174, 1-15. 10.1016/j.pneurobio.2019.01.004.

      (13) I find the description of the results for Experiment 3 very hard to follow. Perhaps if the authors have decided to organise the main text by describing the components from earliest to latest, the Figure organisation should follow suit (i.e., organise the Figure from the earliest to the latest component, instead of gesture entropy/speech entropy / mutual information). This might make the description of the results easier to follow.

      Response 13: As suggested, we have reorganized the results of experiment 3 based on components from earliest to latest, together with an updated Figure 5.

      The results are detailed in Lines 367-423: ‘Topographical maps illustrating amplitude differences between the lower and higher halves of speech entropy demonstrate a central-posterior P1 amplitude (0-100 ms, Figure 5B). Aligning with prior findings[27], the paired t-tests demonstrated a significantly larger P1 amplitude within the ML ROI (t(22) = 2.510, p = 0.020, 95% confidence interval (CI) = [1.66, 3.36]) when contrasting stimuli with higher 50% speech entropy against those with lower 50% speech entropy (Figure 5D1 left). Subsequent correlation analyses unveiled a significant increase in the P1 amplitude with the rise in speech entropy within the ML ROI (r = 0.609, p = 0.047, 95% CI = [0.039, 1.179], Figure 5D1 right). Furthermore, a cluster of neighboring time-electrode samples exhibited a significant contrast between the lower 50% and higher 50% of speech entropy, revealing a P1 effect spanning 16 to 78 ms at specific electrodes (FC2, FCz, C1, C2, Cz, and CPz, Figure 5D2 middle) (t(22) = 2.754, p = 0.004, 95% confidence interval (CI) = [1.65, 3.86], Figure 5D2 left), with a significant correlation with speech entropy (r = 0.636, p = 0.035, 95% CI = [0.081, 1.191], Figure 5D2 right).

      Additionally, topographical maps comparing the lower 50% and higher 50% gesture entropy revealed a frontal N1-P2 amplitude (150-250 ms, Figure 5A). In accordance with previous findings on bilateral frontal N1-P2 amplitude[27], paired t-tests displayed a significantly larger amplitude for stimuli with lower 50% gesture entropy than with higher 50% entropy in both ROIs of LA (t(22) = 2.820, p = 0.011, 95% CI = [2.21, 3.43]) and RA (t(22) = 2.223, p = 0.038, 95% CI = [1.56, 2.89]) (Figure 5E1 left).  Moreover, a negative correlation was found between N1-P2 amplitude and gesture entropy in both ROIs of LA (r = -0.465, p = 0.039, 95% CI = [-0.87, -0.06]) and RA (r = -0.465, p = 0.039, 95% CI = [-0.88, -0.05]) (Figure 5E1 right). Additionally, through a cluster-permutation test, the N1-P2 effect was identified between 184 to 202 ms at electrodes FC4, FC6, C2, C4, C6, and CP4 (Figure 5E2 middle) (t(22) = 2.638, p = 0.015, 95% CI = [1.79, 3.48], (Figure 5E2 left)), exhibiting a significant correlation with gesture entropy (r = -0.485, p = 0.030, 95% CI = [-0.91, -0.06], Figure 5E2 right).

      Furthermore, in line with prior research[42], a left-frontal N400 amplitude (250-450 ms) was discerned from topographical maps of gesture entropy (Figure 5A). Specifically, stimuli with lower 50% values of gesture entropy elicited a larger N400 amplitude in the LA ROI compared to those with higher 50% values  (t(22) = 2.455, p = 0.023, 95% CI = [1.95, 2.96], Figure 5F1 left). Concurrently, a negative correlation was noted between the N400 amplitude and gesture entropy (r = -0.480, p = 0.032, 95% CI = [-0.94, -0.03], Figure 5F1 right) within the LA ROI. The identified clusters showing the N400 effect for gesture entropy (282 – 318 ms at electrodes FC1, FCz, C1, and Cz, Figure 5F2 middle) (t(22) = 2.828, p = 0.010, 95% CI = [2.02, 3.64], Figure 5F2 left) also exhibited significant correlation between the N400 amplitude and gesture entropy (r = -0.445, p = 0.049, 95% CI = [-0.88, -0.01], Figure 5F2 right).

      Similarly, a left-frontal N400 amplitude (250-450 ms) [42] was discerned from topographical maps for MI (Figure 5C). A larger N400 amplitude in the LA ROI was observed for stimuli with lower 50% values of MI compared to those with higher 50% values (t(22) = 3.00, p = 0.007, 95% CI = [2.54, 3.46], Figure 5G1 left). This was accompanied by a significant negative correlation between N400 amplitude and MI (r = -0.504, p = 0.028, 95% CI = [-0.97, -0.04], Figure 5G1 right) within the LA ROI. The N400 effect for MI, observed in the 294–306 ms window at electrodes F1, F3, Fz, FC1, FC3, FCz, and C1 (Figure 5G2 middle) (t(22) = 2.461, p = 0.023, 95% CI = [1.62, 3.30], Figure 5G2 left), also showed a significant negative correlation with MI (r = -0.569, p = 0.011, 95% CI = [-0.98, -0.16], Figure 5G2 right).

      Finally, consistent with previous findings[30], an anterior LPC effect (550-1000 ms) was observed in topographical maps comparing stimuli with lower and higher 50% speech entropy (Figure 5B). The reduced LPC amplitude was evident in the paired t-tests conducted in ROIs of LA (t(22) = 2.614, p = 0.016, 95% CI = [1.88, 3.35]); LC (t(22) = 2.592, p = 0.017, 95% CI = [1.83, 3.35]); RA (t(22) = 2.520, p = 0.020, 95% CI = [1.84, 3.24]); and ML (t(22) = 2.267, p = 0.034, 95% CI = [1.44, 3.10]) (Figure 5H1 left). Simultaneously, a marked negative correlation with speech entropy was evidenced in ROIs of LA (r = -0.836, p =   0.001, 95% CI = [-1.26, -0.42]); LC (r = -0.762, p = 0.006, 95% CI = [-1.23, -0.30]); RA (r = -0.774, p = 0.005, 95% CI = [-1.23, -0.32]) and ML (r = -0.730, p = 0.011, 95% CI = [-1.22, -0.24]) (Figure 5H1 right). Additionally, a cluster with the LPC effect (644 - 688 ms at electrodes Cz, CPz, P1, and Pz, Figure 5H2 middle) (t(22) = 2.754, p = 0.012, 95% CI = [1.50, 4.01], Figure 5H2 left) displayed a significant correlation with speech entropy (r = -0.699, p = 0.017, 95% CI = [-1.24, -0.16], Figure 5H2 right).’

      (14) In the Discussion (L394 - 395) the authors mention for the first time their task being a semantic priming paradigm. This idea of the task as a semantic priming paradigm allowing top-down prediction of gesture over speech should be presented earlier in the paper, perhaps during the final paragraph of the introduction (as part of the rationale) or during the explanation of the task. The authors mention top-down influences earlier and this is impossible to understand before this information about the paradigm is presented. It would also make the reading of the paper significantly clearer. Critically, an appropriate description of the paradigm is missing in the Methods (what are the subjects asked to do? It states that it replicates an effect in Ref 28, but this manuscript does not contain a clear description of the task). To further complicate things, the 'Experimental Procedure' section of the methods states this is a semantic priming paradigm of gestures onto speech (L148) and proceeds to provide two seemingly irrelevant references (for example, the Pitcher reference is to a study that employed faces and houses as stimuli). How is this a semantic priming paradigm? The study where I found the first mention of this paradigm seems to clearly classify it as a Stroop-like task (Kelly et al, 2010).

      We appreciate the reviewer’s thorough consideration. The experimental paradigm employed in the current study differs from the Stroop-like task utilized by Kelly et al. (2010). In their study, the video presentation started with the stroke phase of the gesture, while speech occurred 200 ms after the gesture onset.

      As detailed in our previous study (Zhao et al., 2023, Frontiers in Psychology), we confirmed the semantic predictive role of gestures in relation to speech by contrasting two experimental conditions: (1) gestures preceding speech by a fixed 200 ms interval, and (2) gestures preceding speech at the semantic identification point of the gesture. Our findings revealed time-window-selective disruptions in the semantic congruency effect in the IFG and pMTG, but only in the second condition, suggesting that gestures exert a semantic priming effect on concurrent speech.

      This work highlighted the semantic priming role of gestures in the integration of speech found in Zhao et al. (2021, Journal of Neuroscience). In the study, a comparable approach was adopted by segmenting speech into eight 40-ms time windows based on the speech discrimination point, while manipulating the speech onset to align with the gesture identification point. The results revealed time-window-selective disruptions in the semantic congruency effect, providing support for the dynamic and temporally staged roles of the IFG and pMTG in gesture-speech integration.

      Given that the present study follows the same experimental procedure as our prior work (Zhao et al., 2021, Journal of Neuroscience; Zhao et al., 2023, Frontiers in Psychology), we refer to this design as a "semantic priming" of gesture upon speech. We agree with the reviewer that a detailed description should be clarified earlier in the manuscript. To address this, we have added a more explicit description of the semantic priming paradigm in the methods section of the revised manuscript in Lines 162-166: ‘Given that gestures induce a semantic priming effect on concurrent speech[33], this study utilized a semantic priming paradigm in which speech onset was aligned with the DP of each gesture[23,33], the point at which the gesture transitions into a lexical form[34]. The gesture itself began at the stroke phase, a critical moment when the gesture conveys its primary semantic content [34].’

      The task participants completed was outlined immediately following the explanation of the experimental paradigm: ‘Gesture–speech pairs were presented randomly using Presentation software (www.neurobs.com). Participants were asked to look at the screen but respond with both hands as quickly and accurately as possible merely to the gender of the voice they heard’ (Lines:177-180).

      Wrongly cited references have been corrected.

      (15) L413-417: How do the authors explain that they observe this earlier ERP component and TMS effect over speech and a later one over gesture in pMTG when in their task they first presented gesture and then speech? Why mention STG/S when they didn't assess this?

      (19) L436-440: This paragraph yields the timing of the findings represented in Figure 6 even more confusing. If gesture precedes speech in the paradigm, why are the first TMS and ERP results observed in speech?

      Response 15 &19: Since these two aspects are closely related, we offer a comprehensive explanation. Although gestures were presented before speech, the integration process occurs once both modalities are available. Consequently, ERP and TMS measurements were taken after speech onset to capture the integration of the two modalities. Neural responses were used as the dependent variable to reflect the degree of integration—specifically, gesture-speech semantic congruency in the TMS study and high-low semantic variance in the ERP study. Therefore, the observed early effect can be interpreted as an interaction between the top-down influence of gesture and the bottom-up processing of speech.

      To isolate the pure effect of gesture, neural activity would need to be recorded from gesture onset. However, if one aims to associate the strength of neural activity with the degree of gesture information, recording from the visual processing areas would be more appropriate.

      To avoid unnecessary ambiguity, the phrase "involved STG/S" has been removed from the manuscript.

      (16) L427-428: I find it hard to believe that MI, a behavioural metric, indexes the size of overlapped neural populations activated by gesture and speech. The authors should be careful with this claim or provide evidence in favour.

      Response 16: Mutual information (MI) is a behavioral metric that indexes the distribution of overlapping responses between gesture and speech (for further details, please see the Response to Comment 1). In the present study, MI was correlated with neural responses evoked by gesture and speech, with the goal of demonstrating that neural activity progressively reflects the degree of information conveyed, as indexed by MI.

      (17) Why would you have easier integration (reduced N400) with larger gesture entropy in IFG (Figure 6(3))? Wouldn't you expect more difficult processing if entropy is larger?

      (18) L431-432: The claim that IFG stores semantic information is controversial. The authors provide two references from the early 2000s that do not offer support for this claim (the IFG's purported involvement according to these is in semantic unification, not storage).

      Response 17 &18: As outlined in the Responses to Comment 1 of the public review, we have provided a re-explanation of the IFG as a semantic control region. Additionally, we have clarified the role of the IFG in relation to the various stages of gesture-speech integration in Lines 533-538: ‘Last, the activated speech representation would disambiguate and reanalyze the semantic information and further unify into a coherent comprehension in the pMTG[12,37]. As speech entropy increases, indicating greater uncertainty in the information provided by speech, more cognitive effort is directed towards selecting the targeted semantic representation. This leads to enhanced involvement of the IFG and a corresponding reduction in LPC amplitude’

      (20) Overall, the grammar makes some parts of the discussion hard to follow (e.g. the limitation in L446-447: 'While HD tDCS and TMS may impact functionally and anatomically connected brain regions, the graded functionality of every disturbed period is not guaranteed')

      Response 20: Clear description has been provided in the revised manuscript in Lines 552-557: ‘Additionally, not all influenced TWs exhibited significant associations with entropy and MI. While HD-tDCS and TMS may impact functionally and anatomically connected brain regions[55,56],  whether the absence of influence in certain TWs can be attributed to compensation by other connected brain areas, such as angular gyrus[57] or anterior temporal lobe[58], warrants further investigation. Therefore, caution is needed when interpreting the causal relationship between inhibition effects of brain stimulation and information-theoretic metrics (entropy and MI).’

      References:

      Hartwigsen, G., Bzdok, D., Klein, M., Wawrzyniak, M., Stockert, A., Wrede, K., Classen, J., and Saur, D. (2017). Rapid short-term reorganization in the language network. Elife 6. 10.7554/eLife.25964.

      Jackson, R.L., Hoffman, P., Pobric, G., and Ralph, M.A.L. (2016). The semantic network at work and rest: Differential connectivity of anterior temporal lobe subregions. Journal of Neuroscience 36, 1490-1501. 10.1523/JNEUROSCI.2999-15.2016

      Humphreys, G. F., Lambon Ralph, M. A., & Simons, J. S. (2021). A Unifying Account of Angular Gyrus Contributions to Episodic and Semantic Cognition. Trends in neurosciences, 44(6), 452–463. https://doi.org/10.1016/j.tins.2021.01.006

      Bonner, M. F., & Price, A. R. (2013). Where is the anterior temporal lobe and what does it do?. The Journal of neuroscience : the official journal of the Society for Neuroscience, 33(10), 4213–4215. https://doi.org/10.1523/JNEUROSCI.0041-13.2013

      (21) Inconsistencies between terminology employed in Figures and main text (e.g., pre-test study in text, gating study in Figure?)

      Response 21: Consistence has been made by changing the ‘gating study’ into ‘pre-tests’ in Figure 1 (Lines 758).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Lejeune et al. demonstrated sex-dependent differences in the susceptibility to MRSA infection. The authors demonstrated the role of the microbiota and sex hormones as potential determinants of susceptibility. Moreover, the authors showed that Th17 cells and neutrophils contribute to sex hormone-dependent protection in female mice.

      Strengths:

      The role of microbiota was examined in various models (gnotobiotic, co-housing, microbiota transplantation). The identification of responsible immune cells was achieved using several genetic knockouts and cell-specific depletion models. The involvement of sex hormones was clarified using ovariectomy and the FCG model.

      Weaknesses:

      The mechanisms by which specific microbiota confer female-specific protection remain unclear.

      We thank the reviewer for highlighting the strengths of the manuscript including the models and techniques we employ. We agree that the relationship between the microbiota and sex-dependent protection is less developed compared with other aspects of the study. As detailed below, we are attempting to identify specific microbes that confer femalespecific protection and links with sex hormones. We have promising but preliminary results. Thus, in our revised manuscript, we added new data on the host response as suggested by the detailed comments from the Reviewers. We also elaborate on the potential role of the microbiota in the discussion section.

      Reviewer #1 (Recommendations for the authors):

      (1) The authors nicely showed that the transfer of the protective phenotype by FMT requires the female sex in recipients (Figure 2E). However, it remains unclear whether the female sex is required to develop protective microbiota in donor mice, as only the female NYU donor-male Jax recipient combination was tested. What happens if the microbiota from male NYU mice is transplanted into female Jax mice? If sex hormones act only on the downstream of the microbiota, such mice would show the protective phenotype. However, if sex hormones are required to establish a protective microbiota, the transplantation of microbiota from male NYU mice will not confer protection in recipient female Jax mice.

      The Reviewer’s comment is well taken. We have not conducted the suggested experiment of FMT from male NYU mice to JAX female mice yet because we are pursuing an in vitro approach that we hope will eventually provide a more definitive answer. We observed that stool from female NYU mice and not JAX mice inhibits MRSA when cultured under anaerobic conditions, and this inhibitory activity is eliminated by filtration (Author response image 1A). We also observed that stool from male NYU mice inhibits MRSA growth to a similar extent as stool from female NYU mice (Author response image 1B). This result suggests that the protective role of sex hormones is downstream of the microbiota. We are in the process of identifying the specific microbiota member to support this conclusion.

      Author response image 1.

      Stool from NYU mice inhibits MRSA growth in vitro. (A) MRSA CFU/mL in media (TSB) following culture with unfiltered or filtered stool homogenate from female NYU or JAX mice. Stool homogenate or TSB alone was added in a 1:1 ratio to 1x106 CFU/mL MRSA and cultured anaerobically for up to 24 hours. (B) MRSA CFU/mL in TSB following culture with unfiltered stool homogenate from NYU male or female mice. Stool homogenate or TSB alone was added in a 1:1 ratio to 1x106 CFU/mL MRSA. 3 experimental replicates performed; stool taken from 6 individual mice per condition. Mean MRSA burden ± SEM. Area under the curve analysis + One way ANOVA with Sidak’s multiple comparisons test. ns: not significant.

      (2) The results clearly showed the involvement of the specific microbiota in NYU mice in the sex-dependent bias in susceptibility to MRSA. However, the mechanisms by which specific microbiota promotes female sex-mediated protection need to be better described. Is this simply attributed to the different Th17 cell numbers in NYU and Jax mice (i.e., increased commensalspecific Th17 cells in NYU like Taconic mice)? Or is it possible that NYU microbiota impacts the regulation of sex hormones or their downstream signaling? What about the level of sex hormones in NYU and Jax mice? Are these levels equivalent or different? Do NYU and Jax microbiotas regulate the expression of sex hormone receptors in immune cells differently?

      These are great questions. We do not observe baseline differences in Th17 cells like JAX versus Taconic mice (Figure 5B), suggesting that the mechanism is different. However, it is quite possible that an antigen-specific T cells, or Th17 cell specifically, is present at low levels and expands rapidly upon MRSA colonization. We have added this possibility to the discussion in the revised manuscript. To address the Reviewer’s question about the effect of the microbiota on sex hormones, we first sought to determine which sex hormone is necessary. Using estrogen receptor knockouts (Esr1<sup>-/-</sup>), we were able to implicate estrogen and have added this important finding to the manuscript (Fig 6C). Then, we measured levels of estradiol in stool samples but did not observe a difference between NYU and JAX female mice (Author response image 2). We provide the results below but did not add it to the revised manuscript because we found it difficult to draw a conclusion without more extensive profiling as well as quantification of the receptor on specific immune cell subsets and cell-type specific knockouts. Also, see our response to Reviewer #3 regarding receptor expression. Although we have yet to explain the role of the microbiota, we hope the Reviewer agrees that we have promising yet preliminary results and that the new experiments we added to the manuscript have further strengthened the mechanism on the host-side. 

      Author response image 2.

      Estradiol levels in stool samples prior to MRSA inoculation. (A) Estradiol levels in stool samples collected prior to MRSA inoculation in male and female mice bred at NYU or purchased from Jackson Labs. Frozen stool samples were normalized by weight and processed using the DetectX® Estradiol ELISA Kit (Arbor Assays).

      (3) The authors claimed that Th17-mediated recruitment of neutrophils likely promotes the clearance of MRSA in female NYU mice. However, the experimental evidence supporting this claim could be stronger. The authors should show the neutrophil recruitment in the gut mucosa in female and male NYU mice. Also, the levels of neutrophils between NYU and Jax female mice should be examined. To further strengthen the link between Th17 and neutrophils, it would be ideal to analyze neutrophil recruitment in mice lacking Th17 cells (i.e., Rag2-/-, anti-CD4 treated, Rorgt-/- mice).

      We agree and now include a more detailed analyses of neutrophils. We found that the number of neutrophils in the intestine were not higher in NYU female mice compared with NYU male mice, with or without MRSA. Instead, we show that neutrophils in NYU female mice display higher levels of surface CD11b, a sign of activation, compared to males following inoculation with MRSA . We have added these findings to the revised manuscript (Fig5 H and I). IL-17 can activate neutrophils and increase their antimicrobial activity. Consistent with this possibility, we now show that female mice lacking the IL-17 receptor lose the enhanced colonization resistance. Based on these findings, we have modified this aspect of the conclusion, and thank the reviewer for the helpful suggestion.

      Reviewer #2 (Public review):

      The current study by Lejeune et al. investigates factors that allow for persistent MRSA infection in the GI tract. They developed an intriguing model of intestinal MRSA infection that does not use the traditional antibiotic approach, thereby allowing for a more natural infection that includes the normal intestinal microbiota. This model is more akin to what might be expected to be observed in a healthy human host. They find that biological sex plays a clear role in bacterial persistence during infection but only in mice bred at an NYU Facility and not those acquired from Jackson Labs. This clearly indicates a role for the intestinal microbiome in affecting female bacterial persistence but not male persistence which was unaffected by the origin of the mice and thus the microbiome. Through a series of clever microbiome-specific transfer experiments, they determine that the NYU-specific microbiome plays a role in this sexual dimorphism but is not solely responsible. Additional experiments indicate that Th17 cells, estrogen, and neutrophils also participate in the resistance to persistent infection. Notably, they assess the role of sex chromosomes (X/Y) using the established four core genotype model and find that these chromosomes appear to play little role in bacterial persistence.

      Overall, the paper nicely adds to the growing body of literature investigating how biological sex impacts the immune system and the burden of infectious disease. The conclusions are mostly supported by the data although there are some aspects of the data that could be better addressed and clarified.

      We thank the Reviewer for appreciating our contribution and these supportive comments. We have added several experiments to fill-in gaps and text revisions to increase clarity and acknowledge limitations. 

      (1) There is something of a disconnect between the initial microbiome data and the later data that analyzes sex hormones and chromosomes. While there are clearly differences in microbial species across the two sites (NYU and JAX) how these bacterial species might directly interact with immune cells to induce female-specific responses is left unexplored. At the very least it would help to try and link these two distinct pieces of data to try and inform the reader how the microbiome is regulating the sex-specific response. Indeed, the reader is left with no clear exploration of the microbiota's role in the persistence of the infection and thus is left wanting.

      We agree. This comment is similar to Reviewer #1’s feedback. As mentioned above, we are attempting to clarify the association between sex differences and the microbiota and have included preliminary results for the Reviewers. However, addressing this disconnect will require substantially more investigation. Instead, we have added insightful new data that elaborate on aspects of the host response.  We hope the Reviewer agrees that revised manuscript is stronger and that further delineation of the microbiota can be addressed by future studies.

      (2) While the authors make a reasonable case that Th17 T cells are important for controlling infection (using RORgt knockout mice that cannot produce Th17 cells), it is not clear how these cells even arise during infection since the authors make most of the observations 2 days postinfection which is longer before a normal adaptive immune response would be expected to arise. The authors acknowledge this, but their explanation is incomplete. The increase in Th17 cells they observe is predicated on mitogenic stimulation, so they are not specific (at least in this study) for MRSA. It would be helpful to see a specific restimulation of these cells with MRSA antigens to determine if there are pre-existing, cross-reactive Th17 cells specific for MRSA and microbiota species which could then link these two as mentioned above.

      We acknowledge that this is a limitation of our study. Although an experiment demonstrating pre-existing, cross-reactive T cells would help support our conclusion, aspects of MRSA biology may make the results of this experiment difficult to interpret. We have consulted with an expert on MRSA virulence factors, co-lead author Dr. Victor Torres, about the feasibility of this experiment. MRSA possess superantigens, such as Staphylococcal enterotoxin B, which bind directly to specific Vβ regions of T-cell receptors (TCR) and major histocompatibility complex (MHC) class II on antigen-presenting cells, resulting in hyperactivation of T lymphocytes and monocytes/macrophages. Additionally, other MRSA virulence factors, such as α-hemolysin and LukED, induce cell death of lymphocytes. MRSA’s enterotoxins are heat stable, so heat-inactivation of the bacterium may not help in this matter.  For these reasons, it is unlikely that we can perform a simple restimulation of lymphocytes with MRSA antigens. 

      A study by Shao et al. provides an example of a host commensal species inducing Th17 cells with cross-reactivity against MRSA. Upon intestinal colonization, the intestinal fungus Candida albicans influences T cell polarization towards a Th17 phenotype in the spleen and peripheral lymph nodes which provided protection to the host against systemic candidemia. Interestingly, this induction of protective Th17 cells, increased IL-17 and responsiveness in circulating Ly6G+ neutrophils also protected mice from intravenous infection with MRSA, indicating that T cell activation and polarization by intestinal C. albicans leads to non-specific protective responses against extracellular pathogens.

      Shao TY, Ang WXG, Jiang TT, Huang FS, Andersen H, Kinder JM, Pham G, Burg AR, Ruff B, Gonzalez T, Khurana Hershey GK, Haslam DB, Way SS. Commensal Candida albicans Positively Calibrates Systemic Th17 Immunological Responses. Cell Host & Microbe. 2019 Mar 13;25(3):404-417.e6. doi: 10.1016/j.chom.2019.02.004. PMID: 30870622; PMCID: PMC6419754.

      We have added a brief version of the above discussion in the revised manuscript. Also, as mentioned earlier, we have added new data strengthening the axis between Th17 and neutrophils, including showing that IL-17 receptor is necessary and that neutrophils display signs of heightened activation in female mice during MRSA colonization.   

      (3) The ovariectomy experiment demonstrates a role for ovarian hormones; however, it lacks a control of adding back ovarian hormones (or at least estrogen) so it is not entirely obvious what is causing the persistence in this experiment. This is especially important considering the experiments demonstrating no role for sex chromosomes thus demonstrating that hormonal effects are highly important. Here it leaves the reader without a conclusive outcome as to the exact hormonal mechanism.

      This is a great suggestion. Rather than adding back ovarian hormones, we performed the more direct experiment and tested whether the estrogen receptor (ERα, encoded by Esr1) is necessary for the enhanced colonization resistance. Indeed, we observed that Esr1<sup>-/-</sup> female mice have increased MRSA burden compared to Esr1<sup>+/-</sup> littermates. We have added this new result (Figure 6C) and thank the Reviewer for their guidance. 

      4) The discussion is underdeveloped and is mostly a rehash of the results. It would greatly enhance the manuscript if the authors would more carefully place the results in the context of the current state of the field including a more enhanced discussion of the role of estrogen, microbiome, and T cells and how the field might predict these all interact and how they might be interacting in the current study as well.

      Author response: We thank the Reviewer for their feedback in improving the scholarship on the manuscript. We have expanded on the literature and the mechanistic model in both the discussion section and other parts to provide better context for our findings. 

      Reviewer #3 (Public review):

      Summary:

      Using a mouse model of Staphylococcus aureus gut colonization, Lejeune et al. demonstrate that the microbiome, immune system, and sex are important contributing factors for whether this important human pathogen persists in the gut. The work begins by describing differential gut clearance of S. aureus in female B6 mice bred at NYU compared to those from Jackson Laboratories (JAX). NYU female mice cleared S. aureus from the gut but NYU male mice and mice of both sexes from JAX exhibited persistent gut colonization. Further experimentation demonstrated that differences between staphylococcal gut clearance in NYU and JAX female mice were attributed to the microbiome. However, NYU male and female mice harbor similar microbiomes, supporting the conclusion that the microbiome cannot account for the observed sex-dependent clearance of S. aureus gut colonization. To identify factors responsible for female clearance of S. aureus, the authors performed RNAseq on intestinal epithelial cells and cells enriched within the lamina propria. This analysis revealed sexdependent transcriptional responses in both tissues. Genes associated with immune cell function and migration were distinctly expressed between the sexes. To determine which immune cell types contribute to S. aureus clearance Lejeune et al employed genetic and antibody-mediated immune cell depletion. This experiment demonstrated that CD4+ IL17+ cells and neutrophils promote the elimination of S. aureus from the gut. Subsequent experiments, including the use of the 'four core genotype model' were conducted to discern between the roles of sex chromosomes and sex hormones. This work demonstrated that sex-chromosome-linked genes are not responsible for clearance, increasing the likelihood that hormones play a dominant role in controlling S. aureus gut colonization.

      Strengths:

      A strength of the work is the rigorous experimental design. Appropriate controls were executed and, in most cases, multiple approaches were conducted to strengthen the authors' conclusions. The conclusions are supported by the data.

      The following suggestions are offered to improve an already strong piece of scholarship.

      Weaknesses:

      The correlation between female sex hormones and the elimination of S. aureus from the gut could be further validated by quantifying sex hormones produced in the four core genotype mice in response to colonization. Additionally, and this may not be feasible, but according to the proposed model administering female sex hormones to male mice should decrease colonization. Finally, knowing whether the quantity of IL-17a CD4+ cells change in the OVX mice has the potential to discern whether abundance/migration of the cells or their activation is promoted by female sex hormones.

      In the Discussion, the authors highlight previous work establishing a link between immune cells and sex hormone receptors, but whether the estrogen (and progesterone) receptor is differentially expressed in response to S. aureus colonization could be assessed in the RNAseq dataset. Differential expression of known X and Y chromosome-linked genes were discussed but specific sex hormones or sex hormone receptors, like the estrogen receptor, were not. This potential result could be highlighted.

      We appreciate the comment on the scholarship and thank the Reviewer for the insightful suggestions to improve this manuscript. We apologize for not including references that address some of the Reviewer’s questions. Other research groups have compared the levels of hormones between XX and XY males and females in the four core genotypes model and have found similar levels of circulating testosterone in adult XX and XY males. No difference was found in circulating estradiol levels in XX vs XY- females when tested at 4-6 or 79 months of age. 

      Karen M. Palaszynski, Deborah L. Smith, Shana Kamrava, Paul S. Burgoyne, Arthur P. Arnold, Rhonda R. Voskuhl, A Yin-Yang Effect between Sex Chromosome Complement and Sex Hormones on the Immune Response. Endocrinology, Volume 146, Issue 8, 1 August 2005, Pages 3280–3285, https://doi.org/10.1210/en.2005-0284

      Sasidhar MV, Itoh N, Gold SM, Lawson GW, Voskuhl RR. The XX sex chromosome complement in mice is associated with increased spontaneous lupus compared with XY. Ann Rheum Dis. 2012 Aug;71(8):1418-22. doi: 10.1136/annrheumdis-2011-201246. Epub 2012 May 12. PMID: 22580585; PMCID: PMC4452281.

      Administering female sex hormones to males is a good idea. We did not observe an effect of injecting males with estrogen on MRSA colonization (data not shown), perhaps due to the dose or timing, or because it is not sufficient (i.e., additional hormones and factors may be required). Therefore, we analyzed the necessity of estrogen signaling and found that Esr1<sup>-/-</sup> female mice impairs colonization resistance to MRSA. We have added this new experiment to the revised manuscript (Fig6 C).

      Examination of the levels of estrogen, progesterone, and androgen receptors in our cecalcolonic lamina propria RNA-seq dataset is an excellent idea. We observed a significant increase in the G-protein coupled estrogen receptor 1 (Gper1) and a non-significant increase in Estrogen receptor alpha (Esr1) following MRSA inoculation in the immune cell compartment. This analysis has been added to the revised manuscript (Supplemental Fig6).

      Reviewer #3 (Recommendations for the authors)

      Minor editing issues:

      The topic sentence of the last paragraph in the Results section states - 'male sex defining gene sex determining region Y (Sry) has been moved from the Y chromosome to an autosome'. 'Sex defining gene' and sex-determining region seems redundant in this context. A sex-defining gene would presumably be located within a sex-determining region.

      Bold the letter 'F' in the Figure 5 legend.

      It's not clear from the Figure 6E legend when the IL-17A+ CD4+ cells were quantified, 2 dpi?

      In the third sentence of the second paragraph of the Discussion, the two references are merged together.

      We thank the Reviewer for pointing out these editing issues. They have been addressed in the revised manuscript.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We would like to thank the reviewers for their thoughtful evaluation of our manuscript. We considered all the comments and prepared the revised version. The following are our responses to the reviewers’ comments. All references, including those in the original manuscript are included at the end of this point-by-point response.

      Reviewer #1 (Public Review):

      Weaknesses:

      1) The authors should better review what we know of fungal Drosophila microbiota species as well as the ecology of rotting fruit. Are the microbiota species described in this article specific to their location/setting? It would have been interesting to know if similar species can be retrieved in other locations using other decaying fruits. The term 'core' in the title suggests that these species are generally found associated with Drosophila but this is not demonstrated. The paper is written in a way that implies the microbiota members they have found are universal. What is the evidence for this? Have the fungal species described in this paper been found in other studies? Even if this is not the case, the paper is interesting, but there should be a discussion of how generalizable the findings are.

      The reviewer inquires as to whether the microbial species described in this article are ubiquitously associated with Drosophila or not. Indeed, most of the microbes described in this manuscript are generally recognized as species associated with Drosophila spp. For example, yeasts such as Hanseniaspora uvarum, Pichia kluyveri, and Starmerella bacillaris have been detected in or isolated from Drosophila spp. collected in European countries as well as the United States and Oceania (Chandler et al., 2012; Solomon et al., 2019). As for bacteria, species belonging to the genera Pantoea, Lactobacillus, Leuconostoc, and Acetobacter have also previously been detected in wild Drosophila spp. (Chandler et al., 2011). These statements have been incorporated into our revised manuscript (lines 391-397). Nevertheless, the term “core” in the manuscript and title may lead to misunderstanding, as the generality does not ensure the ubiquitous presence of these microbial species in every individual fly. Considering this point, we replaced the “core” with “key,” a term that is more appropriate to our context.

      2) Can the authors clearly demonstrate that the microbiota species that develop in the banana trap are derived from flies? Are these species found in flies in the wild? Did the authors check that the flies belong to the D. melanogaster species and not to the sister group D. simulans?

      Can the authors clearly demonstrate that the microbiota species that develop in the banana trap are derived from flies? Are these species found in flies in the wild?

      The reviewer asked whether the microbial species detected from the fermented banana samples were derived from flies. To address this question, additional experiments under more controlled conditions would be needed, such as artificially introducing wild flies onto fresh bananas in the laboratory. Nevertheless, the microbes potentially originate from wild flies, as supported by the literature cited in our response to the Weakness 1).

      Alternative sources of microbes also merit consideration. For example, microbes may have been introduced to unfermented bananas by penetration through peel injuries (lines 1300-1301). In addition, they could be introduced by insects other than flies, given that rove beetles (Staphylinidae) and sap beetles (Nitidulidae) were observed in some of the traps. The explanation of these possibilities have been incorporated into DISCUSSION (lines 414427) of our revised manuscript.

      Did the authors check that the flies belong to the D. melanogaster species and not to the sister group D. simulans?

      Our sampling strategy was designed to target not only D. melanogaster but also other domestic Drosophila species, such as D. simulans, that inhabit human residential areas. For the traps where adult flies were caught, we identified the species of the drosophilids as shown in Table S1, thereby showing the presence of either or both D. melanogaster and D. simulans. We added these descriptions in MATERIALS AND METHODS (lines 511-512 and 560-562), and DISCUSSION (lines 378-379).

      3) Did the microarrays highlight a change in immune genes (ex. antibacterial peptide genes)? Whatever the answer, this would be worth mentioning. The authors described their microarray data in terms of fed/starved in relation to the Finke article. They should clarify if they observed significant differences between species (differences between species within bacteria or fungi, and more generally differences between bacteria versus fungi).

      Did the microarrays highlight a change in immune genes (ex. antibacterial peptide genes)? Whatever the answer, this would be worth mentioning.

      Regarding the antimicrobial peptide genes, statistical comparisons of our RNA-seq data across different conditions were impracticable because most of the genes showed low expression levels. The RNA-seq data of the yeast-fed larvae is shown in Author response Table 1. While a subset of genes exhibited significantly elevated expression in the nonsupportive conditions relative to the supportive ones, this can be due to intra-sample variability rather than the difference in the nutritional conditions. Similar expression profiles were observed in the bacteria-fed larvae as well (data not shown). Therefore, it is difficult to discuss a change in immune genes in the paper. Additionally, the previous study that conducted larval microarray analysis (Zinke et al., 2002) did not explicitly focus on immune genes.

      Author response table 1.

      Antimicrobial peptide genes are not up-regulated by any of the microbes. Antimicrobial peptides gene expression profiles of whole bodies of first-instar larvae fed on yeasts. TPM values of all samples and comparison results of gene expression levels in the larvae fed on supportive and non-supportive yeasts are shown. Antibacterial peptide genes mentioned in Hanson and Lemaitre, 2020 are listed. NA or na, not available.

      They should clarify if they observed significant differences between species (differences between species within bacteria or fungi, and more generally differences between bacteria versus fungi).

      We did not observe significant differences in the gene expression profiles of the larvae fed on different microbial species within bacteria or fungi, or between those fed on bacteria and those fed on fungi. For example, the gene expression profiles of larvae fed on the various supportive microbes showed striking similarities to each other, as evidenced by the heat map showing the expression of all genes detected in larvae fed either yeast or bacteria (Author response image 1). Similarities were also observed among larvae fed on various nonsupportive microbes.

      Only a handful of genes showed different expression patterns between larvae fed on yeast and those fed on bacteria. Thus, it is challenging to discuss the potential differential impacts of yeast and bacteria on larval growth, if any.

      Author response image 1.

      Gene expression profiles of larvae fed on the various supporting microbes show striking similarities to each other. Heat map showing the gene expression of the first-instar larvae that fed on yeasts or bacteria. Freshly hatched germ-free larvae were placed on banana agar inoculated with each microbe and collected after 15 h feeding to examine gene expression of the whole body. Note that data presented in Figures 3A and 4C in the original manuscript, which are obtained independently, are combined to generate this heat map. The labels under the heat map indicate the microbial species fed to the larvae, with three samples analyzed for each condition. The lactic acid bacteria (“LAB”) include Lactiplantibacillus plantarum and Leuconostoc mesenteroides, while the lactic acid bacterium (“AAB”) represents Acetobacter orientalis. “LAB + AAB” signifies mixtures of the AAB and either one of the LAB species. The asterisks in the label highlight “LAB + AAB” or “LAB” samples clustered separately from the other samples in those conditions; “” indicates a sample in a “LAB + AAB” condition (Lactiplantibacillus plantarum + Acetobacter orientalis), and “*” indicates a sample in a “LAB” condition (Leuconostoc mesenteroides). Brown abbreviations of scientific names are for the yeast-fed conditions. H. uva, Hanseniaspora uvarum; K. hum, Kazachstania humilis; M. asi, Martiniozyma asiatica; Sa. cra, Saccharomycopsis crataegensis; P. klu, Pichia kluyveri; St. bac, Starmerella bacillaris; BY4741, Saccharomyces cerevisiae BY4741 strain.

      4) The whole paper - and this is one of its merits - points to a role of the Drosophila larval microbiota in processing the fly food. Are these bacterial and fungal species found in the gut of larvae/adults? Are these species capable of establishing a niche in the cardia of adults as shown recently in the Ludington lab (Dodge et al.,)? Previous studies have suggested that microbiota members stimulate the Imd pathway leading to an increase in digestive proteases (Erkosar/Leulier). Are the microbiota species studied here affecting gut signaling pathways beyond providing branched amino acids?

      The whole paper - and this is one of its merits - points to a role of the Drosophila larval microbiota in processing the fly food. Are these bacterial and fungal species found in the gut of larvae/adults? Are these species capable of establishing a niche in the cardia of adults as shown recently in the Ludington lab (Dodge et al.,)?

      Although we did not investigate the microbiota in the gut of either larvae or adults, we did compare the microbiota within surface-sterilized larvae or adults with the microbiota in food samples. We found that adult flies and early-stage foods, as well as larvae and late-stage foods, harbored similar microbial species (Figure 1F). Additionally, previous studies examining the gut microbiota in wild adult flies have detected microbes belonging to the same species or taxa as those isolated from our foods (Chandler et al., 2011; Chandler et al., 2012). We have elaborated on this in our response to Weakness 1).

      While we did not investigate whether these species are capable of establishing a niche in the cardia of adults, we have cited the study by Dodge et al., 2023 in our revised manuscript and discussed the possibility that predominant microbes in adult flies may show a propensity for colonization (lines 410-413).

      Previous studies have suggested that microbiota members stimulate the Imd pathway leading to an increase in digestive proteases (Erkosar/Leulier). Are the microbiota species studied here affecting gut signaling pathways beyond providing branched amino acids?

      The reviewer inquires whether the supportive microbes in our study stimulate gut signaling pathways and induce the expression of digestive protease genes, as demonstrated in a previous study (Erkosar et al., 2015). Based on our RNA-seq data, this is unlikely. The aforementioned study demonstrated that seven protease genes are upregulated through Imd pathway stimulation by a bacterium that promotes the larval growth. In our RNA-seq analysis, these seven genes did not exhibit a consistent upregulation in the presence of the supportive microbes (H. uva or K. hum in Author response table 2A; Le. mes + A. ori in Author response table 2B). Rather, they exhibited a tendency to be upregulated by the presence of non-supportive microbes (St. bac or Pi. klu in Author response table 2A; La. pla in Author Response Table 2B).

      Author response table 2.

      Most of the peptidase genes reported by Erkosar et al., 2015 are more highly expressed under the non-supportive conditions than the supportive conditions. Comparison of the expression levels of seven peptidase genes derived from the RNA-seq analysis of yeast-fed (A) or bacteria-fed (B) first-instar larvae. A previous report demonstrated that the expression of these genes is upregulated upon association with a strain of Lactiplantibacillus plantarum, and that the PGRP-LE/Imd/Relish signaling pathway, at least partially, mediates the induction (Erkosar et al., 2015). H. uva, Hanseniaspora uvarum; K. hum, Kazachstania humilis; P. klu, Pichia kluyveri; S. bac, Starmerella bacillaris; La. pla, Lactiplantibacillus plantarum; Le. mes, Leuconostoc mesenteroides; A. ori, Acetobacter orientalis; ns, not significant.

      Reviewer #2 (Public Review):

      Weaknesses:

      The experimental setting that, the authors think, reflects host-microbe interactions in nature is one of the key points. However, it is not explicitly mentioned whether isolated microbes are indeed colonized in wild larvae of Drosophila melanogaster who eat bananas. Another matter is that this work is rather descriptive and a few mechanical insights are presented. The evidence that the nutritional role of BCAAs is incomplete, and molecular level explanation is missing in "interspecies interactions" between lactic acid bacteria (or yeast) and acetic acid bacteria that assure their inhabitation. Apart from these matters, the future directions or significance of this work could be discussed more in the manuscript.

      The experimental setting that, the authors think, reflects host-microbe interactions in nature is one of the key points. However, it is not explicitly mentioned whether isolated microbes are indeed colonized in wild larvae of Drosophila melanogaster who eat bananas.

      The reviewer asks whether the isolated microbes were colonized in the larval gut. Previous studies on microbial colonization associated with Drosophila have predominantly focused on adults (Pais et al. PLOS Biology, 2018), rather than larval stages. Developing larvae continually consume substrates which are already subjected to microbial fermentation and abundant in live microbes until the end of the feeding larval stage. Therefore, we consider it difficult to discuss microbial colonization in the larval gut. We have mentioned this point in DISCUSSION of the revised manuscript (lines 408-410).

      Another matter is that this work is rather descriptive and a few mechanical insights are presented. The evidence that the nutritional role of BCAAs is incomplete, and molecular level explanation is missing in "interspecies interactions" between lactic acid bacteria (or yeast) and acetic acid bacteria that assure their inhabitation.

      While we recognize the importance of comprehensive mechanistic analysis, elucidation of more detailed molecular mechanisms lies beyond the scope of this study and will be a subject of future research.

      Regarding the nutritional role of BCAAs, the incorporation of BCAAs enabled larvae fed with the non-supportive yeast to grow to the second-instar stage. This observation implies that consumption of BCAAs upregulates diverse genes involved in cellular growth processes in larvae. We mentioned a previously reported interaction between lactic acid bacteria (LAB) and acetic acid bacteria (AAB) in the manuscript (lines 433-436). LAB may facilitate lactate provision to AAB, consequently enhancing the biosynthesis of essential nutrients such as amino acids. To test this hypothesis, future experiments will include the supplementation of lactic acid to AAB culture plates, and the co-inoculation of AAB with LAB mutant strains defective in lactate production to assess both larval growth and continuous larval association with AAB. With respect to AAB-yeast interactions, metabolites released from yeast cells might benefit AAB growth, and this possibility will be investigated through the supplementation of AAB culture plates with candidate metabolites identified in the cell suspension supernatants of the late-stage yeasts.

      Apart from these matters, the future directions or significance of this work could be discussed more in the manuscript.

      We appreciate the reviewer's recommendations. The explanation of the universality of our findings has been included in the revised DISCUSSION (lines 391-397). We have also added descriptions on the implication of compositional shifts occurring in adult microbiota (lines 404413), possible inoculation routes of different microbes (lines 414-427), and hypotheses on the mechanism of larval growth promotion by yeasts (lines 469-476), all of which could be the focus of our future study.

      Reviewer #3 (Public Review):

      Weaknesses:

      Despite describing important findings, I believe that a more thorough explanation of the experimental setup and the steps expected to occur in the exposed diet over time, starting with natural "inoculation" could help the reader, in particular the non-specialist, grasp the rationale and main findings of the manuscript. When exactly was the decision to collect earlystage samples made? Was it when embryos were detected in some of the samples? What are the implications of bacterial presence in the no-fly traps? These samples also harbored complex microbial communities, as revealed by sequencing. Were these samples colonized by microbes deposited with air currents? Were they the result of flies that touched the material but did not lay eggs? Could the traps have been visited by other insects? Another interesting observation that could be better discussed is the fact that adult flies showed a microbiome that more closely resembles that of the early-stage diet, whereas larvae have a more late-stage-like microbiome. It is easy to understand why the microbiome of the larvae would resemble that of the late-stage foods, but what about the adult microbiome? Authors should discuss or at least acknowledge the fact that there must be a microbiome shift once adults leave their food source. Lastly, the authors should provide more details about the metabolomics experiments. For instance, how were peaks assigned to leucine/isoleucine (as well as other compounds)? Were both retention times and MS2 spectra always used? Were standard curves produced? Were internal, deuterated controls used?

      When exactly was the decision to collect early-stage samples made? Was it when embryos were detected in some of the samples?

      We collected traps and early-stage samples 2.5 days after setting up the traps. This duration was determined from pilot experiments. A shorter collection time resulted in a lower likelihood of obtaining traps visited by adult flies, whereas a longer collection time caused overcrowding of larvae as well as deaths of adults from drowning in the liquid seeping out of the fruits. These procedural details have been included in the MATERIALS AND METHODS section of the revised manuscript (lines 523-526).

      What are the implications of bacterial presence in the no-fly traps? These samples also harbored complex microbial communities, as revealed by sequencing. Were these samples colonized by microbes deposited with air currents? Were they the result of flies that touched the material but did not lay eggs? Could the traps have been visited by other insects?

      We assume that the origins of the microbes detected in the no-fly trap foods vary depending on the species. For instance, Colletotrichum musae, the fungus that causes banana anthracnose, may have been present in fresh bananas before trap placement. The filamentous fungi could have originated from airborne spores, but they could also have been introduced by insects that feed on these fungi. We have included these possibilities in the DISCUSSION section of the revised manuscript (lines 417-421).

      Another interesting observation that could be better discussed is the fact that adult flies showed a microbiome that more closely resembles that of the early-stage diet, whereas larvae have a more late-stage-like microbiome. It is easy to understand why the microbiome of the larvae would resemble that of the late-stage foods, but what about the adult microbiome? Authors should discuss or at least acknowledge the fact that there must be a microbiome shift once adults leave their food source.

      We are grateful for the reviewer's insightful suggestion regarding shifts in the adult microbiome. We have included in the DISCUSSION section of the revised manuscript the possibility that the microbial composition may change substantially during pupal stages or after adult eclosion (lines 404-413).

      Lastly, the authors should provide more details about the metabolomics experiments. For instance, how were peaks assigned to leucine/isoleucine (as well as other compounds)? Were both retention times and MS2 spectra always used?

      In this metabolomic analysis, LC-MS/MS with triple quadrupole MS monitors the formation of fragment ions from precursor ions specific to each target compound. The use of PFPP columns, which provide excellent separation of amino acids and nucleobases, allows chromatographic peaks of many structural isomers to be separated into independent peaks. In addition, all measured compounds are compared with data from a standard library to confirm retention time agreement. Structural isomers were separated either by retention time on the column or by compound-specific MRM signals (in fact, leucine and isoleucine have both unique MRM channels and column separations). Detailed MRM conditions are identical to the previously published study (Oka et al., 2017). These have been included in the revised ‘LC-MS/MS measurement’ section in MATERIALS AND METHODS (lines 810-824).

      Were standard curves produced?

      Since relative quantification of metabolite amounts was performed in this study, no standard curve was generated to determine absolute concentrations. However, a standard compound of known concentration (single point) was measured to confirm retention time and relative area values.

      Were internal, deuterated controls used?

      Internal standards for deuterium-labeled compounds were not used in this study. This is because it is not realistic to obtain deuterium-labeled compounds for all compounds since a large number of compounds are measured. However, an internal standard (L-methionine sulfone) is added to the extraction solvent to calculate the recovery rate. This has been included in the revised ‘LC-MS/MS measurement’ section in MATERIALS AND METHODS (lines 824-825).

      Reviewer #1 (Recommendations For The Authors):

      Additional comments 1. The authors should do a better job of presenting their data. It took me quite a while to understand the protocol of Figure 1. Panel 1A, B, C could be improved. For instance, 1A suggests that flies are transferred to the lab while this is in fact the banana trap. Indicate 'Banana trap colonized by flies' rather 'wild-type flies in the trap'. 1C: should indicate that the food suspension comes from the banana trap. 1B,D,D: do not use pale color as legend. Avoid the use of indices in Figure 2 (Y1 rather than Y1). Grey colors are difficult to distinguish in Figure 2. Etc. It is a pain for reviewers that figure legends are on the verso of each figure and not just below.

      We thank the reviewer for the detailed suggestions to improve the clarity and comprehensibility of our figures. We have improved the figures according to the suggestions. As for the figure legends, we have placed them below each respective figure whenever possible.

      1. Clarify in the text if 'sample' means food substratum or flies/larvae (ex. line 116 and elsewhere).

      We have revised the word “sample” throughout our manuscript and eliminated the confusion.

      1. Line 170 - clarify what you mean by fermented food.

      We have replaced the “fermented larval foods” with “fermented bananas” in our revised manuscript (line 165).

      1. Line 199 - what is the meaning of 'stocks'.

      We have replaced the “stocks” with “strains” (line 195).

      1. Line 320 - explain more clearly what the yeast-conditioned banana-agar plate and cell suspension supernatant are, and what the goals of using these media are. This will help in understanding the subsequent text.

      We have added a supplemental figure illustrating the sample preparation for the metabolomic analysis (Figure S6), with the following legend describing the procedure (lines 1335-1346): “Sample preparation process for the metabolomic analysis. We suspected that the supportive live yeast cells may release critical nutrients for larval growth, whereas the non-supportive yeasts may not. To test this possibility, we made three distinct sample preparations of individual yeast strains (yeast cells, yeast-conditioned banana-agar plates, and cell suspension supernatants). Yeast cells were for the analysis of intracellular metabolites, whereas yeast-conditioned banana-agar plates and cell suspension supernatants were for that of extracellular metabolites. The samples were prepared as the following procedures. Yeasts were grown on banana-agar plates for 2 days at 25°C, and then scraped from the plates to obtain “yeast cells.” Next, the remaining yeasts on the resultant plates were thoroughly removed, and a portion from each plate was cut out (“yeast-conditioned banana agar”). In addition, we suspended yeast cells from the agar plates into sterile PBS, followed by centrifugation and filtration to eliminate the yeast cells, to prepare “cell suspension supernatants.”

      1. Figure 5 is difficult to understand. Provide more explanation. Consider moving the 'all metabolites panel' to Supp. Better explain what this holidic medium is.

      The holidic medium is a medium that has been commonly used in the Drosophila research community, which contains ~40 known nutrients, and supports the larval development to pupariation (Piper et al., 2014; Piper et al., 2017). We have introduced this explanation to the RESULTS section of the manuscript (lines 322-327). However, the scope of our research reaches beyond the analysis of the holidic medium components, because feeding the holidic medium alone causes a significant delay in larval growth, suggesting a lack of nutritional components (Piper et al., 2014). Thus, we believe the "All Metabolites" panels should be placed alongside the corresponding “The holidic medium components” panels.

      1. I could not access Figure 6 when downloading the PDF. The page is white and an error message appears - it is problematic to review a paper lacking a figure.

      We regret any inconvenience caused, perhaps due to a system error. Please refer to the Author response image 2, which is identical to Figure 6 of our original manuscript.

      Author response image 2.

      Supportive yeasts facilitate larval growth by providing nutrients, including branched-chain amino acids, by releasing them from their cells (Figure 6 from the original manuscript). (A and B) Growth of larvae feeding on yeasts on banana agar supplemented with leucine and isoleucine. (A) The mean percentage of the live/dead individuals in each developmental stage. n=4. (B) The percentage of larvae that developed into second instar or later stages. The “Not found” population in Figure 6A was omitted from the calculation. Each data point represents data from a single tube. Unique letters indicate significant differences between groups (Tukey-Kramer test, p < 0.05). (C) The biosynthetic pathways for leucine and isoleucine with S. cerevisiae gene names are shown. The colored dots indicate enzymes that are conserved in the six isolated species, while the white dots indicate those that are not conserved. Abbreviations of genera are given in the key in the upper right corner. LEU2 is deleted in BY4741. (D-G) Representative image of Phloxine B-stained yeasts. The right-side images are expanded images of the boxed areas. The scale bar represents 50 µm. (H) Summary of this study. H. uvarum is predominant in the early-stage food and provides Leu, Ile, and other nutrients that are required for larval growth. In the late-stage food, AAB directly provides nutrients, while LAB and yeasts indirectly contribute to larval growth by enabling the stable larva-AAB association. The host larva responds to the nutritional environment by dramatically altering gene expression profiles, which leads to growth and pupariation. H. uva, Hanseniaspora uvarum; K. hum, Kazachstania humilis; Pi. klu, Pichia kluyveri; St. bac, Starmerella bacillaris; GF, germ-free.

      1. Line 323 - Consider rewriting this sentence (too long, explain what the holidic medium is and why this is interesting). "In the yeast-conditioned banana-agar plates, which were anticipated to contain yeast-derived nutrients, many well-known nutrients included in a chemically defined synthetic (holidic) medium for Drosophila melanogaster (Piper et al., 2014, 2017) were not increased compared to the sterile banana-agar plates; instead, they exhibited drastic decreases irrespective of the yeast species."

      We thank the reviewer's suggestion to improve the readability of our manuscript. We have rewritten the sentence in the revised manuscript (lines 320-328) as follows: “The yeastconditioned banana-agar plates were expected to contain yeast-derived nutrients. On the contrary, the result revealed a depletion of various metabolites originally present in the sterile banana agar (Figure 5A). This result prompted us to focus on the metabolites in the chemically defined (holidic) medium for Drosophila melanogaster Piper et al., 2014; Piper et al., 2017. This medium contains ~40 known nutrients, and supports the larval development to pupariation, albeit at the half rate compared to that on a yeast-containing standard laboratory food Piper et al., 2014; Piper et al., 2017. Therefore, the holidic medium could be considered to contain the minimal essential nutrients required for larval growth. Our analysis indicated a substantial reduction of these known nutrients in the yeast-conditioned plates compared to their original quantities (Figure 5B).”

      Reviewer #2 (Recommendations For The Authors):

      Suggestions for improved or additional experiments, data or analyses.

      1. It should be clearly shown (or stated) that isolated microbes, such as H. uvarum and Pa. agglomerans, are indigenous microbes in wild Drosophila melanogaster in their outdoor sampling.

      We thank the reviewer for the suggestions. Addressing the presence of isolated microbes within wild D. melanogaster adults is important, but cannot be feasible with our data for the following reasons. Our microbiota analysis of adults was conducted using pooled individuals of multiple Drosophila species, rather than using D. melanogaster exclusively. Moreover, the microbial isolation and the analysis of adult microbiota were carried out in two independent samplings (Figures 1A and 1E in the original manuscript, respectively). As a result, the microbial species detected in the adults were slightly different from those isolated from the food samples collected in the previous sampling. Nevertheless, it is worth noting that H. uvarum dominated in 2 out of the 3 adult samples, constituting >80% of the fungal composition. Pantoea agglomerans was not detected in the adults, although Enterobacterales accounted for >59% in 2 out of the 3 samples. Therefore, these isolated microbial species, or at least their phylogenetically related species, are presumed to be indigenous to wild D. melanogaster.

      If the reviewer’s suggestion was to state the dominance of H. uvarum and Pantoea agglomerans in early-stage foods, we have added a supplemental figure showing the species-level microbial compositions corresponding to Figure 1B of the original manuscript (Figure S1), and further revised the manuscript (lines 180-186).

      1. The reviewer supposes that the indigenous microbes of flies may differ from what they usually eat. In this study, the authors use banana-based food, but is it justified in terms of the natural environment of the places where those microbes were isolated? In other words, did sampled wild flies eat bananas outside the laboratory at Kyoto University?

      Drosophila spp. inhabit human residential areas and feed on various fermented fruits and vegetables. In the areas surrounding Kyoto University, they can be found in garbage in residential dwellings as well as supermarkets. In this regard, fruits are natural food sources of wild Drosophila in the area.

      Among various fruits, bananas were selected based on the following two reasons. Firstly, bananas were commonly used in previous Drosophila studies as a trap bait or a component of Drosophila food (Anagnostou et al., 2010; Stamps et al., 2012; Consuegra et al., 2020). Secondly, and rather practically, bananas can be obtained in Japan all year at a relatively low cost. Previous studies have used various fruits such as grapes (Quan and Eisen, 2018), figs (Pais et al., 2018), and raspberries (Cho and Rohlfs, 2023). However, these fruits are only available during limited seasons and are more expensive per volume than bananas. Thus, they were not practical for our study, which required large amounts of fruit-based culture media. We have included a brief explanation regarding this point in MATERIALS AND METHODS (lines 514-518).

      1. In Fig. 6B, the Leu and Ile experiment, is the added amount of those amino acids appropriate in the context that they mention "...... supportive yeasts had concentrations of both leucine and isoleucine that were at least four-fold higher than those of non-supportive yeasts"?

      We acknowledge that the supplementation should be carried out ideally in a quantity equivalent to the difference between the released amounts of supportive and non-supportive species. However, achieving this has been highly challenging. Previous studies determined the amount of amino acid supplementation by quantifying their concentration in the bacteriaconditioned media (Consuegra et al., 2020; Henriques et al., 2020). However, we found that quantifying the exact concentrations of the amino acids is not feasible with our yeasts. As shown in Figure 5B in the original manuscript, the amino acid contents were markedly reduced in the yeast-conditioned banana agar compared to the agar without yeasts, presumably because of the uptake by the yeasts. Thus, the amino acids released from yeast cells on the banana-agar plate are not expected to accumulate in the medium. As this reviewer pointed out, in the cell suspension supernatants of the supportive yeasts, concentrations of both leucine and isoleucine were at least four-fold higher compared to those of non-supportive yeasts (Figures 5G-H in the original submission), However, this measurement does not give the absolute amount of either amino acid available for larvae. Given these constraints, we opted for the amino acid concentrations in the holidic medium, which support larval growth under axenic conditions (Piper et al., 2014). We also showed that the supplementation of the amino acids at that concentration to the bananaagar plate was not detrimental to larval growth (Figures 6A-B in the original manuscript). These rationales have been included in the revised ‘Developmental progression with BCAA supplementation’ section in MATERIALS AND METHODS of our manuscript (lines 840-847).

      1. In addition to the above, it can be included other amino acids or nutrients as control experiments.

      As mentioned in our manuscript (lines 365-368), we did supplement other amino acids, lysine and asparagine, which failed to rescue the larval growth.

      1. In the experiment of Fig. 2E, how about examining larval development using heat-killed LAB or yeast with live AAB? The reviewer speculates that one possibility is that AAB needs nutrients from LAB.

      We did not feed larvae with heat-killed LAB and live AAB for the following reasons. LAB grows very poorly on banana agar compared to yeasts, and preparation of LAB required many banana-agar plates even when we fed live bacteria to larvae. Adding dead LAB to banana-agar tubes would require far more plates, but this preparation is impractical. Furthermore, heat-killing may not allow the investigation of the contribution of heat-unstable or volatile compounds.

      As for the reviewer's suggestion regarding the addition of heat-killed yeast with AAB, heat-killed yeast itself promotes larval growth, as shown in Figures 4G and 4H in the original manuscript, so the contribution of yeast cannot be examined using this method.

      Recommendations for improving the writing and presentation.

      1. It would be good to mention that during sample collection, other insects (other than Drosophila species) were not found in the food if this is true.

      Insects other than Drosophila spp. were found in several traps in the sampling shown in Figures 1C-F. These insects, rove beetles (Staphylinidae) and sap beetles (Nitidulidae), seemed to share a niche with Drosophila in nature. Therefore, we believe that the contamination of these insects did not interfere with our goal of obtaining larval food samples. We added these descriptions and explanations to MATERIALS AND METHODS (lines 527531).

      1. There are many different kinds of bananas. It should be mentioned the detailed information.

      We had included the information on the banana in MATERIALS AND METHODS section (line 622).

      1. Concerning the place of sample collection, detailed longitude, and latitude information can be provided (this is easily obtained from Google Maps). When the collection was performed should also be mentioned. This may suggest the environment of the "wild flies" they collected.

      We added a table listing the dates of our collections, along with the longitude and latitude of each sampling place (Table S1A).

      1. The reviewer could not find how the authors conducted heat killing of yeast.

      We added the following procedure to the ‘Quantification of larval development’ section in MATERIALS AND METHODS (lines 680-688). “When feeding heat-killed yeasts to larvae, yeasts were added to the banana-agar tubes and subsequently heated as following procedures. The yeasts were revived from frozen stocks on banana-agar plates, incubated at 25°C, and then streaked on fresh agar plates. After 2-day incubation, yeast cells were scraped from the plates and suspended in PBS at the concentration of 400 mg of yeast cells in 500 µL of PBS. 125 µL of the suspensions were added to banana-agar tubes prepared as described, and after centrifugation at 3,000 x g for 5 min, the supernatants were removed. The amount of cells in each tube is ~50x compared to that when feeding live yeasts, which compensates for the reduced amount due to their inability to proliferate. The tubes were subsequently heated at 80°C for 30 min before adding germ-free larvae.”

      1. The reviewer prefers that all necessary information on how to see figures be provided in figure legends. For example, an explanation of some abbreviations is missing.

      We carefully re-examined the figure legends and added necessary information.

      1. Many of the figures are not kind to readers, i.e., one needs to refer to the legends and main text very frequently. Adding subheadings (titles) to each figure may help.

      We added subheadings to our figures to improve the comprehensibility.

      Reviewer #3 (Recommendations For The Authors):

      I have some minor questions/suggestions about the manuscript that, if addressed, may increase the clarity and quality of the work.

      1. Please, when referring to microbial species in the abbreviated form, use only the first letter of the genus. For example, P. agglomerans should be used, not Pa. agglomerans.

      We are concerned about the potential confusion caused by using only the first letter of genera, since several genera mentioned in our work share the first letters, such as P (Pichia and Pantoea), S (Starmerella, Saccharomyces, and Saccharomycopsis), or L (Lactiplantibacillus and Leuconostoc). Therefore, we used only the unabbreviated form of the above seven genera in our revised manuscript. We have also made every effort to avoid abbreviations in our figures and tables, but found it necessary to retain two-letter abbreviations when spaces are particularly limiting.

      1. In lines 294-298, how exactly was the experiment where yeasts were killed by anti-fungal agents performed? If these agents killed the yeast, how was the microbial growth on plates required to have biomass for fly inoculation obtained? Please, clarify this section.

      The yeasts were grown on normal banana-agar plates before the addition onto the anti-fungal agents-containing banana agar. We added the following procedure to MATERIALS AND METHODS (lines 689-695). “When feeding yeasts on banana agar supplemented with antifungal agents, the yeasts were individually grown on normal banana agar twice before being suspended in PBS at the concentration of 400 mg of yeast cells in 500 µL of PBS. 125 µL of the suspensions was introduced onto the anti-fungal agents (10 mL/L 10% p-hydroxybenzoic acid in 70% ethanol and 6 mL/L propionic acid, following the concentration described in Kanaoka et al., 2023)-containing banana agar in 1.5 mL tubes. After centrifugation, the supernatants were removed. The amount of cells in each tube is ~50x compared to that when feeding live yeasts.”

      1. In lines 557-558, please clarify how rDNA copy numbers can be calculated in this way.

      Considering the results of the ITS and 16S sequencing analysis, it was highly likely that rDNAs from bananas and Drosophila were amplified along with microbial rDNA in this qPCR. To estimate the microbial rDNA copy number, we assumed that the proportion of microbial rDNA within the total amplification products remains consistent between the qPCR and the corresponding sequencing analysis, because the template DNA samples and amplified regions were shared between the analyses. Based on this, the copy number of microbial rDNA was estimated by multiplying the qPCR results with the microbial rDNA ratio observed in the ITS or 16S sequencing analysis of each sample. This methodology has been detailed in the MATERIALS AND METHODS section (lines 609-615).

      1. In lines 609-611, how did you check for cells left from the previous day? Microscopy? Or do you mean that if there was liquid still in the sample you would not add more bacterial cultures? Please, clarify.

      We observed with the naked eye from outside the tubes to determine if additional AAB should be introduced. Since we placed AAB on the banana agar in a lump, we examined whether the lumps were gone or not. We have added these procedures in MATERIALS AND METHODS (lines 671-673).

      1. In Figure 2A, it is hard to differentiate between the gray tones. Please, improve this.

      We have distinguished the plots for different conditions by changing the shape of the markers on the graphs.

      1. In the legend of Figure 4, line 1101, I believe the panel letters are incorrect.

      We have corrected the manuscript (lines 1241-1242) from “heat-killed yeasts on banana agar (H and I) or live yeasts on a nutritionally rich medium (J and K)” to “heat-killed yeasts on banana agar (G and H) or live yeasts on a nutritionally rich medium (I and J).”

      1. In Figure S1, authors showed that bananas that were not inoculated still had detectable rDNA signal. Is this really because bacteria can penetrate the peel? Or could this be the “reagent microbiome”? Alternatively, could these microbes have been introduced during sample prep, such as cutting the bananas?

      The detection of rDNA in bananas that were not inoculated with microbes was unlikely to be due to microbial contamination during experimental manipulation. The reviewer pointed out the possibility that the “reagent microbiome”, presumably the microbes in PBS, are detected from the uninoculated bananas. This seems to be unlikely, considering the PBS was sterilized by autoclaving before use. To ensure that no viable microbe was left in the autoclaved PBS, we applied a portion of the PBS onto a banana-agar plate and confirmed no colony was formed after incubation for a few days. DNA derived from dead microbes might be present in the PBS, but the PBS-added bananas were incubated for 4 days, so it is also unlikely that a detectable amount of DNA remained until sample collection. Furthermore, we believe that no contamination occurred during sample preparation. Banana peels were treated with 70% ethanol before removing them extremely carefully to avoid touching the fruit inside. All tools were sterilized before use. Taking all of these into account, we speculate that the microbes were already present in the bananas before peeling. We added the details of the sample preparation processes in MATERIALS AND METHODS (lines 518-521 and 540).

      Other major revisions

      1. We deposited our yeast genome annotation data in the DDBJ Annotated/Assembled Sequences database, and the accession numbers have been added to the ‘Data availability’ section in MATERIALS AND METHODS (lines 868-873).

      2. The bacterial composition data in Figure 1B was corrected, because in the original version, the data for Place 3 and Place 4 was plotted in reverse. The original and revised plots are shown side by side in Author response image 3. We hope that the reviewers agree that this replacement of the plots does not affect our conclusion (p5, lines 117-120).

      Author response image 3.

      Comparison of the original and revised version of bacterial composition graph in Figure 1B. Comparison of the original (left) and revised (right) version of the graph at the bottom of Figure 1B, which shows the result of bacterial composition analysis. The color key, which is unmodified, is placed below the revised version.

      1. The plot data and labels in the RNA-seq result heatmaps (Figures 3A and 4C) have been corrected. In these figures, row Z-scores of log2(TPM + 1) were to be plotted, as indicated by the key in each figure. However, in the original version, row Z-scores of TPM was erroneously plotted. Thus, Figures 3A and 4C of the original version have been replaced with the correct plots, and the original and revised plots are shown side by side in Author response images 4A and 4B. We hope that the reviewers agree that this replacement of the plots does not affect our conclusion (p7, lines 222-226 and p9, lines 277-281).

      Author response image 4.

      Comparison of the original and revised version of Figures 3A and 4C. (A and B) Comparison of the original (left) and revised (right) version of Figures 3A (A) or 4C (B).

      1. The keys in the original Figures 3D and 4F indicate that log2(fold change) was used to plot all data. However, when plotting the data from the previous study (Zinke et al., 2002), their “fold change value” was used. We have corrected the keys, plots, and legend of Figure 3D to reflect the different nature of the data from our RNA-seq analysis and those from microarray analysis by Zinke et al. The original and revised plots are shown side by side in Author response image 5. We hope that the reviewers agree that this replacement of the plots does not affect our conclusion (p7, lines 228230 and p9, 277-284).

      Author response image 5.

      Comparison of the original and revised version of Figures 3D and 4F. (A and B) Comparison of the original (left) and revised (right) version of Figures 3D (A) or 4F (B).

      1. The labels in Figure S5C and S5D (Figure S4C and S4D in the original version) have been corrected (they are "Pichia kluyveri > Supportive" and "Starmerella bacillaris > Supportive" rather than "Non-support. > H. uva" and "Non-support. > K. hum"). Additionally, we have reintroduced the circle indicating the number of “dme04070: Phosphatidylinositol signaling system” DEGs in Figure S5D, which was missing in Figure S4D of the original version. The original and revised figures are shown in Author response image 6.

      Author response image 6.

      Comparison of the original and revised version of Figures S5C and S5D. (A and B) Comparison of the original (left) and revised (right) versions of Figures S5C (A) or S5D (B). The original figures corresponding to the aforementioned figures were Figures S4C and S4D, respectively.

      1. The "Fermentation stage" column in Table 1, which indicated whether each microbe was considered an early-stage microbe or a late-stage microbe, has been removed to avoid confusion. This is because some of the microbes (Hanseniaspora uvarum, Pichia kluyveri, and Pantoea agglomerans) were employed in both of the feeding experiments using the microbes detected from the early-stage foods (Figures 2A, 2B, S2A, and S2B) and those from the late-stage foods (Figures 2C, 2D, S2C, and S2D).

      2. The leftmost column in Table S7 has been edited to indicate species names rather than “Sample IDs,” because the IDs were not used in anywhere else in the paper.

      Reference

      Chandler, J. A., Lang, J., Bhatnagar, S., Eisen, J. A. and Kopp, A. (2011). Bacterial communities of diverse Drosophila species: Ecological context of a host-microbe model system. PLoS Genetics 7, e1002272.

      Chandler, J. A., Eisen, J. A. and Kopp, A. (2012). Yeast communities of diverse Drosophila species: Comparison of two symbiont groups in the same hosts. Applied and Environmental Microbiology 78, 7327–7336.

      Cho, H. and Rohlfs, M. (2023). Transmission of beneficial yeasts accompanies offspring production in Drosophila—An initial evolutionary stage of insect maternal care through manipulation of microbial load? Ecology and Evolution 13, e10184.

      Consuegra, J., Grenier, T., Akherraz, H., Rahioui, I., Gervais, H., da Silva, P. and Leulier, F. (2020). Metabolic Cooperation among Commensal Bacteria Supports Drosophila Juvenile Growth under Nutritional Stress. iScience 23, 101232.

      Dodge, R., Jones, E. W., Zhu, H., Obadia, B., Martinez, D. J., Wang, C., Aranda-Díaz, A., Aumiller, K., Liu, Z., Voltolini, M., et al. (2023). A symbiotic physical niche in Drosophila melanogaster regulates stable association of a multi-species gut microbiota. Nat Commun 14, 1557.

      Erkosar, B., Storelli, G., Mitchell, M., Bozonnet, L., Bozonnet, N. and Leulier, F. (2015). Pathogen Virulence Impedes Mutualist-Mediated Enhancement of Host Juvenile Growth via Inhibition of Protein Digestion. Cell Host & Microbe 18, 445–455.

      Hanson, M. A. and Lemaitre, B. (2020). New insights on Drosophila antimicrobial peptide function in host defense and beyond. Current Opinion in Immunology 62, 22–30.

      Henriques, S. F., Dhakan, D. B., Serra, L., Francisco, A. P., Carvalho-Santos, Z., Baltazar, C., Elias, A. P., Anjos, M., Zhang, T., Maddocks, O. D. K., et al. (2020). Metabolic cross-feeding in imbalanced diets allows gut microbes to improve reproduction and alter host behaviour. Nat Commun 11, 4236.

      Oka, M., Hashimoto, K., Yamaguchi, Y., Saitoh, S., Sugiura, Y., Motoi, Y., Honda, K., Kikko, Y., Ohata, S., Suematsu, M., et al. (2017). Arl8b is required for lysosomal degradation of maternal proteins in the visceral yolk sac endoderm of mouse embryos. Journal of Cell Science jcs.200519.

      Pais, I. S., Valente, R. S., Sporniak, M. and Teixeira, L. (2018). Drosophila melanogaster establishes a species-specific mutualistic interaction with stable gut-colonizing bacteria. PLOS Biology 16, e2005710.

      Piper, M. D. W., Blanc, E., Leitão-Gonçalves, R., Yang, M., He, X., Linford, N. J., Hoddinott, M. P., Hopfen, C., Soultoukis, G. A., Niemeyer, C., et al. (2014). A holidic medium for Drosophila melanogaster. Nature Methods 11, 100–105.

      Piper, M. D. W., Soultoukis, G. A., Blanc, E., Mesaros, A., Herbert, S. L., Juricic, P., He, X., Atanassov, I., Salmonowicz, H., Yang, M., et al. (2017). Matching Dietary Amino Acid Balance to the In Silico-Translated Exome Optimizes Growth and Reproduction without Cost to Lifespan. Cell Metab 25, 610–621.

      Quan, A. S. and Eisen, M. B. (2018). The ecology of the drosophila-yeast mutualism in wineries. PLOS ONE 13, e0196440.

      Solomon, G. M., Dodangoda, H., McCarthy-Walker, T. T., Ntim-Gyakari, R. R. and Newell, P. D. (2019). The microbiota of Drosophila suzukii influences the larval development of Drosophila melanogaster. PeerJ 7, e8097.

      Zinke, I., Schütz, C. S., Katzenberger, J. D., Bauer, M. and Pankratz, M. J. (2002). Nutrient control of gene expression in Drosophila: microarray analysis of starvation and sugar-dependent response. The EMBO Journal 21, 6162–6173.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      Comment 0: Summary: This work presents an Interpretable protein-DNA Energy Associative (IDEA) model for predicting binding sites and affinities of DNA-binding proteins. Experimental results demonstrate that such an energy model can predict DNA recognition sites and their binding strengths across various protein families and can capture the absolute protein-DNA binding free energies.

      We appreciate the reviewer’s careful assessment of the paper, and we thank the reviewer for the insightful suggestions and comments.

      Comment 1: Strengths: (1) The IDEA model integrates both structural and sequence information, although such an integration is not completely original. (2) The IDEA predictions seem to have agreement with experimental data such as ChIP-seq measurements.

      We appreciate the reviewer’s positive comments on the strength of the paper.

      Comment 2: Weaknesses: (1) The authors claim that the binding free energy calculated by IDEA, trained using one MAX-DNA complex, correlates well with experimentally measured MAX-DNA binding free energy (Figure 2) based on the reported Pearson Correlation of 0.67. However, the scatter plot in Figure 2A exhibits distinct clustering of the points and thus the linear fit to the data (red line) may not be ideal. As such. the use of the Pearson correlation coefficient that measures linear correlation between two sets of data may not be appropriate and may provide misleading results for non-linear relationships.

      We thank the reviewer for the insightful comments and agree that a linear fit between our predictions and the experimental data may not be the best measure of performance. The primary utility of the IDEA model is to predict high-affinity DNA-binding sequences for a given DNA-binding protein by assessing the relative binding affinities across different DNA sequences. In this regard, the ranked order of predicted sequence binding affinities serves as a better metric for evaluating the success of this model. To evaluate this, we calculated both Spearman’s rank correlation coefficient, which does not rely on linear correlation, and the Pearson correlation coefficient between our predictions and the experimental results. As shown in Figure 2, our computation shows a Spearman’s rank correlation coefficient of 0.65 for the MAX-based predictions using one MAX-DNA complex (PDB ID: 1HLO), supporting the model’s capability to effectively distinguish strong from weak binders.

      Although our model generally captures the relative binding affinities across different DNA sequences, its predictive accuracy diminishes for low-affinity sequences (Figure 2).

      This could be due to two limitations of the current modeling framework: (1) The model is residue-based and estimates binding free energy as the additive sum of contributions from individual contacting amino-acid-nucleotide pairs. This assumption does not account for cooperative effects caused by simultaneous changes at multiple nucleotide positions. One potential direction to further improve the model would be to use a finergrained representation by incorporating more atom types within contacting residues, and to use a many-body potential to better capture cooperative effects from multiple mutations. (2) The model assumes that the target DNA adopts the same binding interface as in the reference crystal structure. However, sequence-dependent DNA shape has been shown to be important in determining protein-DNA binding affinity [1]. To address this limitation, a future direction is to use deep-learning-based methods to incorporate predicted DNA shape or protein-DNA complex structures based on their sequences [2, 3] into our model prediction.

      To fully evaluate the predictive power of IDEA, we have included Spearman’s rank correlation coefficient for every correlation plot in this manuscript and have updated the relevant texts. Across all our analyses, the Spearman’s rank correlation coefficients reveal similar predictive performance as the Pearson correlation coefficients. Additionally, we have included in our discussion the current limitations of our model and potential directions for future improvement.

      We have edited our Discussion Section to include a discussion on the limitations of the current model. Specifically, the added texts are:

      “Although IDEA has proved successful in many examples, it can be improved in several aspects. The model currently assumes the training and testing sequences share the same protein-DNA structure. While double-stranded DNA is generally rigid, recent studies have shown that sequence-dependent DNA shape contributes to their binding specificity [1, 2, 4]. To improve predictive accuracy, one could incorporate predicted DNA shapes or structures into the IDEA training protocol. In addition, the model is residue-based and evaluates the binding free energy as the additive sum of contributions from individual amino-acid-nucleotide contacts. This assumption does not account for cooperative effects that may arise from multiple nucleotide changes. A potential refinement could utilize a finer-grained model that includes more atom types within contacting residues and employs a many-body potential to account for such cooperative effects.”

      Comment 3: (2) In the same vein, the linear Pearson Correlation analysis performed in Figure 5A and the conclusion drawn may be misleading.

      We thank the reviewer for the insightful comments. As noted in our response to the previous comment, we have added Spearman’s rank correlation coefficient in addition to the Pearson correlation coefficient to all correlation plots, including Figure 5A.

      Comment 4: (3) The authors included the sequences of the protein and DNA residues that form close contacts in the structure in the training dataset, whereas a series of synthetic decoy sequences were generated by randomizing the contacting residues in both the protein and DNA sequences. In particular, synthetic decoy binders were generated by randomizing either the DNA (1000 sequences) or protein sequences (10,000 sequences) from the strong binders. However, the justification for such randomization and how it might impact the model’s generalizability and transferability remain unclear.

      We thank the reviewer for the insightful comments. The number of randomizing sequences was chosen to strike a balance between sufficient sequence coverage and computational feasibility. Because proteins have more types of amino acids than four nucleotides in DNA, we utilized more protein decoy sequences than DNA decoys. To examine the robustness of our choice against different number of decoy sequences, we repeated the transferability analysis within the bHLH superfamily (Figure 3A) and the generalizability analysis across 12 protein families (Figure 2E) using two additional decoy sequence combinations: (1) 1000 DNA sequences and 1000 protein sequences; (2) 100 DNA sequences and 1000 protein sequences. As shown in Figure S15, we achieved similar results to those reported using the original decoy set, demonstrating the robustness of our model prediction against the variations in the number of decoys. We have included this figure as Figure S15.

      Comment 5: (4) The authors performed Receiver Operating Characteristic (ROC) analysis and reported the Area Under the Curve (AUC) scores in order to quantitate the successful identification of the strong binders by IDEA. It would be beneficial to analyze the precision-recall (PR) curve and report the PRAUC metric which could be more robust.

      We agree with the reviewer that more robust statistical metrics should be used to evaluate our model’s performance. We have included the PRAUC score as an additional evaluation metric of the model’s performance. Due to a significant imbalance in the number of strong and weak binders from the experimental data [5], where the experimentally identified strong binders are far fewer than the weak binders, we reweighted the sample to achieve a balanced evaluation [6], using 0.5 as the baseline for randomized prediction. As shown in Figure S5, IDEA achieves successful predictions in 18 out of 22 cases, demonstrating its predictive accuracy.

      The updated PRAUC result has been included as Figure S5 in the manuscript. We have also included the detailed precision-recall curves for each case in Figure S4.

      In addition, we have provided PRAUC scores for comparing the performance of IDEA with other models, and have summarized these results in Table S2.

      Reviewer #2:

      Comment 0: Summary: Zhang et al. present a methodology to model protein-DNA interactions via learning an optimizable energy model, taking into account a representative bound structure for the system and binding data. The methodology is sound and interesting. They apply this model for predicting binding affinity data and binding sites in vivo. However, the manuscript lacks discussion of/comparison with state-of-the-art and evidence of broad applicability. The interpretability aspect is weak, yet over-emphasized.

      We appreciate the reviewer’s excellent summary of the paper, and we thank the reviewer for the insightful suggestions and comments.

      Comment 1: Strengths: The manuscript is well organized with good visualizations and is easy to follow. The methodology is discussed in detail. The IDEA energy model seems like an interesting way to study a protein-DNA system in the context of a given structure and binding data. The authors show that an IDEA model trained on one system can be transferred to other structurally similar systems. The authors show good performance in discriminating between binding-vs-decoy sequences for various systems, and binding affinity prediction. The authors also show evidence of the ability to predict genome-wide binding sites.

      We appreciate the reviewer’s strong assessment of the strengths of this paper. We have further refined our Methods Section to ensure all modeling details are clearly presented.

      Comment 2: Weaknesses: An energy-based model that needs to be optimized for specific systems is inherently an uncomfortable idea. Is this kind of energy model superior to something like Rosetta-based energy models, which are generally applicable? Or is it superior to family-specific knowledge-based models? It is not clear.

      We thank the reviewer for the insightful comments. The protein-DNA energy model facilitates the calculation of protein-DNA binding free energy based on protein-DNA structures and sequences. Because this model is optimized using the structure-sequence relationship of given protein-DNA complexes, it features specificity based on the conserved structural interface characteristic of each protein family. Because of that, its predictive accuracy depends on the degree of protein-DNA interface similarity between the training and target protein-DNA pairs, and is distinct from a general protein-DNA energy model, such as a Rosetta-based energy model. The model has some connections to the familyspecific energy model. As shown in Author response image 1, systems belonging to the same protein superfamily (MAX and PHO4) exhibit similar patterns in their learned energy models, in contrast to those from a different superfamily (PDX1).

      Author response image 1:

      Comparison of learned energy models for different protein-DNA complexes: MAX (A), PHO4 (B), and PDX1 (C). MAX and PHO4 are members of the Helixloop-helix (HLH) CATH protein superfamily (4.10.280.100), while PDX1 belongs to another Homeodomain-like CATH protein superfamily (1.10.10.60).

      To compare our approach with both general and family-specific knowledge-based energy models, we conducted two studies. First, we incorporated a knowledge-based generic protein-DNA energy model (DBD-Hunter) learned from the protein-DNA database, reported by Skoinick and coworkers [7], into our prediction protocol. This model assigns interaction energies to different functional groups within each DNA nucleotide (e.g., phosphate (PP), sugar (SU), pyrimidine (PY), and imidazole (IM) groups). For our comparison, we averaged the energy contributions of these groups within each nucleotide and replaced the IDEA-learned energy model with this generic one to test its ability to differentiate strong binders from weak binders in the HT-SELEX dataset [5]. As shown in Figure S6, the IDEA model generally achieves better performance than the generic energy model.

      Additionally, we compared IDEA with rCLAMPS, a family-specific energy model developed to predict protein-DNA binding specificity in the C2H2 and homeodomain families.

      As shown in Table S1 and Table S2, IDEA also shows better performance than rCLAMPS in most cases across the C2H2 and homeodomain families, demonstrating that it has better predictive accuracy than both state-of-the-art family-specific and generic knowledgebased models.

      We have included relevant texts in Appendix Section Comparison of IDEA predictive performance Using HT-SELEX data to clarify this point. The added texts are:

      In addition, we compared the performance of IDEA with both general and family-specific knowledge-based energy models. First, we incorporated a knowledgebased generic protein-DNA energy model (DBD-Hunter) learned from the protein-DNA database, reported by Skoinick and coworkers [7], into our prediction protocol. This model assigns interaction energies to different functional groups within each DNA nucleotide, including phosphate (PP), sugar (SU), pyrimidine (PY), and imidazole (IM) groups. For our comparison, we averaged the energy contributions of these groups within each nucleotide and replaced the IDEA-learned energy model with the DBD-Hunter model to assess its ability to differentiate strong binders from weak binders in the HTSELEX dataset [5]. Additionally, we compared IDEA with rCLAMPS, a familyspecific energy model developed to predict protein-DNA binding specificity in the C2H2 and homeodomain families. rCLAMPS learns a position-dependent amino-acid-nucleotide interaction energy model. To incorporate this model into the binding free energy calculation, we averaged the energy contributions across all occurrences of each amino-acid-nucleotide pair, which resulted in a 20-by-4 residue-type-specific energy matrix. This matrix is structurally analogous to the IDEA-trained energy model and can be directly integrated into the binding free energy calculations. As shown in Figure S6, Table S1, and Table S2, the IDEA model generally outperforms DBD-Hunter and rCLAMPS, demonstrating that it can achieve better predictive accuracy than both generic and family-specific knowledge-based models.

      Comment 3: Prediction of binding affinity is a well-studied domain and many competitors exist, some of which are well-used. However, no quantitative comparison to such methods is presented. To understand the scope of the presented method, IDEA, the authors should discuss/compare with such methods (e.g. PMID 35606422).

      We thank the reviewer for the insightful comments. As detailed in our response to Comment 5, we previously misused the term “binding specificity”, and would like to clarify that our model is designed to predict protein-DNA binding affinity. To compare the performance of IDEA with state-of-the-art protein-DNA predictive models, we examined the predictive accuracies of two additional popular computational models: ProBound [8] and DeepBind [9]. ProBound has been shown to have a better performance than several earlier predictive protein-DNA models, including JASPAR 2018 [11], HOCOMOCO [12], Jolma et al. [13], and DeepSELEX [14]. To benchmark these models’ performance, we examine each method’s capability to identify strong binders with the HT-SELEX datasets covering 22 proteins from 12 protein families [5]. As suggested by Reviewer 1, we also calculated the PRAUC score, reweighted to account for data imbalance [6], as a complementary metric for evaluating the model performance.

      As shown in Figure S6, Table S1, and Table S2, IDEA ranked second among the three predictive methods. It is important to note that both ProBound and DeepBind were trained on a curated version of the HT-SELEX data [13], which overlaps with the testing data [5]. Compared with them, IDEA was trained only on the given structural and sequence information from a single protein-DNA complex, thus independent of the testing data. In order to assess how IDEA performs when incorporating knowledge from HT-SELEX data, we augmented the training by randomly including half of the HT-SELEX data (see the Methods Section Enhanced Modeling Prediction with SELEX Data). The augmented IDEA model achieved the best performance among all the models. Overall, IDEA can be used to predict protein-DNA affinities in the absence of known binding sequence data, thereby filling a critical gap when such experimental datasets are unavailable.

      Additionally, we have conducted a 10-fold cross-validation using the same HT-SELEX data [5] and found that IDEA outperformed a recent regression model that considers the shape of DNA with different sequences [5].

      We have revised our text to include the comparison between IDEA and other predictive models. Specifically, we revised the text in Section: IDEA Generalizes across Various Protein Families.

      The revised text reads:

      “To examine IDEA’s predictive accuracy across different DNA-binding protein families, we applied it to calculate protein-DNA binding affinities using a comprehensive HT-SELEX dataset [5]. We focused on evaluating the capability of IDEA to distinguish strong binders from weak binders for each protein with an experimentally determined structure. We calculated the probability density distribution of the top and bottom binders identified in the SELEX experiment. A well-separated distribution indicates the successful identification of strong binders by IDEA (Figure 2D and S4). Receiver Operating Characteristic (ROC) analysis was performed to calculate the Area Under the Curve (AUC) and the precision-recall curve (PRAUC) scores for these predictions. Further details are provided in the Methods Section Evaluation of IDEA Prediction Using HT-SELEX Data. Our analysis shows that IDEA successfully differentiates strong from weak binders for 80% of the 22 proteins across 12 protein families, achieving AUC and balanced PRAUC scores greater than 0.5 (Figure 2D and S5). To benchmark IDEA’s performance against other leading methods, we compared its predictions with several popular models, including the sequence-based predictive models ProBound [8] and DeepBind [9], the familybased energy model rCLAMPS [10], and the knowledge-based energy model DBD-Hunter [7]. IDEA demonstrates performance comparable to these stateof-the-art approaches, and incorporating sequence features further improves its prediction accuracy (Figure S6, Table S1, and Table S2). We also performed 10-fold cross-validation on the binding affinities of protein–DNA pairs in this dataset and found that IDEA outperforms a recent regression model that considers the shape of DNA with different sequences [5] (Figure S7). Details are provided in Section: Comparison of IDEA predictive performance Using HT-SELEX data.”

      We also added one section Comparison of IDEA predictive performance Using HT-SELEX data in the Appendix to fully explain the comparison between IDEA and other popular models. The added texts are:

      “To benchmark the performance of IDEA against state-of-the-art protein-DNA predictive models, we evaluated its ability to recognize strong binders with the HT-SELEX datasets across 22 proteins from 12 families [5]. Specifically, we compare IDEA with two widely used sequence-based models: ProBound [8] and DeepBind [9]. ProBound has demonstrated superior performance over many other predictive protein-DNA models, including JASPAR 2018 [11], HOCOMOCO [12], Jolma et al. [13], and DeepSELEX [14]. To use ProBound, we retrieved the trained binding model for each protein from motifcentral.org and used the GitHub implementation of ProBoundTools to infer the binding scores between protein and target DNA sequences. Except for POU3F1, binding models are available for all proteins. Therefore, we excluded POU3F1 and evaluated the protein-DNA binding affinities for the remaining 21 proteins. To use DeepBind, sequence-specific binding affinities were predicted directly with its web server. The Area Under the Curve (AUC) and the Precision-Recall AUC (PRAUC) scores were used as metrics for comparison. An AUC score of 1.0 indicates a perfect separation between the strong- and weak-binder distributions, while an AUC score of 0.5 indicates no separation. Because there is a significant imbalance in the number of strong and weak binders from the experimental data [5], where the strong binders are far fewer than the weak binders, we reweighted the samples to achieve a balanced evaluation, using 0.5 as the baseline for randomized prediction [6]. As summarized in Figure S6, Table S1, and Table S2, IDEA ranked second among the three predictive models. In order to assess the performance of IDEA when augmented with additional protein-DNA binding data, we augmented IDEA using randomly selected half of the HT-SELEX data (see the Methods Section Enhanced Modeling Prediction with SELEX Data). The augmented IDEA model achieved the best performance among all the models.”

      “We also performed 10-fold cross-validation using the same HT-SELEX datasets, following the protocol described in the Methods Section Enhanced Modeling Prediction with SELEX Data. For each protein, we divided the entire dataset into 10 equal, randomly assigned folds. In each iteration, we used randomly selected 9 of the 10 folds as the training dataset and the remaining fold as the testing dataset. This process was repeated 10 times so that each fold served as the test set once. We then reported the average R2 scores across these iterations to evaluate IDEA’s predictive performance. Our results are compared with the 1mer and 1mer+shape methods from [5], the latest regression model that considers the shape of DNA with different sequences (Figure S7). This comparative analysis shows IDEA achieved higher predictive accuracy than the state-of-the-art sequence-based protein-DNA binding predictors for proteinDNA complexes that have available experimentally resolved structures.”

      “Overall, these results demonstrate that IDEA can be used to predict the proteinDNA pairs in the absence of known binding sequence data, thus filling an important gap in protein-DNA predictions when experimental binding sequence data are unavailable.”

      Comment 4: The term “interpretable” has been used lavishly in the manuscript while providing little evidence on the matter. The only evidence shown is the family-specific residue-nucleotide interaction/energy matrix and speculations on how these values are biologically sensible. Recent works already present more biophysical, fine-grained, and sometimes family-independent interpretability (e.g. PMID 39103447, 36656856, 38352411, etc.). The authors should put into context the scope of the interpretability of IDEA among such works.

      We thank the reviewer for the insightful comment and agree that “interpretability” should be discussed in a relevant context. In our work, interpretability refers to the familyspecific amino-acid-nucleotide interaction energies identified from the model training, which reveal interaction preferences within protein-DNA binding interfaces. As detailed in our response to Comment 6, we performed principal component analysis (PCA) on the learned energy models and observed clustering of learned energy models corresponding to protein families. Therefore, the IDEA-learned energy models can be used as a signature to capture the energetic preferences of amino-acid-nucleotide interactions within a given protein family. This preference can be used to infer preferred sequence binding motifs, similar to those identified by other computational tools [10, 4, 15, 16].

      We have revised the text to clarify the “interpretability” as the family-specific aminoacid-nucleotide interactions that govern sequence-dependent protein-DNA binding, and to discuss IDEA’s interoperability within the context of recent works, including those suggested by the reviewers.

      We have revised the text in Introduction. The new text reads:

      “Here, we introduce the Interpretable protein-DNA Energy Associative (IDEA) model, a predictive model that learns protein-DNA physicochemical interactions by fusing available biophysical structures and their associated sequences into an optimized energy model (Figure 1). We show that the model can be used to accurately predict the sequence-specific DNA binding affinities of DNA-binding proteins and is transferrable across the same protein superfamily. Moreover, the model can be enhanced by incorporating experimental binding data and can be generalized to enable base-pair resolution predictions of genomic DNA-binding sites. Notably, IDEA learns a family-specific interaction matrix that quantifies energetic interactions between each amino acid and nucleotide, allowing for a direct interpretation of the “molecular grammar” governing sequence-specific protein-DNA binding affinities. This interpretable energy model is further integrated into a simulation framework, facilitating mechanistic studies of various biomolecular functions involving protein-DNA dynamics.”

      We have revised the text in Results. The new text reads:

      “IDEA is a coarse-grained biophysical model at the residue resolution for investigating protein-DNA binding interactions (Figure 1). It integrates both structures and corresponding sequences of known protein-DNA complexes to learn an interpretable energy model based on the interacting amino acids and nucleotides at the protein-DNA binding interface. The model is trained using available protein-DNA complexes curated from existing databases [17, 18].

      Unlike existing deep-learning-based protein-DNA binding prediction models, IDEA aims to learn a physicochemical-based energy model that quantitatively characterizes sequence-specific interactions between amino acids and nucleotides, thereby interpreting the “molecular grammar” driving the binding energetics of protein-DNA interactions. The optimized energy model can be used to predict the binding affinity of any given protein-DNA pair based on its structures and sequences. Additionally, it enables the prediction of genomic DNA binding sites by a given protein, such as a transcription factor. Finally, the learned energy model can be incorporated into a simulation framework to study the dynamics of DNA-binding processes, revealing mechanistic insights into various DNA-templated processes. Further details of the optimization protocol are provided in Methods Section Energy Model Optimization.”

      The revised text in Section: Discussion now reads:

      “Another highlight of IDEA is its ability to present an interpretable, familyspecific amino acid-nucleotide interaction energy model for given proteinDNA complexes. The optimized IDEA energy model can not only predict sequence-specific binding affinities of protein-DNA pairs but also provide a residue-specific interaction matrix that dictates the preferences of amino acidnucleotide interactions within specific protein families (Figure S11). This interpretable energy matrix would facilitate the discovery of sequence binding motifs for target DNA-binding proteins, complementing both sequencebased [24, 16, 25] and structure-based approaches [10, 26, 4, 15]. Additionally, we integrated this physicochemical-based energy model into a simulation framework, thereby improving the characterization of protein-DNA binding dynamics. IDEA-based simulation enables the investigation into dynamic interactions between various proteins and DNA, facilitating molecular-level understanding of the physical mechanisms underlying many DNA-binding processes, such as transcription, epigenetic regulations, and their modulation by sequence variations, such as single-nucleotide polymorphisms (SNPs) [22, 23].”

      Comment 5: The manuscript disregards subtle yet important differences in commonly used terminology in the field. For example, the authors use the term ”specificity” and ”affinity” almost interchangeably (for example, the caption for Figure 3A uses ”specificity” although the Methods text describes the prediction as about ”affinity”). If the authors are looking to predict specificity, IDEA needs to be put in the context of the corresponding state-of-the-art (PMID 36123148, 39103447, 38867914, 36124796, etc).

      We really appreciate the reviewer for pointing out the conflation of “specificity” and “affinity” in our manuscript. To clarify, the primary function of IDEA is to predict the binding affinities of protein-DNA pairs in a sequence-specific manner. We have revised the text to clarify the distinction between affinity and specificity and acknowledge prior works, including those provided by the reviewers, that focus on predicting protein-DNA binding specificity.

      We have revised the Section title IDEA Accurately Predicts Protein-DNA Binding Specificity to IDEA Accurately Predicts Sequence-Specific Protein-DNA Binding Affinity; and ResidueLevel Protein-DNA Energy Model for Predicting Protein-DNA Recognition Specificities to Predictive Protein-DNA Energy Model at Residue Resolution.

      We have revised the text in Introduction. The revised text reads:

      “Computational methods complement experimental efforts by providing the initial filter for assessing sequence-specific protein-DNA binding affinity. Numerous methods have emerged to enable predictions of binding sites and affinities of DNA-binding proteins [27, 9, 1, 5, 28, 29, 30, 31, 8]. These methods often utilized machine-learning-based training to extract sequence preference information from DNA or protein by utilizing experimental high-throughput (HT) assays [27, 9, 1, 5, 28, 8], which rely on the availability and quality of experimental binding assays. Additionally, many approaches employ deep neural networks [29, 30, 31], which could obscure the interpretation of interaction patterns governing protein-DNA binding specificities. Understanding these patterns, however, is crucial for elucidating the molecular mechanisms underlying various DNA-recognition processes, such as those seen in TFs [32].”

      We have revised the text in Section: IDEA Demonstrates Transferability across Proteins in the Same CATH Superfamily.

      The revised text reads:

      “Since IDEA relies on the sequence-structure relationship of given protein-DNA complexes to reach predictive accuracy, we inquired whether the trained energy model from one protein-DNA complex could be generalized to predict the sequence-specific binding affinities of other complexes. To test this, we assessed the transferability of IDEA predictions across all 11 structurally available protein-DNA complexes within the MAX TF-associated CATH superfamily (CATH ID: 4.10.280.10, Helix-loop-helix DNA-binding domain). We trained IDEA based on each of these 11 complexes and then used the trained model to predict the MAX-based MITOMI binding affinity. Our results show that IDEA generally makes correct predictions of the binding affinity when trained on proteins that are homologous to MAX, with Pearson and Spearman Correlation coefficients larger than 0.5 (Figure 3A and Figure S10).”

      We have revised the caption of Figure 3: The revised text reads:

      “IDEA prediction shows transferability within the same CATH superfamily. (A) The predicted MAX binding affinity, trained on other protein-DNA complexes within the same protein CATH superfamily, correlates well with experimental measurement. The proteins are ordered by their probability of being homologous to the MAX protein, determined using HHpred [33]. Training with a homologous protein (determined as a hit by HHpred) usually leads to better predictive performance (Pearson Correlation coefficient > 0.5) compared to non-homologous proteins. (B) Structural alignment between 1HLO (white) and 1A0A (blue), two protein-DNA complexes within the same CATH Helix-loop-helix superfamily. The alignment was performed based on the Ebox region of the DNA [34]. (C) The optimized energy model for 1A0A, a protein-DNA complex structure of the transcription factor PHO4 and DNA, with 33.41% probability of being homologous to the MAX protein. The optimized energy model is presented in reduced units, as explained in the Methods Section: Training Protocol.”

      We have revised the text in Section Discussion: The revised text now reads:

      “The protein-DNA interaction landscape has evolved to facilitate precise targeting of proteins towards their functional binding sites, which underlie essential processes in controlling gene expression. These interaction specifics are determined by physicochemical interactions between amino acids and nucleotides. By integrating sequences and structural data from available proteinDNA complexes into an interaction matrix, we introduce IDEA, a data-driven method that optimizes a system-specific energy model. This model enables high-throughput in silico predictions of protein-DNA binding specificities and can be scaled up to predict genomic binding sites of DNA-binding proteins, such as TFs. IDEA achieves accurate de novo predictions using only proteinDNA complex structures and their associated sequences, but its accuracy can be further enhanced by incorporating available experimental data from other binding assay measurements, such as the SELEX data [35, 36, 37], achieving accuracy comparable or better than state-of-the-art methods (Figures S2 and S7, Table S1 and S2). Despite significant progress in genome-wide sequencing techniques [38, 39, 40, 41], determining sequence-specific binding affinities of DNA-binding biomolecules remains time-consuming and expensive. Therefore, IDEA presents a cost-effective alternative for generating the initial predictions before pursuing further experimental refinement.”

      We have revised the text in Discussion to clarify that the acquired binding affinities of target DNA sequences can be used to help existing models to infer specific DNA binding motifs.

      The revised text now reads:

      Another highlight of IDEA is its ability to present an interpretable, familyspecific amino acid-nucleotide interaction energy model for given proteinDNA complexes. The optimized IDEA energy model can not only predict sequence-specific binding affinities of protein-DNA pairs but also provide a residue-specific interaction matrix that dictates the preferences of amino acidnucleotide interactions within specific protein families (Figure S11). This interpretable energy matrix would facilitate the discovery of sequence binding motifs for target DNA-binding proteins, complementing both sequencebased [24, 16, 25] and structure-based approaches [10, 26, 4, 15]. Additionally, we integrated this physicochemical-based energy model into a simulation framework, thereby improving the characterization of protein-DNA binding dynamics. IDEA-based simulation enables the investigation into dynamic interactions between various proteins and DNA, facilitating molecular-level understanding of the physical mechanisms underlying many DNA-binding processes, such as transcription, epigenetic regulations, and their modulation by sequence variations, such as single-nucleotide polymorphisms (SNPs) [22, 23].

      Comment 6: It is not clear how much the learned energy model is dependent on the structural model used for a specific system/family. It would be interesting to see the differences in learned model based on different representative PDB structures used. Similarly, the supplementary figures show a lack of discriminative power for proteins like PDX1 (homeodomain family), POU, etc. Can the authors shed some light on why such different performances?

      We thank the reviewer for the insightful comments and agree that the trained energy model should be presented in the context of protein families. To further analyze the dependence of the energy model on protein family, we visualized the trained energy models for 24 proteins, including all proteins from the HT-SELEX dataset as well as PHO4 (PDB ID: 1A0A) and CTCF (PDB ID: 8SSQ), spanning 12 distinct protein families. To quantitatively assess similarities and differences among these energy models, we flattened each normalized energy model into an 80-dimensional vector and performed principal component analysis (PCA). As shown in Author response image 1 and Figure S11, energy models optimized from the same protein family fall within the same cluster, while those from different protein families exhibit distinct patterns. Moreover, the relative distance between energy models in PCA space reflects the degree of transferability. For example, PHO4 (PDB ID: 1A0A) is positioned close to MAX (PDB ID: 1HLO), whereas USF1 (PDB ID: 1AN4) and TCF4 (PDB ID: 6OD3) are farther away. This is consistent with the results shown in Figure 3A, where the energy model trained from PHO4 has better transferability than those from the other two systems.

      We also greatly appreciate the reviewer’s suggestion to examine cases where IDEA failed to demonstrate strong discriminative power. When evaluating the model’s ability to distinguish between strong and weak binders, we used the available experimental structure most similar to the protein employed in the HT-SELEX experiments. In some instances, only the structure of the same protein from a different organism is available. For example, the HT-SELEX data for PDX1-DNA used the human PDX1 protein, but no human PDX1–DNA complex structure is available. Therefore, we used the mouse PDX1–DNA complex (PDB ID: 2H1K) for model training. The differences between species may limit the predictive accuracy of the model. A similar limitation applies to POU3F1, where an available mouse complex (PDB ID: 4Y60) was used to predict human protein–DNA interactions. Notably, DeepBind [9], a sequence-based prediction tool, also failed to distinguish strong from weak binders when using the mouse POU3F1 protein (AUC score: 0.457), but this was corrected with the human POU3F1 protein (AUC score: 0.956).

      We also examined the remaining cases where IDEA did not show a clear distinction between strong and weak binders: USF1, Egr1, and PROX1. For PROX1, we initially used the structure of a protein-DNA complex (PDB ID: 4Y60) in training. However, upon closer inspection, we discovered that this structure does not include the PROX1 protein, but SOX-18, a different transcription factor. This explains the inaccurate prediction made by IDEA. Since no experimental PROX1-DNA complex structure is currently available, we have removed this case from our HT-SELEX evaluation.

      IDEA also fails to fully resolve the binding preference of USF1. A closer examination of the HT-SELEX data reveals a lack of distinction among the sequences, as most sequences, including those with the lowest M-word (binding affinity) scores, contain the DNA-binding E-box sequence CACGTG. Therefore, USF1 represents a challenging example where the experimental data only consists of strong binders with limited variations in binding affinity, which likely results from differences in flanking sequences of the E-box motif.

      Egr1 stands as a peculiar example. Whereas IDEA does not effectively distinguish between the strong and weak binders in the current HT-SELEX dataset, its predictions are consistent with other experimental datasets, including binding affinities measured by kMITOMI [42] (Figure S8A, B), preferred binding sequences from protein-binding microarray, an earlier HT-SELEX experiment, and bacterial one-hybrid data [43]. Therefore, further investigation of the current HT-SELEX data is needed to reconcile these differences.

      We have included additional text in Section: IDEA Demonstrates Transferability across Proteins in the Same CATH Superfamily to discuss the PCA analysis and the dependence of the model’s transferability on the similarity among the learned energy models.

      The revised text now reads:

      “The transferability of IDEA within the same CATH superfamily can be understood from the similarities in protein-DNA binding interfaces, which determine similar learned energy models. For example, the PHO4 protein (PDB I”D: 1A0A) shares a highly similar DNA-binding interface with the MAX protein (PDB ID: 1HLO) (Figure 3B), despite sharing only a 33.41% probability of being homologous. Consequently, the energy model derived from the PHO4DNA complex (Figure 3C) exhibits a similar amino-acid-nucleotide interactive pattern as that learned from the MAX-DNA complex (Figure 2B). To further evaluate the similarity between the learned energy models and their connection to protein families, we performed principal component analysis (PCA) on the normalized energy models across 24 proteins from 12 protein families [5]. Our analysis (Figure S11) reveals that most of the energy models from the same protein family fall within the same cluster, while those from different protein families exhibit distinct patterns. Moreover, the relative distance between energy models in PCA space reflects the degree of transferability between them. For example, PHO4 (PDB ID: 1A0A) is positioned close to MAX (PDB ID: 1HLO), whereas USF1 (PDB ID: 1AN4) and TCF4 (PDB ID: 6OD3) are farther away. This is consistent with the results in Figure 3A, where the energy model trained on PHO4 has better transferability than those trained on USF1 or TCF4.”

      We have also added an Appendix section titled Analysis of examples where IDEA fails to recognize strong DNA binders to discuss the examples in which IDEA did not perform well:

      “We examine IDEA’s capability in identifying strong binders from the HT-SELEX dataset across 12 protein families [5]. The model successfully predicts 18 out of 22 protein-DNA systems, but the performance is reduced in 4 cases. Closer investigations revealed the source of these limitations. In some instances, only the protein from a different organism is available. For example, the PDX1 HT-SELEX data utilized the human PDX1 protein, but no human PDX1–DNA complex structure is available. Therefore, the mouse PDX1–DNA complex structure (PDB ID: 2H1K) was used for model training. Differences between model organisms may reduce predictive accuracy. A similar limitation applies to POU3F1, where an available mouse complex (PDB ID: 4Y60) was used to predict human protein–DNA interactions. Notably, DeepBind [9], a sequence-based prediction tool, also failed to distinguish strong from weak binders when using the mouse POU3F1 protein (AUC score: 0.457), but this was corrected with the human POU3F1 protein (AUC score: 0.956).

      IDEA also fails to fully resolve the binding preference of USF1. A closer examination of the HT-SELEX data reveals a lack of distinction among the sequences, as most sequences, including those with the lowest M-word (binding affinity) scores, contain the DNA-binding E-box sequence CACGTG. Therefore, USF1 represents a challenging example where the experimental data only consists of strong binders with limited variations in binding affinity, which likely results from differences in flanking sequences of the E-box motif.

      Egr1 stands as a peculiar example. Whereas IDEA does not effectively distinguish between the strong and weak binders in the current HT-SELEX dataset, its predictions are consistent with other experimental datasets, including binding affinities measured by k-MITOMI [42] (Figure S8A, B), preferred binding sequences from protein-binding microarray, an earlier HT-SELEX experiment, and bacterial one-hybrid data [43]. Therefore, further investigation of the current HT-SELEX data is needed to reconcile these differences.”

      Comment 7: It is also not clear if IDEA’s prediction for reverse complement sequences is the same for a given sequence. If so, how is this property being modelled? Either this description is lacking or I missed it.

      We thank the reviewer for the insightful comments. Given a target protein-DNA sequence, the IDEA protocol substitutes it into a known protein-DNA complex structure to evaluate the binding free energy, which can be converted into binding affinity. IDEA uses sequence identity to determine whether the forward or reverse strand of the DNA should be replaced. Only the strand most similar to the target sequence is substituted. As a result, the model treats reverse-complement sequences differently. As the orientations of test sequences are specified from 5’ to 3’ in all datasets used in this study (e.g., processed MITOMI, HT-SELEX, and ChIP-seq data), this approach ensures that the target sequences are replaced and evaluated correctly. In cases where sequence orientation is not provided (though this was not an issue in this study), we recommend replacing both the forward and reverse strands with the target sequence separately and evaluating the corresponding protein–DNA binding free energies. Since strong binders are likely to dominate the experimental signals, the higher predicted binding affinity, with stronger binding free energies, should be taken as the model’s final prediction.

      We have added one section to the Methods Section titled Treatment of Complementary DNA Sequences to clarify these modeling details.

      The specific text reads:

      To replace the DNA sequence in the protein-DNA complex structure with a target sequence, IDEA uses sequence identity to determine whether the target sequence belongs to the forward or reverse strand of the DNA in the proteinDNA structure. The more similar strand is selected and replaced with the target sequence. As the orientations of test sequences are specified from 5’ to 3’ in all datasets used in this study (e.g., processed MITOMI, HT-SELEX, and ChIP-seq data), this approach ensures that the target sequences are replaced and evaluated correctly. In cases where sequence orientation is not provided (though this was not an issue in this study), we recommend replacing both the forward and reverse strands with the target sequence separately and evaluating the corresponding protein–DNA binding free energies. Since strong binders are likely to dominate the experimental signals, the higher predicted binding affinity, with stronger binding free energy, should be taken as the model’s final prediction.”

      “Comment 8: Page 21 line 403, the E-box core should be CACGTG instead of CACGTC.

      We apologize for our oversight and have corrected the relevant text.

      Comment 9: The citation for DNAproDB is outdated and should be updated (PMID 39494533).

      We thank the reviewer for pointing this out and have updated our citation accordingly.

      Reviewer #3:

      Comment 0: Summary: Protein-DNA interactions and sequence readout represent a challenging and rapidly evolving field of study. Recognizing the complexity of this task, the authors have developed a compact and elegant model. They have applied well-established approaches to address a difficult problem, effectively enhancing the information extracted from sparse contact maps by integrating artificial sequences decoy set and available experimental data. This has resulted in the creation of a practical tool that can be adapted for use with other proteins.

      We appreciate the reviewer’s excellent summary of the paper, and we thank the reviewer for the insightful suggestions and comments.

      Comment 1: Strengths: (1) The authors integrate sparse information with available experimental data to construct a model whose utility extends beyond the limited set of structures used for training. (2) A comprehensive methods section is included, ensuring that the work can be reproduced. Additionally, the authors have shared their model as a GitHub project, reflecting their commitment to transparency of research.

      We appreciate the reviewer’s strong assessment of the strengths of this paper. In addition to sharing our model on GitHub, we have also uploaded the original data and the essential scripts required to reproduce the results presented in the manuscript. We hope this further demonstrates our commitment to transparency and reproducibility.

      Comment 2: Weaknesses: (1) The coarse-graining procedure appears artificial, if not confusing, given that full-atom crystal structures provide more detailed information about residue-residue contacts. While the selection procedure for distance threshold values is explained, the overall motivation for adopting this approach remains unclear. Furthermore, since this model is later employed as an empirical potential for molecular modeling, the use of P and C5 atoms raises concerns, as the interactions in 3SPN are modeled between Cα and the nucleic base, represented by its center of mass rather than P or C5 atoms.

      We appreciate the reviewer’s insightful comments. The selection of P and C5 atoms was based on different relative positions of protein and DNA across various complex structures, each with distinctive protein-DNA structural interfaces. To illustrate this, we selected two representative structures where our algorithm selected C5 and P atoms, respectively: MAX-DNA (PDB ID: 1HLO) and FOXP3 (PDB ID: 7TDW). As shown in Author response image 2, in the case of 1HLO, more C5 atoms are within the cutoff distance of 10 A from˚ the protein Cα atoms, thus capturing essential contacting interactions. In contrast, 7TDW has more P atoms within this cutoff. Importantly, several P atoms are distributed on the minor groove of the DNA, which were not captured by the C5 atoms. To maximize the inclusion of relevant structural contacts, we employed a filtering scheme that selectively chooses either P or C5 atoms based on their proximity to the protein to enhance the model prediction. We note that while this scheme is helpful, the IDEA predictions remain robust across different atom selections. To assess this robustness, we performed binding affinity predictions using only P atoms on the HT-SELEX dataset across 12 protein families [5]. Our predictions (Author response table 1) show comparable performance to that achieved using our filtering scheme.

      Author response image 2.

      Comparison between P and C5 atoms in proximity to the protein 3D structures of MAX–DNA (A) and FOXP-DNA (B) complexes, where P atoms (red sphere) and C5 atoms (blue sphere) that are within 10 A of Cα atoms are highlighted.

      When incorporating the trained IDEA energy model into a simulation model, we acknowledge a potential mismatch between the resolution of the data-driven model (one coarse-grained site per nucleotide) and the 3SPN simulation model (three coarse-grained sites per nucleotide). The selection of nucleic base sites for molecular interactions in the 3SPN model follows our previous work [44] and its associated code implementation. While revisiting this part of the manuscript, we identified an inconsistency in the reported results in Figure 5A of our initial version: Specifically, we previously used the protein side-chain atoms, rather than only the Cα atoms, in model training. Retraining the data using the Cα atoms results in reduced prediction performance for the IDEA model (Figure 5A). Nonetheless, incorporating this updated energy model into simulations still yielded high accuracy in the predicted absolute binding free energies (Author response image 3A), demonstrating the robustness of our simulation framework in predicting absolute binding free energies against variations in atom selection during the IDEA model training. Following the reviewer’s suggestion, we also incorporated the IDEA-trained energy model as short-range van der Waals interactions between protein Cα atoms and DNA P atoms. As shown in Author response image 3B, our simulation reveals a slightly improved performance over our original implementation, with higher Pearson and Spearman correlation coefficients and a fitted slope closer to 1.0. This result suggests that a more consistent atom selection scheme between the data-driven and simulation models can improve the overall predictions. Accordingly, we have updated Figure 5 with this improved setup, using the simulation model with short-range vdW interactions implemented between protein Cα atoms and DNA P atoms (Figure 5C), ensuring consistency between the IDEA model and simulation framework.

      Author response table 1.

      Comparison of IDEA performance using two DNA atom selection schemes: the filtering scheme presented in the manuscript (C5 and P atoms) versus using only P atoms. Cases where the two schemes result in different atom selections are highlighted in bold.

      We acknowledge that a gap still exists between the resolution of the data-driven and simulation models. To ensure a completely consistent coarse-grained level between these two models, we will work on implementing the IDEA model output for 1-bead-per-nucleotide DNA simulation models in the future.

      Comment 3: (2) Although the authors use a standard set of metrics to assess model quality and predictive power, some ∆∆G predictions compared to MITOMI-derived ∆∆G values appear nonlinear, which casts doubt on the interpretation of the correlation coefficient.

      Author response image 3.

      Comparison of simulations using different representative atoms (A) Protein-DNA binding simulation with the IDEA-model incorporated as short-range van der Waals between protein Cα atom and nucleic base site. (B) Protein-DNA binding simulation with the IDEA-model incorporated as short-range van der Waals between protein Cα atom and DNA P atoms. The predicted free energies are robust to the choice of DNA representative atoms. The predicted binding free energies are presented in physical units, and error bars represent the standard deviation of the mean.

      We thank the reviewer for the insightful comments and agree that the linear fit between our model’s prediction and the experimental data may not be the best measure of performance. The primary utility of the IDEA model is to predict high-affinity DNA-binding sequences for a given DNA-binding protein by assessing the relative binding affinities across different DNA sequences. In this regard, the ranked order of predicted sequence binding affinities serves as a better metric for evaluating the success of this model. To evaluate this, we calculated both Spearman’s rank correlation coefficient, which does not rely on linear correlation, and the Pearson correlation coefficient between our predictions and the experimental results. As shown in Figure 2, our computation shows a Spearman’s rank correlation coefficient of 0.65 for the MAX-based predictions using one MAX-DNA complex (PDB ID: 1HLO), supporting the model’s capability to effectively distinguish strong from weak binders.

      As reflected in Figure 2 of the main text, although our model generally captures the relative binding affinities across different DNA sequences, its predictive accuracy diminishes for low-affinity sequences (Figure 2). This could be due to two limitations of the current modeling framework: (1) The model is residue-based and estimates binding free energy as the additive sum of contributions from individual contacting amino-acid-nucleotide pairs. This assumption does not account for cooperative effects caused by simultaneous changes at multiple nucleotide positions. One potential direction to further improve the model would be to use a finer-grained representation by incorporating more atom types within contacting residues, and to use a many-body potential to better capture cooperative effects from multiple mutations. (2) The model assumes that the target DNA adopts the same binding interface as in the reference crystal structure. However, sequencedependent DNA shape has been shown to be important in determining protein-DNA binding affinity [1]. To address this limitation, a future direction is to use deep-learningbased methods to incorporate predicted DNA shape or protein-DNA complex structures based on their sequences [2, 3] into our model prediction.

      To fully evaluate the predictive power of IDEA, we have included Spearman’s rank correlation coefficient for every correlation plot in this manuscript. Across all our analyses, the Spearman’s rank correlation coefficients reveal similar predictive performance as the Pearson correlation coefficients. Additionally, we have included in our discussion the current limitations of our model and potential directions for future improvement.

      We have edited our Discussion Section to include a discussion on the limitations of the current model. Specifically, the added texts are:

      “Although IDEA has proved successful in many examples, it can be improved in several aspects. The model currently assumes the training and testing sequences share the same protein-DNA structure. While double-stranded DNA is generally rigid, recent studies have shown that sequence-dependent DNA shape contributes to their binding specificity [1, 2, 4]. To improve predictive accuracy, one could incorporate predicted DNA shapes or structures into the IDEA training protocol. In addition, the model is residue-based and evaluates the binding free energy as the additive sum of contributions from individual amino-acid-nucleotide contacts. This assumption does not account for cooperative effects that may arise from multiple nucleotide changes. A potential refinement could utilize a finer-grained model that includes more atom types within contacting residues and employs a many-body potential to account for such cooperative effects.”

      Comment 4: (3) The discussion section lacks information about the model’s limitations and a comprehensive comparison with other models. Additionally, differences in model performance across various proteins and their respective predictive powers are not addressed.

      We thank the reviewer for the insightful comments. As discussed in the response to Comment 3, the current structural model has several limitations, which may reduce predictive accuracy for weak DNA binders. We have noted these limitations in the Discussion section.

      To compare the performance of IDEA with state-of-the-art protein-DNA predictive models, we examined the predictive accuracies of two additional popular computational models: ProBound [8] and DeepBind [9]. ProBound has been shown to have a better performance than several earlier predictive protein-DNA models, including JASPAR 2018 [11], HOCOMOCO [12], Jolma et al. [13], and DeepSELEX [14]. To benchmark these models’ performance, we examine each method’s capability to identify strong binders with the HT-SELEX datasets covering 22 proteins from 12 protein families [5]. As suggested by Reviewer 1, we also calculated the PRAUC score, reweighted to account for data imbalance [6], as a complementary metric for evaluating the model performance.

      As shown in Figure S6, Table S1, and Table S2, IDEA ranked second among the three predictive methods. It is important to note that both ProBound and DeepBind were trained on a curated version of the HT-SELEX data [13], which overlaps with the testing data [5]. Compared with them, IDEA was trained only on the given structural and sequence information from a single protein-DNA complex, thus independent of the testing data. In order to assess how IDEA performs when incorporating knowledge from HT-SELEX data, we augmented the training by randomly including half of the HT-SELEX data (see the Methods Section Enhanced Modeling Prediction with SELEX Data). The augmented IDEA model achieved the best performance among all the models. We further benchmarked IDEA using a 10-fold cross-validation on the same HT-SELEX data [5] and found that IDEA outperformed a recent regression model that considers the shape of DNA with different sequences [5]. Overall, IDEA can be used to predict protein-DNA affinities in the absence of known binding sequence data, thereby filling a critical gap when such experimental datasets are unavailable.

      In addition, we compared the performance of IDEA with both general and family-specific knowledge-based energy models. First, we incorporated a knowledge-based generic protein-DNA energy model (DBD-Hunter) learned from the protein-DNA database, reported by Skoinick and coworkers [7], into our prediction protocol. This model assigns interaction energies to different functional groups within each DNA nucleotide (e.g., phosphate (PP), sugar (SU), pyrimidine (PY), and imidazole (IM) groups). For our comparison, we averaged the energy contributions of these groups within each nucleotide and replaced the IDEA-learned energy model with this generic one to test its ability to differentiate strong binders from weak binders in the HT-SELEX dataset [5]. As shown in Figure S6, the IDEA model generally achieves better performance than the generic energy model. Additionally, we compared IDEA with rCLAMPS, a family-specific energy model developed to predict protein-DNA binding specificity in the C2H2 and homeodomain families. As shown in Table S1 and Table S2, IDEA also shows better performance than rCLAMPS in most cases across the C2H2 and homeodomain families, demonstrating that it has better predictive accuracy than both family-specific and generic knowledge-based models.

      We have revised our text to include the comparison between IDEA and other predictive models. Specifically, we revised the text in Section: IDEA Generalizes across Various Protein Families.

      The revised text reads:

      “To examine IDEA’s predictive accuracy across different DNA-binding protein families, we applied it to calculate protein-DNA binding affinities using a comprehensive HT-SELEX dataset [5]. We focused on evaluating the capability of IDEA to distinguish strong binders from weak binders for each protein with an experimentally determined structure. We calculated the probability density distribution of the top and bottom binders identified in the SELEX experiment. A well-separated distribution indicates the successful identification of strong binders by IDEA (Figure 2D and S4). Receiver Operating Characteristic (ROC) analysis was performed to calculate the Area Under the Curve (AUC) and the precision-recall curve (PRAUC) scores for these predictions. Further details are provided in the Methods Section Evaluation of IDEA Prediction Using HT-SELEX Data. Our analysis shows that IDEA successfully differentiates strong from weak binders for 80% of the 22 proteins across 12 protein families, achieving AUC and balanced PRAUC scores greater than 0.5 (Figure 2E and S5). To benchmark IDEA’s performance against other leading methods, we compared its predictions with several popular models, including the sequence-based predictive models ProBound [8] and DeepBind [9], the familybased energy model rCLAMPS [10], and the knowledge-based energy model DBD-Hunter [7]. IDEA demonstrates performance comparable to these stateof-the-art approaches (Figure S6, Table S1, and Table S2), and incorporating sequence features further improves its prediction accuracy. We also performed 10-fold cross-validation on the binding affinities of protein–DNA pairs in this dataset and found that IDEA outperforms a recent regression model that considers the shape of DNA with different sequences [5] (Figure S7). Details are provided in Section: Comparison of IDEA predictive performance Using HT-SELEX data.”

      We also added one section Comparison of IDEA predictive performance Using HT-SELEX data in the Appendix to fully explain the comparison between IDEA and other popular models.

      The added texts are:

      “To benchmark the performance of IDEA against state-of-the-art protein-DNA predictive models, we evaluated its ability to recognize strong binders with the HT-SELEX datasets across 22 proteins from 12 families [5]. Specifically, we compare IDEA with two widely used sequence-based models: ProBound [8] and DeepBind [9]. ProBound has demonstrated superior performance over many other predictive protein-DNA models, including JASPAR 2018 [11], HOCOMOCO [12], Jolma et al. [13], and DeepSELEX [14]. To use ProBound, we retrieved the trained binding model for each protein from motifcentral.org and used the GitHub implementation of ProBoundTools to infer the binding scores between protein and target DNA sequences. Except for POU3F1, binding models are available for all proteins. Therefore, we excluded POU3F1 and evaluated the protein-DNA binding affinities for the remaining 21 proteins. To use DeepBind, sequence-specific binding affinities were predicted directly with its web server. The Area Under the Curve (AUC) and the Precision-Recall AUC (PRAUC) scores were used as metrics for comparison. An AUC score of 1.0 indicates a perfect separation between the strong- and weak-binder distributions, while an AUC score of 0.5 indicates no separation. Because there is a significant imbalance in the number of strong and weak binders from the experimental data [5], where the strong binders are far fewer than the weak binders, we reweighted the samples to achieve a balanced evaluation, using 0.5 as the baseline for randomized prediction [6]. As summarized in Figure S6, Table S1, and Table S2, IDEA ranked second among the three predictive models. In order to assess the performance of IDEA when augmented with additional protein-DNA binding data, we augmented IDEA using randomly selected half of the HT-SELEX data (see the Methods Section Enhanced Modeling Prediction with SELEX Data). The augmented IDEA model achieved the best performance among all the models.”

      “In addition, we compared the performance of IDEA with both general and family-specific knowledge-based energy models. First, we incorporated a knowledgebased generic protein-DNA energy model (DBD-Hunter) learned from the protein-DNA database, reported by Skoinick and coworkers [7], into our prediction protocol. This model assigns interaction energies to different functional groups within each DNA nucleotide, including phosphate (PP), sugar (SU), pyrimidine (PY), and imidazole (IM) groups. For our comparison, we averaged the energy contributions of these groups within each nucleotide and replaced the IDEA-learned energy model with the DBD-Hunter model to assess its ability to differentiate strong binders from weak binders in the HTSELEX dataset [5]. Additionally, we compared IDEA with rCLAMPS, a familyspecific energy model developed to predict protein-DNA binding specificity in the C2H2 and homeodomain families. rCLAMPS learns a position-dependent amino-acid-nucleotide interaction energy model. To incorporate this model into the binding free energy calculation, we averaged the energy contributions across all occurrences of each amino-acid-nucleotide pair, which resulted in a 20-by-4 residue-type-specific energy matrix. This matrix is structurally analogous to the IDEA-trained energy model and can be directly integrated into the binding free energy calculations. As shown in Figure S6, Table S1, and Table S2, the IDEA model generally outperforms DBD-Hunter and rCLAMPS, demonstrating that it can achieve better predictive accuracy than both generic and family-specific knowledge-based models.”

      “We also performed 10-fold cross-validation using the same HT-SELEX datasets, following the protocol described in the Methods Section Enhanced Modeling Prediction with SELEX Data. For each protein, we divided the entire dataset into 10 equal, randomly assigned folds. In each iteration, we used randomly selected 9 of the 10 folds as the training dataset and the remaining fold as the testing dataset. This process was repeated 10 times so that each fold served as the test set once. We then reported the average R2 scores across these iterations to evaluate IDEA’s predictive performance. Our results are compared with the 1mer and 1mer+shape methods from [5], the latest regression model that considers the shape of DNA with different sequences (Figure S7). This comparative analysis shows IDEA achieved higher predictive accuracy than the state-of-the-art sequence-based protein-DNA binding predictors for proteinDNA complexes that have available experimentally resolved structures.”

      “Overall, these results demonstrate that IDEA can be used to predict the proteinDNA pairs in the absence of known binding sequence data, thus filling an important gap in protein-DNA predictions when experimental binding sequence data are unavailable.”

      We also greatly appreciate the reviewer’s suggestion to examine the model’s performance across different proteins. To do this, we first evaluated the dependence of IDEA prediction on the availability of experimental structures similar to the target protein-DNA complexes. To quantitatively assess similarities and differences among the IDEA-derived energy models, we flattened each normalized energy model into an 80-dimensional vector and performed principal component analysis (PCA). As shown in Author response image 1 and Figure S11, energy models optimized from the same protein family fall within the same cluster, while those from different protein families exhibit distinct patterns. Moreover, the relative distance between energy models in PCA space reflects the degree of transferability. For example, PHO4 (PDB ID: 1A0A) is positioned close to MAX (PDB ID: 1HLO), whereas USF1 (PDB ID: 1AN4) and TCF4 (PDB ID: 6OD3) are farther away. This is consistent with the results shown in Figure 3A, where the energy model trained from PHO4 has better transferability than those from the other two systems. Therefore, the availability of experimental structures from protein-DNA complexes more similar to the target can lead to better predictive performance.

      We also examine cases in which the IDEA model failed to show strong discriminative power for protein-DNA complexes in the HT-SELEX datasets [5] (Figures 2E and S5). When evaluating the model’s ability to distinguish between strong and weak binders, we used the available experimental structure most similar to the protein employed in the HT-SELEX experiments. In some instances, only the structure of the same protein from a different organism is available. For example, the HT-SELEX data for PDX1-DNA used the human PDX1 protein, but no human PDX1–DNA complex structure is available. Therefore, we used the mouse PDX1–DNA complex (PDB ID: 2H1K) for model training. The differences between species may limit the predictive accuracy of the model. A similar limitation applies to POU3F1, where an available mouse complex (PDB ID: 4Y60) was used to predict human protein–DNA interactions. Notably, DeepBind [9], a sequencebased prediction tool, also failed to distinguish strong from weak binders when using the mouse POU3F1 protein (AUC score: 0.457), but this was corrected with the human POU3F1 protein (AUC score: 0.956).

      We also examined the remaining cases where IDEA did not show a clear distinction between strong and weak binders: USF1, Egr1, and PROX1. For PROX1, we initially used the structure of a protein-DNA complex (PDB ID: 4Y60) in training. However, upon closer inspection, we discovered that this structure does not include the PROX1 protein, but SOX-18, a different transcription factor. This explains the inaccurate prediction made by IDEA. Since no experimental PROX1-DNA complex structure is currently available, we have removed this case from our HT-SELEX evaluation.

      IDEA also fails to fully resolve the binding preference of USF1. A closer examination of the HT-SELEX data reveals a lack of distinction among the sequences, as most sequences, including those with the lowest M-word (binding affinity) scores, contain the DNA-binding E-box sequence CACGTG. Therefore, USF1 represents a challenging example where the experimental data only consists of strong binders with limited variations in binding affinity, which likely results from differences in flanking sequences of the E-box motif.

      Egr1 stands as a peculiar example. Whereas IDEA does not effectively distinguish between the strong and weak binders in the current HT-SELEX dataset, its predictions are consistent with other experimental datasets, including binding affinities measured by kMITOMI [42] (Figure S8A, B), preferred binding sequences from protein-binding microarray, an earlier HT-SELEX experiment, and bacterial one-hybrid data [43]. Therefore, further investigation of the current HT-SELEX data is needed to reconcile these differences.

      In summary, IDEA’s predictive performance depends on the availability of experimental structures closely related to the target protein-DNA complexes, both in terms of protein sequences and model organisms.

      We have included additional text in Section: IDEA Demonstrates Transferability across Proteins in the Same CATH Superfamily to discuss the PCA analysis and the dependence of the model’s transferability on the similarity among the learned energy models.

      The revised text now reads:

      “The transferability of IDEA within the same CATH superfamily can be understood from the similarities in protein-DNA binding interfaces, which determine similar learned energy models. For example, the PHO4 protein (PDB ID: 1A0A) shares a highly similar DNA-binding interface with the MAX protein (PDB ID: 1HLO) (Figure 3B), despite sharing only a 33.41% probability of being homologous. Consequently, the energy model derived from the PHO4DNA complex (Figure 3C) exhibits a similar amino-acid-nucleotide interactive pattern as that learned from the MAX-DNA complex (Figure 2B). To further evaluate the similarity between the learned energy models and their connection to protein families, we performed principal component analysis (PCA) on the normalized energy models across 24 proteins from 12 protein families [5]. Our analysis (Figure S11) reveals that most of the energy models from the same protein family fall within the same cluster, while those from different protein families exhibit distinct patterns. Moreover, the relative distance between energy models in PCA space reflects the degree of transferability between them. For example, PHO4 (PDB ID: 1A0A) is positioned close to MAX (PDB ID: 1HLO), whereas USF1 (PDB ID: 1AN4) and TCF4 (PDB ID: 6OD3) are farther away. This is consistent with the results in Figure 3A, where the energy model trained on PHO4 has better transferability than those trained on USF1 or TCF4.”

      We have also added an Appendix section titled Analysis of examples where IDEA fails to recognize strong DNA binders to discuss the examples in which IDEA did not perform well:

      “We examine IDEA’s capability in identifying strong binders from the HT-SELEX dataset across 12 protein families [5]. The model successfully predicts 18 out of 22 protein-DNA systems, but the performance is reduced in 4 cases. Closer investigations revealed the source of these limitations. In some instances, only the protein from a different organism is available. For example, the PDX1 HT-SELEX data utilized the human PDX1 protein, but no human PDX1–DNA complex structure is available. Therefore, the mouse PDX1–DNA complex structure (PDB ID: 2H1K) was used for model training. Differences between model organisms may reduce predictive accuracy. A similar limitation applies to POU3F1, where an available mouse complex (PDB ID: 4Y60) was used to predict human protein–DNA interactions. Notably, DeepBind [9], a sequence-based prediction tool, also failed to distinguish strong from weak binders when using the mouse POU3F1 protein (AUC score: 0.457), but this was corrected with the human POU3F1 protein (AUC score: 0.956).

      IDEA also fails to fully resolve the binding preference of USF1. A closer examination of the HT-SELEX data reveals a lack of distinction among the sequences, as most sequences, including those with the lowest M-word (binding affinity) scores, contain the DNA-binding E-box sequence CACGTG. Therefore, USF1 represents a challenging example where the experimental data only consists of strong binders with limited variations in binding affinity, which likely results from differences in flanking sequences of the E-box motif.

      Egr1 stands as a peculiar example. Whereas IDEA does not effectively distinguish between the strong and weak binders in the current HT-SELEX dataset, its predictions are consistent with other experimental datasets, including binding affinities measured by k-MITOMI [42] (Figure S8A, B), preferred binding sequences from protein-binding microarray, an earlier HT-SELEX experiment, and bacterial one-hybrid data [43]. Therefore, further investigation of the current HT-SELEX data is needed to reconcile these differences.”

      Comment 5: The authors provide an implementation of their model via GitHub, which is commendable. However, it unexpectedly requires the Modeller suite, despite no details about homology modeling being included in the methods section.

      We thank the reviewer for the helpful comments. We did not use the homology modeling module of Modeller. Instead, we only used a single Python script, buildseq.py, from the Modeller package to extract the protein and DNA sequences from the given PDB structure. We have clarified this in the README file on our GitHub repository.

      Comment 6: While the manuscript is written in clear and accessible English, some sentences are quite long and could benefit from rephrasing (e.g., lines 49-52).

      Thank you for the helpful suggestion. We agree that the original sentence was overly long and have revised it by splitting it into two for improved clarity and readability.

      The revised version reads:

      “The very robustness of evolution [46, 47, 48, 49] provides an opportunity to extract the sequence-structure relationships embedded in existing complexes. Guided by this principle, we can learn an interpretable binding energy landscape that governs the recognition processes of DNA-binding proteins.”

      Comment 7: In line 82, the citations appear out of place, as the context seems to suggest the use of the newly developed model.

      Thank you for this insightful suggestion. We have rephrased the sentence to better connect with the context of this section.

      The revised text now reads:

      “Finally, the learned energy model can be incorporated into a simulation framework to explore the dynamics of DNA-binding processes, revealing mechanistic insights into various DNA-templated processes.”

      Comment 8: Line 143 ”different structure from the bHLH TFs and thus requires a different atom” This is the first instance in the manuscript where the atom selection for distance thresholding is mentioned, making the text somewhat confusing.

      We thank the reviewer for the insightful comment and agree that the atom selection scheme appears abruptly in this section. To improve clarity, we have moved the detailed atom selection scheme and its rationale to the Methods Section titled Structural Modeling of Protein and DNA.

      Comment 9: Figures: Overall, the figures are visually appealing but could be further improved.

      We appreciate the positive feedback regarding the visual presentation of our figures. Following the reviewer’s suggestions and to further enhance clarity, we have revised several figures to improve labeling, layout, and annotations.

      Comment 10: Figure 1: The description ”highlighted in blue” considers changing to ”highlighted in blue on the structure.”.

      We have revised the text based on your suggestion.

      Comment 11: Figure 2: Panel B is missing a color bar legend and units, as is the case in Figure 3C. Additionally, the placement of Panel C is unconventional - it appears it should be Panel D. The color scheme for the spheres is not fully described. Panel E: There are too many colors used; consider employing different markers to improve clarity.

      Thank you for the helpful suggestions.

      For Figure 2B and Figure 3C, we would like to clarify that the predicted energies are presented in reduced units due to an undetermined prefactor introduced during the model optimization. This point has now been clarified in the figure captions and is also explained in the Methods section titled Training Protocol.

      Additionally, we have rearranged Panels C and D to improve the figure layout and have fully described the color coding used in the structural representations.

      We have updated it to read:

      “Results for MAX-based predictions. (A) The binding free energies calculated by IDEA, trained using a single MAX–DNA complex (PDB ID: 1HLO), correlate well with experimentally measured MAX–DNA binding free energies [50]. ∆∆G represents the changes in binding free energy relative to that of the wild-type protein–DNA complex. (B) The heatmap, derived from the optimized energy model, illustrates key amino acid–nucleotide interactions governing MAX–DNA recognition, showing pairwise interaction energies between 20 amino acids and the four DNA bases—DA (deoxyadenosine), DT (deoxythymidine), DC (deoxycytidine), and DG (deoxyguanosine). Both the predicted binding free energies and the optimized energy model are expressed in reduced units, as explained in the Methods Section Training Protocol. Each cell represents the optimized energy contribution, where blue indicates more favorable (lower) energy values, and red indicates less favorable (higher) values. (C) The 3D structure of the MAX–DNA complex (zoomed in with different views) highlights key amino acid–nucleotide contacts at the protein–DNA interface. Notably, several DNA deoxycytidines (red spheres) form close contacts with arginines (blue spheres). Additional nucleotide color coding: adenine (yellow spheres), guanine (green spheres), thymine (pink spheres). (D) Probability density distributions of predicted binding free energies for strong (blue) and weak (red) binders of the protein ZBTB7A. The mean of each distribution is marked with a dashed line. (E) Summary of AUC scores for protein–DNA pairs across 12 protein families, calculated based on the predicted probability distributions of binding free energies.”

      We fully agree that Panel E was visually overwhelming. We have revised the plot by using a combination of color and marker shapes to more clearly distinguish between different protein families, as suggested.

      Comment 12: Typos:

      Line 18: Gene expressions → Gene expression?

      Line 28: performed → utilized ?

      We really appreciate the suggestions and have corrected the text accordingly.

      References

      (1) Tianyin Zhou, Ning Shen, Lin Yang, Namiko Abe, John Horton, Richard S Mann, Harmen J Bussemaker, Raluca Gordan, and Remo Rohs. Quantitative modeling ofˆ transcription factor binding specificities using DNA shape. Proceedings of the National Academy of Sciences, 112(15):4654–4659, 2015.

      (2) Jinsen Li, Tsu-Pei Chiu, and Remo Rohs. Predicting DNA structure using a deep learning method. Nat Commun, 15(1):1243, February 2024.

      (3) Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J. Ballard, Joshua Bambrick, Sebastian W. Bodenstein, David A. Evans, Chia-Chun Hung, Michael O’Neill, David Reiman, Kathryn Tunyasuvunakool, Zachary Wu, Akvile˙ Zemgulytˇ e, Eirini Arvan-˙ iti, Charles Beattie, Ottavia Bertolli, Alex Bridgland, Alexey Cherepanov, Miles Congreve, Alexander I. Cowen-Rivers, Andrew Cowie, Michael Figurnov, Fabian B. Fuchs, Hannah Gladman, Rishub Jain, Yousuf A. Khan, Caroline M. R. Low, Kuba Perlin, Anna Potapenko, Pascal Savy, Sukhdeep Singh, Adrian Stecula, Ashok Thillaisundaram, Catherine Tong, Sergei Yakneen, Ellen D. Zhong, Michal Zielinski, Augustin Zˇ´ıdek, Victor Bapst, Pushmeet Kohli, Max Jaderberg, Demis Hassabis, and John M. Jumper. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, pages 1–3, May 2024.

      (4) Raktim Mitra, Jinsen Li, Jared M. Sagendorf, Yibei Jiang, Ari S. Cohen, Tsu-Pei Chiu, Cameron J. Glasscock, and Remo Rohs. Geometric deep learning of protein–DNA binding specificity. Nat Methods, 21(9):1674–1683, September 2024.

      (5) Lin Yang, Yaron Orenstein, Arttu Jolma, Yimeng Yin, Jussi Taipale, Ron Shamir, and Remo Rohs. Transcription factor family-specific DNA shape readout revealed by quantitative specificity models. Mol Syst Biol, 13(2):910, February 2017.

      (6) Takaya Saito and Marc Rehmsmeier. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE, 10(3):e0118432, March 2015.

      (7) Mu Gao and Jeffrey Skolnick. DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions. Nucleic Acids Res, 36(12):3978–3992, July 2008.

      (8) H. Tomas Rube, Chaitanya Rastogi, Siqian Feng, Judith F. Kribelbauer, Allyson Li, Basheer Becerra, Lucas A. N. Melo, Bach Viet Do, Xiaoting Li, Hammaad H. Adam, Neel H. Shah, Richard S. Mann, and Harmen J. Bussemaker. Prediction of protein–ligand binding affinity from sequencing data with interpretable machine learning. Nat Biotechnol, 40(10):1520–1527, October 2022.

      (9) Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol, 33(8):831–838, August 2015.

      (10) Joshua L. Wetzel, Kaiqian Zhang, and Mona Singh. Learning probabilistic proteinDNA recognition codes from DNA-binding specificities using structural mappings. Genome Res, 32(9):1776–1786, September 2022.

      (11) Aziz Khan, Oriol Fornes, Arnaud Stigliani, Marius Gheorghe, Jaime A CastroMondragon, Robin van der Lee, Adrien Bessy, Jeanne Cheneby, Shubhada R Kulka-` rni, Ge Tan, Damir Baranasic, David J Arenillas, Albin Sandelin, Klaas Vandepoele, Boris Lenhard, Benoˆıt Ballester, Wyeth W Wasserman, Franc¸ois Parcy, and Anthony Mathelier. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Research, 46(D1):D260–D266, January 2018.

      (12) Ivan V. Kulakovskiy, Ilya E. Vorontsov, Ivan S. Yevshin, Ruslan N. Sharipov, Alla D. Fedorova, Eugene I. Rumynskiy, Yulia A. Medvedeva, Arturo Magana-Mora, Vladimir B. Bajic, Dmitry A. Papatsenko, Fedor A. Kolpakov, and Vsevolod J. Makeev. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res, 46(D1):D252–D259, January 2018.

      (13) Arttu Jolma, Jian Yan, Thomas Whitington, Jarkko Toivonen, Kazuhiro R. Nitta, Pasi Rastas, Ekaterina Morgunova, Martin Enge, Mikko Taipale, Gonghong Wei, Kimmo Palin, Juan M. Vaquerizas, Renaud Vincentelli, Nicholas M. Luscombe, Timothy R. Hughes, Patrick Lemaire, Esko Ukkonen, Teemu Kivioja, and Jussi Taipale. DNABinding Specificities of Human Transcription Factors. Cell, 152(1-2):327–339, January 2013.

      (14) Maor Asif and Yaron Orenstein. DeepSELEX: inferring DNA-binding preferences from HT-SELEX data using multi-class CNNs. Bioinformatics, 36(Supplement 2):i634–i642, December 2020.

      (15) Oriol Fornes, Alberto Meseguer, Joachim Aguirre-Plans, Patrick Gohl, Patricia M Bota, Ruben Molina-Fernandez, Jaume Bonet, Altair Chinchilla-Hernandez, Ferran´ Pegenaute, Oriol Gallego, Narcis Fernandez-Fuentes, and Baldo Oliva. Structurebased learning to predict and model protein–DNA interactions and transcriptionfactor co-operativity in cis -regulatory elements. NAR Genomics and Bioinformatics, 6(2):lqae068, April 2024.

      (16) Sofia Aizenshtein-Gazit and Yaron Orenstein. DeepZF: improved DNA-binding prediction of C2H2-zinc-finger proteins by deep transfer learning. Bioinformatics, 38(Suppl 2):ii62–ii67, September 2022.

      (17) Stephen K Burley, Charmi Bhikadiya, Chunxiao Bi, Sebastian Bittrich, Henry Chao, Li Chen, Paul A Craig, Gregg V Crichlow, Kenneth Dalenberg, Jose M Duarte, Shuchismita Dutta, Maryam Fayazi, Zukang Feng, Justin W Flatt, Sai Ganesan, Sutapa Ghosh, David S Goodsell, Rachel Kramer Green, Vladimir Guranovic, Jeremy Henry, Brian P Hudson, Igor Khokhriakov, Catherine L Lawson, Yuhe Liang, Robert Lowe, Ezra Peisach, Irina Persikova, Dennis W Piehl, Yana Rose, Andrej Sali, Joan Segura, Monica Sekharan, Chenghua Shao, Brinda Vallat, Maria Voigt, Ben Webb, John D Westbrook, Shamara Whetstone, Jasmine Y Young, Arthur Zalevsky, and Christine Zardecki. RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Research, 51(D1):D488–D508, November 2022.

      (18) Raktim Mitra, Ari S. Cohen, Jared M. Sagendorf, Helen M. Berman, and Remo Rohs. DNAproDB: an updated database for the automated and interactive analysis of protein-DNA complexes. Nucleic Acids Res, 53(D1):D396–D402, January 2025.

      (19) Natalia Petrenko, Yi Jin, Liguo Dong, Koon Ho Wong, and Kevin Struhl. Requirements for RNA polymerase II preinitiation complex formation in vivo. eLife, 8:e43654, January 2019.

      (20) Rudolf Jaenisch and Adrian Bird. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nat Genet, 33(3):245–254, March 2003.

      (21) Claire Marchal, Jiao Sima, and David M. Gilbert. Control of DNA replication timing in the 3D genome. Nat Rev Mol Cell Biol, 20(12):721–737, December 2019.

      (22) Lucia A. Hindorff, Praveen Sethupathy, Heather A. Junkins, Erin M. Ramos, Jayashri P. Mehta, Francis S. Collins, and Teri A. Manolio. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences, 106(23):9362–9367, June 2009.

      (23) Tuuli Lappalainen, Alexandra J Scott, Margot Brandt, and Ira M Hall. Genomic analysis in the age of human genome sequencing. Cell, 177(1):70–84, 2019.

      (24) Sonali Mukherjee, Michael F. Berger, Ghil Jona, Xun S. Wang, Dale Muzzey, Michael Snyder, Richard A. Young, and Martha L. Bulyk. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat Genet, 36(12):1331– 1339, December 2004.

      (25) Shaoxun Liu, Pilar Gomez-Alcala, Christ Leemans, William J. Glassford, Lucas A. N. Melo, Xiang-Jun Lu, Richard S. Mann, and Harmen J. Bussemaker. Predicting the DNA binding specificity of transcription factor mutants using family-level biophysically interpretable machine learning. bioRxiv, page 2024.01.24.577115, April 2025.

      (26) Tsu-Pei Chiu, Satyanarayan Rao, and Remo Rohs. Physicochemical models of protein–DNA binding with standard and modified base pairs. Proc. Natl. Acad. Sci. U.S.A., 120(4):e2205796120, January 2023.

      (27) Matthew T Weirauch, Atina Cote, Raquel Norel, Matti Annala, Yue Zhao, Todd R Riley, Julio Saez-Rodriguez, Thomas Cokelaer, Anastasia Vedenko, Shaheynoor Talukder, and others. Evaluation of methods for modeling transcription factor sequence specificity. Nature biotechnology, 31(2):126–134, 2013.

      (28) Chaitanya Rastogi, H. Tomas Rube, Judith F. Kribelbauer, Justin Crocker, Ryan E. Loker, Gabriella D. Martini, Oleg Laptenko, William A. Freed-Pastor, Carol Prives, David L. Stern, Richard S. Mann, and Harmen J. Bussemaker. Accurate and sensitive quantification of protein-DNA binding affinity. Proc. Natl. Acad. Sci. U.S.A., 115(16), April 2018.

      (29) Rahmatullah Roche, Bernard Moussad, Md Hossain Shuvo, Sumit Tarafder, and Debswapna Bhattacharya. EquiPNAS: improved protein–nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. Nucleic Acids Research, 52(5):e27–e27, March 2024.

      (30) Yufan Liu and Boxue Tian. Protein–DNA binding sites prediction based on pretrained protein language model and contrastive learning. Briefings in Bioinformatics, 25(1):bbad488, November 2023.

      (31) Binh P. Nguyen, Quang H. Nguyen, Giang-Nam Doan-Ngoc, Thanh-Hoang Nguyen-Vo, and Susanto Rahardja. iProDNA-CapsNet: identifying protein-DNA binding residues using capsule neural networks. BMC Bioinformatics, 20(S23):634, December 2019.

      (32) Trevor Siggers and Raluca Gordan. Protein–DNA binding: complexities and multi-ˆ protein codes. Nucleic Acids Research, 42(4):2099–2111, February 2014.

      (33) Johannes Soding, Andreas Biegert, and Andrei N. Lupas. The HHpred interactive¨ server for protein homology detection and structure prediction. Nucleic Acids Research, 33(suppl 2):W244–W248, July 2005.

      (34) William Humphrey, Andrew Dalke, and Klaus Schulten. VMD – Visual Molecular Dynamics. Journal of Molecular Graphics, 14:33–38, 1996.

      (35) Arttu Jolma, Teemu Kivioja, Jarkko Toivonen, Lu Cheng, Gonghong Wei, Martin Enge, Mikko Taipale, Juan M Vaquerizas, Jian Yan, Mikko J Sillanpa¨a, and others.¨ Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome research, 20(6):861–873, 2010.

      (36) Nobuo Ogawa and Mark D Biggin. High-throughput SELEX determination of DNA sequences bound by transcription factors in vitro. Gene Regulatory Networks: Methods and Protocols, pages 51–63, 2012.

      (37) Alina Isakova, Romain Groux, Michael Imbeault, Pernille Rainer, Daniel Alpern, Riccardo Dainese, Giovanna Ambrosini, Didier Trono, Philipp Bucher, and Bart Deplancke. SMiLE-seq identifies binding motifs of single and dimeric transcription factors. Nature methods, 14(3):316–322, 2017.

      (38) Paul G. Giresi, Jonghwan Kim, Ryan M. McDaniell, Vishwanath R. Iyer, and Jason D. Lieb. FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Res., 17(6):877–885, January 2007.

      (39) Peter J Park. ChIP–seq: advantages and challenges of a maturing technology. Nature reviews genetics, 10(10):669–680, 2009.

      (40) Terrence S. Furey. ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nat Rev Genet, 13(12):840–852, December 2012.

      (41) Anna Bartlett, Ronan C. O’Malley, Shao-shan Carol Huang, Mary Galli, Joseph R. Nery, Andrea Gallavotti, and Joseph R. Ecker. Mapping genome-wide transcriptionfactor binding sites using DAP-seq. Nat Protoc, 12(8):1659–1672, August 2017.

      (42) Marcel Geertz, David Shore, and Sebastian J Maerkl. Massively parallel measurements of molecular interaction kinetics on a microfluidic platform. Proceedings of the National Academy of Sciences, 109(41):16540–16545, 2012.

      (43) Gary D. Stormo and Yue Zhao. Determining the specificity of protein–DNA interactions. Nat Rev Genet, 11(11):751–760, November 2010.

      (44) Xingcheng Lin, Rachel Leicher, Shixin Liu, and Bin Zhang. Cooperative DNA looping by PRC2 complexes. Nucleic Acids Research, 49(11):6238–6248, June 2021.

      (45) P. L. Privalov, A. I. Dragan, and C. Crane-Robinson. Interpreting protein/DNA interactions: distinguishing specific from non-specific and electrostatic from nonelectrostatic components. Nucleic Acids Research, 39(7):2483–2491, April 2011.

      (46) J D Bryngelson and P G Wolynes. Spin glasses and the statistical mechanics of protein folding. Proc. Natl. Acad. Sci. U.S.A., 84(21):7524–7528, November 1987.

      (47) J. N. Onuchic, Z. Luthey-Schulten, and P. G. Wolynes. Theory of protein folding: the energy landscape perspective. Annu Rev Phys Chem, 48:545–600, 1997.

      (48) N. P. Schafer, B. L. Kim, W. Zheng, and P. G. Wolynes. Learning To Fold Proteins Using Energy Landscape Theory. Isr J Chem, 54(8-9):1311–1337, August 2014.

      (49) Wen-Ting Chu, Zhiqiang Yan, Xiakun Chu, Xiliang Zheng, Zuojia Liu, Li Xu, Kun Zhang, and Jin Wang. Physics of biomolecular recognition and conformational dynamics. Rep. Prog. Phys., 84(12):126601, December 2021.

      (50) Sebastian J. Maerkl and Stephen R. Quake. A Systems Approach to Measuring the Binding Energy Landscapes of Transcription Factors. Science, 315(5809):233–237, January 2007.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      This work employs both in vitro and in vivo/transplant methods to investigate the contribution of BDNF/TrkB signaling to enhancing differentiation and dentin-repair capabilities of dental pulp stem cells in the context of exposure to a variety of inflammatory cytokines. A particular emphasis of the approach is the employment of dental pulp stem cells in which BDNF expression has been enhanced using CRISPR technology. Transplantation of such cells is said to improve dentin regeneration in a mouse model of tooth decay.

      The study provides several interesting findings, including demonstrating that exposure to several cytokines/inflammatory agents increases the quantity of (activated) phospho-Trk B in dental pulp stem cells.

      However, a variety of technical issues weaken support for the major conclusions offered by the authors. These technical issues include the following:

      Thank you for your keen observation and evaluation, which helped us significantly improve our manuscript. We have addressed the concerns and comments point by point in detail and substantially revised the manuscript and Figures. We hope that the manuscript is acceptable in the current improvised version.

      Detailed response to your comments/concerns is as follows:

      (1) It remains unclear exactly how the cytokines tested affect BDNF/TrkB signaling. For example, in Figure 1C, TNF-alpha increases TrkB and phospho-TrkB immunoreactivity to the same degree, suggesting that the cytokine promotes TrkB abundance without stimulating pathways that activate TrkB, whereas in Figure 2D, TNF-alpha has little effect on the abundance of TrkB, while increasing phospho-TrkB, suggesting that it affects TrkB activation and not TrkB abundance.

      Thank you for your kind concern. Recently, we have demonstrated the effect and interaction of TNF-alpha and Ca2+/calmodulin-dependent protein kinase II on the regulation of the inflammatory hDPSCs dentino-differentiation via BDNF/TrkB receptor signaling using TrkB inhibitor (Ref. below, and Figure 9). Moreover, we agree with your concern, and we have re-analyzed our replicates and found a better trend and significant abundance of TrkB as well (please refer to revised Figure 2D).

      Ref.: Kim, Ji Hyun, et al. (2025) "Ca 2+/calmodulin-dependent protein kinase II regulates the inflammatory hDPSCs dentino-differentiation via BDNF/TrkB receptor signaling." Frontiers in Cell and Developmental Biology 13: 1558736.

      (2) I find the histological images in Figure 3 to be difficult to interpret. I would have imagined that DAPI nuclear stains would reveal the odontoblast layer, but this is not apparent. An adjacent section labeled with conventional histological stains would be helpful here. Others have described Stro-1 as a stem cell marker that is expressed on a minority of cells associated with vasculature in the dental pulp, but in the images in Figure 3, Stro-l label is essentially co-distributed with DAPI, in both control and injured teeth, indicating that it is expressed in nearly all cells. Although the authors state that the Stro-1-positive cells are associated with vasculature, but I see no evidence that is true.

      Thank you for your concern. STRO-1 is a mesenchymal stem cell marker also expressed in dental pulp stem cells; both populations are distributed in the pulp. DPSCs can contribute to tissue repair and regeneration in inflamed pulp by differentiating into odontoblasts and forming reparative dentin. Moreover, in the case of carious and inflamed pulp, they are disorganized depending on the extent of infection/injury. Our purpose here was to point out DPSCs presence, not vasculature, which will differentiate into odontoblasts in such a scenario. We have revised Figure 3 by adding magnified images and dotted lines to indicate the boundary between the pulp and dentin.

      Ref. Volponi A. A., Pang Y., Sharpe P. T. Stem cell-based biological tooth repair and regeneration. Trends in Cell Biology. 2010;20(12):715–722.

      (3) The data presented convincingly demonstrate that they have elevated BDNF expression in their dental pulp stem cells using a CRISPR-based approach I have a number of questions about these findings. Firstly, nowhere in the paper do they describe the nature of the CRISPR plasmid they are transiently transfecting. Some published methods delete segments of the BDNF 3'-UTR while others use an inactivated Cas9 to position an active transactivator to sequences in the BDNF promoter. If it is the latter approach, transient transfection will yield transient increases in BDNF expression. Also, as BDNF employs multiple promoters, it would be helpful to know which promoter sequence is targeted, and finally, knowing the identity of the guide RNAs would allow assessment for the potential of off-target effects I am guessing that the investigators employ a commercially obtained system from Santa Cruz, but nowhere is this mentioned. Please provide this information.

      Dear Reviewer, yes, you are right. We have used a commercially obtained system from Santa Cruz, i.e., BDNF CRISPR Activation Plasmid (h): sc-400029-ACT and UltraCruz® Transfection Reagent (sc-395739), and they have been mentioned in Chemicals and Reagents section of Materials and Methods as follows.

      “BDNF CRISPR Activation Plasmid (h) is a synergistic activation mediator (SAM) transcription activation system designed to upregulate gene expression specifically BDNF CRISPR Activation Plasmid (h) consists of three plasmids at a 1:1:1 mass ratio: a plasmid encoding the deactivated Cas9 (dCas9) nuclease (D10A and N863A) fused to the transactivation domain VP64, and a blasticidin resistance gene; a plasmid encoding the MS2-p65-HSF1 fusion protein, and a hygromycin resistance gene; a plasmid encoding a target-specific 20 nt guide RNA fused to two MS2 RNA aptamers, and a puromycin resistance gene.”

      The resulting SAM complex binds to a site-specific region approximately 200-250 nt upstream of the transcriptional start site and provides robust recruitment of transcription factors for highly efficient gene activation

      Following transfection, gene activation efficiency could be assayed by WB, IF, or IHC using antibody: pro-BDNF Antibody (5H8): sc-65514

      Author response image 1.

      (4) Another question left unresolved is whether their approach elevated BDNF, proBDNF, or both. Their 28 kDa western blot band apparently represents proBDNF exclusively, with no mature BDNF apparent, yet only mature BDNF effectively activates TrkB receptors. On the other hand, proBDNF preferentially activates p75NTR receptors. The present paper never mentions p75NTR, which is a significant omission, since other investigators have demonstrated that p75NTR controls odontoblast differentiation.

      Dear reviewer, thank you for your noticing the error.

      Pro-BDNF is produced as a 32-kDa precursor that undergoes N-glycosylation and glycosulfation on residues located within the pro-domain of the precursor. N-terminal cleavage of the precursor generates mature BDNF as well as a minor truncated form of the precursor (28 kDa) that arises by a different processing mechanism than mature BDNF. The precursor undergoes N-terminal cleavage within the trans-Golgi network and/or immature secretory vesicles to generate mature BDNF (14 kDa).

      We checked our data and band size, and it shows a little mistake (Thank you for your keen observation and pointing out). The CRISPR protocol required verification of gene activation by checking pro-BDNF, as mentioned in the methodology. The labeling has been revised in the figure as pro-BDNF, and the actual blot with a ladder has been shown below for clarification.

      (5) In any case, no evidence is presented to support the conclusion that the artificially elevated BDNF expression has any effect on the capability of the dental pulp stem cells to promote dentin regeneration. The results shown in Figures 4 and 5 compare dentin regeneration with BDNF-over-expressing stem cells with results lacking any stem cell transplantation. A suitable control is required to allow any conclusion about the benefit of over-expressing BDNF.

      We have tested the presence of BDNF overexpressing cells by the higher expression of GFP here. Moreover, a significant increment in the dentin mineralization volume indicates the advantage of BDNF-over-expressing stem cells. Recently, we published the in vitro effects of BDNF/TrkB on DPSCs odontoblastic differentiation strongly supporting our in vivo data. Currently, we are in a difficult position to conduct the animal study within a short period of time. We would definitely consider using positive control in our future studies.

      Ref.: Kim, Ji Hyun, et al. (2025) "Ca 2+/calmodulin-dependent protein kinase II regulates the inflammatory hDPSCs dentino-differentiation via BDNF/TrkB receptor signaling." Frontiers in Cell and Developmental Biology 13: 1558736.

      (6) Whether increased BDNF expression is beneficial or not, the evidence that the BDNF-overexpressing dental pulp stem cells promote dentin regeneration is somewhat weak. The data presented indicate that the cells increase dentin density by only 6%. The text and figure legend disagree on whether the p-value for this effect is 0.05 or 0.01. In either case, nowhere is the value of N for this statistic mentioned, leaving uncertainty about whether the effect is real.

      A significant increment in the dentin mineralization volume by about 7.76% indicates the advantage of BDNF-over-expressing stem cells, and we believe this could be a breakthrough to advance stem cell engineering and therapy further to get this percentage higher in the future. The text in the result section shows that the p-value for this effect is 0.05. While N was 3 previously, we analyzed two more samples by CT scan and revised results, taking N = 5, which improved the results a little more to about 8.53%. Thank you for noticing; the figure legend has been corrected to 0.05.

      Similarly, our in vitro data in the current study supports the notion that it adds up to mineralization and odontoblastic differentiation. We recently published that BDNF/TrkB significantly enhances calcium deposits and mineralization using a battery of in vitro experiments.

      Ref.: Kim, Ji Hyun, et al. (2025) "Ca 2+/calmodulin-dependent protein kinase II regulates the inflammatory hDPSCs dentino-differentiation via BDNF/TrkB receptor signaling." Frontiers in Cell and Developmental Biology 13: 1558736.

      (7) The final set of experiments applies transcriptomic analysis to address the mechanisms mediating function differences in dental pulp stem cell behavior. Unfortunately, while the Abstract indicates " we conducted transcriptomic profiling of TNFα-treated DPSCs, both with and without TrkB antagonist CTX-B" that does not describe the experiment described, which compared the transcriptome of control cells with cells simultaneously exposed to TNF-alpha and CTX-B. Since CTX-B blocks the functional response of cells to TNF-alpha, I don't understand how any useful interpretation can be attached to the data without controls for the effect of TNF alone and CTX-B alone.

      Dear reviewer, yes, we did it alone and together as well. Earlier, we showed only the combined results and mentioned the interaction between TNFα and TrkB. We have included the results from TNFα alone and combined them with CTX-B for better comparison (Please refer to Figure 8). Figure 8C1 clearly shows the reversal of certain factors with the treatment of TrkB inhibitor compared to figure 8C with TNFα alone treated group.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors investigate the potential for overexpressing BDNF in dental pulp stem cells to enhance dentin regeneration. They suggest that in the inflammatory environment of injured teeth, there is increased signaling of TrkB in response to elevated levels of inflammatory molecules.

      Strengths:

      The potential application to dentin regeneration is interesting.

      Weaknesses:

      There are a number of concerns with this manuscript to be addressed.

      Thank you for your compliments, keen observation, and evaluation, which helped us significantly improve our manuscript. We have addressed the concerns and comments point by point in detail and substantially revised the manuscript and Figures. We hope that the manuscript is acceptable in the current improvised version.

      Detailed response to your comments/concerns is as follows:

      (1) Insufficient citation of the literature. There is a vast literature on BDNF-TrkB regulating survival, development, and function of neurons, yet there is only one citation (Zhang et al 2012) which is on Alzheimer's disease.

      More references have been cited accordingly.

      (2) There are several incorrect statements. For example, in the introduction (line 80) TrkA is not a BDNF receptor.

      Thank you for noticing the typo; the sentence has been corrected.

      (3) Most important - Specific antibodies must be identified by their RRID numbers. To state that "Various antibodies were procured:... from BioLegend" is unacceptable, and calls into question the entire analysis. Specifically, their Western blot in Figure 4B indicates a band at 28 kDa that they say is BDNF, however the size of BDNF is 14 kDa, and the size of proBDNF is 32 and 37 kDa, therefore it is not clear what they are indicating at 28 kDa. The validation is critical to their analysis of BDNF-expressing cells.

      Dear reviewer, thank you for your kind concern. Sorry for the inconvenience; we have added RRID numbers of antibodies.

      Pro-BDNF is produced as a 32-kDa precursor that undergoes N-glycosylation and glycosulfation on residues located within the pro-domain of the precursor. N-terminal cleavage of the precursor generates mature BDNF as well as a minor truncated form of the precursor (28 kDa) that arises by a different processing mechanism than mature BDNF. The precursor undergoes N-terminal cleavage within the trans-Golgi network and/or immature secretory vesicles to generate mature BDNF (14 kDa).

      We checked our data and band size, and it shows a mistake in recognizing ladder size. It is actually a 14kDa band which has been shown. The labeling has been revised in the figure, and the actual blot with a ladder has been shown below for clarification. Similarly, our data focused on the fact that the observed cellular effects are more consistent with BDNF/TrkB-mediated pathways, which are known to promote survival and differentiation.

      (4) Figure 2 indicates increased expression of TrkB and TrkA, as well as their phosphorylated forms in response to inflammatory stimuli. Do these treatments elicit increased secretion of the ligands for these receptors, BDNF and NGF, respectively, to activate their phosphorylation? Or are they suggesting that the inflammatory molecules directly activate the Trk receptors? If so, further validation is necessary to demonstrate that.

      Thank you for your kind concern. TNF-α increases the number of TrkB receptors. The enhanced TrkB activation may result from a greater number of receptors and/or increased activation of individual receptors. In either case, inflammatory agents enhance the TrkB receptor signaling pathway.

      Recently, we have demonstrated the effect and interaction of TNF-alpha and Ca2+/calmodulin-dependent protein kinase II on the regulation of the inflammatory hDPSCs dentino-differentiation via BDNF/TrkB receptor signaling using TrkB inhibitor (Ref. below, and Figure 9). For now, we have added figure 9 for the proposed mechanism of action based on our recent and current study.

      Ref.: Kim, Ji Hyun, et al. (2025) "Ca 2+/calmodulin-dependent protein kinase II regulates the inflammatory hDPSCs dentino-differentiation via BDNF/TrkB receptor signaling." Frontiers in Cell and Developmental Biology 13: 1558736.

      (5) Figure 7 - RNA-Seq data, what is the rationale for treatment with TNF+ CTX-B? How does this identify any role for TrkB signaling? They never define their abbreviations, but if CTX-B refers to cholera toxin subunit B, which is what it usually refers to, then it is certainly not a TrkB antagonist.

      Thank you for your concern. Cyclotraxin-B (CTX-B) is a TrkB antagonist (mentioned in the revised manuscript). In order to identify the underlying mechanism, we ought to locate certain transcriptional factors interacting with the TrkB/BDNF signaling, leading to differentiation and dentinogenesis. Therefore, we treated it with a TrkB inhibitor.

      Earlier, we showed only the combined results and mentioned the interaction between TNFα and TrkB. We have included the results from TNFα alone and combined them with CTX-B for better comparison (Please refer to Figure 8). Figure 8C1 clearly shows the reversal of certain factors with the treatment of TrkB inhibitor compared to figure 8C with TNFα alone treated group. We agree that the precise role of CTX-B in modulating TrkB signaling requires further clarification and have now included this point in the revised discussion while we are currently working on this aspect.

      Reviewer #3 (Public review):

      In general, although the authors interpret their results as pointing towards a possible role of BDNF in dentin regeneration, the results are over-interpreted due to the lack of proper controls and focus on TrkB expression, but not its isoforms in inflammatory processes. Surprisingly, the authors do not study the possible role of p75 in this process, which could be one of the mechanisms intervening under inflammatory conditions.

      Thank you for your compliments, keen observation, and evaluation, which helped us significantly improve our manuscript. We have addressed the concerns and comments point by point in detail and substantially revised the manuscript and Figures. We hope that the manuscript is acceptable in the current improvised version.

      Detailed response to your comments/concerns is as follows:

      (1) The authors claim that there are two Trk receptors for BDNF, TrkA and TrkB. To date, I am unaware of any evidence that BDNF binds to TrkA to activate it. It is true that two receptors have been described in the literature, TrkB and p75 or NGFR, but the latter is not TrkA despite its name and capacity to bind NGF along with other neurotrophins. It is crucial for the authors to provide a reference stating that TrkA is a receptor for BDNF or, alternatively, to correct this paragraph.

      Dear reviewer, we apologize for the inconvenience; it was an error. BDNF binds to TrkB, and the sentence has been corrected.

      (2) The authors discuss BDNF/TrkB in inflammation. Is there any possibility of p75 involvement in this process?

      Mature BDNF binds to the high-affinity receptor tyrosine kinase B (TrkB), activating signaling cascades, while pro-BDNF binds to the p75 neurotrophin receptor (p75NTR). So, we don’t think there’s a possibility, as our data shows mature BDNF production. Here, we initially screened the TrkA and TrkB involvement in dentinogenesis and chose to work with BDNF and its receptor TrkB. Future studies can be directed to elucidate its mechanism of action in the context of dentinogenesis.

      (3) The authors present immunofluorescence (IF) images against TrkB and pTrkB in the first figure. While they mention in the materials and methods section that these antibodies were generated for this study, there is no proof of their specificity. It should be noted that most commercial antibodies labeled as anti-TrkB recognize the extracellular domain of all TrkB isoforms. There are indications in the literature that pathological and excitotoxic conditions change the expression levels of TrkB-Fl and TrkB-T1. Therefore, it is necessary to demonstrate which isoform of TrkB the authors are showing as increased under their conditions. Similarly, it is essential to prove that the new anti-p-TrkB antibody is specific to this Trk receptor and, unlike other commercial antibodies, does not act as an anti-phospho-pan-Trk antibody.

      Thank you for your kind concern.

      Human TrkB has 7 isoforms and predicted Mw ranges from 35 to 93kDa. It has 11 potential N-glycosylation sites. The given antibody (isotype: Mouse IgG2a, κ) has been shown to interact with SHC1, PLCG1 and/or PLCG2, SH2B1 and SH2B2, NGFR, SH2D1A, SQSTM1 and KIDINS220, FRS2.

      And, sorry for the misunderstanding and text mistake. We procured all the antibodies from the market using proven products, and didn’t check any specific isoform. We have mentioned the details of antibodies and reagents in the chemicals section of the methodology.

      (4) I believe this initial conclusion could be significantly strengthened, without opening up other interpretations of the results, by demonstrating the specificity of the antibodies via Western blot (WB), both in the presence and absence of BDNF and other neurotrophins, NGF, and NT-3. Additionally, using WB could help reinforce the quantification of fluorescence intensity presented by the authors in Figure 1. It's worth noting that the authors fixed the cells with 4% PFA for 2 hours, which can significantly increase cellular autofluorescence due to the extended fixation time, favoring PFA autofluorescence. They have not performed negative controls without primary antibodies to determine the level of autofluorescence and nonspecific background. Nor have they indicated optimizing the concentration of primary antibodies to find the optimal point where the signal is strong without a significant increase in background. The authors also do not mention using reference markers to normalize specific fluorescence or indicating that they normalized fluorescence intensity against a standard control, which can indeed be done using specific signal quantification techniques in immunocytochemistry with a slide graded in black-and-white intensity controls. From my experience, I recommend caution with interpretations from fluorescence quantification assays without considering the aforementioned controls.

      Thank you for your insightful comments. We have now included a negative control image in the revised Figures. This control confirms that the observed fluorescence signal is specific and not due to autofluorescence or nonspecific background. In our lab, we have been using these antibodies and already optimized the concentration to use in certain cell types. Additionally, we followed the manufacturer’s recommended antibody concentration and protocol throughout our experiments to ensure an optimal signal-to-noise ratio.

      We agree that extended fixation with 4% PFA may increase autofluorescence; however, including negative controls helps account for this effect. We also ensured consistent imaging parameters and applied the same exposure settings across all samples to allow for a valid comparison of fluorescence intensity. We appreciate your emphasis on careful quantification and have clarified these methodological details in the revised Methods section.

      (5) In Figure 2, the authors determine the expression levels of TrkA and TrkB using qPCR. Although they specify the primers used for GAPDH as a control in materials and methods, they do not indicate which primers they used to detect TrkA and TrkB transcripts, which is essential for determining which isoform of these receptors they are detecting under different stimulations. Similarly, I recommend following the MIQE guidelines (Minimum Information for Publication of Quantitative Real-Time PCR experiments), so they should indicate the amplification efficiency of their primers, the use of negative and positive controls to validate both the primer concentration used, and the reaction, the use of several stable reference genes, not just one.

      We appreciate the reviewer’s suggestion regarding the specificity of primers and the amplification efficiency. In response, we have now included the primer sequences used for detecting TrkA and TrkB transcripts in the revised Materials and Methods section (Quantitative real-time PCR analysis of odontogenic differentiation marker gene expression in dental pulp stem cells). This ensures clarity on which isoforms of these receptors were assessed under different conditions. We also acknowledge the importance of following MIQE guidelines, and we got the primer provided by Integrated DNA Technologies with standard desalting purification and guaranteed yield.

      (6) Moreover, the authors claim they are using the same amounts of cDNA for qPCRs since they have quantified the amounts using a Nanodrop. Given that dNTPs are used during cDNA synthesis, and high levels remain after cDNA synthesis from mRNA, it is not possible to accurately measure cDNA levels without first cleaning it from the residual dNTPs. Therefore, I recommend that the authors clarify this point to determine how they actually performed the qPCRs. I also recommend using two other reference genes like 18S and TATA Binding Protein alongside GAPDH, calculating the geometric mean of the three to correctly apply the 2^-ΔΔCt formula.

      Thank you for your kind concern. We agree that residual dNTPs from cDNA synthesis could impact the accuracy of cDNA quantification. To address this, we have used the commercially available and guaranteed kit. The kit used is mentioned in Materials and Methods. We will definitely consider using 18S and TATA Binding Protein alongside GAPDH in our future studies. For now, we request you consider the results generated against GAPDH control.

      (7) Similarly, given that the newly generated antibodies have not been validated, I recommend introducing appropriate controls for the validation of in-cell Western assays.

      We apologize for the text mistake. Antibodies were procured commercially and not generated. We have corrected the sentence.

      (8) The authors' conclusion that TrkB levels are minimal (Figure 2E) raises questions about what they are actually detecting in the previous experiments might not be the TrkB-Fl form. Therefore, it is essential to demonstrate beyond any doubt that both the antibodies used to detect TrkB and the primers used for qPCR are correct, and in the latter case, specify at which cycle (Ct) the basal detection of TrkB transcripts occurs. Treatment with TNF-alpha for 14 days could lead to increased cell proliferation or differentiation, potentially increasing overall TrkB transcript levels due to the number of cells in culture, not necessarily an increase in TrkB transcripts per cell.

      Thank you for your comments. We appreciate your kind concerns. Here, we are trying to demonstrate that TrkB gets activated in inflammatory conditions. We have also provided the details on primers and antibodies. We have used commercial antibodies and qPCR primers, and they have been extensively validated with previous publications. The efficiency and validation of qPCR primers were provided by a company.

      Moreover, we used the minimal concentration of TNF-alpha twice a week, and before using it, we did preliminary experiments to determine whether it affected any experimental condition.

      (9) Overall, there are reasonable doubts about whether the authors are actually detecting TrkB in the first three images, as well as the phosphorylation levels and localization of this receptor in the cells. For example, in Figure 3 A to J, it is not clear where TrkB is expressed, necessitating better resolution images and a magnified image to show in which cellular structure TrkB is expressed.

      Thank you for your comment. Here, we aimed to show the expression of TrkB receptors in inflamed/infected pulp, especially in minority-distributed DPSCs. TrkB is present on the cell membrane and perinuclear region. We have provided a single-cell (magnified) image in the figure for better clarification.

      (10) In Figure 4, the authors indicate they have generated cells overexpressing BDNF after recombination using CRISPR technology. However, the WB they show in Figure 4B, performed under denaturing conditions, displays a band at approximately 28kDa. This WB is absolutely incorrect with all published data on BDNF detection via this technique. I believe the authors should demonstrate BDNF presence by showing a WB with appropriate controls and BDNF appearing at 14kDa to assume they are indeed detecting BDNF and that the cells are producing and secreting it. What antibodies have been used by the authors to detect BDNF? Have the authors validated it? There are some studies reporting the lack of specificity of certain commercial BDNF antibodies, therefore it is necessary to show that the authors are convincingly detecting BDNF.

      Dear reviewer, thank you for your kind concern. Firstly, we apologize for the inconvenience.

      Pro-BDNF is produced as a 32-kDa precursor that undergoes N-glycosylation and glycosulfation on residues located within the pro-domain of the precursor. N-terminal cleavage of the precursor generates mature BDNF and a minor truncated form of the precursor (28 kDa) that arises by a different processing mechanism than mature BDNF. The precursor undergoes N-terminal cleavage within the trans-Golgi network and/or immature secretory vesicles to generate mature BDNF (14 kDa).

      We checked our data and band size, and it shows a mistake in recognizing ladder size. It is actually a 14kDa band which has been shown. The labeling has been revised in the figure, and the actual blot with a ladder has been shown below for clarification. Similarly, our data focused on the fact that the observed cellular effects are more consistent with BDNF/TrkB-mediated pathways, which are known to promote survival and differentiation.

      (11) While the RNA sequencing data indicate changes in gene expression in cells treated with TNFalpha+CTX-B compared to control, the authors do not show a direct relationship between these genetic modifications with the rest of their manuscript's argument. I believe the results from these RNA sequencing assays should be put into the context of BDNF and TrkB, indicating which genes in this signaling pathway are or are not regulated, and their importance in this context.

      Thank you for your concern. In order to identify the underlying mechanism, we ought to locate certain transcriptional factors interacting with the TrkB/BDNF signaling, leading to differentiation and dentinogenesis. Therefore, we treated it with a TrkB inhibitor.

      Earlier, we showed only the combined results and mentioned the interaction between TNFα and TrkB. We have included the results from TNFα alone and combined them with CTX-B for better comparison (Please refer to Figure 8). Figure 8C1 clearly shows the reversal of certain factors with the treatment of TrkB inhibitor compared to figure 8C with TNFα alone treated group. We agree that the precise role of CTX-B in modulating TrkB signaling requires further clarification. We have now included this point in the revised discussion while working on this aspect. In a parallel study, we are trying to dig deep, especially the TCF family, as they have been documented to interact indirectly with BDNF and TrkB.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Some minor textual issues

      Line 120: It is obvious that TNFα stimulation caused significant phosphorylation of TrkB (p < 0.01) compared to TrkA (p < 0.05).

      Thank you for noticing the typo. The sentence has been corrected.

      The authors should consider rewording this sentence - I do not understand the intended meaning.

      Line 126: pronounced peak at 10 ng/mL. I am not convinced there is a peak. Looks like a plateau to me. To call it a peak one would have to show that the values at 10 ng/ml and 20 ng/ml are statistically different.

      We meant here the peak compared to 0.1 and 1ng/mL concentration and not compared to 20 ng/mL. The sentence has been elaborated accordingly.

      Reviewer #3 (Recommendations for the authors):

      The authors should show how they have validated the specificity of all the used antibodies as well as the efficiency and specificity of their qPCR data.

      We procured the commercially available antibodies (all of them have been extensively validated with previous publications) and also performed negative controls (provided in revised figures). We frequently used Western blot and validate it with band size. Primer sequences are also provided in the revised manuscript. We checked its specificity with R<sup>2</sup> of Standard Curve ≥ 0.98 and the single peak of melting curves. We edited accordingly in line 263.

      Once again, we thank all of you for your efforts in evaluating our study. It really helped us improve the quality of the manuscript. We hope all the queries have been answered and the revised manuscript is acceptable.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank the reviewers for their insightful and constructive comments of our work that have helped to strengthen the manuscript. In response to the additional suggestions provided by the reviewers, we have made revisions by adding or replacing five main figures, three supplementary figures, refining the text, and clarifying certain conclusions. Detailed responses to the reviewers’ points can be found below.

      Additional experiments, textual changes, or modulation of claims are needed to address weaknesses in the SOD1 portion of the study. Specifically:

      A) These studies require an assessment of the on-target efficacy of the inhibitors at the relevant concentration ranges. Ideally, they should have minimal effects against SOD1 knockout cell lines (an acute challenge at a time point before the growth defects become apparent) and show better efficacy in SOD1-overexpressing lines. Key experiments (changes in superoxide, OCR profiling, DNA alkaline comet assay) would be more convincing if they were carried out with SOD1 knockout lines to compare against the inhibitor effects (3-4 days after introducing sgSOD1 when growth defects are not apparent). In addition, SOD activity should be measured directly following inhibitor treatment.

      We agree with the reviewers that the on- vs. off-target effects of the pharmacologic SOD1 inhibitors is a critical point to address. We have validated that SOD activity is reduced following treatment with ATN-224 in Figure 2 – Figure supplement 1A.

      Nevertheless, we acknowledge that the potential for off-target effects of these inhibitors cannot be completely ruled out. To address this concern, we have incorporated a discussion regarding the potential off-target effects of both LCS-1 and ATN-224.

      B) Assays should be included to support that SOD1 activity is altered. ATN-224 and LCS-1 are used to inhibit SOD1 function in the majority of the experiments, which should be supported by SOD activity assays to confirm SOD inhibition. Further, the concentration of ATN-224 used in this paper (12.5 uM) is beyond the concentration of what has been reported to inhibit SOD1 function in human blood cells. In Figure 4D, the authors demonstrate comparable SOD1 total protein levels in WT and PPM1Dmutant cells. However, the authors should further address whether PPM1D-mutation alters SOD1 activity via SOD activity assays.

      We thank the reviewers for these suggestions. We have performed SOD activity assays which confirmed that SOD activity is inhibited upon treatment with ATN-224 at two concentrations (6.25 and 12.5 uM). Although we also did this for LCS-1-treated cells as well, in our hands, we did not see reduced SOD activity. However, LCS-1 has been shown to inhibit SOD activity in other publications including PMID: 21930909 and PMID: 32424294. From these assays, we have also found that PPM1D-mutant cells had increased SOD activity at baseline, despite having similar levels of SOD1 protein. These data have been added to Figure 2–Figure supplement 1A.

      C) Some conclusions are not fully supported by the data provided. The authors claimed that "upon inhibition of SOD1, there was an increase in ROS that was specific to the mutant cells" in Figure 2E. Comparison of ROS levels among untreated, ATN-224, and LCS-1 of PPM1D-mutant cells should have been made and the statistics analysis among these groups should have been provided. Moreover, in Figure 2-Figure Supplement 1E, LCS-1 treatment does not increase ROS levels in PPM1D mutant LCLs. Performing these experiments with control and SOD1 deletion cells would have strengthened the results. Along with this point, the authors should comment on why SOD2 is not identified as a top hit in the CRISPR screen, as SOD2 deletion accumulates superoxide in cells.

      After performing additional statistical analyses for Figure 2E, we found that the minor increase in ROS levels in the mutant cells after SOD1 inhibition was not statistically significant. We have revised the text accordingly.

      As for why SOD2 was not identified as a top hit, we postulate that this may be due to inherent dependency of the WT cell lines on SOD2.

      D) Fig. 1 - SOD1 appears to be clustered with several other genes in the volcano plot (including FANC proteins). Did any other ROS-detoxifying enzymes show similar fitness scores? The effects of the SOD1 sgRNA are striking, however, it would be useful to see qPCR or immunoblot data confirming robust depletion.

      Thank you for your suggestion. We have validated the loss of SOD1 protein expression after SOD1 sgRNA deletion by immunoblot and have added this data to Figure 1– figure supplement 1D. While other ROS-detoxifying enzymes were not significantly enriched in the top 37 hits, interestingly, the Fanconi Anemia pathway also has roles in counteracting oxidative stress. FA-deficient cells have mitochondrial dysfunction and redox imbalance, and several of the FA family proteins are implicated in mitophagy. Therefore, there may be an interesting interplay between SOD1 and the FA pathway that is worth highlighting in the discussion of our manuscript even though there was no experimental investigation performed.

      E) Fig. 2 - What are the relative SOD1 levels in the mutant PPM1D vs. WT. cell lines? The effects of the chemical inhibitors are stronger in MOLM-13 than in the other two lines. These data could also point to whether LCS-1 and ATN-224 cytotoxicity are on-target or off-target at these concentrations, which is a key issue not currently addressed in these studies. This is a particular concern as the OCI-AML2 line shows a stronger growth defect with CRISPR SOD1 KO (in Fig 1) but the smallest effects with these chemical inhibitors. The authors should also include SOD1 levels for Figure 1D and Figure 4Figure supplement 1C.

      SOD1 protein expression is similar between WT and PPM1D-mutant cell lines and the loss of SOD1 after SOD1 sgRNA deletion was validated by immunoblot. These data have been added to Figure 1- figure supplement 1D and Figure 4D.

      F) Does SOD1 co-expression in PPM1D-mutant patient AML correspond to poorer disease outcomes? This can be evaluated in publicly available patient datasets and would support the idea of SOD1 synthetic lethality.

      Unfortunately, there are no publicly available patient datasets with sufficient cases of de novo PPMDmutant AML to assess this question.

      G) While endogenous mitochondrial superoxide levels are elevated in PPM1D mutant lines, it is entirely unclear why SOD1 inhibition should affect mitochondrial superoxide as it detoxifies cytosolic superoxide. Also unclear why the DCFDA signal (which measures total hydroperoxides) is increased under SOD1 inhibition - SOD1 dismutates superoxide radicals into hydrogen peroxide, therefore unless SOD2 is compensating for SOD1 loss, one might expect hydroperoxides to be lower (unless some entirely different oxidase is increasing their levels). None of these outcomes appear to be considered. Finally, it is not explained how lipid peroxidation, which requires the production of hydroxyl or similarly high-potency radicals, is being caused by increased superoxide or peroxides. One possibility is there is an increase in labile iron, in which case this phenotype would be rescued by the iron chelator desferal, and by the lipophilic antioxidant, ferrostatin.

      We measured intracellular labile iron levels by flow cytometry by staining the cells with FerroOrange at baseline and after SOD1 inhibition with our pharmacologic inhibitors (ATN-224 at 12.5 uM and LCS-1 at 1.25 uM). Across the three leukemia cell lines, we saw variable results in iron levels with no appreciable patterns (see below). Therefore, we cannot make conclusions about the contribution of labile iron to our observed phenotypes.

      Author response image 1.

      H) Do the sgSOD1 cells also show similar increases in MitoSox green, DCFDA, and BODIPY signal? These experiments would clarify whether the effects of the inhibitors are directly related directly to SOD1 loss or if they represent off-target effects from the inhibitors and/or compensatory changes in SOD2.

      We do not observe changes in SOD2 in the several contexts in which we have examined this. We cannot exclude off-target effects of the inhibitors so have clarified this in the text.

      I) The authors may want to assess whether Rac1 or NADPH oxidase activity is altered in the SOD1 KO in WT vs. PPM1D cells. Their results may be the consequence of compromised ROS-driven survival signaling or DNA repair rather than direct ROS-induced damage, which is not caused directly by superoxide (or hydrogen peroxide).

      We appreciate the reviewer’s recommendations. However, due to time constraints, we regret not being able to assess Rac1 or NADPH oxidase activity. Nevertheless, we recognize the possibility of altered ROS-driven signaling rather than ROS-induced damage as a driver of our phenotype and have incorporated this possibility into our discussion.

      J) Fig. 3 - the effects on mitochondrial respiratory parameters, while statistically significant, do not seem biologically striking. Also, these data are shown for OCI-AML2 cells which show the smallest cytotoxic effects with the SOD1 inhibitors among the 3 lines tested. They do however show the most robust growth defect with sgSOD1. This discrepancy could suggest that mitochondrial dysfunction does not underlie the observed growth defect and/or the inhibitor cytotoxicity is not on-target. Ideally, mitochondrial profiling should also be carried out on this cell line with inducible SOD1 depletion. Have the authors assessed whether the mitochondrial Bcl family proteins are affected by the inhibitors?

      We assessed a few members of the mitochondrial Bcl-family proteins including MCL-1, BCL-2, and BCL-XL during the revision process. PPM1D-mutant cells have mildly increased expression of these anti-apoptotic proteins at baseline and the expression is not altered by pharmacologic SOD1 inhibition (see Author response image 2 below). Due to time constraints, we were unable to perform seahorse assays and mitochondrial profiling in the SOD1-deletion cells.

      Author response image 2.

      K) Fig. 4 - Currently the data in this figure do not support the authors' claim that PPM1D-mutant cells have impaired antioxidant defense mechanisms, leading to an elevation in ROS levels and reliance on SOD1 for protection. It should be noted that oxidative stress specifically refers to adverse cellular effects of increasing ROS, not baseline levels of various redox parameters. Ideally, levels of GSSG/GSH would be a better measure of potential redox stress tolerance than the total antioxidant capacity assay. Finally, oxidative stress can be assessed by challenging the wt and mutant PPM1D cell lines with oxidant stressors such as paraquat which elevates superoxide, or drugs like erastin which elevate mitochondrial ROS. The immunoblot shows negligible changes in the antioxidant proteins assayed. Again, this blot should include SOD2 which is the most relevant antioxidant in the context of mitochondrial superoxide.

      We measured intracellular glutathione levels by flow cytometry and found that PPM1D-mutant cells had a greater proportion of cells with low levels of GSH. This data has been added as Figure 4D. We have also repeated the western blot to look at the antioxidant proteins catalase, SOD1, and thioredoxin after SOD1-deletion and pharmacologic SOD1 inhibition. We evaluated SOD2 protein levels in these experiments, as suggested. Smooth muscle actin (SMA) is included in the antibody cocktail as a loading control. However, it is unclear to us as to why PPM1D-mutant cells consistently have significantly higher levels of SMA. Therefore, we included a separate loading control, Vinculin. Repeat of these western blots showed a clearer difference between WT and PPM1D-mutant cells in the levels of these antioxidant proteins in which PPM1D-mutant cells have decreased levels of catalase and thioredoxin. These blots also show that SOD2 levels may be mildly increased in the PPM1D-mutant cells at baseline but is not significantly upregulated upon SOD1 inhibition. We have replaced the original immunoblot from Figure 4D with the revised blots that more clearly demonstrate the reduced levels of catalase and thioredoxin, now figure 4E.

      L) Fig. 5 - These data support that DNA breaks are elevated in PPM1D mutant vs. wt cells. However, the data with the chemical SOD1 inhibitor again do not convince us that the enhanced levels are due to on-target effects on SOD1. Use of the alkaline comet assay is appropriate for these studies and the 8-oxoguanine data do indicate contributions from oxidative DNA base damage. But these are unlikely to result directly from altered superoxide levels, as this species cannot directly oxidize DNA bases or cause DNA strand breaks.

      Thank you to the reviewers for raising this point. We have performed comet assays in SOD1-deletion cells to look at levels of DNA damage. Consistent with the reviewers’ point, we do not see a significant increase in DNA breaks after SOD1 deletion. We have removed the data using the SOD1 inhibitor and instead show the COMET analysis in the PPM1D-mut and SOD1-KO cells (see Figure 5F). We now make the point that increased DNA damage with SOD1 loss cannot explain the vulnerability of the double-mutant cells.

      M) Instead of using NAC, which elevates glutathione synthesis but also has several known side effects, the authors may want to determine whether Tempol, a SOD mimetic can rescue the effects of SOD1 knockout or inhibition. This would directly prove that SOD1 functional loss underlies the observed growth defect and cytotoxicity from genetic SOD1 knockdown or chemical inhibition.

      This is an excellent suggestion; we have added comments to this effect into the discussion.

      N) It is recommended the discussion focus more strongly on how the signaling function of superoxide vs. its reactions with other molecular entities to induce genotoxic outcomes could be contributing to the observed phenotypes. The discussion of FANC proteins, which were targets with similar fitness scores but not experimentally investigated at all, is an unwarranted digression.

      Thank you for this recommendation. We have expanded the discussion to focus more on the signaling functions of superoxide. However, considering the role of the Fanconi Anemia pathway in mitigating DNA damage and oxidative stress, we believe the discussion on the FANC proteins is important due to the possible intersection with SOD1. Therefore, we have refined this portion discussion to focus more on the interplay between SOD1 and FA.

      O) The complete lack of consideration of SOD2 in these studies is a missed opportunity as it reduces mitochondrial superoxide levels but elevates hydrogen peroxide levels. It would be very interesting to see whether SOD1 inhibition leads to compensatory increases in SOD2. SOD2 can be easily measured by immunoblot. Furthermore, measuring total superoxide via hydroethidium in a flow cytometric assay vs. mitochondrial ROS in PPM1D mut vs. wt cells and under SOD1 knockout would enable a determination of which species dominates (cytosolic or mitochondrial). These experiments are required to fill some logical gaps in the interpretation of their redox data.

      During the revision process, we have included SOD2 in our studies and have found that loss of SOD1 via genetic deletion and pharmacologic inhibition does not lead to compensatory increases in SOD2 (Figure 4D). Additionally, we have measured cytoplasmic superoxide levels using dihydroethidium to differentiate between cytoplasmic vs. mitochondrial superoxide. We found that at baseline levels, the mutant cells also harbored more cytoplasmic superoxide. We have added this figure as Figure 2C and moved the original mitochondrial superoxide data to Figure 2-figure supplement 1C.

      P) Given the DNA breaks observed in PPM1D mutant cells, it is highly recommended that the authors assess whether iron levels are elevated in mut vs. wt cells and whether desferal can rescue observed SOD1 inhibition defects. Also, it has been reported that PPM1D promotes homologous recombination by forming a stable complex with BRCA1-BARD1, thereby enhancing their recruitment to doublestrand break sites. The authors should comment on why there is no difference in repair via HR in WT and PPM1D mutant cells in Figure 5C.

      Please see comment G regarding our findings about iron levels.

      The reviewers pose an interesting question as to why there is no difference in HR repair between WT and mutant cells, given the reported role of PPM1D in promoting HR. We have addressed this question in the main text. We believe that several factors can limit the extent of HR enhancement in PPM1D-mutant cells. For example, HR is typically confined to the S/G2 phase and thus may be constrained by cell cycling, among other regulatory mechanisms.

      Other comments:

      A) The authors described in the Method section that "The CRISPR Screen PPM1D mutant Cas9expressing OCI-AML2 cell lines were transduced with lentivirus library supernatant." The authors need to provide information on whether the MOI of the CRISPR screen has been well controlled to ensure that the majority of the cell population has a single copy of sgRNA transduction.

      We performed a lentiviral titer curve prior to the screen to determine the volume of viral supernatant to add for a multiplicity of infection (MOI) of 0.3. This important detail has been added to our Methods.

      B) The study convincingly shows differences between parental leukemic cells and the PPM1D mutants but one important control is missing in experiments related to Fig. 2 and 3. All PPM1D mutant clones used in this study were subjected to the blasticidin selection of the transduced cells to generate cells stably expressing Cas9 and subsequently, the clones with successful PPM1D targeting were expanded. The authors should demonstrate that increased ROS production is not just a consequence of the lentiviral transduction and antibiotic selection and that it corresponds to increased PPM1D activity in PPM1D mutant cells. To do that, authors could compare PPM1D clones to parental cells that underwent the same selection procedure (OCI-AML2-Cas9 cells and OCI-AML3-Cas9 cells).

      It is true that the parental OCI-AML2 and OCI-AML3 cell lines underwent four days of blasticidin selection to create the stably expressing Cas9 cell lines. However, after the four-day period, the blasticidin was removed from the cell culture media. From there, we induced the PPM1D-mutations into the Cas9-expressing “WT” cell lines using the RNP-based CRISPR/Cas9 delivery method and single cells were then sorted into 96-well plates. Clones were expanded and validated using Sanger sequencing, TIDE analysis, and western blot. In all of our assays, we compare the WT Cas9 cells to the PPM1D-mutant Cas9 cells. Additionally, the cells have been expanded and passaged several times after blasticidin-selection. Therefore, we believe it is unlikely that there are residual ROSinducing effects from the antibiotic treatment.

      C) The authors mention that they identified 3530 genes differentially expressed in parental and PPM1D mutant cells (line 267) but it is unclear what was the threshold for statistical significance. They mention FDR<0.05 in the Methods but show GSEA analysis with FDR<0.25 in Figure 4A. Source data for Fig. 4 is missing and the list of differentially expressed genes is not shown.

      The source data files for Figures 1 and 4 will be uploaded with the revised manuscript. Upon reviewing the source data, we noticed an error in the number of differentially expressed genes. We have corrected this in line 274 and you will see that this correlates with Figure 4-source data 1. For the thresholds, we used an FDR<0.05 for the differential gene expression analysis, and an FDR <0.25 in the GSEA, which is an appropriate threshold for GSEA. We have clarified these thresholds in the methods section.

      D) Include a definition of MFI in Figure legend Fig.2 and also in the Methods section. The unit should be indicated at both the x and y axes.

      We have defined MFI in the figure legends and methods sections and have updated the figures accordingly.

      E) Legend to Figure 2 - Figure Supplement 1 E should define the grey and pink columns (likely WT and mutants LCLs).

      Thank you. We have defined the grey and pink columns as WT and PPM1D-mutant cell lines, respectively for Figure 2 – Figure supplement 2D and E.

      F) Reporter assays in Fig. 5 convincingly show that NHEJ capacity is reduced in PPM1D mut cells. In the text, the authors state that this might reflect the impact of PPM1D on LSD1 (line 365). Although this might be the case, other options are equally possible. It would be appropriate to include a reference to the ability of PPM1D to counteract gH2AX and ATM which generate the most upstream signals in DDR.

      Thank you to the reviewers for raising this excellent point. We have revised the text to incorporate the impact of PPM1D on yH2AX and ATM on NHEJ.

      G) The authors correctly state that truncation of PPM1D leads to protein stabilization (line 85) and that it is present in U2OS cells (line 355). These observations have first been reported by Kleiblova et al 2013 and therefore one reviewer believes that this reference should be included. This study also identified truncating PPM1D mutation in colon adenocarcinoma. HCT116 cells and the role of PPM1D mutation in promoting the growth of colon cancer has subsequently been tested in an animal model (Burocziova et al., 2019).

      Thank you. We have added this reference to our text in line 360.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This is an interesting study that performs scRNA-Seq on infected and uninfected wounds. The authors sought to understand how infection with E. faecalis influences the transcriptional profile of healing wounds. The analysis demonstrated that there is a unique transcriptional profile in infected wounds with specific changes in macrophages, keratinocytes, and fibroblasts. They also speculated on potential crosstalk between macrophages and neutrophils and macrophages and endothelial cells using NicheNet analysis and CellChat. Overall the data suggest that infection causes keratinocytes to not fully transition which may impede their function in wound healing and that the infection greatly influenced the transcriptional profile of macrophages and how they interact with other cells.

      Strengths:

      It is a useful dataset to help understand the impact of wound infection on the transcription of specific cell types. The analysis is very thorough in terms of transcriptional analysis and uses a variety of techniques and metrics.

      Weaknesses:

      Some drawbacks of the study are the following. First, the fact that it only has two mice per group, and only looks at one time point after wounding decreases the impact of the study. Wound healing is a dynamic and variable process so understanding the full course of the wound healing response would be very important to understand the impact of infection on the healing wound. Including unwounded skin in the scRNA-Seq would also lend a lot more significance to this study. Another drawback of the study is that mouse punch biopsies are very different than human wounds as they heal primarily by contraction instead of reepithelialization like human wounds. So while the conclusions are generally supported the scope of the work is limited.

      Thank you for your thoughtful review and acknowledgment of the thoroughness of our analysis.

      First, the fact that it only has two mice per group, and only looks at one time point after wounding decreases the impact of the study.

      We acknowledge your concerns regarding the limitations of our study, particularly regarding the small number of mice per group and the examination of only one time point post-wounding. We agree that a more comprehensive analysis across multiple time points would provide a deeper understanding of the temporal changes induced by infection. While our primary focus in this study was to elucidate the foundational responses to bacteria-infected wounds, we attempted to augment our analysis by incorporating publicly available datasets of similar nature. However, these datasets lacked power in terms of cell number and populations. Nonetheless, we have bolstered our analysis by applying a crossentropy test on the integrated dataset and reporting its significance (Figure S1F), ensuring the robustness of our single-cell RNA sequencing datasets.

      Including unwounded skin in the scRNA-Seq would also lend a lot more significance to this study.

      We also recognize the significance of comparing infected wounds to unwounded skin to establish a baseline for transcriptional changes. While we attempted to incorporate publicly available unwounded skin samples into our analysis, we encountered limitations in the number of cells, particularly within the immune population. This constraint is addressed in the Limitations section of the manuscript.

      Another drawback of the study is that mouse punch biopsies are very different than human wounds as they heal primarily by contraction instead of re-epithelialization like human wounds.

      Regarding the concern about differences between murine and human wound healing mechanisms, we took measures during tissue isolation to mitigate this issue, extracting incisions of the wounds rather than contracted tissues. Despite the primary mode of wound closure in mice being contraction, we believe our analysis still offers valuable insights into cellular responses to infection relevant to human wound healing.

      We appreciate your constructive criticism of our study. Despite these constraints, we believe our work provides valuable insights into the transcriptional changes induced by infection in healing wounds.

      Reviewer #2 (Public Review):

      Summary:

      The authors have performed a detailed analysis of the complex transcriptional status of numerous cell types present in wounded tissue, including keratinocytes, fibroblasts, macrophages, neutrophils, and endothelial cells. The comparison between infected and uninfected wounds is interesting and the analysis suggests possible explanations for why infected wounds are delayed in their healing response.

      Strengths:

      The paper presents a thorough and detailed analysis of the scRNAseq data. The paper is clearly written and the conclusions drawn from the analysis are appropriately cautious. The results provide an important foundation for future work on the healing of infected and uninfected wounds.

      Weaknesses:

      The analysis is purely descriptive and no attempt is made to validate whether any of the factors identified are playing functional roles in wound healing. The experimental setup is analyzing a single time point and does not include a comparison to unwounded skin.

      We are thankful for your acknowledgment of the thoroughness of our analysis and the cautious nature of our conclusions.

      The analysis is purely descriptive, and no attempt is made to validate whether any of the factors identified are playing functional roles in wound healing.

      Regarding your concern about the purely descriptive nature of our analysis and the lack of functional validation of identified factors, we agree on the importance of understanding the functional roles of transcriptional changes in wound healing. To address this limitation, we plan to conduct functional experiments, such as perturbation assays or in vivo validation studies, to validate the roles of specific factors identified in our analysis.

      The experimental setup is analyzing a single time point and does not include a comparison to unwounded skin.

      We acknowledge the importance of comparing wounded tissue to unwounded skin to establish a baseline for understanding transcriptional changes. This point is noted and acknowledged in the limitations section of our manuscript.

      We appreciate your feedback and assure you that we will consider your suggestions in future iterations of our research.

      Recommendations For The Authors:

      We are grateful for the positive overall assessment of our revised work by the reviewers. Critical comments on specific aspects of our work are listed verbatim below followed by our responses.

      Reviewer 1 (Recommendations for the Authors):

      (1) The figures are a bit cluttered and hard to parse out. The different parts of the figure seem to be scattered all over the place with no consistent order.

      Thank you for your feedback regarding the figures in our manuscript. We acknowledge your concern that some panels may appear cluttered and challenging to navigate. In response, we made concerted efforts to declutter certain panels, taking into account page size constraints and ensuring a minimum font size for readability.

      (2) I didn't really understand what the last sentence on page 6 meant. Is this meant to say that these could be biomarkers of infection?

      We thank the reviewer for noting this lack of clarity. We revised the statement.

      Updated manuscript (lines 111-113)

      “Overall, the persistent E. faecalis infection contributed to higher Tgfb1 expression, whilst Pdgfa levels remained low, correlating with delayed wound healing.”

      (3) >(3) A reference on page 19 didn't format correctly.

      We thank the reviewer for catching the typos. We corrected the reference formatting.

      Updated manuscript (lines 503-505)

      “We confirm the immune-suppressive role of E. faecalis in wound healing, consistent with previous findings in different experimental settings (Chong et al., 2017; Kao et al., 2023; Tien et al., 2017).”

      (4) The title doesn't really address the scope of the finding which goes beyond immunomodulatory.

      The reviewer is correct! We therefore revised the title to cover all aspects of the study as:

      “Decoding the complexity of delayed wound healing following Enterococcus faecalis infection”

      Reviewer 2 (Recommendations for the Authors):

      (1) On page 6, the expression of Tgfb1 is described as "aggravated" by wounding alone. I am not sure whether this means Tgfb1 levels are increased or decreased. It appears from the data that it is increased, which was confusing to me since I interpreted "aggravated" as meaning decreased. So perhaps a different more straightforward word could be used to describe the data.

      We modified this ambiguous statement to:

      Updated manuscript (lines 105-106)

      “By contrast, wounding alone resulted in higher transforming growth factor beta 1 (Tgfb1) expression.”

      (2) On page 7, the authors state that "cells from infected wounds...demonstrated distinct clustering patterns compared to cells from uninfected wounds (Figure S1F)" but when I look at the data in this figure, I cannot really see a difference. Perhaps the differences could be more clearly highlighted?

      Thank you for pointing out this issue. We appreciate the reviewer's comment. We utilized the crossentropy test for statistical comparison, employing UMAP embedding space data. While the data underwent batch correction based on infection status, the UMAP plots for each condition may appear visually similar. However, it's important to note that the number of cells per clusters between the infected and uninfected conditions varies significantly. This aspect influences the selection of points (cells) and their nearest neighbours for statistical testing within each cluster in the embedding space. To address this concern, we have included a table indicating the number of cells per cell type alongside the plot (Figure S1F), providing additional context for the interpretation of our results.

      Author response table 1.

      Author response image 1.

      (3) On page 8, Zeb2hi cells are described as "immunosuppressive" and yet the genes are highlighted to express in include Cxcl2 and IL1b which I would classify as inflammatory, not immunosuppressive. Can the authors be a bit more clear on why they describe the phenotype of these cells as "immunosuppressive"?

      We agree with the reviewer that this is a bit counterintuitive. Conventionally, CXCL2 is thought to be chemoattractant for neutrophil recruitment. However, the infection-specific keratinocyte cluster expressing Cxcl2, Il1b, Wfdc17 along with Zeb2 and Thbs1 indicate their myeloid-derived suppressor cell-like features, which play immunosuppressive roles during infection and in cancer (Alshetaiwi et al., 2020; Siriwach et al., 2022; Veglia et al., 2021).

      Updated manuscript (lines 159-163)

      “As the barrier to pathogens, keratinocytes secrete a broad range of cytokines that can induce inflammatory responses (Alshetaiwi et al., 2020; Siriwach et al., 2022; Veglia et al., 2021). However, Zeb2hi keratinocytes co-expressing Cxcl2, Il1b, and Wfdc17, indicate myeloidderived suppressor cell-like phenotype which implies an immunosuppressive environment (Hofer et al., 2021; Veglia et al., 2021).”

      (4) On pages 8-9, Keratinocytes are described to express MHC class II. I find this quite unexpected since class II is usually thought to be expressed primarily by APCs such as DCs and B cells. Is there a precedent for keratinocytes to express class II? The authors should acknowledge that this is unexpected and in need of further validation, or support the claim with references in which class II expression has been previously observed on keratinocytes (and is thus not unexpected)

      Although MHC class II expression is predominantly on immune cells, an antigen-presenting role for keratinocytes has been reported in many studies (Banerjee et al., 2004; Black et al., 2007; Carr et al., 1986; Gawkrodger et al., 1987; Jiang et al., 2020; Li et al., 2022; Oh et al., 2019; Tamoutounour et al., 2019). Therefore, antigen-presenting role of keratinocytes is known and expected, and we think that this should be further investigated in in the context of wound infection.

      Updated manuscript (lines 177-179)

      “These genes are associated with the major histocompatibility complex (MHC) class II, suggesting a self-antigen presenting keratinocyte population, which have a role in costimulation of T cell responses (Meister et al., 2015; Tamoutounour et al., 2019).”

      REFERENCES

      Alshetaiwi, H., Pervolarakis, N., McIntyre, L. L., Ma, D., Nguyen, Q., Rath, J. A., Nee, K., Hernandez, G., Evans, K., Torosian, L., Silva, A., Walsh, C., & Kessenbrock, K. (2020). Defining the emergence of myeloid-derived suppressor cells in breast cancer using single-cell transcriptomics. Science Immunology, 5(44), eaay6017. https://doi.org/10.1126/sciimmunol.aay6017

      Banerjee, G., Damodaran, A., Devi, N., Dharmalingam, K., & Raman, G. (2004). Role of keratinocytes in antigen presentation and polarization of human T lymphocytes. Scandinavian Journal of Immunology, 59(4), 385–394. https://doi.org/10.1111/j.0300-9475.2004.01394.x

      Black, A. P. B., Ardern-Jones, M. R., Kasprowicz, V., Bowness, P., Jones, L., Bailey, A. S., & Ogg, G. S. (2007). Human keratinocyte induction of rapid effector function in antigen-specific memory CD4+ and CD8+ T cells. European Journal of Immunology, 37(6), 1485–1493. https://doi.org/10.1002/eji.200636915

      Carr, M. M., McVittie, E., Guy, K., Gawkrodger, D. J., & Hunter, J. A. (1986). MHC class II antigen expression in normal human epidermis. Immunology, 59(2), 223–227.

      Gawkrodger, D. J., Carr, M. M., McVittie, E., Guy, K., & Hunter, J. A. (1987). Keratinocyte expression of MHC class II antigens in allergic sensitization and challenge reactions and in irritant contact dermatitis. The Journal of Investigative Dermatology, 88(1), 11–16. https://doi.org/10.1111/1523-1747.ep12464641

      Jiang, Y., Tsoi, L. C., Billi, A. C., Ward, N. L., Harms, P. W., Zeng, C., Maverakis, E., Kahlenberg, J. M., & Gudjonsson, J. E. (2020). Cytokinocytes: The diverse contribution of keratinocytes to immune responses in skin. JCI Insight, 5(20), e142067, 142067. https://doi.org/10.1172/jci.insight.142067

      Li, D., Cheng, S., Pei, Y., Sommar, P., Kärner, J., Herter, E. K., Toma, M. A., Zhang, L., Pham, K., Cheung, Y. T., Liu, Z., Chen, X., Eidsmo, L., Deng, Q., & Xu Landén, N. (2022). Single-Cell Analysis Reveals Major Histocompatibility Complex II‒Expressing Keratinocytes in Pressure Ulcers with Worse Healing Outcomes. The Journal of Investigative Dermatology, 142(3 Pt A), 705–716. https://doi.org/10.1016/j.jid.2021.07.176

      Oh, S., Chung, H., Chang, S., Lee, S.-H., Seok, S. H., & Lee, H. (2019). Effect of Mechanical Stretch on the DNCB-induced Proinflammatory Cytokine Secretion in Human Keratinocytes. Scientific Reports, 9(1), 5156. https://doi.org/10.1038/s41598-019-41480-y

      Siriwach, R., Ngo, A. Q., Higuchi, M., Arima, K., Sakamoto, S., Watanabe, A., Narumiya, S., & Thumkeo, D. (2022). Single-cell RNA sequencing identifies a migratory keratinocyte subpopulation expressing THBS1 in epidermal wound healing. iScience, 25(4), 104130. https://doi.org/10.1016/j.isci.2022.104130

      Tamoutounour, S., Han, S.-J., Deckers, J., Constantinides, M. G., Hurabielle, C., Harrison, O. J., Bouladoux, N., Linehan, J. L., Link, V. M., Vujkovic-Cvijin, I., Perez-Chaparro, P. J., Rosshart, S. P., Rehermann, B., Lazarevic, V., & Belkaid, Y. (2019). Keratinocyte-intrinsic MHCII expression controls microbiota-induced Th1 cell responses. Proceedings of the National Academy of Sciences of the United States of America, 116(47), 23643–23652. https://doi.org/10.1073/pnas.1912432116

      Veglia, F., Sanseviero, E., & Gabrilovich, D. I. (2021). Myeloid-derived suppressor cells in the era of increasing myeloid cell diversity. Nature Reviews. Immunology, 21(8), 485–498. https://doi.org/10.1038/s41577-020-00490-y

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer 3 (Public review):

      Major comments:

      (1) Can isolated mitochondria be transported to cultured cardiomyocytes, such as H9C2 cells, in vitro?

      Thank you for this insightful question. Mitochondria are highly dynamic organelles that play a crucial role in cellular energy metabolism. When cells encounter various stressors and increased energy demands, they can benefit from the incorporation of exogenous mitochondria. In 2013, Masuzawa et al. (Masuzawa, et al.,2013) were the first to demonstrate that transplanted mitochondria are internalized by cardiomyocytes 2 to 8 hours after transplantation, significantly contributing to the preservation of myocardial energetics. Ali et al. (Ali, et al.,2020) discovered that exogenous mitochondria could be internalized by H9C2 cardiomyocytes as quickly as 5 minutes after co-incubation, resulting in an acute enhancement of normal cellular bioenergetics following mitochondrial transplantation. Pacak et al. (Pacak, et al.,2015) established that the internalization of mitochondria into cardiomyocytes is time-dependent and occurs through actin-dependent endocytosis.

      Collectively, these evidences illustrate that exogenous mitochondria can be effectively internalized by H9C2 cells and other cardiomyocytes, our experiments further confirmed that mitochondrial transplantation can be incorporated by the myocardium in vivo.

      (2) The description of results in the manuscript is too simple. It lacks detail on the rationale behind the experiments and the significance of the data.

      Thank you for this suggestion. We have realized that the results in the submitted manuscript have not been adequately interpreted. We have added necessary details on the rationale behind the experiments and the significance of the data to the results section (Lines 57~59, 69~73, 81~88, 91~98, 100~102, 103~104,  10<sup>9</sup>~115, 124~129, 135~146, 149~157, 159~161, 168~169, 178~179). We would like to express our gratitude to the reviewers once again and hope that our modifications will meet their requirements.

      (3) The authors demonstrate that mitochondrial transplantation reduces cardiomyocyte apoptosis. Therefore, Western blot analysis of apoptosis-related caspases could be provided for further confirmation.

      Thank you for this constructive comment. We fully agree with the reviewer's perspective on the detection of apoptosis-related caspases and have conducted a Western blot assay to investigate the impact of mitochondria on myocardial tissue. Our new evidence indicates that rats receiving mitochondrial transplantation exhibited reduced expression of cleaved caspase-3 compared with those in the NS and Vehicle groups (Fig. 6G, 6H, Lines 168~169), suggesting that mitochondrial transplantation decreased the level of apoptosis in the myocardium.

      (4) Do donor mitochondria fuse with recipient mitochondria? Relevant experiments and data should be provided to address this question.

      This is a very helpful comment. Investigating the fate of transplanted mitochondria in myocardial cells after CA is of great significance. The internalization of exogenous mitochondria has been observed across various cell types (Liu, et al.,2021; Shanmughapriya, et al.,2020). Notably, a recent study indicated that after being incorporated into host cells, isolated mitochondria are transported to endosomes and lysosomes. Subsequently, most of these mitochondria escape from these compartments and fuse with the endogenous mitochondrial network (Cowan, et al.,2017). We have discussed this in the manuscript. (Lines 217~220)

      Oxidative stress, a pathophysiological phenomenon common to cells suffering from ischemia/reperfusion insults after CA/CPR, was implicated to promote internalization and survival of exogenous mitochondria (Aharoni-Simon, et al.,2022). In our study, we confirmed that mitochondrial transplantation can enhance the metabolism of cardiomyocytes, increase ATP level, and reduce reactive oxygen species (ROS). Our results indirectly confirm that isolated mitochondria can successfully fuse with myocardial mitochondria.

      (5) In Figure 5A, the histograms are not labeled with the specific experimental groups.

      We apologize for this oversight. We have labeled the specific experimental groups in the histograms presented in Figure 6B and 6C (originally Figure 5A).

      Reviewer #1 (Recommendations For The Authors):

      (1) The age, gender, and strain of the donor rats should be specified in the Methods section. Additionally, it is not obvious what doses of mitochondria were injected into the rats and how the dosage was initially determined.

      Thanks for your suggestion. We have included relevant information about the donor rats in the Methods section(Lines 361~362).

      In Mito group, each animal received 0.5 mL of 1× 10<sup>9</sup>/mL mitochondrial suspension. (Lines 342~345). Considerable amounts of data have demonstrated the efficacy of mitochondrial transplantation in cellular, animal, and human research (Alemany, et al.,2024; Kaza, et al.,2017; Liu, et al.,2023). However, there is currently no evidence to determine the optimal dosage for transplantation. In previous research, isolated mitochondria (1 ×  10<sup>9</sup>) were delivered to the left coronary ostium in pigs, and can be a viable treatment modality in cardiac ischemia-reperfusion injury (Blitzer, et al.,2020; Guariento, et al.,2020). Additionally, the dose of 1× 10<sup>9</sup> mitochondria achieve the maximal hyperemic effect when administered via intracoronary injection (Shin, et al.,2019). Considering that Sprague-Dawley (SD) rats are smaller than pigs and that there is a loss of mitochondria during pulmonary circulation, we adopted a mitochondrial transplantation dose of 5× 10<sup>8</sup>. We will explore the optimal dosage in our future research.

      (2) In Figure 4a, the number of transplanted mitochondria appears to be very low. Considering the high number of mitochondria present in cardiomyocytes, it is unclear whether this small amount of transplanted mitochondria can significantly impact complex II activity and ATP levels in myocardial tissues, as shown in Figures 4b-d, or improve survival post-ROSC, as shown in Figure 2d. Could the observed benefits of mitochondrial transplantation be due to the indirect effects of the injected mitochondria, such as the release of mitochondrial contents, rather than the mitochondria themselves, as discussed by Bertero et al. (2021, Circ. Research)? This issue should be addressed in the manuscript.

      Thanks for this wonderful comment. As presented in Fig. 4 (originally Figure 4A), our results indicated the internalization of mitochondria by myocardium, shown by colocalization of Mito-tracker and myocardium marker. We would like to make our points here regrading to Fig. 4:

      (1) Significant left ventricular systolic and diastolic dysfunction that occurs in the myocardium shortly after the return of ROSC is referred to post-cardiac arrest myocardial dysfunction (PAMD) (Laurent, et al.,2002). It has demonstrated the efficacy of mitochondrial transplantation for the heart following ischemia-reperfusion injury in cellular, animal, and human studies, despite inadequate mitochondrial internalization (Liu, et al.,2023). A low number of transplanted mitochondria may improve cardiac function.

      (2) Only biologically active mitochondria can be specifically labeled with Mito-tracker. Therefore, cardiomyocytes uptake mitochondria that possess complete functionality. Previous results have demonstrated that mitochondrial contents, such as nonviable mitochondria, mitochondrial fractions, mitochondrial deoxyribonucleic acid, ribonucleic acid, exogenous adenosine diphosphate and ATP, do not provide protection to the ischemic heart (McCully, et al.,2017; McCully, et al.,2009).

      (3) The specific mechanism for mitochondrial internalization has yet to be fully elucidated. We totally agree with reviewer’s opinion pertaining the presence of other mechanisms of mitochondria transplantation that play a role in cardiac protection. Multiple mechanism may involve in the cardiac protection effect of mitochondria transplantation, and we are actively seeking reasonable approach to verify these hypotheses in an underway study (Lines 236~246).

      (3) In Figure 4g, the claims regarding sarcomere length, mitochondrial structure, the number of cristae, accumulated calcium etc. seem to rely on the visual interpretation of representative images. To ensure a reliable interpretation of the data, a blinded quantification of each image in each group should be conducted. The same applies to the claims made in Figure 5E.

      Thanks for this suggestion. We have quantitatively evaluated the electron microscope images and HE images of the myocardium to ensure reliable interpretation. Corresponding supplements have been added to the methods (Lines 433~441, 494~496), results sections (Lines  10<sup>9</sup>~115, 178~179), and Figures 5C, 5D, 6K and 6H (originally Figures 4G and 5E).

      (4) In line 69, it is unclear why the authors claim that MAP and HR decrease at 1, 2, 3, and 4 hours after ROSC in all groups compared to the Sham group, despite stating in line 72 that "MAP and HR did not differ at any observational time points (P>0.05, Figure 2C)."

      We apologize for our inaccurate phrasing. In the presented study, there was no statistically significant difference between MAP and HR at any observational timepoints (P>0.05, Figure 2C). In the NS, Vehicle and Mito groups, the MAP and HR decreased at 1, 2, 3, and 4 hours after ROSC, reaching their nadir at 1 hour. Subsequently, MAP and HR increased gradually but did not show any statistically significant differences compared with the Sham group.  (Lines 69~73).

      (5) The absence of increased mitochondrial content in the mito-groups should be discussed further in the manuscript.

      Thank you for your suggestion. We discussed the reasons why the mass of isolated mitochondria did not increase in Lines 224~235.

      (6) The N in Figure 5d should be provided.

      Thanks for your suggestion. We have revised the figure legend to include N of Figure 6F (originally Figures 5D).

      (7) Figure 6 demonstrates content beyond the findings in this manuscript. This reviewer recommends limiting the graphical abstract to the findings specifically in this paper.

      Thanks for your great advice. We have revised Figure 7 (originally Figure 6) and restricted the graphical abstract to the findings presented in this paper.

      Minor issues:

      (8) The order of data in Figure 4 should be consistent with the text in the manuscript. Figures 4E-F-G are described before Figures 4B-C-D in the text. Similarly, Figure 5F was described before Figure 5E in the text.

      Thanks for your great advice. We have rearranged the order of the pictures to align with the text. Thank you for your proposal.

      (9) In Figure 4A, the locations of the epicardium, muscle, and endocardium should be indicated for clarity. Also, it is not obvious where the close-up box refers to in the actual image.

      Thank you for your suggestion. We primarily seek evidence of mitochondrial internalization within the endocardium, as injury occurs first during myocardial ischemia (Kuwada and Takenaka,2000). The close-up box in Fig. 4 refers to the endocardium.

      (10) In Figure 5A, the group annotations are missing from the MDA and SOD graphs. The standard deviation bars for the SOD vehicle and SOD mito groups (3rd and 4th columns) appear to overlap. Can the authors provide the actual p-values?

      We apologize for the mission of group annotations in the MDA and SOD graphs. The p-value between the Vehicle group and the Mito group was 0.004. The SOD activity level of myocardial samples in the groups are presented in Table 1.

      Author response table 1.

      The SOD activity levels of myocardial samples in groups (U/mgprot)

      (11) In line 58, NS abbreviation is used without defining what NS is.

      We apologize for not including the full name of NS. NS is the abbreviation of normal. It has now been marked in the manuscript. (Line 58)

      (12) In line 118, what MDA stands for is not described until line 348. MDA should be defined in the text for the general audience.

      We apologize for this. We have defined it in the manuscript. (Lines 156~157)

      (13) In line 192, the authors state that "mitochondrial transplantation... increased the expression of antioxidant enzymes after four hours of ROSC," while only SOD activity levels were assessed in the manuscript. Increased activity levels do not necessarily imply an increase in expression levels. This discrepancy should be addressed in the Discussion section.

      Sorry for confusing the ‘activity’ with ‘expression’. Although mitochondrial transplantation has been shown to be involved in the restoration of manganese superoxide dismutase levels after ischemic insults, the changes in antioxidant enzyme expression level were not evaluated at the protein level in this paper (Tashiro, et al.,2022). To avoid misunderstandings, we have replaced the term ‘expression’ with ‘activity’ as appropriate. (Lines 268~271)

      (14) Mitochondria from non-ischemic gastrocnemius muscle of health donor animals were isolated and a manner that maximized their healing potential. This sentence is not clear.

      We apologize for the confusing sentence in the original manuscript. To improve clarity, we have revised that sentence. We isolated mitochondria from allogeneic gastrocnemius muscle tissue of healthy rats and maintained optimal mitochondrial activity and therapeutic effects. (Lines 199~201)

      Minor grammar issues:

      In line 153, mitochondrial should be mitochondria.

      Figure 2D: Percent servival should be percent survival.

      There should be a blank in complex IIactivity Figure 4B, and complex IV activity in Figure 4C.

      In line 134, Four hours of ROSC, Tissue samples from. Tissue is capital.

      In line 190, Similaerly should be similarly.

      Thank you for your valuable comments. We apologize for the grammatical issues caused by our oversight. We have made the necessary corrections in the manuscript and figures. (Lines 198, 179, and 268), Figure 2D, Figure 5E (originally Figure 4B); Figure 5F (originally Figure 4C).

      Reviewer #2 (Recommendations For The Authors):

      Some details are lacking clarity, such as the rationale behind choosing certain doses or time points for interventions.

      Thank you for this valuable suggestion. We have explained the rationale behind the selection of the dosage and the timing of the intervention. (Lines 201~212)

      I would suggest verifying mitochondrial function using the seahorse experiment oxygen consumption, and to check mitochondrial oxidative stress. I would also suggest checking the mitochondrial permeability transition pore opening, using for example calcein cobalt quenching or simply a kit to examine this further.

      Thank you for your valuable advice. In our manuscript, we added results regarding mitochondrial reactive oxygen species (ROS) and the mitochondrial permeability transition pore (mPTP) opening. As anticipated, mitochondrial transplantation reduced the increase in mitochondrial ROS and the mPTP opening in ischemic myocardium. (Lines 135~146, 149~157, 442~455, 460~476, Figure 5H, 5I, 6A)

      We agree that seahorse experiment oxygen consumption would be beneficial for understanding the intricacies of their interactions and enhancements. Additionally, Ali et al. (Ali, et al.,2020) have demonstrated that introducing non-autologous mitochondria from healthy skeletal muscle cells into normal cardiomyocytes results in a short-term improvement in bioenergetics, as measured using a Seahorse Extracellular Flux Analyzer. In our results, we have not yet conducted cellular experiments, The process of isolating cells from the myocardial tissue of adult SD rats for Seahorse analysis can lead to secondary damage to the myocardial cells (Jacobson, et al.,1985). In this experiment, we measured ATP content and the activity of mitochondrial complexes to evaluate energy changes after mitochondrial transplantation. We will conduct cell experiments and utilize Seahorse measurements to further clarify the alterations in myocardial energy in future.

      For Figure 3B, it would be beneficial to include the relative quantification of the mitochondrial marker COX-IV. Additionally, if feasible, I suggest verifying the representation of the mitochondria outer membrane TOM20 or VDAC.

      Thank you for your great suggestion. As suggested, we added TOM20 to assess the purity of the isolated mitochondria and reached the same conclusion: the isolated mitochondria exhibited high purity (Figure 3B). TOM20 was expressed in both muscle lysates and isolated mitochondria, whereas GAPDH was exclusively found in the muscle lysate. (We re-validated the purity of the mitochondria by using relative quantification of TOM20 and COX VI.)

      In Figure 2C, the clarity of the graphs depicting both arterial pressure (MAP) and heart rate (HR) is lacking and could potentially confuse the reader. I recommend incorporating color coding instead of relying solely on symbols, or by presenting the data in a more comprehensible format and that aligns with graph B as well.

      Thank you for your constructive comments. We have color-coded the diagrams in Figure 2B and 2C.

      In Figure 4A, please include high-magnification of the mitochondria to provide a more detailed examination.

      Thank you for this insightful comment. We have provided a high-magnification image of the mitochondria in Figure 4.

      Regarding lines 81-82, I recommend specifying the sentence more precisely for better clarity and understanding.

      Thank you for your comments. We have revised the sentences in lines 83~86 to enhance their clarity for readers.

      In the Materials and Methods section, it is crucial to provide precise details. For instance, when staining the exogenous mitochondria with MitoTracker Red, it is important to specify the duration of staining, such as the standard 20 minutes for example. Additionally, it is advisable to mention the number of times these mitochondria were washed with the respiratory solution to ensure thorough removal of excess MitoTracker, thus preventing unintended staining of endogenous mitochondria with MitoTracker red upon injection of pre-labeled mitochondria.

      Thank you for your suggestion. We have added the necessary details regarding Mito-Tracker Red dyeing. (Lines 373~376) In addition, we also added other details in necessary (Lines 373~376, 379~382, 395~396, 397~400, 487~488). We appreciate your suggestion once again.

      The sensitivity of JC-1 dye to temperature and pH fluctuations underscores the necessity for meticulous experimental conditions. It is crucial for the authors to elucidate why they chose to maintain the samples at 4 {degree sign} C for 60 minutes, especially considering the dye's optimal operating temperature of 25 {degree sign} C. Providing a rationale behind this deviation from standard protocol would enhance the scientific rigor and reproducibility of the study. Please add more information on the objectives used in the fluorescence microscope (BX53, OLYMPUS, Tokyo, Japan) and the software used.

      We sincerely apologize for the mistake in this sentence. The purified mitochondria, which are stained with JC-1, should be stored at 4°C and examined using a fluorescence microscope within 60 minutes. Purified mitochondria were incubated with JC-1 staining solution at 37°C for 20 minutes. The fluorescence microscope used in our experiment is equipped with a WHN 10/22 eyepiece, and the software version is OLYMPUS cellSens Standard 3.2. (Lines 379~382)

      Moreover, in the context of immunoblotting, it is imperative for the authors to furnish detailed information regarding the preparation of muscle tissue homogenates. Specifically, clarification is needed regarding the solution utilized for tissue grinding. Did the authors employ ice-cold RIPA lysis buffer or an alternative lysis buffer, supplemented with a protease inhibitor cocktail? Such details are pivotal for methodological transparency.

      Thanks for this wonderful comment. In the methods section, we added detailed information about protein extraction. (Lines 383~385)

      Furthermore, it would be beneficial for the authors to specify the instrument employed for scanning the immunoblots, as well as the software utilized for subsequent analysis of the immunoblot images. Providing this information would not only enhance the reproducibility of the findings but also facilitate the evaluation of the experimental results.

      Thank you for your suggestion. We have included the instrument used for scanning the Western blot, as well as the software used for image analysis in the manuscript. (Lines 397~400)

      Authors must exercise caution against copy-pasting. In line 282, there's a query regarding how the mitochondria were isolated. It is recommended to cite a specific reference and offer more comprehensive details. Despite the authors referencing a number within the text, the absence of numbered references makes it challenging to cross-reference.

      Thank you for pointing this out; we have updated the citation accordingly (Line 361).

      Figure 5C please double check some misspelling label errors (e.g: Vehicle and not Vehucle).

      We apologize for the misspelling in Figure 6E (originally Figure 5C) and have corrected it. Additionally, we have thoroughly reviewed the text for spelling errors and sincerely apologize once again for the previous mistakes. (Lines 249~252, 322)

      References:

      Aharoni-Simon M, Ben-Yaakov K, Sharvit-Bader M, Raz D, Haim Y, Ghannam W, Porat N, Leiba H, Marcovich A, Eisenberg-Lerner A, Rotfogel Z. 2022. Oxidative stress facilitates exogenous mitochondria internalization and survival in retinal ganglion precursor-like cells. SCI REP-UK 12:5122. doi:10.1038/s41598-022-08747-3

      Alemany VS, Nomoto R, Saeed MY, Celik A, Regan WL, Matte GS, Recco DP, Emani SM, Del NP, McCully JD. 2024. Mitochondrial transplantation preserves myocardial function and viability in pediatric and neonatal pig hearts donated after circulatory death. J THORAC CARDIOV SUR 167: e6-e21. doi: 10.1016/j.jtcvs.2023.05.010

      Ali PP, Kenney MC, Kheradvar A. 2020. Bioenergetics Consequences of Mitochondrial Transplantation in Cardiomyocytes. J AM HEART ASSOC 9: e14501. doi:10.1161/JAHA.119.014501

      Blitzer D, Guariento A, Doulamis IP, Shin B, Moskowitzova K, Barbieri GR, Orfany A, Del NP, McCully JD. 2020. Delayed Transplantation of Autologous Mitochondria for Cardioprotection in a Porcine Model. ANN THORAC SURG  109:711-719. doi: 10.1016/j.athoracsur.2019.06.075

      Cowan DB, Yao R, Thedsanamoorthy JK, Zurakowski D, Del NP, McCully JD. 2017. Transit and integration of extracellular mitochondria in human heart cells. SCI REP-UK 7:17450. doi:10.1038/s41598-017-17813-0

      Guariento A, Blitzer D, Doulamis I, Shin B, Moskowitzova K, Orfany A, Ramirez-Barbieri G, Staffa SJ, Zurakowski D, Del NP, McCully JD. 2020. Preischemic autologous mitochondrial transplantation by intracoronary injection for myocardial protection. J THORAC CARDIOV SUR 160: e15-e29. doi: 10.1016/j.jtcvs.2019.06.111

      Jacobson SL, Banfalvi M, Schwarzfeld TA. 1985. Long-term primary cultures of adult human and rat cardiomyocytes. BASIC RES CARDIOL 80 Suppl 1:79-82. doi:10.1007/978-3-662-11041-6_15

      Kaza AK, Wamala I, Friehs I, Kuebler JD, Rathod RH, Berra I, Ericsson M, Yao R, Thedsanamoorthy JK, Zurakowski D, Levitsky S, Del NP, Cowan DB, McCully JD. 2017. Myocardial rescue with autologous mitochondrial transplantation in a porcine model of ischemia/reperfusion. J THORAC CARDIOV SUR 153:934-943. doi: 10.1016/j.jtcvs.2016.10.077

      Kuwada Y, Takenaka K. 2000. [Transmural heterogeneity of the left ventricular wall: subendocardial layer and subepicardial layer]. J CARDIOL 35:205-218.

      Laurent I, Monchi M, Chiche JD, Joly LM, Spaulding C, Bourgeois B, Cariou A, Rozenberg A, Carli P, Weber S, Dhainaut JF. 2002. Reversible myocardial dysfunction in survivors of out-of-hospital cardiac arrest. J AM COLL CARDIOL 40:2110-2116. doi:10.1016/s0735- 1097(02)02594-9

      Liu D, Gao Y, Liu J, Huang Y, Yin J, Feng Y, Shi L, Meloni BP, Zhang C, Zheng M, Gao J. 2021. Intercellular mitochondrial transfer as a means of tissue revitalization. SIGNAL TRANSDUCT TAR 6:65. doi:10.1038/s41392-020-00440-z

      Liu Q, Liu M, Yang T, Wang X, Cheng P, Zhou H. 2023. What can we do to optimize mitochondrial transplantation therapy for myocardial ischemia-reperfusion injury? MITOCHONDRION 72:72-83. doi: 10.1016/j.mito.2023.08.001

      Masuzawa A, Black KM, Pacak CA, Ericsson M, Barnett RJ, Drumm C, Seth P, Bloch DB, Levitsky S, Cowan DB, McCully JD. 2013. Transplantation of autologously derived mitochondria protects the heart from ischemia-reperfusion injury. AM J PHYSIOL-HEART C 304:H966-H982. doi:10.1152/ajpheart.00883.2012

      McCully JD, Cowan DB, Emani SM, Del NP. 2017. Mitochondrial transplantation: From animal models to clinical use in humans. MITOCHONDRION 34:127-134. doi: 10.1016/j.mito.2017.03.004

      McCully JD, Cowan DB, Pacak CA, Toumpoulis IK, Dayalan H, Levitsky S. 2009. Injection of isolated mitochondria during early reperfusion for cardioprotection. AM J PHYSIOL-HEART C 296:H94-H105. doi:10.1152/ajpheart.00567.2008

      Pacak CA, Preble JM, Kondo H, Seibel P, Levitsky S, Del NP, Cowan DB, McCully JD. 2015. Actin-dependent mitochondrial internalization in cardiomyocytes: evidence for rescue of mitochondrial function. BIOL OPEN 4:622-626. doi:10.1242/bio.201511478

      Shanmughapriya S, Langford D, Natarajaseenivasan K. 2020. Inter and Intracellular mitochondrial trafficking in health and disease. AGEING RES REV 62:101128. doi: 10.1016/j.arr.2020.101128

      Shin B, Saeed MY, Esch JJ, Guariento A, Blitzer D, Moskowitzova K, Ramirez-Barbieri G, Orfany A, Thedsanamoorthy JK, Cowan DB, Inkster JA, Snay ER, Staffa SJ, Packard AB, Zurakowski D, Del NP, McCully JD. 2019. A Novel Biological Strategy for Myocardial Protection by Intracoronary Delivery of Mitochondria: Safety and Efficacy. JACC-BASIC TRANSL SC 4:871-888. doi: 10.1016/j.jacbts.2019.08.007

      Tashiro R, Bautista-Garrido J, Ozaki D, Sun G, Obertas L, Mobley AS, Kim GS, Aronowski J, Jung JE. 2022. Transplantation of Astrocytic Mitochondria Modulates Neuronal Antioxidant Defense and Neuroplasticity and Promotes Functional Recovery after Intracerebral Hemorrhage. J NEUROSCI 42:7001-7014. doi:10.1523/JNEUROSCI.2222-21.2022

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      In this study, the authors investigate the tolerance of aminoglycosides in E. coli mutants deleted in the Krebs cycle and respiratory chain enzymes. The motivation for this study is unclear. Transport of aminoglycosides is pmf-dependent, as the authors correctly note, and knocking out energy-producing components leads to tolerance of aminoglycosides, this has been well established. In S. aureus, clinically relevant "small colony" strains selected for in the course of therapy with aminoglycosides acquire null mutations in the biosynthesis of heme or ubiquinone, and have been studied in detail. In E. coli, such knockouts have not been reported in clinical isolates, probably due to severe fitness costs.

      Response: We sincerely appreciate the time and consideration the reviewer dedicated to evaluating our manuscript. It's important to highlight that while the transport of aminoglycosides is PMF-dependent, recent studies underscore the potential role of metabolic mutations in antibiotic tolerance, a facet that warrants further investigation. For instance, the study by Henimann’s and Michiels' groups explored genomic changes in E. coli strains (including uropathogenic UTI89 strains) subjected to daily antibiotic exposure (Van den Bergh et al., 2022). Notably, mutations predominantly occurred in genes of the nuo operon, a key component of E. coli energy metabolism, suggesting a link between metabolic adaptations and antibiotic tolerance. Furthermore, the research by Collin's group revealed previously unrecognized genes related to central metabolism (e.g., icd, gltD, sucA) that contribute to antibiotic resistance in E. coli cells exposed to multiple antibiotics, including aminoglycosides (Lopatkin et al., 2021). These findings are corroborated by the presence of similar mutations in clinical E. coli pathogens, as evidenced by the analysis of a large library of 7243 E. coli genomes from NCBI Pathogen Detection (Lopatkin et al., 2021). The clinical relevance of metabolic mutations in antibiotic tolerance is increasingly recognized, yet their underlying mechanisms remain enigmatic. Therefore, elucidating the role of metabolic pathways in conferring antibiotic tolerance is highly critical. We have updated the introduction to clearly convey our motivation in this study (see page 4).

      At the same time, single-cell analysis has shown that individual cells with a decrease in the expression of Krebs cycle enzymes are tolerant of antibiotics and have lower ATP (Manuse et al., PLoS Biol 19: e3001194). The authors of the study under review report that knocking out ICD, isocitrate dehydrogenase that catalyzes the rate-limiting step in the Krebs cycle, has little effect on aminoglycoside tolerance and actually leads to an increase in the level of ATP over time. This observation does not seem to make much sense and contradicts previous reports, specifically that E. coli ICD is tolerant of antibiotics and, not surprisingly, produces Less ATP (Kabir and Shimizu, Appl Micro-biol Biotechnol. 2004; 65(1):84-96; Manuse et al., PLoS Biol 19: e3001194). Mutations in other Krebs cycle enzymes, unlike ICD, do lead to a dramatic increase in tolerance of aminoglycosides according to the paper under review. This is all very confusing.

      Response: Although our data cannot be directly compared to that of Kabir and Shimizu (Mohiuddin Kabir and Shimizu, 2004), due to the utilization of entirely different experimental procedures and measurement techniques, we can draw some parallels to the study conducted by Lewis’ group (Manuse et al., 2021), despite certain differences in experimental protocols. Furthermore, the reviewer has made strong assertions regarding our manuscript based on the findings of Lewis’ group. Thus, we believe it's pertinent to expand our response regarding that study.

      In the study of Lewis’ group, bacterial cells were inoculated at a ratio of 1:100 into LB medium from an overnight culture (approximately 16 hours). Subsequently, the cultures were incubated at 37°C for approximately 2 hours, and ATP levels were measured using the BacTiter Glo kit (Promega, Madison, WI, USA). ATP levels were then normalized to cell density, determined through optical density measurements, and represented on a linear diagram. As demonstrated in Supplementary Figure S1c of their paper, there was a 10-15% reduction in normalized ATP levels in the icd mutant compared to the wild type. In our experiments, cells were grown for 24 hours in overnight cultures, diluted 100-fold in fresh media, and ATP levels were measured at 3, 4, 5, and 6 hours using the same kit. ATP levels were normalized to cell counts quantified by flow cytometry. Upon analyzing our data of the icd mutant for around 3 hours (the time point closest to that of the study of Lewis’ group), we observed a reduction of approximately 15-20% (without statistical significance) in the icd mutant compared to the wild-type (see raw data, linear plot, and logarithmic plot below; Author response image 1), which aligns with the findings of Lewis’ group.

      We further investigated the gentamicin tolerance of both wild-type and icd mutant strains of E. coli BW25113 (Author response image 2). Our findings indicate that the increased sensitivity of the icd mutant of the MG1655 strain to gentamicin is similar to the observation in the other E. coli strain.

      Author response image 1.

      ATP levels in the icd mutant. ATP levels of both the mutant and wild-type strains were measured at t=3 hours of cell growth and normalized to cell counts. The figure presents the raw data (a), linear plot (b), and logarithmic plot (c) of the same dataset. This data corresponds to the first panel of Figure 3B in the manuscript.

      Author response image 2.

      Gentamicin tolerance of wild-type and icd mutant strains of E. coli BW25113. Both wild type and mutant strains were treated with gentamicin (50 µg/ml) for 5 hours at the mid-exponential phase. Cells were plated before and after treatment for CFU/ml counts. The dashed line represents the limit of detection. CFU: Colony forming units.

      We think that there are two primary reasons why our study cannot contradict the findings of the Lewis group:

      Firstly, our study cannot be directly compared to theirs, as they did not comprehensively explore the impact of gene deletions on cell metabolism beyond the measurement of ATP levels at a single time point (Manuse et al., 2021). Our study encompasses various metabolic parameters such as cellular ATP, redox status, proton motive force (PMF), intracellular pH, and drug uptake throughout the exponential and/or early stationary phase. Additionally, we conducted proteomic analysis for five different strains including mutants and wild type. Moreover, we performed pathway enrichment analysis grounded in the statistical background of the entire genome, encompassing various functional pathway classification frameworks such as Gene Ontology annotations, KEGG pathways, and Uniprot keywords. The results of these pathway enrichment analyses are now available in the Supplementary File (see Supplementary Tables 11-17 in the current manuscript). Thus, we believe it is unjust to deem our study contradictory compared to the Lewis group's study, which does not have a comprehensive analysis of the metabolism of the mutant strains they investigated.

      Secondly, our study cannot be compared to that specific study (Manuse et al., 2021) due to the utilization of a distinct antibiotic (ciprofloxacin). Cell tolerance is heavily reliant on the mechanism of action of the antibiotic used. Therefore, the reviewer should have focused on studies closely related to aminoglycoside tolerance. Our study is not confusing or contradictory, as Lewis’ group also demonstrated that the tolerance of the icd mutant to gentamicin was significantly reduced while the tolerance of other TCA cycle mutant strains was increased in a different study (Shan et al., 2015). However, they did not delve into the metabolism of these mutant strains, as we did. We now mention this point in our manuscript (see pages 14-15).

      Apart from the confusing data, it is not clear what useful information may be obtained from the choice of the experimental system. The authors examine exponentially growing cells of E. coli for tolerance of aminoglycosides. The population at this stage of growth is highly susceptible to aminoglycosides, and only some rare persister cells can survive. However, the authors do not study persisters. A stationary population of E. coli is tolerant of aminoglycosides, and this is clinically relevant, but this is not the subject of the study.

      Response: Respectfully, we must express our disagreement with the reviewer's comments. Our experimental system is meticulously organized and logically structured. Mutant strains such as gltA, sucA, and nuoI deletions exhibit increased tolerance to all aminoglycosides tested, with their fractions clearly increasing around the mid-exponential phase between 3-4 hours (refer to Figure 2B in our manuscript). This surge in tolerance is evident at the population level as well (as depicted in Figure 1A in our manuscript, where certain mutant strains demonstrate complete survival to streptomycin, with survival fractions nearing 1). Given the pronounced increase observed around the mid-exponential phase, we primarily characterize the metabolism of these cells during this growth phase.

      It's essential to note that any investigation into antibiotic tolerance and/or resistance holds immense significance, regardless of the growth phase under scrutiny, as antibiotic tolerance/resistance poses a substantial healthcare challenge. Additionally, metabolic mutant strains do not necessarily entail severe fitness costs, as evidenced by Figure S2A published by the Lewis group (Manuse et al., 2021), a finding consistent with our study (see Figure 2B in our manuscript). This phenomenon could confer a survival advantage to bacterial cells, as they may acquire metabolic mutations to bolster their tolerance without incurring significant fitness costs. Furthermore, numerous studies suggest that bacterial cells may opt for the evolutionary pathway leading to increased tolerance before acquiring resistance mechanisms (Levin-Reisman et al., 2017; Santi et al., 2021). The presence of metabolic mutations in clinical E. coli pathogens has also been confirmed through the analysis of a large library of 7243 E. coli genomes from NCBI Pathogen Detection by Collin’s group (Lopatkin et al., 2021). Consequently, comprehending the tolerance mechanisms of metabolic mutations holds paramount importance.

      References

      Levin-Reisman I, Ronin I, Gefen O, Braniss I, Shoresh N, Balaban NQ. 2017. Antibiotic tolerance facilitates the evolution of resistance. Science (1979) 355:826–830. doi:10.1126/science.aaj2191

      Lopatkin AJ, Bening SC, Manson AL, Stokes JM, Kohanski MA, Badran AH, Earl AM, Cheney NJ, Yang JH, Collins JJ. 2021. Clinically relevant mutations in core metabolic genes confer antibiotic resistance. Science (1979) 371. doi:10.1126/science.aba0862

      Manuse S, Shan Y, Canas-Duarte SJ, Bakshi S, Sun WS, Mori H, Paulsson J, Lewis K. 2021. Bacterial persisters are a stochastically formed subpopulation of low-energy cells. PLoS Biol 19. doi:10.1371/journal.pbio.3001194

      Mohiuddin Kabir M, Shimizu K. 2004. Metabolic regulation analysis of icd-gene knockout Escherichia coli based on 2D electrophoresis with MALDI-TOF mass spectrometry and enzyme activity measurements. Appl Microbiol Biotechnol 65:84–96. doi:10.1007/s00253-004-1627-1

      Santi I, Manfredi P, Maffei E, Egli A, Jenal U. 2021. Evolution of Antibiotic Tolerance Shapes Resistance Development in Chronic Pseudomonas aeruginosa Infections. doi:10.1128/mBio.03482-20

      Shan Y, Lazinski D, Rowe S, Camilli A, Lewis K. 2015. Genetic basis of persister tolerance to aminoglycosides in Escherichia coli. mBio 6. doi:10.1128/mBio.00078-15

      Van den Bergh B, Schramke H, Michiels JE, Kimkes TEP, Radzikowski JL, Schimpf J, Vedelaar SR, Burschel S, Dewachter L, Lončar N, Schmidt A, Meijer T, Fauvart M, Friedrich T, Michiels J, Heinemann M. 2022. Mutations in respiratory complex I promote antibiotic persistence through alterations in intracellular acidity and protein synthesis. Nat Commun 13:546. doi:10.1038/s41467-022-28141-x

      Reviewer #2 (Public Review):

      Summary:

      This interesting study challenges a dogma regarding the link between bacterial metabolism decrease and tolerance to aminoglycosides (AG). The authors demonstrate that mutants well-known for being tolerant to AG, such as those of complexes I and II, are not so due to a decrease in the proton motive force (PMF) and thus antibiotic uptake, as previously reported in the literature.

      Strengths:

      This is a complete study. These results are surprising and are based on various read-outs, such as ATP levels, pH measurement, membrane potential, and the uptake of fluorophore-labeled gentamicin. Utilizing a proteomic approach, the authors show instead that in tolerant mutants, there is a decrease in the levels of proteins associated with ribosomes (targets of AG), causing tolerance.

      Response: We sincerely appreciate the reviewer for taking the time to read our manuscript and offer valuable suggestions.

      Weaknesses:

      The use of a single high concentration of aminoglycoside: my main comment on this study concerns the use of an AG concentration well above the MIC (50 µg/ml or 25 µg/ml for uptake experiments), which is 10 times higher than previously used concentrations (Kohanski, Taber) in study showing a link with PMF. This significant difference may explain the discrepancies in results. Indeed, a high concentration of AG can mask the effects of a metabolic disruption and lead to less specific uptake. However, this concentration highlights a second molecular level of tolerance. Adding experiments using lower concentrations (we propose 5 µg/ml to compare with the literature) would provide a more comprehensive understanding of AG tolerance mechanisms during a decrease in metabolism.

      Another suggestion would be to test iron limitation (using an iron chelator as DIP), which has been shown to induce AG tolerance. Can the authors demonstrate if this iron limitation leads to a decrease in ribosomal proteins? This experiment would validate their hypothesis in the case of a positive result. Otherwise, it would help distinguish two types of molecular mechanisms for AG tolerance during a metabolic disruption: (i) PMF and uptake at low concentrations, (ii) ribosomal proteins at high concentrations.

      Response: While we acknowledge the intriguing possibility of exploring whether iron limitation results in a reduction of ribosomal proteins, we believe that this topic falls slightly outside the scope of our current study. This area warrants independent investigation since our current research did not specifically focus on iron-limited environments (LB medium is iron-rich, as referenced (Abdul-tehrani et al., 1999; Rodríguez-Rojas et al., 2015)). However, we fully concur with the notion that experimental outcomes may be contingent upon the concentration of aminoglycosides (AG). Hence, we repeated the critical experiments using a lower concentration of gentamicin (5 µg/mL), as suggested by the reviewer. Before delving into a discussion of these results, we wish to emphasize two key points. Firstly, the majority of our metabolic measurements, including ATP levels, redox activities, intracellular pH, and metabolomics, were conducted in mutant and wild-type cells in the absence of drugs. Our objective was to elucidate the impact of genetic perturbations of the TCA cycle on cell metabolism. Secondly, it's important to emphasize that our study does not invalidate the hypothesis that AG uptake is proton motive force (PMF)-dependent. We observed similar drug uptake across the strains tested, which is reasonable considering that their energy metabolism and PMF are not significantly altered compared to the wild type (at least we did not observe a consistent trend in their metabolic levels). Consequently, our study does not necessarily contradict with previous claims (Taber Harry W et al., 1987). We have now clarified this point in the manuscript (see pages 1 and 13).

      When we employed a lower gentamicin concentration, we still noted a significant elevation in tolerance among the gltA, sucA, and nuoI mutant strains compared to the wild type. Also, it remained evident that the observed tolerance in the mutant strains cannot be ascribed to differences in drug uptake or impaired PMF, as the levels of drug uptake and the disruption of PMF by gentamicin (at lower concentrations) in the mutant strains were comparable to those of the wild type. Moreover, since our metabolic measurements and proteomics analyses failed to reveal any notable alterations in energy metabolism in these strains, the consistency in drug uptake levels across both mutant and wild-type strains, even at lower concentrations, further bolsters the validity of our findings obtained at higher gentamicin concentrations. The new results have been incorporated into the Supplementary file (see Supplementary Figures S1, S5, S7, and S9) and discussed throughout the manuscript.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      Line 120: Luria-Bertani (LB), used Lysogeny Broth.

      Line 180: "RSG dye can be reduced by bacterial reductases of PMF" to be reformulated.

      Response: The suggested corrections have been incorporated into the manuscript.

      References

      Abdul-tehrani H, Hudson AJ, Chang Y, Timms AR, Hawkins C, Williams JM, Harrison PM, Guest JR, Andrews SC. 1999. Ferritin Mutants of Escherichia coli Are Iron Deficient and Growth Impaired, and fur Mutants are Iron Deficient, Journal of Bacteriology.

      Rodríguez-Rojas A, Makarova O, Müller U, Rolff J. 2015. Cationic Peptides Facilitate Iron-induced Mutagenesis in Bacteria. PLoS Genet 11. doi:10.1371/journal.pgen.1005546

      Taber Harry W, Mueller JP, Miller PF, Arrow AS. 1987. Bacterial Uptake of Aminoglycoside Antibiotics. Microbiol Rev 51:439–457. doi:10.1128/mr.51.4.439-457.1987

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Li et al describe a novel form of melanosome based iridescence in the crest of an Early Cretaceous enantiornithine avialan bird from the Jehol Group.

      Strengths:

      Novel set of methods applied to the study of fossil melanosomes.

      Weaknesses:

      (1) Firstly, several studies have argued that these structures are in fact not a crest, but rather the result of compression. Otherwise, it would seem that a large number of Jehol birds have crests that extend not only along the head but the neck and hindlimb. It is more parsimonious to interpret this as compression as has been demonstrated using actuopaleontology (Foth 2011).

      Firstly, we respectfully acknowledge the reviewer’s interpretation.

      However, the new specimen we report here is distinct as preserved from Confuciusornis (Foth 2011), which belongs to a different clade and exhibits a differently preserved feather crest of a different shape compared to the species described in this study. Figure 3a Foth 2011, Paläontologische Zeitschrift;the cervical feather is much longer than feather from head region in the specimen the referee talked about; It is quite incompletely preserved and much shorter in proportional length (relative to the skull) than the specimen we sampled (see picture below).

      Author response image 1.

      Our new specimen with well-preserved and the feather crest were interpretated as the originally shaped;the cervical feather is largely absent or very short

      In the new specimen there is a large feather crest that gradually extends from the cranial region of the fossil bird, rather than the cervical region, as observed in the previously proposed Confuciusornis crest. The feather crest extends in a consistent direction (caudodistally), and the feathers in the head region of the bird are exceptionally well-preserved, retaining their original shape. The feathers are measured about 1- 2cm at their longest barb. Feathers in the neck are much shorter (see Confuciusornis  picture above).

      (2) The primitive morphology of the feather with their long and possibly not interlocking barbs also questions the ability of such feathers to be erected without geologic compression.

      We acknowledge that the specimen must have undergone some degree of compression during diagenesis and fossilization. Given that the rachis itself is already sufficiently thick (that the ligaments everting a crest would attach to), we conclude that it had the structural integrity to remain erect on the skull.

      (3) The feather is not in situ and therefore there is no way to demonstrate unequivocally that it is indeed from the head (it could just as easily be a neck feather)

      We conclude that it belongs to the head based on the similar suture, overall length, and its close position to the caudal part of the head. There are no similar types of feathers nearby, such as those found on the neck or other areas, which is why we reason that it is a head crest feather. Besides, the shape of the feather we sampled is dramatically different from the much softer and shorter ones detected on the neck.

      In addition, we further sampled the crest feather barb from in situ preserved feather crest. We also detected a similar pattern to what we originally found regarding the packing of melanosomes. This is now added to the text.

      (4) Melanosome density may be taphonomic; in fact, in an important paper that is notably not cited here (Pan et al. 2019) the authors note dense melanosome packing and attribute it to taphonomy. This paper describes densely packed (taphonomic) melanosomes in non-avian avialans, specifically stating, "Notably, we propose that the very dense arrangement of melanosomes in the fossil feathers (Fig. 2 B, C, and G-I, yellow arrows) does not reflect in-life distribution, but is, rather, a taphonomic response to postmortem or postburial compression" and if this paper was taken into account it seems the conclusions would have to change drastically. If in this case the density is not taphonomic, this needs to be justified explicitly (although clearly these Jehol and Yanliao fossils are heavily compressed).

      We have added a line acknowledging this possibility. We have accounted for the shrinkage effects caused by heat and compression, as detailed in our Supplementary Information (SI) file. Even when these changes are considered, they do not alter the main conclusions of our study. Besides given most melanosomes we used for simulation are mostly complete and well preserved,we consider the distortion is rather limited or at least minor compared to changes seen in taxonomic experiment shown.

      (5) Color in modern birds is affected by the outer keratin cortex thickness which is not preserved but the authors note the barbs are much thicker (10um) than extant birds; this surely would have affected color so how can the authors be sure about the color in this feather?

      In extant birds, feather barbs of similar size are primarily composed of air spaces and quasi-ordered keratin structures, largely lacking dense melanosomes. The color-producing barb we have described here does not directly correspond to a feather type in modern birds for comparison. Since there is no direct extant analog to inform the keratin thickness and similar melanosome density, we utilize advanced 3-D FDTD modeling approach to the question of coloration reconstruction, rather than relying on statistical DFA approaches. In additional to packed melanosomes, the external thin keratin cortex layer is also considered for the simulation.

      Additionally, even in the thinner melanosome-packed layers of barbules in living birds, iridescent coloration often is observed (e.g., Rafael Maia J. R. Soc. Interface 2009). This further supports the plausibility of our modeling approach and its relevance to understanding coloration in both extinct and extant species.

      (6) Authors describe very strange shapes that are not present in extant birds: "...different from all other known feather melanosomes from both extant and extinct taxa in having some extra hooks and an oblique ellipse shape in cross and longitudinal sections of individual melanosome" but again, how can it be determined that this is not the result of taphonomic distortion?

      We consistently observed similar hook-like structures not only in this feather but also in feathers from different positions of the crest. We do not believe that distortion would produce such a regular and consistent pattern; instead, distortion likely would result in random alterations, as demonstrated by prior taphonomic experiments.

      (7) The authors describe the melanosomes as hexagonally packed but this does not appear to be in fact the case, rather appearing quasi-periodic at best, or random. If the authors could provide some figures to justify this hexagonal interpretation?

      To further validate the regional hexagonal pattern, we expanded our sampling to additional sites. We observed similar patterns not only in various regions of the same barb but also across different feathers (see added SI Figures below). This extensive sampling supports the validity of the melanosome patterns identified in our original analysis.

      (8) One way to address these concerns would be to sample some additional fossil feathers to see if this is unique or rather due to taphonomy

      We sampled additional areas from the same feather as well as feathers from other regions of the head crest. The packing patterns are generally similar with slight variations in size (figure S6).

      (9) On a side, why are the feet absent in the CT scan image? "

      To achieve better image resolution, the field of view was adjusted, resulting in part of the feet being excluded from the CT scan.

      Reviewer #2 (Public review):

      Summary:

      The authors reconstructed the three-dimensional organization of melanosomes in fossilized feathers belonging to a spectacular specimen of a stem avialan from China. The authors then proceed to infer the original coloration and related ecological implications.

      Strengths:

      I believe the study is well executed and well explained. The methods are appropriate to support the main conclusions. I particularly appreciate how the authors went beyond the simple morphological inference and interrogated the structural implications of melanosome organization in three dimensions. I also appreciate how the authors were upfront with the reliability of their methods, results, and limitations of their study. I believe this will be a landmark study for the inference of coloration in extinct species and how to interrogate its significance in the future.

      We thank the referee for these positive comments.

      Weaknesses:

      I have a few minor comments.

      Introduction: I would suggest the authors move the paragraph on coloration in modern birds (lines 75-97) before line 64, as this is part of the reasoning behind the study. I believe this change would improve the flow of the introduction for the general reader.

      We thank the referee for the suggestion, and we made changes accordingly to improve the flow of introduction.

      Melanosome organization: I was surprised to find little information in the main text regarding this topic. As this is one of the major findings of the study, I would suggest the authors include more information regarding the general geometry/morphology of the single melanosomes and their arrangement in three dimensions.

      We thank the referee for this suggestion. We elaborated on the details of the melanosomes in the results as follows:

      Hooks are commonly observed on the oval-shaped melanosomes in cross-sectional views, with two dominant types identified on the dorsal and ventral sides (Figure 3c-d, red arrows). These hooks are deflected in opposing directions, linking melanosomes from different arrays (dorsal-ventral) together. The major axis(y) of the oval-shaped melanosomes (mean = 283 nm) is oriented toward the left side in cross-section, while the shorter axis(x) measures approximately 186 nm (Table S2). In oblique or near-longitudinal sections (Figure 3e-f), the hooked structures’ connections to the distal and proximal sides of neighboring melanosomes are clearly visible (blue arrows, Figure 3f). A similar pattern occurs in two additional regions of interest within the same feather (figure S5). Although the smaller proximal hooks in these sections are less distinct, this may reflect developmental variation during melanosome formation along the feather barb. Significantly smaller hooks were also observed in cross-sections of in-situ feather barbs from the anterior side of the feather crest (figure S6). The mean long axis (z) of the melanosomes is approximately 1774 nm (Table S2). Based on these observations, we propose that the hooked structures—particularly those on the dorsal, ventral, proximal, and distal sides of the melanosomes—enhance the structural integrity of the barb (figure S7). However, these features may be teratological and unique to this individual, as no similar structures have been reported in other sampled feathers. These hooks may stabilize the stacked melanosome rods and contribute to increased barb dimensions, such as diameter and length. The sections exhibit modified (or asymmetric) hexagonally packed melanosomes with presence of extra hooked linkages (Figure 3c-d and e-f). The long rod-like melanosomes are different from all other known feather melanosomes from both extant and extinct taxa in having some extra hooks and an oblique ellipse shape in cross and longitudinal sections of individual melanosomes (Durrer 1986, Zhang, Kearns et al. 2010). The asymmetric packing of the melanosomes (the major axis leans leftward) played a major role in the reduction of fossilized keratinous matrix within the barbs, which may correspond to a novel structural coloration in this extinct bird. The close packed hexagonal melanosome pattern found in extant avian feathers yield rounded melanosome outlines in contrast to the oval-shaped melanosomes (see figure S8, x<y) in the perpendicular section here. The asymmetric compact hexagonal packing (ACHP) of the melanosomes is different from the known pattern of melanosomes formed in the structure of barbules among extant birds (Eliason and Shawkey 2012), which has been seen as a regular hexagonal organization. The packing of the melanosomes in an asymmetric pattern, on the microscopic level, might be related to the asymmetrical path of the barb extension direction observed at the macroscopic level (figure S5).

      Added Supplemental figure S5. STEM images of cross-sections taken from three different positions (indicated by white dashed lines in a) demonstrate similar melanosome packing styles. Dashed-lines labeled in (a) indicate where the corresponding position of these sections were taken, black arrows indicate the individual barbs that accumulated together in this long crest father. One distinct feature of these sections is the hooked-link structure that aligns the melanosomes into a modified hexagonal, packed arrangement. White arrows (in c, e, g) indicate the hooked structures observed in the selected melanosomes.

      Added Supplemental figure S6. STEM images showing melanosome structure from three fragments of the feather crest (indicated by dashed lines and white box in a) reveal the hooked linkages between melanosomes and their surrounding melanosomes structures in (b), (c) and (d). Due to the shorter length of these feather barbs, the hook structures are not as well-defined as those in the longer feather samples shown in the main text.

      Keratin: the authors use such a term pretty often in the text, but how is this inference justified in the fossil? Can the authors extend on this? Previous studies suggested the presence of degradation products deriving from keratin, rather than immaculated keratin per se.

      We changed to keratinous matrix and material instead. We observed matrix/material in between these melanosomes were filled by organic rich tissue that is proposed to possibly be taphonomically altered keratin.

      Ontogenetic assessment: the authors infer a sub-adult stage for the specimen, but no evidence or discussion is reported in the SI. Can the authors describe and discuss their interpretations?

      Thanks for the suggestion. We made an osteo-histological section and add our evaluation of the histology of the femoral bone tissue sampled from the specimen to justify assessment of its ontogenetic stage.

      See Supplemental figure S2 for Femur Osteo-Histology

      SI file Femur Osteo-Histology

      Ground sections were acquired from the right side of the femur to assess the osteo-histological features of the bone and its ontogenetic stage. As shown in figure S2, long, flat-shaped lacunae are widely present and densely packed throughout the major part of the bone section. Very few secondary osteocytes are present, and parallel-fibered bone tissue is underdeveloped. The flattened osteocyte lacunae dominate the cellular shape, with observable vascular canals connecting different lacunae. Overall, the osteo-histology indicates that the bird was still in an active growth stage at the time of death, suggesting it was in its sub-adult growth phase.

      CT scan data: these data should be made freely available upon publication of the study.

      We will release our CT scanning on an open server (https://osf.io/kw7sd/) along with the final version of the manuscript.

      Reviewer #3 (Public review):

      Summary:

      The paper presents an in-depth analysis of the original colour of a fossil feather from the crest of a 125-million-year-old enantiornithine bird. From its shape and location, it would be predicted that such a feather might well have shown some striking colour and pattern. The authors apply sophisticated microscopic and numerical methods to determine that the feather was iridescent and brightly coloured and possibly indicates this was a male bird that used its crest in sexual displays.

      Strengths:

      The 3D micro-thin-sectioning techniques and the numerical analyses of light transmission are novel and state-of-the-art. The example chosen is a good one, as a crest feather is likely to have carried complex and vivid colours as a warning or for use in sexual display. The authors correctly warn that without such 3D study feather colours might be given simply as black from regular 2D analysis, and the alignment evidence for iridescence could be missed.

      Weaknesses: Trivial.

      Recommendations for the authors:

      Reviewer #3 (Recommendations for the authors):

      In a few places, the paper can be strengthened:

      Dimensionality of study method: In the first paragraph, you set things up (lines 60-62) to say that studies hitherto have been of melanosomes and packing in two dimensions... and I then expect you to say soon after, in the next paragraph, 'Here, we investigate a fossil feather in three dimensions...' or some such, but you don't.

      You come back to Methods at the end of the Introduction (lines 97-101), but again do not say whether you model the feather in three dimensions or not. Yes, you did - I finally learned at line 104 - you did micro serial sectioning. This needs to shift a long forward into the Introduction.

      Thanks for the suggestions, we utilize serial sectioning to get a different view of the microbodies that are proposed to be melanosomes and reconstructed the three-dimensional volume of the melanosomes, as well as the intercalated keratin.

      We restructured the introduction and make clear that the three-dimensional data obtained in this study also was used for modeling and in a more anterior position in the text.

      In the Results, there are not enough references to images. It's not enough to refer generally to 'Figures 3c-f' [line 133] and then go on to rapidly step through some amazing imagery (text lines 133-146) - you need to add an image citation to each observation so readers can know exactly which image is being described each time.

      We elaborated our description of imaging to better describe the melanosomes in our results section. We add the description of the stack of melanosomes as IN Above (reply of Reviewer #2).

      The 3D data in Figures 3 and 4 is great and based on huge technical wizardry. The sketch model in Figure 4a is excellent, but could you not attempt an actual 3D block diagram showing the hexagonal arrangement of clusters of aligned melanosomes?

      We have also tried FIB -SEM in an additional place for validation of our ultrathin sections data. See the SI file.

      Added figure S7. Targeted feather barb block prepared in FIB-SEM, with volume rendering reconstruction based on the acquired sequential cross-sectional images; the volume reconstruction is visualized in the x-y plane (c-cross section view) and in x-z plane (d-sagittal section view).

      Modified Figure S8d shows the 3D model of aligned melanosomes. To show the arrangement more clearly, the schematic XY cross-section of the melanosomes 3D model is shown below (also shown in Supplementary Figure S8d).

      35: delete 'yield'

      Changed

      73: 'feather fell' ? = 'feather that has fallen'

      Changed

      305: excises ?= exercises

      Changed

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Tian et al. describe how TIPE regulates melanoma progression, stemness, and glycolysis. The authors link high TIPE expression to increased melanoma cell proliferation and tumor growth. TIPE causes dimerization of PKM2, as well as translocation of PKM2 to the nucleus, thereby activating HIF-1alpha. TIPE promotes the phosphorylation of S37 on PKM2 in an ERK-dependent manner. TIPE is shown to increase stem-like phenotype markers. The expression of TIPE is positively correlated with the levels of PKM2 Ser37 phosphorylation in murine and clinical tissue samples. Taken together, the authors demonstrate how TIPE impacts melanoma progression, stemness, and glycolysis through dimeric PKM2 and HIF-1alpha crosstalk.

      Strengths:

      The authors manipulated TIPE expression using both shRNA and overexpression approaches throughout the manuscript. Using these models, they provide strong evidence of the involvement of TIPE in mediating PKM2 Ser37 phosphorylation and dimerization. The authors also used mutants of PKM2 at S37A to block its interaction with TIPE and HIF-1alpha. In addition, an ERK inhibitor (U0126) was used to block the phosphorylation of Ser37 on PKM2. The authors show how dimerization of PKM2 by TIPE causes nuclear import of PKM2 and activation of HIF-1alpha and target genes. Pyridoxine was used to induce PKM2 dimer formation, while TEPP-46 was used to suppress PKM2 dimer formation. TIPE maintains stem cell phenotypes by increasing the expression of stem-like markers. Furthermore, the relationship between TIPE and Ser37 PKM2 was demonstrated in murine and clinical tissue samples.

      Weaknesses:

      The evaluation of how TIPE causes metabolic reprogramming can be better assessed using isotope tracing experiments and improved bioenergetic analysis.

      Thank you very much for your suggestions. Unfortunately, we cannot complete the isotope tracing experiments due to the lack of instruments, nor with the help of the company after consulting several companies. We are very sorry for this imperfect experiment, and we have discussed this disadvantage in our manuscripts. Moreover, due to our negligence, there was only three metabolites were presented in the previous manuscripts. However, we have performed the routine untargeted metabolomics to demonstrate how TIPE causes metabolic reprogramming. We have added the detailed results as a new figure named as Figure S3, in which, the glycolysis pathway particularly pyruvate and lactic acid is decreased after TIPE interference.

      Reviewer #2 (Public Review):

      In this article, Tian et al present a convincing analysis of the molecular mechanisms underpinning TIPE-mediated regulation of glycolysis and tumor growth in melanoma. The authors begin by confirming TIPE expression in melanoma cell lines and identify "high" and "low" expressing models for functional analysis. They show that TIPE depletion slows tumour growth in vivo, and using both knockdown and over-expression approaches, show that this is associated with changes in glycolysis in vitro. Compelling data using multiple independent approaches is presented to support an interaction between TIPE and the glycolysis regulator PKM2, and the over-expression of TIPE-promoted nuclear translocation of PKM2 dimers. Mechanistically, the authors also demonstrate that PKM2 is required for TIPE-mediated activation of HIF1a transcriptional activity, as assessed using an HRE-promoter reporter assay, and that TIPE-mediated PKM2 dimerization is p-ERK dependent. Finally, the dependence of TIPE activity on PKM2 dimerization was demonstrated on tumor growth in vivo and in the regulation of glycolysis in vitro, and ectopic expression of HIF1a could rescue the inhibition of PKM2 dimerization in TIPE overexpressing cells and reduced induction of general cancer stem cell markers, showing a clear role for HIF1a in this pathway. The main conclusions of this paper are well supported by data, but some aspects of the experiments need clarification and some data panels are difficult to read and interpret as currently presented.

      The detailed mechanistic analysis of TIPE-mediated regulation of PKM2 to control aerobic glycolysis and tumor growth is a major strength of the study and provides new insights into the molecular mechanisms that underpin the Warburg effect in cancer cells. However, despite these strengths, some weaknesses were noted, which if addressed will further strengthen the study.

      (1) The analysis of patient samples should be expanded to more directly measure the relationship between TIPE levels and melanoma patient outcome and progression (primary vs metastasis), to build on the association between TIPE levels and proliferation (Ki67) and hypoxia gene sets that are currently shown.

      Thanks for your suggestions, we have added the relationship between TIPE levels and progression (non-lymph node metastasis vs lymph node metastasis). In addition, we added the association between TIPE and Ki67 or LDH levels as your advised, as shown in Figure 7.

      However, the relationship between TIPE levels and melanoma patient outcome is not presented in this article. One reason is that the tissue microarray lack of the survival data. Interestingly, the TCGA dataset showed that the higher TIPE expression has a favorable prognosis for melanoma. We are also very curious about this. Our following study indicated that TIPE might serve as a positive regulator of PD-L1. Therefore, the higher expression of TIPE presents more sensitive tendency to immunotherapy, resulting in a favorable prognosis in melanoma. The detailed mechanisms will be discussed in our following article, and we hope that it might as a continuous research topic for TIPE in melanoma.

      We just only disclose a little information that TIPE has a similar survival and immune signature to PD-L1 and PD-1 in melanoma as following:

      Author response image 1.

      (2) The duration of the in vivo experiments was not clearly defined in the figures, however, it was clear from the tumor volume measurements that they ended well before standard ethical endpoints in some of the experiments. A rationale for this should be provided because longer-duration experiments might significantly change the interpretation of the data. For example, does TIPE depletion transiently reduce or lead to sustained reductions in tumor growth?

      Thanks for your suggestions. Actually, we have performed a pre-experiment before the formal experiments, and all the time points were referred to this. Furthermore, we have added the detailed time points into the figure legends as you suggested.

      (3) The analysis of general cancer stem cell markers is solid and interesting, however inclusion of neural crest stem cell markers that are more relevant to melanoma biology would greatly strengthen this aspect of the study.

      Thanks for your advices. We have selected two neural crest stem cell markers including Nestin and Sox10 to test their expression after overexpression of TIPE in G361 cells or interference of TIPE in A375 cells.

      (4) The authors should take care that all data panels are clearly readable in the figures to facilitate appropriate interpretation by the reader.

      Thanks for your suggestions. We have amended the data panels according to you advises to ensure it is clear and professionally presented.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Major points

      (1) In Figure 1D, glucose, pyruvate, and lactate were measured at a steady state. However, metabolites at steady state do not accurately depict changes in pathway activity. An isotope tracing experiment (i.e., using labelled 13C glucose) can be used to study glucose catabolism into pyruvate, as well as tracing into lactate or into the TCA cycle following changes in TIPE expression. In addition, although the authors point towards changes in metabolic reprogramming, only three metabolites were measured. The use of isotope tracing to monitor metabolites from more than one pathway would be suggested to support the claim that metabolism is being reprogrammed due to TIPE.

      Thank you very much for your suggestions. Unfortunately, we cannot complete the isotope tracing experiments due to the lack of instruments, nor with the help of the company after consulting several companies. We are very sorry for this imperfect experiment, and we have discussed this disadvantage in our manuscripts. Moreover, due to our negligence, there was only three metabolites were presented in the previous manuscripts. However, we have performed the routine untargeted metabolomics to demonstrate how TIPE causes metabolic reprogramming. We have added the detailed results as a new figure named as Figure S3, in which, the glycolysis pathway particularly pyruvate and lactic acid is decreased after TIPE interference.

      (2) In Figure 1H, extracellular acidification was used to determine glycolytic activity. However, bicarbonate secretion can also greatly affect pH, and should be considered (PMID 25449966). Although total ATP content was measured, the contribution of ATP from glycolysis can be also determined (see PMID 28270511) to provide a more accurate representation of glycolytic ATP production.

      Thanks for your suggestions again. As described at the above, we will improve our measurement methods in the future, and we have discussed our weakness in the manuscripts.

      (3) On page 5, lines 108-111, the authors show that "This process represents an important regulator of the TIPE family switching between oxidative phosphorylation and aerobic glycolysis, paving the way for cancer-specific metabolism in response to low-oxygen challenge." However, there is no data on oxidative phosphorylation. What is the effect of TIPE on oxygen consumption?

      Thanks for your careful and professional advices. We have conducted a thorough review of the manuscript for language accuracy and corrected this term to eliminate confusion and ensure the text is clear and professionally presented.

      Minor points

      (1) On page 3, line 68, it is unclear what is increasing lactate levels, as lactate can be transported inside of cells.

      Thanks for your suggestions, we have corrected this misdescription to improve the overall quality and readability of the manuscript.

      (2) In Figure 1B, RNA sequencing was performed on TIPE overexpressing G361 cells. The "ribosome" pathway has the highest count and lowest p-value. However, there is no mention of this in the text.

      Thanks for your suggestions, we selected aerobic glycolysis as our major story comprehensively according to the transcriptomics, metabolomics and the Co-IP/MS results. Anyway, the "ribosome" pathway as you pointed might is our next research topic in the future.

      (3) It would be helpful to include the cell line in Figure S1B-C as well as in the figure legend.

      Thanks for your suggestions, we have added the cell line into Figure S1B-C as well as in the figure legend.

      (4) Concerning supplementary figures, it would be helpful to include the panel numbers when referring to them in the main text (see line 120 or 122 as an example).

      Thanks for your suggestions, we have added the panel numbers when referring to them in the main text.

      (5) The sentence on lines 127-131 is very confusing.

      Thanks for your suggestions, we have corrected the improper descriptions as you mentioned.

      (6) In Figure S3, qPCR is misspelled in the figure legend. Also, it would be helpful to include what is meant by "relative expression" on the y-axis of Figure S3A.

      Thanks for your suggestions, we have corrected the errors as you pointed. Due to the y-axis represents the expression both of TIPE and HIF-1α, the present description might be more suitable.

      (7) There is an extra space on line 196.

      Thanks for your suggestions, we have corrected as you pointed.

      (8) In Figure 7E LDH staining was performed. Which isoform of LDH was detected?

      Actually, we stained total LDH in Figure 7E.

      (9) On line 931, Warburg is misspelled.

      Thanks for your suggestion, we have corrected all mentioned typos, including " Warburg " in lines 931.

      Reviewer #2 (Recommendations For The Authors):

      Major comments:

      - Supplementary Figure 2G. Unit of time measurement for tumor growth panel needs to be defined. If this refers to days, 5 days is a relatively short period to assess tumor growth differences in vivo, and indeed, 1000-1200mm3 is a standard ethical end-point for these types of models, and this experiment was concluded well before reaching these tumor sizes. Can the authors explain why they ended this experiment at this timepoint?

      Thanks for your suggestions. As you suggested, we have added the detailed time points into the figure legends. Actually, we have performed a pre-experiment before the formal experiments, and all the time points were referred to this.

      - Supplementary Figure 2j - Correlation analysis between TIPE expression and overall survival outcome in melanoma patients is more relevant to support the experimental observations described in the paper than the correlation with Ki67. This analysis should also be provided. In addition, is there any difference in TIPE expression between primary and metastatic melanoma patients which would then more directly link TIPE with melanoma progression in patients?

      The relationship between TIPE levels and melanoma patient outcome is not presented in this article. One reason is that the tissue microarray lack of the survival data. Interestingly, the TCGA dataset showed that the higher TIPE expression has a favorable prognosis for melanoma. We are also very curious about this. Our following study indicated that TIPE might serve as a positive regulator for PD-L1. Therefore, the higher expression of TIPE presents more sensitive tendency to immunotherapy, resulting in a favorable prognosis in melanoma. The detailed mechanisms will be discussed in our following article, and we hope that it might as a continuous research topic for TIPE in melanoma.

      Furthermore, we have added the relationship between TIPE levels and progression (non-lymph node metastasis vs lymph node metastasis), and Ki67 in Figure 7.

      - Figure 2 - The A2 domain protein represents a substantial reduction in the size of PKM2, which would likely have other structural effects that could affect interactions with TIPE. This should be discussed by the authors because, in this reviewer's opinion, the data presented do not shed light on the specific TIPE domain requirements for the interaction with PKM2.

      Thanks for your suggestions. We have discussed this phenomenon in our manuscripts.

      - Figure 4: The authors show that PKM2 recruitment to the promoters of GLUT1 and LDHA is induced by TIPE expression. Is HIF1a recruitment also induced by TIPE? This is a key gap in the detailed molecular analysis provided by the authors.

      Thanks for your suggestions. This phenomenon you mentioned is very interesting, however, the expression of GLUT1 and LDHA was completely decreased when we overexpression of TIPE and PKM2 (S37A) compared to overexpression of TIPE and wild PKM2. Therefore, we believe that the higher expression of GLUT1 and LDHA was primarily promoted by TIPE-induced PKM2 recruitment.

      - Figure 6: The authors present nice data for general pluripotency/stem cell markers however given melanocytes arise from the neural crest, and neural crest markers are expressed during melanoma initiation and response to therapies, analysis of neural crest stem cell markers would be appropriate to include in this analysis. For example, Sox10, Pax3, NGFR, and AQP2 have all been identified as neural crest stem cell markers expressed in both melanoma patients and experimental models.

      Thanks for your advices. We have selected two neural crest stem cell markers including Nestin and Sox10 to test their expression after overexpression of TIPE in G361 cells or interference of TIPE in A375 cells.

      Minor comments:

      - All Figure and Supplementary Figure legends should indicate how many replicate experiments the data represents, and all error bars should be defined (StDev vs SEM).

      We have added as you suggested.

      - Supplementary Figure S1C - can the authors confirm the densitometry values on the western, as the band looks to be considerably larger than 1.6 fold higher compared to the control?

      We redone the densitometry measurement by ImageJ. However, the result still the same.

      - FACs panels in Supplementary Figure 2C-D are unreadable and should be enlarged.

      - Supplementary Figure S2i - quantification of Ki67 images appears warranted.

      - Supplementary Figure S2j - The text in the figure panel is too small and needs to be increased so the data can be interpreted accurately. Also, the authors should confirm the data is specifically from melanoma patients in the figure legend.

      We have improved the quality of the figures and revised their descriptions for greater clarity and coherence, ensuring that they effectively highlight the key results of our study.

      - Figure 1A - text on the heat map cannot be read. Gene-level information can be removed, and sample labels should be made larger. In panel D, no statistical analysis is shown for the metabolomics analysis. These should be added, or the authors should modify the text when referring to these data.

      We have improved the quality of the figures and revised their descriptions for greater clarity and coherence, ensuring that they effectively highlight the key results of our study.

      - Line 127: RNAseq data does not indicate a change in metabolites; text should be changed to say "TIPE dramatically promoted expression of genes...".

      We have corrected as you suggested.

      - Supplementary Figure S3c - Labels and correlation values are not readable.

      - Figure 2A - The text and details in the figure are difficult to read.

      - Figure S4 D-H - text in figure panels too small to read.

      Thank you for above three questions, we have carefully reviewed the entire document to ensure all figures are clear and correctly cited, preventing any confusion and maintaining the integrity of our research findings.

      - Figure 3 - the legend restates the major observations and interpretations of the figure, however does not contain enough information about what the data represents or how it was generated. The interpretation of the data should be made in the main text. For example, in panel 3. F-G the number of individual cells quantified for the analysis should be stated. In addition, given the data are generated from two completely independent cell lines, it would be more appropriate to have separate graphs for the A375 cells and G361 cells. The signal levels in the respective controls at baseline are very different, and plotted together without clear labels, making the reader question the validity of the data when this just reflects different basal signals in different cell models.

      We have separated the graphs for the A375 cells and G361 cells.

      - Figure 4 B-C - IgG controls are missing in Co-IP experiments.

      We have added the IgG controls as you suggested.

      - Figure 5F - The unit of measure of time should be indicated on the axes; is this days?

      We measured the tumor volumes for 7 times every 5 days. We have added the detailed description in the materials and methods section.

      - Line 348: error in text, mammosphere which should presumably be tumorsphere if from melanoma cells.

      Thanks for your suggestions, we have corrected this term to "tumorsphere" and conducted a thorough language and grammar review of the manuscript to ensure its professional presentation.

      - Methods: more experimental details for the transcriptomic, mass spec, and metabolomics studies should be provided. There are insufficient details if readers wish to repeat these experiments.

      Thanks for your suggestions, we have corrected as you advised.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The current work explored the link between the pulvinar intrinsic organisation and its functional and structural connectivity patterns of the cortex using different dimensional reduction techniques. Overall they find relationships between pulvinar-cortical organization and cortico-cortical organization, and little evidence for clustered organization. Moreover, they investigate PET maps to understand how neurotransmitter/receptor distributions vary within the pulvinar and along its structural and functional connectivity axes.

      Strengths:

      There is a replication dataset and different modalities are compared against each other to understand the structural and functional organisation of the pulvinar complex.

      Weaknesses:

      (1) What is the motivation of the study and how does this work extend previous assessments of the organization of the complete thalamus within the gradient framework?

      Thank you for raising this central question. As already mentioned in the main text, pulvinar is one of the largest and prototypical associative nuclei, yet its organizational principles in the human brain remain relatively unexplored. The substantial body of anatomical research conducted in primate species suggests the coexistence of multiple coexisting and overlapping corticotopic representations on the pulvinar complex.

      Existing connectivity-based parcellation studies of pulvinar organization often overlook these organizational principles, as the resulting parcellation may reflect a linear combination of single overlapping connectopies rather than accurately capturing their distinct and unique spatial arrangement.

      Investigations of thalamic connectivity have already revealed overarching organizational principles within the thalamus, which are partially reflected in its cytoarchitecture subdivision. These principles are associated with core and matrix thalamic neuronal subpopulation, and their distinct contributions to large-scale connectivity networks.

      Since gradient selection relies on the explained variance of the diffusion embeddings, and pulvinar-cortical connectivity likely accounts for only a limited portion of the variance in thalamocortical connectivity, we chose to focus specifically on the pulvinar nucleus. This approach was intended to ensure that the local connectivity principles of the pulvinar are not overshadowed by the broader connectotopical organization of the entire thalamus.

      This rationale aligns with findings in topographically organized regions of the cerebral cortex, such as M1, S1 or visual areas. In these regions, distinct principles of topographical organization are not readily apparent when analyzing whole-brain connectivity embedding but emerge when dimensionality reduction is applied to region-specific connectivity data.

      (2) Why is the current atlas chosen for the delineation of the pulvinar and individualized maps not considered? Given the size of the pulvinar, more validation of the correctness of the atlas may be helpful.

      To improve signal-to-noise ratio and in alignment with previous studies, we performed diffusion embedding on the group-level, averaged connectivity matrices rather than estimating gradients at the individual subject level.

      The decision to use a standard-space atlas for pulvinar delineation, rather than individualized parcellation, was driven by technical considerations: 1) functional MRI data were already transformed to MNI space; and 2) individualized parcellation of thalamic nuclei can result in varying pulvinar volumes across subjects, complicating the averaging of connectivity data. By using a standard-space atlas, we ensured that connectivity was consistently extracted from the same set of voxels across all subjects.

      We selected the AAL3 atlas (Rolls et al., 2020)over other existing thalamic atlases for practical reasons: the atlas incorporates an ex-vivo thalamic parcellation (Iglesias et al., 2018) with a specific delineation of pulvinar nuclei, which was necessary for subsequent analyses. In the revised version of the manuscript, to validate our findings, we replicated the pulvinar gradient using a different pulvinar delineation from a recent, thalamus-specific atlas (Su et al., 2019). Notably, the spatial distribution of pulvinar connectivity and coexpression gradients remained consistent, regardless of the choice of the thalamic atlas, underscoring the robustness of our results.

      (3) Overall the study feels a little incremental and a repetition of what others have done already in the thalamus. It would be good to know how focusing only on the pulvinar changes interpretation, for example by comparing thalamic and pulvinar gradients?

      The authors acknowledge the existing body of literature that has examined thalamic connectivity under the lens of the connectivity gradient framework. While these studies may provide valuable insights into the functional topography of the pulvinar complex -given its prominent role within the thalamus - we contend that a focused analysis of pulvinar connectivity offers a unique opportunity to uncover the specific organization principles of this nuclear complex. By isolating the pulvinar, we aimed to avoid the potential overshadowing of its local connectivity patterns by the broader connectotopical organization of the entire thalamus. However, as we believe that our findings are best interpreted within the broader context of general thalamic connectivity organization, we have included an additional paragraph in the Discussion section, which explores the similarities and differences between thalamic and pulvinar gradients, offering a more integrative perspective on our results.

      “In recent years, different works have explored the spatial arrangement of thalamic connectivity within a connectivity gradient framework. Diffusion embedding of thalamocortical functional connectivity has revealed a principal, medio-lateral gradient that was found correlated to thalamic structural subdivisions, and a secondary, antero-posterior gradient associated with thalamic functional subfields, and showing progression from unimodal sensorimotor cortical networks to multimodal attention and associative networks. Interestingly, the principal thalamic gradient shows a medio-lateral arrangement on the pulvinar axis while the secondary gradients correspond more to a ventral-dorsal pulvinar axis (Yang et al. 2020). In particular, further independent investigations have suggested that the progressing pattern of thalamic connectivity from unimodal to transmodal cortices is strongly associated to the local density of core and matrix cell types, thus establishing a link between molecular properties and functional connectivity dynamics (Müller et al. 2020; Huang et al. 2024). Our findings complement and expand the existing literature by revealing a similar arrangement of cortical connectivity patterns on the pulvinar complex, and elucidating its relationship to in-vivo estimates of molecular markers of neurotransmission. We found that the gradient associated to unimodal-transmodal cortical connectivity accounted for the highest percentage of variance of variance in cortico-pulvinar connectivity, in line with its well-acknowledged role of associative nucleus. It is noteworthy that, in analyses of thalamocortical gradients, the pulvinar complex is situated towards the “sensorimotor” extreme of the unimodal-to-transmodal thalamic gradient (Yang et al., 2020). This likely reflects its prominent connectivity to visual and sensory areas compared to other thalamic nuclei. Nevertheless, the extensive and intricate association of pulvinar with multiple cortical networks emerges is strongly evident in various functional connectivity investigations (Basile et al., 2021; Kumar et al., 2017, 2022). By isolating pulvinar-cortical from broader thalamocortical connectivity, our analysis was able to provide additional insights into the spatial organization of its connectivity with different cortical networks, highlighting the pulvinar's remarkable functional diversity and complexity.”

      (4) Could it be that the gradient patterns stem from lacking anatomical and functional resolutions (or low SNR) therefore generating no sharp boundaries?

      The gradient organization described in our results is aligns with anatomical evidence on non-human primates (Shipp, 2003), and with existing neuroimaging studies in humans, which report limited correspondence between connectivity-based hard clustering solutions and histological delineation of pulvinar nuclei. However, we recognize the critical importance of assessing the impact of SNR on connectivity measures derived from functional and structural MRI. In the revised manuscript, we have included an additional analysis to investigate the potential impact of local noise on gradient reconstruction. This analysis involved sampling voxel-wise SNR estimates in the pulvinar from both BOLD and diffusion-weighted MRI data, averaging these estimates to generate group-level, modality-specific SNR maps. We then assessed spatial correlations between these maps and the gradient embeddings using the same methodological framework employed throughout the study. Our findings indicate that functional connectivity gradients are weakly, but significantly correlated to SNR, with the strongest correlation observed for the third gradient (left hemisphere G<sub>FC</sub>1 r= -0.30, SA-corrected p < 0.001, G<sub>FC</sub>2 r= 0.22, SA-corrected p = 0.05, G<sub>FC</sub>3 r= 0.55, SA-corrected p < 0.001; right hemisphere G<sub>FC</sub>1 r= -0.41, SA-corrected p < 0.001, G<sub>FC</sub>2 r= 0.22, SA-corrected p = 0.008, G<sub>FC</sub>3 r= 0.52, SA-corrected p = 0.017). In contrast, structural connectivity gradients showed no significant correlation with SNR (left hemisphere G<sub>SC</sub>1 r= 0.06, SA-corrected p = 0.82, G<sub>SC</sub>2 r= -0.33, SA-corrected p = 0.01; right hemisphere G<sub>SC</sub>1 r= 0.40, SA-corrected p = 0.28, G<sub>SC</sub>2 r=-0.19, SA-corrected p = 0.31).

      Reviewer #1 (Recommendations for the authors):

      (1) Please add more literature on thalamus gradients and interpret this with care.

      Thank you for the suggestion. We have added the following paragraph in the Discussion section:

      “In recent years, different works have explored the spatial arrangement of thalamic connectivity within a connectivity gradient framework. Diffusion embedding of thalamocortical functional connectivity has revealed a principal, medio-lateral gradient that was found correlated to thalamic structural subdivisions, and a secondary, antero-posterior gradient associated with thalamic functional subfields, and showing progression from unimodal sensorimotor cortical networks to multimodal attention and associative networks. Interestingly, the principal thalamic gradient shows a medio-lateral arrangement on the pulvinar axis while the secondary gradients correspond more to a ventral-dorsal pulvinar axis (Yang et al. 2020). In particular, further independent investigations have suggested that the progressing pattern of thalamic connectivity from unimodal to transmodal cortices is strongly associated to the local density of core and matrix cell types, thus establishing a link between molecular properties and functional connectivity dynamics (Müller et al. 2020; Huang et al. 2024). Our findings complement and expand the existing literature by revealing a similar arrangement of cortical connectivity patterns on the pulvinar complex, and elucidating its relationship to in-vivo estimates of molecular markers of neurotransmission. We found that the gradient associated to unimodal-transmodal cortical connectivity accounted for the highest percentage of variance of variance in cortico-pulvinar connectivity, in line with its well-acknowledged role of associative nucleus. It is noteworthy that, in analyses of thalamocortical gradients, the pulvinar complex is situated towards the “sensorimotor” extreme of the unimodal-to-transmodal thalamic gradient (Yang et al., 2020). This likely reflects its prominent connectivity to visual and sensory areas compared to other thalamic nuclei. Nevertheless, the extensive and intricate association of pulvinar with multiple cortical networks emerges is strongly evident in various functional connectivity investigations (Basile et al., 2021; Kumar et al., 2017, 2022). By isolating pulvinar-cortical from broader thalamocortical connectivity, our analysis was able to provide additional insights into the spatial organization of its connectivity with different cortical networks, highlighting the pulvinar's remarkable functional diversity and complexity.

      As regards structural connectivity, existing accounts describe a medio-lateral organization of thalamocortical connections, corresponding to an antero-posterior gradient on the cortical mantle. This gradient organization appears to be anchored to genetic markers of different cell types (Oldham and Ball 2023). In line with their findings, we describe a principal axis of structural connectivity in the pulvinar complex that is arranged on the mediolateral axis, and we enforce the notion of a deep relationship between structural connections and molecular expression of neurotransmission markers. On the other hand, the patterns of connectivity with the cerebral cortex do not correspond to a clear antero-posterior axis on the cerebral cortex, probably showing the predominance of local connectivity over the global thalamic structural topography. Further investigations are warranted to ascertain whether the structural gradients of the pulvinar complex may be in continuity with this general cortico-thalamic connectivity gradient.”

      (2) Please state the motivation of the work more clearly and what makes it different from related literature.

      Thank you for pointing us to this lack of clarity. We have added the following paragraph in the Introduction section:

      “In particular, investigations of thalamic connectivity within the gradient framework have uncovered general organizational principles within the thalamus, which are partially reflected in thalamic cytoarchitecture subdivisions. These principles have been linked to core and matrix thalamic neuronal subpopulation, and to their differential contribution to large-scale connectivity networks (Müller et al., 2020; Yang et al., 2020). However, given the remarkable functional complexity and diversity of the pulvinar complex, these global spatial organization patterns likely capture only part of its functional topography. With this in mind, isolating pulvinar connectivity from the remaining thalamocortical connectome would ensure that local organizational principles are not obscured by the global connectotopic structure of the entire thalamus.”

      (3) Why did the authors opt for a whole brain labelling atlas, would a thalamus-specific atlas not be more suitable?

      Despite being a large-scale whole brain atlas, the labeling atlas of choice (AAL3) incorporates a thalamus-specific parcellation from previous work (Iglesias et al., 2018), derived from ex-vivo data and including subdivision of the pulvinar complex into anterior, inferior, lateral and medial nuclei. In the revised version of the manuscript, to validate our findings, we replicated the pulvinar gradient using a different pulvinar delineation from a recent, thalamus-specific atlas (Su et al., 2019). We show these results in Supplementary Figure 1. Notably, the spatial distribution of pulvinar connectivity and coexpression gradients remained consistent, regardless of the choice of the thalamic atlas, underscoring the robustness of our results.

      (4) How did the authors account for the potential low sensitivity of subcortical signals in the PET data?

      We acknowledge the inherent limitations in spatial sensitivity that are a common drawback of PET imaging. However, the PET data employed in the present study were derived from a high-quality dataset collected across multiple studies, predominantly acquired using high resolution scanners (Hansen et al., 2022; see supplementary material at https://static-content.springer.com/esm/art%3A10.1038%2Fs41593-022-01186-3/MediaObjects/41593_2022_1186_MOESM3_ESM.xlsx for technical details). Furthermore, the reliability of neurotransmission markers measurements at the subcortical level has been validated against genetic transcription markers (Hansen, Markello, et al., 2022; Hansen, Shafiei, et al., 2022), ensuring robust and biologically meaningful results.

      (5) What about SNR of the metrics within the pulvinar?

      The referee raises a crucial and complex point, prompting us to conduct additional analyses. We recognize the critical importance of assessing the impact of SNR on connectivity measures derived from functional and structural MRI. In the revised manuscript, we have included an additional analysis to investigate the potential impact of local noise on gradient reconstruction. Therefore, we have incorporated the following text into the manuscript:

      Results (5. Reliability and Reproducibility):

      “To assess the influence of local noise on functional and structural connectivity gradients, we calculated the spatial correlation between gradient values and averaged voxel-wise estimates of signal-to-noise ratio (SNR) from functional and structural MRI data, respectively. We found that functional connectivity gradients are weakly, but significantly correlated with the SNR, with the strongest correlation observed for the third gradient (left hemisphere G<sub>FC</sub>1 r= -0.30, SA-corrected p < 0.001, G<sub>FC</sub>2 r= 0.22, SA-corrected p = 0.05, G<sub>FC</sub>3 r= 0.55, SA-corrected p < 0.001; right hemisphere G<sub>FC</sub>1 r= -0.41, SA-corrected p < 0.001, G<sub>FC</sub>2 r= 0.22, SA-corrected p = 0.008, G<sub>FC</sub>3 r= 0.52, SA-corrected p = 0.017). In contrast, structural connectivity gradients were not significantly associated with SNR (left hemisphere G<sub>SC</sub>1 r= 0.06, SA-corrected p = 0.82, G<sub>SC</sub>2 r= -0.33, SA-corrected p = 0.01; right hemisphere G<sub>SC</sub>1 r= 0.40, SA-corrected p = 0.28, G<sub>SC</sub>2 r=-0.19, SA-corrected p = 0.31) (Supplementary Figure 5).”

      Methods (4. Reliability and reproducibility assessment):

      “To evaluate the possible influence of SNR on connectivity-derived diffusion embeddings, we have performed a voxel-wise,

      modality-specific, SNR assessment to investigate correlation between spatial distribution of noise and diffusion embeddings. For each subject, we separately calculated voxel-wise SNR maps for the left and right pulvinar, using both functional (BOLD) volumes and DWI data. For BOLD volumes, we employed the widely accepted definition of temporal signal to noise (tSNR) (Murphy et al., 2006):

      where T<sub>mean</sub> and T<sub>std</sub> are, respectively, the mean and the standard deviation of each voxel’s signal across the time series.

      For the DWI data, we applied a similar approach (Cai et al., 2021) that allows estimation of SNR from multiple b=0 diffusion weighted volumes:

      where S is the voxel’s signal intensity, and the mean (S<sub>mean</sub>) and standard deviation (S<sub>std</sub>) were computed across all the b0-weighted volumes (18 for HCP dataset; 7 for LEMON dataset). Individual pulvinar SNR maps were then averaged to generate group-level estimates of SNR spatial distribution. The resulting, modality-specific average SNR maps were correlated with the diffusion gradients derived from the corresponding modality, following the same approach described in the previous section (Pearson’s correlation; p-values corrected using spatial null models for spatial autocorrelation, and Benjamini-Hochberg correction for FWE).”

      (6) The numbers of the screeplot / numbers in figures are quite small and not so easy to read.

      Thank you for highlighting this point. We have fixed this issue in the revised version of the Figures.

      (7) How do you know the pulvinar mask is not also picking up on the cortical spinal tract?

      To ensure that pulvinar masks did not pick up streamlines from the corticospinal tracts, we performed a thorough visual inspection of the tractograms that were employed for structural connectivity estimation. For each subject-specific tractogram, we randomly subsampled 10000 streamlines after transformation into MNI standard space and summed up these results to generate a group-level tractogram in standard space. The resulting track-density images (Author response image 1) demonstrate only minimal involvement of descending/ascending tracts from/to the brainstem and spinal cord, confirming the specificity of the pulvinar masks.

      Author response image 1.

      Group-level structural connectivity of the pulvinar complex. Track-density images have been normalized and overlaid on the MNI152 standard template.

      (8) There is no mention of the within pulvinar gradients that then are correlated with PET patterns or across gradients are tested to spatial autocorrelation? I believe it is only mentioned for the cortex.

      Thanks for providing us with the opportunity to clarify this important aspect, which is mentioned in the Methods section (3. Gradient analysis and statistics):

      “To account for the spatial autocorrelation (SA) properties of gradient maps, for all the correlations described, statistical significance was assessed using the permutational approach described in Burt et al. (2020). Briefly, this method takes as input geometric distance matrices for SA estimation and involves the generation of a given number of SA-preserving permuted surrogate maps, which are then employed as nulls to estimate a permutational null distribution of the test statistic (Burt et al. 2020). Pairwise Euclidean distances between left or right pulvinar voxel coordinates were employed for pulvinar null models, while for cortical parcellated connectivity data Euclidean distances were estimated between centroids of each cortical ROI. In both cases, 1000 surrogates were generated to estimate the null distribution. Statistical tests were controlled for false discovery rate (FDR) using Benjamini and Hochberg’s correction.”

      However, to enhance readability, we have highlighted this concept in the Results section (3. The unimodal-to-transmodal gradient (G<sub>FC</sub>1) aligns with receptor expression on the dorso-ventral pulvinar axis):

      “To take into account the effects of spatial autocorrelation, we corrected the resulting p-values using a method based on SA-preserving spatial null models (Burt et al. 2020)”.

      (9) I don't fully understand why the mappings are so patchy of the structural connectivity gradient? Maybe some normalisation went wrong? Other papers on thalamic gradients show smoother patterns.

      We thank the Reviewer for the observation. After thoroughly reviewing the related codes, we found no normalization errors. However, we identified a visualization issue, which has been addressed in the revised version. Specifically, the structural gradient representations showed in the figures were based on the averaged values of left and right pulvinar gradients both of which include structural connectivity to either the ipsilateral or contralateral cerebral cortex. Since ipsilateral connectivity is more prominently represented than contralateral connectivity, this led to asymmetric gradient patterns between ipsilateral and contralateral cortical gradients, resulting in a patchy representation when gradients were averaged between left and right pulvinar. To resolve this, we adjusted the visualization by flipping the right pulvinar gradient representations along the x axis, aligning all the ipsilateral cortical connectivity on the left side and all the contralateral connectivity on the right. This adjustment produced smoother, more readable, and interpretable visualizations. Additionally, it allowed the asymmetry between ipsilateral and contralateral connections to be more clearly appreciated.

      (10) The final statement of the abstract is misleading as we at this point don't know how making spatial pattern maps in the pulvinar may help understand the role of the pulvinar in health and disease.

      We appreciate the Reviewer’s suggestion and have updated the expression accordingly:

      “Our findings represent a significant step forward in advancing the understanding of pulvinar anatomy and function, offering an exploratory framework to investigate the role of this structure in both health and disease.”

      Reviewer #2 (Public review):

      Summary:

      The authors aimed to explore and better understand the complex topographical organization of the human pulvinar, a brain region crucial for various high-order functions such as perception and attention. They sought to move beyond traditional histological subdivisions by investigating continuous 'gradients' of cortical connections along the dorsoventral and mediolateral axes. Using advanced imaging techniques and a comprehensive PET atlas of neurotransmitter receptors, the study aimed to identify and characterize these gradients in terms of structural connections, functional coactivation, and molecular binding patterns. Ultimately, the authors targeted to provide a more nuanced understanding of pulvinar anatomy and its implications for brain function in both healthy and diseased states.

      Strengths:

      A key strength of this study lies in the authors' effort to comprehensively combine multimodal data, encompassing both functional and structural connectomics, alongside the analysis of major neurotransmitter distributions. This approach enabled a more nuanced understanding of the overarching organizational principles of the pulvinar nucleus within the broader context of whole-brain connectivity. By employing cortex-wide correlation analyses of multimodal embedding patterns derived from 'gradients,' which provide spatial maps reflecting the underlying connectomic and molecular similarities across voxels, the study offers a thorough characterization of the functional neuroanatomy of the pulvinar.

      Weaknesses:

      Despite its strengths, the current manuscript falls short in presenting the authors' unique perspectives on integrating the diverse biological principles derived from the various neuroimaging modalities. The findings are predominantly reported as correlations between different gradient maps, without providing the in-depth interpretations that would allow for a more comprehensive understanding of the pulvinar's role as a central hub in the brain's network. Another limitation of the study is the lack of clarity regarding the application of pulvinar and its subnuclei segmentation maps to individual brains prior to BOLD signal extraction and gradient reconstruction. This omission raises concerns about the precision and reproducibility of the findings, leaving their robustness less transparently evaluable.

      We thank the Reviewer for the valuable comments. While commonalities and discrepancies between structural and functional connectivity have been extensively explored in the literature, the relationship between functional connectivity and modulatory neurotransmission remains poorly understood. Specifically, while the role of thalamic modulatory neurotransmission has been thoroughly investigated in experimental animal models from an electrophysiological perspective, it remains relatively underexplored in the human brain. In our study, we identified significant associations between the spatial distribution of serotonergic, noradrenergic, dopaminergic and mu-opioid systems and functional pulvinar-cortical connectivity to specific functional networks. Evidence from pharmacological challenge studies using resting-state fMRI suggests that these neurotransmission systems may modulate network-specific thalamocortical connectivity directly or influence neural gain in cortico-cortical connectivity, a process partially dependent on thalamocortical connections to associative thalamic nuclei. However, the limitations of spatial and receptor specificity inherent to this approach, coupled with the predominantly correlational nature of our study design, prevented us from drawing more definitive conclusions on the biological relationship between neurotransmitter expression and functional connectivity. As regards the lack of clarity concerning signal extraction, we have now clarified that all the relevant steps of time series extraction were performed in standard space, without any further registration to individual subjects.

      Reviewer #2 (Recommendations for the authors):

      In line with the weaknesses that I raised above, my recommendation to authors are two-fold:

      (1) Please provide readers with a more holistic viewpoint to better digest all the correlation analyses. For instance, in p18, the summary says:

      "G<sub>FC</sub>1, GRC1, and G<sub>SC</sub>2 substantially delineate multiscale differences between the ventral and dorsal aspects of the pulvinar. Moving along the ventral-dorsal axis of the pulvinar complex, more ventral regions showed higher functional connectivity to unimodal sensory processing networks, higher levels of 5HTT and NAT expression, and preferentially higher structural connectivity to modality-independent or low-level sensory processing cortices."

      We already knew somehow the existence of the dorsoventral axis in the pulvinar, as the authors already specified in the introduction. Beyond this simple report on phenomenological observation, one may provide a more integrated discussion to pinpoint what commonality or discrepancy the GFC, GRC, and GSC map show and potential common principles explaining their biological relationship (e.g., the 5HTT and NAT's high expression and functional connectivity). Such digested perspectives will grant the study unique insights into the functional system of the pulvinar.

      We have expanded on this topic in the Discussion section (Neurochemical correlates of pulvinar-cortical topographical organization) as follows:

      “Indeed, while commonalities and discrepancies between structural and functional connectivity have been extensively investigated, the relationship between functional connectivity and modulatory neurotransmission remains poorly understood. Our findings reveal stronger associations between pulvinar-cortical connectivity to specific functional networks and the spatial distribution of markers of serotonergic, noradrenergic, dopaminergic and opioid systems. Pharmacological challenge studies using resting-state functional MRI suggest that each of these neurotransmission systems may either directly modulate thalamocortical connectivity or influence neuronal gain in cortico-cortical functional connectivity, which is known to depend, in part, on cortical connections to associative thalamic nuclei, including the pulvinar.”

      (2) Specify the details if there was a QC procedure to check the signal extraction from the pulvinar subnuclei by applying the segmentation atlas at each individual.

      Preprocessed BOLD volumes were available in standard-space, and time series were extracted for each voxel within a standard-space mask of the pulvinar complex. All volumes underwent visual inspection to ensure the accuracy of the registration process. Regarding the pulvinar subnuclei, these structures were not segmented at the individual level.

      Reviewer #3 (Public review):

      Summary of the Study:

      The authors investigate the organization of the human pulvinar by analyzing DWI, fMRI, and PET data. The authors explore the hypothesis of the "replication principle" in the pulvinar.

      Strengths and Weaknesses of the Methods and Results:

      The study effectively integrates diverse imaging modalities to provide a view of the pulvinar's organization. The use of analysis techniques, such as diffusion embedding-driven gradients combined with detailed interpretations of the pulvinar, is a strength.

      Even though the study uses the best publicly available resolution possible with current MR-technology, the pulvinar is densely packed with many cell bodies, requiring even higher spatial resolution. In addition, the model order selection of gradients may vary with the acquired data quality. Therefore, the pulvinar's intricate organization needs further exploration with even higher spatial resolution to capture gradients closer to the biological organization of the pulvinar.

      Appraisal of the Study's Aims and Conclusions:

      The authors delineate the gradient organization of the pulvinar. The study provides a basis for understanding the pulvinar's role in mediating brain network communication.

      Impact and Utility of the Work:

      This work contributes to the field by offering insights into pulvinar organization.

      We thank the Reviewer for their positive assessment and constructive feedback. The Authors agree with the Reviewer that the spatial resolution of currently available in-vivo imaging methods is limited, and that gradient representation would indeed benefit from higher resolution data. However, we also note that the resolution of structural and functional volumes used in our study is consistent with existing literature on pulvinar connectivity. Additionally, the PET data employed in our work include multi-centric studies collected worldwide from healthy populations, and are primarily acquired using high-resolution scanners that allow spatial resolution up to 2 mm<sup>2</sup>. Notwithstanding, further investigations employing finer resolution imaging techniques, such as ultra-high field fMRI, may provide more detailed insights into pulvinar topographical organization at a finer scale.

      Reviewer #3 (Recommendations for the authors):

      (1) The HCP data contains genetically related datasets. Please mention whether the data-selection criteria for the selected 210 healthy subjects followed the genetically unrelated criteria.

      The HCP sample employed in this study consists of an initial cohort of 100 unrelated subjects, as provided in the HCP database, along with an additional random sample of 110 subjects. Subjects were selected without following a genetic criterion, as the family structure of the HCP dataset was part of a restricted access subset that we did not have access to at the time of processing. Subsequently, we obtained access to this information and determined that 178 out of 210 subjects (85%) are genetically unrelated. Of the remaining, genetically related subjects, 22 (~10% of the total sample) were included with another subject from the same family group (11 pairs); 6 (3%) were included with two other family members (2 triplets) and 4 (2%) were all parts of the same family group. This information has been included in the Methods section for clarity.

      (2) The study uses HCP data with an fMRI resolution of 2mm isotropic and diffusion MRI with 1.25mm. Additionally, the LEMON dataset includes 1.7mm isotropic DWI data and fMRI with 2.3mm isotropic resolution. Furthermore, the available PET data from the Hansen et al. 2022b study has a rather coarser spatial resolution. Therefore, it may be important to mention in the discussion that the pulvinar is densely packed with cell bodies and that their gradient organization might be better reflected with even higher spatial resolution or improved measurement techniques used in the study.

      We have revised the conclusive section of the Discussion into a paragraph title “Future perspectives and limitations”, and added the following text:

      “One notable limitation of this study lies in the relatively small size of the pulvinar complex compared to other larger cortical or subcortical structures. The high cellular density of the pulvinar poses a challenge for the relatively coarse resolution of currently available imaging techniques. Although the generally high quality of both the main and validation datasets, including rs-fMRI data (Uǧurbil et al. 2013; Babayan et al. 2019), align with current standards for imaging investigations of pulvinar connectivity, higher-resolution imaging approaches may offer more granular insights. Advanced techniques, such as ultra-high-field fMRI, hold promise for uncovering the fine-scale topographical organization of the pulvinar complex.”

      (3) The functional multiplicity of the Pulvinar nuclei among other thalamus nuclei is also illustrated in https://doi.org/10.1038/s42003-022-04126-w

      We thank the Reviewer for suggesting this important reference. We have added the following text in the Discussion section:

      “It is noteworthy that, in analyses of thalamocortical gradients, the pulvinar complex is situated towards the “sensorimotor” extreme of the unimodal-to-transmodal thalamic gradient (Yang et al., 2020). This likely reflects its prominent connectivity to visual and sensory areas compared to other thalamic nuclei. Nevertheless, the extensive and intricate association of pulvinar with multiple cortical networks emerges is strongly evident in various functional connectivity investigations (Basile et al., 2021; Kumar et al., 2017, 2022). By isolating pulvinar-cortical from broader thalamocortical connectivity, our analysis was able to provide additional insights into the spatial organization of its connectivity with different cortical networks, highlighting the pulvinar's remarkable functional diversity and complexity.”

      (4) In addition to DWI/DSI and PET, the study also uses fMRI, which allows for functional interaction in time. It may be worth reflecting in the discussion that the observed gradient organization of the pulvinar could have detailed aspects in the temporal domain, which might not be fully captured in the time-averaged embeddings.

      We thank the Reviewer for their insightful observation. The authors recognize that the exploration of brain temporal dynamics is a compelling area of research due to its extensive correlation with multiple hierarchical aspects of brain information processing. Examining the functional organization of the pulvinar complex lies beyond the scope of the present work and will be subject of further investigation. On the other hand, it is possible that certain aspects of the spatial organization of pulvinar connectivity may be influenced by temporal dynamics of cortico-thalamic information processing. Intrinsic timescales have been consistently showed to progressively increase from unimodal to multimodal associative cortical regions. Furthermore, cortico-thalamic connectivity in matrix-rich regions has been correlated with cortical time scales.

      To address this point, we have added the following lines to the Discussion section:

      “In this context, it could be hypothesized that the observed gradient organization of the pulvinar may also exhibit specific patterns in the temporal domain. Indeed, multiple investigations have linked the temporal dynamics of cortical regions to different aspects of information processing (Rossi-Pool et al., 2021; Soltani et al., 2021). Notably, intrinsic neural timescales of functional activity have been associated with the functional specialization and gradient organization of the cerebral cortex (Golesorkhi et al., 2021), with shorter timescales in unimodal sensory regions and longer ones in transmodal networks (Ito et al., 2020; Murray et al., 2014). Moreover, thalamocortical connectivity has been showed to correlate with these patterns of intrinsic time scale (Müller et al., 2020). In addition, modulatory neurotransmitters such as serotonin and dopamine have been demonstrated to play a significant role in modulating functional cortical dynamics across different timescales (Hansen, Shafiei, et al., 2022; Luppi et al., 2023). Exploring how the spatial organization of the pulvinar relates to temporal dynamics and timescale modulation could provide valuable insights and represents a promising avenue for future investigations.”

      (5) The K-means clustering (Supplementary Figure 1) used has limitations, particularly with respect to the structure of the data. Another aspect is the reproducibility of the model-order selection. Did the reliability and reproducibility assessment produce a similar number of clusters with the LEMON data as with the HCP data?

      We acknowledge the limitations of k-means clustering, particularly regarding the stability and reproducibility of the model order. To address the concerns, we iteratively ran the clustering algorithm 50 times on bootstrap resamples to enhance the stability of the silhouette score estimates. In addition, we have now replicated the analysis on the secondary dataset, as suggested by the Reviewer (Author response image 2). The Silhouette plots show similar number of clusters between the two different datasets for functional connectivity gradients, with minor differences observed in the results for structural connectivity gradients and multimodal gradient clustering. Notably, we did not find high a high degree of similarity between the results of gradient clustering and histologically defined nuclei, further underscoring the distinct organizational patterns identified through our analysis.

      This reinforces the relevance of using gradient-based approaches to reveal insights into the functional and structural organization of the pulvinar complex that may not align strictly with discrete, histologically defined subdivisions.

      Author response image 2.

      K-means clustering of pulvinar gradients on the secondary dataset (LEMON) and their correspondence with histological pulvinar nuclei. Panels on the left show the silhouette plots for left and right pulvinar clustering solutions; error bars are standard error calculated across 50 resamples. Panels on the right show matrix plots of Dice similarity coefficients for pulvinar clusters against histological nuclei (AAL3 atlas). INF: inferior; ANT: anterior; LAT: lateral; MED: medial.

      (6) The pulvinar correlates of the unimodal-transmodal cortical gradient (Figure 4) show an association with almost the entire brain (Figure 4C, violin plot). It would be interesting to back this association with known anatomical connectivity studies in animals that show connections to these network areas. To my limited knowledge, I am not aware of pulvinar tracer studies showing such extensive connectivity across the entire cortex.

      As our structural connectivity estimates are based on tractography, they are subject to the known limitation of potentially overestimating anatomical connectivity. A technical clarification is warranted: since structural connectivity is grouped by networks, it is strongly influenced by connections to specific cortical regions within each network. This explains the uneven and asymmetric distribution of structural gradient-weighted connectivity observed in our results and does not imply widespread connectivity across the entire cortex.

      Nonetheless, structural connectivity of the pulvinar to cortical regions in primates encompasses a remarkably broad array of cortical areas, including predominantly occipital (Adams et al., 2000; Benevento, 1976; Casanova et al., 1989), temporal (Berman & Wurtz, 2010; Gattass et al., 2018; Homman-Ludiye et al., 2020) and parietal cortices (Asanuma et al., 1985; Baleydier & Morel, 1992). Additionally, to a more limited extent, connections to the cingulate gyrus, and portions of the lateral prefrontal cortex have also been documented (Baleydier & Mauguiere, 1985; Baleydier & Mauguire, 1987). These connectivity patterns are in line with prior accounts of structural connectivity of the human pulvinar (Arcaro et al., 2015; Basile et al., 2021; Leh et al., 2008; Tamietto et al., 2012), and with the patterns identified in our work (Author response image 1). Such findings provide further validation of the structural connectivity profiles explored in the present study.

      References

      Adams, M. M., Hof, P. R., Gattass, R., Webster, M. J., & Ungerleider, L. G. (2000). Visual cortical projections and chemoarchitecture of macaque monkey pulvinar. The Journal of Comparative Neurology, 419(3), 377–393. https://doi.org/10.1002/(SICI)1096-9861(20000410)419:3<377::AID-CNE9>3.0.CO;2-E

      Arcaro, M. J., Pinsk, M. A., & Kastner, S. (2015). The anatomical and functional organization of the human visual pulvinar. Journal of Neuroscience. https://doi.org/10.1523/JNEUROSCI.1575-14.2015

      Asanuma, C., Andersen, R. A., & Cowan, W. M. (1985). The thalamic relations of the caudal inferior parietal lobule and the lateral prefrontal cortex in monkeys: Divergent cortical projections from cell clusters in the medial pulvinar nucleus. Journal of Comparative Neurology, 241(3), 357–381. https://doi.org/10.1002/cne.902410309

      Baleydier, C., & Mauguiere, F. (1985). Anatomical evidence for medial pulvinar connections with the posterior cingulate cortex, the retrosplenial area, and the posterior parahippocampal gyrus in monkeys. Journal of Comparative Neurology. https://doi.org/10.1002/cne.902320207

      Baleydier, C., & Mauguiere, F. (1987). Network organization of the connectivity between parietal area 7, posterior cingulate cortex and medial pulvinar nucleus: A double fluorescent tracer study in monkey. Experimental Brain Research, 66(2). https://doi.org/10.1007/BF00243312

      Baleydier, C., & Morel, A. (1992). Segregated thalamocortical pathways to inferior parietal and inferotemporal cortex in macaque monkey. Visual Neuroscience, 8(5), 391–405. https://doi.org/10.1017/S0952523800004922

      Basile, G. A., Bertino, S., Bramanti, A., Anastasi, G. P., Milardi, D., & Cacciola, A. (2021). In Vivo Super-Resolution Track-Density Imaging for Thalamic Nuclei Identification. Cerebral Cortex. https://doi.org/10.1093/cercor/bhab184

      Benevento. (1976). The Cortical Projections of the Inferior Pulvinar and Adjacent Lateral Pulvinar in the Rhesus Monkey ( Macaca. October, 108, 1–24.

      Berman, R. A., & Wurtz, R. H. (2010). Functional Identification of a Pulvinar Path from Superior Colliculus to Cortical Area MT. The Journal of Neuroscience, 30(18), 6342–6354. https://doi.org/10.1523/JNEUROSCI.6176-09.2010

      Cai, L. Y., Yang, Q., Hansen, C. B., Nath, V., Ramadass, K., Johnson, G. W., Conrad, B. N., Boyd, B. D., Begnoche, J. P., Beason-Held, L. L., Shafer, A. T., Resnick, S. M., Taylor, W. D., Price, G. R., Morgan, V. L., Rogers, B. P., Schilling, K. G., & Landman, B. A. (2021). PreQual: An automated pipeline for integrated preprocessing and quality assurance of diffusion weighted MRI images. Magnetic Resonance in Medicine, 86(1), 456. https://doi.org/10.1002/mrm.28678

      Casanova, C., Freeman, R. D., & Nordmann, J. P. (1989). Monocular and binocular response properties of cells in the striate-recipient zone of the cat’s lateral posterior-pulvinar complex. Journal of Neurophysiology. https://doi.org/10.1152/jn.1989.62.2.544

      Gattass, R., Soares, J. G. M., & Lima, B. (2018). Comparative Pulvinar Organization Across Different Primate Species (pp. 37–37). https://doi.org/10.1007/978-3-319-70046-5_8

      Golesorkhi, M., Gomez-Pilar, J., Tumati, S., Fraser, M., & Northoff, G. (2021). Temporal hierarchy of intrinsic neural timescales converges with spatial core-periphery organization. Communications Biology, 4(1), 277. https://doi.org/10.1038/s42003-021-01785-z

      Hansen, J. Y., Markello, R. D., Tuominen, L., Nørgaard, M., Kuzmin, E., Palomero-Gallagher, N., Dagher, A., & Misic, B. (2022). Correspondence between gene expression and neurotransmitter receptor and transporter density in the human brain. NeuroImage, 264, 119671. https://doi.org/10.1016/j.neuroimage.2022.119671

      Hansen, J. Y., Shafiei, G., Markello, R. D., Smart, K., Cox, S. M. L., Nørgaard, M., Beliveau, V., Wu, Y., Gallezot, J.-D., Aumont, É., Servaes, S., Scala, S. G., DuBois, J. M., Wainstein, G., Bezgin, G., Funck, T., Schmitz, T. W., Spreng, R. N., Galovic, M., … Misic, B. (2022). Mapping neurotransmitter systems to the structural and functional organization of the human neocortex. Nature Neuroscience, 25(11), 1569–1581. https://doi.org/10.1038/s41593-022-01186-3

      Homman-Ludiye, J., Mundinano, I. C., Kwan, W. C., & Bourne, J. A. (2020). Extensive Connectivity Between the Medial Pulvinar and the Cortex Revealed in the Marmoset Monkey. Cerebral Cortex, 30(3), 1797–1812. https://doi.org/10.1093/cercor/bhz203

      Iglesias, J. E., Insausti, R., Lerma-Usabiaga, G., Bocchetta, M., Van Leemput, K., Greve, D. N., van der Kouwe, A., Fischl, B., Caballero-Gaudes, C., & Paz-Alonso, P. M. (2018). A probabilistic atlas of the human thalamic nuclei combining ex vivo MRI and histology. NeuroImage, 183, 314–326. https://doi.org/10.1016/j.neuroimage.2018.08.012

      Ito, T., Hearne, L. J., & Cole, M. W. (2020). A cortical hierarchy of localized and distributed processes revealed via dissociation of task activations, connectivity changes, and intrinsic timescales. NeuroImage, 221, 117141. https://doi.org/10.1016/j.neuroimage.2020.117141

      Kumar, V. J., Beckmann, C. F., Scheffler, K., & Grodd, W. (2022). Relay and higher-order thalamic nuclei show an intertwined functional association with cortical-networks. Communications Biology, 5(1), 1–17. https://doi.org/10.1038/s42003-022-04126-w

      Kumar, V. J., van Oort, E., Scheffler, K., Beckmann, C. F., & Grodd, W. (2017). Functional anatomy of the human thalamus at rest. NeuroImage, 147, 678–691. https://doi.org/10.1016/j.neuroimage.2016.12.071

      Leh, S. E., Chakravarty, M. M., & Ptito, A. (2008). The Connectivity of the Human Pulvinar: A Diffusion Tensor Imaging Tractography Study. International Journal of Biomedical Imaging, 2008, 1–5. https://doi.org/10.1155/2008/789539

      Luppi, A. I., Hansen, J. Y., Adapa, R., Carhart-Harris, R. L., Roseman, L., Timmermann, C., Golkowski, D., Ranft, A., Ilg, R., Jordan, D., Bonhomme, V., Vanhaudenhuyse, A., Demertzi, A., Jaquet, O., Bahri, M. A., Alnagger, N. L. N., Cardone, P., Peattie, A. R. D., Manktelow, A. E., … Stamatakis, E. A. (2023). In vivo mapping of pharmacologically induced functional reorganization onto the human brain’s neurotransmitter landscape. Science Advances, 9(24), eadf8332. https://doi.org/10.1126/sciadv.adf8332

      Müller, E. J., Munn, B., Hearne, L. J., Smith, J. B., Fulcher, B., Arnatkevičiūtė, A., Lurie, D. J., Cocchi, L., & Shine, J. M. (2020). Core and matrix thalamic sub-populations relate to spatio-temporal cortical connectivity gradients. NeuroImage, 222, 117224. https://doi.org/10.1016/j.neuroimage.2020.117224

      Murphy, K., Bodurka, J., & Bandettini, P. A. (2006). How long to scan? The relationship between fMRI temporal signal to noise and necessary scan duration. NeuroImage, 34(2), 565. https://doi.org/10.1016/j.neuroimage.2006.09.032

      Murray, J. D., Bernacchia, A., Freedman, D. J., Romo, R., Wallis, J. D., Cai, X., Padoa-Schioppa, C., Pasternak, T., Seo, H., Lee, D., & Wang, X.-J. (2014). A hierarchy of intrinsic timescales across primate cortex. Nature Neuroscience, 17(12), 1661–1663. https://doi.org/10.1038/nn.3862

      Oldham, S., & Ball, G. (2023). A phylogenetically-conserved axis of thalamocortical connectivity in the human brain. Nature Communications, 14(1), 6032. https://doi.org/10.1038/s41467-023-41722-8

      Rolls, E. T., Huang, C.-C., Lin, C.-P., Feng, J., & Joliot, M. (2020). Automated anatomical labelling atlas 3. NeuroImage, 206, 116189. https://doi.org/10.1016/j.neuroimage.2019.116189

      Rossi-Pool, R., Zainos, A., Alvarez, M., Parra, S., Zizumbo, J., & Romo, R. (2021). Invariant timescale hierarchy across the cortical somatosensory network. Proceedings of the National Academy of Sciences, 118(3), e2021843118. https://doi.org/10.1073/pnas.2021843118

      Shipp, S. (2003). The functional logic of cortico-pulvinar connections. Philosophical Transactions of the Royal Society B: Biological Sciences, 358(1438), 1605–1624. https://doi.org/10.1098/rstb.2002.1213

      Soltani, A., Murray, J. D., Seo, H., & Lee, D. (2021). Timescales of cognition in the brain. Current Opinion in Behavioral Sciences, 41, 30–37. https://doi.org/10.1016/j.cobeha.2021.03.003

      Su, J. H., Thomas, F. T., Kasoff, W. S., Tourdias, T., Choi, E. Y., Rutt, B. K., & Saranathan, M. (2019). Thalamus Optimized Multi Atlas Segmentation (THOMAS): Fast, fully automated segmentation of thalamic nuclei from structural MRI. NeuroImage, 194, 272–282. https://doi.org/10.1016/j.neuroimage.2019.03.021

      Tamietto, M., Pullens, P., de Gelder, B., Weiskrantz, L., & Goebel, R. (2012). Subcortical Connections to Human Amygdala and Changes following Destruction of the Visual Cortex. Current Biology, 22(15), 1449–1455. https://doi.org/10.1016/j.cub.2012.06.006

      Yang, S., Meng, Y., Li, J., Li, B., Fan, Y.-S., Chen, H., & Liao, W. (2020). The thalamic functional gradient and its relationship to structural basis and cognitive relevance. NeuroImage, 218, 116960. https://doi.org/10.1016/j.neuroimage.2020.116960

    1. Author Response

      The following is the authors’ response to the original reviews.

      Firstly, we must take a moment to express our sincere gratitude to editorial board for allowing this work to be reviewed, and to the peer reviewers for taking the time and effort to review our manuscript. The reviews are thoughtful and reflect the careful work of scientists who undoubtedly have many things on their schedule. We cannot express our gratitude enough. This is not a minor sentiment. We appreciate the engagement.

      Allow us to briefly highlight some of the changes made to the revised manuscript, most on behalf of suggestions made by the reviewers:

      1) A supplementary figure that includes the calculation of drug applicability and variant vulnerability for a different data set–16 alleles of dihydrofolate reductase, and two antifolate compounds used to treat malaria–pyrimethamine and cycloguanil.

      2) New supplementary figures that add depth to the result in Figure 1 (the fitness graphs): we demonstrate how the rank order of alleles changes across drug environments and offer a statistical comparison of the equivalence of these fitness landscapes.

      3) A new subsection that explains our specific method used to measure epistasis.

      4) Improved main text with clarifications, fixed errors, and other addendums.

      5) Improved referencing and citations, in the spirit of better scholarship (now with over 70 references).

      Next, we’ll offer some general comments that we believe apply to several of the reviews, and to the eLife assessment. We have provided the bulk of the responses in some general comments, and in response to the public reviews. We have also included the suggestions and made brief comments to some of the individual recommendations.

      On the completeness of our analysis

      In our response, we’ll address the completeness issue first, as iterations of it appear in several of the reviews, and it seems to be one of the most substantive philosophical critiques of the work (there are virtually no technical corrections, outside of a formatting and grammar fixes, which we are grateful to the reviewers for identifying).

      To begin our response, we will relay that we have now included an analysis of a data set corresponding to mutants of a protein, dihydrofolate reductase (DHFR), from Plasmodium falciparum (a main cause of malaria), across two antifolate drugs (pyrimethamine and ycloguanil). We have also decided to include this new analysis in the supplementary material (see Figure S4).

      Author response image 1.

      Drug applicability and variant vulnerability for 16 alleles of dihydrofolate reductase.

      Here we compute the variant vulnerability and drug applicability metrics for two drugs, pyrimethamine (PYR) and cycloguanil (CYC), both antifolate drugs used to treat malaria. This is a completely different system than the one that is the focus of the submitted paper, for a different biomedical problem (antimalarial resistance), using different drugs, and targets. Further, the new data provide information on both drugs of different kinds, and drug concentrations (as suggested by Reviewer #1; we’ve also added a note about this in the new supplementary material). Note that these data have already been the subject of detailed analyses of epistatic effects, and so we did not include those here, but we do offer that reference:

      ● Ogbunugafor CB. The mutation effect reaction norm (mu-rn) highlights environmentally dependent mutation effects and epistatic interactions. Evolution. 2022 Feb 1;76(s1):37-48.

      ● Diaz-Colunga J, Sanchez A, Ogbunugafor CB. Environmental modulation of global epistasis is governed by effective genetic interactions. bioRxiv. 2022:202211.

      Computing our proposed metrics across different drugs is relatively simple, and we could have populated our paper with suites of similar analyses across data sets of various kinds. Such a paper would, in our view, be spread too thin–the evolution of antifolate resistance and/or antimalarial resistance are enormous problems, with large literatures that warrant focused studies. More generally, as the reviewers doubtlessly understand, simply analyzing more data sets does not make a study stronger, especially one like ours, that is using empirical data to both make a theoretical point about alleles and drugs and offer a metric that others can apply to their own data sets.

      Our approach focused on a data set that allowed us to discuss the biology of a system: a far stronger paper, a far stronger proof-of-concept for a new metric. We will revisit this discussion about the structure of our study. But before doing so, we will elaborate on why the “more is better” tone of the reviews is misguided.

      We also note that study where the data originate (Mira et al. 2015) is focused on a single data set of a single drug-target system. We should also point out that Mira et al. 2015 made a general point about drug concentrations influencing the topography of fitness landscapes, not unlike our general point about metrics used to understand features of alleles and different drugs in antimicrobial systems.

      This isn’t meant to serve as a feeble appeal to authority – just because something happened in one setting doesn’t make it right for another. But other than a nebulous appeal to the fact that things have changed in the 8 years since that study was published, it is difficult to argue why one study system was permissible for other work but is somehow “incomplete” in ours. Double standards can be appropriate when they are justified, but in this case, it hasn’t been made clear, and there is no technical basis for it.

      Our study does what countless other successful ones do: utilizes a biological system to make a general point about some phenomena in the natural world. In our case, we were focused on the need for more evolution-inspired iterations of widely used concepts like druggability. For example, a recent study of epistasis focused on a single set of alleles, across several drugs, not unlike our study:

      ● Lozovsky ER, Daniels RF, Heffernan GD, Jacobus DP, Hartl DL. Relevance of higher-order epistasis in drug resistance. Molecular biology and evolution. 2021 Jan;38(1):142-51.

      Next, we assert that there is a difference between an eagerness to see a new metric applied to many different data sets (a desire we share, and plan on pursuing in the future), and the notion that an analysis is “incomplete” without it. The latter is a more serious charge and suggests that the researcher-authors neglected to properly construct an argument because of gaps in the data. This charge does not apply to our manuscript, at all. And none of the reviewers effectively argued otherwise.

      Our study contains 7 different combinatorially-complete datasets, each composed of 16 alleles (this not including the new analysis of antifolates that now appear in the revision). One can call these datasets “small” or “low-dimensional,” if they choose (we chose to put this front-and-center, in the title). They are, however, both complete and as large or larger than many datasets in similar studies of fitness landscapes:

      ● Knies JL, Cai F, Weinreich DM. Enzyme efficiency but not thermostability drives cefotaxime resistance evolution in TEM-1 β-lactamase. Molecular biology and evolution. 2017 May 1;34(5):1040-54.

      ● Lozovsky ER, Daniels RF, Heffernan GD, Jacobus DP, Hartl DL. Relevance of higher-order epistasis in drug resistance. Molecular biology and evolution. 2021 Jan;38(1):142-51.

      ● Rodrigues JV, Bershtein S, Li A, Lozovsky ER, Hartl DL, Shakhnovich EI. Biophysical principles predict fitness landscapes of drug resistance. Proceedings of the National Academy of Sciences. 2016 Mar 15;113(11):E1470-8.

      ● Ogbunugafor CB, Eppstein MJ. Competition along trajectories governs adaptation rates towards antimicrobial resistance. Nature ecology & evolution. 2016 Nov 21;1(1):0007.

      ● Lindsey HA, Gallie J, Taylor S, Kerr B. Evolutionary rescue from extinction is contingent on a lower rate of environmental change. Nature. 2013 Feb 28;494(7438):463-7.

      These are only five of very many such studies, some of them very well-regarded.

      Having now gone on about the point about the data being “incomplete,” we’ll next move to the more tangible comment-criticism about the low-dimensionality of the data set, or the fact that we examined a single drug-drug target system (β lactamases, and β-lactam drugs).

      The criticism, as we understand it, is that the authors could have analyzed more data,

      This is a common complaint, that “more is better” in biology. While we appreciate the feedback from the reviewers, we notice that no one specified what constitutes the right amount of data. Some pointed to other single data sets, but would analyzing two different sets qualify as enough? Perhaps to person A, but not to persons B - Z. This is a matter of opinion and is not a rigorous comment on the quality of the science (or completeness of the analysis).

      ● Should we analyze five more drugs of the same target (beta lactamases)? And what bacterial orthologs?

      ● Should we analyze 5 antifolates for 3 different orthologs of dihydrofolate reductase?

      ● And in which species or organism type? Bacteria? Parasitic infections?

      ● And why only infectious disease? Aren’t these concepts also relevant to cancer? (Yes, they are.)

      ● And what about the number of variants in the aforementioned target? Should one aim for small combinatorially complete sets? Or vaster swaths of sequence space, such as the ones generated by deep mutational scanning and other methods?

      I offer these options in part because, for the most part, were not given an objective suggestion for appropriate level of detail. This is because there is no answer to the question of what size of dataset would be most appropriate. Unfortunately, without a technical reason why a data set of unspecified size [X] or [Y] is best, then we are left with a standard “do more work” peer review response, one that the authors are not inclined to engage seriously, because there is no scientific rationale for it.

      The most charitable explanation for why more datasets would be better is tied to the abstract notion that seeing a metric measured in different data sets somehow makes it more believable. This, as the reviewers undoubtedly understand, isn’t necessarily true (in fact, many poor studies mask a lack of clarity with lots of data).

      To double down on this take, we’ll even argue the opposite: that our focus on a single drug system is a strength of the study.

      The focus on a single-drug class allows us to practice the lost art of discussing the peculiar biology of the system that we are examining. Even more, the low dimensionality allows us to discuss–in relative detail–individual mutations and suites of mutations. We do so several times in the manuscript, and even connect our findings to literature that has examined the biophysical consequences of mutations in these very enzymes.

      (For example: Knies JL, Cai F, Weinreich DM. Enzyme efficiency but not thermostability drives cefotaxime resistance evolution in TEM-1 β-lactamase. Molecular biology and evolution. 2017 May 1;34(5):1040-54.)

      Such detail is only legible in a full-length manuscript because we were able to interrogate a system in good detail. That is, the low-dimensionality (of a complete data set) is a strength, rather than a weakness. This was actually part of the design choice for the study: to offer a new metric with broad application but developed using a system where the particulars could be interrogated and discussed.

      Surely the findings that we recover are engineered for broader application. But to suggest that we need to apply them broadly in order to demonstrate their broad impact is somewhat antithetical to both model systems research and to systems biology, both of which have been successful in extracting general principles for singular (often simple) systems and models.

      An alternative approach, where the metric was wielded across an unspecified number of datasets would lend to a manuscript that is unfocused, reading like many modern machine learning papers, where the analysis or discussion have little to do with actual biology. We very specifically avoided this sort of study.

      To close our comments regarding data: Firstly, we have considered the comments and analyzed a different data set, corresponding to a different drug-target system (antifolate drugs, and DHFR). Moreover, we don’t think more data has anything to do with a better answer or support for our conclusions or any central arguments. Our arguments were developed from the data set that we used but achieve what responsible systems biology does: introduces a framework that one can apply more broadly. And we develop it using a complete, and well-vetted dataset. If the reviewers have a philosophical difference of opinion about this, we respect it, but it has nothing to do with our study being “complete” or not. And it doesn’t speak to the validity of our results.

      Related: On the dependence of our metrics on drug-target system

      Several comments were made that suggest the relevance of the metric may depend on the drug being used. We disagree with this, and in fact, have argued the opposite: the metrics are specifically useful because they are not encumbered with unnecessary variables. They are the product of rather simple arithmetic that is completely agnostic to biological particulars.

      We explain, in the section entitled “Metric Calculations:

      “To estimate the two metrics we are interested in, we must first quantify the susceptibility of an allelic variant to a drug. We define susceptibility as $1 - w$, where w is the mean growth of the allelic variant under drug conditions relative to the mean growth of the wild-type/TEM-1 control. If a variant is not significantly affected by a drug (i.e., growth under drug is not statistically lower than growth of wild-type/TEM-1 control, by t-test P-value < 0.01), its susceptibility is zero. Values in these metrics are summaries of susceptibility: the variant vulnerability of an allelic variant is its average susceptibility across drugs in a panel, and the drug applicability of an antibiotic is the average susceptibility of all variants to it.”

      That is, these can be animated to compute the variant vulnerability and drug applicability for data sets of various kinds. To demonstrate this (and we thank the reviewers for suggesting it), we have analyzed the antifolate-DHFR data set as outlined above.

      Finally, we will make the following light, but somewhat cynical point (that relates to the “more data” more point generally): the wrong metric applied to 100 data sets is little more than 100 wrong analyses. Simply applying the metric to a wide number of datasets has nothing to do with the veracity of the study. Our study, alternatively, chose the opposite approach: used a data set for a focused study where metrics were extracted. We believe this to be a much more rigorous way to introduce new metrics.

      On the Relevance of simulations

      Somewhat relatedly, the eLife summary and one of the reviewers mentioned the potential benefit of simulations. Reviewer 1 correctly highlights that the authors have a lot of experience in this realm, and so generating simulations would be trivial. For example, the authors have been involved in studies such as these:

      ● Ogbunugafor CB, Eppstein MJ. Competition along trajectories governs adaptation rates towards antimicrobial resistance. Nature ecology & evolution. 2016 Nov 21;1(1):0007.

      ● Ogbunugafor CB, Wylie CS, Diakite I, Weinreich DM, Hartl DL. Adaptive landscape by environment interactions dictate evolutionary dynamics in models of drug resistance. PLoS computational biology. 2016 Jan 25;12(1):e1004710.

      ● Ogbunugafor CB, Hartl D. A pivot mutation impedes reverse evolution across an adaptive landscape for drug resistance in Plasmodium vivax. Malaria Journal. 2016 Dec;15:1-0.

      From the above and dozens of other related studies, we’ve learned that simulations are critical for questions about the end results of dynamics across fitness landscapes of varying topography. To simulate across the datasets in the submitted study would be be a small ask. We do not provide this, however, because our study is not about the dynamics of de novo evolution of resistance. In fact, our study focuses on a different problem, no less important for understanding how resistance evolves: determining static properties of alleles and drugs, that provide a picture into their ability to withstand a breadth of drugs in a panel (variant vulnerability), or the ability of a drug in a panel to affect a breadth of drug targets.

      The authors speak on this in the Introduction:

      “While stepwise, de novo evolution (via mutations and subsequent selection) is a key force in the evolution of antimicrobial resistance, evolution in natural settings often involves other processes, including horizontal gene transfer and selection on standing genetic variation. Consequently, perspectives that consider variation in pathogens (and their drug targets) are important for understanding treatment at the bedside. Recent studies have made important strides in this arena. Some have utilized large data sets and population genetics theory to measure cross-resistance and collateral sensitivity. Fewer studies have made use of evolutionary concepts to establish metrics that apply to the general problem of antimicrobial treatment on standing genetic variation in pathogen populations, or for evaluating the utility of certain drugs’ ability to treat the underlying genetic diversity of pathogens”

      That is, the proposed metrics aren’t about the dynamics of stepwise evolution across fitness landscapes, and so, simulating those dynamics don’t offer much for our question. What we have done instead is much more direct and allows the reader to follow a logic: clearly demonstrate the topography differences in Figure 1 (And Supplemental Figure S2 and S3 with rank order changes).

      Author response image 2.

      These results tell the reader what they need to know: that the topography of fitness landscapes changes across drug types. Further, we should note that Mira et al. 2015 already told the basic story that one finds different adaptive solutions across different drug environments. (Notably, without computational simulations).

      In summary, we attempted to provide a rigorous, clean, and readable study that introduced two new metrics. Appeals to adding extra analysis would be considered if they augmented the study’s goals. We do not believe this to be the case.

      Nonetheless, we must reiterate our appreciation for the engagement and suggestions. All were made with great intentions. This is more than one could hope for in a peer review exchange. The authors are truly grateful.

      eLife assessment

      The work introduces two valuable concepts in antimicrobial resistance: "variant vulnerability" and "drug applicability", which can broaden our ways of thinking about microbial infections through evolution-based metrics. The authors present a compelling analysis of a published dataset to illustrate how informative these metrics can be, study is still incomplete, as only a subset of a single dataset on a single class of antibiotics was analyzed. Analyzing more datasets, with other antibiotic classes and resistance mutations, and performing additional theoretical simulations could demonstrate the general applicability of the new concepts.

      The authors disagree strongly with the idea that the study is ‘incomplete,” and encourage the editors and reviewers to reconsider this language. Not only are the data combinatorially complete, but they are also larger in size than many similar studies of fitness landscapes. Insofar as no technical justification was offered for this “incomplete” summary, we think it should be removed. Furthermore, we question the utility of “theoretical simulations.” They are rather easy to execute but distract from the central aims of the study: to introduce new metrics, in the vein of other metrics–like druggability, IC50, MIC–that describe properties of drugs or drug targets.

      Public Reviews:

      Reviewer #1 (Public Review):

      The manuscript by Geurrero and colleagues introduces two new metrics that extend the concept of "druggability"- loosely speaking, the potential suitability of a particular drug, target, or drug-target interaction for pharmacological intervention-to collections of drugs and genetic variants. The study draws on previously measured growth rates across a combinatoriality complete mutational landscape involving 4 variants of the TEM-50 (beta lactamase) enzyme, which confers resistance to commonly used beta-lactam antibiotics. To quantify how growth rate - in this case, a proxy for evolutionary fitness - is distributed across allelic variants and drugs, they introduce two concepts: "variant vulnerability" and "drug applicability".

      Variant vulnerability is the mean vulnerability (1-normalized growth rate) of a particular variant to a library of drugs, while drug applicability measures the mean across the collection of genetic variants for a given drug. The authors rank the drugs and variants according to these metrics. They show that the variant vulnerability of a particular mutant is uncorrelated with the vulnerability of its one-step neighbors and analyze how higher-order combinations of single variants (SNPs) contribute to changes in growth rate in different drug environments.

      The work addresses an interesting topic and underscores the need for evolutionbased metrics to identify candidate pharmacological interventions for treating infections. The authors are clear about the limitations of their approach - they are not looking for immediate clinical applicability - and provide simple new measures of druggability that incorporate an evolutionary perspective, an important complement to the orthodoxy of aggressive, kill-now design principles. I think the ideas here will interest a wide range of readers, but I think the work could be improved with additional analysis - perhaps from evolutionary simulations on the measured landscapes - that tie the metrics to evolutionary outcomes.

      The authors greatly appreciate these comments, and the proposed suggestions by reviewer 1. We have addressed most of the criticisms and suggestions in our comments above.

      Reviewer #2 (Public Review):

      The authors introduce the notions of "variant vulnerability" and "drug applicability" as metrics quantifying the sensitivity of a given target variant across a panel of drugs and the effectiveness of a drug across variants, respectively. Given a data set comprising a measure of drug effect (such as growth rate suppression) for pairs of variants and drugs, the vulnerability of a variant is obtained by averaging this measure across drugs, whereas the applicability of a drug is obtained by averaging the measure across variants.

      The authors apply the methodology to a data set that was published by Mira et al. in 2015. The data consist of growth rate measurements for a combinatorially complete set of 16 genetic variants of the antibiotic resistance enzyme betalactamase across 10 drugs and drug combinations at 3 different drug concentrations, comprising a total of 30 different environmental conditions. For reasons that did not become clear to me, the present authors select only 7 out of 30 environments for their analysis. In particular, for each chosen drug or drug combination, they choose the data set corresponding to the highest drug concentration. As a consequence, they cannot assess to what extent their metrics depend on drug concentration. This is a major concern since Mira et al. concluded in their study that the differences between growth rate landscapes measured at different concentrations were comparable to the differences between drugs. If the new metrics display a significant dependence on drug concentration, this would considerably limit their usefulness.

      The authors appreciate the point about drug concentration, and it is one that the authors have made in several studies.

      The quick answer is that whether the metrics are useful for drug type-concentration A or B will depend on drug type-concentration A or B. If there are notable differences in the topography of the fitness landscape across concentration, then we should expect the metrics to differ. What Reviewer #2 points out as a “major concern,” is in fact a strength of the metrics: it is agnostic with respect to type of drug, type of target, size of dataset, or topography of the fitness landscape. And so, the authors disagree: no, that drug concentration would be a major actor in the value of the metrics does not limit the utility of the metric. It is simply another variable that one can consider when computing the metrics.

      As discussed above, we have analyzed data from a different data set, in a different drug-target problem (DHFR and antifolate drugs; see supplemental information). These demonstrate how the metric can be used to compute metrics across different drug concentrations.

      As a consequence of the small number of variant-drug combinations that are used, the conclusions that the authors draw from their analysis are mostly tentative with weak statistical support. For example, the authors argue that drug combinations tend to have higher drug applicability than single drugs, because a drug combination ranks highest in their panel of 7. However, the effect profile of the single drug cefprozil is almost indistinguishable from that of the top-ranking combination, and the second drug combination in the data set ranks only 5th out of 7.

      We reiterate our appreciation for the engagement. Reviewer #2 generously offers some technical insight on measurements of epistasis, and their opinion on the level of statistical support for our claims. The authors are very happy to engage in a dialogue about these points. We disagree rather strongly, and in addition to the general points raised above (that speak to some of this), will raise several specific rebuttals to the comments from Reviewer #2.

      For one, the Reviewer #2 is free to point to what arguments have “weak statistical support.” Having read the review, we aren’t sure what this is referring to. “Weak statistical support” generally applies to findings built from underpowered studies, or designs constructed in manner that yield effect sizes or p-values that give low confidence that a finding is believable (or is replicable). This sort of problem doesn’t apply to our study for various reasons, the least of which being that our findings are strongly supported, based on a vetted data set, in a system that has long been the object of examination in studies of antimicrobial resistance.

      For example, we did not argue that magnetic fields alter the topography of fitness landscapes, a claim which must stand up to a certain sort of statistical scrutiny. Alternatively, we examined landscapes where the drug environment differed statistically from the non-drug environment and used them to compute new properties of alleles and drugs.

      We can imagine that the reviewer is referring to the low-dimensionality of the fitness landscapes in the study. Again: the features of the dataset are a detail that the authors put into the title of the manuscript. Further, we emphasize that it is not a weakness, but rather, allows the authors to focus, and discuss the specific biology of the system. And we responsibly explain the constraints around our study several times, though none of them have anything to do with “weak statistical support.”

      Even though we aren’t clear what “weak statistical support” means as offered by Reviewer 2, the authors have nonetheless decided to provide additional analyses, now appearing in the new supplemental material.

      We have included a new Figure S2, where we offer an analysis of the topography of the 7 landscapes, based on the Kendall rank order test. This texts the hypothesis that there is no correlation (concordance or discordance) between the topographies of the fitness landscapes.

      Author response image 3.

      Kendall rank test for correlation between the 7 fitness landscapes.

      In Figure S3, we test the hypothesis that the variant vulnerability values differ. To do this, we calculate a paired t-test. These are paired by haplotype/allelic variant, so the comparisons are change in growth between drugs for each haplotype.

      Author response image 4.

      Paired t-tests for variant vulnerability.

      To this point raised by Reviewer #2:

      “For example, the authors argue that drug combinations tend to have higher drug applicability than single drugs, because a drug combination ranks highest in their panel of 7. However, the effect profile of the single drug cefprozil is almost indistinguishable from that of the top-ranking combination, and the second drug combination in the data set ranks only 5th out of 7.”

      Our study does not argue that drug combinations are necessarily correlated with a higher drug applicability. Alternatively, we specifically highlight that one of the combinations does not have a high drug applicability:

      “Though all seven drugs/combinations are β-lactams, they have widely varying effects across the 16 alleles. Some of the results are intuitive: for example, the drug regime with the highest drug applicability of the set—amoxicillin/clavulanic acid—is a mixture of a widely used β-lactam (amoxicillin) and a β-lactamase inhibitor (clavulanic acid) (see Table 3). We might expect such a mixture to have a broader effect across a diversity of variants. This high applicability is hardly a rule, however, as another mixture in the set, piperacillin/tazobactam, has a much lower drug applicability (ranking 5th out of the seven drugs in the set) (Table 3).”

      In general, we believe that the submitted paper is responsible with regards to how it extrapolates generalities from the results. Further, the manuscript contains a specific section that explains limitations, clearly and transparently (not especially common in science). For that reason, we’d encourage reviewer #2 to reconsider their perspective. We do not believe that our arguments are built on “weak” support at all. And we did not argue anything particular about drug combinations writ large. We did the opposite— discussed the particulars of our results in light of the biology of the system.

      Thirdly, to this point:

      “To assess the environment-dependent epistasis among the genetic mutations comprising the variants under study, the authors decompose the data of Mira et al. into epistatic interactions of different orders. This part of the analysis is incomplete in two ways. First, in their study, Mira et al. pointed out that a fairly large fraction of the fitness differences between variants that they measured were not statistically significant, which means that the resulting fitness landscapes have large statistical uncertainties. These uncertainties should be reflected in the results of the interaction analysis in Figure 4 of the present manuscript.”

      The authors are uncertain with regards to the “uncertainties” being referred to, but we’ll do our best to understand: our study utilized the 7 drug environments from Mira et al. 2015 with statistically significant differences between growth rates with and without drug. And so, this point about how the original set contained statistically insignificant treatments is not relevant here. We explain this in the methods section:

      “The data that we examine comes from a past study of a combinatorial set of four mutations associated with TEM-50 resistance to β-lactam drugs [39 ]. This past study measured the growth rates of these four mutations in combination, across 15 different drugs (see Supplemental Information).”

      We go on to say the following:

      “We examined these data, identifying a subset of structurally similar β-lactams that also included β-lactams combined with β-lactamase inhibitors, cephalosporins and penicillins. From the original data set, we focus our analyses on drug treatments that had a significant negative effect on the growth of wild-type/TEM-1 strains (one-tailed ttest of wild-type treatment vs. control, P < 0.01). After identifying the data from the set that fit our criteria, we were left with seven drugs or combinations (concentration in μg/ml): amoxicillin 1024 μg/ ml (β-lactam), amoxicillin/clavulanic acid 1024 μg/m l (βlactam and β-lactamase inhibitor) cefotaxime 0.123 μg/ml (third-generation cephalosporin), cefotetan 0.125 μg/ml (second-generation cephalosporins), cefprozil 128 μg/ml (second-generation cephalosporin), ceftazidime 0.125 μg/ml (third-generation cephalosporin), piperacillin and tazobactam 512/8 μg/ml (penicillin and β-lactamase inhibitor). With these drugs/mixtures, we were able to embody chemical diversity in the panel.”

      Again: The goal of our study was to develop metrics that can be used to analyze features of drugs and targets and disentangle these metrics into effects.

      Second, the interpretation of the coefficients obtained from the epistatic decomposition depends strongly on the formalism that is being used (in the jargon of the field, either a Fourier or a Taylor analysis can be applied to fitness landscape data). The authors need to specify which formalism they have employed and phrase their interpretations accordingly.

      The authors appreciate this nuance. Certainly, how to measure epistasis is a large topic of its own. But we recognize that we could have addressed this more directly and have added text to this effect.

      In response to these comments from Reviewer #2, we have added a new section focused on these points (reference syntax removed here for clarity; please see main text for specifics):

      “The study of epistasis, and discussions regarding the means to detect and measure now occupies a large corner of the evolutionary genetics literature. The topic has grown in recent years as methods have been applied to larger genomic data sets, biophysical traits, and the "global" nature of epistatic effects. We urge those interested in more depth treatments of the topic to engage larger summaries of the topic.”

      “Here will briefly summarize some methods used to study epistasis on fitness landscapes. Several studies of combinatorially-complete fitness landscapes use some variation of Fourier Transform or Taylor formulation. One in particular, the Walsh-Hadamard Transform has been used to measure epistasis across a wide number of study systems. Furthermore, studies have reconciled these methods with others, or expanded upon the Walsh-Hadamard Transform in a way that can accommodate incomplete data sets. These methods are effective for certain sorts of analyses, and we strongly urge those interested to examine these studies.”

      “The method that we've utilized, the LASSO regression, determines effect sizes for all interactions (alleles and drug environments). It has been utilized for data sets of similar size and structure, on alleles resistant to trimethoprim. Among many benefits, the method can accommodate gaps in data and responsibly incorporates experimental noise into the calculation.”

      As Reviewer #2 understands, there are many ways to examine epistasis on both high and low-dimensional landscapes. Reviewer #2 correctly offers two sorts of formalisms that allow one to do so. The two offered by Reviewer #2, are not the only means of measuring epistasis in data sets like the one we have offered. But we acknowledge that we could have done a better job outlining this. We thank Reviewer #2 for highlighting this, and believe our revision clarifies this.

      Reviewer #3 (Public Review):

      The authors introduce two new concepts for antimicrobial resistance borrowed from pharmacology, "variant vulnerability" (how susceptible a particular resistance gene variant is across a class of drugs) and "drug applicability" (how useful a particular drug is against multiple allelic variants). They group both terms under an umbrella term "drugability". They demonstrate these features for an important class of antibiotics, the beta-lactams, and allelic variants of TEM-1 beta-lactamase.

      The strength of the result is in its conceptual advance and that the concepts seem to work for beta-lactam resistance. However, I do not necessarily see the advance of lumping both terms under "drugability", as this adds an extra layer of complication in my opinion.

      Firstly, the authors greatly appreciate the comments from Reviewer #3. They are insightful, and prescriptive. And allow us to especially thank reviewer 3 for supplying a commented PDF with some grammatical and phrasing suggestions/edits. This is much appreciated. We have examined all these suggestions and made changes.

      In general, we agree with the spirit of many of the comments. In addition to our prior comments on the scope of our data, we’ll communicate a few direct responses to specific points raised.

      I also think that the utility of the terms could be more comprehensively demonstrated by using examples across different antibiotic classes and/or resistance genes. For instance, another good model with published data might have been trimethoprim resistance, which arises through point mutations in the folA gene (although, clinical resistance tends to be instead conferred by a suite of horizontally acquired dihydrofolate reductase genes, which are not so closely related as the TEM variants explored here).

      1. In our new supplemental material, we now feature an analysis of antifolate drugs, pyrimethamine and cycloguanil. We have discussed this in detail above and thank the reviewer for the suggestion.

      2. Secondly, we agree that the study will have a larger impact when the metrics are applied more broadly. This is an active area of investigation, and our hope is that others apply our metrics more broadly. But as we discussed, such a desire is not a technical criticism of our own study. We stand behind the rigor and insight offered by our study.

      The impact of the work on the field depends on a more comprehensive demonstration of the applicability of these new concepts to other drugs.

      The authors don’t disagree with this point, which applies to virtually every potentially influential study. The importance of a single study can generally only be measured by its downstream application. But this hardly qualifies as a technical critique of our study and does not apply to our study alone. Nor does it speak to the validity of our results. The authors share this interest in applying the metric more broadly.

      Reviewer #1 (Recommendations For The Authors):

      • The main weakness of the work, in my view, is that it does not directly tie these new metrics to a quantitative measure of "performance". The metrics have intuitive appeal, and I think it is likely that they could help guide treatment options-for example, drugs with high applicability could prove more useful under particular conditions. But as the authors note, the landscape is rugged and intuitive notions of evolutionary behavior can sometimes fail. I think the paper would be much improved if the authors could evaluate their new metrics using some type of quantitative evolutionary model. For example, perhaps the authors could simulate evolutionary dynamics on these landscapes in the presence of different drugs. Is the mean fitness achieved in the simulations correlated with, for example, the drug applicability when looking across an ensemble of simulations with the same drug but varied initial conditions that start from each individual variant? Similarly, if you consider an ensemble of simulations where each member starts from the same variant but uses a different drug, is the average fitness gain captured in some way by the variant vulnerability? All simulations will have limitations, of course, but given that the landscape is fully known I think these questions could be answered under some conditions (e.g. strong selection weak mutation limit, where the model could be formulated as a Markov Chain; see 10.1371/journal.pcbi.1004493 or doi: 10.1111/evo.14121 for examples). And given the authors' expertise in evolutionary dynamics, I think it could be achieved in a reasonable time. With that said, I want to acknowledge that with any new "metrics", it can be tempting to think that "we need to understand it all" before it is useful, and I don't want to fall into that trap here.

      The authors respect and appreciate these thoughtful comments.

      As Reviewer #1 highlighted, the authors are experienced with building simulations of evolution. For reasons we have outlined above, we don’t believe they would add to the arc of the current story and may encumber the story with unnecessary distractions. Simulations of evolution can be enormously useful for studies focused on particulars of the dynamics of evolution. This submitted study is not one of those. It is charged with identifying features of alleles and drugs that capture an allele’s vulnerability to treatment (variant vulnerability) and a drug’s effectiveness across alleles (drug applicability). Both features integrate aspects of variation (genetic and environmental), and as such, are improvements over both metrics used to describe drug targets and drugs.

      • The new metrics rely on means, which is a natural choice. Have the authors considered how variance (or other higher moments) might also impact evolutionary dynamics? I would imagine, for example, that the ultimate outcome of a treatment might depend heavily on the shape of the distribution, not merely its mean. This is also something one might be able to get a handle on with simulations.

      These are relevant points, and the authors appreciate them. Certainly, moments other than the mean might have utility. This is the reason that we computed the one-step neighborhood variant vulnerability–to see if the variant vulnerability of an allele was related to properties of its mutational neighborhood. We found no such correlation. There are many other sorts of properties that one might examine (e.g., shape of the distribution, properties of mutational network, variance, fano factor, etc). As we don’t have an informed reason to pursue any of this in lieu of others, we are pleased to investigate this in the future.

      Also, while we’ve addressed general points about simulations above, we want to note that our analysis of environmental epistasis does consider the variance. We urge Reviewer #1 to see our new section on “Notes on Methods Used to Measure Epistasis” where we explain some of this and supply references to that effect.

      • As I understand it, the fitness measurements here are measures of per capita growth rate, which is reasonable. However, the authors may wish to briefly comment on the limitations of this choice-i.e. the fact that these are not direct measures of relative fitness values from head-to-head competition between strains.

      Reviewer #1 is correct: the metrics are computed from means. As Reviewer 1 definitely understands, debates over what measurements are proper proxies for fitness go back a long time. We added a slight acknowledgement about the existence of multiple fitness proxies in our revision.

      • The authors consider one-step variant vulnerability. Have the authors considered looking at 2-step, 3-step, etc analogs of the 1-step vulnerability? I wonder if these might suggest potential vulnerability bottlenecks associated with the use of a particular drug/drug combo or trajectories starting from particular variants.

      This is an interesting point. We provided one-step values as a means of interrogating the mutational neighborhood of alleles in the fitness landscape. While there could certainly be other pattern-relationships between the variant vulnerability and features of a fitness landscape (as the reviewer recognizes), we don’t have a rigorous reason to test them, other than an appeal to “I would be curious if [Blank].” As in, attempting to saturate the paper with these sorts of examinations might be fun, could turn up an interesting result, but this is true for most studies.

      To highlight just how serious we are about future questions along these lines, we’ll offer one specific question about the relationship between metrics and other features of alleles or landscapes. Recent studies have examined the existence of “evolvabilityenhancing mutations,” that propel a population to high-fitness sections of a fitness landscape:

      ● Wagner, A. Evolvability-enhancing mutations in the fitness landscapes of an RNA and a protein. Nat Commun 14, 3624 (2023). https://doi.org/10.1038/s41467023-39321-8

      One present and future area of inquiry involves whether there is any relationship between metrics like variant vulnerability and these sorts of mutations.

      We thank Reviewer 1 for engagement on this issue.

      • Fitness values are measured in the presence of a drug, but it is not immediately clear how the drug concentrations are chosen and, more importantly, how the choice of concentration might impact the landscape. The authors may wish to briefly comment on these effects, particularly in cases where the environment involves combinations of drugs. There will be a "new" fitness landscape for each concentration, but to what extent do the qualitative features changes-or whatever features drive evolutionary dynamics--change?

      This is another interesting suggestion. We have analyzed a new data set for dihydrofolate reductase mutants that contains a range of drug concentrations of two different antifolate drugs. The general question of how drug concentrations change evolutionary dynamics has been addressed in prior work of ours:

      ● Ogbunugafor CB, Wylie CS, Diakite I, Weinreich DM, Hartl DL. Adaptive landscape by environment interactions dictate evolutionary dynamics in models of drug resistance. PLoS computational biology. 2016 Jan 25;12(1):e1004710.

      ● Ogbunugafor CB, Eppstein MJ. Competition along trajectories governs adaptation rates towards antimicrobial resistance. Nature ecology & evolution. 2016 Nov 21;1(1):0007.

      There are a very large number of environment types that might alter the drug availability or variant vulnerability metrics. In our study, we used an established data set composed of different alleles of a Beta lactamase, with growth rates measured across a number of drug environments. These drug environments consisted of individual drugs at certain concentrations, as outlined in Mira et al. 2015. For our study, we examined those drugs that had a significant impact on growth rate.

      For a new analysis of antifolate drugs in 16 alleles of dihydrofolate reductase (Plasmodium falciparum), we have examined a breadth of drug concentrations (Supplementary Figure S4). This represents a different sort of environment that one can use to measure the two metrics (variant vulnerability or drug applicability). As we suggest in the manuscript, part of the strength of the metric is precisely that it can incorporate drug dimensions of various kinds.

      • The metrics introduced depend on the ensemble of drugs chosen. To what extent are the chosen drugs representative? Are there cases where nonrepresentative ensembles might be advantageous?

      The authors thank the reviewer for this. The general point has been addressed in our comments above. Further, the general question of how a study of one set of drugs applies to other drugs applies to every study of every drug, as no single study interrogates every sort of drug ensemble. That said, we’ve explained the anatomy of our metrics, and have outlined how it can be directly applied to others. There is nothing about the metric itself that has anything to do with a particular drug type – the arithmetic is rather vanilla.

      Reviewer #2 (Recommendations For The Authors):

      1. Regarding my comment about the different formalisms for epistatic decomposition analysis, a key reference is

      Poelwijk FJ, Krishna V, Ranganathan R (2016). The Context-Dependence of Mutations: A Linkage of Formalisms. PLoS Comput Biol 12(6): e1004771.

      The authors appreciate this, are fans of this work, and have cited it in the revision.

      An example where both Fourier and Taylor analyses were carried out and the different interpretations of these formalisms were discussed is

      Unraveling the causes of adaptive benefits of synonymous mutations in TEM-1 βlactamase. Mark P. Zwart, Martijn F. Schenk, Sungmin Hwang, Bertha Koopmanschap, Niek de Lange, Lion van de Pol, Tran Thi Thuy Nga, Ivan G. Szendro, Joachim Krug & J. Arjan G. M. de Visser Heredity 121:406-421 (2018)

      The authors are grateful for these references. While we don’t think they are necessary for our new section entitled “Notes on methods used to detect epistasis,” we did engage them, and will keep them in mind for other work that more centrally focuses on methods used to detect epistasis. As the author acknowledges, a full treatment of this topic is too large for a single manuscript, let alone a subsection of one study. We have provided a discussion of it, and pointed the readers to longer review articles that explore some of these topics in good detail:

      ● C. Bank, Epistasis and adaptation on fitness landscapes, Annual Review of Ecology, Evolution, and Systematics 53 (1) (2022) 457–479.

      ● T. B. Sackton, D. L. Hartl, Genotypic context and epistasis in individuals and populations, Cell 166 (2) (2016) 279–287.

      ● J. Diaz-Colunga, A. Skwara, J. C. C. Vila, D. Bajic, Á. Sánchez, Global epistasis and the emergence of ecological function, BioRxviv

      1. Although the authors label Figure 4 with the term "environmental epistasis", as far as I can see it is only a standard epistasis analysis that is carried out separately for each environment. The analysis of environmental epistasis should instead focus on which aspects of these interactions are different or similar in different environments, for example, by looking at the reranking of fitness values under environmental changes [see Ref.[26] as well as more recent related work, e.g. Gorter et al., Genetics 208:307-322 (2018); Das et al., eLife9:e55155 (2020)]. To some extent, such an analysis was already performed by Mira et al., but not on the level of epistatic interaction coefficients.

      The authors have provided a new analysis of how fitness value rankings have changed across drug environments, often a signature of epistatic effects across environments (Supplementary Figure S1).

      We disagree with the idea that our analysis is not a sort of environmental epistasis; we resolve coefficients between loci across different environments. As with every interrogation of G x E effects (G x G x E in our case), what constitutes an “environment” is a messy conversation. We have chosen the route of explaining very clearly what we mean:

      “We further explored the interactions across this fitness landscape and panels of drugs in two additional ways. First, we calculated the variant vulnerability for 1-step neighbors, which is the mean variant vulnerability of all alleles one mutational step away from a focal variant. This metric gives information on how the variant vulnerability values are distributed across a fitness landscape. Second, we estimated statistical interaction effects on bacterial growth through LASSO regression. For each drug, we fit a model of relative growth as a function of M69L x E104K x G238S x N276D (i.e., including all interaction terms between the four amino acid substitutions). The effect sizes of the interaction terms from this regularized regression analysis allow us to infer higher-order dynamics for susceptibility. We label this calculation as an analysis of “environmental epistasis.”

      As the grammar for these sorts of analyses continues to evolve, the best one can do is be clear about what they mean. We believe that we communicated this directly and transparently.

      1. As a general comment, to strengthen the conclusions of the study, it would be good if the authors could include additional data sets in their analysis.

      The authors appreciate this comment and have given this point ample treatment. Further, other main conclusions and discussion points are focused on the biology of the system that we examined. Analyzing other data sets may demonstrate the broader reach of the metrics, but it would not alter the strength of our own conclusions (or if they would, Reviewer #2 has not told us how).

      1. There are some typos in the units of drug concentrations in Section 2.4 that should be corrected.

      The authors truly appreciate this. It is a great catch. We have fixed this in the revised manuscript.

      Reviewer #3 (Recommendations For The Authors):

      I would suggest demonstrating the concepts for a second drug class, and suggest folA variants and trimethoprim resistance, for which there is existing published data similar to what the authors have used here (e.g. Palmer et al. 2015, https://doi.org/10.1038/ncomms8385)

      The authors appreciate this insight. As previously described, we have analyzed a data set of folA mutants for the Plasmodium falciparum ortholog of dihydrofolate reductase, and included these results in new supplemental material. Please see the supplementary material.

      There are some errors in formatting and presentation that I have annotated in a separate PDF file (https://elife-rp.msubmit.net/eliferp_files/2023/04/11/00117789/00/117789_0_attach_8_30399_convrt.pdf), as the absence of line numbers makes indicating specific things exceedingly difficult.

      The authors apologize for the lack of line numbers (an honest oversight), but moreover, are tremendously grateful for this feedback. We have looked at the suggested changes carefully and have addressed many of them. Thank you.

      One thing to note: we have included a version of Figure 4 that has effects on the same axes. It appears in the supplementary material (Figure S4).

      In closing, the authors would like to thank the editors and three anonymous reviewers for engagement and for helpful comments. We are confident that the revised manuscript qualifies as a substantive revision, and we are grateful to have had the opportunity to participate.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We are grateful for the comments from the reviewers, which helped us to strengthen our analyses and communicate more effectively the details of our findings and their significance. To address their criticisms, we have performed new analyses and revised the text and figures. We believe the manuscript was significantly improved. We provide the line number of important parts of the text that were changed, here in this letter. Below, we address the specific comments from the reviewers in detail.

      Reviewer #1 (Public Review):

      Gehr and colleagues used an elegant method, using neuropixels probes, to study retinal input integration by mouse superior collicular cells in vivo. Compared to a previous report of the same group, they opto-tagged inhibitory neurons and defined the differential integration onto each group. Through these experiments, the author concluded that overall, there is no clear difference between the retina connectivity to excitatory and inhibitory superior colliculus neurons. The exception to that rule is that excitatory neurons might be driven slightly stronger than inhibitory ones. Technically, this work is performed at a high level, and the plots are beautifully conceived, but I have doubts if the interpretation given by the authors is solid. I will elaborate below.

      Some thoughts about the interpretation of the results.

      My main concern is the "survivor bias" of this work, which can lead to skewed conclusions. From the data set acquired, 305 connections were measured, 1/3 inhibitory and 2/3 excitatory. These connections arise from 83 RGC onto 124 RGC (I'm interpreting the axis of Fig.2 C). Here it is worth mentioning that different RGC types have different axonal diameters (Perge et al., 2009). Here the diameter is also related to the way cells relay information (max frequencies, for example). It is possible that thicker axons are easier to measure, given the larger potential changes would likely occur, and thus, selectively being picked up by the neuropixels probe. If this is the case, we would have a clear case of "survival bias", which should be tested and discussed. One way to determine if the response properties of axonal termini are from an unbiased sample is to make a rough functional characterization as generally performed (see Baden et al. 2006). This is fundamental since all other conclusions are based on unbiased sampling.

      First of all, we want to thank the reviewer for the detailed and constructive comments based on which we refined the analysis and updated the figures. We hope that our changes adequately address the concerns of reviewer #1.

      We would like to clarify that Fig. 2C represents an example from a single experiment. In total, we recorded 326 RGCs and 680 SC neurons in total, with 161 individual RGCs making connections onto 183 individual SC neurons. Moreover, we thank the reviewer for bringing up that important point about the potential “survivor bias”. To address this concern, we would like to provide some clarifications (see below). In addition, we now added the point that different RGCs can have different axonal diameter as requested by the reviewers (line 605).

      It is important to note that our approach does not capture the total pool of retinal inputs. Moreover, we did not want to convey the impression that our approach equally captures all retinal inputs to a given SC neuron, as this is not the case. Likewise, it is important to note that our current method does not allow for the measurement of axonal diameters. To obtain an estimate of axonal thickness, complementary techniques such as imaging/staining or electron microscopy would be needed. Our study aimed at characterizing connected RGC-SC pairs and how excitatory and inhibitory neurons in the SC integrate retinal inputs, providing valuable insight on their wiring principles.

      We greatly appreciate the reviewer for highlighting this limitation and we now address these points in the discussion of the revised version of our manuscript (line 603).

      Regarding the suggested “rough functional characterization” of the RGCs. We have thought about this analysis and unfortunately, we did not present the necessary stimuli, e.g. chirp, in all experiments to be able to perform this analysis. Moreover, the dataset represented in this work contains only 326 RGC neurons, with 161 identified RGCs making connections to SC neurons. Thus, it is unlikely that our dataset uniformly covers all ~30 RGC types in the mouse. However, given that our dataset is the first measurement of RGC inputs to SC INs and SC EXNs in vivo, we believe it provides a first step and a foundation for future studies focusing on specific RGC types to refine our understanding of the RGC-SC circuitry. We discuss this point now in the revised manuscript (line 586).

      One aspect that is not clear to me is to measure of connectivity strength in Figure 2. Here it seems that connectivity strength is directly correlated with the baseline firing rate of the SC neuron (see example plots). If this is a general case, the synaptic strength can be assumed but would only differ in strength due to the excitability of the postsynaptic cell. This should be tested by plotting the correlation coefficient analysis against the baseline firing rate.

      We appreciate the reviewer for bringing up this important point. From the analysis perspective, we would like to clarify that the efficacy measure is independent of the baseline firing rate. It quantifies the probability of adding spikes on top of the baseline rate by subtracting the baseline firing rate before measuring the area of the peak (Usrey et al., 1999).

      Furthermore, we acknowledge the reviewer’s interesting and valuable observation about the relationship of the firing rate and the excitability of the SC neuron in the example plots. To test whether the efficacy is directly related to the mean firing rate, we conducted additional analyses to show the efficacy measure as a function of the mean firing rate (Author response image 1 and Figure 2G). To that end, we utilized two different measures of firing rate: the mean firing rate during spontaneous activity (gray screen) over a duration of 10 sec (across 30 trials), which was interleaved with the natural movie presentations, and the overall firing rate throughout the entire recording session. Our findings indeed reveal a positive correlation, as predicted by the reviewer (Author response image 1, gray screen: EXC r = 0.22721; p < 0.00081; INH: r = 0.34677, p= 0.00076; entire recording: EXC r = 0.42685; p < 0.0005; INH: r = 0.43543, p = 0.00002).

      Author response image 1.

      Efficacy measure of connected RGC-SC pairs as a function of the mean firing rate during different stimulus conditions: during spontaneous activity (gray screen, left) and throughout the entire recording session (right).

      However, it is important to note that although we observe a correlation on the population level, the relationship between postsynaptic firing and efficacy is diverse. We identify pairs with strong connections despite the firing rate of the postsynaptic SC cell being low. Likewise, we also find pairs with weak connections despite the firing rate of the SC neuron being high (Author response image 2). These observations suggest that factors beyond the postsynaptic firing contribute to the efficacy of the connection. This is exemplified by the fact that SC neurons can receive both strong and weak connections from their convergent presynaptic RGC pool.

      Author response image 2.

      RGC-SC connectivity. Cross-correlograms showing 4 connected RGC-SC pairs (top) with two RGCs connecting onto the same SC neuron. Raster plots of SC neuron spiking activity in response to firing of the presynaptically connected RGC. The same SC neuron can receive both strong and weak RGC inputs.

      In summary, we thank the reviewer for bringing up this important question, and we believe that our additional analyses shed light on the relationship between firing rate and efficacy. This result is very interesting, and we include these findings in the updated Figure 2 in the revised manuscript (panel 2G) in exchange with the panel of the peak latency. Moreover, we also address this point now in the results and discussion section of the revised manuscript (line 280 and line 525).

      My third concern is the assessment of functional similarity in Fig. 3. It is not clear to me why the similarity value was taken by the arithmetic mean. For example, even if the responses are identical for one connected pair that exclusively responds either to the ON or OFF sparse noise, the maximal value can only be 0.67. Perhaps I misunderstood something.

      We thank the reviewer for raising this point about the clarification regarding the calculation of the similarity index. We apologize for any confusion caused by our description on the similarity index calculation. To clarify, the similarity index was calculated specifically between the responses of the RGC and the responses of the postsynaptic SC neuron, rather than between the neurons and the visual stimulus. As a result, the similarity index reflects the degree of similarity in the responses of the connected pairs. Therefore, if the responses of the RGC and the connected postsynaptic neuron are identical, regardless of whether they respond exclusively to ON, only to OFF, or a mixture of ON-OFF, the similarity index will be one. We have updated the relevant part in the methods section to make this point clearer to the reader (line 917).

      Secondly, correlations in natural(istic) movies can differ dramatically depending on the frame rate that the movie was acquired and the way it is displayed to the animal. What looks natural to us will elicit several artifacts at a retinal level, e.g., due to big jumps between frames (no direction-selective response) or overall little modulation (large spatial correlations). I would rather opt for uniform stimuli, as suggested previously. Of course, these are also approximations but can be easily reproduced by different labs and are not subjected to the intricacies of the detailed naturalistic stimulus used.

      We agree with the reviewer that spatiotemporal correlations of naturalistic stimuli are complex. To address this point, we added two stimuli with little spatiotemporal correlations to the similarity analysis. The first stimulus we added is a phase scrambled version of the natural movie (PSM, also taken from Froudarakis et al. (2014)). The second is a binary white noise checkerboard stimulus. These stimuli were presented randomly interleaved with the natural movie, for 30 trials each. The similarity index analysis revealed that even with uniform stimuli included, the average similarity index is correlated to the efficacy. We show this data now in Figure 3.

      Fourth. It is important to control the proportion of inhibitory cells activated optogenetically across the recording probe. Currently, it is not possible to assess if there are false negatives. One way of controlling for this would be to show that the number of inhibitory interneurons doesn't vary across the probe.

      We thank the reviewer for highlighting this important aspect of the experiment and analysis. We are aware of this point and therefore took extra care to minimize the biases that could be introduced by our recording and stimulation method. Our approach to include recorded excitatory and inhibitory neurons was conservative, briefly:

      1. We included only excitatory and inhibitory neurons that were within the SC, defined by visually driven activity and continuous retinotopy (see method).

      2. We further restricted the included neurons to neurons that were located within the boundaries of the LED evoked responses, i.e. the recording channels with optogenetic evoked MUA responses within the SC (Figure 1 – figure supplement 1).

      3. Both excitatory and inhibitory SC neurons were selected in this way.

      These inclusion criteria were specifically designed to avoid sampling excitatory neurons from regions on the Neuropixels probe that lacked optogenetically evoked responses and thus to minimize the number of falsely labeled excitatory neurons.

      To illustrate these inclusion criteria and the resulting spatial distribution of the selected excitatory and inhibitory SC neurons along the 384 channels of the Neuropixels probe, we now added a supplementary figure (Figure 1 – figure supplement 1). This figure shows the multi- unit activity in response to optogenetic stimulation and the distribution of inhibitory and excitatory single units within the range of channels that are activated via LED stimulation for 3/11 selected experiments. This highlights that we employed stringent criteria for determining the boundaries and selecting which neurons to include in our study. The distribution of excitatory and inhibitory SC neurons is not significantly different for 9/11 experiments (Wilcoxon rank-sum test, p values = 0.307, 0.0115, 0.755, 0.834, 5.0110-6, 0.79, 0.80, 0.26, 0.33, 0.08, 0.13). Moreover, in the two significantly different experiments only 2 RGC-SC EXC pairs were located in the region without identified SC INs, and thus will not affect the results. We now address this point in the methods section (line 859).

      Fifth. In Fig. 4, the ISI had a minimal bound of 5 ms. Why? This would cap the firing rate at 200Hz, but we know that RGC in explants can fire at higher frequencies for evoked responses. I would set a lower bound since it should come naturally from the after-depolarization block.

      The chosen 5 ms minimal bound was in the range used in previous literature, e.g. 4-30 ms in Usrey et al. 1998 (Usrey et al., 1998). To address the question of the reviewer, we re-analyzed the data with a lower bound of 2 ms (2 – 30 ms) to include RGCs that fire at higher frequencies than 200Hz. However, we did not observe a clear difference between the 2-30 and 5-30 ms groups for inhibitory connections (SC IN: p = 0.604). Only the excitatory connections show a statistically significant difference (p = 0.011), however, the effect size is small (Cohen’s d = EXC = 0.063, INH = 0.030). Nonetheless we updated a panel in figure 4 to represent the 2-30 ms group (Figure 4F).

      Another aspect that remains unclear is to what extent the paired-spike ratio depends on the baseline firing rate. This would change the interpretation from the particular synaptic connection to the intrinsic properties of the cell and is plausible since the bassline firing rate varies tremendously.

      To address how the paired-spike ratio depends on the baseline firing rate we plotted the change of PSR depending on ISI as suggested by the reviewer.

      One related analysis would be to plot the change of PSR depending on the ISI. It would be intuitive to make a scatter plot for all paired spikes of all recorded neurons (separated into inhibitory and excitatory) of ISI vs. PSR.

      We appreciate the valuable suggestion from the reviewer. We have now separated the ISIs into distinct groups spanning 5 ms intervals represented in Author response image 3, right. These intervals range from 5-10 ms up to 25-30 ms. Notably, we observe a difference between the excitatory and inhibitory populations. The excitatory population exhibits a monotonic decrease in mean PSR across the intervals, while the inhibitory population shows a peak around 10/15 ms.

      Author response image 3.

      Change of mean paired-spike ratio (PSR) depending on ISI. Left) Comparison of PSR between two groups of different ISIs. The 2-30 ms group ensures to include high-firing RGCs (excitatory pairs 2-30 vs 5-30 ms p = 0.011; inhibitory pairs 2-30 vs 5-30 ms p = 0.604, Wilcoxon signed-rank). Right) PSR for groups of different ISI intervals. Mean PSR ± SEM for excitatory groups: 2.0±0.09, 1.75±0.09, 1.51±0.05, 1.31±0.05, 1.2±0.05; inhibitory groups: 1.35±0.06, 1.51±0.09, 1.5±0.1,1.22±0.06, 1.21±0.07. p E vs I (within group): 1.5510-5, 9.55±10-2, 4.21±10-1, 3.74±10-1, 6.22 ±10-1, Wilcoxon rank-sum test.

      Panel 4E is confusing to me. Here what is plotted is efficacy 1st against PSR (which is efficacy 2nd/efficacy 1st). Given that you have a linear relation between efficacy 1st and efficacy 2nd (panel 4C), you are essentially re-plotting the same information, which should necessarily have a hyperbolic relationship: [ f(x) = y/x ]. Thus, fitting this with a linear function makes no sense and it has to be decaying if efficacy 2nd > efficacy1st as shown in 4C.

      We thank the reviewer for raising this question which helped us to improve the representation and disruption of the results shown in figure 4. Panel 4E is intended to investigate whether there is a correlation between the efficacy strength (eff 1st) and the amount of facilitation (PSR). From panel 4C it is already evident that the data points for high efficacies lie closer to the unity line, as compared to the data points for low efficacies. This suggests that the PSR is stronger for connections with smaller efficacies 1st. To quantify this relationship, we have plotted the efficacy 1st vs the PSR in panel 4E, which thus adds new information to the figure. Importantly, this panel is shown in log-log scales, and therefore the decaying relationship is not evident. If we had shown the data on linear-linear scale, the decaying function would have been evident (Author response image 4). And indeed, as the reviewer pointed out, we cannot fit a hyperbolic relationship with a linear function. This is exactly the reason why we show the data in log-log scale and also estimate the Pearson correlation also from the logs of the efficacies and PSRs.

      In Author response image 4 we show the relationship plotted on linear scale using an approach to fit the hyperbolic relationship employing a hyperbolic cosecant function 𝑎/𝑠𝑖𝑛ℎ(𝑏 ∗ 𝑥) + 𝑐.

      Author response image 4.

      Relationship between efficacy to 1st RGC and PSR visualized on linear scale using a hyperbolic fitting approach 𝑎/𝑠𝑖𝑛ℎ(𝑏 ∗ 𝑥) + 𝑐.

      Finally, in Figure 5, the perspective is inverted, and the spike correlations are seen from the perspective of SC neurons. Here it would also be good to plot the cumulative histograms and not look at the averages.

      We added the cumulative histogram in Figure 5 (panel B), in addition to represent the raw data points and the mean.

      Regarding the similarity index and use of natural stats, please see my previous comments. Also, would it be possible to plot the contribution v/s the firing rate with the baseline firing rate with no stimulation or full-field stimulation? This is important since naturalistic movies have too many correlations and dependencies that make this plot difficult to interpret.

      We now show the contribution vs firing rates for different stimulus conditions in a new figure supplement (Figure 5- figure supplement 1). We added the correlations to the different stimuli for baseline firing rate with no stimulation (gray background), full-field stimulation (checkerboard) and phase scrambled natural movie.

      Overall, the paper only speaks from excitatory and inhibitory differences in the introduction and results. However, it is known that there are three clear morphologically distinct classes of excitatory neurons (wide-field, narrow-field, and stellate). This topic is touched in the discussion but not directly in the context of these results. Smaller cells might likely be driven much stronger. Wide-field cells would likely not be driven by one RGC input only and will probably integrate from many more cells than 6.

      We thank the reviewer for this comment. We agree with the reviewer that addressing how the different excitatory and inhibitory cell-types integrate RGC input is important to understand the visual processing mechanisms in the SC. The presented study aimed at comparing the excitatory and inhibitory population in general using the VGAT-ChR2 mouse line. Understanding how specific genetically defined cell-types integrate RGC inputs is clearly very interesting and should be done. Unfortunately, the mouse lines that would allow targeting genetically identified inhibitory cell-types are still limited and therefore we can only use functional measurements to assess different types of neurons in the SC. We now address this point about distinct SC cell-types in the discussion (line 643).

      One possible functional measurement is the size of the receptive field, which, to some degree, could be used as a proxy for different morphologies, i.e. small receptive fields could hint towards compact morphology while large receptive fields could indicate a wider morphology. It is known for example that narrow-field and stellate cells have small RF sizes, while wide-field cells have large RFs. We studied the relationship between the RF size and spike waveform duration but did not find a significant correlation (Figure R6). Moreover, the spike waveform duration, as discussed in the manuscript, is not a valid criterion to separate EXNs and INs in the SC, as it is common practice in the cortex. We now also looked into whether the connectivity strength is related to the RF size. Interestingly, while in the current dataset we do not find a significant correlation between the efficacy and the receptive field size for both EXN and IN (Author response image 5, left), we do find a significant negative correlation between contribution and receptive field size for the excitatory neurons (Author response image5, right). This result indicates that SC excitatory neurons with small receptive fields are more strongly coupled to the RGC input as compared to neurons with larger receptive fields.

      Author response image 5.

      Relationship between RF size and connectivity measures (efficacy and contribution) for RGC-SC EXN and RGC-SC IN pairs (two-sided Wilcoxon rank-sum test).

      Reviewer #2 (Public Review):

      This study follows up on a previous study by the group (Sibille et al Nature Communications 2022) in which high density Neuropixel probes were inserted tangentially through the superficial layers of the superior colliculus (SC) to record the activity of retinocollicular axons and postsynaptic collicular neurons in anesthetized mice. By correlating spike patterns, connected pairs could be identified which allowed the authors to demonstrate that functionally similar retinal axon-SC neuron pairs were strongly connected.

      In the current study, the authors use similar techniques in vGAT-ChR2 mice and add a fiber optic to identify light-activated GABAergic and non-light-activated nonGABAergic neurons. Using their previously verified techniques to identify connected pairs, within regions of optogenetic activation they identified 214 connected pairs of retinal axons and nonGABAergic neurons and 91 pairs of connected retinal axons and GABAergic neurons. The main conclusion is that retinal activity contributed more to the activity of postsynaptic nonGABAergic SC neurons than to the activity of postsynaptic GABAergic SC neurons.

      The study is very well done. The figures are well laid out and clearly establish the conclusions. My main comments are related to the comparison to other circuits and further questions that might be addressed in the SC.

      It is stated several times that the superior colliculus and the visual cortex are the two major brain areas for visual processing and these areas are compared throughout the manuscript. However, since both the dorsal lateral geniculate nucleus (dLGN) and SC include similar synaptic motifs, including triadic arrangements of retinal boutons with GABAergic and nonGABAergic neurons, it might be more relevant to compare and contrast retinal convergence and other features in these structures.

      Thank you for pointing out that crucial point. Indeed, the comparison to the thalamus is a valid argument, as both the SC and LGN are primary targets of RGC axon terminals. During the preparation of the manuscript, we extensively discussed whether to compare our new SC dataset with existing literature on the LGN or the primary visual cortex (V1) is the more appropriate. Ultimately, we decided on using the visual cortex as the main comparison because of the following reasons:

      1. The SC is widely recognized as an evolutionary conserved circuit for visual computation and visually guided behaviors, while the dLGN is generally regarded as a relay station for RGC information to the visual cortex (Steriade, McCormick, 1997). Thus, we believe it is more relevant to compare the evolutionary older visual circuit (SC) to the evolutionary newer visual circuit (visual cortex).

      2. In the mouse, the dLGN contains only a limited number of inhibitory interneurons and represent only approximately 6% of the total dLGN neuronal population (Butler, 2008; Evangelio et al., 2018). It has been suggested that the rodent somatosensory thalamus even lacks interneurons (Arcelli et al., 1997). Consequently, directly comparing inhibitory interneurons in the SC to those in the dLGN would pose challenges.

      3. Along the same line, the density and also the diversity of inhibitory neurons in the SC is high and likely more comparable to the density and diversity of inhibitory neurons in the visual cortex, than to the dLGN circuit. In the dLGN, TC projection neurons far outnumber inhibitory neurons (Arcelli et al., 1997; Evangelio et al., 2018) and the dLGN is inhabited by just 1-2 classes of GABAergic retinorecipient interneurons (Arcelli et al., 1997; Jaubert-Miazza et al., 2005; Krahe et al., 2011; Ling et al., 2012). Classification approaches (e.g. 3D reconstruction) so far have not revealed any subclasses except for distinctions in intrinsic membrane properties (Leist et al., 2016), suggesting low interneuron diversity in the dLGN. This is in contrast to the vLGN, where a recent study found a diversity of GABAergic neurons (Sabbagh et al., 2021).

      4. In the thalamo-cortical circuit, there exists a notable difference in how cortical excitatory and cortical inhibitory neurons are driven by their thalamic input (Alonso and Swadlow, 2005; Cruikshank et al., 2007). This discrepancy forms the basis for several models of visual processing in the visual cortex (Kremkow et al., 2016; Taylor et al., 2021). Which is why we wanted to assess whether the SC follows similar or different rules.

      That said, the reviewer is correct that the dLGN and the SC share certain wiring motifs, such as the triadic arrangements of retinal boutons. Unfortunately, the VGAT-ChR2 mouse line used in our study does not specifically label SC inhibitory neurons that are involved in the formation of triadic arrangements. Therefore, we are unable to draw specific conclusion regarding this point. To further investigate this aspect, the usage of GAD67 mice, which have been shown to selectively label intrinsic interneurons which receive RGC input and contact non-GABAergic dendrites (Whyland et al., 2020), would be necessary. Nonetheless, we acknowledge the question raised by the reviewer and in response, we have now provided a more in-depth comparison to the dLGN in the discussion section of the revised manuscript (line 565).

      The GABAergic and nonGABAergic neurons showed a wide range of firing rates. It might be interesting to sort the cells by firing rates to see if they exhibit different properties. For example, since the SC contains both GABAergic interneurons and projection neurons it would be interesting to examine whether GABAergic neurons with higher firing rates exhibit narrower spikes, similar to cortical fast spiking interneurons. Similarly, it might be of interest to sort the neurons by their receptive field sizes since this is associated with different SC neuron types.

      We thank the reviewer for the interesting suggestions of SC neurons classification into different categories. The relationship between connectivity measures and RF size has been addressed in Author response image 5. We have now studied the relationship of spike waveforms and several measures such as firing rate and RF size in more detail (Author response image 6).

      As the baseline firing is generally low in SC and our experiments are performed under anesthetized conditions, we used the evoked firing rates to sort the cells by firing rates or RF sizes. We have added an analysis showing the mean firing rate (calculated over the full recording duration) as a function of the spike width (peak-to-trough duration). We observe no significant relationship between the different groups of cell types. The same accounts if we sort the SC neurons by their RF size. RF sizes were calculated from PSTHs and summed RF for SL and SD. We do not see a relationship between neuron type and firing or RF size.

      Author response image 6.

      Mean firing rate (left) and RF size (right) as a function of peak-to-trough (PT) duration for excitatory and inhibitory SC neurons. Both measures are not correlated to the PT duration (Pearson correlation coefficient, two-sided Wilcoxon rank-sum test).

      The recording techniques allowed for the identification of the distance between connected retinocollicular fibers and postsynaptic neurons. It might also be interesting to compare the properties of connected pairs recorded at dorsal versus ventral locations since neurons with different genetic identities and response properties are located in different dorsal/ventral locations (e.g. Liu et al. Neuron 2023). Also, regarding the strength of connections, previous electron microscopy studies have shown that the retinocollicular terminals differ in density and size in the dorsal/ventral dimension (e.g Carter et al JCN 1991).

      We thank the reviewer for raising this interesting and relevant point to compare the properties of the connected pairs across the dorsal and ventral location. Unfortunately, our tangential recording approach is not ideally suited for comparing the properties of neurons across the different SC depths. For comparing dorsal versus ventral located neurons in the SC, as done in Liu et al., Neuron 2023, vertical recordings would be more appropriate. We now provide a discussion on this aspect (line 589).

      Was optogenetic activation of GABAergic neurons ever paired with visual activation? It would be interesting to examine the receptive fields of the nonGABAergic neurons before and after activation of the GABAergic neurons (as in Gale and Murphy J Neurosci 2016).

      This is an important point and indeed we have paired activation of GABAergic neurons with visual stimulation (checkerboard stimulus) to assess the impact of the GABAergic neurons on the firing of the excitatory neurons. We observed a diversity of effects, with some EXNs being strongly suppressed and others being only weakly suppressed. Thus, we predict that the receptive field of those EXN that are suppressed by optogenetically evoked IN firing, should be affected in some way. However, the checkerboard stimulus was only presented for a short duration (1 s) and for only a few trials (n = 30). Therefore, estimating the receptive fields of EXN before and after optogenetic activation of GABAergic neurons is unfortunately not possible with the existing dataset. We now mention this point in the discussion (line 668).

      Reviewer #3 (Public Review):

      This study performs in vivo recordings of neurons in the mouse superior colliculus and their afferents from the retina, retinal ganglion cells (RGCs). Building on a preparation they previously published, this study adds the use of optogenetic identification of inhibitory neurons (aka optotagging) to compare RGC connectivity to excitatory and inhibitory neurons in SC. Using this approach, the authors characterize connection probability, strength, and response correlation between RGCs and their target neurons in SC, finding several differences from what is observed in the retina-thalamus-visual cortex pathway. As such, this may be a useful dataset for efforts to understand retinocollicular connectivity and computations.

      Recommendations:

      Reviewer #1 (Recommendations For The Authors):

      Some minor points.

      Fig.1G shows a difference in mean firing rates between inhibitory and excitatory cells. Please plot the cumulative distribution of firing rates to be able to scrutinize the data better.

      We have addressed this issue and updated panel G in Figure 1.

      Fig. 2C. The black background color of this plot is black; it is not possible to decipher much, please change it to white

      We have now changed panel C in Figure 2 to a white background.

      Fig. 4D would be better represented as a histogram since most points overlap.

      We now represent panel D in Figure 4 as a histogram.

      Citations. I would cite some of the foundational work, in some instances, e.g., in the first sentence (SC receives input from the retina)

      We have now addressed this issue and cited more foundational studies (e.g. line 68)

      The discussion is a bit long; the last paragraph can be removed, mainly because the previous section conflates superficial SC with the entire SC, which is confusing (e.g., Ayupe et al.). In this way, there is more space to discuss the direct implication of the study within the context of known cell types.

      We now shortened the discussion and provide more background about different SC cell types in the discussion (line 643).

      Reviewer #2 (Recommendations For The Authors):

      Minor correction: Whyland et al 2020 did not identify V1 input to horizontal cells. A more appropriate reference is Zingg et al Neuron 2017.

      We thank the reviewer for this important point and have now corrected the citation in line 613 in the discussion to Zingg et al 2017.

      Reviewer #3 (Recommendations For The Authors):

      Regarding the degree of convergence from RGC to SC, the Crair lab (Furman 2013) performed a quantal analysis in slice that is worth citing.

      We included this citation in the revised version of the manuscript (line 501).

      I have lost track at this point, but many labs (Heimel, Meister, Farrow, Cang, Isa, maybe others?) have observed that neighboring SC neurons have similar tuning for direction/orientation, but the circuit mechanisms are not well understood. Given the relatively weak correlation between response tuning of RGC axons and their SC target neurons, a useful comparison might be that of SC neurons and their neighbors, and whether SC neurons that show weaker correlation to their RGC axons show stronger correlations with their SC neighbors, which could implicate local connectivity within SC.

      We thank the reviewer for providing this interesting comment. With our recording approach we could study locally connected SC neurons. However, the focus of our study was to first characterize the retinocolliculuar connectivity and therefore investigating the intracollicular connectivity is beyond the scope of the current study. We thank the reviewer for the valuable suggestion and will consider to tackle this aspect in a separate study in the future.

      Is it possible any of these measurements are biased by laminar targeting of their probe within superficial SC? Their schematic seems to suggest they targeted the deeper part of superficial SC. Do they know whether they recorded throughout superficial SC or targeted the deeper layers closer to stratum opticum?

      Our recordings are in between the deeper and upper visual SC layer depending on the recording site on the Neuropixels probe as we use an angled insertion approach. Besides DiI staining (Author response image 7), we can estimate the location of the probe using functional measurements, i.e. visually driven channels and retinotopic locations of the recording sites. If the Neuropixels probe is inserted too superficial, the number of recording site with visually driven activity is low. If the Neuropixels probe is inserted too deep in the visual layers we see two separated regions on the probe with visually driven activity in which the retinotopy is non-continues (please refer to Figure 2 in (Sibille et al., 2022)). In the recordings included in this study, the number of visually driven channels was generally high and the retinotopy continues, suggesting that we covered a region within the deeper and upper visual layers.

      Author response image 7.

      Functional estimation of probe location. DiI staining of Neuropixels probe (middle) and multi-unit activity across channels in response to visual stimulation (bottom). The white dashed lines in the middle and bottom panels mark the rough boundaries of the visual SC layers.

      In Fig. 4, the authors argue that firing in inhibitory neurons is less correlated with RGC input. Does their metric for contribution of retinal input control for the fact that inhibitory neurons have higher firing rates overall and, e.g., may be more depolarized at rest and likelier to fire spontaneous spikes but no less likely to be driven by retina? Or is the argument that their visual responses are more likely to be driven by V1 or local connections?

      We thank the reviewer for bringing up that point. The contribution measure estimates the fraction of SC spikes that were preceded by an RGC spike and it is thus, in theory, independent of the firing rate of the SC neuron. In practice, however, we agree that high firing SC neurons may be more likely to have a lower contribution value simply because a larger fraction of their spikes is not preceded by the activity of the presynaptic RGC. But this is exactly what we aimed at characterizing with this analysis. Where these non-RGC driven SC spikes originate from, whether from a more depolarized state of the neuron or by other sources such as V1 or local connections, we can only speculate about. That said, please note that despite SC INs having higher firing rates, not all of them show low contribution. Likewise, we also see SC neurons with low firing rates and low contribution values (new Supp Fig. 3).

      Minor point: The optotagging in the example cell doesn't cause the cell to fire for ~50 ms? That is odd. Typically, cells classified as optotagged fire within 5-10 ms of light onset. Is that a strange example cell or is there something different about the optotagging approach?

      Unfortunately, transient LED light onsets and offsets can induce light artifacts on Neuropixels probes (Jun et al., 2017; Steinmetz et al., 2021) and therefore it is challenging to use brief LED pulses for optotagging with Neuropixels probes. To avoid this overlap of artefacts and LED evoked spikes, we opted for a longer stimulus duration of 100 ms to activate VGAT neurons (Bennett et al., 2019; Siegle et al., 2019). Moreover, instead of a square pulse, we used a slow ramping for light onsets and offsets to minimize the magnitude of induced artifacts. In Author response image 8 we present examples of individual activated VGAT neurons responding to a 100 ms blue light pulse.

      Author response image 8.

      Optotagging approach. Example traces of a single stimulation pulse and protocol used for optogenetic stimulation. Evoked activity in response to LED stimulation (100ms, 100 trials) for six example SC IN neurons.

      References

      Alonso J-M, Swadlow HA. 2005. Thalamocortical specificity and the synthesis of sensory cortical receptive fields. J Neurophysiol 94:26–32. doi:10.1152/jn.01281.2004

      Arcelli P, Frassoni C, Regondi MC, De Biasi S, Spreafico R. 1997. GABAergic neurons in mammalian thalamus: a marker of thalamic complexity? Brain Res Bull 42:27–37. doi:10.1016/s0361- 9230(96)00107-4

      Bennett C, Gale SD, Garrett ME, Newton ML, Callaway EM, Murphy GJ, Olsen SR. 2019. Higher-Order Thalamic Circuits Channel Parallel Streams of Visual Information in Mice. Neuron 102:477- 492.e5. doi:10.1016/j.neuron.2019.02.010

      Butler AB. 2008. Evolution of the thalamus: a morphological and functional review. Thalamus & Related Systems 4:35–58. doi:10.1017/S1472928808000356

      Cruikshank SJ, Lewis TJ, Connors BW. 2007. Synaptic basis for intense thalamocortical activation of feedforward inhibitory cells in neocortex. Nat Neurosci 10:462–468. doi:10.1038/nn1861

      Evangelio M, García-Amado M, Clascá F. 2018. Thalamocortical Projection Neuron and Interneuron Numbers in the Visual Thalamic Nuclei of the Adult C57BL/6 Mouse. Frontiers in Neuroanatomy 12.

      Froudarakis E, Berens P, Ecker AS, Cotton RJ, Sinz FH, Yatsenko D, Saggau P, Bethge M, Tolias AS. 2014. Population code in mouse V1 facilitates readout of natural scenes through increased sparseness. Nat Neurosci 17:851–857. doi:10.1038/nn.3707

      Jaubert-Miazza L, Green E, Lo F-S, Bui K, Mills J, Guido W. 2005. Structural and functional composition of the developing retinogeniculate pathway in the mouse. Vis Neurosci 22:661–676. doi:10.1017/S0952523805225154

      Jun JJ, Steinmetz NA, Siegle JH, Denman DJ, Bauza M, Barbarits B, Lee AK, Anastassiou CA, Andrei A, Aydın Ç, Barbic M, Blanche TJ, Bonin V, Couto J, Dutta B, Gratiy SL, Gutnisky DA, Häusser M, Karsh B, Ledochowitsch P, Lopez CM, Mitelut C, Musa S, Okun M, Pachitariu M, Putzeys J, Rich PD, Rossant C, Sun W, Svoboda K, Carandini M, Harris KD, Koch C, O’Keefe J, Harris TD. 2017. Fully integrated silicon probes for high-density recording of neural activity. Nature 551:232–236. doi:10.1038/nature24636

      Krahe TE, El-Danaf RN, Dilger EK, Henderson SC, Guido W. 2011. Morphologically Distinct Classes of Relay Cells Exhibit Regional Preferences in the Dorsal Lateral Geniculate Nucleus of the Mouse. J Neurosci 31:17437–17448. doi:10.1523/JNEUROSCI.4370-11.2011

      Kremkow J, Perrinet LU, Monier C, Alonso J-M, Aertsen A, Frégnac Y, Masson GS. 2016. Push-Pull Receptive Field Organization and Synaptic Depression: Mechanisms for Reliably Encoding Naturalistic Stimuli in V1. Frontiers in Neural Circuits 10.

      Leist M, Datunashvilli M, Kanyshkova T, Zobeiri M, Aissaoui A, Cerina M, Romanelli MN, Pape H-C, Budde T. 2016. Two types of interneurons in the mouse lateral geniculate nucleus are characterized by different h-current density. Sci Rep 6:24904. doi:10.1038/srep24904

      Ling C, Hendrickson ML, Kalil RE. 2012. Morphology, Classification, and Distribution of the Projection Neurons in the Dorsal Lateral Geniculate Nucleus of the Rat. PLOS ONE 7:e49161. doi:10.1371/journal.pone.0049161

      Sabbagh U, Govindaiah G, Somaiya RD, Ha RV, Wei JC, Guido W, Fox MA. 2021. Diverse GABAergic neurons organize into subtype-specific sublaminae in the ventral lateral geniculate nucleus. J Neurochem 159:479–497. doi:10.1111/jnc.15101

      Sibille J, Gehr C, Teh KL, Kremkow J. 2022. Tangential high-density electrode insertions allow to simultaneously measure neuronal activity across an extended region of the visual field in mouse superior colliculus. J Neurosci Methods 376:109622. doi:10.1016/j.jneumeth.2022.109622

      Siegle JH, Jia X, Durand S, Gale S, Bennett C, Graddis N, Heller G, Ramirez TK, Choi H, Luviano JA, Groblewski PA, Ahmed R, Arkhipov A, Bernard A, Billeh YN, Brown D, Buice MA, Cain N, Caldejon S, Casal L, Cho A, Chvilicek M, Cox TC, Dai K, Denman DJ, de Vries SEJ, Dietzman R, Esposito L, Farrell C, Feng D, Galbraith J, Garrett M, Gelfand EC, Hancock N, Harris JA, Howard R, Hu B, Hytnen R, Iyer R, Jessett E, Johnson K, Kato I, Kiggins J, Lambert S, Lecoq J, Ledochowitsch P, Lee JH, Leon A, Li Y, Liang E, Long F, Mace K, Melchior J, Millman D, Mollenkopf T, Nayan C, Ng L, Ngo K, Nguyen T, Nicovich PR, North K, Ocker GK, Ollerenshaw D, Oliver M, Pachitariu M, Perkins J, Reding M, Reid D, Robertson M, Ronellenfitch K, Seid S, Slaughterbeck C, Stoecklin M, Sullivan D, Sutton B, Swapp J, Thompson C, Turner K, Wakeman W, Whitesell JD, Williams D, Williford A, Young R, Zeng H, Naylor S, Phillips JW, Reid RC, Mihalas S, Olsen SR, Koch C. 2019. A survey of spiking activity reveals a functional hierarchy of mouse corticothalamic visual areas (preprint). Neuroscience. doi:10.1101/805010

      Steinmetz NA, Aydin C, Lebedeva A, Okun M, Pachitariu M, Bauza M, Beau M, Bhagat J, Böhm C, Broux M, Chen S, Colonell J, Gardner RJ, Karsh B, Kloosterman F, Kostadinov D, Mora-Lopez C, O’Callaghan J, Park J, Putzeys J, Sauerbrei B, van Daal RJJ, Vollan AZ, Wang S, Welkenhuysen M, Ye Z, Dudman JT, Dutta B, Hantman AW, Harris KD, Lee AK, Moser EI, O’Keefe J, Renart A, Svoboda K, Häusser M, Haesler S, Carandini M, Harris TD. 2021. Neuropixels 2.0: A miniaturized high-density probe for stable, long-term brain recordings. Science 372:eabf4588. doi:10.1126/science.abf4588

      Taylor MM, Contreras D, Destexhe A, Frégnac Y, Antolik J. 2021. An Anatomically Constrained Model of V1 Simple Cells Predicts the Coexistence of Push–Pull and Broad Inhibition. J Neurosci 41:7797–7812. doi:10.1523/JNEUROSCI.0928-20.2021

      Usrey WM, Reppas JB, Reid RC. 1999. Specificity and Strength of Retinogeniculate Connections. Journal of Neurophysiology 82:3527–3540. doi:10.1152/jn.1999.82.6.3527

      Usrey WM, Reppas JB, Reid RC. 1998. Paired-spike interactions and synaptic efficacy of retinal inputs to the thalamus. Nature 395:384–387. doi:10.1038/26487

      Whyland KL, Slusarczyk AS, Bickford ME. 2020. GABAergic cell types in the superficial layers of the mouse superior colliculus. J Comp Neurol 528:308–320. doi:10.1002/cne.24754

    1. Author Response

      We are grateful for the insightful suggestions and comments provided by the reviewers. Your constructive feedback has been valuable, and we are thankful for the opportunity to address each point.

      We appreciate both reviewers’ recognition of our devotion to rigorous methodology and experimental control in this study, as evidenced by the comments: “remarkable efforts were made to isolate peripheral confounds”, “a clear strength of the study is the multitude of control conditions … that makes results very convincing”, and “thorough design of the study”. Indeed, we hope to have provided more than solid, but compelling evidence for sound-driven motor inhibitory effects of online TUS. We hope that this will be reflected in the assessment. Our conclusions are supported by multiple experiments across multiple institutions using exemplary experimental control including (in)active controls and multiple sound-sham conditions. This contrasts with the sole use of flip-over sham or no-stimulation conditions used in the majority of work to date. Indeed, the current study communicates that substantiated inferences on the efficacy of ultrasonic neuromodulation cannot be made under insufficient experimental control.

      In response to the reviewers' comments, we have substantially changed our manuscript. Specifically, we have open-sourced the auditory masking stimuli and specified them in better detail in the text, we have improved the figures to reflect the data more closely, we have clarified the intracranial doseresponse relationship, we have elaborated in the introduction, and we have further discussed the possibility of direct neuromodulation. We hope that you agree these changes have helped to substantially improve the manuscript.

      Public reviews

      1.1) Despite the main conclusion of the authors stating that there is no dose-response effects of TUS on corticospinal inhibition, both the comparison of Isppa and MEP decrease for Exp 1 and 2, and the linear regression between MEP decrease (relative to baseline) and the estimated Isppa are significant, arguing the opposite, that there is a dose-response function which cannot be fully attributed to difference in sound (since the relationship in inversed, lower intracranial Isppa leads to higher MEP decrease). These results suggest that doseresponse function needs to be further studied in future studies.

      We thank the reviewer for bringing up this point. While we are convinced our study provides no evidence for a direct neuromodulatory dose-response relationship, we have realized that the manuscript could benefit from improved clarity on this point.

      A dose-response relationship between TUS intensity and motor cortical excitability was assessed by manipulating free-water Isppa (Figure 4C). Here, no significant effect of free-water stimulation intensity was observed for Experiment I or II, thus providing no evidence for a dose-response relationship (Section 3.2). To aid in clarity, ‘N.S.’ has been added to Figure 4C in the revised manuscript.

      However, it is likely that the efficacy of TUS would depend on realized intracranial intensity, which we estimated with 3D simulations for on-target stimulation. These simulations resulted in an estimated intracranial intensity for each applied free-water intensity (i.e., 6.35 and 19.06 W/cm2), for each participant. We then tested whether inter-individual differences in intracranial intensity during on-target TUS affected MEP amplitude. We have realized that the original visualization used to display these data and its explanation was unintuitive. Therefore, we have completely revised Supplementary Figure 6. Because of the substantial length of this section, we have not copied it here. Please see the Supplementary material for the implemented improvements.

      In brief, we now show MEP amplitudes on the y-axis, rather than expressing values a %change. This plot depicts how individuals with higher intracranial intensities during ontarget TUS exhibit higher MEP amplitudes. However, this same relationship is observed for active control and sound-sham conditions. If there were a direct neuromodulatory doseresponse relationship of TUS, this would be reflected as the difference between on-target and control conditions changing as the estimated intracranial intensity increases. This was not the case. Further, the fact that the difference between on-target stimulation and baseline changes across intracranial intensities is notable, but this occurs to an equal degree in the control conditions. Therefore, these data cannot be interpreted as evidence for a doseresponse relationship.

      We hope the changes in Supplementary Figure 6 will make it clear that there is no evidence for direct intracranial dose-response effects.

      1.2) Other methods to test or mask the auditory confound are possible (e.g., smoothed ramped US wave) which could substantially solve part of the sound issue in future studies or experiments in deaf animals etc... 

      We agree with the reviewer’s statement. We aimed to replicate the findings of online motor cortical inhibition reported in prior work using a 1000 Hz square wave modulation frequency. While ramping can effectively reduce the auditory confound, as noted in the discussion, this is not feasible for the short pulse durations (0.1-0.3 ms) employed in the current study (Johnstone et al., 2021). We have further clarified this point in the methods section of the revised manuscript as follows:

      “While ramping the pulses can in principle mitigate the auditory confound (Johnstone et al., 2021; Mohammadjavadi et al., 2019), doing so for such short pulse durations (<= 0.3 ms) is not effective. Therefore, we used a rectangular pulse shape to match prior work.”

      Mitigation of the auditory confound by testing deaf subjects is a valid approach, and has now been added to the revised manuscript in the discussion as follows:

      “Alternative approaches could circumvent auditory confounds by testing deaf subjects, or perhaps more practically by ramping the ultrasonic pulse to minimize or even eliminate the auditory confound.”

      1.3) Dose-response function is an extremely important feature for a brain stimulation technique. It was assessed in Exp II by computing the relationship between the estimated intracranial intensities and the modulation of corticospinal excitability (Fig. 3b, 3c). It is not clear why data from Experiment I could not be integrated in a global intracranial dose-response function to explore wider ranges of intracranial intensities and MEP variability.

      We chose not to combine data from Experiment 1 in a global intracranial dose-response function because TUS was applied at different fundamental frequencies and focal depths (Experiment I: 500 kHz, 35 mm; Experiment II: 250 kHz, 28 mm). We have now explicitly communicated this under Supplementary Figure 6:

      “It was not appropriate to combine data from Experiments I and II given the different fundamental frequencies and stimulation depths applied… we ran simple linear models for Experiment II, which had a sufficient sample size (n = 27) to assess inter-individual variability.”

      1.4) Furthermore, the dose response function as computed with the MEP change relative to baseline shows a significant effect (6.35W/cm2) or a trend (19.06 W/cm2) for a positive linear relationship. This comparison cannot disentangle the auditory confound from the pure neuromodulatory effect but given the direction of the relationship (lower Isppa associated with larger neuromodulatory effect), it is unlikely that it is driven by sound. This relationship is absent for the Active control condition or the Sound Sham condition, more or less matched for peripheral confound. This needs to be further discussed. 

      Please refer to point 1.1

      1.5) The clear auditory confound arises from TUS pulsing at audible frequencies, which can be highly subject to inter-individual differences. Did the authors individually titrate the auditory mask to account for this intra- and inter-individual variability in auditory perception? 

      In Experiments I-III, the auditory mask was identical between participants. In Experiment IV, the auditory mask volume and signal-to-noise ratio were adjusted per participant. In the discussion we recommend individualized mask titration. However, we do note that masking successfully blinded participants in Experiment II, despite using uniform masking stimuli (Supplementary Figure 5).

      1.6) How different is the masking quality when using bone-conducting headphones (e.g., Exp. 1) compared to in-ear headphones (e.g., Exp. 2)?

      In our experience, bone conducting headphones produce a less clear, fuzzier, sound than in-ear headphones. However, in-ear headphones block the ear canal and likely result in the auditory confound being perceived as louder. We have included this information in the discussion of the revised manuscript:

      “Titrating auditory mask quality per participant to account for intra- and inter-individual differences in subjective perception of the auditory confound would be beneficial. Here, the method chosen for mask delivery must be considered. While bone-conducting headphones align with the bone conduction mechanism of the auditory confound, they might not deliver sound as clearly as in-ear headphones or speakers. Nevertheless, the latter two rely on airconducted sound. Notably, in-ear headphones could even amplify the perceived volume of the confound by obstructing the ear canal.”

      1.7) I was not able to find any report on the blinding efficacy of Exp. 1. Do the authors have some data on this? 

      We do not have blinding data available for Experiment I. Following Experiment I, we decided it would be useful to include such an assessment in Experiment II.

      1.8) Was the possibility to use smoothed ramped US wave form ever tested as a control condition in this set of studies, to eventually reduce audibility? For such fast PRF, for fast PRF, the slope would still need to be steep to stimulate the same power (AUC), it might not be as efficient. 

      We indeed tested smoothing (ramping) the waveform. There was no perceptible impact on the auditory confound volume. Indeed, prior research has also indicated that ramping over

      such short pulse durations is not effective (Johnstone et al., 2021). Taken together, we chose to continue with a square wave modulation as in prior TUS-TMS studies. We have updated the methods section of the manuscript with the following:

      “While ramping the pulses can in principle mitigate the auditory confound (Johnstone et al., 2021; Mohammadjavadi et al., 2019), doing so for such short pulse durations (<= 0.3 ms) is not effective. Therefore, we used a rectangular pulse shape to match prior work.”

      Importantly, our research shows that auditory co-stimulation can confound effects on motor excitability, and this likely occurred in multiple seminal TUS studies. While some preliminary work has been done on the efficacy of ramping in humans, future work is needed to determine what ramp shapes and lengths are optimal for reducing the auditory confound.

      1.9) There are other models or experiments that need to be discussed in order to clearly disassociate the TUS effect from the auditory confound effect, for instance, testing deaf animal models or participants, or experiments with multi-region recordings (to rule out the effects of the dense structural connectivity between the auditory cortex and the motor cortex). 

      The suggestion to consider multi-region recording in future experiments is important. Indeed, the effects of the auditory confound are expected to vary between brain regions. In the primary motor cortex, we observe a learned inhibition, which is perhaps supported by dense structural connectivity with the auditory system. In contrast, in perceptual areas such as the occipital cortex, one might expect tuned attentional effects in response to the auditory cue. We suggest that it is likely that the impact of the auditory confound also operates on a more global network level. It is reasonable to propose that, in a cognitive task for example, the confound will affect task performance and related brain activity, ostensibly regardless of the extent of direct structural connectivity between the auditory cortex and the (stimulated) region of interest.

      Regarding the testing of deaf subjects, this has been included in the revised discussion as follows:

      “Alternative approaches could circumvent auditory confounds by testing deaf subjects, or perhaps more practically by ramping the ultrasonic pulse to minimize or even eliminate the auditory confound.”

      1.10) The concept of stochastic resonance is interesting but traditionally refers to a mechanism whereby a particular level of noise actually enhances the response of non-linear systems to weak sensory signals. Whether it applies to the motor system when probed with suprathreshold TMS intensities is unclear. Furthermore, whether higher intensities induce higher levels of noise is not straightforward neither considering the massive amount of work coming from other NIBS studies in particular. Noise effects are indeed a function of noise intensity, but exhibit an inverted U-shape dose-response relationship (Potok et al., 2021, eNeuro). In general SR is rather induced with low stimulation intensities in particular in perceptual domain (see Yamasaki et al., 2022, Neuropsychologia).  In the same order of ideas, did the authors compare inter-trials variability across the different conditions? 

      We thank the reviewer for these insightful remarks. Indeed, stochastic resonance is a concept first formalized in the sensory domain. Recently, the same principles have been shown to apply in other domains as well. For example, transcranial electric noise (tRNS) exhibits similar stochastic resonance principles as sensory noise (Van Der Groen & Wenderoth, 2016). Indeed, tRNS has been applied to many cortical targets, including the motor system. In the current manuscript, we raise the question of whether TUS might engage with neuronal activity following principles similar to tRNS. One prediction of this framework would be that TUS might not modulate excitation/inhibition balance overall, but instead exhibit an inverted U-shape dose-dependent relationship with stochastic noise. Please note, we do not use the ‘suprathreshold TMS intensity’ to quantify whether noise could bring a sub-threshold input across the detection threshold, nor whether it could bring a sub-threshold output across the motor threshold. Instead, we use the MEP read-out to estimate the temporally varying excitability itself. We argue that MEP autocorrelation captures the mixture of temporal noise and temporal structure in corticospinal excitability. Building on the non-linear response of neuronal populations, low stochastic noise might strengthen weakly present excitability patterns, while high stochastic noise might override pre-existing excitability. It is therefore not the overall MEP amplitude, but the MEP timeseries that is of interest to us. Here, we observe a non-linear dose-dependent relationship, matching the predicted inverted U-shape. Importantly, we did not intend to assume stochastic resonance principles in the motor domain as a given. We have now clarified in the revised manuscript that we propose a putative framework and regard this as an open question:

      “Indeed, human TUS studies have often failed to show a global change in behavioral performance, instead finding TUS effects primarily around the perception threshold where noise might drive stochastic resonance (Butler et al., 2022; Legon et al., 2018). Whether the precise principles of stochastic resonance generalize from the perceptual domain to the current study is an open question, but it is known that neural noise can be introduced by brain stimulation (Van Der Groen & Wenderoth, 2016). It is likely that this noise is statedependent and might not exceed the dynamic range of the intra-subject variability (Silvanto et al., 2007). Therefore, in an exploratory analysis, we exploited the natural structure in corticospinal excitability that exhibits as a strong temporal autocorrelation in MEP amplitude.”

      Following the above reasoning, we felt it critical to estimate noise in the timeseries, operationalized as a t-1 autocorrelation, rather than capture inter-trial variability that ignores the timeseries history and requires data aggregation thereby reducing statistical power. Importantly, we would expect the latter index to capture global variability, putatively masking the temporal relationships which we were aiming to test. The reviewer raises an interesting option, inviting us to wonder if inter-trial variability might be sensitive enough, nonetheless. To this end, we compared inter-trial variability as suggested. This was achieved by first calculating the inter-trial variability for each condition, and then running a three-way repeated measures ANOVA on these values with the independent variables matching our autocorrelation analyses, namely, procedure (on-target/active control)intensity (6.35/19.06)masking (no mask/masked). This analysis did not reveal any significant interactions or main effects.

      Author response table 1.

      1.11) State-dependency/Autocorrelations: These values were extracted from Exp2 which has baseline trials. Can the authors provide autocorrelation values at baseline, with and without auditory mask?  Can the authors comment on the difference between the autocorrelation profiles of the active TUS condition at 6.35W/cm2 or at 19.06W/cm2. They should somehow be similar to my understanding.  Besides, the finding that TUS induces noise only when sound is present and at lower intensities is not well discussed. 

      In the revised manuscript, we have now included baseline in the figure (Figure 4D). Regarding baseline with and without a mask, we must clarify that baseline involves only TMS (no mask), and sham involves TMS + masking stimulus (masked).

      The dose-dependent relationship of TUS intensity with autocorrelation is critical. One possible observation would have been that TUS at both intensities decreased autocorrelation, with higher intensities evoking a greater reduction. Here, we would have concluded that TUS introduced noise in a linear fashion.

      However, we observed that lower-intensity TUS in fact strengthened pre-existing temporal patterns in excitability (higher autocorrelation), while during higher-intensity TUS these patterns were overridden (lower autocorrelation). This non-linear relationship is not unexpected, given the non-linear responses of neurons.

      If this non-linear dependency is driven by TUS, one could expect it to be present during conditions both with and without auditory masking. However, the preparatory inhibition effect of TUS likely depends on the salience of the cue, that is, the auditory confound. In trials without auditory masking, the salience of the confound in highly dependent on (transmitted) intensity, with higher intensities being perceived as louder. In contrast, when trials are masked, the difference in cue salience between lower and higher intensity stimulation in minimized. Therefore, we would expect for any nuanced dose-dependent direct TUS effect to be best detectable when the difference in dose-dependent auditory confound perception is minimized via masking. Indeed, the dose-dependent effect of TUS on autocorrelation is most prominent when the auditory confound is masked.

      “In sum, these preliminary exploratory analyses could point towards TUS introducing temporally specific neural noise to ongoing neural dynamics in a dose-dependent manner, rather than simply shifting the overall excitation-inhibition balance. One possible explanation for the discrepancy between trials with and without auditory masking is the difference in auditory confound perception, where without masking the confound’s volume differs between intensities, while with masking this difference is minimized. Future studies might consider designing experiments such that temporal dynamics of ultrasonic neuromodulation can be captured more robustly, allowing for quantification of possible state-dependent or nondirectional perturbation effects of stimulation.”

      1.12) Statistical considerations. Data from Figure 2 are considered in two-by-two comparisons. Why not reporting the ANOVA results testing the main effect of TUS/Auditory conditions as done for Figure 3. Statistical tables of the LMM should be reported. 

      Full-factorial analyses and main effects for TUS/Auditory conditions are discussed from Section 3.2 onwards. These are the same data supporting Figure 2 (now Figure 3). We would like to note that the main purpose of Figure 2 is to demonstrate to the reader that motor inhibition was observed, thus providing evidence that we replicated motor inhibitory effects of prior studies. A secondary purpose is to visually represent the absence of direct and spatially specific neuromodulation. However, the appropriate analyses to demonstrate this are reported in following sections, from Section 3.2 onwards, and we are concerned that mentioning these analyses earlier will negatively impact comprehensibility.

      Statistical tables of the LMMs are provided within the open-sourced data and code reported at the end of the paper, embedded within the output which is accessible as a pdf (i.e., analysis/analysis.pdf).

      1.13) Startle effects: The authors dissociate two mechanisms through which sound cuing can drive motor inhibition, namely some compensatory expectation-based processes or the evocation of a startle response. I find the dissociation somehow artificial. Indeed, it is known that the amplitude of the acoustic startle response habituates to repetitive stimulation. Therefore, sensitization can well explain the stabilization of the MEP amplitude observed after a few trials. 

      Thank you for bringing this to our attention. Indeed, an acoustic startle response would habituate over repetitive stimulation. A startle response would result in MEP amplitude being significantly altered in early trials. As the participant would habituate to the stimulus, the startle response would decrease. MEP amplitude would then return to baseline levels. However, this is not the pattern we observe. An alternative possibility is that participants learn the temporal contingency between the stimulus and TMS. Here, compensatory expectation-based change in MEP amplitude would be observed. In this scenario, there would be no change in MEP amplitude during early trials because the stimulus has not yet become informative of the TMS pulse timing. However, as participants learn how to predict TMS timing by the stimulus, MEP amplitude would decrease. This is also the pattern we observe in our data. We have clarified these alternatives in the revised manuscript as follows:

      “Two putative mechanisms through which sound cuing may drive motor inhibition have been proposed, positing either that explicit cueing of TMS timing results in compensatory processes that drive MEP reduction (Capozio et al., 2021; Tran et al., 2021), or suggesting the evocation of a startle response that leads to global inhibition (Fisher et al., 2004; Furubayashi et al., 2000; Ilic et al., 2011; Kohn et al., 2004; Wessel & Aron, 2013). Critically, we can dissociate between these theories by exploring the temporal dynamics of MEP attenuation. One would expect a startle response to habituate over time, where MEP amplitude would be reduced during startling initial trials, followed by a normalization back to baseline throughout the course of the experiment as participants habituate to the starling stimulus. Alternatively, if temporally contingent sound-cueing of TMS drives inhibition, MEP amplitudes should decrease over time as the relative timing of TUS and TMS is being learned, followed by a stabilization at a decreased MEP amplitude once this relationship has been learned.”

      1.14) Can the authors further motivate the drastic change in intensities between Exp1 and 2? Is it due to the 250-500 carrier difference? It this coming from the loss power at 500kHz? 

      The change in intensities between Experiments I and II was not an intentional experimental manipulation. Following completion of data acquisition, our TUS system received a firmware update that differentially corrected the 250 kHz and 500 kHz stimulation intensities. In this manuscript, we report the actual free-water intensities applied during our experiments.

      1.15) Exp 3: Did 4 separate blocks of TUS-TMS and normalized for different TMS intensities used with respect to baseline. But how different was it. Why adjusting and then re adjusting intensities? 

      The TMS intensities required to evoke a 1 mV MEP under the four sound-sham conditions significantly differed from the intensities required for baseline. In the revised appendix, we have now included a figure depicting the TMS intensities for these conditions, as well as statistical tests demonstrating each condition required a significantly higher TMS intensity than baseline.

      TMS intensities were re-adjusted to avoid floor effects when assessing the efficacy of ontarget TUS. Sound-sham conditions themselves attenuate MEP amplitude. This is also evident from the higher TMS intensities required to evoke a 1 mV MEP under these conditions. If direct neuromodulation by TUS would have further decreased MEP amplitude, the concern was that effects might not be detectible within such a small range of MEP amplitudes.

      1.16) In Exp 4, TUS targeted the ventromedial WM tract. Since direct electrical stimulation on white matter pathways within the frontal lobe can modulate motor output probably through dense communication along specific white matter pathways (e.g., Vigano et al., 2022, Brain), how did the authors ensure that this condition is really ineffective? Furthermore, the stimulation might have covered a lot more than just white matter. Acoustic and thermal simulations would be helpful here as well. 

      Thank you for pointing out this possibility. Ultrasonic and electrical stimulation have quite distinct mechanisms of action. Therefore, it is challenging to directly compare these two approaches. There is a small amount of evidence that ultrasonic neuromodulation of white matter tracts is possible. However, the efficacy of white matter modulation is likely much lower, given the substantially lesser degree of mechanosensitive ion channel expression in white matter as opposed to gray matter (Sorum et al., 2020, PNAS). Further, recent work has indicated that ultrasonic neuromodulation of myelinated axonal bundles occurs within the thermal domain (Guo et al., 2022, SciRep), which is not possible with the intensities administered in the current study. Nevertheless, based on Experiment IV in isolation, it cannot be definitively excluded that there TUS induced direct neuromodulatory effects in addition to confounding auditory effects. However, Experiment IV does not possess sufficient inferential power on its own and must be interpreted in tandem with Experiments I-III. Taken together with those findings, it is unlikely that a veridical neuromodulation effect is seen here, given the equivalent or lower stimulation intensities, the substantially deeper stimulation site, and the absence of an additional control condition in Experiment IV. This likelihood is further decreased by the fact that inhibitory effects under masking descriptively scale with the audibility of TUS.

      Off-target effects such as unintended co-stimulation of gray matter when targeting white matter is always an important factor to consider. Unfortunately, individualized simulations for Experiment IV are not available. However, the same type of transducer and fundamental frequency was used as in Experiment II, for which we do have simulations. Given the size of the focus and the very low in-situ intensities extending beyond the main focal point, it is incredibly unlikely that effective stimulation was administered outside white matter in a meaningful number of participants. Nevertheless, the reviewer is correct that this can only be directly confirmed with simulations, which remain infeasible due to both technical and practical constraints. We have included the following in the revised manuscript:

      “The remaining motor inhibition observed during masked trials likely owes to, albeit decreased, persistent audibility of TUS during masking. Indeed, MEP attenuation in the masked conditions descriptively scale with participant reports of audibility. This points towards a role of auditory confound volume in motor inhibition (Supplementary Fig. 8). Nevertheless, one could instead argue that evidence for direct neuromodulation is seen here. This unlikely for a number of reasons. First, white matter contains a lesser degree of mechanosensitive ion channel expression and there is evidence that neuromodulation of these tracts may occur primarily in the thermal domain (Guo et al., 2022; Sorum et al., 2021). Second, Experiment IV lacks sufficient inferential power in the absence of an additional control and must therefore be interpreted in tandem with Experiments I-III. These experiments revealed no evidence for direct neuromodulation using equivalent or higher stimulation intensities and directly targeting grey matter while also using multiple control conditions. Therefore, we propose that persistent motor inhibition during masked trials owes to continued, though reduced, audibility of the confound (Supplementary Fig. 8). However, future work including an additional control (site) is required to definitively disentangle these alternatives.”

      1.17) Still for Exp 4. the rational for the 100% MSO or 120% or rMT is not clear, especially with respect to Exp 1 and 2. Equipment is similar as well as raw MEPs amplitudes, therefore the different EMG gain might have artificially increased TMS intensities. Could it have impacted the measured neuromodulatory effects?

      Experiment IV was conducted independently at a different institute than Experiments I-II. In contrast to Experiments I-II, a gel pad was used to couple TUS to the participant’s head. The increased TMS-to-cortex distance introduced by the gel pad necessitates higher TMS intensities to compensate for the increased offset. In fact, in 9/12 participants, the intended intensity at 120% rMT exceeded the maximum stimulator output. In those cases, we defaulted to the maximum stimulator output (i.e., 100% MSO). We have clarified in the revised supplementary material as follows:

      “We aimed to use 120% rMT (n =3). However, if this intensity surpassed 100% MSO, we opted for 100% MSO instead (n = 9). The mean %MSO was 94.5 ± 10.5%. The TMS intensities required in this experiment were higher than those required in Experiment I-II using the same TMS coil, though still within approximately one standard deviation. This is likely due to the use of a gel pad, which introduces more distance between the TMS coil and the scalp, thus requiring a higher TMS intensity to evoke the same motor activity.”

      Regarding the EMG gain, this did not affect TMS intensities and did not impact the measured neuromodulatory effects. The EMG gain at acquisition is always considered during signal digitization and further analyses.

      1.18) Exp. 4. It would be interesting to provide the changes in MEP amplitudes for those subjects who rated "inaudible" in the self-rating compared to the others. That's an important part of the interpretation: inaudible conditions lead to inhibition, so there is an effect. The auditory confound is not additive to the TUS effect. 

      Previously, we only provided participant’s ratings of audibility, and showed that conditions that were rated as inaudible more often showed less inhibition, descriptively indicating that inaudible stimulation does not lead to inhibition. This interpretation is in line with our conclusion that the TUS auditory confound acts as a cue signaling the upcoming TMS pulse, thus leading to preparatory inhibition.

      We have now included an additional plot and discussion in Supplementary Figure 8 (Subjective Report of TUS Audibility). Here, we show the change in MEP amplitude from baseline for the three continuously masked TUS intensities as in the main manuscript, but now split by participant rating of audibility. Descriptively, less audible sounds result in no marked change or a smaller change in MEP amplitude. This supports our conclusion that direct neuromodulation is not being observed here. When participants were unsure whether they could hear TUS, or when they did hear TUS, more inhibition was observed. However, this is still to a lesser degree than unmasked stimulation which was nearly always audible, and likely also more salient. This also supports our conclusion that these results indicate a role of cue salience rather than direct neuromodulation. Regarding masked conditions where participants were uncertain whether they heard TUS, the sound was likely sufficient to act as a cue, albeit potentially subliminally. After all, preparatory inhibition is not a conscious action undertaken by the participant either. We would also like to note that participants reported perceived audibility after each block, not after each trial, so selfreported audibility was not a fine-grained measurement. The data from Experiment IV suggest that the volume of the cue has an impact on motor inhibition. Taken together with the points mentioned in 1.16, it is not possible to conclude there is evidence for direct neuromodulation in Experiment IV.

      1.19) I suggest to re-order sub panels of the main figures to fit with the chronologic order of appearance in the text. (e.g Figure 1 with A) Ultrasonic parameters, B) 3D-printed clamp, C) Sound-TMS coupling, D) Experimental condition). 

      We have restructured the figures in the manuscript to provide more clarity and to have greater alignment with the eLife format.

      2.1) Although auditory confounds during TUS have been demonstrated before, the thorough design of the study will lead to a strong impact in the field.

      We thank the reviewer for recognition of the impact of our work. They highlight that auditory confounds during TUS have been demonstrated previously. Indeed, our work builds upon a larger research line on auditory confounds. The current study extends on the confound’s presence by quantifying its impact on motor cortical excitability, but perhaps more importantly by invalidating the most robust and previously replicable findings in humans. Further, this study provides a way forward for the field, highlighting the necessity of (in)active control conditions and tightly matched sham conditions for appropriate inferences in future work. We have amended the abstract to better reflect these points:

      “Primarily, this study highlights the substantial shortcomings in accounting for the auditory confound in prior TUS-TMS work where only a flip-over sham control was used. The field must critically reevaluate previous findings given the demonstrated impact of peripheral confounds. Further, rigorous experimental design via (in)active control conditions is required to make substantiated claims in future TUS studies.”

      2.2) A few minor [weaknesses] are that (1) the overview of previous related work, and how frequent audible TUS protocols are in the field, could be a bit clearer/more detailed

      We have expanded on previous related work in the revised manuscript:

      “Indeed, there is longstanding knowledge of the auditory confound accompanying pulsed TUS (Gavrilov & Tsirulnikov, 2012). However, this confound has only recently garnered attention, prompted by a pair of rodent studies demonstrating indirect auditory activation induced by TUS (Guo et al., 2022; Sato et al., 2018). Similar effects have been observed in humans, where exclusively auditory effects were captured with EEG measures (Braun et al., 2020). These findings are particularly impactful given that nearly all TUS studies employ pulsed protocols, from which the pervasive auditory confound emerges (Johnstone et al., 2021).”

      2.3) The acoustic control stimulus can be described in more detail

      We have elaborated upon the masking stimulus for each experiment in the revised manuscript as follows:

      Experiment I: “In addition, we also included a sound-only sham condition that resembled the auditory confound. Specifically, we generated a 1000 Hz square wave tone with 0.3 ms long pulses using MATLAB. We then added white noise at a signal-to-noise ratio of 14:1. This stimulus was administered to the participant via bone-conducting headphones.”

      Experiment II: “In this experiment, the same 1000 Hz square wave auditory stimulus was used for sound-only sham and auditory masking conditions. This stimulus was administered to the participant over in-ear headphones.”

      Experiment III: “Auditory stimuli were either 500 or 700 ms in duration, the latter beginning 100 ms prior to TUS (Supplementary Fig. 3.3). Both durations were presented at two pitches. Using a signal generator (Agilent 33220A, Keysight Technologies), a 12 kHz sine wave tone was administered over speakers positioned to the left of the participant as in Fomenko and colleagues (2020). Additionally, a 1 kHz square wave tone with 0.5 ms long pulses was administered as in Experiments I, II, IV, and prior research (Braun et al., 2020) over noisecancelling earbuds.”

      Experiment IV: “We additionally applied stimulation both with and without a continuous auditory masking stimulus that sounded similar to the auditory confound. The stimulus consisted of a 1 kHz square wave with 0.3 ms long pulses. This stimulus was presented through wired bone-conducting headphones (LBYSK Wired Bone Conduction Headphones). The volume and signal-to-noise ratio of the masking stimulus were increased until the participant could no longer hear TUS, or until the volume became uncomfortable.”

      In the revised manuscript we have also open-sourced the audio files used in Experiments I, II, and IV, as well as a recording of the output of the signal generator for Experiment III:

      “Auditory stimuli used for sound-sham and/or masking for each experiment are accessible here: https://doi.org/10.5281/zenodo.8374148.”

      2.4) The finding that remaining motor inhibition is observed during acoustically masked trials deserves further discussion.

      We agree. Please refer to points 1.16 and 1.18.

      2.5) In several places, the authors state to have "improved" control conditions, yet remain somewhat vague on the kind of controls previous work has used (apart from one paragraph where a similar control site is described). It would be useful to include more details on this specific difference to previous work.

      In the revised manuscript, we have clarified the control condition used in prior studies as follows:

      Abstract:

      “Primarily, this study highlights the substantial shortcomings in accounting for the auditory confound in prior TUS-TMS work where only a flip-over sham control was used.”

      Introduction:

      “To this end, we substantially improved upon prior TUS-TMS studies implementing solely flip-over sham by including both (in)active control and multiple sound-sham conditions.”

      Methods:

      “We introduced controls that improve upon the sole use of flip-over sham conditions used in prior work. First, we applied active control TUS to the right-hemispheric face motor area, allowing for the assessment of spatially specific effects while also better mimicking ontarget peripheral confounds. In addition, we also included a sound-only sham condition that closely resembled the auditory confound.”

      2.6) I also wondered how common TUS protocols are that rely on audible frequencies. If they are common, why do the authors think this confound is still relatively unexplored (this is a question out of curiosity). More details on these points might make the paper a bit more accessible to TUS-inexperienced readers. 

      Regarding the prevalence of the auditory confound, please refer to point 2.2.

      Peripheral confounds associated with brain stimulation can have a strong impact on outcome measures, often even overshadowing the intended primary effects. This is well known from electromagnetic stimulation. For example, the click of a TMS pulse can strongly modulate reaction times (Duecker et al., 2013, PlosOne) with effect sizes far beyond that of direct neuromodulation. Unfortunately, this consideration has not yet fully been embraced by the ultrasonic neuromodulation community. This is despite long known auditory effects of TUS (Gavrilov & Tsirulnikov, 2012, Acoustical Physics). It was not until the auditory confound was shown to impact brain activity by Guo et al., and Sato et al., (2018, Neuron) that the field began to attend to this phenomenon. Mohammadjavadi et al., (2019, BrainStim) then showed that neuromodulation persisted even in deaf mice, and importantly, also demonstrated that ramping ultrasound pulses could reduce the auditory brainstem response (ABR). Braun and colleagues (2020, BrainStim) were the first bring attention to the auditory confound in humans, while also discussing masking stimuli. This was followed by a study from Johnstone and colleagues (2021, BrainStim) who did preliminary work assessing both masking and ramping in humans. Recently, Liang et al., (2023) proposed a new form of masking colourfully titled the ‘auditory Mondrian’. Further research into the peripheral confounds associated with TUS is on the way.

      However, we agree that the confound remains relatively unexplored, particularly given the substantial impact it can have, as demonstrated in this paper. What is currently lacking is an assessment of the reproducibility of previous work that did not sufficiently consider the auditory confound. The current study constitutes a strong first step to addressing this issue, and indeed shows that results are not reproducible when using control conditions that are superior to flip-over sham, like (in)active control conditions and tightly matched soundsham conditions. This is particularly important given the fundamental nature of this research line, where TUS-TMS studies have played a central role in informing choices for stimulation protocols in subsequent research.

      We would speculate that, with TUS opening new frontiers for neuroscientific research, there comes a rush of enthusiasm wherein laying the groundwork for a solid foundation in the field can sometimes be overlooked. Therefore, we hope that this work sends a strong message to the field regarding how strong of an impact peripheral confounds can have, also in prior work. Indeed, at the current stage of the field, we see no justification not to include proper experimental control moving forward. Only when we can dissociate peripheral effects from direct neuromodulatory effects can our enthusiasm for the potential of TUS be warranted.

      2.7) Results, Fig. 2: Why did the authors not directly contrast target TUS and control conditions? 

      Please refer to point 1.1.

      2.8) The authors observe no dose-response effects of TUS. Does increasing TUS intensity also increase an increase in TUS-produced sounds? If so, should this not also lead to doseresponse effects? 

      We thank the reviewer for this insightful question. Yes, increasing TUS intensity results in an increased volume of the auditory confound. Under certain circumstances this could lead to ‘dose-response’ effects. In the manuscript, we propose that the auditory confounds acts as a cue for the upcoming TMS pulse, thus resulting in MEP attenuation once the cue is informative (i.e., when TMS timing can be predicted by the auditory confound). In this scenario, volume can be taken as the salience of the cue. When the auditory confound is sufficiently salient, it should cue the upcoming TMS pulse and thus result in a reduction of MEP amplitude.

      If we take Experiment II as an example (Figure 3B), the 19.06 W/cm2 stimulation would be louder than the 6.35 W/cm2 intensity. However, as both intensities are audible, they both cue the upcoming TMS pulse. One could speculate that the very slight (nonsignificant) further decrease for 19.06 W/cm2 stimulation could owe to a more salient cueing.

      One might notice that MEP attenuation is less strong in Experiment I, even though higher intensities were applied. Directly contrasting intensities from Experiments I and II was not feasible due to differences in transducers and experimental design. From the perspective of sound cueing of the upcoming TMS pulse, the auditory confound cue was less informative in Experiment I than Experiment II, because TUS stimulus durations of both 100 and 500 ms were administered, rather than solely 500 ms durations. This could explain why descriptively less MEP attenuation was observed in Experiment I, where cueing was less consistent.

      Perhaps more convincing evidence of a sound-based ‘dose-response’ effect comes from Experiment IV (Figure 4B). Here, we propose that continuous masking reduced the salience of the auditory confound (cue), and thus, less MEP attenuation was be observed. Indeed, we see less MEP change for masked stimulation. For the lowest administered volume during masked stimulation, there was no change in MEP amplitude from baseline. For higher volumes, however, there was a significant inhibition of MEP amplitude, though it was still less attenuation than unmasked stimulation. These results indicate a ‘doseresponse’ effect of volume. When the volume (intensity) of the auditory confound was low enough, it was inaudible over the continuous mask (also as reported by participants), and thus it did not act as a cue for the upcoming TMS pulse, therefore not resulting in motor inhibition. When the volume (intensity) was higher, less participants reported not being able to hear the stimulation, so the cue was to a given extent more salient, and in line with the cueing hypothesis more inhibition was observed.

      In summary, because the volume of the auditory confound scales with the intensity of TUS, there may be dose-response effects of the auditory confound volume. Along the border of (in)audibility of the confound, as in masked trials of Experiment IV, we may observe dose-response effects. However, at clearly audible intensities (e.g., Experiment I & II), the size of such an effect would likely be small, as both volumes are sufficiently audible to act as a cue for the upcoming TMS pulse leading to preparatory inhibition.

      2.9) I wonder if the authors could say a bit more on the acoustic control stimulus. Some sound examples would be useful. The authors control for audibility, but does the control sound resemble the one produced by TUS? 

      Please refer to point 2.3.

      2.10) The authors' claim that the remaining motor inhibition observed during masked trials is due to persistent audibility of TUS relies "only" on participants' descriptions. I think this deserves a bit more discussion. Could this be evidence that there is a TUS effect in addition to the sound effect? 

      Please refer to points 1.16 and 1.18.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We again thank you for the positive and constructive feedback on our manuscript, and for highlighting its contributions to understanding the role of CARD8 in viral protease-triggered sensing of viral spread, and the potential impact of our findings on chronic inflammation and immune activation. We agree that it will be important for future work to address whether or not HIV-1 protease-triggered CARD8 inflammasome activation contributes to chronic inflammation in PLWH who are receiving ART.

      In response to the question about the baseline level of IL-1β in Fig. 4D, the figure below shows the mock condition for the CD4+ T cell:MDM coculture. We had done this control in parallel with the data presented in the submitted figure. Levels of IL-1β during HIV-1 infection are increased over background (i.e., mock infection). We note that for donor G the IL-1β concentration is below the limit of detection for this assay. Thus, it remains possible that other inflammasomes contribute modestly during cell-to-cell transmission of HIV-1; however, incomplete knockout of CARD8 in a minority of cells may also contribute to the observed levels of IL-1β in response to HIV-1 infection. Nonetheless, collectively, our data strongly supports the role for CARD8 in HIV-1 protease-triggered inflammasome activation.


      The following is the authors’ response to the original reviews.

      Joint Public Review:

      Following up on their previous work, the authors investigated whether cell-to-cell transmission of HIV-1 activates the CARD8 inflammasome in macrophages, an important question given that inflammasome activation in myeloid cells triggers proinflammatory cytokine release. The data support the idea that CARD8 is activated by the viral protease and promotes inflammation. However, time-course analyses in primary T cells and macrophages and further information on the specific inflammasome involved would further increase the significance of the study.

      Strengths:

      The manuscript is well-written and the data is of good quality. The evidence that CARD8 senses the HIV-1 protease in the context of cell-to-cell transmission is important since cell-to-cell transmission is thought to play a key role in viral spread in vivo, and inflammation is a major driver of disease progression. Clean knockout experiments in primary macrophages are a notable strength and the results clearly support the role of CARD8 in protease-dependent sensing of viral spread and the induction of IL1β release and cell death. The finding that HIV-1 strains are resistant to protease inhibitors differ in CARD8 activation and IL1β production is interesting and underscores the potential clinical relevance of these results.

      Weaknesses:

      One weakness is that the authors used T cell lines which might not faithfully reflect the efficiency of HIV-1 production and cell-cell transfer by primary T cells. To assess whether CARD8 is also activated by protease from incoming viral particles earlier time points should be analyzed. Finally, while the authors exclude the role of NLRP3 in IL-1b and the death of macrophages it would be interesting to know whether the effect is still Gasdermin D dependent.

      Recommendations for the authors

      (1) Co-culture assay should also be done between primary CD4 cells and primary MDMs, because T-cell lines produce much more viruses, and the efficiency of cell-tocell transmission might be dramatically different in primary cells compared to cell lines.

      We have now added data from experiments using infected primary CD4 cells as the donor cells in cell-to-cell HIV-1 transmission to MDMs in new Figure 4. The results largely phenocopy the SUPT1:MDM coculture in that we observe inflammasome activation after co-culture of HIV-infected primary T cells with primary MDMs. We find that this inflammasome activity induced by the CD4:MDM cell-to-cell transmission is abrogated by knockout of CARD8 in the MDMs or treatment of HIV protease inhibitor lopinavir (LPV) or caspase 1 inhibitor VX765, suggesting that this activation is dependent on CARD8, HIV protease, and caspase 1. Additionally, the signal persists in the presence of reverse transcriptase inhibitor nevirapine (NVP), suggesting that the incoming protease is driving activation.

      (2) For all co-culture experiments, supernatants were collected at 48 or 72 hours. Since CARD8 activation is expected to be driven by incoming viral particles without RT, they should measure cytokine production at much earlier time points. 2-3 days co-culture raises concerns. Ideally, the authors can provide a time-course.

      We have now added a time course of the SUPT1:MDM coculture from 3 unique donors taken at 4, 24, 48, and 72 hours post coculture in the presence or absence of reverse transcriptase inhibitor (see new Figure 3B) as well as for the primary CD4 cells to MDM co-culture (see new Figure 4B). We detect IL-1β at the 24hour time point (and later), but not at the 4-hour time point which is slower than what was detected by direct cell-free infection (Kulsuptrakul et al., 2023). However, we still hypothesize that this is driven by active incoming viral protease because the signal is not abrogated by a reverse transcriptase inhibitor, which indicates that de novo protease production is not necessary. We also observed that IL-1β levels do not increase after plateauing 24h after establishing the co-culture, suggesting that secondary infection does not further amplify inflammasome activation. We now speculate on this in the Discussion.

      (3) A potential confounder in the data in Figure 4 is that despite rightly including the cognate adaptations in the Gag cleavage sites with the PI-R protease mutants, some of these viruses still display Gag processing defects. Can the authors disentangle the potency of PR mutant cleavage with either reduced cell entry or reduced protease availability due to processing defects in the incoming virions?

      The reviewer is correct that although the western blot with the p24<sup>gag</sup> antibody suggests that Gag is processed, we cannot rule out that other variables do not contribute to the observed difference in CARD8 inflammasome activation. For example, PI-R clones relative to the LAI strain may have distinct protease substrate specificity, variable efficiency/kinetics in viral assembly, gag dimerization, and other factors may ultimately influence CARD8 inflammasome activation. We have updated the text to reflect these possibilities. Nonetheless, this argument does not change the conclusion that CARD8 inflammasome activation is affected by protease mutations acquired during drug resistance.

      (4) There is considerable donor variation in the macrophages (unsurprising) but can the authors correlate this with CARD8 expression and are there any off-target effects on macrophage permissivity to HIV-1 infection?

      We have now considerably increased the number of primary cell donors from the first submission (see Author response table 1 below). We find that the non-responsive donor presented in the first submission is aberrant since all others do respond to a greater or lesser degree (Figure 3, Figure 4). However, the reviewer may be correct that the particular aberrant donor MDMs were poorly infected. We also note that despite donor variability in the degree of activation (IL-1β secretion) from cocultures with HIV<sub>BaL</sub>-infected SUPT1 cells, HIV-induced activation is comparable to the activation induced by VbP (see new Figure 3–figure supplement 1B). We do not see a notable difference in CARD8 expression between donors. Nonetheless, with the added number of primary cell donors, the data are consistent with a role of primary MDMs from nearly all donors in supporting a CARD8-dependent, HIV-protease dependent inflammasome response after co-culture with infected T cells. We have left in data from all of the donors so that readers can appreciate the variability among primary cells.

      Author response table 1.

      In addition, to address the reviewer concerns about off-target effects of the sgRNAs on macrophage permissivity, we assessed our CD4:MDM cocultures for percent infectivity via intracellular p24<sup>gag</sup> in AAVS1 vs CARD8 KO MDMs and we observed no significant difference in infectivity in AAVS1 vs CARD8 KO MDMs (see Author response image 1 of MDMs after co-culture with T cells that is not affected any potential off-target effects of the sgRNAs.

      Author response image 1.

      Equivalent infection in AAVS1 vs CARD8 KOMDMs. AAVS1 or CARD8 KO from donor 12 were cocultured with mock or HIV infected CD4 T cells as described in Figure 4D for 72 hours then assessed for HIV infection of the MDMs by washing away CD4 T cells, harvesting MDMs, and staining attached MDMs for intracellular p24<sup>gag</sup> for flow cytometry analysis. Datasets represent mean ± SD (n=2 technical replicates from one donor). One-way ANOVA with Dunnett’s test using GraphPad Prism 10. ns = not significant, *p<0.05,**p<0.01, ***p<0.001, ****p<0.0001.

      (5) The authors suggest that NLRP3 is unlikely to be the mediator of IL-1b and cell death in the macrophages. Is this death still GSDMDdependent, what other NLRs are expressed in this system and does it make a difference what PAMP you use to prime the response?

      We have now added additional data in support of the conclusion that NLRP3 is not a mediator of the IL-1β secretion in the infected SUPT1 cells to primary MDMs coculture. In addition to using an NLRP3 inhibitor, we have now also made NLRP3 KOs MDMs and used these in the coculture experiments which show that the IL-1β secretion after coculture of infected SUPT1 cells and primary MDMs is mediated by CARD8 and not NLRP3 because the signal is abrogated by CARD8 knockout, but not by NLRP3 knockout. This new data is shown in Figure 3C and D.

      To assess the role of GSDMD, we treated SUPT1:MDM cocultures with disulfiram, a GSDMD inhibitor (Hu et al., 2020). Disulfiram treatment abrogated IL-1β secretion, suggesting that this activation is indeed GSDMD-mediated (see Author response image 2 below). We choose not to include the disulfiram result in the final manuscript since we have not ruled out cytotoxic effects of the drug.

      There are likely other NLRs expressed in primary MDMs; however, since inflammasome activation is completely absent in the CARD8 KO MDMs, we infer that CARD8 is the main inflammasome-forming sensor in this system. However, we cannot rule out the possibility of other innate sensors being activated downstream of CARD8 or under different differentiation conditions.

      To address the concern that alternative priming affects CARD8 activation, we compared pre-treatment of cells with Pam3CSK4 or lipopolysaccharide (LPS) in the presence or absence of HIV protease inhibitor and reverse transcriptase inhibitor. Regardless of the priming agent used, we observed HIV protease-dependent activation that persisted in the presence of reverse transcriptase inhibitor, suggesting that CARD8 is the main sensor under LPS and Pam3CSK4 priming (new Figure 3–figure supplement 1A).

      Author response image 2.

      Inflammasome activation following cell-to-cell HIV infection is mediated by GSDMD. SUPT1-CCR5 cells were either mock-infected or infected with HIV-1<sub>NL4.3BaL</sub> for 20 hours before coculturing with MDMs in either the presence or absence of GSDMD inhibitor disulfarim (25μM). Cocultures were harvested 24 hours later to assess (left) IL-1β secretion via IL-1 reporter assay and (right) cell viability via CellTiter-Glo® assay. Viability was calculated by normalizing to relative luminescence units in the mock untreated control. Dotted line indicates limit of detection (LoD). Dashed line indicates 100% viability as determined by untreated mock control. Datasets represent mean ± SD (n=2 technical replicates for one donor). Two-way ANOVA with Sidak’s test (using GraphPad Prism 10. ns = not significant, *p<0.05,**p<0.01, ***p<0.001, ****p<0.0001.

      Minor points

      (1) In Figure 1, the authors should clarify whether LAI or LAI-VSV-G was used.

      Wild-type virus (LAI strain) was used in Figure 1. This has now been clarified in the figure legend.

      (2) In Figure 1, the fraction of infected cells without DEAE was ~20% in both WT and CARD8 KO THP-1, suggesting somewhat efficient viral entry even in the absence of DEAE. How do the authors reconcile this with the lack of IL-1β production? The increase in infection observed in WT THP-1 +DEAE was overall modest (from ~20% to 25-30%) compared to the dramatic difference in IL-1β production. Can they provide more evidence or discuss how DEAE might be impacting cytokine production? If differences in viral entry are the explanation for differences in inflammasome activation, then they should be able to overcome this by using virus at a higher MOI in the absence of DEAE. Experiments proposed in Figure 1 +/- DEAE should be repeated using a range of MOI for LAI and showing the corresponding percent infection in THP-1 cells (which is not shown in Figure S2 for LAI-VSVG).

      We hypothesize that the lack of IL-1β production without DEAE is likely due to an insufficient amount of incoming viral protease to induce CARD8 activation. Though the increase in infection with DEAE is modest by intracellular p24<sup>gag</sup> at 24 hours post infection, we infer that intracellular p24<sup>gag</sup> may be largely underestimating the actual increase in viral efficiency achieved with DEAE (now in Supplemental Note). We have also updated Figure S2 (now Figure 2–figure supplement 1) legend to include the percent infection for HIV-1<sub>LAI</sub> and HIV-1<sub>LAI-VSVG</sub> infections. We agree that activation in the absence of DEAE could be overcome by infecting with a more concentrated viral stock to increase the MOI. Indeed, our decision to use the cell-to-cell transmission model achieves this in a more physiologic context.

      (3) In Figure S1, the authors point out that RT-activity in the supernatants was similar in the cell-free vs. cell-to-cell model. While in the transwell system THP-1 cells are the only cells capable of producing new virions, how are they able to differentiate viral production from sup-T1 vs. THP-1 in the cell-to-cell system? At a minimum, they should provide some data on the observed RT activity in matching wells containing the same number of infected sup-T1 cells utilized in coculture experiments.

      We think this may have been a misinterpretation. In Figure S1 (now Figure 1B, right), we compare the amount of virus available in the lower chamber of the transwell versus the cell-to-cell condition. We are not comparing cell-free to cell-to-cell infection. We have changed the text and figure title to clarify this point.

      (4) Can the authors provide additional comments on the lack of IL-1β release in donor C in Figure 3? The donor did not produce IL-1β in response to VbP or HIV, although the WB for CARD8 appears similar to the other two donors.

      We have now tested MDMs from additional donors and continue to find a range of IL-1β secretion after the coculture. However, donor C is aberrant since each of the other donors had detectable IL-1β secretion in response to VbP and HIV-1 to greater or lesser extents. Nonetheless, we have included additional donors summarized in the table above corresponding to major comment #4.

      (5) For Figure 3, can the authors provide information on the fraction of MDMs that were infected after coculture with sup-T1 cells? Why didn't the authors measure cell death in MDMs?

      It is difficult to measure the fraction of MDMs infected or dying in the cocultures since it is hard to separate signal from the T cells. Although it would be possible to do so, in this manuscript, we instead prefer to focus on the potential contribution of CARD8 inflammasome activation in exacerbating chronic inflammation in response to HIV rather than the depletion of macrophages.

      (6) In Figure 4, did the authors introduce the mutations associated with PI resistance into the same LAI backbone? If not, this is not a fair comparison, as viral protein expression levels were not at the same level, indicated in Figure 4A. Additionally, such comparison will be further strengthened by using cells other than 293T cells for the coculture assay.

      No, we did not introduce these mutations into LAI, since they were already in an NL4.3 backbone and NL4.3 and LAI differ by only 1 amino acid in protease. We have updated Table S1 to report this amino acid difference. We also note that in our previous manuscript we tested much more diverse proteases such as a clade A HIV-1, HIV-2, and SIVs and find comparable CARD8 cleavage to LAI.

      Additions not requested by Reviewers:

      THP-1 characterization

      In our previous work, we noticed that different “wildtype” THP-1 lines behaved uniquely in response to DEAE-dextran. In particular, we observed inflammasome activation in response to DEAE-dextran alone at the concentration used for spinoculations (20μg/mL), whereas the other THP-1 line did not. Thus, we performed STR profiling on each THP-1 cell line and determined that the THP-1 cells used in our studies (JK THP1s) are distinct from THP-1 cells from ATCC at 3 different loci. This data is now included in the Supplemental Note (Figure A1). Please note that all data in this and the accompanying manuscript were performed in JK THP-1 cells.

      Whole plasmid sequencing of the PI-resistant HIV clones

      Since preprint submission, we have done whole plasmid Oxford Nanopore sequencing on the PI-resistant HIV clones obtained from the NIAID HIV/AIDS Specimen Repository Program. Of note, there were a handful of previously unreported mutations included in these plasmid stocks within protease. We have updated Table S1 to include an additional column titled “Additional amino acid changes in HIV<sup>PR</sup> relative to NL4.3.”

      References

      Hu JJ, Liu X, Xia S, Zhang Z, Zhang Y, Zhao J, Ruan J, Luo X, Lou X, Bai Y, Wang J, Hollingsworth LR, Magupalli VG, Zhao L, Luo HR, Kim J, Lieberman J, Wu H. 2020. FDA-approved disulfiram inhibits pyroptosis by blocking gasdermin D pore formation. Nat Immunol 21:736–745. doi:10.1038/s41590-020-0669-6

      Kulsuptrakul J, Turcotte EA, Emerman M, Mitchell PS. 2023. A human-specific motif facilitates CARD8 inflammasome activation after HIV-1 infection. eLife 12:e84108. doi:10.7554/eLife.84108

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      UGGTs are involved in the prevention of premature degradation for misfolded glycoproteins, by utilizing UGGT-KO cells and a number of different ERAD substrates. They proposed a concept by which the fate of glycoproteins can be determined by a tug-of-war between UGGTs and EDEMs.

      Strengths:

      The authors provided a wealth of data to indicate that UGGT1 competes with EDEMs, which promotes glycoprotein degradation.

      Weaknesses:

      Less clear, though, is the involvement of UGGT2 in the process. Also, to this reviewer, some data do not necessarily support the conclusion.

      Major criticisms:

      (1) One of the biggest problems I had on reading through this manuscript is that, while the authors appeared to generate UGGTs-KO cells from HCT116 and HeLa cells, it was not clearly indicated which cell line was used for each experiment. I assume that it was HCT116 cells in most cases, but I did not see that it was clearly mentioned. As the expression level of UGGT2 relative to UGGT1 is quite different between the two cell lines, it would be critical to know which cells were used for each experiment.

      Thank you for this comment. We have clarified this point, especially in the figure legends.

      (2) While most of the authors' conclusion is sound, some claims, to this reviewer, were not fully supported by the data. Especially I cannot help being puzzled by the authors' claim about the involvement of UGGT2 in the ERAD process. In most of the cases, KO of UGGT2 does not seem to affect the stability of ERAD substrates (ex. Fig. 1C, 2A, 3D). When the author suggests that UGGT2 is also involved in the ERAD, it is far from convincing (ex. Fig. 2D/E). Especially because now it has been suggested that the main role of UGGT2 may be distinct from UGGT1, playing a role in lipid quality control (Hung, et al., PNAS 2022), it is imperative to provide convincing evidence if the authors want to claim the involvement of UGGT2 in a protein quality control system. In fact, it was not clear at all whether even UGGT1 is also involved in the process in Fig. 2D/E, as the difference, if any, is so subtle. How the authors can be sure that this is significant enough? While the authors claim that the difference is statistically significant (n=3), this may end up with experimental artifacts. To say the least, I would urge the authors to try rescue experiments with UGGT1 or 2, to clarify that the defect in UGGT-DKO cells can be reversed. It may also be interesting to see that the subtle difference the authors observed is indeed N-glycan-dependent by testing a non-glycosylated version of the protein (just like NHK-QQQ mutants in Fig. 2C).

      We appreciate this comment. According to this comment, we reevaluated the importance of UGGT2 for ER-protein quality control. As this reviewer mentioned, KO of UGGT2 does not affect the stability of ATF6a, NHK, rRI332-Flag or EMC1-△PQQ-Flag (Fig. 1E, 2A, and 3DE). Furthermore, we tested whether overexpression of UGGT2 reverses the phenotype of UGGT-DKO regarding the degradation rate of NHK, and we found that it did not affect the degradation rate of NHK, whereas overexpression of UGGT1 restored the degradation rate to that in WT cells.

      Author response image 1.

      Collectively, these facts suggest that the role of UGGT2 in ER protein quality control is rather limited in HCT116 cells. Therefore, we have decided not to mention UGGT2 in the title, and weakened the overall claim that UGGT2 contributes to ER protein quality control. Tissues with high expression of UGGT2 or cultured cells other than HCT116 would be appropriate for revealing the detailed function of UGGT2.

      To this reviewer, it is still possible that the involvement of UGGT1 (or 2, if any) could be totally substrate-dependent, and the substrates used in Fig 2D or E happen not to be dependent to the action of UGGTs. To the reviewer, without the data of Fig. 2D and E the authors provide enough evidence to demonstrate the involvement of UGGT1 in preventing premature degradation of glycoprotein ERAD substrates. I am just afraid that the authors may have overinterpreted the data, as if the UGGTs are involved in stabilization of all glycoproteins destined for ERAD.

      Based on the point this reviewer mentioned, we decided to delete previous Fig. 2D and 2E. There may be more or less efficacy of UGGT1 for preventing early degradation of substrates.

      (3) I am a bit puzzled by the DNJ treatment experiments. First, I do not see the detailed conditions of the DNJ treatment (concentration? Time?). Then, I was a bit surprised to see that there were so little G3M9 glycans formed, and there was about the same amount of G2M9 also formed (Figure 1 Figure supplement 4B-D), despite the fact that glucose trimming of newly syntheized glycoproteins are expected to be completely impaired (unless the authors used DNJ concentration which does not completely impair the trimming of the first Glc). Even considering the involvement of Golgi endo-alpha-mannosidase, a similar amount of G3M9 and G2M9 may suggest that the experimental conditions used for this experiment (i.e. concentration of DNJ, duration of treatment, etc) is not properly optimized.

      We think that our experimental condition of DNJ treatment is appropriate to evaluate the effect of DNJ. Referring to the other papers (Ali and Field, 2000; Karlsson et al., 1993; Lomako et al., 2010; Pearse et al., 2010; Tannous et al., 2015), 0.5 mM DNJ is appropriate. In our previously reported experiment, 16 h treatment with kifunensine mannosidase inhibitor was sufficient for N-glycan composition analysis prior to cell collection (Ninagawa et al., 2014), and we treated cells for a similar time in Figure 1-Figure Supplement 4 and 5 (and Figure 1-Figure Supplement 6). We could see the clear effect of DNJ to inhibit degradation of ATF6a with 2 hours of pretreatment (Fig. 1G). Furthermore, our results are very reasonable and consistent with previous findings that DNJ increased GM9 the most (Cheatham et al., 2023; Gross et al., 1983; Gross et al., 1986; Romero et al., 1985). In addition to DNJ, we used CST for further experiments in new figures (Fig. 1H and Figure 1-Figure supplement 6). DNJ and CST are inhibitors of glucosidase; DNJ is a stronger inhibitor of glucosidase II, while CST is a stronger inhibitor of glucosidase I (Asano, 2000; Saunier et al., 1982; Szumilo et al., 1987; Zeng et al., 1997). An increase in G3M9 and G2M9 was detected using CST (Figure1-Figure Supplement 6). Like DNJ, CST also inhibited ATF6a degradation in UGGT-DKO cells (Fig. 1H). These findings show that our experimental condition using glucosidase inhibitor is appropriate and strongly support our model (Fig. 5). Differences between the effects of DNJ and CST are now described in our manuscript pages 8 to 10.

      Reviewer #2 (Public Review):

      In this study, Ninagawa et al., shed light on UGGT's role in ER quality control of glycoproteins. By utilizing UGGT1/UGGT2 DKO cells, they demonstrate that several model misfolded glycoproteins undergo early degradation. One such substrate is ATF6alpha where its premature degradation hampers the cell's ability to mount an ER stress response.

      While this study convincingly demonstrates early degradation of misfolded glycoproteins in the absence of UGGTs, my major concern is the need for additional experiments to support the "tug of war" model involving UGGTs and EDEMs in influencing the substrate's fate - whether misfolded glycoproteins are pulled into the folding or degradation route. Specifically, it would be valuable to investigate how overexpression of UGGTs and EDEMs in WT cells affects the choice between folding and degradation for misfolded glycoproteins. Considering previous studies indicating that monoglucosylation influences glycoprotein solubility and stability, an essential question is: what is the nature of glycoproteins in UGGTKO/EDEMKO and potentially UGGT/EDEM overexpression cells? Understanding whether these substrates become more soluble/stable when GM9 versus mannose-only translation modification accumulates would provide valuable insights.

      In the new figure 2DE, we conducted overexpression experiments of structure formation factors UGGT1 and/or CNX, and degradation factors EDEMs. While overexpression of structure formation factors (Fig. 2DE) and KO of degradation factors (Ninagawa et al., 2015; Ninagawa et al., 2014) increased stability of substrates, KO of UGGT1 (Fig. 1E, 2A and 3DF) and overexpression of degradation factors (Fig. 2DE) (Hirao et al., 2006; Hosokawa et al., 2001; Mast et al., 2005; Olivari et al., 2005) accelerated degradation of substrates. A comparison of the properties of N-glycan with the normal type and the type without glucoses was already reported (Tannous et al., 2015). The rate of degradation of substrate was unchanged, but efficiency of secretion of substrates was affected.

      The study delves into the physiological role of UGGT, but is limited in scope, focusing solely on the effect of ATF6alpha in UGGT KO cells' stress response. It is crucial for the authors to investigate the broader impact of UGGT KO, including the assessment of basal ER proteotoxicity levels, examination of the general efflux of glycoproteins from ER, and the exploration of the physiological consequences due to UGGT KO. This broader perspective would be valuable for the wider audience. Additionally, the marked increase in ATF4 activity in UGGTKO requires discussion, which the authors currently omit.

      We evaluated the sensitivity of WT and UGGT1-KO cells to ER stress (Figure 4G). KO of UGGT1 increased the sensitivity to ER stress inducer Tg, indicating the importance of UGGT1 for resisting ER stress.

      We add the following description in the manuscript about ATF4 activity in UGGT1-KO: “In addition to this, UGGT1 is necessary for proper functioning of ER resident proteins such as ATF6a (Fig. 4B-F). It is highly possible that ATF6a undergoes structural maintenance by UGGT1, which could be necessary to avoid degradation and maintain proper function, because ATF6a with more rigid in structure tended to remain in UGGT1-KO cells (Fig. 4C). Responses of ERSE and UPRE to ER stress, which require ATF6a, were decreased in UGGT1-KO cells (Fig. 4DE). In contrast, ATF4 reporter activity was increased in UGGT1-KO cells (Fig. 4F), while the basal level of ATF4 in UGGT1-KO cells was comparable with that in WT (Figure 1-Figure supplement 2B). The ATF4 pathway might partially compensate the function of the ERSE and UPRE pathways in UGGT1-KO cells in acute ER stress. This is now described on Page 17 in our manuscript.

      The discussion section is brief and could benefit from being a separate section. It is advisable for the authors to explore and suggest other model systems or disease contexts to test UGGT's role in the future. This expansion would help the broader scientific community appreciate the potential applications and implications of this work beyond its current scope.

      Thank you for making this point. The DISCUSSION part has now been separated in our manuscript. We added some points in the manuscript about other model organisms and diseases in the DISCUSSION as follows: “ Our work focusing on the function of mammalian UGGT1 greatly advances the understanding how ER homeostasis is maintained in higher animals. Considering that Saccharomyces cerevisiae does not have a functional orthologue of UGGT1 (Ninagawa et al., 2020a) and that KO of UGGT1 causes embryonic lethality in mice (Molinari et al., 2005), it would be interesting to know at what point the function of UGGT1 became evolutionarily necessary for life. Related to its importance in animals, it would also be of interest to know what kind of diseases UGGT1 is associated with. Recently, it has been reported that UGGT1 is involved in ER retention of Trop-2 mutant proteins, which are encoded by a causative gene of gelatinous drop-like corneal dystrophy (Tax et al., 2024). Not only this, but since the ER is known to be involved in over 60 diseases (Guerriero and Brodsky, 2012), we must investigate how UGGT1 and other ER molecules are involved in diseases.”

      Reviewer #3 (Public Review):

      This manuscript focuses on defining the importance of UGGT1/2 in the process of protein degradation within the ER. The authors prepared cells lacking UGGT1, UGGT2, or both UGGT1/UGGT2 (DKO) HCT116 cells and then monitored the degradation of specific ERAD substrates. Initially, they focused on the ER stress sensor ATF6 and showed that loss of UGGT1 increased the degradation of this protein. This degradation was stabilized by deletion of ERAD-specific factors (e.g., SEL1L, EDEM) or treatment with mannose inhibitors such as kifunesine, indicating that this is mediated through a process involving increased mannose trimming of the ATF6 N-glycan. This increased degradation of ATF6 impaired the function of this ER stress sensor, as expected, reducing the activation of downstream reporters of ER stress-induced ATF6 activation. The authors extended this analysis to monitor the degradation of other well-established ERAD substrates including A1AT-NHK and CD3d, demonstrating similar increases in the degradation of destabilized, misfolding protein substrates in cells deficient in UGGT. Importantly, they did experiments to suggest that re-overexpression of wild-type, but not catalytically deficient, UGGT rescues the increased degradation observed in UGGT1 knockout cells. Further, they demonstrated the dependence of this sensitivity to UGGT depletion on N-glycans using ERAD substrates that lack any glycans. Ultimately, these results suggest a model whereby depletion of UGGT (especially UGGT1 which is the most expressed in these cells) increases degradation of ERAD substrates through a mechanism involving impaired re-glucosylation and subsequent re-entry into the calnexin/calreticulin folding pathway.

      I must say that I was under the impression that the main conclusions of this paper (i.e., UGGT1 functions to slow the degradation of ERAD substrates by allowing re-entry into the lectin folding pathway) were well-established in the literature. However, I was not able to find papers explicitly demonstrating this point. Because of this, I do think that this manuscript is valuable, as it supports a previously assumed assertion of the role of UGGT in ER quality control. However, there are a number of issues in the manuscript that should be addressed.

      Notably, the focus on well-established, trafficking-deficient ERAD substrates, while a traditional approach to studying these types of processes, limits our understanding of global ER quality control of proteins that are trafficked to downstream secretory environments where proteins can be degraded through multiple mechanisms. For example, in Figure 1-Figure Supplement 2, UGGT1/2 knockout does not seem to increase the degradation of secretion-competent proteins such as A1AT or EPO, instead appearing to stabilize these proteins against degradation. They do show reductions in secretion, but it isn't clear exactly how UGGT loss is impacting ER Quality Control of these more relevant types of ER-targeted secretory proteins.

      We appreciate your comment. It is certainly difficult to assess in detail how UGGT1 functions against secretion-competent proteins, but we think that the folding state of these proteins is improved, which avoids their degradation and increases their secretion. In Figure 1-Figure supplement 2E, there is a clear decrease in secretion of EPO in UGGT1-KO cells, suggesting that UGGT1 also inhibits degradation of such substrates. Note that, as shown in Fig. 3A-C, once a protein forms a solid structure, it is rarely degraded in the ER.

      Lastly, I don't understand the link between UGGT, ATF6 degradation, and ATF6 activation. I understand that the idea is that increased ATF6 degradation afforded by UGGT depletion will impair activation of this ER stress sensor, but if that is the case, how does UGGT2 depletion, which only minimally impacts ATF6 degradation (Fig. 1), impact activation to levels similar to the UGGT1 knockout (Fig 4)? This suggests UGGT1/2 may serve different functions beyond just regulating the degradation of this ER stress sensor. Also, the authors should quantify the impaired ATF6 processing shown in Fig 4B-D across multiple replicates.

      According to this valuable comment, we reevaluated our manuscript. As this reviewer mentioned, involvement of UGGT2 in the activation of ATF6a cannot be explained only by the folding state of ATF6a. Thus, the part about whether UGGT2 is effective in activating ATF6 is outside the scope of this paper. The main focus of this paper is the contribution of UGGT1 to the ER protein quality control mechanism.

      Ultimately, I do think the data support a role for UGGT (especially UGGT1) in regulating the degradation of ERAD substrates, which provides experimental support for a role long-predicted in the field. However, there are a number of ways this manuscript could be strengthened to further support this role, some of which can be done with data they have in hand (e.g., the stats) or additional new experiments.

      In this revision period, to further elucidate the function of UGGT, we did several additional experiments (new figures Fig. 1H, 2DE, 4G and, Figure 1-Figure Supplement 6). We hope that these will bring our papers up to the level you have requested.

      Reviewer #1 (Recommendations For The Authors):

      Minor points:

      (1) Abbreviations: GlcNAc, N-acetylglucosamines -> why plural?

      Corrected.

      (2) Abstract: to this reviewer, it may not be so common to cite references in the abstract.

      We submit this manuscript to eLife as “Research Advances”. In the instructions of eLife for “Research Advances”, there is the description: “A reference to the original eLife article should be included in the abstract, e.g. in the format “Previously we showed that XXXX (author, year). Here we show that YYYY.” We follow this.

      (3) Introduction: "as the site of biosynthesis of approximately one-third of all proteins." Probably this statement needs a citation?

      We added the reference there. You can also confirm this in “The Human Protein Atlas” website. https://www.proteinatlas.org/humanproteome/tissue/secretome

      (4) Figure 1F - the authors claimed that maturation of HA was delayed also in UGGT2 cells, but it was not at all clear to me. Rescue experiments with UGGT2 would be desired.

      We agree with this reviewer, but there was a statistically significant difference in the 80 min UGGT2-KO strain. Previously, it was reported that HA maturation rate was not affected by UGGT2 (Hung et al., 2022). We think that the difference is not large. A rescue experiment of UGGT2 on the degradation of NHK was conducted, and is shown in this response to referees.

      (5) Figure 4A, here also the authors claim that UGGT2 is "slightly" involved in folding of ATF6alpha(P) but it is far from convincing to this reviewer.

      Now we also think that involvement of UGGT2 in ER protein quality control should be examined in the future.

      (6) Page 11, line 7 from the bottom: "peak of activation was shifted from 1 hour to 4 hours after the treatment of Tg in UGGT-KO cells". I found this statement a bit awkward; how can the authors be sure that "the peak" is 4 hours when the longest timing tested is 4 hours (i.e. peak may be even later)?

      Corrected. We deleted the description.

      (7) Page 11, line 4 "a more rigid structure that averts degradation" Can the authors speculate what this "rigid" structure actually means? The reviewer has to wonder what kind of change can occur to this protein with or without UGGT1. Binding proteins? The difference in susceptibility against trypsin appears very subtle anyway (Figure 4 Figure Supplement 1).

      Let us add our thoughts here: Poorly structured ATF6a is immediately routed for degradation in UGGT1-KO cells. As a result, ATF6a with a stable or rigid structure have remained in the UGGT1-KO strain. ATF6a with a metastable state is tended to be degraded without assistance of UGGT1.

      (8) Figure 1 Figure supplement 2; based on the information provided, I calculate the relative ratio of UGGT2/UGGT1 in HCT116 which is 4.5%, and in HeLa 26%. Am I missing something? Also significant figure, at best, should be 2, not 3 (i.e. 30%, not 29.8%).

      Corrected. Thank you for this comment.

      Reviewer #2 (Recommendations For The Authors):

      (1) The effect in Fig. 2B with UGGT1-D1358A add-back is minimal. Testing the inactive and active add-back on other substrates, such as ATF6alpha, which undergoes a more rapid degradation, would provide a more comprehensive assessment.

      To examine the effect of full length and inactive mutant of UGGT1 in UGGT1-KO and UGGT2-KO on the rate of degradation of endogenous ATF6a, we tried to select more than 300 colonies stably expressing full-length Myc-UGGT1/2, UGGT1/2-Flag, and UGGT1/2 (no tag), and their point mutant of them. However, no cell lines expressing nearly as much or more UGGT1/2 than endogenous ones were obtained. The expression level of UGGT1 seemed to be tightly regulated. A low-expressing stable cell line could not recover the phenotype of ATF6a degradation.

      We also tried to measure the degradation rate of exogenously expressed ATF6a. But overexpressed ATF6a is partially transported to the Golgi and cleaved by proteases, which makes it difficult to evaluate only the effect of degradation.

      (2) In reference to this statement on pg. 11:

      "This can be explained by the rigid structure of ATF6(P) lacking structural flexibility to respond to ER stress because the remaining ATF6(P) in UGGT1-KO cells tends to have a more rigid structure that averts degradation, which is supported by its slightly weaker sensitivity to trypsin (Figure 4-figure supplement 1A). "

      The rationale for testing ATF6(P) rigidity via trypsin digestion needs clarification. The authors should provide more background, especially if it relates to previous studies demonstrating UGGT's influence on substrate solubility. If trypsin digestion is indeed addressing this, it should be applied consistently to all tested misfolded glycoproteins, ensuring a comprehensive approach.

      We now provide more background with three references about trypsin digestion. Trypsin digestion allows us to evaluate the structure of proteins originated from the same gene, but it can sometimes be difficult to comparatively evaluate the structure of proteins originated from different genes. For example, antitrypsin is resistant to trypsin by its nature, which does not necessarily mean that antitrypsin forms a more stable structure than other proteins. NHK, a truncated version of antitrypsin, is still resistant to trypsin compared with other substrates.

      (3) Many of the figures described in the manuscript weren't referred to a specific panel. For example, pg. 12 "Fig. 1E and Fig.5," the exact panel for Fig. 5 wasn't referenced.

      Thank you for this comment. Corrected.

      (4) For experiments measuring the composition of glycoproteins in different KO lines, it is necessary to do the experiment more than once for conducting statistical analysis and comparisons. Moreover, the authors did not include raw composition data for these experiments. Statistical analysis should also be done for Fig. 4E-F.

      Our N-glycan composition data (Figure 1-Figure supplement 5 and 6C) is consistent with previous our papers (George et al., 2021; George et al., 2020; Ninagawa et al., 2015; Ninagawa et al., 2014). We did it twice in the previous study and please refer to it regarding statistical analysis (George et al., 2020). We add the raw composition data of N-glycan (Figure 1-Figure supplement 4 and 6B). In Fig. 4D-F, now statistical analysis is included.

      Ali, B.R., and M.C. Field. 2000. Glycopeptide export from mammalian microsomes is independent of calcium and is distinct from oligosaccharide export. Glycobiology. 10:383-391.

      Asano, N. 2000. Glycosidase-Inhibiting Glycomimetic Alkaloids. Biological Activities and Therapeutic Perspectives. Journal of Synthetic Organic Chemistry, Japan. 58:666-675.

      Cheatham, A.M., N.R. Sharma, and P. Satpute-Krishnan. 2023. Competition for calnexin binding regulates secretion and turnover of misfolded GPI-anchored proteins. J Cell Biol. 222.

      George, G., S. Ninagawa, H. Yagi, J.I. Furukawa, N. Hashii, A. Ishii-Watabe, Y. Deng, K. Matsushita, T. Ishikawa, Y.P. Mamahit, Y. Maki, Y. Kajihara, K. Kato, T. Okada, and K. Mori. 2021. Purified EDEM3 or EDEM1 alone produces determinant oligosaccharide structures from M8B in mammalian glycoprotein ERAD. Elife. 10.

      George, G., S. Ninagawa, H. Yagi, T. Saito, T. Ishikawa, T. Sakuma, T. Yamamoto, K. Imami, Y. Ishihama, K. Kato, T. Okada, and K. Mori. 2020. EDEM2 stably disulfide-bonded to TXNDC11 catalyzes the first mannose trimming step in mammalian glycoprotein ERAD. Elife. 9:e53455.

      Gross, V., T. Andus, T.A. Tran-Thi, R.T. Schwarz, K. Decker, and P.C. Heinrich. 1983. 1-deoxynojirimycin impairs oligosaccharide processing of alpha 1-proteinase inhibitor and inhibits its secretion in primary cultures of rat hepatocytes. Journal of Biological Chemistry. 258:12203-12209.

      Gross, V., T.A. Tran-Thi, R.T. Schwarz, A.D. Elbein, K. Decker, and P.C. Heinrich. 1986. Different effects of the glucosidase inhibitors 1-deoxynojirimycin, N-methyl-1-deoxynojirimycin and castanospermine on the glycosylation of rat alpha 1-proteinase inhibitor and alpha 1-acid glycoprotein. Biochem J. 236:853-860.

      Hirao, K., Y. Natsuka, T. Tamura, I. Wada, D. Morito, S. Natsuka, P. Romero, B. Sleno, L.O. Tremblay, A. Herscovics, K. Nagata, and N. Hosokawa. 2006. EDEM3, a soluble EDEM homolog, enhances glycoprotein endoplasmic reticulum-associated degradation and mannose trimming. J Biol Chem. 281:9650-9658.

      Hosokawa, N., I. Wada, K. Hasegawa, T. Yorihuzi, L.O. Tremblay, A. Herscovics, and K. Nagata. 2001. A novel ER alpha-mannosidase-like protein accelerates ER-associated degradation. EMBO reports. 2:415-422.

      Hung, H.H., Y. Nagatsuka, T. Solda, V.K. Kodali, K. Iwabuchi, H. Kamiguchi, K. Kano, I. Matsuo, K. Ikeda, R.J. Kaufman, M. Molinari, P. Greimel, and Y. Hirabayashi. 2022. Selective involvement of UGGT variant: UGGT2 in protecting mouse embryonic fibroblasts from saturated lipid-induced ER stress. Proc Natl Acad Sci U S A. 119:e2214957119.

      Karlsson, G.B., T.D. Butters, R.A. Dwek, and F.M. Platt. 1993. Effects of the imino sugar N-butyldeoxynojirimycin on the N-glycosylation of recombinant gp120. Journal of Biological Chemistry. 268:570-576.

      Lomako, J., W.M. Lomako, C.A. Carothers Carraway, and K.L. Carraway. 2010. Regulation of the membrane mucin Muc4 in corneal epithelial cells by proteosomal degradation and TGF-beta. Journal of cellular physiology. 223:209-214.

      Mast, S.W., K. Diekman, K. Karaveg, A. Davis, R.N. Sifers, and K.W. Moremen. 2005. Human EDEM2, a novel homolog of family 47 glycosidases, is involved in ER-associated degradation of glycoproteins. Glycobiology. 15:421-436.

      Ninagawa, S., T. Okada, Y. Sumitomo, S. Horimoto, T. Sugimoto, T. Ishikawa, S. Takeda, T. Yamamoto, T. Suzuki, Y. Kamiya, K. Kato, and K. Mori. 2015. Forcible destruction of severely misfolded mammalian glycoproteins by the non-glycoprotein ERAD pathway. J Cell Biol. 211:775-784.

      Ninagawa, S., T. Okada, Y. Sumitomo, Y. Kamiya, K. Kato, S. Horimoto, T. Ishikawa, S. Takeda, T. Sakuma, T. Yamamoto, and K. Mori. 2014. EDEM2 initiates mammalian glycoprotein ERAD by catalyzing the first mannose trimming step. J Cell Biol. 206:347-356.

      Olivari, S., C. Galli, H. Alanen, L. Ruddock, and M. Molinari. 2005. A novel stress-induced EDEM variant regulating endoplasmic reticulum-associated glycoprotein degradation. J Biol Chem. 280:2424-2428.

      Pearse, B.R., T. Tamura, J.C. Sunryd, G.A. Grabowski, R.J. Kaufman, and D.N. Hebert. 2010. The role of UDP-Glc:glycoprotein glucosyltransferase 1 in the maturation of an obligate substrate prosaposin. J Cell Biol. 189:829-841.

      Romero, P.A., B. Saunier, and A. Herscovics. 1985. Comparison between 1-deoxynojirimycin and N-methyl-1-deoxynojirimycin as inhibitors of oligosaccharide processing in intestinal epithelial cells. Biochem J. 226:733-740.

      Saunier, B., R.D. Kilker, J.S. Tkacz, A. Quaroni, and A. Herscovics. 1982. Inhibition of N-linked complex oligosaccharide formation by 1-deoxynojirimycin, an inhibitor of processing glucosidases. Journal of Biological Chemistry. 257:14155-14161.

      Szumilo, T., G.P. Kaushal, and A.D. Elbein. 1987. Purification and properties of the glycoprotein processing N-acetylglucosaminyltransferase II from plants. Biochemistry. 26:5498-5505.

      Tannous, A., N. Patel, T. Tamura, and D.N. Hebert. 2015. Reglucosylation by UDP-glucose:glycoprotein glucosyltransferase 1 delays glycoprotein secretion but not degradation. Molecular biology of the cell. 26:390-405.

      Zeng, Y., Y.T. Pan, N. Asano, R.J. Nash, and A.D. Elbein. 1997. Homonojirimycin and N-methyl-homonojirimycin inhibit N-linked oligosaccharide processing. Glycobiology. 7:297-304.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We would like to thank you and the two Reviewers for the thoughtful evaluation of the manuscript and the support for publication. We have addressed all points raised by the two Reviewers.

      - We have extensively streamlined the manuscript. Repetitive passages regarding the respective kinase cascades have been removed.

      - We improved the presentation of the main Figures (mainly labeling and font size):

      - Figure 1: C, D, E, F o Figure 2: C, E, F, G, I, o Figure 3: D o Figure 4: F

      - Figure 5: A, B, C, D, E

      - We integrated new SI-data related to kinase functions, expression and the ‘cell-type comparisons’ of the KinCon reporter system (Figure Supplement 4, 5).

      Below you will find a detailed point-by-point response.

      Reviewer #1 (Recommendations For The Authors):

      Regarding the issue of the use of the word "dynamics," as described in the public review, here are a few examples of ambiguous use in different sentences: o Line 27: dynamics of full-length protein kinases. Is this referring to the dynamics of conformational interconversion between inactive and active states?

      - Line 138: dynamic functioning of kinases. It is not clear what this means. o Line 276: ... alters KinCon dynamics. Not clear if they are measuring time-dependent process or a single point. 

      - Figure legend 4F: dynamics of CDK4/6 reporters. Again, not clear how the assay is measuring dynamics.

      In my opinion, the authors use proper terminology that describes their assay in which the term dynamics is not used: Title: "... impact of protein and small molecule interactions on kinase conformations" and Line 89 "... reporter can be used to track conformational changes of kinases...".

      We have replaced the “dynamics” sections. 

      - Line 27: The understanding of the structural dynamics of…

      - Line 91: This reporter can be used to track dynamic changes of kinases conformations…

      - Line 139: Conventional methods often fall short in capturing the dynamics of kinases within their native cellular environments…

      - Line 146: Such insights into the molecular structure dynamics of kinases in intact cells…

      - Line 199: In order to enhance our understanding of kinase structure dynamics…

      - Line 276: These findings underline that indeed the trimeric complex formation alters….

      - Figure Legend 4F: Quantification of alterations of CDK4/6 KinCon reporter bioluminescence signals…

      The authors state that KinCon has predictive capabilities (abstract and line 142). What do  the authors mean by this?

      Previously we have benchmarked the suitability of the KinCon reporter for target engagement assays of wt and mutated kinase activities. With this we determined specificities of melanoma drugs for mutated BRAF variants (Mayrhofer 2020, PNAS). 

      The authors indicate that KinCon is a highly sensitive assay. Can the authors elaborate on what high sensitivity means?  

      With sensitivity we mean that we can detect conformation dynamics of the reporter at low expression levels of the hybrid protein expressed in the cell line of choice.

      - Line 209: Immunoblotting of cell lysates following luminescence measurements showed expression levels of the reporters in the range and below the endogenous expressed kinases (Figure 1E).  …

      - Line 219:   Using this readout, we showed that at expression levels of the BRAF KinCon reporter below the immunoblotting detection limit, one hour of drug exposure exclusively converted BRAF-V600E to the more closed conformation (Figure 1F, G, Figure Supplement 1B). 

      - Line 221: These data underline that at expression levels far below the endogenous kinase, protein activity conformations can be tracked in intact cells. …

      For example, can they discuss how other fluorescence-based approaches that are less sensitive would not be able to accomplish the same type of results or derive similar conclusions? Can they provide a resolution metric both in space and time? Given that the authors state that this is a technical report, this information is of relevance.

      We highlight the key pros & cons of the KinCon reporter technology in following sections:

      -Line 529: The KinCon technology, introduced here, seeks to address the previously mentioned challenges. It has the potential to become a valuable asset for tracking kinase functions in living cells which are hard to measure solely via phosphotransferase activities. Overall, it offers an innovative solution for understanding kinase activity conformations, which could pave the way for more novel intervention strategies for kinase entities with limited pharmaceutical targeting potential. So far, this relates to the tracking of kinase-scaffold and pseudo-kinase functions.

      - Line 535: Key advantages of the KinCon reporter technology is the robustness of the system to track kinase conformations at varying expression levels. However, in contrast to fluorescence-based reporter read-outs subcellular analysis and cell sorting are still challenging due to comparable low levels of light emission

      The authors nicely describe how KinCon works in Figure 1B and part of 1C. I do think that the bottom of panel 1C needs to be revised, as well as the text describing the potential scenarios of potency, efficacy, and synergism.

      One issue with this part of Figure 1C is that it is not clear what the x-axis in the 3 plots refers to. Is this time? Is this concentration of a small molecule, inhibitor, or binding partner? This was confusing also in the context of the term dynamics used throughout the text. The terms potency, efficacy, and synergism should be subtitles, or the panels and the x-axis should be better defined, especially for a non-specialized reader.

      Related to this part of Figure 1C is the text. The authors mention potency, effectiveness, and synergy (Line 195). Can the authors use more fundamental terminology related to these three scenarios, for example, changes in activation constant, and percent of protein activates? Also, why synergy is only related to effectiveness? Can synergy also be associated with potency?

      Thank you for bringing this up, we have revised Figure 1C to better reflect the mentioned effects of potency. To avoid confusion, we removed the illustration for drug synergism. Accordingly, we have integrated the axis descriptions for the presented dose-response curves.   

      Thus, we have further streamlined the text in the introduction – examples are shown below:

      - Line 195: Light recordings and subsequent calculations of time-dependent dosage variations of bioluminescence signatures of parallel implemented KinCon configurations aid in establishing dose-response curves. These curves are used for discerning pharmacological characteristics such as drug potency, effectiveness of drug candidates, and potential drug synergies (Figure 1C)

      - Figure 1C:  Shown is the workflow for the KinCon reporter construct engineering and analyses using KinCon technology. The kinase gene of interest is inserted into the multiple cloning site of a mammalian expression vector which is flanked by respective PCA fragments (-F[1], -F[2]) and separated with interjacent flexible linkers. Expression of the genetically encoded reporter in indicated multi-well formats allows to vary expression levels and define a coherent drug treatment plan. Moreover, it is possible to alter the kinase sequence (mutations) or to co-express or knock-down the respective endogenous kinase, interlinked kinases or proteinogenic regulators of the respective pathway. After systematic administration of pathway modulating drugs or drug candidates, analyses of KinCon structure dynamics may reveal alterations in potency, efficacy, and potential synergistic effects of the tested bioactive small molecules (schematic dose response curves are depicted)

      Lastly, the use of these three cartoons gives the impression that the experimental results to come will follow a similar representation. Instead, the results are presented in bar plots for many different conditions. I think this will lead to confusion for a broad audience.

      The bottom panel of Figure 1C is not the depiction of real experiments but rather an illustration of fitted dose-response curves. We would like to present previous demonstrations of doseresponse curves using BRAF KinCon data and ERK phosphorylation (Röck 2019, Sci. Advances) 

      We further agree with the reviewer and have therefore added a new part in the methods section addressing the evaluation of data extensively. 

      - Line 668: In Figure 1 E and F, a representative experiment of n=4 independent experiments is shown. In these cases, absolute bioluminescence values without any normalization are shown. Otherwise, data was indicated as RLU (relative light unit) fold change. This means the data was normalized on the indicated control condition (either with normalization of the western blot or without; as indicated.

      For a non-expert reader, can the authors clarify the use of tracking basal conformations vs. transient over-expression of the various KinCon constructs? Moreover, the authors use the term transient over-expression for 10, 16, 24, and 48 h (Line 203). This, to a non-expert reader, does not seem transient.

      We have revised the manuscript to clarify it:

      - Line 207: We showed that transient over-expression of these KinCon reporters for a time frame of 10h, 16h, 24h or 48h in HEK293T cells delivers consistently increasing signals for all KinCon reporters (Figure 1E, Figure Supplement 1A). 

      - Figure 1E) Representative KinCon experiments of time-dependent expressions of indicated KinCon reporter constructs in HEK293T cells are shown (mean ±SEM). Indicated KinCon reporters were transiently over-expressed in 24-well format in HEK293T cells for 10h, 16h, 24h and 48h each.

      Regarding Figure 1E and similar graphical representations: Why is the signal (RLU) nonlinear with time? If the fluorescence of the KinCon construct is linearly related to its expression or concentration inside the cell, one would expect a linear increase. Have the authors plotted RLU/Expression band intensity to account for changes in protein concentration? For instance, some of the results within Figure 3 are normalized to concentration on reporter expression level.

      Out intention was to show that varying expression levels can be used for the illustrated target engagement assays.Indeed, the represented elevations of RLU might be  due to factors such as: 

      - Doubling times of cells

      - Cell density

      - Media composition (which changes over time)

      - Reporter protein stabilities

      - Abundance of interactors of kinases

      For the results with LKB1, the authors claim that intermediate fold change in fluorescence (Figure 2E) is due to a partially closed intermediate state (Line 262). Can the authors discard the possibility by which there is a change in populations of active and inactive that on average give intermediate values?

      Based on our experience with KinCon reporter conformation states of kinases we tested so far, we assume that the presented data reflects an intermediate state. We agree that it needs further validation. We have changed the text accordingly:

      - Line 264: Upon interaction with LKB1 this conformation shifts to a partially closed intermediate state.

      The authors claim in Line 274 that mutations located at the interface of the LKB1/STRADalpha complex affect interactions and hypothesize that allosteric communication between LKB1 and STRADalpha is essential for function. Given that these mutations are at the interaction interface, why would the authors postulate an allosteric mechanism that evokes an effect distant from the interaction/active site? Could it be that function requires surface contacts alone that are disrupted by the mutations?

      We agree with the reviewer and changed our argumentation for this point:

      - Line 276: These findings underline that indeed the trimeric complex formation alters the opening and closing of the tested full-length kinase structures using the applied KinCon reporter read out

      I was unable to find text to explain the following: Figure 2I shows the mutation R74A as n.s., but in the text, only W308C is mentioned to not change fluorescence. Could the authors clarify why R74A is not discussed in the text?  Maybe this reviewer missed the text in which it was discussed.

      We adapted the manuscript and include the R74A mutation as followed:

      - Line 296: Among these mutations, only the W308C and R74A mutation prevented significant closing of the LKB1 conformation when co-expressed with STRAD𝛼 and MO25 (Figure 2I).

      In Figure 2I where the individual measurements of the LKB1-R74A KinCon are highlighted in red to better emphasize the deviations. In the case of the R74A mutation the effect seen might be due to the high deviation between the experiments (Highlighted in red). These deviations are much higher when compared to either the wt or the W308 mutant, and can also be seen in the LKB1-R74A-KinCon only condition (white). Even though no significant closing of the LKB1 conformation could be observed in the case of R74A, we believe, since the trend of the conformation closing upon complex formation is still visible that the effect is still there. Further replicates would be necessary to validate this theory. 

      Similarly, the authors state in line 326 that the study included an analysis of RIPK2. However, I was unable to find results, graphs, or additional text discussing RIPK2.

      The RIPK2 conformation was analyzed in Figure 3C (page 12).

      Some figures of RLU use absolute values, percentages, and fold change. Is there are reason why the authors use different Y-axis values? These should be explained and justified in Methods. Similarly, bars for wt in Figures 3D, G, or 4D, E, F show no errors. How are the authors normalizing the data and repeats so that there is no error, and are they treating the rest of the data (i.e., mutants and/or treated with small molecules) in the same way?

      We have changed the Y-axis values. Now, throughout the manuscript we show that there is a RLU fold-change. Except are selected experiments when solely absolute RLU values are shown (such as Figure 1E, F). We have also decided to integrate a paragraph into the methods section (Line 655). Figure 3D was changed as well.

      - Line 668: In Figure 1 E and F, a representative experiment of n=4 independent experiments is shown.  In these cases absolute bioluminescence values without any normalisation are shown.  Otherwise, data was indicated as RLU fold change. This means the data was normalized on the indicated control condition (either with normalization of the western blot or without; as indicated).

      The data is generally normalized on wt or untreated conditions, when the cells were treated with small molecules for target engagement assays. 

      Lastly, the section starting in Line 472 reads more like a discussion of results from different types of inhibitors used in this study that results on its own. The authors should consider a new subtitle such as results or make this section a discussion.

      We agree with the reviewer and this part of the results was split into a new section of the result:

      - Line 455: “Effect of different kinase inhibitor types on the KinCon reporter system”.

      Reviewer #2 (Recommendations For The Authors):

      I have a few suggestions, since the paper is a distillation of a vast amount of work and tells a useful story.

      (1) The work is very solid, uses examples from the literature, and also extends into new experimental space. An obvious weakness is mentioned by the authors for the CKD data, in that measurements with Cyclin D (the activating subunit) are not characterized, although Cyclin D might be assumed to be present. 

      We performed experiments with the CDK4/6 KinCon reporters and co-expressed CyclinD with a ratio of 1:3 (HEK293T cells, expression for 48h). However, in the context of inhibitor treatments we could not track conformation changes in these initial experiments. The cells were treated with the indicated CDK4/6i [1µM] for 3h. This seems to not impact the conformation of CDK4/6 wt or mutated KinCon reporters. There is a tendency that CyclinD co-expression promotes CDK4/6 conformation opening (data not shown).

      Author response image 1.

      Bioluminescence signal of CDK4/6 KinCon reporters with co-expressed CyclinD3 (HEK293T, expression for 48h) upon exposure to indicated CDK4/6i [1µM] or DMSO for 3h (mean ±SEM, n=3 ind. experiments). No significant changes using the current setting.

      (2) The work with the trimeric LKB1 complex involves pseudokinase, STRADalpha, whose conformation is also examined as a function of LKB1 status; since STRAD is an activator of LKB1. A future goal should be the evaluation of the complex in the presence of STRAD inhibitory/activating small molecules.

      Thank you for this great idea, we are currently compiling a FWF grant application to get support for such a R&D project.

      Minor points

      • Have any of the data been repeated in a different cell background? This came to mind because HeLa cells lack LKB1, which might be a useful place to test the LKB1 data in a different context.

      This experiment was performed and we show it in Figure Supplement 5. Further, we followed the advice of the reviewer and performed suggested experiments. We integrated the colon cancer cell line SW480 into the experimental setup. Overall, three cell settings showed the same pattern of KinCon reporter analyses for LKB1-STRADα-MO25 complex formation utilizing the LKB1- and STRADα-KinCon reporters.  

      • The study picks up the PKA Cushings Syndrome field, which makes sense, and data are presented for L206R. PMID 35830806 explains how different patient mutations drive different signaling outcomes through distinct complex formations, and it would be interesting to discuss how mutations in KinCon complexes, especially those with mutations, could affect sub-cellular localization. Could the authors explain if this was done for any of the proteins, whose low experimental expression is a clear advantage, but is presumably hard to maintain across experiments?

      The feedback of the reviewer motivated us to perform subcellular fractionation experiments. They were performed with PKAc wt and L206R KinCon reporters as well as BRAF wt and V600E reporters. We were not able to see major differences between the wt and mutated reporter constructs in respect to their nucleus: cytoplasm localizations (Figure Supplement 4). For your information, in a R+D project with the mitochondrial kinase PINK1 we see localization of the reporter as expected almost exclusively at the mitochondria fraction. 

      - Line 495: In this context of activating kinase mutations we showed that using PKAc (wt and L206R) and BRAF (wt and V600E) reporters as example we could not track alterations of cytoplasmic and nuclear localization (Figure Supplement 4). Furthermore, subcellular localization of PKAc KinCon reporters did not change when L206R mutant was introduced (Figure Supplement 4). As a control BRAF wt and V600E KinCon reporters were used and also no changes in localization was observed.

      • I suggest changing PMs (Figure 2 and others) simply to mutation, I read this as plasma membrane constantly.

      We agree and we have changed it to “patient mutation” in Figure 2C, Figure 3E, Figure 4B.

    1. Author Response

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #2 (Public Review):

      Summary:

      This paper tests the idea that schooling can provide an energetic advantage over solitary swimming. The present study measures oxygen consumption over a wide range of speeds, to determine the differences in aerobic and anaerobic cost of swimming, providing a potentially valuable addition to the literature related to the advantages of group living.

      Response: Thank you for the positive comments.

      Strengths:

      The strength of this paper is related to providing direct measurements of the energetics (oxygen consumption) of fish while swimming in a group vs solitary. The energetic advantages of schooling has been claimed to be one of the major advantages of schooling and therefore a direct energetic assessment is a useful result.

      Response: Thank you for the positive comments.

      Weaknesses:

      1) Regarding the fish to water volume ratio, the arguments raised by the authors are valid. However, the ratio used is still quite high (as high as >2000 in solitary fish), much higher than that recommended by Svendsen et al (2006). Hence this point needs to be discussed in the ms (summarising the points raised in the authors' response)

      Response: Thank you for the comments. We have addressed this point in the previous comments. In short, our ratio is within the range of the published literature. We conducted the additional signal-to-noise analysis for quality assurance.

      2) Wall effects: Fish in a school may have been swimming closer to the wall. The fact that the convex hull volume of the fish school did not change as speed increased is not a demonstration that fish were not closer to the wall, nor is it a demonstration that wall effect were not present. Therefore the issue of potential wall effects is a weakness of this paper.

      Response: Thank you for the comments. We have addressed this point in the previous comments. We provided many other considerations in addition to the convex hull volume. In particular, our boundary layer is < 2.5mm, which was narrower than the width of the giant danio of ~10 mm.

      3) The authors stated "Because we took high-speed videos simultaneously with the respirometry measurements, we can state unequivocally that individual fish within the school did not swim closer to the walls than solitary fish over the testing period". This is however not quantified.

      Response: Thank you for the comments. We have addressed this point in the previous comments. We want to note that the statement in the response letter is to elaborate the discussion points, but not stated as data in the manuscript. The bottom line is very few studies used PIV to quantify the thickness of the boundary layer like what we did in our experiment.

      4) Statistical analysis. The authors have dealt satisfactorily with most of the comments.

      However :

      (a) the following comment has not been dealt with directly in the ms "One can see from the graphs that schooling MO2 tends to have a smaller SD than solitary data. This may well be due to the fact that schooling data are based on 5 points (five schools) and each point is the result of the MO2 of five fish, thereby reducing the variability compared to solitary fish."

      (b) Different sizes were used for solitary and schooling fishes. The authors justify using larger fish as solitary to provide a better ratio of respirometer volume to fish volume in the tests on individual fish. However, mass scaling for tail beat frequency was not provided. Although (1) this is because of lack of data for this species and (2) using scaling exponent of distant species would introduce errors of unknown magnitude, this is still a weakness of the paper that needs to be acknowledged here and in the ms.

      Response: Thank you for the comments. We have addressed both points in the previous comments and provided comprehensive discussions. We also stated the caveats in the method section of the manuscript.

      Reviewer #3 (Public Review):

      Zhang and Lauder characterized both aerobic and anaerobic metabolic energy contributions in schools and solitary fishes in the Giant danio (Devario aequipinnatus) over a wide range of water velocities. By using a highly sophisticated respirometer system, the authors measure the aerobic metabolisms by oxygen uptake rate and the non-aerobic oxygen cost as excess post-exercise oxygen consumption (EPOC). With these data, the authors model the bioenergetic cost of schools and solitary fishes. The authors found that fish schools have a J-shaped metabolism-speed curve, with reduced total energy expenditure per tail beat compared to solitary fish. Fish in schools also recovered from exercise faster than solitary fish. Finally, the authors conclude that these energetic savings may underlie the prevalence of coordinated group locomotion in fish.

      The conclusions of this paper are mostly well supported by data.

      Response: Thank you for the positive comments.

      Recommendations for the authors:

      Reviewer #3 (Recommendations For The Authors):

      I have read carefully the revised version of the manuscript and would like to thank the authors for addressing all my comments/suggestions.

      I have no additional comments/suggestions. Now, I strongly believe that this manuscript deserves to be published in eLife.

      Response: Thank you for the positive comments.


      The following is the authors’ response to the original reviews.

      General responses

      Many thanks to the reviewers and editors for their very helpful comments on our manuscript. Below we respond (in blue text) to each of the reviewer comments, both the public ones and the more detailed individual comments in the second part of each review. In some cases, we consider these together where the same point is made in both sets of comments. We have made several changes to the manuscript in response to reviewer suggestions, and we respond in detail to the comments of reviewer #2 who feels that we have overstated the significance of our manuscript and suggests several relevant literature references. We prepared a table summarizing these references and why they differ substantially from the approach taken in our paper here.

      Overall, we would like to emphasize to both reviewers and readers of this response document that previous studies of fish schooling dynamics (or collective movement of vertebrates in general, see Commentary Zhang & Lauder 2023 J. Exp. Biol., doi:10.1242/jeb.245617) have not considered a wide speed range and thus the importance of measuring EPOC (excess post-exercise oxygen consumption) as a key component of energy use. Quantifying both aerobic and non-aerobic energy use allows us to calculate the total energy expenditure (TEE) which we show differs substantially and, importantly, non-linearly with speed between schools and measurements on solitary individuals. Comparison between school total energy use and individual total energy use are critical to understanding the dynamics of schooling behaviour in fishes.

      The scope of this study is the energetics of fish schools. By quantifying the TEE over a wide range of swimming speeds, we also show that the energetic performance curve is concave upward, and not linear, and how schooling behaviour modifies this non-linear relationship.

      In addition, one key implication of our results is that kinematic measurements of fish in schools (such as tail beat frequency) are not a reliable metric by which to estimate energy use. Since we recorded high-speed video simultaneously with energetic measurements, we are able to show that substantial energy savings occur by fish in schools with little to no change in tail beat frequency, and we discuss in the manuscript the various fluid dynamic mechanisms that allow this. Indeed, studies of bird flight show that when flying in a (presumed) energy-saving V-formation, wing beat frequency can actually increase compared to flying alone. We believe that this is a particularly important part of our findings: understanding energy use by fish schools must involve actual measurements of energy use and not indirect and sometimes unreliable kinematic measurements such as tail beat frequency or amplitude.

      Reviewer #1 (Public Review):

      Summary:

      In the presented manuscript the authors aim at quantifying the costs of locomotion in schooling versus solitary fish across a considerable range of speeds. Specifically, they quantify the possible reduction in the cost of locomotion in fish due to schooling behavior. The main novelty appears to be the direct measurement of absolute swimming costs and total energy expenditure, including the anaerobic costs at higher swimming speeds.

      In addition to metabolic parameters, the authors also recorded some basic kinematic parameters such as average distances or school elongation. They find both for solitary and schooling fish, similar optimal swimming speeds of around 1BL/s, and a significant reduction in costs of locomotion due to schooling at high speeds, in particular at ~5-8 BL/s.

      Given the lack of experimental data and the direct measurements across a wide range of speeds comparing solitary and schooling fish, this appears indeed like a potentially important contribution of interest to a broader audience beyond the specific field of fish physiology, in particular for researchers working broadly on collective (fish) behavior.

      Response: Thank you for seeing the potential implications of this study. We also believe that this paper has broader implications for collective behaviour in general, and outline some of our thinking on this topic in a recent Commentary article in the Journal of Experimental Biology: (Zhang & Lauder 2023 doi:10.1242/jeb.245617). Understanding the energetics of collective behaviours in the water, land, and air is a topic that has not received much attention despite the widespread view that moving as a collective saves energy.

      Strengths:

      The manuscript is for the most part well written, and the figures are of good quality. The experimental method and protocols are very thorough and of high quality. The results are quite compelling and interesting. What is particularly interesting, in light of previous literature on the topic, is that the authors conclude that based on their results, specific fixed relative positions or kinematic features (tail beat phase locking) do not seem to be required for energetic savings. They also provide a review of potential different mechanisms that could play a role in the energetic savings.

      Response: Thank you for seeing the nuances we bring to the existing literature and comment on the quality of the experimental method and protocols. Despite a relatively large literature on fish schooling based on previous biomechanical research, our studies suggest that direct measurement of energetic cost clearly demonstrates the energy savings that result from the sum of different fluid dynamic mechanisms depending on where fish are, and also emphasizes that simple metrics like fish tail beat frequency do not adequately reflect energy savings during collective motion.

      Weaknesses:

      A weakness is the actual lack of critical discussion of the different mechanisms as well as the discussion on the conjecture that relative positions and kinematic features do not matter. I found the overall discussion on this rather unsatisfactory, lacking some critical reflections as well as different relevant statements or explanations being scattered across the discussion section. Here I would suggest a revision of the discussion section.

      Response: The critical discussion of the different possible energy-saving mechanisms is indeed an important topic. We provided a discussion about the overall mechanism of ‘local interactions’ in the first paragraph of “Schooling Dynamics and energy conservation”. To clarify, our aim with Figure 1 is to introduce the current mechanisms proposed in the existing engineering/hydrodynamic literature that have studied a number of possible configurations both experimentally and computationally. Thank you for the suggestion of better organizing the discussion to critically highlight different mechanisms that would enable a dynamic schooling structure to still save energy and why the appendage movement frequency does not necessarily couple with the metabolic energy expenditure. Much of this literature uses computational fluid dynamic models or experiments on flapping foils as representative of fish. This exact issue is of great interest to us, and we are currently engaged in a number of other experiments that we hope will shed light on how fish moving in specific formations do or don’t save energy.

      Our aim in presenting Figure 1 at the start of the paper was to show that there are several ways that fish could save energy when moving in a group as shown by engineering analyses, but before investigating these various mechanisms in detail we first have to show that fish moving in groups actually do save energy with direct metabolic measurements. Hence, our paper treats the various mechanisms as inspiration to determine experimentally if, in fact, fish in schools save energy, and if so how much over a wide speed range. Our focus is to experimentally determine the performance curve that shows energy use as speed increases, for schools compared to individuals. Therefore, we have elected not to go into detail about these different hydrodynamic mechanisms in this paper, but rather to present them as a summary of current engineering literature views and then proceed to document energy savings (as stated in the second last paragraph of Introduction). We have an Commentary paper in the Journal of Experimental Biology that addresses this issue generally, and we are reluctant to duplicate much of that discussion here (Zhang & Lauder 2023 doi:10.1242/jeb.245617). We are working hard on this general issue as we agree that it is very interesting. We have revised the Introduction (second last paragraph of Introduction) and Discussion (first paragraph of Discussion) to better indicate our approach, but we have not added any significant discussion of the different hydrodynamic energy saving proposals as we believe that it outside the scope of this first paper and more suitable as part of follow-up studies.

      Also, there is a statement that Danio regularly move within the school and do not maintain inter-individual positions. However, there is no quantitative data shown supporting this statement, quantifying the time scales of neighbor switches. This should be addressed as core conclusions appear to rest on this statement and the authors have 3d tracks of the fish.

      Response: Thank you for pointing out this very important future research direction. Based on our observations and the hypothesized mechanisms for fish within the school to save energy (Fig. 1), we have been conducting follow-up experiments to decipher the multiple dynamic mechanisms that enable the fish within the school to save energy. Tracking the 3D position of each individual fish body in 3D within the fish school has proven difficult. We currently have 3D data on the nose position obtained simultaneously with the energetic measurements, but we do not have full 3D fish body positional data. Working with our collaborators, we are developing a 3-D tracking algorithm that will allow us to quantify how long fish spend in specific formations, and we currently have a new capability to record high-speed video of fish schooling moving in a flow tank for many hours (see our recent perspective by Ko et al., 2023 doi.org/10.1098/rsif.2023.0357). The new algorithms and the results will be published as separate studies and we think that these ongoing experiments are outside the scope of the current study with its focus on energetics. Nevertheless, the main point of Fig. 1 is to provide possible mechanisms to inspire future studies to dissect the detailed hydrodynamic mechanisms for energy saving, and the points raised by this comment are indeed extremely interesting to us and our ongoing experiments in this area. We provide a statement to clarify this point in the 1st paragraph of “Schooling dynamics and energy conservation” section.

      Further, there is a fundamental question on the comparison of schooling in a flow (like a stream or here flow channel) versus schooling in still water. While it is clear that from a pure physics point of view that the situation for individual fish is equivalent. As it is about maintaining a certain relative velocity to the fluid, I do think that it makes a huge qualitative difference from a biological point of view in the context of collective swimming. In a flow, individual fish have to align with the external flow to ensure that they remain stationary and do not fall back, which then leads to highly polarized schools. However, this high polarization is induced also for completely non-interacting fish. At high speeds, also the capability of individuals to control their relative position in the school is likely very restricted, simply by being forced to put most of their afford into maintaining a stationary position in the flow. This appears to me fundamentally different from schooling in still water, where the alignment (high polarization) has to come purely from social interactions. Here, relative positioning with respect to others is much more controlled by the movement decisions of individuals. Thus, I see clearly how this work is relevant for natural behavior in flows and that it provides some insights on the fundamental physiology, but I at least have some doubts about how far it extends actually to “voluntary” highly ordered schooling under still water conditions. Here, I would wish at least some more critical reflection and or explanation.

      Response: We agree completely with this comment that animal group orientations in still fluid can have different causes from their locomotion in a moving fluid. We very much agree with the reviewer that social interactions in still water, which typically involve low-speed locomotion and other behaviours such as searching for food by the group, can be important and could dictate fish movement patterns. In undertaking this project, we wanted to challenge fish to move at speed, and reasoned that if energy savings are important in schooling behaviour due to hydrodynamic mechanisms, we should see this when fish are moving forward against drag forces induced by fluid impacting the school. Drag forces scale as velocity squared, so we should see energy savings by the school, if any, as speed increases.

      We also quantified fish school swimming speeds in the field from the literature and presented a figure showing that in nature fish schools can and do move at considerable speeds. This figure is part of our overview on collective behaviour recently in J. Exp. Biol. (Zhang & Lauder 2023 doi:10.1242/jeb.245617). It is only by studying fish schools moving over a speed range that we can understand the performance curve relating energy use to swimming speed. Indeed, we wonder if fish moving in still water as a collective versus as solitary individuals would show energy savings at all. We now provided the justification for studying fish schooling in moving fluids in the second and third paragraph of the Introduction. When animals are challenged hydrodynamically (e.g. at higher speed), it introduces the need to save energy. Movement in still water lacks the need for fish to save energy. When fish do not need to save locomotor energy in still water, it is hard to justify why we would expect to observe energy saving and related physiological mechanisms in the first place. As the reviewer said, the ‘high polarization in still water has to come purely from social interactions’. Our study does not dispute this consideration, and indeed we agree with it! In our supplementary materials, we acknowledged the definitions for different scenarios of fish schooling can have different behavioural and ecological drivers. Using these definitions, we explicitly stated, in the introduction, that our study focuses on active and directional schooling behaviour to understand the possible hydrodynamic benefits of energy expenditure for collective movements of fish schools. By stating the scope of our study at the outset, we hope that this will keep the discussion focused on the energetics and kinematics of fish schools, without unnecessarily addressing other many possible reasons for fish schooling behaviours in the discussion such as anti-predator grouping, food searching, or reproduction as three examples.

      As this being said, we acknowledge (in the 2nd paragraph of the introduction) that fish schooling behaviour can have other drivers when the flow is not challenging. Also, there are robotic-&-animal interaction studies and computational fluid dynamic simulation studies (that we cited) that show individuals in fish schools interact hydrodynamically. Hydrodynamic interactions are not the same as behaviour interactions, but it does not mean individuals within the fish schooling in moving flow are not interacting and coordinating.

      Related to this, the reported increase in the elongation of the school at a higher speed could have also different explanations. The authors speculate briefly it could be related to the optimal structure of the school, but it could be simply inter-individual performance differences, with slower individuals simply falling back with respect to faster ones. Did the authors test for certain fish being predominantly at the front or back? Did they test for individual swimming performance before testing them in groups together? Again this should be at least critically reflected somewhere.

      Response: Thank you for raising this point. If the more streamlined schooling structure above 2 BL/s is due to the weaker individuals not catching up with the rest of the school, we would expect the weaker individuals to quit swimming tests well before 8 BL/s. However, we did not observe this phenomenon. Although we did not specifically test for the two questions the reviewer raises here, our results suggest that inter-individual variation in the swimming performance of giant Danio is not at the range of 2 to 8 BL/s (a 400% difference). While inter-individual differences certainly exist, we believe that they are small relative to the speeds tested as we did not see any particular individuals consistently unable to keep up with the school or certain individuals maintaining a position near the back of the school. As this being said, we provide additional interpretations for the elongated schooling structure at the end of the 2nd paragraph of the “schooling dynamics and energy conservation” section.

      Reviewer #1 (Recommendations For The Authors):

      Line 58: The authors write "How the fluid dynamics (...) enable energetic savings (...)". However, the paper focuses rather on the question of whether energetic savings exist and does not enlighten us on the dominant mechanisms. Although it gives a brief overview of all possible mechanisms, it remains speculative on the actual fluid dynamical and biomechanical processes. Thus, I suggest changing "How" to "Whether".

      Response: Great point! We changed “How” to “Whether”.

      Lines 129-140: In the discussion of the U-shaped aerobic rate, there is no direct comparison of the minimum cost values between the schooling and solitary conditions. Only the minimum costs during schooling are named/discussed. In addition to the data in the figure, I suggest explicitly comparing them as well for full transparency.

      Response: Thanks for raising this point. We did not belabor this point because there was no statistical significance. As requested, we added a statement to address this with statistics in the 1st paragraph of the Results section.

      Line 149: The authors note that the schooling fish have a higher turning frequency than solitary fish. Here, a brief discussion of potential explanations would be good, e.g. need for coordination with neighbors -> cost of schooling.

      Response: Thank you for the suggestion. In the original version of the manuscript, we discussed that the higher turning frequency could be related to higher postural costs for active stability adjustment at low speeds. As requested, we now added that high turn frequency can relate to the need for coordination with neighbours in the last paragraph of the “Aerobic metabolic rate–speed curve of fish schools” section. As indicated above, the suspected costs of coordination did not result in higher costs of schooling at the lower speed (< 2 BL s-1, where the turn frequency is higher).

      Line 151: The authors discuss the higher maximum metabolic rate of schooling fish as a higher aerobic performance and lower use of aerobic capacity. This may be confusing for non-experts in animal physiology and energetics of locomotion. I recommend providing somewhere in a paper an additional explanation to clarify it to non-experts. While lines 234-240 and further below potentially address this, I found this not very focused or accessible to non-experts. Here, I suggest the authors consider revisions to make it more comprehensible to a wider, interdisciplinary audience.

      Response: We agree with the reviewer that the difference between maximum oxygen uptake and maximum metabolic rate can be confusing. In fact, among animal physiologists, these two concepts are often muddled. One of the authors is working on an invited commentary from J. Exp. Biol. to clearly define these two concepts. We have made the language in the section “Schooling dynamics enhances aerobic performance and reduces non-aerobic energy use” more accessible to a general audience. In addition, the original version presented the relevant framework in the first and the second paragraphs of the Introduction when discussing aerobic and non-aerobic energy contribution. In brief, when vertebrates exhibit maximum oxygen uptake, they use aerobic and non-aerobic energy contributions that both contribute to their metabolic rate. Therefore, the maximum total metabolic rate is higher than the one estimated from only maximum oxygen uptake. We used the method presented in Fig. 3a to estimate the maximum metabolic rate for metabolic energy use (combining aerobic and non-aerobic energy use). In kinesiology, maximum oxygen uptake is used to evaluate the aerobic performance and energy use of human athletes is estimated by power meters or doubly labelled water.

      Line 211: The authors write that Danio regularly move within the school and do not maintain inter-individual positions. Given that this is an important observation, and the relative position and its changes are crucial to understanding the possible mechanisms for energetic savings in schools, I would expect some more quantitative support for this statement, in particular as the authors have access to 3d tracking data. For example introducing some simple metrics like average time intervals between swaps of nearest neighbors, possibly also resolved in directions (front+back versus right+left), should provide at least some rough quantification of the involved timescales, whether it is seconds, tens of seconds, or minutes.

      Response: As responded in the comment above, 3-D tracking of both body position and body deformation of multiple individuals in a school is not a trivial research challenge and we have ongoing research on this issue. We hope to have results on the 3D positions of fish in schools soon! For this manuscript, we believe that the data in Figure 4E which shows the turning frequency of fish in schools and solitary controls shows the general phenomenon of fish moving around (as fish turn to change positions within the school), but we agree that more could be done to address this point and we are indeed working on it now.

      Lines 212-217: There is a very strong statement that energetic savings by collective motion do not require fixed positional arrangements or specific kinematic features. While possibly one of the most interesting findings of the paper, I found that in its current state, it was not sufficiently/satisfactorily discussed. For example for the different mechanisms summarized, there will be clearly differences in their relevance based on relative distance and position. For example mechanisms 3 and 4 likely have significant contributions only at short distances. Here, the question is how relevant can they be if the average distance is 1 BL? Also, 1BL side by side is very much different from 1BL front to back, given the elongated body shape. For mechanisms 1 and 2, it appears relative positioning is quite important. Here, having maybe at least some information from the literature (if available) on the range of wall or push effects or the required precision in relative positioning for having a significant benefit would be very much desired. Also, do the authors suggest that a) these different effects overlap giving any position in the school a benefit, or b) that there are specific positions giving benefits due to different mechanisms and that fish "on purpose" switch only between these energetic "sweet" spots, I guess this what is towards the end referred to as Lighthill conjecture? Given the small group size I find a) rather unlikely, while b) actually also leads to a coordination problem if every fish is looking for a sweet spot. Overall, a related question is whether the authors observed a systematic change in leading individuals, which likely have no, or very small, hydrodynamic benefits.

      Response: Thank you for the excellent discussion on this point. As we responded above, we have softened the tone of the statement. In the original version, we were clear that the known mechanisms as summarized in Fig. 1 lead us to ‘expect’ that fish do not need to be in a fixed position to save energy.

      In general, current engineering/hydrodynamic studies suggest that any fish positioned within one body length (both upstream and downstream and side by side) will benefit from one or more of the hydrodynamic mechanisms that we expect will reduce energy costs, relative to a solitary individual. Our own studies using robotic systems suggest that a leading fish will experience an added mass “push” from a follower when the follower is located within roughly ½ body length behind the leader. We cited a Computational Fluid Dynamic (CFD) study about the relative distance among individuals for energy saving to be in effect. Please keep in mind that CFD simulation is a simplified model of the actual locomotion of fish and involves many assumptions and currently only resolves the time scale of seconds (see commentary of Zhang & Lauder 2023 doi:10.1242/jeb.245617 in J. Exp. Biol. for the current challenges of CFD simulation). To really understand the dynamic positions of fish within the school, we will need 3-D tracking of fish schools with tools that are currently being developed. Ideally, we would also have simultaneous energetic measurements, but of course, this is enormously challenging and it is not clear at this time how to accomplish this.

      We certainly agree that the relative positions of fish (vertically staggered or in-line swimming) do affect the specific hydrodynamic mechanisms being used. We cited the study that discussed this, but the relative positions of fish remain an active area of research. More studies will be out next few years to provide more insight into the effects of the relative positions of fish in energy saving. The Lighthill conjecture is observed in flapping foils and whether fish schools use the Lighthill conjecture for energy saving is an active area of research but still unclear. We also provided a citation about the implication of the Lighthill conjecture on fish schools. Hence, our original version stated ‘The exact energetic mechanisms….would benefit from more in-depth studies’. We agree with the reviewer that not all fish can benefit Lighthill conjecture (if fish schools use it) at any given time point, hence the fish might need to rotate in using the Lighthill conjecture. This is one more explanation for the dynamic positioning of fish in a school.

      Overall, in response to the question raised, we do not believe that fish are actively searching for “sweet spots” within the school, although this is only speculation on our part. We believe instead that fish, located in a diversity of positions within the school, get the hydrodynamic advantage of being in the group at that configuration.

      We believe that fish, once they group and maintain a grouping where individuals are all within around one body length distance from each other, will necessarily get hydrodynamic benefits. As a collective group, we believe that at any one time, several different hydrodynamic mechanisms are all acting simultaneously and result in reduced energetic costs (Fig. 1).

      Figure 4E: The y-axis is given in the units of 10-sec^-1 which is confusing is it 10 1/s or 1/(10s)? Why not use simply the unit of 1/s which is unambiguous?

      Response: Thank you for the suggestions. We counted the turning frequency over the course of 10 seconds. To reflect more accurately on what we did, we used the suggested unit of 1/(10s) to more correctly correspond to how we made the measurements and the duration of the measurement. We recognize that this is a bit non-standard but would like to keep these units if possible.

      Figure 4F: The unit in the school length is given in [mm], which suggests that the maximal measured school length is 4mm, this can't be true.

      Response: Thank you for pointing this out. The unit should be [cm], which we corrected.

      Reviewer #2 (Public Review):

      Summary:

      This paper tests the idea that schooling can provide an energetic advantage over solitary swimming. The present study measures oxygen consumption over a wide range of speeds, to determine the differences in aerobic and anaerobic cost of swimming, providing a potentially valuable addition to the literature related to the advantages of group living.

      Response: Thank you for acknowledging our contribution is a valuable addition to the literature on collective movement by animals.

      Strengths:

      The strength of this paper is related to providing direct measurements of the energetics (oxygen consumption) of fish while swimming in a group vs solitary. The energetic advantages of schooling have been claimed to be one of the major advantages of schooling and therefore a direct energetic assessment is a useful result.

      Response: Thank you for acknowledging our results are useful and provide direct measurements of energetics to prove a major advantage of schooling relative to solitary motion over a range of speeds.

      Weaknesses:

      The manuscript suffers from a number of weaknesses which are summarised below:

      1) The possibility that fish in a school show lower oxygen consumption may also be due to a calming effect. While the authors show that there is no difference at low speed, one cannot rule out that calming effects play a more important role at higher speed, i.e. in a more stressful situation.

      Response: Thank you for raising this creative point on “calming”. When vertebrates are moving at high speeds, their stress hormones (adrenaline, catecholamines & cortisol) increase. This phenomenon has been widely studied, and therefore, we do not believe that animals are ‘calm’ when moving at high speed and that somehow a “calming effect” explains our non-linear concave-upward energetic curves. “Calming” would have to have a rather strange non-linear effect over speed to explain our data, and act in contrast to known physiological responses involved in intense exercise (whether in fish or humans). It is certainly not true for humans that running at high speeds in a group causes a “calming effect” that explains changes in metabolic energy expenditure. We have added an explanation in the third paragraph in the section “Schooling dynamics enhances aerobic performance and reduces non-aerobic energy use”. Moreover, when animal locomotion has a high frequency of appendage movement (for both solitary individual and group movement), they are also not ‘calm’ from a behavioural point of view. Therefore, we respectfully disagree with the reviewer that the ‘calming effect’ is a major contributor to the energy saving of group movement at high speed. It is difficult to believe that giant danio swimming at 8 BL/s which is near or at their maximal sustainable locomotor limits are somehow “calm”. In addition, we demonstrated by direct energetic measurement that solitary individuals do not have a higher metabolic rate at the lower speed and thus directly show that there is very likely no cost of “uncalm” stress that would elevate the metabolic rate of solitary individuals. Furthermore, the current version of this manuscript compared the condition factor of the fish in the school and solitary individuals and found no difference (see Experimental Animal Section in the Methods). This also suggests that the measurement on the solitary fish is likely not confounded by any stress effects.

      Finally, and as discussed further below, since we have simultaneous high-speed videos of fish swimming as we measure oxygen consumption at all speeds, we are able to directly measure fish behaviour. Since we observed no alteration in tail beat kinematics between schools and individuals (a key result that we elaborate on below), it’s very hard to justify that a “calming” effect explains our results. Fish in schools swimming at speed (not in still water) appear to be just as “calm” as solitary individuals.

      2) The ratio of fish volume to water volume in the respirometer is much higher than that recommended by the methodological paper by Svendsen et al. (J Fish Biol 2016) Response: The ratio of respirometer volume to fish volume is an important issue that we thought about in detail before conducting these experiments. While Svendsen et al., (J. Fish Biol. 2016) recommend a respirometer volume-to-fish volume ratio of 500, we are not aware of any experimental study comparing volumes with oxygen measuring accuracy that gives this number as optimal. In addition, the Svendsen et al. paper does not consider that their recommendation might result in fish swimming near the walls of the flume (as a result of having relatively larger fish volume to flume volume) and hence able to alter their energetic expenditure by being near the wall. In our case, we needed to be able to study both a school (with higher animal volumes) and an individual (relatively lower volume) in the same exact experimental apparatus. Thus, we had to develop a system to accurately record oxygen consumption under both conditions.

      The ratio of our respirometer to individual volume for schools is 693, while the value for individual fish is 2200. Previous studies (Parker 1973, Abrahams & Colgan, 1985, Burgerhout et al., 2013) that used a swimming-tunnel respirometer (i.e., a sealed treadmill) to measure the energy cost of group locomotion used values that range between 1116 and 8894 which are large and could produce low-resolution measurements of oxygen consumption. Thus, we believe that we have an excellent ratio for our experiments on both schools and solitary individuals, while maintaining a large enough value that fish don’t experience wall effects (see more discussion on this below, as we experimentally quantified the flow pattern within our respirometer).

      The goal of the recommendation by Svendsen et al. is to achieve a satisfactory R2 (coefficient of determination) value for oxygen consumption data. However, Chabot et al., 2020 (DOI: 10.1111/jfb.14650) pointed out that only relying on R2 values is not always successful at excluding non-linear slopes. Much worse, only pursuing high R2 values has a risk of removing linear slopes with low R2 only because of a low signal-to-noise ratio and resulting in an overestimation of the low metabolic rate. Although we acknowledge the excellent efforts and recommendations provided by Svendsen et al., 2016, we perhaps should not treat the ratio of respirometer to organism volume of 500 as the gold standard for swim-tunnel respirometry. Svendsen et al., 2020 did not indicate how they reached the recommendation of using the ratio of respirometer to organism volume of 500. Moreover, Svendsen et al., 2020 stated that using an extended measuring period can help to resolve the low signal-to-noise ratio. Hence, the key consideration is to obtain a reliable signal-to-noise ratio which we will discuss below.

      To ensure we obtain reliable data quality, we installed a water mixing loop (Steffensen et al., 1984) and used the currently best available technology of oxygen probe (see method section of Integrated Biomechanics & Bioenergetic Assessment System) to improve the signal-to-noise ratio. The water mixing loop is not commonly used in swim-tunnel respirometer. Hence, if a previously published study used a respirometer-to-organism ratio up to 8894, our updated oxygen measuring system is completely adequate to produce reliable signal-to-noise ratios in our system with a respirometer-to-organism ratio of 2200 (individuals) and 693 (schools). In fact, our original version of the manuscript used a published method (Zhang et al., 2019, J. Exp. Biol. https://doi.org/10.1242/jeb.196568) to analyze the signal-to-noise ratio and provided the quantitative approach to determine the sampling window to reliably capture the signal (Fig. S5).

      3) Because the same swimming tunnel was used for schools and solitary fish, schooling fish may end up swimming closer to the wall (because of less volume per fish) than solitary fish. Distances to the wall of schooling fish are not given, and they could provide an advantage to schooling fish.

      Response: This is an issue that we considered carefully in designing these experiments. After considering the volume of the respirometer and the size of the fish (see the response above), we decided to use the same respirometer to avoid any other confounding factors when using different sizes of respirometers with potentially different internal flow patterns. In particular, different sizes of Brett-type swim-tunnel respirometers differ in the turning radius of water flow, which can produce different flow patterns in the swimming section. Please note that we quantified the flow pattern within the flow tank using particle image velocimetry (PIV) (so we have quantitative velocity profiles across the working section at all tested speeds), and modified the provided baffle system to improve the flow in the working section.

      Because we took high-speed videos simultaneously with the respirometry measurements, we can state unequivocally that individual fish within the school did not swim closer to the walls than solitary fish over the testing period (see below for the quantitative measurements of the boundary layer). Indeed, many previous respirometry studies do not obtain simultaneous video data and hence are unable to document fish locations when energetics is measured.

      In studying schooling energetics, we believe that it is important to control as many factors as possible when making comparisons between school energetics and solitary locomotion. We took great care as indicated in the Methods section to keep all experimental parameters the same (same light conditions, same flow tank, same O2 measuring locations with the internal flow loop, etc.) so that we could detect differences if present. Changing the flow tank respirometer apparatus between individual fish and the schools studied would have introduced an unacceptable alteration of experimental conditions and would be a clear violation of the best experimental practices.

      We have made every effort to be clear and transparent about the choice of experimental apparatus and explained at great length the experimental parameters and setup used, including the considerations about the wall effect in the extended Methods section and supplemental material provided.

      Our manuscript provides the measurement of the boundary layer (<2.5 mm at speeds > 2 BL s-1) in the methods section of the Integrated Biomechanics & Bioenergetic Assessment System. We also state that the boundary layer is much thinner than the body width of the giant danio (~10 mm) so that the fish cannot effectively hide near the wall. Due to our PIV calibration, we are able to quantify flow near the wall.

      In the manuscript, we also provide details about the wall effects and fish schools as follows from the manuscript: ”…the convex hull volume of the fish school did not change as speed increased, suggesting that the fish school was not flattening against the wall of the swim tunnel, a typical feature when fish schools are benefiting from wall effects. In nature, fish in the centre of the school effectively swim against a ‘wall’ of surrounding fish where they can benefit from hydrodynamic interactions with neighbours.”’ The notion that the lateral motion of surrounding slender bodies can be represented by a streamlined wall was also proposed by Newman et al., 1970 J. Fluid Mech. These considerations provide ample justification for the comparison of locomotor energetics by schools and solitary individuals.

      4) The statistical analysis has a number of problems. The values of MO2 of each school are the result of the oxygen consumption of each fish, and therefore the test is comparing 5 individuals (i.e. an individual is the statistical unit) vs 5 schools (a school made out of 8 fish is the statistical unit). Therefore the test is comparing two different statistical units. One can see from the graphs that schooling MO2 tends to have a smaller SD than solitary data. This may well be due to the fact that schooling data are based on 5 points (five schools) and each point is the result of the MO2 of five fish, thereby reducing the variability compared to solitary fish. Other issues are related to data (for example Tail beat frequency) not being independent in schooling fish.

      Response: We cannot agree with the reviewer that fish schools and solitary individuals are different statistical units. Indeed, these are the two treatments in the statistical sense: a school versus the individual. This is why we invested extra effort to replicate all our experiments on multiple schools of different individuals and compare the data to multiple different solitary individuals. This is a standard statistical approach, whether one is comparing a tissue with multiple cells to an individual cell, or multiple locations to one specific location in an ecological study. Our analysis treats the collective movement of the fish school as a functional unit, just like the solitary individual is a functional unit. At the most fundamental level of oxygen uptake measurements, our analysis results from calculating the declining dissolved oxygen as a function of time (i.e. the slope of oxygen removal). Comparisons are made between the slope of oxygen removal by fish schools and the slope of oxygen removal by solitary individuals. This is the correct statistical comparison.

      The larger SD in individuals can be due to multiple biological reasons other than the technical reasons suggested here. Fundamentally, the different SD between fish schools and individuals can be the result of differences between solitary and collective movement and the different fluid dynamic interactions within the school could certainly cause differences in the amount of variation seen. Our interpretation of the ‘numerically’ smaller SD in fish schools than that of solitary individuals suggests that interesting hydrodynamic phenomena within fish schools remain to be discovered.

      Reviewer #2 (Recommendations For The Authors):

      I have reviewed a previous version of this paper. This new draft is somewhat improved but still presents a number of issues which I have outlined below.

      Response: Thanks for your efforts to improve our paper with reviews, but a number of your comments apply to the previous version of the paper, and we have made a number of revisions before submitting it to eLife. We explain below how this version of the manuscript addresses many of your comments from both the previous and current reviews. As readers can see from our responses below, this version of the manuscript version no longer uses only ‘two-way ANOVA’ as we have implemented an additional statistical model. (Please see the comments below for more detailed responses related to the statistical models).

      1) One of the main problems, and one of the reasons (see below) why many previous papers have measured TBF and not the oxygen consumption of a whole school, is that schooling also provides a calming effect (Nadler et al 2018) which is not easily differentiated from the hydrodynamic advantages (Abraham and Colgan 1985). This effect can reduce the MO2 while swimming and the EPOC when recovering. The present study does not fully take this potential issue into account and therefore its results are confounded by such effects. The authors state (line 401) that " the aerobic locomotion cost of solitary individuals showed no statistical difference from (in fact, being numerically lower) that of fish schools at a very low testing speed. The flow speed is similar to some areas of the aerated home aquarium for each individual fish. This suggests that the stress of solitary fish likely does not meaningfully contribute to the higher locomotor costs". While this is useful, the possibility that at higher speeds (i.e. a more stressful situation) solitary fish may experience more stress than fish in a school, cannot be ruled out.

      Response: Thank you for finding our results and data useful. We have addressed the comments on calming or stress effects in our response above. The key point is that either solitary or school fish are challenged (i.e. stressed) at a high speed where the sizable increases in stress hormones are well documented in the exercise physiology literature. We honestly just do not understand how a “calming” effect could possibly explain the upward concave energetic curves that we obtained, and how “calming” could explain the difference between schools and solitary individuals. Since we have simultaneous high-speed videos of fish swimming as we measure oxygen consumption at all speeds, we are able to directly observe fish behaviour. It is not exactly clear what a “calming effect” would look like kinematically or how one would measure this experimentally, but since we observed no alteration in tail beat kinematics between schools and individuals (a key result that we elaborate on below), it’s very hard to justify that a “calming” effect explains our results. Fish in schools appear to be just as “calm” as solitary individuals.

      If the reviewer's “calming effect” is a general issue, then birds flying in a V-formation should also experience a “calming effect”, but at least one study shows that birds in a V-formation experience higher wing beat frequencies.

      In addition, Nalder et al., 2018 (https://doi.org/10.1242/bio.031997) did not study any such “calming effect”. We assume the reviewer is referring to Nalder et al., 2016, which showed that shoaling reduced fish metabolic rates in a resting respirometer that has little-to-no water current that would motivate fish to swim (which is very different from the swim-tunnel respirometer we used). Moreover, the inter-loop system used by Nalder et al., 2016 has the risk of mixing the oxygen uptake of the fish shoal and solitary individuals. Hence, we believe that it is not appropriate to extend the results of Nalder et al., 2016 to infer and insist on a calming effect for fish schools that we studied which are actively and directionally swimming over a wide speed range up to and including high speeds. Especially since our data clearly show that ‘the aerobic locomotion cost of solitary individuals showed no statistical difference from (in fact, being numerically lower) that of fish schools at very low testing speeds’. More broadly, shoaling and schooling are very different in terms of polarization as well as the physiological and behavioural mechanisms used in locomotion. Shoaling behaviour by fish in still water is not the same as active directional schooling over a speed range. Our supplementary Table 1 provides a clear definition for a variety of grouping behaviours and makes the distinction between shoaling and schooling.

      Our detailed discussion about other literature mentioned by this reviewer can be seen in the comments below.

      2) The authors overstate the novelty of their work. Line 29: "Direct energetic measurements demonstrating the 30 energy-saving benefits of fluid-mediated group movements remain elusive" The idea that schooling may provide a reduction in the energetic costs of swimming dates back to the 70s, with pioneering experimental work showing a reduction in tail beat frequency in schooling fish vs solitary (by Zuyev, G. V. & Belyayev, V. V. (1970) and theoretical work by Weihs (1973). Work carried out in the past 20 years (Herskin and Steffensen 1998; Marras et al 2015; Bergerhout et al 2013; Hemelrijk et al 2014; Li et al 2021, Wiwchar et al 2017; Verma et al 2018; Ashraf et al 2019) based on a variety of approaches has supported the idea of a reduction in swimming costs in schooling vs solitary fish. In addition, group respirometry has actually been done in early and more recent studies testing the reduction in oxygen consumption as a result of schooling (Parker, 1973; Itazawa et al., 1978; Abrahams and Colgan 1985; Davis & Olla, 1992; Ross & Backman, 1992, Bergerhout et al 2013; Currier et al 2020). Specifically, Abrahams and Colgan (1985) and Bergerhout et al (2013) found that the oxygen consumption of fish swimming in a school was higher than when solitary, and Abrahams and Colgan (1985) made an attempt to deal with the confounding calming effect by pairing solitary fish up with a neighbor visible behind a barrier. These issues and how they were dealt with in the past (and in the present manuscript) are not addressed by the present manuscript. Currier et al (2020) found that the reduction of oxygen consumption was species-specific.

      Response: We cannot agree with this reviewer that we have overstated the novelty of our work, and, in fact, we make very specific comments on the new contributions of our paper relative to the large previous literature on schooling. We are well aware of the literature cited above and many of these papers have little or nothing to do with quantifying the energetics of schooling. In addition, many of these papers rely on simple kinematic measurements which are unrelated to direct energetic measurements of energy use. To elaborate on this, we present the ‘Table R’ below which evaluates and compares each of the papers this reviewer cites above. The key message (as we wrote in the manuscript) is that none of the previous studies measured non-aerobic cost (and thus do not calculate the total energy expenditure (TEE), which we show to be substantial. In addition, many of these studies do not compare schools to individuals, do not quantify both energetics and kinematics, and do not study a wide speed range. Only 33% of previous studies used direct measurements of aerobic metabolic rate to compare the locomotion costs of fish schools and solitary individuals (an experimental control). We want to highlight that most of the citations in the reviewer’s comments are not about the kinematics or hydrodynamics of fish schooling energetics, although they provide peripheral information on fish schooling in general. We also provide an overview of the literature on this topic in our paper in the Journal of Experimental Biology (Zhang & Lauder 2023 doi:10.1242/jeb.245617) and do not wish to duplicate that discussion here. We summarized and cited the relevant papers about the energetics of fish schooling in Table 1.

      Author response table 1.

      Papers cited by Reviewer #2, and a summary of their contributions and approach.

      References cited above:

      Zuyev, G., & Belyayev, V. V. (1970). An experimental study of the swimming of fish in groups as exemplified by the horsemackerel [Trachurus mediterraneus ponticus Aleev]. J Ichthyol, 10, 545-549.

      Weihs, D. (1973). Hydromechanics of fish schooling. Nature, 241(5387), 290-291.

      Herskin, J., & Steffensen, J. F. (1998). Energy savings in sea bass swimming in a school: measurements of tail beat frequency and oxygen consumption at different swimming speeds. Journal of Fish Biology, 53(2), 366-376.

      Marras, S., Killen, S. S., Lindström, J., McKenzie, D. J., Steffensen, J. F., & Domenici, P. (2015). Fish swimming in schools save energy regardless of their spatial position. Behavioral ecology and sociobiology, 69, 219-226.

      Burgerhout, E., Tudorache, C., Brittijn, S. A., Palstra, A. P., Dirks, R. P., & van den Thillart, G. E. (2013). Schooling reduces energy consumption in swimming male European eels, Anguilla anguilla L. Journal of experimental marine biology and ecology, 448, 66-71.

      Hemelrijk, C. K., Reid, D. A. P., Hildenbrandt, H., & Padding, J. T. (2015). The increased efficiency of fish swimming in a school. Fish and Fisheries, 16(3), 511-521.

      Li, L., Nagy, M., Graving, J. M., Bak-Coleman, J., Xie, G., & Couzin, I. D. (2020). Vortex phase matching as a strategy for schooling in robots and in fish. Nature communications, 11(1), 5408.

      Wiwchar, L. D., Gilbert, M. J., Kasurak, A. V., & Tierney, K. B. (2018). Schooling improves critical swimming performance in zebrafish (Danio rerio). Canadian Journal of Fisheries and Aquatic Sciences, 75(4), 653-661.

      Verma, S., Novati, G., & Koumoutsakos, P. (2018). Efficient collective swimming by harnessing vortices through deep reinforcement learning. Proceedings of the National Academy of Sciences, 115(23), 5849-5854.

      Ashraf, I., Bradshaw, H., Ha, T. T., Halloy, J., Godoy-Diana, R., & Thiria, B. (2017). Simple phalanx pattern leads to energy saving in cohesive fish schooling. Proceedings of the National Academy of Sciences, 114(36), 9599-9604.

      Parker Jr, F. R. (1973). Reduced metabolic rates in fishes as a result of induced schooling. Transactions of the American Fisheries Society, 102(1), 125-131.

      Itazawa, Y., & Takeda, T. (1978). Gas exchange in the carp gills in normoxic and hypoxic conditions. Respiration physiology, 35(3), 263-269.

      Abrahams, M. V., & Colgan, P. W. (1985). Risk of predation, hydrodynamic efficiency and their influence on school structure. Environmental Biology of Fishes, 13, 195-202.

      Davis, M. W., & Olla, B. L. (1992). The role of visual cues in the facilitation of growth in a schooling fish. Environmental biology of fishes, 34, 421-424.

      Ross, R. M., Backman, T. W., & Limburg, K. E. (1992). Group-size-mediated metabolic rate reduction in American shad. Transactions of the American Fisheries Society, 121(3), 385-390.

      Currier, M., Rouse, J., & Coughlin, D. J. (2021). Group swimming behaviour and energetics in bluegill Lepomis macrochirus and rainbow trout Oncorhynchus mykiss. Journal of Fish Biology, 98(4), 1105-1111.

      Halsey, L. G., Wright, S., Racz, A., Metcalfe, J. D., & Killen, S. S. (2018). How does school size affect tail beat frequency in turbulent water?. Comparative Biochemistry and Physiology Part A: Molecular & Integrative Physiology, 218, 63-69.

      Johansen, J. L., Vaknin, R., Steffensen, J. F., & Domenici, P. (2010). Kinematics and energetic benefits of schooling in the labriform fish, striped surfperch Embiotoca lateralis. Marine Ecology Progress Series, 420, 221-229.

      3) In addition to the calming effect, measuring group oxygen consumption suffers from a number of problems as discussed in Herskin and Steffensen (1998) such as the fish volume to water volume ratio, which varies considerably when testing a school vs single individuals in the same tunnel and the problem of wall effect when using a small volume of water for accurate O2 measurements. Herskin and Steffensen (1998) circumvented these problems by measuring tailbeat frequencies of fish in a school and then calculating the MO2 of the corresponding tailbeat frequency in solitary fish in a swim tunnel. A similar approach was used by Johansen et al (2010), Marras et al (2015), Halsey et al (2018). However, It is not clear how these potential issues were dealt with here. Here, larger solitary D. aequipinnatus were used to increase the signal-to-noise ratio. However, using individuals of different sizes makes other variables not so directly comparable, including stress, energetics, and kinematics. (see comment 7 below).

      Response: We acknowledge the great efforts made by previous studies to understand the energetics of fish schooling. These studies, as detailed in the table and elaborated in the response above (see comment 2) are very different from our current study. Our study achieved a direct comparison of energetics (including both aerobic and non-aerobic cost) and kinematics between solitary individuals and fish schools that has never been done before. Our detailed response to the supposed “calming effect” is given above.

      As highlighted in the previous comments and opening statement, our current version has addressed the wall effect, tail beat frequency, and experimental and analytical efforts invested to directly compare the energetics between fish schools and solitary individuals. As readers can see in our comprehensive method section, achieving the direct comparison between solitary individuals and fish schools is not a trivial task. Now we want to elaborate on the role of kinematics as an indirect estimate of energetics. Our results here show that kinematic measurements of tail beat frequency are not reliable estimates of energetic cost, and the previous studies cited did not measure EPOC and those costs are substantial, especially as swimming speed increases. Fish in schools can save energy even when the tail beat frequency does not change (although school volume can change as we show). We elaborated (in great detail) on why kinematics does not always reflect on the energetics in the submitted version (see last paragraph of “Schooling dynamics and energy conservation” section). Somehow modeling what energy expenditure should be based only on tail kinematics is, in our view, a highly unreliable approach that has never been validated (e.g., fish use more than just tails for locomotion). Indeed, we believe that this is an inadequate substitute for direct energy measurements. We disagree that using slightly differently sized individuals is an issue since we recorded fish kinematics across all experiments and included the measurements of behaviour in our manuscript. Slightly altering the size of individual fish was done on purpose to provide a better ratio of respirometer volume to fish volume in the tests on individual fish, thus we regard this as a benefit of our approach and not a concern.

      Finally, in another study of the collective behaviour of flying birds (Usherwood, J. R., Stavrou, M., Lowe, J. C., Roskilly, K. and Wilson, A. M. (2011). Flying in a flock comes at a cost in pigeons. Nature 474, 494-497), the authors observed that wing beat frequency can increase during flight with other birds. Hence, again, we cannot regard movement frequency of appendages as an adequate substitute for direct energetic measurements.

      4) Svendsen et al (2016) provide guidelines for the ratio of fish volume to water volume in the respirometer. The ratio used here (2200) is much higher than that recommended. RFR values higher than 500 should be avoided in swim tunnel respirometry, according to Svendsen et al (2016).

      Response: Thank you for raising this point. Please see the detailed responses above to the same comment above. We believe that our experimental setup and ratios are very much in line with those recommended, and represent a significant improvement on previous studies which use large ratios.

      5) Lines 421-436: The same goes for wall effects. Presumably, using the same size swim tunnel, schooling fish were swimming much closer to the walls than solitary fish but this is not specifically quantified here in this paper. Lines 421-436 provide some information on the boundary layer (though wall effects are not just related by the boundary layer) and some qualitative assessment of school volume. However, no measurement of the distance between the fish and the wall is given.

      Response: Please see the detailed responses above to the same comment. Specifically, we used the particle image velocimetry (PIV) system to measure the boundary layer (<2.5 mm at speeds > 2 BL s-1) and stated the parameters in the methods section of the Integrated Biomechanics & Bioenergetic Assessment System. We also state that the boundary layer is much thinner than the body width of the giant danio (~10 mm) so that the fish cannot effectively hide near the wall. Due to our PIV calibration, we are able to quantify flow near the wall.

      Due to our video data obtained simultaneously with energetic measurements, we do not agree that fish were swimming closer to the wall in schools and also note that we took care to modify the typical respirometer to both ensure that flow across the cross-section did not provide any refuges and to quantify flow velocities in the chamber using particle image velocimetry. We do not believe that any previous experiments on schooling behaviour in fish have taken the same precautions.

      6) The statistical tests used have a number of problems. Two-way ANOVA was based on school vs solitary and swimming speed. However, there are repeated measures at each speed and this needs to be dealt with. The degrees of freedom of one-way ANOVA and T-tests are not provided. These tests took into account five groups of fish vs. five solitary fish. The values of MO2 of each school are the result of the oxygen consumption of each fish, and therefore the test is comparing 5 individuals (i.e. an individual is the statistical unit) vs 5 schools (a school made out of 8 fish is the statistical unit). Therefore the test is comparing two different statistical units. One can see from the graphs that schooling MO2 tend to have a smaller SD than solitary data. This may well be due to the fact that schooling data are based on 5 points (five schools) and each point is the result of the MO2 of five fish, thereby reducing the variability compared to solitary fish. TBF, on the other hand, can be assigned to each fish even in a school, and therefore TBF of each fish could be compared by using a nested approach of schooling fish (nested within each school) vs solitary fish, but this is not the statistical procedure used in the present manuscript. The comparison between TBFs presumably is comparing 5 individuals vs all the fish in the schools (6x5=30 fish). However, the fish in the school are not independent measures.

      Response: We cannot agree with this criticism, which may be based on this reviewer having seen a previous version of the manuscript. We did not use two-way ANOVA in this version. This version of the manuscript reported the statistical value based on a General Linear Model (see statistical section of the method). We are concerned that this reviewer did not in fact read either the Methods section or the Results section. In addition, it is hard to accept that, from examination of the data shown in Figure 3, there is not a clear and large difference between schooling and solitary locomotion, regardless of the statistical test used.

      Meanwhile, the comments about the ‘repeated’ measures from one speed to the next are interesting, but we cannot agree. The ‘repeated’ measures are proper when one testing subject is assessed before and after treatment. Going from one speed to the next is not a treatment. Instead, the speed is a dependent and continuous variable. In our experimental design, the treatment is fish school, and the control is a solitary individual. Second, we never compared any of our dependent variables across different speeds within a school or within an individual. Instead, we compared schools and individuals at each speed. In this comparison, there are no ‘repeated’ measures. We agree with the reviewer that fish in the school are interacting (not independent). This is one more reason to support our approach of treating fish schools as a functional and statistical unit in our experiment design (more detailed responses are stated in the response to the comment above).

      7) The size of solitary and schooling individuals appears to be quite different (solitary fish range 74-88 cm, schooling fish range 47-65 cm). While scaling laws can correct for this in the MO2, was this corrected for TBF and for speed in BL/s? Using BL/s for speed does not completely compensate for the differences in size.

      Response: Our current version has provided justifications for not conducting scaling in the values of tail beat frequency. Our justification is “The mass scaling for tail beat frequency was not conducted because of the lack of data for D. aequipinnatus and its related species. Using the scaling exponent of distant species for mass scaling of tail beat frequency will introduce errors of unknown magnitude.”. Our current version also acknowledges the consideration about scaling as follows: “Fish of different size swimming at 1 BL s-1 will necessarily move at different Reynolds numbers, and hence the scaling of body size to swimming speed needs to be considered in future analyses of other species that differ in size”

      Reviewer #3 (Public Review):

      Summary:

      Zhang and Lauder characterized both aerobic and anaerobic metabolic energy contributions in schools and solitary fishes in the Giant danio (Devario aequipinnatus) over a wide range of water velocities. By using a highly sophisticated respirometer system, the authors measure the aerobic metabolisms by oxygen uptake rate and the non-aerobic oxygen cost as excess post-exercise oxygen consumption (EPOC). With these data, the authors model the bioenergetic cost of schools and solitary fishes. The authors found that fish schools have a J-shaped metabolism-speed curve, with reduced total energy expenditure per tail beat compared to solitary fish. Fish in schools also recovered from exercise faster than solitary fish. Finally, the authors conclude that these energetic savings may underlie the prevalence of coordinated group locomotion in fish.

      The conclusions of this paper are mostly well supported by data, but some aspects of methods and data acquisition need to be clarified and extended.

      Response: Thank you for seeing the value of our study. We provided clarification of the data acquisition system with a new panel of pictures included in the supplemental material to show our experimental system. We understand that our methods have more details and justifications than the typical method sections. First, the details are to promote the reproducibility of the experiments. The justifications are the responses to reviewer 2, who reviewed our previous manuscript version and also posted the same critiques after we provided the justifications for the construction of the system and the data acquisition.

      Strengths:

      This work aims to understand whether animals moving through fluids (water in this case) exhibit highly coordinated group movement to reduce the cost of locomotion. By calculating the aerobic and anaerobic metabolic rates of school and solitary fishes, the authors provide direct energetic measurements that demonstrate the energy-saving benefits of coordinated group locomotion in fishes. The results of this paper show that fish schools save anaerobic energy and reduce the recovery time after peak swimming performance, suggesting that fishes can apport more energy to other fitness-related activities whether they move collectively through water.

      Response: Thank you. We are excited to share our discoveries with the world.

      Weaknesses:

      Although the paper does have strengths in principle, the weakness of the paper is the method section. There is too much irrelevant information in the methods that sometimes is hard to follow for a researcher unfamiliar with the research topic. In addition, it was hard to imagine the experimental (respirometer) system used by the authors in the experiments; therefore, it would be beneficial for the article to include a diagram/scheme of that respiratory system.

      Response: We agree with the reviewer and hence added the pictures of the experimental system in the supplementary materials (Fig. S4). We think pictures are more realistic to present the system than schematics. We also provide a picture of the system during the process of making the energetic measurements. It is to show the care went to ensure fish are not affected by any external stimulation other than the water velocity. The careful experimental protocol is very critical to reveal the concave upward shaped curve of bony fish schools that was never reported before. Many details in the methods have been included in response to Reviewer 2.

      Reviewer #3 (Recommendations For The Authors):

      Overall, this is a very interesting, well-written, and nice article. However, many times the method section looks like a discussion. Furthermore, the authors need to check the use of the word "which" throughout the text. I got the feeling that it is overused/misused sometimes.

      Response: Thank you for the positive comments. The method is written in that way to address the concerns of Reviewer 2 who reviewed our previous versions. We corrected the overuse of ‘which’ throughout the manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors investigate the contributions of the long noncoding RNA snhg3 in liver metabolism and MAFLD. The authors conclude that liver-specific loss or overexpression of Snhg3 impacts hepatic lipid content and obesity through epigenetic mechanisms. More specifically, the authors invoke that the nuclear activity of Snhg3 aggravates hepatic steatosis by altering the balance of activating and repressive chromatin marks at the Pparg gene locus. This regulatory circuit is dependent on a transcriptional regulator SND1.

      Strengths:

      The authors developed a tissue-specific lncRNA knockout and KI models. This effort is certainly appreciated as few lncRNA knockouts have been generated in the context of metabolism. Furthermore, lncRNA effects can be compensated in a whole organism or show subtle effects in acute versus chronic perturbation, rendering the focus on in vivo function important and highly relevant. In addition, Snhg3 was identified through a screening strategy and as a general rule the authors the authors attempt to follow unbiased approaches to decipher the mechanisms of Snhg3.

      Weaknesses:

      Despite efforts at generating a liver-specific knockout, the phenotypic characterization is not focused on the key readouts. Notably missing are rigorous lipid flux studies and targeted gene expression/protein measurement that would underpin why the loss of Snhg3 protects from lipid accumulation. Along those lines, claims linking the Snhg3 to MAFLD would be better supported with careful interrogation of markers of fibrosis and advanced liver disease. In other areas, significance is limited since the presented data is either not clear or rigorous enough. Finally, there is an important conceptual limitation to the work since PPARG is not established to play a major role in the liver.

      We thank the reviewer for the detailed comment. In this study, hepatocyte-specific Snhg3 deficiency decreased body and liver weight and alleviated hepatic steatosis in DIO mice, whereas overexpression induced the opposite effect (Figure 2 and 3). Furthermore, we investigated the hepatic differentially expressed genes (DEGs) between the DIO Snhg3-HKI and control WT mice using RNA-Seq and revealed that Snhg3 exerts a global effect on the expression of genes involved in fatty acid metabolism using GSEA (Figure 4B). We validated the expression of some DEGs involved in fatty acid metabolism by RT-qPCR. The results showed that the hepatic expression levels of some genes involved in fatty acid metabolism, including Cd36, Cidea/c and Scd1/2 were upregulated in Snhg3-HKO mice and were downregulated in Snhg3-HKI mice compared to the controls (Figure 4C), respectively. Please check them in the first paragraph in p8.

      As a transcription regulator of Cd36 and Cidea/c, it is well known that PPARγ plays major adipogenic and lipogenic roles in adipose tissue. Although the expression of PPARγ in the liver is very low under healthy conditions, induced expression of PPARγ in both hepatocytes and non-parenchymal cells (Kupffer cells, immune cells, and HSCs) in the liver has a crucial role in the pathophysiology of MASLD (Lee et al., 2023b, Chen et al., 2023, Gross et al., 2017). The activation of PPARγ in the liver induces the adipogenic program to store fatty acids in lipid droplets as observed in adipocytes (Lee et al., 2018). Moreover, the inactivation of liver PPARγ abolished rosiglitazone-induced an increase in hepatic TG and improved hepatic steatosis in lipoatrophic AZIP mice (Gavrilova et al., 2003). Furthermore, there is a strong correlation between the onset of hepatic steatosis and hepatocyte-specific PPARγ expression. Clinical trials have also indicated that increased insulin resistance and hepatic PPARγ expressions were associated with NASH scores in some obese patients (Lee et al., 2023a, Mukherjee et al., 2022). Even though PPARγ’s primary function is in adipose tissue, patients with MASLD have much higher hepatic expression levels of PPARγ, reflecting the fact that PPARγ plays different roles in different tissues and cell types (Mukherjee et al., 2022). As these studies mentioned above, our result also hinted at the importance of PPARγ in the pathophysiology of MASLD. Snhg3 deficiency or overexpression respectively induced the decrease or increase in hepatic PPARγ. Moreover, administration of PPARγ antagonist T0070907 mitigated the hepatic Cd36 and Cidea/c increase and improved Snhg3-induced hepatic steatosis. However,  conflicting findings suggest that the expression of hepatic PPARγ is not increased as steatosis develops in humans and in clinical studies and that PPARγ agonists administration didn’t aggravate liver steatosis (Gross et al., 2017). Thus, understanding how the hepatic PPARγ expression is regulated may provide a new avenue to prevent and treat the MASLD (Lee et al., 2018). We also discussed it in revised manuscript, please refer the first paragraph in the section of Discussion in p13.

      Hepatotoxicity accelerates the development of progressive inflammation, oxidative stress and fibrosis (Roehlen et al., 2020). Chronic liver injury including MASLD can progress to liver fibrosis with the formation of a fibrous scar. Injured hepatocytes can secrete fibrogenic factors or exosomes containing miRNAs that activate HSCs, the major source of the fibrous scar in liver fibrosis (Kisseleva and Brenner, 2021). Apart from promoting lipogenesis, PPARγ has also a crucial function in improving inflammation and fibrosis (Chen et al., 2023). In this study, no hepatic fibrosis phenotype was seen in Snhg3-HKO and Snhg3-HKI mice (figures supplement 1D and 2D). Moreover, deficiency and overexpression of Snhg3 respectively decreased and increased the expression of profibrotic genes, such as collagen type I alpha 1/2 (Col1a1 and Col1a2), but had no effects on the pro-inflammatory factors, including transforming growth factor β1 (Tgfβ1), tumor necrosis factor α (Tnfα), interleukin 6 and 1β (Il6 and Il1β) (figures supplement 3A and B). Inflammation is an absolute requirement for fibrosis because factors from injured hepatocytes alone are not sufficient to directly activate HSCs and lead to fibrosis (Kisseleva and Brenner, 2021). Additionally, previous studies indicated that exposure to HFD for more 24 weeks causes less severe fibrosis (Alshawsh et al., 2022). In future, the effect of Snhg3 on hepatic fibrosis in mice need to be elucidated by prolonged high-fat feeding or by adopting methionine- and choline deficient diet (MCD) feeding. Please check them in the second paragraph in the section of Discussion in p13.

      References

      ALSHAWSH, M. A., ALSALAHI, A., ALSHEHADE, S. A., SAGHIR, S. A. M., AHMEDA, A. F., AL ZARZOUR, R. H. & MAHMOUD, A. M. 2022. A Comparison of the Gene Expression Profiles of Non-Alcoholic Fatty Liver Disease between Animal Models of a High-Fat Diet and Methionine-Choline-Deficient Diet. Molecules, 27. DIO:10.3390/molecules27030858, PMID:35164140

      CHEN, H., TAN, H., WAN, J., ZENG, Y., WANG, J., WANG, H. & LU, X. 2023. PPAR-gamma signaling in nonalcoholic fatty liver disease: Pathogenesis and therapeutic targets. Pharmacol Ther, 245, 108391. DIO:10.1016/j.pharmthera.2023.108391, PMID:36963510

      GAVRILOVA, O., HALUZIK, M., MATSUSUE, K., CUTSON, J. J., JOHNSON, L., DIETZ, K. R., NICOL, C. J., VINSON, C., GONZALEZ, F. J. & REITMAN, M. L. 2003. Liver peroxisome proliferator-activated receptor gamma contributes to hepatic steatosis, triglyceride clearance, and regulation of body fat mass. J Biol Chem, 278, 34268-76. DIO:10.1074/jbc.M300043200, PMID:12805374

      GROSS, B., PAWLAK, M., LEFEBVRE, P. & STAELS, B. 2017. PPARs in obesity-induced T2DM, dyslipidaemia and NAFLD. Nat Rev Endocrinol, 13, 36-49. DIO:10.1038/nrendo.2016.135, PMID:27636730

      KISSELEVA, T. & BRENNER, D. 2021. Molecular and cellular mechanisms of liver fibrosis and its regression. Nat Rev Gastroenterol Hepatol, 18, 151-166. DIO:10.1038/s41575-020-00372-7, PMID:33128017

      LEE, S. M., MURATALLA, J., KARIMI, S., DIAZ-RUIZ, A., FRUTOS, M. D., GUZMAN, G., RAMOS-MOLINA, B. & CORDOBA-CHACON, J. 2023a. Hepatocyte PPARgamma contributes to the progression of non-alcoholic steatohepatitis in male and female obese mice. Cell Mol Life Sci, 80, 39. DIO:10.1007/s00018-022-04629-z, PMID:36629912

      LEE, S. M., MURATALLA, J., SIERRA-CRUZ, M. & CORDOBA-CHACON, J. 2023b. Role of hepatic peroxisome proliferator-activated receptor gamma in non-alcoholic fatty liver disease. J Endocrinol, 257. DIO:10.1530/JOE-22-0155, PMID:36688873

      LEE, Y. K., PARK, J. E., LEE, M. & HARDWICK, J. P. 2018. Hepatic lipid homeostasis by peroxisome proliferator-activated receptor gamma 2. Liver Res, 2, 209-215. DIO:10.1016/j.livres.2018.12.001, PMID:31245168

      MUKHERJEE, A. G., WANJARI, U. R., GOPALAKRISHNAN, A. V., KATTURAJAN, R., KANNAMPUZHA, S., MURALI, R., NAMACHIVAYAM, A., GANESAN, R., RENU, K., DEY, A., VELLINGIRI, B. & PRINCE, S. E. 2022. Exploring the Regulatory Role of ncRNA in NAFLD: A Particular Focus on PPARs. Cells, 11. DIO:10.3390/cells11243959, PMID:36552725

      ROEHLEN, N., CROUCHET, E. & BAUMERT, T. F. 2020. Liver Fibrosis: Mechanistic Concepts and Therapeutic Perspectives. Cells, 9. DIO:10.3390/cells9040875, PMID:32260126

      Reviewer #2 (Public Review):

      Through RNA analysis, Xie et al found LncRNA Snhg3 was one of the most down-regulated Snhgs by a high-fat diet (HFD) in mouse liver. Consequently, the authors sought to examine the mechanism through which Snhg3 is involved in the progression of metabolic dysfunction-associated fatty liver diseases (MASLD) in HFD-induced obese (DIO) mice. Interestingly, liver-specific Snhg3 knockout was reduced, while Snhg3 over-expression potentiated fatty liver in mice on an HFD. Using the RNA pull-down approach, the authors identified SND1 as a potential Sngh3 interacting protein. SND1 is a component of the RNA-induced silencing complex (RISC). The authors found that Sngh3 increased SND1 ubiquitination to enhance SND1 protein stability, which then reduced the level of repressive chromatin H3K27me3 on PPARg promoter. The upregulation of PPARg, a lipogenic transcription factor, thus contributed to hepatic fat accumulation.

      The authors propose a signaling cascade that explains how LncRNA sngh3 may promote hepatic steatosis. Multiple molecular approaches have been employed to identify molecular targets of the proposed mechanism, which is a strength of the study. There are, however, several potential issues to consider before jumping to a conclusion.

      (1) First of all, it's important to ensure the robustness and rigor of each study. The manuscript was not carefully put together. The image qualities for several figures were poor, making it difficult for the readers to evaluate the results with confidence. The biological replicates and numbers of experimental repeats for cell-based assays were not described. When possible, the entire immunoblot imaging used for quantification should be presented (rather than showing n=1 representative). There were multiple mislabels in figure panels or figure legends (e.g., Figure 2I, Figure 2K, and Figure 3K). The b-actin immunoblot image was reused in Figure 4J, Figure 5G, and Figure 7B with different exposure times. These might be from the same cohort of mice. If the immunoblots were run at different times, the loading control should be included on the same blot as well.

      We thank the reviewer for the detailed comment. We have provided the clear figures in revised manuscript, please check them.

      The biological replicates and numbers of experimental repeats for cell-based assays had been updated and please check them in the manuscript.

      The entire immunoblot imaging used for quantification had been provided in the primary data. Please check them.

      The original Figure 2I, Figure 2K, Figure 3K have been revised and replaced with new Figure 2F, Figure 2H, Figure 3H, and their corresponding figure legends has also been corrected in revised manuscript.

      The protein levels of CD36, PPARγ and β-ACTIN were examined at the same time and we had revised the manuscript, please check them in revised Figure 7B and 7C.

      (2) The authors can do a better job in explaining the logic for how they came up with the potential function of each component of the signaling cascade. Snhg3 is down-regulated by HFD. However, the evidence presented indicates its involvement in promoting steatosis. In Figure 1C, one would expect PPARg expression to be up-regulated (when Sngh3 was down-regulated). If so, the physiological observation conflicts with the proposed mechanism. In addition, SND1 is known to regulate RNA/miRNA processing. How do the authors rule out this potential mechanism? How about the hosting snoRNA, Snord17? Does it involve the progression of NASLD?

      We thank the reviewer for the detailed comment. Our results showed that the expression of Snhg3 was decreased in DIO mice which led us to speculate that the downregulation of Snhg3 in DIO mice might be a stress protective reaction to high nutritional state, but the specific details need to be clarified. This is probably similar to fibroblast growth factor 21 (FGF21) and growth differentiation factor 15 (GDF15), whose endogenous expression and circulating levels are elevated in obese humans and mice despite their beneficial effects on obesity and related metabolic complications (Keipert and Ost, 2021). Although FGF21 can be induced by oxidative stress and be activated in obese mice and in NASH patients, elevated FGF21 paradoxically protects against oxidative stress and reduces hepatic steatosis (Tillman and Rolph, 2020).  We had added the content the section of Discussion, please check it in the second paragraph in p12.

      SND1 has multiple roles through associating with different types of RNA molecules, including mRNA, miRNA, circRNA, dsRNA and lncRNA. SND1 could bind negative-sense SARS-CoV-2 RNA and promoted viral RNA synthesis, and to promote viral RNA synthesis (Schmidt et al., 2023). SND1 is also involved in hypoxia by negatively regulating hypoxia‐related miRNAs (Saarikettu et al., 2023). Furthermore, a recent study revealed that lncRNA SNAI3-AS1 can competitively bind to SND1 and perturb the m6A-dependent recognition of Nrf2 mRNA 3'UTR by SND1, thereby reducing the mRNA stability of Nrf2 (Zheng et al., 2023). Huang et al. also reported that circMETTL9 can directly bind to and increase the expression of SND1 in astrocytes, leading to enhanced neuroinflammation (Huang et al., 2023). However, whether there is an independent-histone methylation role of SND1/lncRNA-Snhg3 involved in lipid metabolism in the liver needs to be further investigated. We also discussed the limitation in the manuscript and please refer the section of Discussion in the third paragraph in p17.

      Snhg3 serves as host gene for producing intronic U17 snoRNAs, the H/ACA snoRNA. A previous study found that cholesterol trafficking phenotype was not due to reduced Snhg3 expression, but rather to haploinsufficiency of U17 snoRNA. Upregulation of hypoxia-upregulated mitochondrial movement regulator (HUMMR) in U17 snoRNA-deficient cells promoted the formation of ER-mitochondrial contacts, resulting in decreasing cholesterol esterification and facilitating cholesterol trafficking to mitochondria (Jinn et al., 2015). Additionally, disruption of U17 snoRNA caused resistance to lipid-induced cell death and general oxidative stress in cultured cells. Furthermore, knockdown of U17 snoRNA in vivo protected against hepatic steatosis and lipid-induced oxidative stress and inflammation (Sletten et al., 2021). We determined the expression of hepatic U17 snoRNA and its effect on SND1 and PPARγ. The results showed that the expression of U17 snoRNA decreased in the liver of DIO Snhg3-HKO mice and unchanged in the liver of DIO Snhg3-HKI mice, but overexpression of U17 snoRNA had no effect on the expression of SND1 and PPARγ (figure supplement 5A-C), indicating that Sngh3 induced hepatic steatosis was independent on U17 snoRNA. We also discussed it in revised manuscript, please refer the section of Discussion in p15.

      References

      HUANG, C., SUN, L., XIAO, C., YOU, W., SUN, L., WANG, S., ZHANG, Z. & LIU, S. 2023. Circular RNA METTL9 contributes to neuroinflammation following traumatic brain injury by complexing with astrocytic SND1. J Neuroinflammation, 20, 39. DIO:10.1186/s12974-023-02716-x, PMID:36803376

      JINN, S., BRANDIS, K. A., REN, A., CHACKO, A., DUDLEY-RUCKER, N., GALE, S. E., SIDHU, R., FUJIWARA, H., JIANG, H., OLSEN, B. N., SCHAFFER, J. E. & ORY, D. S. 2015. snoRNA U17 regulates cellular cholesterol trafficking. Cell Metab, 21, 855-67. DIO:10.1016/j.cmet.2015.04.010, PMID:25980348

      KEIPERT, S. & OST, M. 2021. Stress-induced FGF21 and GDF15 in obesity and obesity resistance. Trends Endocrinol Metab, 32, 904-915. DIO:10.1016/j.tem.2021.08.008, PMID:34526227

      SAARIKETTU, J., LEHMUSVAARA, S., PESU, M., JUNTTILA, I., PARTANEN, J., SIPILA, P., POUTANEN, M., YANG, J., HAIKARAINEN, T. & SILVENNOINEN, O. 2023. The RNA-binding protein Snd1/Tudor-SN regulates hypoxia-responsive gene expression. FASEB Bioadv, 5, 183-198. DIO:10.1096/fba.2022-00115, PMID:37151849

      SCHMIDT, N., GANSKIH, S., WEI, Y., GABEL, A., ZIELINSKI, S., KESHISHIAN, H., LAREAU, C. A., ZIMMERMANN, L., MAKROCZYOVA, J., PEARCE, C., KREY, K., HENNIG, T., STEGMAIER, S., MOYON, L., HORLACHER, M., WERNER, S., AYDIN, J., OLGUIN-NAVA, M., POTABATTULA, R., KIBE, A., DOLKEN, L., SMYTH, R. P., CALISKAN, N., MARSICO, A., KREMPL, C., BODEM, J., PICHLMAIR, A., CARR, S. A., CHLANDA, P., ERHARD, F. & MUNSCHAUER, M. 2023. SND1 binds SARS-CoV-2 negative-sense RNA and promotes viral RNA synthesis through NSP9. Cell, 186, 4834-4850 e23. DIO:10.1016/j.cell.2023.09.002, PMID:37794589

      SLETTEN, A. C., DAVIDSON, J. W., YAGABASAN, B., MOORES, S., SCHWAIGER-HABER, M., FUJIWARA, H., GALE, S., JIANG, X., SIDHU, R., GELMAN, S. J., ZHAO, S., PATTI, G. J., ORY, D. S. & SCHAFFER, J. E. 2021. Loss of SNORA73 reprograms cellular metabolism and protects against steatohepatitis. Nat Commun, 12, 5214. DIO:10.1038/s41467-021-25457-y, PMID:34471131

      TILLMAN, E. J. & ROLPH, T. 2020. FGF21: An Emerging Therapeutic Target for Non-Alcoholic Steatohepatitis and Related Metabolic Diseases. Front Endocrinol (Lausanne), 11, 601290. DIO:10.3389/fendo.2020.601290, PMID:33381084

      ZHENG, J., ZHANG, Q., ZHAO, Z., QIU, Y., ZHOU, Y., WU, Z., JIANG, C., WANG, X. & JIANG, X. 2023. Epigenetically silenced lncRNA SNAI3-AS1 promotes ferroptosis in glioma via perturbing the m(6)A-dependent recognition of Nrf2 mRNA mediated by SND1. J Exp Clin Cancer Res, 42, 127. DIO:10.1186/s13046-023-02684-3, PMID:37202791

      (3) The role of PPARg in fatty liver diseases might be a rodent-specific phenomenon. PPARg agonist treatment in humans may actually reduce ectopic fat deposition by increasing fat storage in adipose tissues. The relevance of the findings to human diseases should be discussed.

      We thank the reviewer for the detailed comment. As a transcription regulator of Cd36 and Cidea/c, it is well known that PPARγ plays major adipogenic and lipogenic roles in adipose tissue. Although the expression of PPARγ in the liver is very low under healthy conditions, induced expression of PPARγ in both hepatocytes and non-parenchymal cells (Kupffer cells, immune cells, and hepatic stellate cells (HSCs)) in the liver has a crucial role in the pathophysiology of MASLD (Lee et al., 2023b, Chen et al., 2023, Gross et al., 2017). The activation of PPARγ in the liver induces the adipogenic program to store fatty acids in lipid droplets as observed in adipocytes (Lee et al., 2018). Moreover, the inactivation of liver PPARγ abolished rosiglitazone-induced an increase in hepatic TG and improved hepatic steatosis in lipoatrophic AZIP mice (Gavrilova et al., 2003). Apart from promoting lipogenesis, PPARγ has also a crucial function in improving inflammation and fibrosis (Chen et al., 2023). Furthermore, there is a strong correlation between the onset of hepatic steatosis and hepatocyte-specific PPARγ expression. Clinical trials have also indicated that increased insulin resistance and hepatic PPARγ expressions were associated with NASH scores in some obese patients (Lee et al., 2023a, Mukherjee et al., 2022). Even though PPARγ’s primary function is in adipose tissue, patients with MASLD have much higher hepatic expression levels of PPARγ, reflecting the fact that PPARγ plays different roles in different tissues and cell types (Mukherjee et al., 2022). As these studies mentioned above, our result also hinted at the importance of PPARγ in the pathophysiology of MASLD. Snhg3 deficiency or overexpression respectively induced the decrease or increase in hepatic PPARγ. Moreover, administration of PPARγ antagonist T0070907 mitigated the hepatic Cd36 and Cidea/c increase and improved Snhg3-induced hepatic steatosis. However,  conflicting findings suggest that the expression of hepatic PPARγ is not increased as steatosis develops in humans and in clinical studies and that PPARγ agonists administration didn’t aggravate liver steatosis (Gross et al., 2017). Thus, understanding how the hepatic PPARγ expression is regulated may provide a new avenue to prevent and treat the MASLD (Lee et al., 2018). We also discussed it in revised manuscript, please refer the first paragraph in the section of Discussion in p13.

      References

      CHEN, H., TAN, H., WAN, J., ZENG, Y., WANG, J., WANG, H. & LU, X. 2023. PPAR-gamma signaling in nonalcoholic fatty liver disease: Pathogenesis and therapeutic targets. Pharmacol Ther, 245, 108391. DIO:10.1016/j.pharmthera.2023.108391, PMID:36963510

      GAVRILOVA, O., HALUZIK, M., MATSUSUE, K., CUTSON, J. J., JOHNSON, L., DIETZ, K. R., NICOL, C. J., VINSON, C., GONZALEZ, F. J. & REITMAN, M. L. 2003. Liver peroxisome proliferator-activated receptor gamma contributes to hepatic steatosis, triglyceride clearance, and regulation of body fat mass. J Biol Chem, 278, 34268-76. DIO:10.1074/jbc.M300043200, PMID:12805374

      GROSS, B., PAWLAK, M., LEFEBVRE, P. & STAELS, B. 2017. PPARs in obesity-induced T2DM, dyslipidaemia and NAFLD. Nat Rev Endocrinol, 13, 36-49. DIO:10.1038/nrendo.2016.135, PMID:27636730

      LEE, S. M., MURATALLA, J., KARIMI, S., DIAZ-RUIZ, A., FRUTOS, M. D., GUZMAN, G., RAMOS-MOLINA, B. & CORDOBA-CHACON, J. 2023a. Hepatocyte PPARgamma contributes to the progression of non-alcoholic steatohepatitis in male and female obese mice. Cell Mol Life Sci, 80, 39. DIO:10.1007/s00018-022-04629-z, PMID:36629912

      LEE, S. M., MURATALLA, J., SIERRA-CRUZ, M. & CORDOBA-CHACON, J. 2023b. Role of hepatic peroxisome proliferator-activated receptor gamma in non-alcoholic fatty liver disease. J Endocrinol, 257. DIO:10.1530/JOE-22-0155, PMID:36688873

      LEE, Y. K., PARK, J. E., LEE, M. & HARDWICK, J. P. 2018. Hepatic lipid homeostasis by peroxisome proliferator-activated receptor gamma 2. Liver Res, 2, 209-215. DIO:10.1016/j.livres.2018.12.001, PMID:31245168

      MUKHERJEE, A. G., WANJARI, U. R., GOPALAKRISHNAN, A. V., KATTURAJAN, R., KANNAMPUZHA, S., MURALI, R., NAMACHIVAYAM, A., GANESAN, R., RENU, K., DEY, A., VELLINGIRI, B. & PRINCE, S. E. 2022. Exploring the Regulatory Role of ncRNA in NAFLD: A Particular Focus on PPARs. Cells, 11. DIO:10.3390/cells11243959, PMID:36552725

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      As a general strategy for the revision, I would advise the authors to focus on strengthening the analysis of the liver with the two most important figures being Figure 2 and Figure 3. The mechanism as it stands is problematic which reduces the impact of the animal studies despite substantial efforts from the authors. Consider removing or toning down some of the studies focused on mechanisms in the nucleus, including changing the title.

      We thank the reviewer for the detailed comment. In this study, hepatocyte-specific Snhg3 deficiency decreased body and liver weight, alleviated hepatic steatosis and promoted hepatic fatty acid metabolism in DIO mice, whereas overexpression induced the opposite effect. The hepatic differentially expressed genes (DEGs) between the DIO Snhg3-HKI and control WT mice using RNA-Seq and revealed that Snhg3 exerts a global effect on the expression of genes involved in fatty acid metabolism using GSEA (Figure 4B). RT-qPCR analysis confirmed that the hepatic expression levels of some genes involved in fatty acid metabolism, including Cd36, Cidea/c and Scd1/2, were upregulated in Snhg3-HKO mice and were downregulated in Snhg3-HKI mice compared to the controls (Figure 4C). Moreover, deficiency and overexpression of Snhg3 respectively decreased and increased the expression of profibrotic genes, such as Col1a1 and Col1a2, but had no effects on the pro-inflammatory factors, including Tgfβ1, Tnfα, Il6 and Il1β (figure supplement 3A and B). The results indicated that Snhg3 involved in hepatic steatosis through regulating fatty acid metabolism. Furthermore, PPARγ was selected to study its role in Snhg3-induced hepatic steatosis by integrated analyzing the data from CUT&Tag-Seq, ATAC-Seq and RNA-Seq. Finally, inhibition of PPARγ with T0070907 alleviated Snhg3 induced Cd36 and Cidea/c increases and improved Snhg3-aggravated hepatic steatosis. In summary, we confirmed that SND1/H3K27me3/PPARγ is partially responsible for Sngh3-inuced hepatic steatosis. As the reviewer suggested, we replaced the title with “LncRNA-Snhg3 Aggravates Hepatic Steatosis via PPARγ Signaling”.

      (1) How is steatosis changing in the liver? Is this due to a change in fatty acid uptake, lipogenesis/synthesis, beta-oxidation, trig secretion, etc..? The analysis in Figures 2 and 3 is mostly focused on metabolic chamber studies which seem distracting, particularly in the absence of a mechanism and given a liver-specific perturbation. The authors should use a combination of targeted gene expression, protein blots, and lipid flux measurements to provide better insights here. The histology in Figure 2H suggests a very dramatic effect but does match with lipid measurements in 2I.

      We thank the reviewer for the detailed comment. The pathogenesis of MASLD has not been entirely elucidated. Multifarious factors such as genetic and epigenetic factors, nutritional factors, insulin resistance, lipotoxicity, microbiome, fibrogenesis and hormones secreted from the adipose tissue, are recognized to be involved in the development and progression of MASLD (Buzzetti et al., 2016, Lee et al., 2017, Rada et al., 2020, Sakurai et al., 2021, Friedman et al., 2018). In this study, we investigated the hepatic differentially expressed genes (DEGs) between the DIO Snhg3-HKI and control WT mice using RNA-Seq and revealed that Snhg3 exerts a global effect on the expression of genes involved in fatty acid metabolism using GSEA (Figure 4B). We validated the expression of some DEGs involved in fatty acid metabolism by RT-qPCR. The results showed that the hepatic expression levels of some genes involved in fatty acid metabolism, including Cd36, Cidea/c and Scd1/2 were upregulated in Snhg3-HKO mice and were downregulated in Snhg3-HKI mice compared to the controls (Figure 4C), respectively. Additionally, we re-analyzed the metabolic chamber data using CalR and the results showed that there were no obvious differences in heat production, total oxygen consumption, carbon dioxide production or RER between DIO Snhg3-HKO or DIO Snhg3-HKI and the corresponding control mice (figure supplement 1C and 2C). Unfortunately, we did not detect lipid flux due to limited experimental conditions. However, in summary, our results indicated that Snhg3 is involved in hepatic steatosis by regulating fatty acid metabolism. Please check them in the first paragraph in p8.

      Additionally, we determined the hepatic TC levels in other batch of DIO Snhg3-HKO and control mice and found there was no difference in hepatic TC (as below) between DIO Snhg3-HKO and control mice fed HFD 18 weeks. Perhaps the apparent difference in TC requires a prolonged high-fat diet feeding time.

      Author response image 1.

      Hepatic TC contents of in DIO Snhg3-Flox and Snhg3-HKO mice.

      References

      BUZZETTI, E., PINZANI, M. & TSOCHATZIS, E. A. 2016. The multiple-hit pathogenesis of non-alcoholic fatty liver disease (NAFLD). Metabolism, 65, 1038-48. DIO:10.1016/j.metabol.2015.12.012, PMID:26823198

      FRIEDMAN, S. L., NEUSCHWANDER-TETRI, B. A., RINELLA, M. & SANYAL, A. J. 2018. Mechanisms of NAFLD development and therapeutic strategies. Nat Med, 24, 908-922. DIO:10.1038/s41591-018-0104-9, PMID:29967350

      LEE, J., KIM, Y., FRISO, S. & CHOI, S. W. 2017. Epigenetics in non-alcoholic fatty liver disease. Mol Aspects Med, 54, 78-88. DIO:10.1016/j.mam.2016.11.008, PMID:27889327

      RADA, P., GONZALEZ-RODRIGUEZ, A., GARCIA-MONZON, C. & VALVERDE, A. M. 2020. Understanding lipotoxicity in NAFLD pathogenesis: is CD36 a key driver? Cell Death Dis, 11, 802. DIO:10.1038/s41419-020-03003-w, PMID:32978374

      SAKURAI, Y., KUBOTA, N., YAMAUCHI, T. & KADOWAKI, T. 2021. Role of Insulin Resistance in MAFLD. Int J Mol Sci, 22. DIO:10.3390/ijms22084156, PMID:33923817

      (2) Throughout the manuscript the authors make claims about liver disease models, but this is not well supported since markers of advanced liver disease are not examined. The authors should stain and show expression for fibrosis and inflammation.

      We thank the reviewer for the detailed comment. Metabolic dysfunction-associated fatty liver disease (MASLD) is characterized by excess liver fat in the absence of significant alcohol consumption. It can progress from simple steatosis to metabolic dysfunction-associated steatohepatitis (MASH) and fibrosis and eventually to chronic progressive diseases such as cirrhosis, end-stage liver failure, and hepatocellular carcinoma (Loomba et al., 2021). As the reviewer suggested, we detected the effect of Snhg3 on liver fibrosis and inflammation. The results showed no hepatic fibrosis phenotype was seen in Snhg3-HKO and Snhg3-HKI mice (figures supplement 1D and 2D). Moreover, deficiency and overexpression of Snhg3 respectively decreased and increased the expression of profibrotic genes, such as collagen type I alpha 1/2 (Col1a1 and Col1a2), but had no effects on the pro-inflammatory factors including Tgf-β, Tnf-α, Il-6 and Il-1β (figure supplement 3A and 3B). Inflammation is an absolute requirement for fibrosis because factors from injured hepatocytes alone are not sufficient to directly activate HSCs and lead to fibrosis (Kisseleva and Brenner, 2021). Additionally, previous studies indicated that exposure to HFD for more 24 weeks causes less severe fibrosis (Alshawsh et al., 2022). In future, the effect of Snhg3 on hepatic fibrosis in mice need to be elucidated by prolonged high-fat feeding or by adopting methionine- and choline deficient diet (MCD) feeding. Please check them in the second paragraph in the section of Discussion in p13.

      References

      ALSHAWSH, M. A., ALSALAHI, A., ALSHEHADE, S. A., SAGHIR, S. A. M., AHMEDA, A. F., AL ZARZOUR, R. H. & MAHMOUD, A. M. 2022. A Comparison of the Gene Expression Profiles of Non-Alcoholic Fatty Liver Disease between Animal Models of a High-Fat Diet and Methionine-Choline-Deficient Diet. Molecules, 27. DIO:10.3390/molecules27030858, PMID:35164140

      KISSELEVA, T. & BRENNER, D. 2021. Molecular and cellular mechanisms of liver fibrosis and its regression. Nat Rev Gastroenterol Hepatol, 18, 151-166. DIO:10.1038/s41575-020-00372-7, PMID:33128017

      LOOMBA, R., FRIEDMAN, S. L. & SHULMAN, G. I. 2021. Mechanisms and disease consequences of nonalcoholic fatty liver disease. Cell, 184, 2537-2564. DIO:10.1016/j.cell.2021.04.015, PMID:33989548

      (3) Publicly available datasets show that PPARG protein is not expressed in the liver (Science 2015 347(6220):1260419, PMID: 25613900). Are the authors sure this is not an effect on another PPAR isoform like alpha? ChIP and RNA-seq pathway readouts do not distinguish between different isoforms.

      We thank the reviewer for the detailed comment. As a transcription regulator of Cd36 and Cidea/c, it is well known that PPARγ plays major adipogenic and lipogenic roles in adipose tissue. Although the expression of PPARγ in the liver is very low under healthy conditions, induced expression of PPARγ in both hepatocytes and non-parenchymal cells (Kupffer cells, immune cells, and hepatic stellate cells (HSCs)) in the liver has a crucial role in the pathophysiology of MASLD (Lee et al., 2023b, Chen et al., 2023, Gross et al., 2017). The activation of PPARγ in the liver induces the adipogenic program to store fatty acids in lipid droplets as observed in adipocytes (Lee et al., 2018). Moreover, the inactivation of liver PPARγ abolished rosiglitazone-induced an increase in hepatic TG and improved hepatic steatosis in lipoatrophic AZIP mice (Gavrilova et al., 2003). Apart from promoting lipogenesis, PPARγ has also a crucial function in improving inflammation and fibrosis (Chen et al., 2023). Furthermore, there is a strong correlation between the onset of hepatic steatosis and hepatocyte-specific PPARγ expression. Clinical trials have also indicated that increased insulin resistance and hepatic PPARγ expressions were associated with NASH scores in some obese patients (Lee et al., 2023a, Mukherjee et al., 2022). Even though PPARγ’s primary function is in adipose tissue, patients with MASLD have much higher hepatic expression levels of PPARγ, reflecting the fact that PPARγ plays different roles in different tissues and cell types (Mukherjee et al., 2022). As these studies mentioned above, our result also hinted at the importance of PPARγ in the pathophysiology of MASLD. Snhg3 deficiency or overexpression respectively induced the decrease or increase in hepatic PPARγ. Moreover, administration of PPARγ antagonist T0070907 mitigated the hepatic Cd36 and Cidea/c increase and improved Snhg3-induced hepatic steatosis. However,  conflicting findings suggest that the expression of hepatic PPARγ is not increased as steatosis develops in humans and in clinical studies and that PPARγ agonists administration didn’t aggravate liver steatosis (Gross et al., 2017). Thus, understanding how the hepatic PPARγ expression is regulated may provide a new avenue to prevent and treat the MASLD (Lee et al., 2018). We also discussed it in revised manuscript, please refer the first paragraph in the section of Discussion in p13 in revised manuscript.

      PPARα, most highly expressed in the liver, transcriptionally regulates lipid catabolism by regulating the expression of genes mediating triglyceride hydrolysis, fatty acid transport, and β-oxidation. Activators of PPARα decrease plasma triglycerides by inhibiting its synthesis and accelerating its hydrolysis (Chen et al., 2023). Mice with deletion of the Pparα gene exhibited more hepatic steatosis under HFD induction. As the reviewer suggested, we investigated the effect of Snhg3 on Pparα expression.  The result showed that both deficiency of Snhg3 or overexpression of Snhg3 doesn’t affect the mRNA level of Pparα as showing below, indicating that Snhg3-induced lipid accumulation independent on PPARα. Additionally, the exon, upstream 2k, 5’-UTR and intron regions of Pparγ, not Pparα, were enriched with the H3K27me3 mark (fold_enrichment = 4.15697) in the liver of DIO Snhg3-HKO mice using the CUT&Tag assay (table supplement 8), which was further confirmed by ChIP (Figure 6F and G). Therefore, we choose PPARγ to study its role in Sngh3-induced hepatic steatosis by integrated analyzing the data from CUT&Tag-Seq, ATAC-Seq and RNA-Seq.

      Author response image 2.

      The mRNA levels of hepatic Pparα expression in DIO Snhg3-HKO mice and Snhg3-HKI mice compared to the controls.

      References

      CHEN, H., TAN, H., WAN, J., ZENG, Y., WANG, J., WANG, H. & LU, X. 2023. PPAR-gamma signaling in nonalcoholic fatty liver disease: Pathogenesis and therapeutic targets. Pharmacol Ther, 245, 108391. DIO:10.1016/j.pharmthera.2023.108391, PMID:36963510

      GAVRILOVA, O., HALUZIK, M., MATSUSUE, K., CUTSON, J. J., JOHNSON, L., DIETZ, K. R., NICOL, C. J., VINSON, C., GONZALEZ, F. J. & REITMAN, M. L. 2003. Liver peroxisome proliferator-activated receptor gamma contributes to hepatic steatosis, triglyceride clearance, and regulation of body fat mass. J Biol Chem, 278, 34268-76. DIO:10.1074/jbc.M300043200, PMID:12805374

      GROSS, B., PAWLAK, M., LEFEBVRE, P. & STAELS, B. 2017. PPARs in obesity-induced T2DM, dyslipidaemia and NAFLD. Nat Rev Endocrinol, 13, 36-49. DIO:10.1038/nrendo.2016.135, PMID:27636730

      LEE, S. M., MURATALLA, J., KARIMI, S., DIAZ-RUIZ, A., FRUTOS, M. D., GUZMAN, G., RAMOS-MOLINA, B. & CORDOBA-CHACON, J. 2023a. Hepatocyte PPARgamma contributes to the progression of non-alcoholic steatohepatitis in male and female obese mice. Cell Mol Life Sci, 80, 39. DIO:10.1007/s00018-022-04629-z, PMID:36629912

      LEE, S. M., MURATALLA, J., SIERRA-CRUZ, M. & CORDOBA-CHACON, J. 2023b. Role of hepatic peroxisome proliferator-activated receptor gamma in non-alcoholic fatty liver disease. J Endocrinol, 257. DIO:10.1530/JOE-22-0155, PMID:36688873

      LEE, Y. K., PARK, J. E., LEE, M. & HARDWICK, J. P. 2018. Hepatic lipid homeostasis by peroxisome proliferator-activated receptor gamma 2. Liver Res, 2, 209-215. DIO:10.1016/j.livres.2018.12.001, PMID:31245168

      MUKHERJEE, A. G., WANJARI, U. R., GOPALAKRISHNAN, A. V., KATTURAJAN, R., KANNAMPUZHA, S., MURALI, R., NAMACHIVAYAM, A., GANESAN, R., RENU, K., DEY, A., VELLINGIRI, B. & PRINCE, S. E. 2022. Exploring the Regulatory Role of ncRNA in NAFLD: A Particular Focus on PPARs. Cells, 11. DIO:10.3390/cells11243959, PMID:36552725

      (4) Previous work suggests that SNHG3 regulates its neighboring gene MED18 which is an important regulator of global transcription. Could some of the observed effects be due to changes in MED18 or other neighboring genes?

      We thank the reviewer for the detailed comment. Previous work suggested that human SNHG3 promotes progression of gastric cancer by regulating neighboring MED18 gene methylation (Xuan and Wang, 2019). Here, we studied the effect of mouse Snhg3 on Med18 and the result showed that Snhg3 had no effect on the mRNA levels of Med18 (as below). Additionally, we also tested the effect of mouse Snhg3 on its neighboring gene, regulator of chromosome condensation 1 (Rcc1). Although deficiency of Snhg3 inhibited the mRNA level of Rcc1, overexpression of Snhg3 doesn’t affect the mRNA level of Rcc1 as showing below. RCC1, the only known guanine nucleotide exchange factor in the nucleus for Ran, a nuclear Ras-like G protein, directly participates in cellular processes such as nuclear envelope formation, nucleocytoplasmic transport, and spindle formation (Ren et al., 2020). RCC1 also regulates chromatin condensation in the late S and early M phases of the cell cycle. Many studies have found that RCC1 plays an important role in tumors. Furthermore, whether Rcc1 mediates the alleviated effect on MASLD of Snhg3 needs to be further investigated.

      Author response image 3.

      The mRNA levels of hepatic Rcc1 and Med18 expression in DIO Snhg3-HKO mice and Snhg3-HKI mice compared to the controls.

      References

      REN, X., JIANG, K. & ZHANG, F. 2020. The Multifaceted Roles of RCC1 in Tumorigenesis. Front Mol Biosci, 7, 225. DIO:10.3389/fmolb.2020.00225, PMID:33102517

      XUAN, Y. & WANG, Y. 2019. Long non-coding RNA SNHG3 promotes progression of gastric cancer by regulating neighboring MED18 gene methylation. Cell Death Dis, 10, 694. DIO:10.1038/s41419-019-1940-3, PMID:31534128

      (5) The claim that Snhg3 regulates SND1 protein stability seems subtle. There is data inconsistency between different panels regarding this regulation including Figure 5I, Figure 6A, and Figure 7E. In addition, is ubiquitination happening in the nucleus where Snhg3 is expressed?

      We thank the reviewer for the detailed comment. The effect of Snhg3-induced SND1 expression had been confirmed by western blotting, please check them in Figure 5I, Figure 6A, Figure 7E and corresponding primary data. Additionally, Snhg3-induced SND1 protein stability seemed subtle, indicating there may be other mechanism by which Snhg3 promotes SND1, such as riboregulation. We had added it in the section of Discussion, please check it in the second paragraph in p16.

      Additionally, we did not detect the sites where SND1 is modified by ubiquitination. Our results showed that Snhg3 was more localized in the nucleus (Figure 1D) and Snhg3 also promoted the nuclear localization of SND1 (Figure 5O). We had revised the diagram of Snhg3 action in Figure 8G. Please check them in revised manuscript.

      (6) The authors show that the loss of Snhg3 changes the global H3K27me3 level. Few enzymes modify H3K27me3 levels. Did the authors check for an interaction between EZH2, Jmjd3, UTX, and Snhg3/SND1?

      We thank the reviewer for the detailed comment. It is crucial to ascertain whether SND1 itself functions as a new demethylase or if it influences other demethylases, such as Jmjd3, enhancer of zeste homolog 2 (EZH2), and ubiquitously transcribed tetratricopeptide repeat on chromosome X (UTX). The precise mechanism by which SND1 regulates H3K27me3 is still unclear and hence requires further investigation. We had added the limitations in the section of Discussion and please check it in the third paragraph in p17.

      (7) Can the authors speculate if the findings related to Snhg3/SND1 extend to humans?

      We thank the reviewer for the detailed comment. Since the sequence of Snhg3 is not conserved between mice and humans, the findings in this manuscript may not be applicable to humans, but the detail need to be further exploited.

      (8) As a general rule the figures are too small or difficult to read with limited details in the figure legends which limits evaluation. For example, Figure 1B and almost all of 4 cannot read labels. Figure 2, cannot see the snapshots show of mice or livers. What figure is supporting the claim that snhg3KI are more 'hyper-accessible'? Can the authors clarify what Figure 4H is referring to?

      We thank the reviewer for the detailed comment. We have provided high quality figures in our revised manuscript.

      The ‘hyper-accessible’ state in the liver of Snhg3-HKI mice was inferred by the differentially accessible regions (DARs), that is, we discovered 4305 DARs were more accessible in Snhg3-HKI mice and only 2505 DARs were more accessible in control mice and please refer table supplement 3).

      The result of Figure 4H about heatmap for Cd36 was from hepatic RNA-seq of DIO Snhg3-HKI and control WT mice. For avoiding ambiguity, we have removed it.

      (9) Authors stated that upon Snhg3 knock out, more genes are upregulated(1028) than downregulated(365). This description does not match Figure 4A. It seems in Figure 4A there are equal numbers of up and downregulated genes.

      We thank the reviewer for the detailed question. We apologized for this mistake and have corrected it.

      (10) Provide a schematic of the knockout and KI strategy in the supplement.

      We thank the reviewer for the detailed comment. We had included the knockout and KI strategy in figure supplement 1A and B, and 2A.

      Reviewer #2 (Recommendations For The Authors):

      (1) Metabolic cage data need to be reanalyzed with CalR (particularly when the body weights are significantly different).

      We thank the reviewer for the detailed comment. We reanalyzed the metabolic cage data using CalR (Mina et al., 2018). The results showed that there were no obvious differences in heat production, total oxygen consumption, carbon dioxide production and the respiratory exchange ratio between DIO Snhg3-HKO and control mice. Similar to DIO Snhg3-HKO mice, there was also no differences in heat production, total oxygen consumption, carbon dioxide production, and RER between DIO Snhg3-HKI mice and WT mice. Please check them in figure supplement 1C and 2C, and Mouse Calorimetry in Materials and Methods.

      Reference

      MINA, A. I., LECLAIR, R. A., LECLAIR, K. B., COHEN, D. E., LANTIER, L. & BANKS, A. S. 2018. CalR: A Web-Based Analysis Tool for Indirect Calorimetry Experiments. Cell Metab, 28, 656-666 e1. DIO:10.1016/j.cmet.2018.06.019, PMID:30017358

      (2) ITT in Figure 2F should also be presented as % of the initial glucose level, which would reveal that there is no difference between WT and KO.

      We thank the reviewer for the detailed comment. We repeated ITT experiment and include the new data in revised manuscript, please check it in Figure 2C.

      (3) The fasting glucose results are inconsistent between ITT and GTT. Is there any difference in fasting glucose?

      We thank the reviewer for the questions. The difference between GTT and ITT was caused owing to different fasting time, that is, mice were fasted for 6 h in ITT and were fasted for 16 h in GTT. It seems that Snhg3 doesn’t affect short- and longer-time fasting glucose levels and please refer Figures 2C and 3C.

    1. Author Response:

      The following is the authors' response to the original reviews.

      Reviewer #1 (Public Review):

      [...] The experiments are well-designed and carefully conducted. The conclusions of this work are in general well supported by the data. There are a couple of points that need to be addressed or tested.

      1) It is unclear how LC phasic stimulation used in this study gates cortical plasticity without altering cellular responses (at least at the calcium imaging level). As the authors mentioned that Polack et al 2013 showed a significant effect of NE blockers in membrane potential and firing rate in V1 layer2/3 neurons during locomotion, it would be useful to test the effect of LC silencing (coupled to mismatch training) on both cellular response and cortical plasticity or applying NE antagonists in V1 in addition to LC optical stimulation. The latter experiment will also address which neuromodulator mediates plasticity, given that LC could co-release other modulators such as dopamine (Takeuchi et al. 2016 and Kempadoo et al. 2016). LC silencing experiment would establish a causal effect more convincingly than the activation experiment.

      Regarding the question of how phasic stimulation could alter plasticity without affecting the response sizes or activity in general, we believe there are possibilities supported by previous literature. It has been shown that catecholamines can gate plasticity by acting on eligibility traces at synapses (He et al., 2015; Hong et al., 2022). In addition, all catecholamine receptors are metabotropic and influence intracellular signaling cascades, e.g., via adenylyl cyclase and phospholipases. Catecholamines can gate LTP and LTD via these signaling pathways in vitro (Seol et al., 2007). Both of these influences on plasticity at the molecular level do not necessitate or predict an effect on calcium activity levels. We have now expanded on this in the discussion of the revised manuscript.

      While a loss of function experiment could add additional corroborating evidence that LC output is required for the plasticity seen, we did not perform loss-of-function experiments for three reasons:

      1. The effects of artificial activity changes around physiological set point are likely not linear for increases and decreases. The problem with a loss of function experiment here is that neuromodulators like noradrenaline affect general aspects of neuronal function. This is apparent in Polack et al., 2013: during the pharmacological blocking experiment, the membrane hyperpolarizes, membrane variance becomes very low, and the cells are effectively silenced (Figure 7 of (Polack et al., 2013)), demonstrating an immediate impact on neuronal function when noradrenaline receptor activation is presumably taken below physiological/waking levels. In light of this, if we reduce LC output/noradrenergic receptor activation and find that plasticity is prevented, this could be the result of a direct influence on the plasticity process, or, the result of a disruption of another aspect of neuronal function, like synaptic transmission or spiking. We would therefore challenge the reviewer’s statement that a loss-of-function experiment would establish a causal effect more convincingly than the gain- of-function experiment that we performed.

      2. The loss-of-function experiment is technically more difficult both in implementation and interpretation. Control mice show no sign of plasticity in locomotion modulation index (LMI) on the 10-minute timescale (Figure 4J), thus we would not expect to see any effect when blocking plasticity in this experiment. We would need to use dark-rearing and coupled-training of mice in the VR across development to elicit the relevant plasticity ((Attinger et al., 2017); manuscript Figure 5). We would then need to silence LC activity across days of VR experience to prevent the expected physiological levels of plasticity. Applying NE antagonists in V1 over the entire period of development seems very difficult. This would leave optogenetically silencing axons locally, which in addition to the problems of doing this acutely (Mahn et al., 2016; Raimondo et al., 2012), has not been demonstrated to work chronically over the duration of weeks. Thus, a negative result in this experiment will be difficult to interpret, and likely uninformative: We will not be able to distinguish whether the experimental approach did not work, or whether local LC silencing does nothing to plasticity.

      Note that pharmacologically blocking noradrenaline receptors during LC stimulation in the plasticity experiment is also particularly challenging: they would need to be blocked throughout the entire 15 minute duration of the experiment with no changes in concentration of antagonist between the ‘before’ and ‘after’ phases, since the block itself is likely to affect the response size, as seen in Polack et al., 2013, creating a confound for plasticity-related changes in response size. Thus, we make no claim about which particular neuromodulator released by the LC is causing the plasticity.

      1. There are several loss-of-function experiments reported in the literature using different developmental plasticity paradigms alongside pharmacological or genetic knockout approaches. These experiments show that chronic suppression of noradrenergic receptor activity prevents ocular dominance plasticity and auditory plasticity (Kasamatsu and Pettigrew, 1976; Shepard et al., 2015). Almost absent from the literature, however, are convincing gain-of-function plasticity experiments.

      Overall, we feel that loss-of-function experiments may be a possible direction for future work but, given the technical difficulty and – in our opinion – limited benefit that these experiments, would provide in light of the evidence already provided for the claims we make, we have chosen not to perform these experiments at this time. Note that we already discuss some of the problems with loss-of-function experiments in the discussion.

      2) The cortical responses to NE often exhibit an inverted U-curve, with higher or lower doses of NE showing more inhibitory effects. It is unclear how responses induced by optical LC stimulation compare or interact with the physiological activation of the LC during the mismatch. Since the authors only used one frequency stimulation pattern, some discussion or additional tests with a frequency range would be helpful.

      This is correct, we do not know how the artificial activation of LC axons relates to physiological activation, e.g. under mismatch. The stimulation strength is intrinsically consistent in our study in the sense that the stimulation level to test for changes in neuronal activity is similar to that used to probe for plasticity effects. We suspect that the artificial activation results in much stronger LC activity than seen during mismatch responses, given that no sign of the plasticity in LMI seen in high ChrimsonR occurs in low ChrimsonR or control mice (Figure 4J). Note, that our conclusions do not rely on the assumption that the stimulation is matched to physiological levels of activation during the visuomotor mismatches that we assayed. The hypothesis that we put forward is that increasing levels of activation of the LC (reflecting increasing rates or amplitude of prediction errors across the brain) will result in increased levels of plasticity. We know that LC axons can reach levels of activity far higher than that seen during visuomotor mismatches, for instance during air puff responses, which constitute a form of positive prediction error (unexpected tactile input) (Figures 2C and S1C). The visuomotor mismatches used in this study were only used to demonstrate that LC activity is consistent with prediction error signaling. We have now expanded on these points in the discussion as suggested.

      Reviewer #1 (Recommendations For The Authors):

      1) In Figure 3E, there is a rebound response of ChrimsonR at the offset of the mismatch. Is that common? If so, what does it mean? If not, maybe replace it with a more common example trace.

      This trace in fact represents the population average, so this offset response (or ‘rebound’) reflects a significant component of the population response to visual flow onset (i.e., mismatch offset), only under conditions of LC stimulation. See our response to reviewer 2 concerning this element of the response.

      2) It would be helpful to have some discussions on how a mismatch signal reaches and activates LC from cortical neurons.

      We have now added a short segment on this to the discussion.

      Reviewer #2 (Public Review):

      [...] The study provides very compelling data on a timely and fascinating topic in neuroscience. The authors carefully designed experiments and corresponding controls to exclude any confounding factors in the interpretation of neuronal activity in LC axons and cortical neurons. The quality of the data and the rigor of the analysis are important strengths of the study. I believe this study will have an important contribution to the field of system neuroscience by shedding new light on the role of a key neuromodulator. The results provide strong support for the claims of the study. However, I also believe that some results could have been strengthened by providing additional analyses and experimental controls. These points are discussed below.

      Calcium signals in LC axons tend to respond with pupil dilation, air puffs, and locomotion as the authors reported. A more quantitative analysis such as a GLM model could help understand the relative contribution (and temporal relationship) of these variables in explaining calcium signals. This could also help compare signals obtained in the sensory and motor cortical domains. Indeed, the comparison in Figure 2 seems a bit incomplete since only "posterior versus anterior" comparisons have been performed and not within-group comparisons. I believe it is hard to properly assess differences or similarities between calcium signal amplitude measured in different mice and cranial windows as they are subject to important variability (caused by different levels of viral expression for instance). The authors should at the very least provide a full statistical comparison between/within groups through a GLM model that would provide a more systematic quantification.

      To provide a more detailed comparison of responses, we have expanded on the analysis in Figure 2 to include comparative heatmaps from anterior and posterior imaging sites, as well as statistical comparisons of the response curves as a function of time. This shows how similar the responses are in the two regions.

      Beyond this, we are not sure how a regression analysis (GLM or otherwise) would help support the main point we aim to make here. The responses in anterior and posterior regions are similar, which supports a broadcast model of LC function in the cortex, rather than specialized routing of prediction error signals to cortical areas. Linear contributions of the signals are apparent from the stimulus triggered responses, and while non-linear interactions between the different variables are certainly an interesting question, they go beyond the point we aim to make and would also not be captured by a regression analysis. In addition, we have refined our language replacing descriptors of ‘the same’ or ‘indistinguishable’ between the two regions with ‘similar’, to highlight that while we find no evidence of a difference, our analysis does not cover all possible differences that might appear when looking at non-linear interactions.

      Previous studies using stimulations of the locus coeruleus or local iontophoresis of norepinephrine in sensory cortices have shown robust responses modulations (see McBurney-Lin et al., 2019, https://doi.org/10.1016/j.neubiorev.2019.06.009 for a review). The weak modulations observed in this study seem at odds with these reports. Given that the density of ChrimsonR-expressing axons varies across mice and that there are no direct measurements of their activation (besides pupil dilation), it is difficult to appreciate how they impact the local network. How does the density of ChrimsonR-expressing axons compare to the actual density of LC axons in V1? The authors could further discuss this point.

      In terms of estimating the percentage of cortical axons labelled based on our axon density measurements: we refer to cortical LC axonal immunostaining in the literature to make this comparison.

      In motor cortex, an average axon density of 0.07 µm/µm2 has been reported (Yin et al., 2021), and 0.09 µm/µm2 in prefrontal cortex (Sakakibara et al., 2021). Density of LC axons varies by cortical area, with higher density in motor cortex and medial areas than sensory areas (Agster et al., 2013): V1 axon density is roughly 70% of that in cingulate cortex (adjacent to motor and prefrontal cortices) (Nomura et al., 2014). So, we approximate a maximum average axon density in V1 of approximately 0.056 µm/µm2.

      Because these published measurements were made from images taken of tissue volumes with larger z-depth (~ 10 µm) than our reported measurements (~ 1 µm), they appear much larger than the ranges reported in our manuscript (0.002 to 0.007 µm/µm2). We repeated the measurements in our data using images of volumes with 10 µm z-depth, and find that the percentage axons labelled in our study in high ChrimsonR-expressing mice ranges between 0.012 to 0.039 µm/µm2. This corresponds to between 20% to 70% of the density we would expect based on previous work. Note that this is a potentially significant underestimate, and therefore should be used as a lower bound: analyses in the literature use images from immunostaining, where the signal to background ratio is very high. In contrast, we did not transcardially perfuse our mice leading to significant background (especially in the pia/L1, where axon density is high - (Agster et al., 2013; Nomura et al., 2014)), and the intensity of the tdTomato is not especially high. We therefore are likely missing some narrow, dim, and superficial fibers in our analysis.

      We also can quantify how our variance in axonal labelling affects our results: For the dataset in Figure 3, there doesn’t appear to be any correlation between the level of expression and the effect of stimulating the axons on the mismatch or visual flow responses for each animal (Author response image 1), while there is a significant correlation between the level of expression and the pupil dilation, consistent with the dataset shown in Figure 4. Thus, even in the most highly expressing mice, there is no clear effect on average response size at the level of the population. We have added these correlations to the revised manuscript as a new Figure S3.

      **Author response image 1. **

      Correlations between axon density and average effect of laser stimulation on stimulus responses and pupil dilation (data from manuscript Figure 3). Grey points show control mice, blue points show low ChrimsonR-expressing mice, and purple points show high ChrimsonR- expressing mice.

      To our knowledge, there has not yet been any similar experiment reported utilizing local LC axonal optogenetic stimulation while recording cortical responses, so when comparing our results to those in the literature, there are several important methodological differences to keep in mind. The vast majority of the work demonstrating an effect of LC output/noradrenaline on responses in the cortex has been done using unit recordings, and while results are mixed, these have most often demonstrated a suppressive effect on spontaneous and/or evoked activity in the cortex (McBurney-Lin et al., 2019). In contrast to these studies, we do not see a major effect of LC stimulation either on baseline or evoked calcium activity (Figure 3), and, if anything, we see a minor potentiation of transient visual flow onset responses (see also Author response image 2). There could be several reasons why our stimulation does not have the same effect as these older studies:

      1. Recording location: Unit recordings are often very biased toward highly active neurons (Margrie et al., 2002) and deeper layers of the cortex, while we are imaging from layer 2/3 – a layer notorious for sparse activity. In one of the few papers to record from superficial layers, it was been demonstrated that deeper layers in V1 are affected differently by LC stimulation methods compared to more superficial ones (Sato et al., 1989), with suppression more common in superficial layers. Thus, some differences between our results and those in the majority of the literature could simply be due to recording depth and the sampling bias of unit recordings.

      2. Stimulation method: Most previous studies have manipulated LC output/noradrenaline levels by either iontophoretically applying noradrenergic receptor agonists, or by electrically stimulating the LC. Arguably, even though our optogenetic stimulation is still artificial, it represents a more physiologically relevant activation compared to iontophoresis, since the LC releases a number of neuromodulators including dopamine, and these will be released in a more physiological manner in the spatial domain and in terms of neuromodulator concentration. Electrical stimulation of the LC as used by previous studies differs from our optogenetic method in that LC axons will be stimulated across much wider regions of the brain (affecting both the cortex and many of its inputs), and it is not clear whether the cause of cortical response changes is in cortex or subcortical. In addition, electrical LC stimulation is not cell type specific.

      3. Temporal features of stimulation: Few previous studies had the same level of temporal control over manipulating LC output that we had using optogenetics. Given that electrical stimulation generates electrical artifacts, coincident stimulation during the stimulus was not used in previous studies. Instead, the LC is often repeatedly or tonically stimulated, sometimes for many seconds, prior to the stimulus being presented. Iontophoresis also does not have the same temporal specificity and will lead to tonically raised receptor activity over a time course determined by washout times.

      4. State specificity: Most previous studies have been performed under anesthesia – which is known to impact noradrenaline levels and LC activity (Müller et al., 2011). Thus, the acute effects of LC stimulation are likely not comparable between anesthesia and in the awake animal.

      Due to these differences, it is hard to infer why our results differ compared to other papers. The study with the most similar methodology to ours is (Vazey et al., 2018), which used optogenetic stimulation directly into the mouse LC while recording spiking in deep layers of the somatosensory cortex with extracellular electrodes. Like us, they found that phasic optogenetic stimulation alone did not alter baseline spiking activity (Figure 2F of Vazey et al., 2018), and they found that in layers 5 and 6, short latency transient responses to foot touch were potentiated and recruited by simultaneous LC stimulation. While this finding appears more overt than the small modulations we see, it is qualitatively not so dissimilar from our finding that transient responses appear to be slightly potentiated when visual flow begins (Author response image 2). Differences in the degree of the effect may be due to differences in the layers recorded, the proportion of the LC recruited, or the fact anesthesia was used in Vazey et al., 2018.

      Note that we only used one set of stimulation parameters for optogenetic stimulation, and it is always possible that using different parameters would result in different effects. We have now added a discussion on the topic to the revised manuscript.

      In the analysis performed in Figure 3, it seems that red light stimulations used to drive ChrimsonR also have an indirect impact on V1 neurons through the retina. Indeed, figure 3D shows a similar response profile for ChrimsonR and control with calcium signals increasing at laser onset (ON response) and offset (OFF response). With that in mind, it is hard to interpret the results shown in Figure 3E-F without seeing the average calcium time course for Control mice. Are the responses following visual flow caused by LC activation or additional visual inputs? The authors should provide additional information to clarify this result.

      This is a good point. When we plot the average difference between the stimulus response alone and the optogenetic stimulation + stimulus response, we do indeed find that there is a transient increase in response at the visual flow onset (and the offset of mismatch, which is where visual flow resumes), and this is only seen in ChrimsonR-expressing mice (Author response image 2). We therefore believe that these enhanced transients at visual flow onset could be due to the effect of ChrimsonR stimulation, and indeed previous studies have shown that LC stimulation can reduce the onset latency and latency jitter of afferent-evoked activity (Devilbiss and Waterhouse, 2004; Lecas, 2004), an effect which could mediate the differences we see. We have added this analysis to the revised manuscript in Figure 3 and added discussion accordingly.

      **Author response image 2. **

      Difference in responses to visual stimuli caused by optogenetic stimulation, calculated by subtracting the average response when no laser was presented from the average response when the laser was presented concurrent with the visual stimulus. Pink traces show the response difference for ChrimsonR-expressing mice, and grey shows the same for control mice. Black blocks below indicate consecutive timepoints after stimulation showing a significant difference between ChrimsonR and control as determined by hierarchical bootstrapping (p<0.05).

      Some aspects of the described plasticity process remained unanswered. It is not clear over which time scale the locomotion modulation index changes and how many optogenetic stimulations are necessary or sufficient to saturate this index. Some of these questions could be addressed with the dataset of Figure 3 by measuring this index over different epochs of the imaging session (from early to late) to estimate the dynamics of the ongoing plasticity process (in comparison to control mice). Also, is there any behavioural consequence of plasticity/update of functional representation in V1? If plasticity gated by repeated LC activations reproduced visuomotor responses observed in mice that were exposed to visual stimulation only in the virtual environment, then I would expect to see a change in the locomotion behaviour (such as a change in speed distribution) as a result of the repeated LC stimulation. This would provide more compelling evidence for changes in internal models for visuomotor coupling in relation to its behavioural relevance. An experiment that could confirm the existence of the LC-gated learning process would be to change the gain of the visuomotor coupling and see if mice adapt faster with LC optogenetic activation compared to control mice with no ChrimsonR expression. Authors should discuss how they imagine the behavioural manifestation of this artificially-induced learning process in V1.

      Regarding the question of plasticity time course: Unfortunately, owing to the paradigm used in Figure 3, the time course of the plasticity will not be quantifiable from this experiment. This is because in the first 10 minutes, the mouse is in closed loop visuomotor VR experience, undergoing optogenetic stimulation (this is the time period in which we record mismatches). We then shift to the open loop session to quantify the effect of optogenetic stimulation on visual flow responses. Since the plasticity is presumably happening during the closed loop phase, and we have no read-out of the plasticity during this phase (we do not have uncoupled visual flow onsets to quantify LMI in closed loop), it is not possible to track the plasticity over time.

      Regarding the behavioral relevance of the plasticity: The type of plasticity we describe here is consistent with predictive, visuomotor plasticity in the form of a learned suppression of responses to self-generated visual feedback during movement. Intuitive purposes of this type of plasticity would be 1) to enable better detection of external moving objects by suppressing the predictable (and therefore redundant) self-generated visual motion and 2) to better detect changes in the geometry of the world (near objects have a larger visuomotor gain that far objects). In our paradigm, we have no intuitive read-out of the mouse’s perception of these things, and it is not clear to us that they would be reflected in locomotion speed, which does not differ between groups (manuscript Figure S5). Instead, we would need to turn to other paradigms for a clear behavioral read-out of predictive forms of sensorimotor learning: for instance, sensorimotor learning paradigms in the VR (such as those used in (Heindorf et al., 2018; Leinweber et al., 2017)), or novel paradigms that reinforce the mouse for detecting changes in the gain of the VR, or moving objects in the VR, using LC stimulation during the learning phase to assess if this improves acquisition. This is certainly a direction for future work. In the case of a positive effect, however, the link between the precise form of plasticity we quantify in this manuscript and the effect on the behavior would remain indirect, so we see this as beyond the scope of the manuscript. We have added a discussion on this topic to the revised manuscript.

      Finally, control mice used as a comparison to mice expressing ChrimsonR in Figure 3 were not injected with a control viral vector expressing a fluorescent protein alone. Although it is unlikely that the procedure of injection could cause the results observed, it would have been a better control for the interpretation of the results.

      We agree that this indeed would have been a better control. However, we believe that this is fortunately not a major problem for the interpretation of our results for two reasons:

      1. The control and ChrimsonR expressing mice do not show major differences in the effect of optogenetic LC stimulation at the level of the calcium responses for all results in Figure 3, with the exception of the locomotion modulation indices (Figure 3I). Therefore, in terms of response size, there is no major effect compared to control animals that could be caused by the injection procedure, apart from marginally increased transient responses to visual flow onset – and, as the reviewer notes, it is difficult to see how the injection procedure would cause this effect.

      2. The effect on locomotion modulation index (Figure 3I) was replicated with another set of mice in Figure 4C, for which we did have a form of injected control (‘Low ChrimsonR’), which did not show the same plasticity in locomotion modulation index (Figure 4E). We therefore know that at least the injection itself is not resulting in the plasticity effect seen.

      Reviewer #2 (Recommendations For The Authors):

      In experiments where axonal imaging was performed on LC axons, the authors should indicate the number of mice used in addition to the number of Field of View (FoV). Indeed, samples (FoVs) are not guaranteed to be independent as LC axons can span large cortical areas and the same axon can end up in different FoVs. Please provide statistics across mice/cranial windows to confirm the robustness of the results.

      All information requested regarding animal numbers in axonal imaging are provided in the statistical Table S1, as well as in the text and figures (e.g., Figure 2A). Samples will be independent in time (as different FoVs were imaged on different days), but it is indeed possible that axon segments from different FoVs within an animal come from the same axon.

      Averaging across animals greatly reduces statistical power. We have therefore implemented hierarchical bootstrapping instead: bootstrapping first occurs at the level of animal and then at the level of FoV. All p-values that were reported as significant in manuscript remained significant with this test, with no major reduction in significance level, with the exception of Figure S2B, where statistical significance was lost (p = 0.04 with Rank sum, p = 0.07 with hierarchical Bootstrapping). We therefore conclude that sampling from the same animals across days is not responsible for the significance of results reported.

      References

      Agster, K.L., Mejias-Aponte, C.A., Clark, B.D., Waterhouse, B.D., 2013. Evidence for a regional specificityi n the density and distribution of noradrenergic varicosities in rat cortex. Journal of Comparative Neurology 521, 2195–2207. https://doi.org/10.1002/cne.23270

      Attinger, A., Wang, B., Keller, G.B., 2017. Visuomotor Coupling Shapes the Functional Development of Mouse Visual Cortex. Cell 169, 1291-1302.e14. https://doi.org/10.1016/j.cell.2017.05.023

      Devilbiss, D.M., Waterhouse, B.D., 2004. The Effects of Tonic Locus Ceruleus Output on Sensory-Evoked Responses of Ventral Posterior Medial Thalamic and Barrel Field Cortical Neurons in the Awake Rat. J. Neurosci. 24, 10773–10785. https://doi.org/10.1523/JNEUROSCI.1573-04.2004

      He, K., Huertas, M., Hong, S.Z., Tie, X., Hell, J.W., Shouval, H., Kirkwood, A., 2015. Distinct Eligibility Traces for LTP and LTD in Cortical Synapses. Neuron 88, 528–538. https://doi.org/10.1016/j.neuron.2015.09.037

      Heindorf, M., Arber, S., Keller, G.B., 2018. Mouse Motor Cortex Coordinates the Behavioral Response to Unpredicted Sensory Feedback. Neuron 0. https://doi.org/10.1016/j.neuron.2018.07.046

      Hong, S.Z., Mesik, L., Grossman, C.D., Cohen, J.Y., Lee, B., Severin, D., Lee, H.-K., Hell, J.W., Kirkwood, A., 2022. Norepinephrine potentiates and serotonin depresses visual cortical responses by transforming eligibility traces. Nat Commun 13, 3202. https://doi.org/10.1038/s41467-022-30827-1

      Kasamatsu, T., Pettigrew, J.D., 1976. Depletion of brain catecholamines: failure of ocular dominance shift after monocular occlusion in kittens. Science 194, 206–209. https://doi.org/10.1126/science.959850

      Lecas, J.-C., 2004. Locus coeruleus activation shortens synaptic drive while decreasing spike latency and jitter in sensorimotor cortex. Implications for neuronal integration. European Journal of Neuroscience 19, 2519–2530. https://doi.org/10.1111/j.0953-816X.2004.03341.x

      Leinweber, M., Ward, D.R., Sobczak, J.M., Attinger, A., Keller, G.B., 2017. A Sensorimotor Circuit in Mouse Cortex for Visual Flow Predictions. Neuron 95, 1420-1432.e5. https://doi.org/10.1016/j.neuron.2017.08.036

      Mahn, M., Prigge, M., Ron, S., Levy, R., Yizhar, O., 2016. Biophysical constraints of optogenetic inhibition at presynaptic terminals. Nat Neurosci 19, 554–556. https://doi.org/10.1038/nn.4266

      Margrie, T.W., Brecht, M., Sakmann, B., 2002. In vivo, low-resistance, whole-cell recordings from neurons in the anaesthetized and awake mammalian brain. Pflugers Arch. 444, 491–498. https://doi.org/10.1007/s00424-002-0831-z

      McBurney-Lin, J., Lu, J., Zuo, Y., Yang, H., 2019. Locus coeruleus-norepinephrine modulation of sensory processing and perception: A focused review. Neurosci Biobehav Rev 105, 190–199. https://doi.org/10.1016/j.neubiorev.2019.06.009

      Müller, C.P., Pum, M.E., Amato, D., Schüttler, J., Huston, J.P., De Souza Silva, M.A., 2011. The in vivo neurochemistry of the brain during general anesthesia. Journal of Neurochemistry 119, 419–446. https://doi.org/10.1111/j.1471-4159.2011.07445.x

      Nomura, S., Bouhadana, M., Morel, C., Faure, P., Cauli, B., Lambolez, B., Hepp, R., 2014. Noradrenalin and dopamine receptors both control cAMP-PKA signaling throughout the cerebral cortex. Front Cell Neurosci 8. https://doi.org/10.3389/fncel.2014.00247

      Polack, P.-O., Friedman, J., Golshani, P., 2013. Cellular mechanisms of brain-state-dependent gain modulation in visual cortex. Nat Neurosci 16, 1331–1339. https://doi.org/10.1038/nn.3464

      Raimondo, J.V., Kay, L., Ellender, T.J., Akerman, C.J., 2012. Optogenetic silencing strategies differ in their effects on inhibitory synaptic transmission. Nat Neurosci 15, 1102–1104. https://doi.org/10.1038/nn.3143

      Sakakibara, Y., Hirota, Y., Ibaraki, K., Takei, K., Chikamatsu, S., Tsubokawa, Y., Saito, T., Saido, T.C., Sekiya, M., Iijima, K.M., n.d. Widespread Reduced Density of Noradrenergic Locus Coeruleus Axons in the App Knock-In Mouse Model of Amyloid-β Amyloidosis. J Alzheimers Dis 82, 1513–1530. https://doi.org/10.3233/JAD-210385

      Sato, H., Fox, K., Daw, N.W., 1989. Effect of electrical stimulation of locus coeruleus on the activity of neurons in the cat visual cortex. Journal of Neurophysiology. https://doi.org/10.1152/jn.1989.62.4.946

      Seol, G.H., Ziburkus, J., Huang, S., Song, L., Kim, I.T., Takamiya, K., Huganir, R.L., Lee, H.-K., Kirkwood, A., 2007. Neuromodulators control the polarity of spike-timing-dependent synaptic plasticity. Neuron 55, 919–929. https://doi.org/10.1016/j.neuron.2007.08.013

      Shepard, K.N., Liles, L.C., Weinshenker, D., Liu, R.C., 2015. Norepinephrine is necessary for experience-dependent plasticity in the developing mouse auditory cortex. J Neurosci 35, 2432–2437.https://doi.org/10.1523/JNEUROSCI.0532-14.2015

      Vazey, E.M., Moorman, D.E., Aston-Jones, G., 2018. Phasic locus coeruleus activity regulates cortical encoding of salience information. Proceedings of the National Academy of Sciences 115, E9439– E9448. https://doi.org/10.1073/pnas.1803716115

      Yin, X., Jones, N., Yang, J., Asraoui, N., Mathieu, M.-E., Cai, L., Chen, S.X., 2021. Delayed motor learning in a 16p11.2 deletion mouse model of autism is rescued by locus coeruleus activation. Nat Neurosci 24, 646–657. https://doi.org/10.1038/s41593-021-00815-7

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      The authors describe a method to decouple the mechanisms supporting pancreatic progenitor self-renewal and expansion from feed-forward mechanisms promoting their differentiation. The findings are important because they have implications beyond a single subfield. The strength of evidence is solid in that the methods, data and analyses broadly support the claims with only minor weaknesses.

      We are grateful for the substantial effort that reviewers put into reading our manuscript and providing such a detailed feedback. We have strived to address, as much as possible, all comments and criticisms. Thanks to the feedback, we believe that we have now a significantly improved manuscript. Below, there is a point-bypoint response.

      Reviewer #1 (Public Review)

      In this manuscript, the authors are developing a new protocol that aims at expanding pancreatic progenitors derived from human pluripotent stem cells under GMP-compliant conditions. The strategy is based on hypothesis-driven experiments that come from knowledge derived from pancreatic developmental biology.

      The topic is of major interest in the view of the importance of amplifying human pancreatic progenitors (both for fundamental purposes and for future clinical applications). There is indeed currently a major lack of information on efficient conditions to reach this objective, despite major recurrent efforts by the scientific community.

      Using their approach that combines stimulation of specific mitogenic pathways and inhibition of retinoic acid and specific branches of the TGF-beta and Wnt pathways, the authors claim to be able, in a highly robust and reproducible manner) to amplify in 10 passages the number of pancreatic progenitors (PP) by 2,000 folds, which is really an impressive breakthrough.

      The work is globally well-performed and quite convincing. I have however some technical comments mainly related to the quantification of pancreatic progenitor amplification and to their differentiation into beta-like cells following amplification.

      We thank the reviewer for the positive assessment. Below we provide a point-by-point response to specific comments and criticisms.

      Reviewer #1 (Recommendations For The Authors)

      Figure 1:

      Panel A: What is exactly counted in Fig. 1A? Is it the number of PP (as indicated in the title) or the total number of cells? If it is PPs, was it done following PDX1/NKX6.1/SOX9 staining and FACS quantification? This question applies to a number of Figures and the authors should be clear on this point.

      We now define ‘PP cells’ as ‘PP-containing cells’ (PP cells) the first time we use the term in the RESULTS section.

      Panel D: I do not understand the source of TGFb1, GDF11, FGF18, PDGFA. Which cell type(s) express such factors in culture? I was not convinced that the signals are produced by PP and act through an autocrine loop. I have the same type of questions for the receptors: PDGFR on the second page of the results; RARs and RXRs on the third page.

      We refer to these factors/receptors as components of a tentative autocrine loop. We agree we do not prove it and we now comment on this in the discussion section.

      Figure 2:

      FACS plots are very difficult to analyze for two reasons: I do not understand the meaning of the y axes (PDX1/SOX9). Does that mean that 100% of the cells were PDX1+/SOX9+? The authors should show the separated FACS plots. More importantly, the x axes indicate that NKX6.1 FACS staining is very weak. This is by far different from what can be read in publications performing the same types of experiments (publications by Millman, Otonkoski...as examples). How was quantification performed when it is so difficult to properly define positive vs negative populations? It is necessary to present proper "negative controls" for FACS experiments and to clearly indicate how positive versus cells were defined

      We now explain the gating strategy better in the results section, all controls are included in figure S2.

      Figure 3:

      What is the exact "phenotype" of the cells that incorporated EdU: It would be really instructive to add PDX1/NKX6.1/SOX9 staining on top of EdU. I am also surprised that 20% of the cells stain positive for Annexin V. This is a huge fraction. Does that mean that many cells (20%) are dying and if the case, how amplification can take place under such deleterious conditions?

      This is an interesting mechanistic point but performing these experiments would delay the publication of the final manuscript for too long. These assays were done at p3 in order to catch CINI cells that do not expand in most cases. It is important to note that cell death also appears higher in CINI cells. It is likely that the combination of these effects results in reproducible expansion under C5. We comment on the possibilities in the discussion section.

      Figure 4:

      On FACS plots the intensity at the single cell level (see x-axis of the figure) of the NKX6.1 staining is found to increase in Fig. 4G by 50-100 folds when compared to Fig. 4E. Is it expected? This should be discussed in the text. Do the authors observe the same increase by immunocytochemistry?

      The apparent difference is actually 10-fold (from 2x102 to 2x103). We think that the most likely reason for this apparent increase is that at p0 we typically used very few cells for the FC in order to keep as many as possible for the subsequent expansion. If we had used more, we would be able to also detect cells with higher expression. As we mention in the bioinformatics analysis, NKX6 expression does increase with passaging and therefore it is also possible that at least part of this increase is real. However, we don’t have suitable data (same number of cells analyzed at each passage) to address this in a reliable manner.

      Figure 5

      Previous data from the scientific literature indicate that in vitro, by default, PP gives rise to duct-like cells. This is a bit described in the result section and supplementary figures taking into account the expression of transcription factors. However the data are not clearly explained and described in quite a qualitative manner. They should appear in a quantitative fashion (and the main figures), adding additional duct cell markers such as Carbonic anhydrase, SPP1, CFTR, and others. I assume that the authors can easily use their transcriptomic data to produce a Figure to be described and discussed in detail.

      We think it can be misleading to use such markers (other than TFs and the latter only as a collective) because specific markers of terminal differentiation are more often than not expressed during development in multipotent progenitors, the most conspicuous example been CPA1. To illustrate the point, we used the RNA Seq data of and plotted the expression values of a panel of duct genes in isolated human fetal progenitors (Ramond et al., 2017) together with their expression in p0 PP and ePP cells from all three different procedure (please see below). All raw RNA Seq data were processed together to enable direct comparison. According to the analysis of Ramond et al the A population corresponds to MPCs, C to early endocrine progenitors (EP), D to late endocrine progenitors and, by inference and gene expression pattern B to BPs. Expression levels of all these markers were very similar suggesting that these markers cannot be used to distinguish between duct cells and progenitor cells. Importantly, SC-islets derived from either dPP or ePP cells express extremely low and similar levels of KRT19, a marker of duct cells. This latter information is now included in the last part of the results (Figure S7).

      Author response image 1.

      Fig. 7:<br /> The figure is a bit disappointing for 2 reasons. In A and B, the quality of INS, GCG, and SST staining is really poor. In E, GSIS is really difficult to interpret. They should not be presented as stimulatory indexes. The authors should present independently: INS content; INS secretion at low glucose; INS secretion at high glucose; INS secretion with KCL. Finally, the authors should indicate that glucose poorly (around 2 fold) activates insulin/C-Pept secretion in their stem-cell-derived islets.

      We disagree with the quality assessment of the immunofluorescence. Stimulation indexes are also used very widely but we now provide data for actual C-peptide secretion normalized for DNA content of the SC-islets. For technical reasons we do not have normalized C-peptide secretion for human islets. However, we provide a direct comparison to the stimulation index of human islets assayed under the same conditions (2.7 mM glucose / 16.7 mM glucose / 16.7 mM glucose + 30 mM KCl) without presenting SC-islets separately and tweaking the glucose basal (lowering) and stimulation (increasing) levels to inflate the stimulation index. This is unfortunately common. In any case, we do not claim an improvement in the differentiation conditions and our S5-S7 steps may not be optimal but this is not the subject of this work.

      Reviewer #2 (Public Review)

      Summary

      The paper presents a novel approach to expand iPSC-derived pdx1+/nkx6.1+ pancreas progenitors, making them potentially suitable for GMP-compatible protocols. This advancement represents a significant breakthrough for diabetes cell replacement therapies, as one of the current bottlenecks is the inability to expand PP without compromising their differentiation potential. The study employs a robust dataset and state-of-the-art methodology, unveiling crucial signaling pathways (eg TGF, Notch...) responsible for sustaining pancreas progenitors while preserving their differentiation potential in vitro.

      Strengths

      This paper has strong data, guided omics technology, clear aims, applicability to current protocols, and beneficial implications for diabetes research. The discussion on challenges adds depth to the study and encourages future research to build upon these important findings.

      We thank the reviewer for the positive assessment. Below we provide a point-by-point response to general comments and criticisms.

      Weaknesses

      The paper does have some weaknesses that could be addressed to improve its overall clarity and impact. The writing style could benefit from simplification, as certain sections are explained in a convoluted manner and difficult to follow, in some instances, redundancy is evident. Furthermore, the legends accompanying figures should be self-explanatory, ensuring that readers can easily understand the presented data without the need to be checking along the paper for information.

      We have simplified the text in several places and removed redundancies, particularly in the discussion. We revisited the figure legends and made minor corrections to increase clarity. However, regarding the figure legends, we think that adding the interpretation of the results would be redundant to the main text.

      The culture conditions employed in the study might benefit from more systematic organization and documentation, making them easier to follow.<br /> There is a comparative Table (Table S1) where all conditions are summarized. We refer to this Table every time that we introduce a new condition. We also have a Table (Table S4) which presents all different media and components used it the differentiation procedure.

      Another important aspect is the functionality of the expanded cells after differentiation. While the study provides valuable insights into the expansion of pancreas progenitors in vitro and does the basic tests to measure their functionality after differentiation the paper could be strengthened by exploring the behavior and efficacy of these cells deeper, and in an in vivo setting.

      This will be done in a future study where we will also introduce a number of modifications in S5-S7

      Quantifications for immunofluorescence (IF) data should be displayed.

      We have not conducted quantifications of IFs because FC is much more objective and accurate. We have not conducted FC for CDX2 and AFP because all other data strongly favor C6 anyway. It should be noted that CDX2 and AFP expression is generally not addressed at all presumably because it raises uncomfortable questions and, to our knowledge, we are the first to address this so exhaustively.

      Some claims made in the paper may come across as somewhat speculative.

      We have now indicated so where applicable.

      Additionally, while the paper discusses the potential adaptability of the method to GMP-compatible protocols, there is limited elaboration on how this transition would occur practically or any discussion of the challenges it might entail.

      We have now added a paragraph discussing this in the discussion section.

      Reviewer #2 (Recommendations For The Authors)

      Related to Figure 1:

      • Unclear if CINI or SB431542 + CINI was used (first paragraph of results...)

      The paragraph was unclear and it is now rewritten

      • Was the differentiation to PP similar between the different attempts? A basic QC for each Stem Cell technology differentiation would be good to include.

      We added (Figure 1B) a comparison of expression data of general genes (QC) in PP cells showing very comparable patterns of expression. Some of these PP cells went on to expand and most did not but there is no apparent correlation of this with the gene expression data.

      • qPCR data - relative fold? over what condition? (indicate on axis label)

      We added a label as well as an explanation on p0 values in the figure legend

      • FGF18/ PDGFA - worth including background in pancreas development as in the other factors.

      Background information has been added

      • Bioinformatics is a bit biased with a few genes selected - what are the DEGs / top enriched pathways? Maybe worth showing a volcano plot of the DEGs for example.

      We have done all these standard analyses but we think that they did not contribute anything else useful to the study with the exception of pointing to the finding that the TGFb pathway is negatively correlated with expansion, and this is included in the study. The ‘unbiased’ analysis that the reviewer suggests did not turn out something else useful to exploit for the expansion. This does not mean that our approach is biased – in our view it is hypothesis-driven. As we also write in the manuscript, if in a certain pathway a key gene fails to be expressed, the pathway will not show up in any GO or GSEA analyses. However, the pathway will still be regulated. The RA and FGF18 cases clearly illustrate this. We realize that these analyses have become a standard but we think that it is not the only way to approach genomics data and these approaches did not offer much in the context of this study.

      • The E2F part is very speculative

      The pathway came up as a result of ‘unbiased’ GSEA analyses. However, we do agree and rephrased.

      • The authors claim ' the negative correlation of TGFb signalling with expansion retrospectively justifies the use of A83 '. However, p0 is not treated with A83 - how can they tell that there is a correlation between TGFb signalling and expansion?

      The correlation came from the RNA Seq data analysis during expansion. We have rephrased slightly to convey the message more clearly.

      • Typo with TGFbeta inhibitor name is mispelled (A3801)

      Corrected

      • Page 5 - last paragraph - Table S3? (isnt it refering to S2?)

      Since Table S2 is the list of the regulated genes and S3 is the list of the regulated signaling pathway components both are relevant here, we now refer to both.

      • In the text Figure 2G should read Figure 1G (page 7, end of 1st paragraph).

      Corrected

      • 'Autocrine loop' existence – speculative

      Added the phrase ‘we speculated’. We refer to this only as a tentative interpretation. We also elaborate in the discussion now.

      Related to Figure 2:

      • I am not sure if I would refer to chemical "activation/inhibition" of pathways as 'gain/loss of function'. Maybe this term is more adequate for genetic modifications.

      For genetic manipulations, these terms are (supposed to be) accompanied by the adjective ‘genetic’ but to avoid misinterpretations we changed the terms to activation and inhibition as suggested.

      • It would be good to include a summary of the different conditions as a schematic in one of the figures, to make it very clear to the reader what the conditions are.

      We tried this in an early version of the manuscript but, in our view, it was adding complexity, rather than simplifying things. The problem is that as such the Table cannot be integrated in any figure if eg in Figure 2 it would be too early, if in Figure 4 it would be too late and so on. All conditions show up in detail in Table S1.

      • Nkx6.1 - is the image representative? It looks like Nkx6.1 decreases over the passages.

      We do mention in the text that ‘… even though expansion (in C5) appeared to somewhat reduce the number of NKX6.1+ cells. (Figure 2E-G). As we mentioned, this was one of the reasons to continue with other conditions (C6-C8).

      • Upregulation of AFP/ CDX2 is a bit concerning - the IF for C5 p5 shows a high proportion of CDX2+ cells (Fig S2I). perhaps it would be good to quantify the IF.

      It was concerning – this is why we then tested conditions C6-8. Since it is C6 that we propose at the end, it would be, in our view, extraneous to quantify CDX2 in C5.

      • How do C5/C1/C0 compare to CINI?

      We now remind the reader in the results section that CINI was not reproducible - so any other comparison would be extraneous.

      Related to Figure 3:

      • There is a 'Lore Ipsum' label above B

      Corrected

      Related to Figure 4:

      • It is good that AFP expression is reduced at p10, but there seems to be a high proportion of AFP at p5. IF/FACS should be quantified.

      We think that this would not add significantly since there are several other criteria, particularly the increase of the PDX1+/SOX9+/NKX6.1+ that clearly show that the C6 condition is preferable. Further elaboration of C6 could use such additional criteria. We comment on CDX2 / AFP in the discussion.

      • CDX2 should be quantified by IF / FACS.

      We think that this would not add significantly since there are several other criteria, particularly the increase of the PDX1+/SOX9+/NKX6.1+ that clearly show that the C6 condition is preferable. Further elaboration of C6 could use such additional criteria. We comment on CDX2 / AFP in the discussion.

      • Karyotype analysis is good but not very precise when analyzing genetic micro alterations... what does a low-pass sequencing of the expanding lines look like? Are there any micro-deletions in the expanding lines?

      This is an unusual request. Microdeletions may occur at any point – during passaging of hPS cells, differentiation as well as well as expansion but such data are so far not shown in publications – and reasonably so in our opinion. Thus, we have not done this analysis but it certainly would be appropriate in a clinical setting as part of QC.

      • Data supporting that the cells can be cryopreserved and recovered with >85% survival rate is not provided.

      We now provide data for the C6-mediated expansion (Figure 4J). The freezing procedure was developed during the time we were testing C5 and we don’t have sufficient data to show reliably the survival of the cells during C5 expansion. Thus, we have now removed the reference in the C5 part of the manuscript.

      Related to Figure 5:

      -Figure 5C - perhaps worth commenting on the different pathways that are enriched when cells undergo expansion and show some of the genes that are up/down regulated.

      This is indeed of interest but since it will not address any specific question in the context of this work (eg is the endocrine program repressed?) and since it would not be followed by additional experiments we think that it would burden the manuscript unnecessarily. The data are accessible for any type of analysis through the GEO database.

      • Figure S5D shows in vitro clustering away from in vivo PP - it would be good to explain how in vitro generated PP differs from their in vivo counterparts instead of restricting the comparison to the in vitro protocol.

      We have added a possible interpretation of this observation in the results section and discuss, how one could go properly about this comparison.

      • Quantification of Fig5F should be included. Is GP2 expression detectable by IF at p5 too?

      We have quantified GP2 expression by FC at p10 but not at earlier stages. We include now the FC data in Fig5F

      • Validation of Fig5G by qPCR would be good. PDX1 did not seem reduced by IF in Figure 4.

      The purpose of Fig5G is to compare the expression of the same genes across different expansion approaches. Therefore, in our view, qPCRs would not be appropriate since we do not have samples from the other approaches. We did not claim a reduction in PDX1 expression.

      • How can the authors explain the NGN3 expression at PP?

      In our view, differentiation is a dynamic process and not all cells are synchronized at the same cell type, this is true in vivo and in vitro. Sc-RNA Seq data indeed show a small population of cells at PP that are NEUROG3+ (our unpublished data). We have now included this in the discussion.

      Related to Figure 6:

      • How do the different lines differ? Any statistical comparison between lines?

      There is a paragraph dealing with the comparison of PP and ePP cells (p5 and p10) from different lines at the level of gene expression and the data are in Figure S6A-G. Then there is a paragraph addressing this at the level of PDX1/SOX9/NKX6.1 expression by FC. We have now expanded and rewrote the latter to include statistical comparisons across PPs from different lines at p0, p5 an p10

      Related to Figure 7:

      • Mention the use of micropatterned

      Micropatterned wells - not really correct. They use Aggrewells, micropatterned plates are something else.

      We changed ‘micropatterned wells’ into ‘microwells’

      • Figure 7D, those are qPCR data. The label is inconsistent, why did they call it fold induction instead of fold change? Also, not sure if plotting the fold change to hPSC is the best here.

      We use fold change when comparing the expression of the same gene at different passages but fold induction when comparing to its expression in hPS cells. We made sure it is also explained in the figure legends.

      • Absolute values should be shown for the GSIS to determine basal insulin secretion. Also, sequential stimulation to address if the cells are able to respond to multiple glucose stimulations.

      We include now the secreted amounts of human C-peptide under the different conditions (Figure S7) normalized for cell numbers using their DNA content for the normalization. The many parameters we have used suggest that dPP and ePP SC-islets are very similar. If we were claiming a better S5-S7 procedure, such an assay would have been necessary but in this context, we think it is not absolutely necessary.

      • In vivo data would have strengthened the story. It is not clear if, in vivo, the cells will behave as the nonexpanded iPSC-derived beta cells.

      We agree and these studies are under way but we do not expect to complete them soon. We feel that it is important that this work appears sooner rather than later.

      Reviewer #3 (Public Review)

      Summary:

      In this work, Jarc et al. describe a method to decouple the mechanisms supporting progenitor self-renewal and expansion from feed-forward mechanisms promoting their differentiation.

      The authors aimed at expanding pancreatic progenitor (PP) cells, strictly characterized as PDX1+/SOX9+/NKX6.1+ cells, for several rounds. This required finding the best cell culture conditions that allow sustaining PP cell proliferation along cell passages, while avoiding their further differentiation. They achieve this by comparing the transcriptome of PP cells that can be expanded for several passages against the transcriptome of unexpanded (just differentiated) PP cells.

      The optimized culture conditions enabled the selection of PDX1+/SOX9+/NKX6.1+ PP cells and their consistent, 2000-fold, expansion over ten passages and 40-45 days. Transcriptome analyses confirmed the stabilization of PP identity and the effective suppression of differentiation. These optimized culture conditions consisted of substituting the Vitamin A containing B27 supplement with a B27 formulation devoid of vitamin A (to avoid retinoic acid (RA) signaling from an autocrine feed-forward loop), substituting A38-01 with the ALK5 II inhibitor (ALK5i II) that targets primarily ALK5, supplementation of medium with FGF18 (in addition to FGF2) and the canonical Wnt inhibitor IWR-1, and cell culture on vitronectin-N (VTN-N) as a substrate instead of Matrigel.

      Strengths:

      The strength of this work relies on a clever approach to identify cell culture modifications that allow expansion of PP cells (once differentiated) while maintaining, if not reinforcing, PP cell identity. Along the work, it is emphasized that PP cell identity is associated with the co-expression of PDX1, SOX9, and NKX6.1. The optimized protocol is unique (among the other datasets used in the comparison shown here) in inducing a strong upregulation of GP2, a unique marker of human fetal pancreas progenitors. Importantly GP2+ enriched hPS cell-derived PP cells are more efficiently differentiating into pancreatic endocrine cells (Aghazadeh et al., 2022; Ameri et al., 2017).

      The unlimited expansion of PP cells reported here would allow scaling-up the generation of beta cells, for the cell therapy of diabetes, by eliminating a source of variability derived from the number of differentiation procedures to be carried out when starting at the hPS cell stage each time. The approach presented here would allow the selection of the most optimally differentiated PP cell population for subsequent expansion and storage. Among other conditions optimized, the authors report a role for Vitamin A in activating retinoic acid signaling in an autocrine feed-forward loop, and the supplementation with FGF18 to reinforce FGF2 signaling.

      This is a relevant topic in the field of research, and some of the cell culture conditions reported here for PP expansion might have important implications in cell therapy approaches. Thus, the approach and results presented in this study could be of interest to researchers working in the field of in vitro pancreatic beta cell differentiation from hPSCs. Table S1 and Table S4 are clearly detailed and extremely instrumental to this aim.

      We thank the reviewer for the positive assessment. Below we provide a point-by-point response to general comments and criticisms.

      Weaknesses

      The authors strictly define PP cells as PDX1+/SOX9+/NKX6.1+ cells, and this phenotype was convincingly characterized by immunofluorescence, RT-qPCR, and FACS analysis along the work. However, broadly defined PDX1+/SOX9+/NKX6.1+ could include pancreatic multipotent progenitor cells (MPC, defined as PDX1+/SOX9+/NKX6.1+/PTF1A+ cells) or pancreatic bipotent progenitors (BP, defined as PDX1+/SOX9+/NKX6.1+/PTF1A-) cells. It has been indeed reported that Nkx6.1/Nkx6.2 and Ptf1a function as antagonistic lineage determinants in MPC (Schaffer, A.E. et al. PLoS Genet 9, e1003274, 2013), and that the Nkx6/Ptf1a switch only operates during a critical competence window when progenitors are still multipotent and can be uncoupled from cell differentiation. It would be important to define whether culturing PDX1+/SOX9+/NKX6.1+ PP (as defined in this work) in the best conditions allowing cell expansion is reinforcing either an MPC or BP phenotype. Data from Figure S2A (last paragraph of page 7) suggests that PTF1A expression is decreased in C5 culture conditions, thus more homogeneously keeping BP cells in this media composition. However, on page 15, 2nd paragraph it is stated that "the strong upregulation of NKX6.2 in our procedure suggested that our ePP cells may have retracted to an earlier PP stage". Evaluating the co-expression of the previously selected markers with PTF1A (or CPA2), or the more homogeneous expression of novel BP markers described, such as DCDC2A (Scavuzzo et al. Nat Commun 9, 3356, 2018), in the different culture conditions assayed would more shield light into this relevant aspect.

      This is certainly an interesting point. The RNA Seq data suggest that ePP cells resemble BP cells rather than MPCs and that this occurs during expansion. We have now added a new paragraph in the results section to illustrate this and added graphs of CPA2, PTF1A and DCDC2A expression during expansion in Figure 5, S5 as well as data in Table S5. In summary, we favor the interpretation that expanded cells are close but not identical to the BP identity and refer to that in the discussion. We have also amended the statement on page 15 stating the strong upregulation of NKX6.2 in our procedure suggested that our ePP cells may have retracted to an earlier PP stage.

      In line with the previous comment, it would be extremely insightful if the authors could characterize or at least discuss a potential role for YAP underlying the mechanistic effects observed after culturing PP in different media compositions. It is well known that the nuclear localization of the co-activator YAP broadly promotes cell proliferation, and it is a key regulator of organ growth during development. Importantly in this context, it has been reported that TEAD and YAP regulate the enhancer network of human embryonic pancreatic progenitors and disruption of this interaction arrests the growth of the embryonic pancreas (Cebola, I. et al. Nat Cell Biol 17, 615-26, 2015). More recently, it has also been shown that a cell-extrinsic and intrinsic mechanotransduction pathway mediated by YAP acts as gatekeeper in the fate decisions of BP in the developing pancreas, whereby nuclear YAP in BPs allows proliferation in an uncommitted fate, while YAP silencing induces EP commitment (Mamidi, A. et al. Nature 564, 114-118, 2018; Rosado-Olivieri et al. Nature Communications 10, 1464, 2019). This mechanism was further exploited recently to improve the in vitro pancreatic beta cell differentiation protocol (Hogrebe et al., Nature Protocols 16, 4109-4143, 2021; Hogrebe et al, Nature Biotechnology 38, 460-470, 2020). Thus, YAP in the context of the findings described in this work could be a key player underlying the proliferation vs differentiation decisions in PP.

      We do refer to these publications now and refer to the YAP pathway in the introduction and results sections as well as in the discussion. We have not investigated more because the kinetics of the different components of the pathway are complex and do not give an indication of whether the pathway becomes more or less active – please see below.

      Author response image 2.

      Regarding the improvements made in the PP cell culture medium composition to allow expansion while avoiding differentiation, some of the claims should be better discussed and contextualized with current stateof-the-art differentiation protocols. As an example, the use of ALK5 II inhibitor (ALK5i II) has been reported to induce EP commitment from PP, while RA was used to induce PP commitment from the primitive gut tube cell stage in recently reported in vitro differentiation protocols (Hogrebe et al., Nature Protocols 16, 41094143, 2021; Rosado-Olivieri et al. Nature Communications 10, 1464, 2019). In this context, and to the authors' knowledge, is Vitamin A (triggering autocrine RA signaling) usually included in the basal media formulations used in other recently reported state-of-the-art protocols? If so, at which stages? Would it be advisable to remove it?

      These points and our views are now included in the discussion

      In this line also, the supplementation of cell culture media with the canonical Wnt inhibitor IWR-1 is used in this work to allow the expansion of PP while avoiding differentiation. A role for Wnt pathway inhibition during endocrine differentiation using IWR1 has been previously reported (Sharon et al. Cell Reports 27, 22812291.e5, 2019). In that work, Wnt inhibition in vitro causes an increase in the proportion of differentiated endocrine cells. It would be advisable to discuss these previous findings with the results presented in the current work. Could Wnt inhibition have different effects depending on the differential modulation of the other signaling pathways?

      These points are now included in the discussion together with the points above

      Reviewer #3 (Recommendations For The Authors)

      Recommendations for improving the writing and presentation and minor comments on the text and figures:

      • In the Introduction (page 3, line 1) it is stated: "Diabetes is a global epidemic affecting > 9% of the global population and its two main forms result from .....". The authors could rephrase/remove "global" repeated twice.

      Corrected

      • On page 4 of the introduction, in the context of "Unlimited expansion of PP cells in vitro will require disentangling differentiation signals from proliferation/maintenance signals. Several pathways have been implicated in these processes..." the authors are advised to consider mentioning the YAP mediated mechanisms as another key aspect underlying MPC phenotype (Cebola, I. et al. Nat Cell Biol 17, 615-26, 2015) and the BP to endocrine progenitor (EP) commitment (Mamidi, A. et al. Nature 564, 114-118, 2018; Rosado-Olivieri et al. Nature Communications 10, 1464, 2019). This should be better discussed in the context of the Weaknesses mentioned in the Public Review. It would be worth considering adding effectors and other molecules involved in YAP and Hippo pathway signaling to Table S3.

      We have added the role of the Hippo/YAP pathway in the introduction and mentioned in the results the finding that components of the pathway are generally not regulated except two that are now added in Table S3

      • In page 4, paragraph 3, near "and SB431542, another general (ALK4/5/7) TGFβ inhibitor", consider removing "another". SB431542 is the same inhibitor mentioned in the other protocols at the beginning of the paragraph.

      The paragraph is rewritten because it was not clear – we used A83-01 and not SB431542. Other approaches had used SB431542.

      • Page 5, Table S2 is cited after Table S3, please consider reordering.

      In fact, both S2 and S3 are relevant there, therefore we quote both now.

      • Page 8, 2nd paragraph, near "Expression of both AFP and CDX2 increased transiently upon expansion, at p5 (Figure S2H-J)." How do you explain results in FigS2C, D and FigS2E (AFP/CDX2)? RT-qPCR data does not suggest transient downregulation.

      AFP and CDX2 were – wrongly – italicized in the quoted passage. Therefore, in one case we refer to the protein and in the other to the transcript levels. We corrected and added the qualifier ‘appeared’. The difference is most likely due to translational regulation but we did not elaborate since we do not know. In any case, we have used the, less favorable but more robust, gene expression levels as the main criterion.

      • Page 9, end of 2nd paragraph, Figure 5A is cited but it looks like this should be Figure 4A.

      Corrected

      • Page 9, 3rd paragraph, when stating "C5 ePP cells of the same passage no..." please replace "no" with a number or a suitable abbreviation.

      Corrected

      • Page 9, 3rd paragraph. Expressing the values in the Y axis in a consistent manner for FigS2B-D and FigS4A would make a comparison easier.

      We strive to keep sections autonomous so that the reader would not have to flip between figures and sections – this is why we think that figure S4A is preferable as it is; it is a direct comparison of C6 to C5 for the different markers and has the additional advantage that one needs not to include p0 levels.

      • Page 9, 3rd paragraph. Green dots in FigS4A stand for p5 cells? if so, shouldn't these average 1 for all assayed genes?

      No, because the baseline (average 1) is the C5 expression at the corresponding passage no. We changed the y-axis label, hopefully it is clearer now.

      • Page 10 3rd paragraph, please include color labels in Fig. 5G.

      The different colors here correspond to the different expansion procedures that are compared. The samples are labelled on the x axis.

      • Page 10 3rd paragraph, Figure 6G is cited but it looks like this should be Figure 5G.

      Corrected

      • Page 11, 1st paragraph, at "TF genes such as FOXA2 and RBJ remained comparable", please double check if "RBJ" should be "RBPJ".

      Corrected

      • Page 11, end of 1st paragraph, when stating "Of note, expression of PTF1A was also undetectable in all ePP cells (Table S5)", is PTF1A expression level close to 1000 (which units?) in Table S5 considered undetectable?

      This statement regarding ‘undetectable PTF1A expression’ refers to expanded PP cells (ePP), not PP cells at p0. For the latter, expression is indeed close to 1000 in normalized RNA-sequence counts as mentioned in the Table legend.

      -Page 11, 4th paragraph, "In summary, the comparative transcriptome analyses suggested that our C6 expansion procedure is more efficient at strengthening the PP identity". In the context of comments made in the Public Review, more accuracy needs to be put when defining PP identity. Are these MPC or BP?

      The RNA Seq data suggest that expansion promotes a MPC  BP transition. We have added a paragraph in the corresponding results section and comment in the discussion.

      • Page 15, 2nd paragraph, the sentence "expression of PTF1A, recently shown to promote endocrine differentiation of hPS cells (Miguel-Escalada et al., 2022)" is confusing. Please double-check sentence syntax and reference. Does PTF1A expression "promote" or "create epigenetic competence" for endocrine differentiation?

      Its role is in the MPCs and it prepares the epigenetic landscape to allow for duct and endocrine specification later, thus it ‘creates epigenetic competence’. The paper was cited out of context and we have now corrected it.

      Additional recommendations by the Reviewing Editor:

      An insufficient number of experimental repetitions have been used for the following data: (Figure 1A, n = 2; Figures 2B-D, p10, n = 2; Figures 6A and B, VTN-N, n = 1).

      This is true but we do not draw quantitative conclusions from or do comparisons with these data.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #2 (Public Review):

      I have read the authors' response to my comments as well as to the other reviewers. Summarizing briefly, I don't think they provide substantial answer to the questions/comments by me or reviewer 3, and generally do not quantify the results/effects data. I still remain unconvinced about the analyses and conclusions. Rather than rewriting another set of comments, I think it will be more useful for all (authors and readers) simply to be able to see the entire set of reviews and responses together with the paper.

      The authors disagree with the views of referees. The authors have provided point-wise precise responses to each of the previous comments. The authors find that the referee has not been able to engage with the responses and accompanying analysis that were provided while communicating the previous response.

      The following extensive analyses were performed by the authors while submitting our revision of round 2 of peer-review to address the comments of reviewer 2 and reviewer 3   that were raised by them on the previous versions:

      (1) We calculated the distribution of multiple metrics for both the apo and holo simulations, including their secondary structure composition, and demonstrated the robustness of our findings.

      (2) We analyzed smaller 60 µs chunks from two parts of the 1.5 ms trajectory and showed how, in combination with the Markov state modeling (MSM) approach, these chunks effectively capture equilibrium properties.

      (3) We thoroughly investigated the choice of starting structures, examining parameters such as Rg, RMSD, secondary structure, and SASA, in response to Referee 3's concerns about the objectivity of our dimension reduction approach.

      (4) We conducted multiple analyses using VAMP-scores and justified the use of a Variational Autoencoder (VAE) over tICA.

      (5) We had extensively verified the choice of hyperparameters used in constructing the MSM.

      (6) To aleviate referee concerns, we had retrained a VAE with four latent dimensions and used it to build an MSM, ensuring the robustness of our approach.

      However, we find that Referee has not considered these additional analysis in response to his/her comments on the manuscript.

      Since referee 2 also draws comments from Referee 3, it is worth noting that some of the comments from Referee 2 and Referee 3 in Round 1 were mutually contradictory. In particular, Referee 3's suggestion in Round 1 to use the same initial configuration for simulations of intrinsically disordered proteins (IDPs) in both apo and ligand-bound forms contradicts the fundamental principle that IDPs should not possess structural bias. This recommendation also directly conflicts with Referee 2's request for greater diversity in starting structures. Our manuscript provided robust evidence that our initial configurations are indeed diverse, with one configuration coincidentally matching that used in the ligand-bound simulations. Despite this, we addressed both sets of concerns in our Round 2 revisions. Unfortunately, it seems that these efforts were overlooked in the subsequent round of review.

      Referee 2's suggestion in prevous round of review comments to mix both holo and apo simulation trajectories for MSM construction is conceptually wrong and indicates a lack of understanding of transition matrix building in this field. Nevertheless, we addressed these comments by performing additional analyses and demonstrating the robustness of our current MSM.

      Reviewer #3 (Public Review):

      Summary:

      While the authors have provided additional information in the updated manuscript, none of the additional analyses address the fundamental flaws of the manuscript.

      The additional analyses do not convincingly demonstrate that these two extremely different simulation datasets (1500 microsecond unbiased MD for a-synuclein + fasudil, 23 separate 1-4 microsecond simulations of apo a-synuclein) are directly comparable for the purposes of building MSMs.

      The 23 unbiased 1-4 microsecond simulations of apo αS totals to ~ 60 us.

      Author response image 1.

      Left figure : Distribution of the radius of gyration (Rg) of the 23 apo simulation (as shown in the colourbar) and holo simulation (black). Right figure : Mean and standard deviation (as error bar) of the Rg of the 23 apo (colourbar) and holo simulations (black).

      We have plotted the distribution of the Radius of gyration ((Rg) for the 23 apo simulation (colour bar) and the holo simulation (black) as shown in the left figure and also compared the mean and standard deviations of the Rg values (right figure). We find that our apo simulations span the entire space of Rg as is spanned by the holo simulation. We have also measured the mean and standard deviations (SD) (horizontal error bar) of the apo and holo simulations. The fact that the apo simulations have mean and SDs comparable to those of the holo ensemble suggests that the majority of the apo simulations are sampling similar conformational space as those observed in the ligand-bound holo form and hence can be used for building the MSM.

      The additional analyses do not demonstrate that there are sufficient conformational transitions among kinetically metastable states observed in 23 separate 1-4 microsecond simulations of apo a-synuclein to build a valid MSM, or that the latent space of the VAE is kinetically meaningful.      

      We have performed the Chapman-Kolmogorov test to compare observed and predicted transition probabilities over increasing lag times and found good agreement between these probabilities, thereby suggesting that transitions between states are well-sampled for both the apo (Author response image 2) and holo simulation (Figure S9).

      Author response image 2.

      The Chapman-Kolmogorov test performed for the three state Markov State Model of the αS ensemble.

      As for the latent space of VAE, we have compared the VAMP2 score and compared with tICA. VAE has a higher VAMP2 score as compared to tICA thereby indicating its efficacy in capturing slower mode for both apo and holo simulation (Fig. S7 and S8).

      If one is interested in modeling the kinetics and thermodynamics of transitions between a set of conformational states, and they run a small number of MD simulations that are too short to see conformational transitions between conformational states - any kinetics and thermodynamics modeled by an MSM will be inherently meaningless. This is likely to be the case with the apo asynuclein dataset analyzed in this investigation.

      We disagree with the referee’s view. The referee does not seem to understand the point of building Markov state models via short-time scale trajectories. The distribution of Rg of all the 23 apo simulations spans the entire Rg space sampled by the holo simulation, thereby suggesting that multiple short simulations can sample structures of varying sizes as sampled from the 1.5 ms holo simulation (see Author response image 1).

      Simulations of 1-4 microseconds are almost certainly far too short to see a meaningful sampling of conformational transitions of a highly entangled 140-residue IDP beyond a very local relaxation of the starting structures, and the authors provide no analyses to suggest otherwise.

      Author response image 3.

      Autocorrelation of the first principal component of the backbone dihedral for the apo (colourbar) and holo (black) simulation.

      Author response image 4.

      Autocorrelation of the second principal component of the backbone dihedral for the apo (colourbar) and holo (black) simulation.

      In order to assess the 23 short simulations in capturing meaningful kinetics and thermodynamics, we have computed the backbone dihedrals which were then reduced to two principal components for both the 23 apo and holo simulations. We then calculated the autocorrelation time for each of the components and for each of the apo and holo simulations which are plotted in Author response image 3 and Author response image 4 respectively.

      The autocorrelation for the holo and most of the apo simulation is similar, thereby suggesting that there is sufficient sampling of conformational transitions between conformational states in the apo simulations and are therefore able to represent the structural changes of the system similarly to the long simulation.

      Without convincingly demonstrating reasonable statistics of conformational changes from the very small apo simulation dataset analyzed here, it seems highly likely the apparent validity of the apo MSM results from learning a VAE latent space that groups structurally and kinetically distinct conformations into similar states, creating the spurious appearance of transitions between states. As such, the kinetics and thermodynamics of the resulting MSM are likely to be relatively meaningless, and comparisons with an MSM for a-synuclein in the presence of fasudil are likely to be meaningless.

      We have shown above that the short simulations are able to capture the structural changes in the long simulation. In addition we have compared the VAMP2 score of the apo and holo simulation with tICA and found out that VAE is superior in capturing long timescale dynamics, for both apo and holo simulation (Fig. S7 and S8).

      In its present form, this study provides an example of how the use of black-box machine learning methods to analyze molecular simulations can lead to obtaining misleading results (such as the appearance of a valid MSM) - when more basic analyses are omitted.

      The authors disagree with the referee’s viewpoint on our manuscript. We find that the majority of the contents of the referee’s comments are cursory and lack objectivity.

      The referee’s loose reference on Machine learning as a black box lacks basic knowledge to comprehend artificial deep neutral network’s long-proven ability to objectively deduce optimal set of lower-dimensional representation of conformational subspace of complex biomacromolecule. The referee’s views on the manuscript ignore the extensive optimization of hyper-parameters that were carried out by the authors in developing the suitable framework of beta-variational autoencoder for deducing optimal latent space representation of complex and fuzzy conformational  landscape of an IDP such as alpha-synuclein. We had thoroughly investigated the choice of starting structures, examining parameters such as Rg, RMSD, secondary structure, and SASA, in response to Referee 3's concerns about the objectivity of our dimension reduction approach. However, we find that referee 3 has ignored the analysis provided to justify our choice.

      Referee 3's advocacy for linear dimensional reduction techniques overlooks the necessity and generality of non-linear approaches, as enabled by artificial deep neural network frameworks, demonstrated in the present manuscript. Nevertheless, our manuscript includes evidence demonstrating the optimality of our current reduced dimensions through varied dimensional analyses. Our extensive analysis, based on the VAMP-2 score, supports the sufficiency of the present dimensions compared to other linear reduction methods.

      The referee’s views that developing Markov state models (MSM) of apo form of the alphasynulclein using multiple number of 1-4 microsecond long simulation length is misleading, suggests referee’s lack of knowledge on the fundamental purpose and motivation for the usage of MSM, which is, to derive long-time scale equilibrium properties from significantly short-length adaptively sampled trajectories. The referee has overlooked the extensive analysis that the authors had provided while demonstrating that the Markov state models developed from short length simulation trajectories of alpha-synclein can statistically replicate the properties derived from very long trajectories.

      ---

      The following is the authors’ response to the original reviews.

      The following extensive analyses were performed to address the reviewer comments:

      (1) We have calculated the distribution of radius of gyration (Rg), end-to-end distance (Ree), solvent accessible surface area (SASA)  of the apo and holo simulations and also their secondary structure composition.

      (2) We have performed a similar analysis for the smaller 60 µs chunk from two parts of the 1.5 ms trajectory.

      (3) The choice of starting structures have been thoroughly investigated in terms of Rg, RMSD, secondary structure and SASA.

      (4) We have justified the use of VAE over tICA.

      (5) We have verified the choice of hyperparameters that were used to build the MSM.

      (6) We have retrained a VAE with four latent dimensions and used it to build MSM. 

      (7) As per recommendation of the referee 1, we have updated the title of the manuscript by introducing ‘expansion’ phrase.

      The manuscript has been accordingly revised by updating it with additional analysis.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This is a well-conducted study about the mechanism of binding of a small molecule (fasudil) to a disordered protein (alpha-synuclein). Since this type of interaction has puzzled researchers for the last two decades, the results presented are welcome as they offer relevant insight into the physical principles underlying this interaction.

      Strengths:

      The results show convincingly that the mechanism of entropic expansion can explain the previously reported binding of fasudil to alpha-synuclein. In this context, the analysis of the changes in the entropy of the protein and of water is highly relevant. The combination use of machine learning for dimensional reduction and of Markov State Models could become a general procedure for the analysis of other systems where a compound binds a disordered protein.

      Weaknesses:

      It would be important to underscore the computational nature of the results, since the experimental evidence that fasudil binds alpha-synuclein is not entirely clear, at least to my knowledge.

      The experimental evidence of binding of fasudil to α-synuclein and potentially preventing its aggregation is reported in the paper “Fasudil attenuates aggregation of α-synuclein in models of Parkinson’s disease. Tatenhorst et al. Acta Neuropathologica Communications (2016) 4:39 DOI 10.1186/s40478-016-0310-y ”. In this work, solution state 15N-1H HSQC NMR experiments were performed of α-synuclein in increasing amounts of fasudil which led to large chemical shift perturbation of Y133 and Y136 residues. Additionally single and double mutant  synT-Y133A and synT-Y136A (tyrosine is replaced with alanine), when treated with fasudil, had no significant effect as evident from immunochemistry, thereby indicating that α-synuclein aggregation can be inhibited by the interaction of C-terminal tyrosines with  fasudil. These two analyses point to binding specific binding sites of fasudil to α-synuclein.

      In our work, we have built a MSM using the latent dimension of a deep learning method called VAE,  to address how fasudil interacts with α-synuclein. An analysis of the macrostates as obtained from MSM, gives insights into how fasudil interacts with α-synuclein, in terms of  transition probabilities among the states, thereby predicting which states are most favorable for binding.

      Reviewer #2 (Public Review):

      The manuscript by Menon et al describes a set of simulations of alpha-Synuclein (aSYN) and analyses of these and previous simulations in the presence of a small molecule.

      While I agree with the authors that the questions addressed are interesting, I am not sure how much we learn from the present simulations and analyses. In parts, the manuscript reads more like an attempt to apply a whole range of tools rather than with a goal of answering any specific questions.

      In this manuscript, we have employed a variational bayesian method, VAE, that uses variational inference to approximate the distribution of latent variable. Unlike conventional linear dimension reduction methods such as tICA (as provided in the SI), this method has been found to be better (higher VAMP2 score) in capturing slow modes and thereby facilitate the study of long-time dynamics. Markov State Model was built on this lower dimension space which indicated the presence of three and six states for the apo and holo simulations respectively. The exclusivity of the states was justified by determining the backbone contact map and further mapping these states using a denoising CNN-VAE. The increase in the number of states in the presence of the small molecule was justified by calculating the entropy of the macrostates. The entropic contribution from water remained similar across all states, while for the protein in the holo ensemble, entropy was significantly modulated (either increased or decreased) compared to the apo state. In contrast, the entropy of the apo states showed much less modulation. This proves that an increase in the number of states is primarily an entropic effect caused by the small molecule. Finally we have compared the mean first passage time (MFPT) of other states to the most populated state, which reveals a strong correlation between transition time and the system's entropy for both apo and holo ensemble. However, the transition times (to the most populated state) are much lower for the holo ensemble, thereby suggesting that fasudil may potentially trap the protein conformations in the intermediate states, thereby slowing down αS in exploring the large conformational space and eventually slow down aggregation.

      There's a lot going on in this paper, and I am not sure it is useful for the authors, readers or me to spell out all of my comments in detail. But here are at least some points that I found confusing/etc

      Major concerns

      p. 5 and elsewhere:

      I lack a serious discussion of convergence and the statistics of the differences between the two sets of simulations. On p. 5 it is described how the authors ran multiple simulations of the ligandfree system for a total of 62 µs; that is about 25 times less than for the ligand system. I acknowledge that running 1.5 ms is unfeasible, but at a bare minimum the authors should discuss and analyse the consequences for the relatively small amount of sampling. Here it is important to say that while 62 µs may sound like a lot it is probably not enough to sample the relevant properties of a 140-residue long disordered protein.

      As to referee 2’s original comment on ‘a lot going on in the manuscript’, we believe that the complexity of the project demanded that this work needs to be dealt with an extensive analysis and objective machine learning approaches, instead of routine collective variable or traditional linear dimensional reduction techniques. This is what has been accomplished in this manuscript. For someone to get the gist of the work, the last paragraph of the introduction and first paragraph of conclusion provides a summary of the overall finding and investigation in the manuscript. First, a VAE-based machine learning approach demonstrates the modulation of free energy landscape of alpha-synuclein in presence of fasudil. Next, Markov State Model elucidates distinct binding competing states of alpha-synuclein in presence of the small-molecule drug. Then the MSMderived metastable states of alpha-synuclein monomer are structurally characterized in presence of fasudil. Next we mapped the macrostates in apo and bound-state ensembles using denoising convolutional variational autoencoder, to ensure that these are mutually distinct. Next we show that fasudil exhibits conformation-dependent interactions with individual metastable states. Finally the investigation quantatively brings out entropic signatures of small molecule binding.

      We thank the reviewer for the question. For the apo simulations, we performed 1-4 μs long simulations with 23 different starting structures and the ensemble amounted to an ensemble of ~62 μs. In the Supplementary figures,  we show analyses of how the starting structures used for apo simulations compare with the structure used to run the holo simulations as well as comparison of the apo and holo ensembles in terms of structures features as Rg, Ree, solvent accessible surface area (SASA) and secondary structure properties. This is updated in the manuscript on page 3,31- 33 and figures S1-S6, S25-S30.

      Also, regarding the choice of starting structures, we chose multiple distinct conformations from a previous simulation of alpha synuclein monomer, reported in Robustelli et. al, PNAS, 115 (21), E4758-E4766. The Rg of the starting structures represent the entire distribution of Rg of the holo ensemble; from compact, intermediate to extended states. Importantly, the Rg distribution of the apo and holo ensembles are highly comparable and overlapping, indicating that the apo simulations, although of short timescale, have sampled the phase space locally around each starting conformation and thus covered the protein phase space as in the holo simulation. Similarly, other structural properties such as SASA, Ree  and secondary structure are comparable for the two ensembles. These analyses show that the local sampling across a variety of starting conformations has ensured sufficient sampling of the IDP phase space. This is  updated in the manuscript on page 33-34 and figure S1, S25-S30.

      p. 7:

      The authors make it sound like a bad thing than some methods are deterministic. Why is that the case? What kind of uncertainty in the data do they mean? One can certainly have deterministic methods and still deal with uncertainty. Again, this seems like a somewhat ad hoc argument for the choice of the method used.

      We appreciate the reviewer’s comment. In this work, we have used a single VAE model to map the simulation of αS in its apo state and in the presence of fasudil, into two dimensions. If we had used an autoencoder, which is a deterministic model, we would have to train two independent models; one for the apo-state and one for fasudil. It would then be questionable to compare the two dimensions obtained from two different autoencoders as the model parameters are not shared. 

      VAE gives us this flexibility by not mapping it to a single point, but to a distribution, thereby encouraging it to learn more generalizable representation. The uncertainty is not in the data; but mapping a conformation (of the fasudil simulation) to a distribution would provide a new point for a similar structure (from the apo simulation). 

      p. 8:

      The authors should make it clear (i) what the reconstruction loss and KL is calculated over and (ii) what the RMSD is calculated over.

      (i) The reconstruction loss is calculated between the reconstructed and original pairwise distances, whereas the KL loss is calculated between the approximated posterior distribution and the prior distribution (for VAE it is a standard normal distribution)

      (ii) The RMSE is the root mean square error between the original data and the reconstructed data. 

      (i) is updated on page 34 and (ii) is updated in the revised manuscript on page 8.

      p. 9/figure 1:

      The authors select a beta value that may be the minimum, but then is just below a big jump in the cross-validation error. Why does the error jump so much and isn't it slightly dangerous to pick a value close to such a large jump.

      In this work, RMSE has been chosen as a metric to select the best VAE model. To do so, the β parameter (weighting factor for the KL loss) was varied. The β value was chosen as this had the minimum value.

      This is updated on page 8.

      p. 10:

      Why was a 2-dimensional representation used in the VAE? What evidence do the authors have that the representation is meaningful? The authors state "The free energy landscape represents a large number of spatially close local minima representative of energetically competitive conformations inherent in αS" but they do not say what they mean by "spatially close". In the original space? If so, where is the evidence.

      We thank the reviewer for the question. Even though an increase in the number of latent dimensions may make the model more accurate, this can also result in overfitting. The model can simply memorize the pattern in the data instead of generalizing them. A higher dimensional latent space is also more difficult to interpret; therefore, we chose two dimensions. 

      The reconstruction loss (which is the mean squared error between the input and the reconstructed data) is of the order of 10-4. Also, the MSM built on the latent space of VAE is able to identify states that are distinct for both apo and holo simulations, which ensures that the latent space representation is meaningful.

      We have also trained a model with 4 neurons in the latent space and built an MSM. The implied timescales indicate the presence of six states which is consistent with the model with two latent dimensions.

      This is updated in the manuscript on page 13 and figure S14-S15.

      No, not spatially close in the original space, but in the reduced two dimensional latent space.

      p. 10:

      It is not clear from the text whether the VAEs are the same for both aSYN and aSYN-Fasudil. I assume they are. Given that the Fasudil dataset is 25x larger, presumably the VAE is mostly driven by that system. Is the VAE an equally good representation of both systems?

      Yes, the same model is used for both aSYN and aSYN-Fasudil ensemble.

      The states obtained from the MSM of the aSyn ensemble are distinct when their Cα contact maps are analyzed. So we think it is a good representation for this system.

      p. 10/11:

      Do the authors have any evidence that the latent space representation preserves relevant kinetic properties? This is a key point because the entire analysis is built on this. The choice of using z1 and z2 to build the MSM seems somewhat ad hoc. What does the auto-correlation functions of Z1 and Z2 look like? Are the related to dynamics of some key structural properties like Rg or transient helical structure.

      Autocorrelation of z1 and z2 of the latent space of VAE and the radius of gyration for asyn-fasudil simulation.

      Author response image 5.

      We find that z1 of VAE has a much slower decay as compared to Rg. This indicates that it is much better in capturing long-time-scale dynamics as compared to Rg.

      p. 11:

      What's the argument for not building an MSM with states shared for aSYN +- Fasudil?

      We have built two different markov state models for two aSYN simulation in its apo state and in the presence of ligand. Mixing the two latent spaces to build one MSM would give incorrect transition timescales among the states as these are independent simulations.

      p. 12:

      Fig. 3b/c show quite clearly that the implied timescales are not converged at the chosen lag time (incidentally, it would have been useful with showing the timescales in physical time). The CK test is stated to be validated with "reasonable accuracy", though it is unclear what that means.

      We have mentioned the physical timescales in the main manuscript (Page no. 38), which is 36 and 32 ns for apo and holo simulations, respectively. We used “reasonable accuracy” in the context of the Chapman-Kolmogorov test. We note that for the ligand simulations, the estimated and predicted models are in excellent agreement as compared to some of the transitions in the apo state. This good agreement implies that the model has reached Markovianity and the timescales have converged. 

      The CK test is updated in the manuscript on page 12.

      p. 12:

      In Fig. 3d, what are the authors bootstrapping over? What are the errors if the authors analyse sampling noise (e.g. bootstrap over simulation blocks)?

      For bootstrapping, we randomly deleted a part of the simulation (simulation block) and rebuilt the MSM with this reduced dataset. We repeated this 10 times and reported the average value of the population and the transition timescales over the 10 iterations.  

      p. 13:

      I appreciate that the authors build an MSM using only a subset of the fasudil simulations. Here, it would be important that this analysis includes the entire workflow so that the VAE is also rebuilt from scratch. Is that the case?

      The VAE model was trained over data points of the ligand simulation sampled at every 9 ns starting from time t=0, for the entire 1.5 ms. We did not train it for the subset of the fasudil simulation, but rather used the trained VAE model to get the latent space of the 60 μs of the fasudil simulation to build the MSM. Additionally, we have compared the distributions of Rg for this simulation block with the apo ensemble and found good agreement among them. 

      Rg distribution is updated in the manuscript on page 13 and see figure S10-S11.

      p. 18:

      I don't understand the goal of building the CVAE and DCVAE. Am I correct that the authors are building a complex ML model using only 3/6 input images? What is the goal of this analysis. As it stands, it reads a bit like simply wanting to apply some ML method to the data. Incidentally, the table in Fig. 6C is somewhat intransparent.

      We appreciate the reviewer’s valid question. The ensemble averaged contact map of the macrostates of aSyn in apo state and in the presence of ligand posed us a challenge in finding contacts that are exclusive to each state. Since VAEs are excellent in finding patterns, we employed a convolutional VAE (typically used for images). However, owing to the few number of contact maps, the model overfitted and to prevent this, we added noise to the data.  A visual inspection of the ensemble averaged contact map, especially for IDPs is difficult and this lower dimensional space will give us a preliminary idea of how each macrostate is different from every other. The table in Fig. 6C provides scores for the denoised contact maps (SSIM and PSNR scores). An SSIM score above 0.9 and PSNR score between 20-48 indicates that the reconstruction of the contact map is of good quality.

      p. 22:

      "Our results indicate that the interaction of fasudil with αS residues governs the structural features of the protein."

      What results indicate this?

      By building a Markov State Model and comparing them across the apo and holo ensembles, we showed the interaction of fasudil with aSyn leads to the population of more states (than apo). In these states, we observe that fasudil interacts with aSyn in different regions as shown by the protein-ligand contact map as shown in figure 7. Also, the contact maps and the extent of secondary structure of the six states are distinct across the states. The location and extent of the helix and sheet-like character in the ensemble of the six macrostates as shown in figure S16-S17.  Based on these observations, we state that the interaction of the small molecule favors the population of new aSyn states that are distinct in their structural features.

      p. 23:

      The authors should add some (realistic) errors to the entropy values quoted. Fig. 8 have some error bars, though they seem unrealistically small. Also, is the water value quoted from the same force field and conditions as for the simulations?

      The error values are the standard deviations that are provided by the PDB2ENTROPY package. Yes, the water value is from the same force field and conditions for the simulations are the same as reported in the section “Entropy of water”  

      p. 23:

      Has PDB2ENTROPY been validated for use with disordered proteins?

      Yes, it has been used in the following paper studying liquid-liquid phase separation of an IDP. 

      This paper has also been cited in the manuscript (reference 66).

      “Thermodynamic forces from protein and water govern condensate formation of an intrinsically disordered protein domain” by Saumyak Mukherjee & Lars V. Schäfer, Nature Communications volume  14, Article number: 5892 (2023) https://doi.org/10.1038/s41467-023-41586-y

      p. 23/24:

      It would be useful to compare (i) the free energies of the states (from their populations), (ii) the entropies (as calculated) and (iii) the enthalpies (as calculated e.g. as the average force field energy). Do they match up?

      Our analysis stems from previous studies where enthalpy driven drug design has not led to significant advances in drug design, particularly for IDPs. In the presence of the drug/ligand, the protein may be able to explore a larger conformational space and hence an increase in the number of states accessible by the protein, which we found by building Markov State Model using the latent space of VAE. The entropy of the protein is calculated based on the torsional degrees of freedom relative to the random distribution (the protein with the most random configuration).

      p. 31:

      It is unclear which previous simulation the new aSYN simulations were launched from. What is the size of the box used?

      The starting conformations for the new aSYN simulations were randomly chosen from a previously reported 73 μs simulation in Robustelli et. al. (PNAS, 115 (21), E4758-E4766). 

      Box size for the 23 simulation has been added to the supplemental information in Table S1.

      Reviewer #3 (Public Review):

      Summary:

      In this manuscript Menon, Adhikari, and Mondal analyze explicit solvent molecular dynamics (MD) computer simulations of the intrinsically disordered protein (IDP) alpha-synuclein in the presence and absence of a small molecule ligand, Fasudil, previously demonstrated to bind alpha-synuclein by NMR spectroscopy without inducing folding into more ordered structures. In order to provide insight into the binding mechanism of Fasudil the authors analyze an unbiased 1500us MD simulation of alpha-synuclein in the presence of Fasudil previously reported by Robustelli et.al. (Journal of the American Chemical Society, 144(6), pp.2501-2510). The authors compare this simulation to a very different set of apo simulations: 23 separate1-4us simulations of alphasynuclein seeded from different apo conformations taken from another previously reported by Robustelli et. al. (PNAS, 115 (21), E4758-E4766), for a total of ~62us.

      To analyze the conformational space of alpha-synuclein - the authors employ a variational autoencoder (VAE) to reduce the dimensionality of Ca-Ca pairwise distances to 2 dimensions, and use the latent space projection of the VAE to build Markov state Models. The authors utilize kmeans clustering to cluster the sampled states of alpha-synuclein in each condition into 180 microstates on the VAE latent space. They then coarse grain these 180 microstates into a 3macrostate model for apo alpha-synuclein and a 6-macrostate model for alpha-synuclein in the presence of fasudil using the PCCA+ course graining method. Few details are provided to explain the hyperparameters used for PCCA+ coarse graining and the rationale for selecting the final number of macrostates.

      The authors analyze the properties of each of the alpha-synuclein macrostates from their final MSMs - examining intramolecular contacts, secondary structure propensities, and in the case of alpha-synuclein:Fasudil holo simulations - the contact probabilities between Fasudil and alphasynuclein residues.

      The authors utilize an additional variational autoencoder (a denoising convolutional VAE) to compare denoised contact maps of each macrostate, and project onto an additional latent space. The authors conclude that their apo and holo simulations are sampling distinct regions of the conformational space of alpha-synuclein projected on the denoising convolutional VAE latent space.

      Finally, the authors calculate water entropy and protein conformational entropy for each microstate. To facilitate water entropy calculations - the author's take a single structure from each macrostate - and ran a 20ps simulation at a finer timestep (4 femtoseconds) using a previously published method (DoSPT), which computes thermodynamic properties of water from MD simulations using autocorrelation functions of water velocities. The authors report that water entropy calculated from these individual 20ps simulations is very similar.

      For each macrostate the authors compute protein conformational entropy using a previously published Maximum Information Spanning tree approach based on torsion angle distributions - and observe that the estimated protein conformational entropy is substantially more negative for the macrostates of the holo ensemble.

      The authors calculate mean first passage times from their Markov state models and report a strong correlation between the protein conformational entropy of each state and the mean first passage time from each state to the highest populated state.

      As the authors observe the conformational entropy estimated from macrostates of the holo alphasynuclein:Fasudil is greater than those estimated from macrostates of the apo holo alphasynuclein macrostates - they suggest that the driving force of Fasudil binding is an increase in the conformational entropy of alpha-synuclein. No consideration/quantification of the enthalpy of alpha-synuclein Fasudil binding is presented.

      Strengths:

      The author's utilize MD simulations run with an appropriate force field for IDPs (a99SB-disp and a99SB-disp water (Robustelli et. al, PNAS, 115 (21), E4758-E4766) - which has previously been used to perform MD simulations of alpha-synuclein that have been validated with extensive NMR data.

      The contact probability between Fasudil and each alpha-synuclein residue observed in the previously performed 1500us MD simulation of alpha-synuclein in the presence of Fasudil (Robustelli et. al., Journal of the American Chemical Society, 144(6), pp.2501-2510) was previously found to be in good agreement with experimental NMR chemical shift perturbations upon Fasudil binding - suggesting that this simulation is a reasonable choice for understanding IDP:small molecule interactions.

      Weaknesses:

      Major Weakness 1: Simulations of apo alpha-synuclein and holo simulations of alpha-synuclein and fasudil are not comparable.

      The most robust way to determine how presence of Fasudil affects the conformational ensemble of alpha-synuclein conclusions is to run apo and holo simulations of the same length from the same starting structures using the same simulation parameters.

      The 23 1-4 us independent simulations of apo alpha-synuclein and the long unbiased 1500us alpha-synuclein in the presence of fasudil are not directly comparable. The starting structures of simulations used to build a Markov state model to describe apo alpha-synuclein were taken from a previously reported 73us MD simulation of alpha-synuclein run with the a99SB-disp force field and water model) with 100mM NaCl, (Robustelli et. al, PNAS, 115 (21), E4758-E4766). As the holo simulation of alpha-synuclein and Fasudil was run in 50mM NaCl, snapshots from the original apo alpha-synuclein simulation were resolvated with 50mM NaCl - and new simulations were run.

      No justification is offered for how starting structures were selected. We have no sense of the conformational variability of the starting structures selected and no sense of how these conformations compare to the alpha-synuclein conformations sampled in the holo simulation in terms of standard structural descriptors such as tertiary contacts, secondary structure, radius of gyration (Rg), solvent exposed surface area etc. (we only see a comparison of projections on an uninterpretable non-linear latent-space and average contact maps). Additionally, 1-4 us is a relatively short timescale for a simulation of a 140 residue IDP- and one is unlikely to see substantial evolution for many structural properties of interest (ie. secondary structure, radius of gyration, tertiary contacts) in simulations this short. Without any information about the conformational space sample in the 23 apo simulations (aside from a projection on an uninterpretable latent space)- we have no way to determine if we observe transitions between distinct states in these short simulations, and therefore if it is possible the construct a meaningful MSM from these simulations.

      If the structures used for apo simulations are on average more compact or contain more tertiary contacts - then it is unsurprising that in short independent simulations they sample a smaller region of conformational space. Similarly, if the starting structures have similar dimensions - but we only observe extremely local sampling around starting structures in apo simulations in the short simulation times - it would also not be surprising that we sample a smaller amount of conformational space. By only presenting comparisons of conformational states on an uninformative VAE latent space - it is not possible for a reader to ask simple questions about how the conformational ensembles compare.

      It is noted that the authors attempt to address questions about sampling by building an MSM of single contiguous 60us portion of the holo simulation of alpha-synuclein and Fasudil - noting that:

      "the MSM built using lesser data (and same amount of data as in water) also indicated the presence of six states of alphaS in presence of fasudil, as was observed in the MSM of the full trajectory. Together, this exercise invalidates the sampling argument and suggests that the increase in the number of metastable macrostates of alphaS in fasudil solution relative to that in water is a direct outcome of the interaction of alphaS with the small molecule."

      However, the authors present no data to support this assertion - and readers have no sense of how the conformational space sampled in this portion of the trajectory compares to the conformational space sampled in the independent apo simulations or the full holo simulation. As the analyzed 60us portion of the holo trajectory may have no overlap with conformational space sampled in the independent apo simulations - it is unclear if this control provides any information. There is no quantification of the conformational entropy of the 6 states obtained from this portion of the holo trajectory or the full conformational space sampled. No information is presented to determine if we observe similar states in the shorter portion of the holo trajectory. Furthermore - as the authors provide almost no justification for the criteria used to select of the final number of macrostates for any of the MSMs reported in this work- and the number of macrostates is effectively a free parameter in the PCCA+ method, arriving at an MSM with 6 macrostates does not convey any information about the conformational entropy of alpha-synuclein in the presence or absence of ligands. Indeed - the implied timescale plot for 60us holo MSM (Figure S2) - shows that at least 10 processes are resolved in the 120 microstate model - and there is no information to provided explaining/justifying how a final 6-macrostate model was determined. The authors also do not project the conformations sampled in this sub- trajectory onto the latent space of the final VAE.

      One certainly expects that an MSM built with 1/20th of the simulation data should have substantial differences from an MSM built from the full trajectory - so failing additional information and hyperparameter justification - one wonders if the emergence of a 6-state model could be the direct result of hardcoded VAE and MSM construction hyperparameter choices.

      Required Controls For Supporting the Conclusions of the Study: The authors should initiate apo and holo simulations from the same starting structures - using the same simulation software and parameters. This could be done by adding a Fasudil ligand to the apo structures - or by removing the Fasudil ligand from a subset of holo structures. This would enable them to make apples-toapples comparisons about the effect of Fasudil on alpha-synuclein conformational space.

      Failing to add direct apples-to-apples comparisons, which would be required to truly support the studies conclusions, the authors should at least compare the conformational space sampled in the independent apo simulations and holo simulations using standard interpretable IDP order parameters (ie. Rg, end-to-end distance, secondary structure order parameters) and/or principal components from PCA or tICA obtained from the holo simulation. The authors should quantify the number of transitions observed between conformational states in their apo simulations. The authors could also perform more appropriate holo controls, without additional calculations, by taking batches of a similar number of short 1-4us segments of simulations used to compute the apo MSMs and examining how the parameters/macrostates of the holo MSMs vary with the input with random selections.

      In case of IDPs, one should not bias the simulation by starting from identical structures, as IDP does not have a defined structure and the starting configuration has little significance. It is the microenvironment that matters most. As for the choice of simulation software and parameters, we have used the same force field that was used in the holo simulation at the same temperature and same salt concentration. We have performed multiple independent simulations that have varying structural signatures such as Rg, SASA and secondary structure content. In fact, the starting structure for apo simulations covered the entire span of the Rg distribution of holo simulation, including the starting structure of the holo simulation. The simulations are unbiased w.r.t the starting structure. Although the fasudil simulation was run for 1.5 ms, we should also understand that it is difficult to run a millisecond range of simulation in reasonable time from a single starting structure. It is exactly for this reason that we start with different structures so that we do not bias ourselves and sample every possible conformation. 

      We have updated the manuscript on page 33-34 and figure S1, S25-S30.

      Considering the computational expense for simulating 1.5 ms timescale of a 140-residue IDP, we generated an ensemble from multiple short runs amounting to ~60 µs. The premise of this investigation is a widely popular method, Markov State Models (MSMs) that can be used to estimate long timescale kinetics and stationary populations of metastable states built from ensembles of short simulations. We have also demonstrated that comparable to the apo data, when we build an MSM for asyn-fasudil (holo) using 60 µs simulation block, the implied timescales (ITS) plot shows identical number of metastable states as for the 1.5 ms data.  

      An intrinsically disordered protein (IDP) is not represented by a fixed structure. Therefore, it would be most appropriate to run multiple simulations starting from different initial structures and simulate the local environment around those structures; thus generating an ensemble effectively sampling the phase space. Accordingly, for initiating the apo simulations, instead of biasing the initial structure (using the starting structure used for simulations with fasudil), we chose randomly 23 different conformations from the 73 µs long simulation of 𝛼-synuclein monomer reported in Robustelli et. al, PNAS, 115 (21), E4758-E4766.  Based on the reviewer’s comment on providing a justification for choice of the starting structures for apo simulations, we provide a compilation of figures below showing comparison of standard conformational properties of the chosen initial structures for apo simulations with the starting structure of the long holo simulation; we have also provided comparative analyses of the apo (~60 µs) and holo ensemble (1.5 ms) properties. 

      Figure S1 compares the Rg of the apo and holo ensembles of ~60 μs and 1.5 ms, respectively. The distributions are majorly overlapping, indicating that the apo ensemble is comparable to the holo ensemble, in terms of the extent of compaction of the conformations. In Figure 1, we have also marked the Rg values corresponding to the starting structures used to seed the apo simulations. It is evident that the 23 starting conformations chosen represent the whole range of the Rg space that is sampled in the holo ensemble. Therefore, while the apo simulations are relatively short (1-4 μs), the local sampling of these multiple starting conformations of variable compaction (Rg) ensures that the phase space is efficiently sampled and the resulting ensemble is comparable to the holo ensemble. Furthermore, the implementation of MSM on such an ensemble can be efficiently used to identify metastable states and the long timescale transitions happening between them

      Another property that is proportional to Rg is the end-to-end distance of the protein conformations. Figure S2 shows that the distribution of this property in the apo and holo ensembles are highly similar.

      Figure S3 depicts another fundamental structural descriptor i.e. solvent accessible surface area (SASA) that indicates the extent of folding and the exposure of the residues. The apo ensemble only shows a minimal shift in the distribution towards higher SASA values. The distributions of the two ensembles largely overlap. 

      In Figure S25, we have provided the root mean square deviation (RMSD) of the starting structures used in the apo simulations with the structure used to start the long simulation with fasudil. The RMSD values range from 1.6 to 3 nm, indicating that the starting structures used are highly variable. This is justifiable for IDPs since they are not identified by a single, fixed structure, but rather by an array of different conformations.  

      Figures S26-S28 show the fraction of the secondary structure elements i.e. helix, beta and coil in the starting structures of apo and holo simulations. All the conformations are mostly disordered in nature with the greatest extent of coil content. The helix content ranges from 3-10 % while sheet content varies from 3-15 % in the initial simulation structures. 

      Figures S4-s6 represent the residue-wise percentage of secondary structure elements (helix, beta and coil) in the apo and holo ensembles. It is evident that the extent of secondary structure is comparable in the two ensembles. 

      The above analyses comparing distributions of several structural features clearly indicate that the apo simulations we performed from different starting structures have effectively sampled the phase space as the single long simulation of the holo system.

      We have discussed the above in the manuscript: Computational Methods section, Page 33-34.

      The above VAMP score analyses (Figures S7 and S8has been now presented in the manuscript: Results and Discussion (Page 8)

      Building the MSM

      While building the MSM, we iteratively varied the hyperparameters to build a reasonable model. In this process, we explored different values of the number of clusters, maximum number of iterations, tolerance, stride, metric, seed, chunk size and initialization methods. There is no possible way to perform an optimization on the choice of the above hyperparameters using gradient descent methods, as no convergence would be guaranteed. The parameters were tuned carefully so that we get the best possible implied timescales of the system. The quality of the MSM was further validated using the Chapman-Kolmogorov (CK) test on a state-by-state basis i.e by considering the transitions between each pair of the metastable states. In addition, we have built the contact maps to show that the states are mutually exclusive. This is also justified by the latent space of denoising convolutional variational autoencoders.

      We have compared the conformational space in the independent apo and holo simulations for Rg, Ree, SASA and secondary structure. As for PCA/TICA, we have computed the VAMP-2 score for TICA and found out to be low as compared to VAE. In fact, neural networks have been shown previously as a better dimension reduction technique due to its non-linearity over linear methods such as PCA or TICA.

      Author response image 6.

      Distribution of (a)Rg, (b) Ree, (c) SASA and of the apo ensemble and a 60 μs slice of the holo simulation trajectory.  (d) ITS plot of the 60 μs chunk.

      First, someone familiar with MSM should understand that the basic philosophy of MSM is not the requirement of long simulation trajectories, which would defeat the purpose of its usage. Rather as motivated by Noe and coworkers in seminal PNAS (vol. 106, page 9011, year 2009) paper, MSM plays an important role in inferring long-time scale equilibrium properties by using significantly short-length scale non-equilibrium trajectories. 

      Considering the difference in the size of the ensembles in the apo and holo simulations, we verified how different is the MSM built using 60 μs slice of the data from the 1.5 ms holo simulation in terms of the number of metastable states identified by the model. For this, we considered 60 μs data beginning from 966 μs - 1026 μs. First, we compared the gross structural properties of these datasets. Author response image 6a-c compares the distributions of Rg, Ree and SASA. The distributions show that the apo and holo simulations are very similar with respect to these standard properties of protein conformations. 

      We built the MSM for this 60 μs data of the holo ensemble from the reduced data obtained from the same VAE model. We would like to clarify that the hyperparameters of the model are not hardcoded but rather carefully fine-tuned to obtain a good model that performs good kinetic discretization of the underlying macrostates. The implied timescale plot of this new MSM shows distinct timescales corresponding to six macrostates. This led us to conclude that the six-state model is robust despite the differences in the ensemble size. The implied timescale is shown in Author response image 6d.

      The above analyses in Author response image 6 are presented in Results and Discussion, Page 13. 

      Major Weakness 2: There is little justification of how the hyperparameters MSMs were selected. It is unclear if the results of the study depend on arbitrary hyperparameter selections such as the final number of macrostates in each model.

      It is unclear what criteria were used to determine the appropriate number of microstates and macrostates for each MSM. Most importantly - as all analyses of water entropy and conformational entropy are restricted to the final macrostates - the criteria used to select the final number of macrostates with the PCCA+ are extremely important to the results of the conclusions of the study. From examining the ITS plots in Figure 3 - it seems both MSMs show the same number of resolved processes (at least 11) - suggesting that a 10-state model could be apropraite for both systems. If one were to simply select a large number of macrostates for the 20x longer holo simulation - do these states converge to the same conformational entropy as the states seen in the short apo simulations? Is there some MSM quality metric used to determine what number of macrostates is more appropriate?

      Required Controls For Supporting the Conclusions of the Study: The authors should specify the criteria used to determine the appropriate number of microstates and macrostates for their MSMs and present controls that demonstrate that the conformational entropies calculated for their final states are not simply a function of the ratio of the number macrostates chosen to represent very disparate amounts of conformational sampling.

      VAMP-2 score was used to determine the number of microstates. We have calculated the VAMP2 score by varying the number of microstates, ranging from 10 to 220. We find that the VAMP-2 score has saturated at a higher number of microstates for both apo and holo simulations.

      The number of macrostates were determined by the gap between the lines of the Implied timescales plot followed by a CK test (shown in figure S1). Since we plotted the first 10 slowest timescales, the implied timescales show 10 timescales and this is not an indicator of the number of macrostates. The macrostates are separated by distinct gaps in the timescales and do not merge as seen beyond 5 timescales in the plot. The timescales, when leveled off and distinct, indicate that the system has well defined metastable states and the MSM is accurate in identifying the macrostates. We find this to be three and six for the apo and holo simulations from the corresponding implied timescales.

      The above is discussed in Computational Methods, Page 37-38.

      Major Weakness 3: The use of variational autoencoders (VAEs) obscures insights into the underlying conformational ensembles of apo and holo alpha-synuclein rather than providing new ones

      No rationale is offered for the selection of the VAE architecture or hyperparameters used to reduce the dimensionality of alpha-synuclein conformational space.

      It is not clear the VAEs employed in this study are providing any new insight into the conformational ensembles and binding mechanisms of Fasudil to alpha-synuclein, or if the underlying latent space of the VAEs are more informative or kinetically meaningful than standard linear dimensionality reduction techniques like PCA and tICA. The initial VAE is used to reduce the dimensionality of alpha-synuclein conformational ensembles to 2 degrees of freedom - but it is unclear if this projection is structurally or kinetically meaningful. It is not clear why the authors choice to use a 2-dimeinsional projection instead of a higher number of dimensions to build their MSMs. Can they produce a more kinetically and structurally meaningful model using a higher dimensional VAE latent space?

      Additionally - it is not clear what insights are provided by the Denoising Convolutional Variational Autoencoder. The authors appear to be noising-and-denoising the contact maps of each macrostate, and then projecting the denoised values onto a new latent space - and commenting that they are different. Does this provide additional insight that looking at the contact maps in Figures 4&5 does not? Is this more informative than examining the distribution of the Radii of gyration or the secondary structure propensities of each ensemble? It is not clear what insight this analysis adds to the manuscript.

      Suggested controls to improve the study: The authors should project interpretable IDP structural descriptors (ie. secondary structure, radius of gyration, secondary structure content, # of intramolecular contacts, # of intermolecular contacts between alpha-synuclein and Fasudil ) onto this latent space to illustrate if any of these properties are meaningful separated by the VAE projection. The authors should compare these projections, and MSMs built from these projections, to projections and MSMs built from projections using standard linear dimensionality projection techniques like PCA and tICA.

      We have already pointed out the IDP structural parameters for the first question.

      In case of VAE, the latent space captures the underlying pattern of the higher dimensional data. A non-linear projection using VAE has shown to have a higher VAMP-2 score over linear dimension reduction methods such as tICA. The latent space of VAE was then used to build the MSM, in order to get the macrostates and also the transition timescales among them. We can project the data onto a higher dimension, but the goal is to reduce it to lower dimensions where it will be easier to interpret. Higher number dimensions would also risk overfitting; and the model, instead of learning the pattern, it may simply memorize the data. The training and validation loss curve from VAE has reached the order of 10^-4 thereby indicating good reconstruction of the original data.

      As for dimension reduction using tICA, the VAMP-2 score confirms that our VAE model performs better than tICA. This manuscript uses deep neural networks to understand the structural and kinetic process of IDP and small molecule interaction. Dimension reduction using tICA would give different reaction coordinates and MSM built using the projected data of tICA will not be one-to one comparable with that obtained from VAE.

      We had to perform noising, as we had only 9 contact maps. This led to overfitting of the CVAE model. To overcome this problem, we have introduced white noise to our data, so as to prevent the model from overfitting. The objective of the DCVAE model was to see how distinct these contact maps are based on their locations on a lower dimensional space. A visual inspection of the ensemble averaged contact map, especially for IDPs is much more difficult as compared to folded proteins. So, even before computing the Rg, Ree, SASA or secondary structure, this lower dimensional space will give us a preliminary idea of how each macrostate is different from every other.

      As for the distribution of Rg, we have plotted it in Author response image 7. The residue-wise percentage secondary structure is plotted in figure S4-S6  for the holo and apo simulation respectively.

      Author response image 7.

      Distribution of radius of gyration for the three and six macrostates in the apo and holo simulation respectively.

      As for training a model with a higher number of latent dimensions, we have retrained a VAE model with four dimensions in the latent space. The loss was of the order of 10-4. We built a MSM with the appropriate number of microstates and found the presence of six macrostates as evident from the ITS plot as shown in Figure S14 and S15.

      This data is presented in Results and Discussion, Page 13

      Major Weakness 4: The MSMs produced in this study have large discrepancies with MSMs previously produced on the same dataset by the same authors that are not discussed.

      Previously - two of the authors of this manuscript (Menon and Mondal) authored a preprint titled "Small molecule modulates α-synuclein conformation and its oligomerization via Entropy Expansion" (https://www.biorxiv.org/content/10.1101/2022.10.20.513005v1.full) that analyzed the same 1500us holo simulation of alpha-synuclein binding Fasudil. In this study - they utilized the variational approach to Markov processes (VAMP) to build an MSM using a 1D order parameter as input (the radius of gyration), first discretizing the conformational space into 300 microstates before similarly building a 6 macrostate model. From examining the contact maps and secondary structure propensities of the holo MSMs from the current study and the previous study- some of the macrostates appear similar, however there appear to be orders of magnitude differences in the timescales of conformational transitions between the two models. The timescales of conformational transitions in the previous MSM are on the order of 10s of microseconds, while the timescales of transitions in this manuscript are 100s-1000s microseconds. In the previous manuscript, a 3 state MSM is built from an apo α-synuclein obtained from a continuous 73ms unbiased MD simulation of alpha-synuclein run at a different salt concentration (100mM) and an additional 33 ms of shorter simulations. The apo MSM from the previous study similarly reports very fast timescales of transitions between apo states (on the order ~1ms) - while the MSM reported in the current study (Figure 9) are on the order of 10s-100s of microseconds).

      These discrepancies raise further concerns that the properties of the MSMs built on these systems are extremely sensitive to the chosen projection methods and MSM modeling choices and hyperparameters, and that neither model may be an accurate description of the true underlying dynamics

      Suggestions to improve the study: The authors should discuss the discrepancies with the MSMs reported in their previous studies.

      In the previous preprint, the radius of gyration was used as the collective variable to build the MSM. In this manuscript, we have used a much more general collective variable, reduced pairwise distance using VAE. Firstly, the collective variables used to build the model in the two works are different. Secondly, for the 73 μs apo simulation in the previous manuscript, the salt concentration used was 100 mM, but in this work, we have used a salt concentration of 50 mM, same as the salt concentration used in the holo simulations. Since the two simulation conditions are different with respect to salt concentration, the conformational space sampled in these conditions will be different and this will be reflected in the nature/features of the metastable states and the associated transition kinetics. Thirdly, the lag time at which the MSM was built was 3.6 ns in the previous manuscript, whereas, in this work we have used 32 ns. This is already off by a factor of 10. So the order of timescales have also changed. Thus, changes in the collective variable and change in the lag time at which the system reaches Markovianity is different. Hence, the timescales of transition among the macrostates are also different. Because of these differences, it would not be correct to compare the results that we would get from the two investigations.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      To highlight the role of the entropic expansion mechanism, I would suggest modifying the title to capture this result, for example: "An Integrated Machine Learning Approach Delineates an Entropic Expansion Mechanism for the Binding of a Small Molecule to α-Synuclein".

      We have changed the title as suggested by the reviewer.

      To my knowledge the binding of fasudil to alpha-synuclein has been shown in the simulations by Robustelli et al (JACS 2022), but the experimental evidence is less clear cut. If an experimental binding affinity and the effect on alpha-synuclein aggregation have been measured, they should be reported.

      Reviewer #2 (Recommendations For The Authors):

      We thank the reviewer for the careful evaluation of our manuscript and providing comments and questions that we have attempted to address and incorporate. 

      Minor

      Abstract:

      In "which is able to statistically distinguish fuzzy ensemble", what does the word "statistically" mean in this context? Do the authors present evidence that the two ensembles are statistically different, and if so in what ways?

      We have analyzed the apo and holo ensembles of aSyn using the framework of Markov State Models, which provides the stationary populations of the states that the model identifies. For this reason, we have used ‘which is able to statistically distinguish fuzzy ensemble’ as we compare and contrast the metastable states that we resolve using MSM. The MSM provides metastable states which are identified through statistical analysis of the transitions between states (transition probability matrix). We characterize their structural features to distinguish them which gives a meaningful interpretation of the fuzzy ensemble.

      Abstract:

      What does "entropic ordering" mean?

      We thank the reviewer for pointing this out. Here, we mean that the presence of the small molecule only affects the protein backbone entropy while the entropy of water is not affected in the simulations with fasudil. We will rewrite this more clearly in the abstract. 

      The changed sentence is as follows: 

      “A thermodynamic analysis indicates that small-molecule modulates the structural repertoire of αS by tuning protein backbone entropy, however the entropy of the water remains unperturbed.”

      Abstract:

      What does "offering insights into entropic modulation" mean?

      In this investigation, we first discretized the ensemble of a small-molecule binding/interacting with a disordered aSyn into the underlying metastable states, followed by characterisation of these identified states. As small molecule interactions can affect the overall entropy of the IDP, we estimated the said effect of fasudil binding on aSyn. We find that small molecule binding effect is manifested in the protein backbone entropy and the solvent entropy is not affected. Through this work, we highlight these insights into the modulatory effect that fasudil brings about in the entropy of the system (entropic modulation).

      p. 3/4:

      When the authors write "However, a routine comparison of monomeric αS ensemble... ensemble" it is unclear whether they are referring to previous work (they only cite a paper with simulations of "apo" aSYN, and if so which. Do they mean Ref 32? Also, the word "routine" sounds odd in this context.

      We thank the author for pointing this out. We compared the ensemble properties (such as the distributions of the radius of gyration, end-to-end distance, solvent accessible surface area, secondary structure properties) of ɑ-synuclein monomer that we generated in neat water and the ensemble of ɑ-synuclein in the presence of the small molecule fasudil that is reported in Robustelli et.al. (Journal of the American Chemical Society, 144(6), pp.2501-2510).  We have now modified this sentence in the main manuscript as follows: (Page no 3)

      “However, comparison of the global and local structural features of the αS ensemble in neat water and that in the presence of fasudil [32] (see Figure S1-S6) did not indicate a significant difference that is a customary signature of the dynamic IDP ensemble.”

      p. 4:

      Regarding "Integrative approaches are therefore gaining importance in IDP studies", these kinds of integrative approaches have been used for 20 years for studies of IDPs (with increasing sophistication and success), so I think "gaining" is somewhat of a stretch.

      We thank the reviewer for this comment. We agree with the reviewer and have now changed this sentence  as follows:

      “Integrative approaches have been exploited in studying IDPs as well as small-molecule binding to IDPs.”

      p. 5:

      What does "large scale" mean in "This study showed no large-scale differences between the bound and unbound states of αS"? Do the authors mean substantially/significantly different, or differences on a large (length) scale?

      Here, we refer to the study of small molecule (fasudil) binding study to α-synclein reported in Robustelli et.al. (Journal of the American Chemical Society, 144(6), pp.2501-2510). In this study, the authors report no substantial (“large scale”) differences in the conformational ensembles of αsynuclein in the bound and unbound states of fasudil such as the backbone conformation distributions. 

      p. 6:

      The authors write "In a clear departure from the classical view of ligand binding to a folded globular protein, the visual change in αS ensemble due to the presence of small molecule is not so strikingly apparent." I don't understand this. Normally, there is very little difference between apo and holo protein structures for folded proteins, so I don't understand the "in a clear departure" part. This seems like a strawman. Of course, for folded proteins one can generally see the ligand bound, but here the authors are talking about the protein.

      In case of folded proteins, the overall tertiary structure of the protein remains mostly the same upon binding of the ligand. Structural changes are localized in nature and primarily around the binding site. However, in case of ⍺Syn, binding of fasudil is transient and not as strong as seen for folded proteins. “Clear departure” refers to the fact that for ⍺Syn, binding of fasudil is more subtle and dispersed across the ensemble of conformations rather than localized changes as in case of folded proteins.

      p. 6:

      I don't think the term "data-agnostic" makes sense since these methods are based on data and also make some assumptions about how the data can/should be used.

      We have replaced this term with “model-agnostic”.

      p. 16:

      How are contacts defined; please add to caption.

      A contact is considered if the Cα atoms of two residues are within a distance of 8 Å of each other. We have updated the caption with this information in Figures 4 and 5.  

      p. 20:

      What do the authors mean by "non-specific interactions" in this context?

      The interactions of fasudil are predominantly with the negatively charged residues in the C-terminal region of ⍺Syn via charge-charge and π-stacking interactions (Robustelli et.al. (Journal of the American Chemical Society, 144(6), pp.2501-2510)).

      In addition, in some metastable states that we identify, we also observe transient interactions with residues in the hydrophobic NAC region and N-terminal region. We refer to these transient interactions as “non-specific” interactions.

      p. 27:

      Are the axes of Fig. 9c/d z1 and z2?

      Yes. The axes are z1 and z2

      Smaller than minor

      Abstract:

      Rephrase "In particular, the presence of fasudil in milieu"

      We have rephrased the sentence as follows: 

      “In particular, the presence of fasudil in the solvent…”

      p. 4:

      What does the word "potentially" do in "ensemble of conformations potentially sampled"?

      Here, by potentially, we mean the various conformations that the protein can adopt, subject to the environmental conditions. 

      p. 10:

      "we trained a large array of inter-residue pairwise distances"

      The distances were not trained; please reformulate

      We have corrected this sentence as follows:  

      “We trained a VAE model using a large array of inter-residue pairwise distances.”

      p. 13:

      N/C-terminal -> terminus (or in the C-terminal region)

      We have made the changes in the manuscript at the required places. 

      p. 20:

      Precedent -> previous (?)

      We have made the change in the manuscript. 

      p. 30:

      As far as I understand, Anton does not use GPUs and does not run Desmond.

      We thank the reviewer for providing this information. We referred to the original paper of the ⍺syn-fasudil simulations (Robustelli et.al. (Journal of the American Chemical Society, 144(6), pp.2501-2510)). The authors have performed equilibration with GPU/Desmond and used Anton for production runs. We have modified this sentence as:

      We have modified this sentence as: 

      “A 1500 μs long all-atom MD simulation trajectory of αS monomer in aqueous fasudil solution was simulated by D. E. Shaw Research with the Anton supercomputer that is specially purposed for running long-time-scale simulations.” on page 31

      References : 

      (1) Schütte  C,  Fischer  A,  Huisinga  W,  Deuflhard  P  (1999)  A  direct  approach  to  conformational  dynamics  based  on  hybrid  monte  carlo. J  Comput  Phys 151:146–168

      (2) Chodera JD, Swope WC, Pitera JW, Dill KA (2006) Long-time protein folding dynamics from short-time molecular dynamics simulations.Multiscale  Model  Simul5:1214–1226.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important study identifies the mitotic localization mechanism for Aurora B and INCENP (parts of the chromosomal passenger complex, CPC) in Trypanosoma brucei. The mechanism is different from that in the more commonly studied opisthokonts and there is solid support from RNAi and imaging experiments, targeted mutations, immunoprecipitations with crosslinking/mass spec, and AlphaFold interaction predictions. The results could be strengthened by biochemically testing proposed direct interactions and demonstrating that the targeting protein KIN-A is a motor. The findings will be of interest to parasitology researchers as well as cell biologists working on mitosis and cell division, and those interested in the evolution of the CPC.

      We thank the editor and the reviewers for their thorough and positive assessment of our work and the constructive feedback to further improve our manuscript. Please find below our responses to the reviewers’ comments. Please note that the conserved glycine residue in the Switch II helix in KIN-A was mistakenly labelled as G209 in the original manuscript. We now corrected it to G210 in the revised manuscript.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The CPC plays multiple essential roles in mitosis such as kinetochore-microtubule attachment regulation, kinetochore assembly, spindle assembly checkpoint activation, anaphase spindle stabilization, cytokinesis, and nuclear envelope formation, as it dynamically changes its mitotic localization: it is enriched at inner centromeres from prophase to metaphase but it is relocalized at the spindle midzone in anaphase. The business end of the CPC is Aurora B and its allosteric activation module IN-box, which is located at the C-terminal part of INCENP. In most well-studied eukaryotic species, Aurora B activity is locally controlled by the localization module of the CPC, Survivin, Borealin, and the N-terminal portion of INCENP. Survivin and Borealin, which bind the N terminus of INCENP, recognize histone residues that are specifically phosphorylated in mitosis, while anaphase spindle midzone localization is supported by the direct microtubule-binding capacity of the SAH (single alpha helix) domain of INCENP and other microtubule-binding proteins that specifically interact with INCENP during anaphase, which are under the regulation of CDK activity. One of these examples includes the kinesin-like protein MKLP2 in vertebrates.

      Trypanosoma is an evolutionarily interesting species to study mitosis since its kinetochore and centromere proteins do not show any similarity to other major branches of eukaryotes, while orthologs of Aurora B and INCENP have been identified. Combining molecular genetics, imaging, biochemistry, cross-linking IP-MS (IP-CLMS), and structural modeling, this manuscript reveals that two orphan kinesin-like proteins KIN-A and KIN-B act as localization modules of the CPC in Trypanosoma brucei. The IP-CLMS, AlphaFold2 structural predictions, and domain deletion analysis support the idea that (1) KIN-A and KIN-B form a heterodimer via their coiled-coil domain, (2) Two alpha helices of INCENP interact with the coiled-coil of the KIN-A-KIN-B heterodimer, (3) the conserved KIN-A C-terminal CD1 interacts with the heterodimeric KKT9-KKT11 complex, which is a submodule of the KKT7-KKT8 kinetochore complex unique to Trypanosoma, (4) KIN-A and KIN-B coiled-coil domains and the KKT7-KKT8 complex are required for CPC localization at the centromere, (5) CD1 and CD2 domains of KIN-A support its centromere localization. The authors further show that the ATPase activity of KIN-A is critical for spindle midzone enrichment of the CPC. The imaging data of the KIN-A rigor mutant suggest that dynamic KIN-A-microtubule interaction is required for metaphase alignment of the kinetochores and proliferation. Overall, the study reveals novel pathways of CPC localization regulation via KIN-A and KIN-B by multiple complementary approaches.

      Strengths:

      The major conclusion is collectively supported by multiple approaches, combining site-specific genome engineering, epistasis analysis of cellular localization, AlphaFold2 structure prediction of protein complexes, IP-CLMS, and biochemical reconstitution (the complex of KKT8, KKT9, KKT11, and KKT12).

      We thank the reviewer for her/his positive assessment of our manuscript.

      Weaknesses:

      • The predictions of direct interactions (e.g. INCENP with KIN-A/KIN-B, or KIN-A with KKT9-KKT11) have not yet been confirmed experimentally, e.g. by domain mutagenesis and interaction studies.

      Thank you for this point. It is true that we do not have evidence for direct interactions between KIN-A with KKT9-KKT11. However, the interaction between INCENP with KIN-A/KIN-B is strongly supported by our cross-linking IP-MS of native complexes. Furthermore, we show that deletion of the INCENPCPC1 N-terminus predicted to interact with KIN-A:KIN-B abolishes kinetochore localization.

      • The criteria used to judge a failure of localization are not clearly explained (e.g., Figure 5F, G).

      As suggested by the reviewer in recommendation #14, we have now included example images for each category (‘kinetochores’, ‘kinetochores + spindle’, ‘spindle’) along with a schematic illustration in Fig. 5F.

      • It remains to be shown that KIN-A has motor activity.

      We thank the reviewer for this important comment. Indeed, motor activity remains to demonstrated using an in vitro system, which is beyond the scope of this study. What we show here is that the motor domain of KIN-A effectively co-sediments with microtubules and that spindle localization of KIN-A is abolished upon deletion of the motor domain. Moreover, mutation of a conserved Glycine residue in the Switch II region (G210) to Alanine (‘rigor mutation’, (Rice et al., 1999)), renders KIN-A incapable of translocating to the central spindle, suggesting that its ATPase activity is required for this process. To clarify this point in the manuscript, we have replaced all instances, where we refer to ‘motor activity’ of KIN-A with ‘ATPase activity’ when referring to experiments performed using the KIN-A rigor mutant. In addition, we have included a Multiple Sequence Alignment (MSA) of KIN-A and KIN-B from different kinetoplastids with human Kinesin-1, human Mklp2 and yeast Klp9 in Figure 6A and S6A, showing the conservation of key motifs required for ATP coordination and tubulin interaction. In the corresponding paragraph in the main text, we describe these data as follows:

      ‘We therefore speculated that anaphase translocation of the kinetoplastid CPC to the central spindle may involve the kinesin motor domain of KIN-A. KIN-B is unlikely to be a functional kinesin based on the absence of several well-conserved residues and motifs within the motor domain, which are fully present in KIN-A (Li et al., 2008). These include the P-loop, switch I and switch II motifs, which form the nucleotide binding cleft, and many conserved residues within the α4-L12 elements, which interact with tubulin (Fig. S6A) (Endow et al., 2010). Consistent with this, the motor domain of KIN-B, contrary to KIN-A, failed to localize to the mitotic spindle when expressed ectopically (Fig. S2E) and did not co-sediment with microtubules in our in vitro assay (Fig. S6B).’

      • The authors imply that KIN-A, but not KIN-B, interacts with microtubules based on microtubule pelleting assay (Fig. S6), but the substantial insoluble fractions of 6HIS-KINA and 6HIS-KIN-B make it difficult to conclusively interpret the data. It is possible that these two proteins are not stable unless they form a heterodimer.

      This is indeed a possibility. We are currently aiming at purifying full-length recombinant KIN-A and KIN-B (along with the other CPC components), which will allow us to perform in vitro interaction studies and to investigate biochemical properties of this complex (including the role of the motor domains of KIN-A and KIN-B) within the framework of an in-depth follow-up study. To address the point above, we have added the following text in the legend corresponding to Fig. S6:

      ‘Microtubule co-sedimentation assay with 6HIS-KIN-A2-309 (left) and 6HIS-KIN-B2-316 (right). S and P correspond to supernatant and pellet fractions, respectively. Note that both constructs to some extent sedimented even in the absence of microtubules. Hence, lack of microtubule binding for KIN-B may be due to the unstable non-functional protein used in this study.’

      • For broader context, some prior findings should be introduced, e.g. on the importance of the microtubule-binding capacity of the INCENP SAH domain and its regulation by mitotic phosphorylation (PMID 8408220, 26175154, 26166576, 28314740, 28314741, 21727193), since KIN-A and KIN-B may substitute for the function of the SAH domain.

      We have modified the introduction to include the following text and references mentioned by the reviewer: ‘The localization module comprises Borealin, Survivin and the N-terminus of INCENP, which are connected to one another via a three-helical bundle (Jeyaprakash et al., 2007, 2011; Klein et al., 2006). The two modules are linked by the central region of INCENP, composed of an intrinsically disordered domain and a single alpha helical (SAH) domain. INCENP harbours microtubule-binding domains within the N-terminus and the central SAH domain, which play key roles for CPC localization and function (Samejima et al., 2015; Kang et al., 2001; Noujaim et al., 2014; Cormier et al., 2013; Wheatley et al., 2001; Nakajima et al., 2011; Fink et al., 2017; Wheelock et al., 2017; van der Horst et al., 2015; Mackay et al., 1993).’

      Reviewer #2 (Public Review):

      How the chromosomal passenger complex (CPC) and its subunit Aurora B kinase regulate kinetochore-microtubule attachment, and how the CPC relocates from kinetochores to the spindle midzone as a cell transitions from metaphase to anaphase are questions of great interest. In this study, Ballmer and Akiyoshi take a deep dive into the CPC in T. brucei, a kinetoplastid parasite with a kinetochore composition that varies greatly from other organisms.

      Using a combination of approaches, most importantly in silico protein predictions using alphafold multimer and light microscopy in dividing T. brucei, the authors convincingly present and analyse the composition of the T. brucei CPC. This includes the identification of KIN-A and KIN-B, proteins of the kinesin family, as targeting subunits of the CPC. This is a clear advancement over earlier work, for example by Li and colleagues in 2008. The involvement of KIN-A and KIN-B is of particular interest, as it provides a clue for the (re)localization of the CPC during the cell cycle. The evolutionary perspective makes the paper potentially interesting for a wide audience of cell biologists, a point that the authors bring across properly in the title, the abstract, and their discussion.

      The evolutionary twist of the paper would be strengthened 'experimentally' by predictions of the structure of the CPC beyond T. brucei. Depending on how far the authors can extend their in-silico analysis, it would be of interest to discuss a) available/predicted CPC structures in well-studied organisms and b) structural predictions in other euglenozoa. What are the general structural properties of the CPC (e.g. flexible linkers, overall dimensions, structural differences when subunits are missing etc.)? How common is the involvement of kinesin-like proteins? In line with this, it would be good to display the figure currently shown as S1D (or similar) as a main panel.

      We thank the reviewer for her/his encouraging assessment of our manuscript and the appreciation on the extent of the evolutionary relevance of our work. As suggested, we have moved the phylogenetic tree previously shown in Fig. S1D to the main Fig. 1F. Our AF2 analysis of CPC proteins and (sub)complexes from other kinetoplastids failed to predict reliable interactions among CPC proteins except for that between Aurora B and the IN box. It therefore remains unclear whether CPC structures are conserved among kinetoplastids. Because components of CPC remain unknown in other euglenozoa (other than Aurora B and INCENP), we cannot perform structural predictions of CPC in diplonemids or euglenids.

      It remains unclear how common the involvement of kinesin-like proteins with the CPC is in other eukaryotes, partly because we could not identify an obvious homolog of KIN-A/KIN-B outside of kinetoplastids. Addressing this question would require experimental approaches in various eukaryotes (e.g. immunoprecipitation and mass spectrometry of Aurora B) as we carried out in this manuscript using Trypanosoma brucei.

      Reviewer #3 (Public Review):

      Summary:

      The protein kinase, Aurora B, is a critical regulator of mitosis and cytokinesis in eukaryotes, exhibiting a dynamic localisation. As part of the Chromosomal Passenger Complex (CPC), along with the Aurora B activator, INCENP, and the CPC localisation module comprised of Borealin and Survivin, Aurora B travels from the kinetochores at metaphase to the spindle midzone at anaphase, which ensures its substrates are phosphorylated in a time- and space-dependent manner. In the kinetoplastid parasite, T. brucei, the Aurora B orthologue (AUK1), along with an INCENP orthologue known as CPC1, and a kinetoplastid-specific protein CPC2, also displays a dynamic localisation, moving from the kinetochores at metaphase to the spindle midzone at anaphase, to the anterior end of the newly synthesised flagellum attachment zone (FAZ) at cytokinesis. However, the trypanosome CPC lacks orthologues of Borealin and Survivin, and T. brucei kinetochores also have a unique composition, being comprised of dozens of kinetoplastid-specific proteins (KKTs). Of particular importance for this study are KKT7 and the KKT8 complex (comprising KKT8, KKT9, KKT11, and KKT12). Here, Ballmer and Akiyoshi seek to understand how the CPC assembles and is targeted to its different locations during the cell cycle in T. brucei.

      Strengths & Weaknesses:

      Using immunoprecipitation and mass-spectrometry approaches, Ballmer and Akiyoshi show that AUK1, CPC1, and CPC2 associate with two orphan kinesins, KIN-A and KIN-B, and with the use of endogenously expressed fluorescent fusion proteins, demonstrate for the first time that KIN-A and KIN-B display a dynamic localisation pattern similar to other components of the CPC. Most of these data provide convincing evidence for KIN-A and KIN-B being bona fide CPC proteins, although the evidence that KIN-A and KIN-B translocate to the anterior end of the new FAZ at cytokinesis is weak - the KIN-A/B signals are very faint and difficult to see, and cell outlines/brightfield images are not presented to allow the reader to determine the cellular location of these faint signals (Fig S1B).

      We thank the reviewer for their thorough assessment of our manuscript and the insightful feedback to further improve our study. To address the point above, we have acquired new microscopy data for Fig. S1B and S1C, which now includes phase contrast images, and have chosen representative cells in late anaphase and telophase. We hope that the signal of Aurora BAUK1, KIN-A and KIN-B at the anterior end of the new FAZ can be now distinguished more clearly.

      They then demonstrate, by using RNAi to deplete individual components, that the CPC proteins have hierarchical interdependencies for their localisation to the kinetochores at metaphase. These experiments appear to have been well performed, although only images of cell nuclei were shown (Fig 2A), meaning that the reader cannot properly assess whether CPC components have localised elsewhere in the cell, or if their abundance changes in response to depletion of another CPC protein.

      We chose to show close-ups of the nucleus to highlight the different localization patterns of CPC proteins under the different RNAi conditions. In none of these conditions did we observe mis-localization of CPC subunits to the cytoplasm. To clarify this point, we added the following sentence in the legend for Figure 2A:

      ‘A) Representative fluorescence micrographs showing the localization of YFP-tagged Aurora BAUK1, INCENPCPC1, KIN-A and KIN-B in 2K1N cells upon RNAi-mediated knockdown of indicated CPC subunits. Note that nuclear close-ups are shown here. CPC proteins were not detected in the cytoplasm. RNAi was induced with 1 μg/mL doxycycline for 24 h (KIN-B RNAi) or 16 h (all others). Cell lines: BAP3092, BAP2552, BAP2557, BAP3093, BAP2906, BAP2900, BAP2904, BAP3094, BAP2899, BAP2893, BAP2897, BAP3095, BAP3096, BAP2560, BAP2564, BAP3097. Scale bars, 2 μm.’

      Ballmer and Akiyoshi then go on to determine the kinetochore localisation domains of KIN-A and KIN-B. Using ectopically expressed GFP-tagged truncations, they show that coiled-coil domains within KIN-A and KIN-B, as well as a disordered C-terminal tail present only in KIN-A, but not the N-terminal motor domains of KIN-A or KIN-B, are required for kinetochore localisation. These data are strengthened by immunoprecipitating CPC complexes and crosslinking them prior to mass spectrometry analysis (IP-CLMS), a state-of-the-art approach, to determine the contacts between the CPC components. Structural predictions of the CPC structure are also made using AlphaFold2, suggesting that coiled coils form between KIN-A and KIN-B, and that KIN-A/B interact with the N termini of CPC1 and CPC2. Experimental results show that CPC1 and CPC2 are unable to localise to kinetochores if they lack their N-terminal domains consistent with these predictions. Altogether these data provide convincing evidence of the protein domains required for CPC kinetochore localisation and CPC protein interactions. However, the authors also conclude that KIN-B plays a minor role in localising the CPC to kinetochores compared to KIN-A. This conclusion is not particularly compelling as it stems from the observation that ectopically expressed GFP-NLS-KIN-A (full length or coiled-coil domain + tail) is also present at kinetochores during anaphase unlike endogenously expressed YFP-KIN-A. Not only is this localisation probably an artifact of the ectopic expression, but the KIN-B coiled-coil domain localises to kinetochores from S to metaphase and Fig S2G appears to show a portion of the expressed KIN-B coiled-coil domain colocalising with KKT2 at anaphase. It is unclear why KIN-B has been discounted here.

      As the reviewer points out, a small fraction of GFP-NLS-KIN-B317-624 is indeed detectable at kinetochores in anaphase, although most of the protein shows diffuse nuclear staining. There are various explanations for this phenomenon: It is conceivable that the KIN-B motor domain may contribute to microtubule binding and translocation of the CPC from kinetochores onto the spindle in anaphase. In our experiments, ectopically expressed KIN-B317-624 likely outcompetes a fraction of endogenous KIN-B for binding to KIN-A, which could interfere with this translocation process, leaving a population of CPC ‘stranded’ at kinetochores in anaphase. Another possibility, hinted at by the reviewer, is that the C-terminus of KIN-B interacts with receptors at the kinetochore/centromere. Although we do not discount this possibility, we nevertheless decided to focus on KIN-A in this study, because the anaphase kinetochore retention phenotype for both full-length GFP-NLS-KIN-A and -KIN-A309-862 is much stronger than for KIN-B317-624. Two additional reasons were that (i) KIN-A is highly conserved within kinetoplastids, whereas KIN-B orthologs are missing in some kinetoplastids, and (ii) no convincing interactions between KIN-B and kinetochore proteins were predicted by AF2.

      To address the reviewer’s point, we decided to include KIN-B in the title of this manuscript, which now reads: ‘Dynamic localization of the chromosomal passenger complex is controlled by the orphan kinesins KIN-A and KIN-B in the kinetoplastid parasite Trypanosoma brucei’.

      Moreover, we modified the corresponding paragraph in the results section as follows:

      ‘Intriguingly, unlike endogenously YFP-tagged KIN-A, ectopically expressed GFP fusions of both full-length KIN-A and KIN-A310-862 clearly localized at kinetochores even in anaphase (Figs. 2, F and H). Weak anaphase kinetochore signal was also detectable for KIN-B317-624 (Fig. S2F). GFP fusions of the central coiled-coil domain or the C-terminal disordered tail of KIN-A did not localize to kinetochores (data not shown). These results show that kinetochore localization of the CPC is mediated by KIN-A and KIN-B and requires both the central coiled-coil domain as well as the C-terminal disordered tail of KIN-A.’

      Next, using a mixture of RNAi depletion and LacI-LacO recruitment experiments, the authors show that kinetochore proteins KKT7 and KKT9 are required for AUK1 to localise to kinetochores (other KKT8 complex components were not tested here) and that all components of the KKT8 complex are required for KIN-A kinetochore localisation. Further, both KKT7 and KKT8 were able to recruit AUK1 to an ectopic locus in the S phase, and KKT7 recruited KKT8 complex proteins, which the authors suggest indicates it is upstream of KKT8. However, while these experiments have been performed well, the reciprocal experiment to show that KKT8 complex proteins cannot recruit KKT7, which could have confirmed this hierarchy, does not appear to have been performed. Further, since the LacI fusion proteins used in these experiments were ectopically expressed, they were retained (artificially) at kinetochores into anaphase; KKT8 and KIN-A were both able to recruit AUK1 to LacO foci in anaphase, while KKT7 was not. The authors conclude that this suggests the KKT8 complex is the main kinetochore receptor of the CPC - while very plausible, this conclusion is based on a likely artifact of ectopic expression, and for that reason, should be interpreted with a degree of caution.

      We previously showed that RNAi-mediated depletion of KKT7 disrupts kinetochore localization of KKT8 complex members, whereas kinetochore localization of KKT7 is unaffected by disruption of the KKT8 complex (Ishii and Akiyoshi, 2020). Moreover, in contrast to the KKT8 complex, KKT7 remains at kinetochores in anaphase (Akiyoshi and Gull, 2014). These data show that KKT7 is upstream of the KKT8 complex. In this context, the LacI-LacO tethering approach can be very useful to probe whether two proteins (or domains of proteins) could interact in vivo either directly or indirectly. However, a recruitment hierarchy cannot be inferred from such experiments because the data just shows whether X can recruit Y to an ectopic locus (but not whether X is upstream of Y or vice versa). Regarding the retention of Aurora BAUK1 at kinetochores in anaphase upon ectopic expression of GFP-KKT8-LacI, we agree with the reviewer that these data need to be carefully interpreted. Nevertheless, the notion that the KKT7-KKT8 complex recruits the CPC to kinetochores is also strongly supported by IP-MS, RNAi experiments, and AF2 predictions. For clarification and to address the reviewer’s point, we re-formulated the corresponding paragraph in the main text:

      ‘We previously showed that KKT7 lies upstream of the KKT8 complex (Ishii and Akiyoshi, 2020). Indeed, GFP-KKT72-261-LacI recruited tdTomato-KKT8, -KKT9 and -KKT12 (Fig. S4E). Expression of both GFP-KKT72-261-LacI and GFP-KKT8-LacI resulted in robust recruitment of tdTomato-Aurora BAUK1 to LacO foci in S phase (Figs. 4, E and F). Intriguingly, we also noticed that, unlike endogenous KKT8 (which is not present in anaphase), ectopically expressed GFP-KKT8-LacI remained at kinetochores during anaphase (Fig. 4F). This resulted in a fraction of tdTomato-Aurora BAUK1 being trapped at kinetochores during anaphase instead of migrating to the central spindle (Fig. 4F). We observed a comparable situation upon ectopic expression of GFP-KIN-A, which is retained on anaphase kinetochores together with tdTomato-KKT8 (Fig. S4F). In contrast, Aurora BAUK1 was not recruited to LacO foci marked by GFP- KKT72-261-LacI in anaphase (Fig. 4E).’

      Further IP-CLMS experiments, in combination with recombinant protein pull-down assays and structural predictions, suggested that within the KKT8 complex, there are two subcomplexes of KKT8:KKT12 and KKT9:KKT11, and that KKT7 interacts with KKT9:KKT11 to recruit the remainder of the KKT8 complex. The authors also assess the interdependencies between KKT8 complex components for localisation and expression, showing that all four subunits are required for the assembly of a stable KKT8 complex and present AlphaFold2 structural modelling data to support the two subcomplex models. In general, these data are of high quality and convincing with a few exceptions. The recombinant pulldown assay (Fig. 4H) is not particularly convincing as the 3rd eluate gel appears to show a band at the size of KKT11 (despite the labelling indicating no KKT11 was present in the input) but no pulldown of KKT9, which was present in the input according to the figure legend (although this may be mislabeled since not consistent with the text). The text also states that 6HIS-KKT8 was insoluble in the absence of KKT12, but this is not possible to assess from the data presented.

      We thank the reviewer for pointing out an error in the text: ‘Removal of both KKT9 and KKT11 did not impact formation of the KKT8:KKT12 subcomplex’ should read ‘Removal of either KKT9 or KKT11 did not impact formation of the KKT8:KKT12 subcomplex’. Regarding the very faint band perceived to be KKT11 in the 3rd eluate: This band runs slightly lower than KKT11 and likely represents a bacterial contaminant (which we have seen also in other preps in the past). We have made a note of this in the corresponding legend (new Fig. 4I). Moreover, we provide the estimated molecular weights for each subunit, as suggested by the reviewer in recommendation #14 (see below):

      ‘(I) Indicated combinations of 6HIS-tagged KKT8 (~46 kDa), KKT9 (~39 kDa), KKT11 (~29 kDa) and KKT12 (~23 kDa) were co-expressed in E. coli, followed by metal affinity chromatography and SDS-PAGE. The asterisk indicates a common contaminant.’

      The corresponding paragraph in the results section now reads:

      To validate these findings, we co-expressed combinations of 6HIS-KKT8, KKT9, KKT11 and KKT12 in E. coli and performed metal affinity chromatography (Fig. 4I). 6HIS-KKT8 efficiently pulled down KKT9, KKT11 and KKT12, as shown previously (Ishii and Akiyoshi, 2020). In the absence of KKT9, 6HIS-KKT8 still pulled down KKT11 and KKT12. Removal of either KKT9 or KKT11 did not impact formation of the KKT8:KKT12 subcomplex. In contrast, 6HIS-KKT8 could not be recovered without KKT12, indicating that KKT12 is required for formation of the full KKT8 complex. These results support the idea that the KKT8 complex consists of KKT8:KKT12 and KKT9:KKT11 subcomplexes.’

      It is also surprising that data showing the effects of KKT8, KKT9, and KKT12 depletion on KKT11 localisation and abundance are not presented alongside the reciprocal experiments in Fig S4G-J.

      YFP-KKT11 is delocalized upon depletion of KKT8 and KKT9 (see below). Unfortunately, we were unsuccessful in our attempts at deriving the corresponding KKT12 RNAi cell line, rendering this set of data incomplete. Because these data are not of critical importance for this study, we decided not to invest more time in attempting further transfections.

      Author response image 1.

      The authors also convincingly show that AlphaFold2 predictions of interactions between KKT9:KKT11 and a conserved domain (CD1) in the C-terminal tail of KIN-A are likely correct, with CD1 and a second conserved domain, CD2, identified through sequence analysis, acting synergistically to promote KIN-A kinetochore localisation at metaphase, but not being required for KIN-A to move to the central spindle at anaphase. They then hypothesise that the kinesin motor domain of KIN-A (but not KIN-B which is predicted to be inactive based on non-conservation of residues key for activity) determines its central spindle localisation at anaphase through binding to microtubules. In support of this hypothesis, the authors show that KIN-A, but not KIN-B can bind microtubules in vitro and in vivo. However, ectopically expressed GFP-NLS fusions of full-length KIN-A or KIN-A motor domain did not localise to the central spindle at anaphase. The authors suggest this is due to the GPF fusion disrupting the ATPase activity of the motor domain, but they provide no evidence that this is the case. Instead, they replace endogenous KIN-A with a predicted ATPase-defective mutant (G209A), showing that while this still localises to kinetochores, the kinetochores were frequently misaligned at metaphase, and that it no longer concentrates at the central spindle (with concomitant mis-localisation of AUK1), causing cells to accumulate at anaphase. From these data, the authors conclude that KIN-A ATPase activity is required for chromosome congression to the metaphase plate and its central spindle localisation at anaphase. While potentially very interesting, these data are incomplete in the absence of any experimental data to show that KIN-A possesses ATPase activity or that this activity is abrogated by the G209A mutation, and the conclusions of this section are rather speculative.

      Thank you for this important comment, which relates to a similar point raised by Reviewer 1 (see above). Indeed, ATPase and motor activity of KIN-A remain to demonstrated biochemically using recombinant proteins, which is beyond the scope of this study. We generated MSAs of KIN-A and KIN-B from different kinetoplastids with human Kinesin-1, human Mklp2 and yeast Klp9, which are now presented in Figure 6A and S6A. These clearly show that key motifs required for ATP or tubulin binding in other kinesins are highly conserved in KIN-A (but not KIN-B). This includes the conserved glycine residue in the Switch II helix (G234 in human Kinesin-1, G210 in T. brucei KIN-A), which forms a hydrogen bond with the γ-phosphate of ATP, and upon mutation has been shown to impair ATPase activity and trap the motor head in a strong microtubule (‘rigor’) state (Rice et al., 1999; Sablin et al., 1996). The prominent rigor phenotype of KIN-AG210A is consistent with KIN-A having ATPase activity. In addition to the data in Fig. 6A and S6A, we made following changes to the main text:

      ‘We therefore speculated that anaphase translocation of the kinetoplastid CPC to the central spindle may involve the kinesin motor domain of KIN-A. KIN-B is unlikely to be a functional kinesin based on the absence of several well-conserved residues and motifs within the motor domain, which are fully present in KIN-A (Li et al., 2008). These include the P-loop, switch I and switch II motifs, which form the nucleotide binding cleft, and many conserved residues within the α4-L12 elements, which interact with tubulin (Fig. S6A) (Endow et al., 2010). Consistent with this, the motor domain of KIN-B, contrary to KIN-A, failed to localize to the mitotic spindle when expressed ectopically (Fig. S2E) and did not co-sediment with microtubules in our in vitro assay (Fig. S6B).

      Ectopically expressed GFP-KIN-A and -KIN-A2-309 partially localized to the mitotic spindle but failed to concentrate at the midzone during anaphase (Figs. 2, F and G), suggesting that N-terminal tagging of the KIN-A motor domain may interfere with its function. To address whether the ATPase activity of KIN-A is required for central spindle localization of the CPC, we replaced one allele of KIN-A with a C-terminally YFP-tagged G210A ATP hydrolysis-defective rigor mutant (Fig. 6A) (Rice et al., 1999) and used an RNAi construct directed against the 3’UTR of KIN-A to deplete the untagged allele. The rigor mutation did not affect recruitment of KIN-A to kinetochores (Figs. S6, C and D). However, KIN-AG210A-YFP marked kinetochores were misaligned in ~50% of cells arrested in metaphase, suggesting that ATPase activity of KIN-A promotes chromosome congression to the metaphase plate (Figs. S6, E-H).’

      Impact:

      Overall, this work uses a wide range of cutting-edge molecular and structural predictive tools to provide a significant amount of new and detailed molecular data that shed light on the composition of the unusual trypanosome CPC and how it is assembled and targeted to different cellular locations during cell division. Given the fundamental nature of this research, it will be of interest to many parasitology researchers as well as cell biologists more generally, especially those working on aspects of mitosis and cell division, and those interested in the evolution of the CPC.

      We thank the reviewer for his/her feedback and thoughtful and thorough assessment of our study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Why did the authors omit KIN-B from the title?

      We decided to add KIN-B in the title. Please see our response to Reviewer #3 (public review).

      (2) Abstract, line 28, "Furthermore, the kinesin motor activity of KIN-A promotes chromosome alignment in prometaphase and CPC translocation to the central spindle upon anaphase onset." This must be revised - see public review.

      We changed this section of the abstract as follows:

      ‘Furthermore, the ATPase activity of KIN-A promotes chromosome alignment in prometaphase and CPC translocation to the central spindle upon anaphase onset. Thus, KIN-A constitutes a unique ‘two-in-one’ CPC localization module in complex with KIN-B, which directs the CPC to kinetochores (from S phase until metaphase) via its C-terminal tail, and to the central spindle (in anaphase) via its N-terminal kinesin motor domain.’

      (3) Line 87-90. The findings by Li et al., 2008 (KIN-A and KIN-B interacting with Aurora B and epistasis analysis) should be introduced more comprehensively in the Introduction section.

      We added the following sentence in the introduction:

      ‘In addition, two orphan kinesins, KIN-A and KIN-B, have been proposed to transiently associate with Aurora BAUK1 during mitosis (Li et al., 2008; Li, 2012).’

      (4) Figure 1B. The way the Trypanosoma cell cycle is defined should be briefly explained in the main text, rather than just referring to the figure.

      The ‘KN’ annotation of the trypanosome cell cycle is explained in the Figure 1 legend. We now also added a brief description in the main text:

      ‘We next assessed the localization dynamics of fluorescently tagged KIN-A and KIN-B over the course of the cell cycle (Figs. 1, B-E). T. brucei possesses two DNA-containing organelles, the nucleus (‘N’) and the kinetoplast (‘K’). The kinetoplast is an organelle found uniquely in kinetoplastids, which contains the mitochondrial DNA and replicates and segregates prior to nuclear division. The ‘KN’ configuration serves as a good cell cycle marker (Woodward and Gull, 1990; Siegel et al., 2008).’

      (5) Line 118. Throughout the paper, it is not clear why GFP-NLS fusion was used instead of GFP fusion. Please justify the fusion of NLS.

      NLS refers to a short ‘nuclear localization signal’ (TGRGHKRSREQ) (Marchetti et al., 2000), which ensures that the ectopically expressed construct is imported into the nucleus. When we previously expressed truncations of KKT2 and KKT3 kinetochore proteins, many fragments did not go into the nucleus presumably due to the lack of an NLS, which prevented us from determining which domains are responsible for their kinetochore localization. We have since then consistently used this short NLS sequence in our inducible GFP fusions in the past without any complications. We added a sentence in the Materials & Methods section under Trypanosome culture: ‘All constructs for ectopic expression of GFP fusion proteins include a short nuclear localization signal (NLS) (Marchetti et al., 2000).’ To avoid unnecessary confusion, we removed ‘NLS’ from the main text and figures.

      (6) Line 121, "Unexpectedly". It is not clear why this was unexpected.

      To clarify this point, we modified this paragraph in the results section:

      ‘To our surprise, KIN-A-YFP and GFP-KIN-B exhibited a CPC-like localization pattern identical to that of Aurora BAUK1: Both kinesins localized to kinetochores from S phase to metaphase, and then translocated to the central spindle in anaphase (Figs. 1, C-E). Moreover, like Aurora BAUK1, a population of KIN-A and KIN-B localized at the new FAZ tip from late anaphase onwards (Figs. S1, B and C). This was unexpected, because KIN-A and KIN-B were previously reported to localize to the spindle but not to kinetochores or the new FAZ tip (Li et al., 2008). These data suggest that KIN-A and KIN-B are bona fide CPC proteins in trypanosomes, associating with AuroraAUK1, INCENPCPC1 and CPC2 throughout the cell cycle.’

      (7) Line 127-129. Defining homologs and orthologs is tricky - there are many homologs and paralogs of kinesin-like proteins. The method to define the presence or absence of KIN-A/KIN-B homologs should be described in the Materials and Methods section.

      Due to the difficulty in defining true orthologs for kinesin-like proteins, we took a conservative approach: reciprocal best BLAST hits. We first searched KIN-A homologs using BLAST in the TriTryp database or using hmmsearch using manually prepared hmm profiles. When the top hit in a given organism found T. brucei KIN-A in a reciprocal BLAST search in T. brucei proteome, we considered the hit as a true ortholog. We modified the Materials and Methods section as below.

      ‘Searches for homologous proteins were done using BLAST in the TriTryp database (Aslett et al., 2010) or using hmmsearch using manually prepared hmm profiles (HMMER version 3.0; Eddy, 1998). The top hit was considered as a true ortholog only if the reciprocal BLAST search returned the query protein in T. brucei.’

      (8) Line 156. For non-experts of Trypanosoma cell biology, it is not clear how the nucleolar localization is defined.

      The nucleolus in T. brucei is discernible as a DAPI-dim region in the nucleus.

      (9) Fig.2G and Fig.S2F. These data imply that the coiled-coil and C-terminal tail domains of KIN-A/KIN-B are important for anaphase spindle midzone enrichment. However, it is odd that this was not mentioned. This reviewer recommends that the authors quantify the midzone localization data of these constructs and discuss the role of the coiled-coil domains.

      One possibility is that KIN-A and KIN-B need to form a complex (via their coiled-coil domains) to localize to the spindle midzone. Another likely possibility, which is discussed in the manuscript, is that N-terminal tagging of KIN-A impairs motor activity. This is supported by the fact that the central spindle localization is also disrupted in full-length GFP-KIN-A. We decided not to provide a quantification for these data due to low sample sizes for some of the constructs (e.g. expression not observed in all cells).

      (10) Line 288-289, "pLDDT scores improved significantly for KIN-A CD1 in complex with KKT9:KKT11 (>80) compared to KIN-A CD1 alone (~20) (Figs. S3, A and B)." I can see that pLDDT score is about 20 at KIN-A CD1 from Figs S3A, but the basis of pLDDT > 80 upon inclusion go KKT9:KKT11 is missing.

      We added the pLDDT and PAE plots for the AF2 prediction of KIN-A700-800 in complex with KKT9:KKT11 in Fig. S5B.

      (11) Fig. 5A. Since there is no supporting biochemical data for KIN-A-KKT9-KKT11 interaction, it is important to assess the stability of AlphaFold-based structural predictions of the KIN-A-KKT9-KKT11 interaction. Are there significant differences among the top 5 prediction results, and do these interactions remain stable after the "simulated annealing" process used in the AlphaFold predictions? Are predicted CD1-interacting regions/amino residues in KKT9 and KKT11 evolutionarily conserved?

      See above. The interaction was predicted in all 5 predictions as shown in Fig. S5B. Conservation of the CD1-interacting regions in KKT9 and KKT11 are shown below:

      Author response image 2.

      KKT9 (residues ~53 – 80 predicted to interact with KIN-A in T. brucei)

      Author response image 3.

      KKT11 (residues 61-85 predicted to interact with KIN-A in T. brucei)

      (12) Line 300, Fig. S5D and E, "failed to localize at kinetochores". From this resolution of the microscopy images, it is not clear if these proteins fail to localize at kinetochores as the KKT and KIN-A310-716 signals overlap. Perhaps, "failed to enrich at kinetochores" is a more appropriate statement.

      We changed this sentence according to the reviewer’s suggestion.

      (13) Line 309 and Fig 5D and F, "predominantly localized to the mitotic spindle". From this image shown in Fig 5D, it is not clear if KIN-A∆CD1-YFP and Aurora B are predominantly localized to the spindle or if they are still localized to centromeres that are misaligned on the spindle. Without microtubule staining, it is also not clear how microtubules are distributed in these cells. Please clarify how the presence or absence of kinetochore/spindle localization was defined.

      As shown in Fig. S5E and S5F, deletion of CD1 clearly impairs kinetochore localization of KIN-A (kinetochores marked by tdTomato-KKT2). Moreover, misalignment of kinetochores, as observed upon expression of the KIN-AG210A rigor mutant, would result in an increase in 2K1N cells and proliferation defects, which is not the case for the KIN-A∆CD1 mutant (Fig. 5H, Fig. S5I). KIN-A∆CD1-YFP appears to localize diffusely along the entire length of the mitotic spindle, whereas we still observe kinetochore-like foci in the rigor mutant. Unfortunately, we do not have suitable antibodies that would allow us to distinguish spindle microtubules from the vast subpellicular microtubule array present in T. brucei and hence need to rely on tagging spindle-associated proteins such as MAP103.

      (14) Fig. 5F, G, S5F. Along the same lines, it would be helpful to show example images for each category - "kinetochores", "kinetochores + spindle", and "spindle".

      As suggested by the reviewer, we have now included example images for each category (‘kinetochores’, ‘kinetochores + spindle’, ‘spindle’) along with a schematic illustration in Fig. 5F.

      (15) Line 332 and Fig. S6A. The experiment may be repeated in the presence of ATP or nonhydrolyzable ATP analogs.

      We thank the reviewer for the suggestion. We envisage such experiments for an in-depth follow-up study.

      (16) Line 342, "motor activity of KIN-A". Until KIN-A is shown to have motor activity, the result based on the rigor mutant does not show that the motor activity of KIN-A promotes chromosome congression. The result suggests that the ATPase activity of KIN-A is important.

      We changed that sentence as suggested by the reviewer.

      (17) Line 419 -. The authors base their discussion on the speculation that KIN-A is a plus-end directed motor. Please justify this speculation.

      Indeed, the notion that KIN-A is a plus-end directed motor remains a hypothesis, which is based on sequence alignments with other plus-end directed motors and the observation that the KIN-A motor domain is involved in translocation of the CPC to the central spindle in anaphase. We have modified the corresponding section in the discussion as follows:

      ‘It remains to be investigated whether KIN-A truly functions as a plus-end directed motor. The role of the KIN-B in this context is equally unclear. Since KIN-B does not possess a functional kinesin motor domain, we deem it unlikely that the KIN-A:KIN-B heterodimer moves hand-over-hand along microtubules as do conventional (kinesin-1 family) kinesins. Rather, the KIN-A motor domain may function as a single-headed unit and drive processive plus-end directed motion using a mechanism similar to the kinesin-3 family kinesin KIF1A (Okada and Hirokawa, 1999).’

      (18) Line 422-423, "plus-end directed motion using a mechanism similar to kinesin-3 family kinesins (such as KIF1A)." Please cite a reference supporting this statement.

      See above. We cited a paper by (Okada and Hirokawa, 1999).

      Reviewer #2 (Recommendations For The Authors):

      Please provide a quantification of data shown in Figure 2F-H and described in lines 151-166.

      We decided not to provide a quantification for these data due to low sample sizes for some of the constructs (e.g. expression not observed in all cells).

      It appears as if the paper more or less follows a chronological order of the experiments that were performed before AF multimer enabled the insightful and compelling structural analysis. That is a matter of style, but in some cases, the writing could be updated, shortened, or re-arranged into a more logical order. Concrete examples:

      (i) Line 144: "we did not include CPC2 for further analysis in this study" Although CPC2 features at a prominent and interesting position in the predicted structures of the kinetoplastid CPC, shown in later main figures.

      We attempted RNAi-mediated depletion of CPC2 using two different shRNA constructs. However, we cannot exclude the possibility that the knockdown of CPC2 was less efficient compared with the other CPC subunits. For this reason, we decided to remove all the data on CPC2 from Fig. S2.

      (ii) The work with the KIN-A motor domain only and KIN-A ∆motor domain (Fig 2) begs the question about a more subtle mutation to interfere with the motor domain. Which is ultimately presented in Fig 6. I think that the final paragraph and Figure 6 follow naturally after Figure 2.

      We appreciate the suggestion. However, we would like to keep Figure 6 there.

      (iii) The high-confidence structural predictions in Fig 3 and Fig 4 are insightful. The XL-MS descriptions that precede them are not so helpful (Fig 3A and 4G and in the text). To emphasize their status as experimental support for the predicted structures, which is very important, it would be good to discuss the XL-MS after presenting the models.

      As suggested, we have re-arranged the text and/or figures such that the AF2 predictions are discussed first and the CLMS data are brought in afterwards.

      Figure 1A prominently features an arbitrary color code and a lot of protein IDs without a legend. That is not a very convincing start. Figure S1 is more informative, containing annotated protein names and results of the KIN-A and KIN-B IPs. Please improve Figure 1A, for example by presenting a modified version of Figure S1. In all these types of figures, please list both protein names and gene IDs.

      We agree with the reviewer that the IP-MS data in Fig. S1 is more informative and hence decided to swap the heatmaps in Fig. 1A and Fig. S1A. We further annotated the heatmap corresponding to the Aurora BAUK1 IP-MS (now presented in Fig. S1) as suggested by the reviewer.

      The visualization of the structural predictions is not consistent among figures:

      (i) The structure in Fig 4I is important and could be displayed larger. The pLDDT scores, and especially those of the non-displayed models, do not add much information and should not be a main panel. If the authors want to display the pLDDT scores, I recommend a panel (main or supplement) of the structure colored for local prediction confidences, as in Fig 5A.

      (ii) In Figure 5A itself, it is hard to follow the chains in general, and KIN-A in particular, since the structure is pLDDT-coloured. Please present an additional panel colored by chain (consistent with Fig 4I, as mentioned above).

      (iii) The summarizing diagram, currently displayed as Fig 4J, should be placed after Fig 5A and take the discovered KIN-A - KKT9-11 connection into account. Ideally, it also covers the suspected importance of the motor domain and serves as a summarising diagram.

      We thank the reviewer for the constructive comments. For each structure prediction, we now present two images side by side; one coloured by chain and one colored by pLDDT. We recently re-ran AF2 for the full CPC and also for the KKT7N-KKT8 complex, and got improved predictions. Hence some of the models in Fig. 3/S3 and Fig. 4/S4 have been updated accordingly. For the CLMS plots, we also decided to colour the cross-links according to whether the 30 angstrom distance constraints were fulfilled or not in the AF2 prediction. We also increased the size of the structures shown in Fig. 4. Furthermore, we decided to remove the summarizing diagram from Fig. 4 and instead made a new main Fig. 7, which shows a more detailed schematic, which also takes into account the proposed function of the KIN-A motor domain, as suggested by the reviewer, and other points addressed in the Discussion.

      The methods section for the structural predictions lacks essential information. Predictions can only be reproduced if the version of AF2 multimer v2.x is specified and key parameters are mentioned.

      As suggested, we have added the details in the Materials and Methods section as follows.

      ‘Structural predictions of KIN-A/KIN-B, KIN-A310-862/KIN-B317-624, CPC1/CPC2/KIN-A300-599/KIN-B 317-624, and KIN-A700-800/KKT9/KKT11 were performed using ColabFold version 1.3.0 (AlphaFold-Multimer version 2), while those of AUK1/CPC1/CPC2/KIN-A1-599/KIN-B, KKT71-261/KKT9/KKT11/KKT8/KKT12, KKT9/KKT11/KKT8/KKT12, and KKT71-261/KKT9/KKT11 were performed using ColabFold version 1.5.3 (AlphaFold-Multimer version 2.3.1) using default settings, accessed via https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.3.0/AlphaFold2.ipynb and https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.5.3/AlphaFold2.ipynb.’

      Line 121, please explain the "Unexpectedly" by including a reference to the work from Li and colleagues. A statement with some details would be useful, as the difference between both studies appears to be crucial for the novelty of this paper. Alternatively, refer to this being covered in the discussion.

      To clarify this point, we modified this paragraph in the results section:

      ‘To our surprise, KIN-A-YFP and GFP-KIN-B exhibited a CPC-like localization pattern identical to that of Aurora BAUK1: Both kinesins localized to kinetochores from S phase to metaphase, and then translocated to the central spindle in anaphase (Figs. 1, C-E). Moreover, like Aurora BAUK1, a population of KIN-A and KIN-B localized at the new FAZ tip from late anaphase onwards (Figs. S1, B and C). This was unexpected, because KIN-A and KIN-B were previously reported to localize to the spindle but not to kinetochores or the new FAZ tip (Li et al., 2008). These data suggest that KIN-A and KIN-B are bona fide CPC proteins in trypanosomes, associating with AuroraAUK1, INCENPCPC1 and CPC2 throughout the cell cycle.’

      Line 285 refers to "conserved" regions in the C-terminal part of KIN-A, referring to Figure 5. Please expand the MSA in Figure 5B to get an idea about the conservation/variation outside CD1 and CD2.

      We now present the full MSA for KIN-A proteins in kinetoplastids in Fig. S5A.

      Please specify what is meant by Line 367-369 for someone who is not familiar with the work by Komaki et al. 2022. Either clarify in the text or clarify in the text with data to support it.

      We updated the corresponding section in the discussion as follows:

      ‘Komaki et al. recently identified two functionally redundant CPC proteins in Arabidopsis, Borealin Related Interactor 1 and 2 (BORI1 and 2), which engage in a triple helix bundle with INCENP and Borealin using a conserved helical domain but employ an FHA domain instead of a BIR domain to read H3T3ph (Komaki et al., 2022).’

      Data presented in Figure S6A, the microtubule co-sedimentation assay, is not convincing since a substantial amount of KIN-A/B is pelleted in the absence of microtubules. Did the authors spin the proteins in BRB80 before the assay to continue with soluble material and reduce sedimentation in the absence of microtubules? If the authors want to keep the wording in lines 331-332, the MT-binding properties of KIN-A and KIN-B need to be investigated in more detail, for example with a titration and a quantification thereof. Otherwise, they should change the text and replace "confirms" with "is consistent with". In any case, the legend needs to be expanded to include more information.

      To address the point above, we have added the following text in the legend corresponding to Fig. S6:

      ‘Microtubule co-sedimentation assay with 6HIS-KIN-A2-309 (left) and 6HIS-KIN-B2-316 (right). S and P correspond to supernatant and pellet fractions, respectively. Note that both constructs to some extent sedimented even in the absence of microtubules. Hence, lack of microtubule binding for KIN-B may be due to the unstable non-functional protein used in this study.’

      We have also updated the main text in the results section:

      ‘We therefore speculated that anaphase translocation of the kinetoplastid CPC to the central spindle may involve the kinesin motor domain of KIN-A. KIN-B is unlikely to be a functional kinesin based on the absence of several well-conserved residues and motifs within the motor domain, which are fully present in KIN-A (Li et al., 2008). These include the P-loop, switch I and switch II motifs, which form the nucleotide binding cleft, and many conserved residues within the α4-L12 elements, which interact with tubulin (Fig. S6A) (Endow et al., 2010). Consistent with this, the motor domain of KIN-B, contrary to KIN-A, failed to localize to the mitotic spindle when expressed ectopically (Fig. S2E) and did not co-sediment with microtubules in our in vitro assay (Fig. S6B).’

      Details:

      The readability of the pAE plots could be improved by arranging sequences according to their position in the structure. For example in Fig4I, KKT8 could precede KKT12. If it is easy to update this, the authors might want to do so.

      We re-ran the AF2 predictions for the KKT7N – KKT8 complex in Fig. 4/S4 and changed the order according to the reviewer’s suggestion (KKT9:KKT11:KKT8:KKT12).

      The same paper is referred to as Je Van Hooff et al. 2017 and as Van Hooff et al. 2017

      Thank you for pointing this out. We have corrected the citation.

      Reviewer #3 (Recommendations For The Authors):

      (1) Please state at the end of the introduction/start of the results section that this work was performed in procyclic trypanosomes. Given that the cell cycles of procyclic and bloodstream forms differ, this is important.

      We added this information at the end of the introduction:

      ‘Here, by combining biochemical, structural and cell biological approaches in procyclic form T. brucei, we show that the trypanosome CPC is a pentameric complex comprising Aurora BAUK1, INCENPCPC1, CPC2 and the two orphan kinesins KIN-A and KIN-B.’

      (2) Please define NLS at first use (line 118), and for clarity, explain the rationale for using GFP with an NLS.

      NLS refers to a short ‘nuclear localization signal’ (TGRGHKRSREQ) (Marchetti et al., 2000), which ensures that the ectopically expressed construct is imported into the nucleus. When we previously expressed truncations of KKT2 and KKT3 kinetochore proteins, many fragments did not go into the nucleus presumably due to the lack of an NLS, which prevented us from determining which domains are responsible for their kinetochore localization. We have since then consistently used this short NLS sequence in our inducible GFP fusions in the past without any complications. We added a sentence in the Materials & Methods section under Trypanosome culture: ‘All constructs for ectopic expression of GFP fusion proteins include a short nuclear localization signal (NLS) (Marchetti et al., 2000).’ To avoid unnecessary confusion, we removed ‘NLS’ from the main text and figures.

      (3) Lines 148-150 - it would strengthen this claim if KIN-A/B protein levels were assessed by Western blot.

      We now present a Western blot in Fig. S2C, showing that bulk KIN-B levels are clearly reduced upon KIN-A RNAi. The same is true also to some extent for KIN-A levels upon KIN-B RNAi, although this is less obvious, possibly due to the lower efficiency of KIN-B compared to KIN-A RNAi as judged by fluorescence microscopy (quantified in Fig. 2D and 2E).

      (4) Line 253 - the text mentions the removal of both KKT9 and KKT11, which is not consistent with the figure (Fig 4H) - do you mean the removal of either KKT9 or KKT11?

      Yes, we thank the reviewer for pointing out this mistake in the text, which has now been corrected.

      (5) Line 337 - please include a reference for the G209A ATPase-defective rigor mutant - has this been shown to result in KIN-A being inactive previously?

      Please see above our answer in public review.

      (6) It is not always obvious when fluorescent fusion proteins are being expressed endogenously or ectopically, or when they are being expressed in an RNAi background or not without tracing the cell lines in Table S1 - please ensure this is clearly stated throughout the manuscript.

      We now made sure that this is clearly stated in the main text as well as in the figure legends.

      (7) Line 410 - 'KIN-A C-terminal tail is stuffed full of conserved CDK1CRK3 sites' - what does 'stuffed full' really mean (this is rather imprecise) and what are the consensus sites - are these CDK1 consensus sites that are assumed to be conserved for CRK3? I'm not aware of consensus sites for CRK3 having been determined, but if they have, this should be referenced.

      We have modified the corresponding section in the discussion as follows:

      ‘In support of this, the KIN-A C-terminal tail harbours many putative CRK3 sites (10 sites matching the minimal S/T-P consensus motif for CDKs) and is also heavily phosphorylated by Aurora BAUK1 in vitro (Ballmer et al. 2024). Finally, we speculate that the interaction of KIN-A motor domain with microtubules, coupled to the force generating ATP hydrolysis and possibly plus-end directed motion, eventually outcompetes the weakened interactions of the CPC with the kinetochore and facilitates the extraction of the CPC from chromosomes onto spindle microtubules during anaphase. Indeed, deletion of the KIN-A motor domain or impairment of its motor function through N-terminal GFP tagging causes the CPC to be trapped at kinetochores in anaphase. Central spindle localization is additionally dependent on the ATPase activity of the KIN-A motor domain as illustrated by the KIN-A rigor mutant.’

      (8) Lines 412-416: this proposal is written rather definitively - given no motor activity has been demonstrated for KIN-A, please make clear that this is still just a theory.

      See above.

      (9) Fig 1: KKT2 is not highlighted in Fig 1A - given this has been used for colocalization in Fig 1C-E, was it recovered, and if not, why not? Fig 1B-E: the S phase/1K1N terminology is somewhat misleading. Not all S phase cells will have elongated kinetoplasts - usually an asterisk is used to signify replicated DNA, not kinetoplast shape. If it is to be used here for elongation, then for consistency, N should be used for G2/mitotic cells.

      Fig. 1A (now Fig. S1A) only shows the tip 30 hits. KKT2 was indeed recovered with Aurora BAUK1 (see Table S2) and is often used as a kinetochore marker in trypanosomes by our lab and others since the signal of fluorescently tagged KKT2 is relatively bright and KKT2 localizes to centromeres throughout the cell cycle.

      (10) A general comment for all image figures is that these do not have accompanying brightfield images and it is therefore difficult to know where the cell body is, or sometimes which nuclei and kinetoplasts belong to which cell where DNA from more than one cell is within the image. It would be beneficial if brightfield images could be added, or alternatively, the cell outlines were traced onto DAPI or merged images. Also, brightfield images would allow the stage of cytokinesis (pre-furrowing/furrowing/abscission) in anaphase cells to be determined.

      Since this study primarily addresses the recruitment mechanism of the CPC to kinetochores and to the central spindle from S phase to metaphase and in anaphase, respectively, and CPC proteins are not observed outside of the nucleus during these cell cycle stages, we did not present brightfield images in the figures. However, this point is particularly valid for discerning the localization of KIN-A and KIN-B to the new FAZ tip from late anaphase onwards. Hence, we acquired new microscopy data for Fig. S1B and S1C, which now includes phase contrast images, and have chosen representative cells in late anaphase and telophase. We hope that the signal of Aurora BAUK1, KIN-A and KIN-B at the anterior end of the new FAZ can be now distinguished more clearly.

      (11) Fig 2A: legend should state that the micrographs show the localisation of the proteins within the nucleus as whole cells are not shown. 2C: can INCENP not be split into 2 lines - the 'IN' looks like 1N at first glance, which is confusing.

      We have applied the suggested change in Fig. 2.

      (12) Fig 3 (and other AF2 figures): Could the lines for satisfied & not satisfied in the key be thicker so they more closely resemble the lines in the figure and are less likely to be confused with the disordered regions of the CPC components?

      We have now made those lines thicker.

      (13) Why were different E value thresholds used in Fig 3 and Fig 4?

      The CLMS data in Fig. 3 and Fig. 4 now both use the same E value threshold of E-3 (previously E-4 was used in Fig. 4). To determine a sensible significance threshold, we included some yeast protein sequences (‘false positives’) in the database used in pLink2 for identification of crosslinked peptides. Note that we recently also re-ran AF2 for the full CPC and for the KKT7N-KKT8 complex and got improved predictions. Hence some of the models in Fig. 3/S3 and Fig. 4/S4 have been updated accordingly. For the CLMS plots, we also decided to colour the cross-links according to whether the 30 angstrom distance constraints were fulfilled or not in the AF2 prediction.

      (14) Fig 4H legend - please give the expected sizes of these recombinant proteins & check the 3rd elution panel (see public review comments).

      See above response in public review.

      (15) Fig 4I - please explain what the colours of the PAE plot and the values in the key signify, as well as how the Scored Residue values are arrived at. Please also define the pIDDT in the legend.

      We have cited DeepMind’s 2021 methods paper, in which the outputs of AlphaFold are explained in detail. We also added a short description of the pLDDT and PAE scores and the corresponding colour coding in the legends of Fig. 3 and Fig. 4, respectively.

      From figure 3 legend:

      ‘(B) Cartoon representation showing two orientations of the trypanosome CPC, coloured by protein on the left (Aurora BAUK1: crimson, INCENPCPC1: green, CPC2: cyan, KIN-A: magenta, and KIN-B: yellow) or according to their pLDDT values on the right, assembled from AlphaFold2 predictions shown in Figure S3. The pLDDT score is a per-residue estimate of the confidence in the AlphaFold prediction on a scale from 0 – 100. pLDDT > 70 (blue, cyan) indicates a reasonable accuracy of the model, while pLDDT < 50 (red) indicates a low accuracy and often reflects disordered regions of the protein (Jumper et al., 2021). BS3 crosslinks in (B) were mapped onto the model using PyXlinkViewer (blue = distance constraints satisfied, red = distance constraints violated, Cα-Cα Euclidean distance threshold = 30 Å) (Schiffrin et al., 2020).’

      From Figure 4 legend:

      ‘(G) AlphaFold2 model of the KKT7 – KKT8 complex, coloured by protein (KKT71-261: green, KKT8: blue, KKT12: pink, KKT9: cyan and KKT11: orange) (left) and by pLDDT (center). BS3 crosslinks in (H) were mapped onto the model using PyXlinkViewer (Schiffrin et al., 2020) (blue = distance constraints satisfied, red = distance constraints violated, Cα-Cα Euclidean distance threshold = 30 Å). Right: Predicted Aligned Error (PAE) plot of model shown on the left (rank_2). The colour indicates AlphaFold’s expected position error (blue = low, red = high) at the residue on the x axis if the predicted and true structures were aligned on the residue on the y axis (Jumper et al., 2021).’

      (16) Fig 6 legend - Line 730 should say (F) not (C).

      Thank you for pointing out this typo.

      (17) Fig S1A - a key is missing for the colours. Fig S1B/C - cell outlines or a brightfield image are really needed here - see earlier comment. Fig S1D - there doesn't seem to be a method for how this tree was generated.

      See above response in public review regarding Fig. S1A and S1B/C. The tree in Fig. S1D is based on (Butenko et al., 2020).

      (18) Fig S2: A: how was protein knockdown validated (especially for CPC2 where there was little obvious phenotype)? Fig S2B: the y-axis should read proportion of cells, not percentage. Fig S2E - NLS should be labelled.

      Thank you for pointing out the mistake in the labelling.

      (19) Fig S3: PAE plots should be labelled with protein names, not A-E. Similarly, the pIDDT plots should be labelled as in Fig 4I.

      We have corrected the labelling in Fig. S3.

      (20) Fig S5A-D - cell cycle stage labels are missing from images.

      Thank you for pointing out the missing cell cycle stage labels.

      Addition by editor:

      In line 126 the statement that KIN-A and KIN-B "associate with Aurora-AUK1, INCENP-CPC1 and CPC2 throughout the cell cycle" seems too strong. There is no direct evidence for this. Please re-phrase as "likely associate" or "suggest... that ... may...".

      We have modified that sentence according to the editor’s suggestion.

      References:

      Akiyoshi, B., and K. Gull. 2014. Discovery of Unconventional Kinetochores in Kinetoplastids. Cell. 156. doi:10.1016/j.cell.2014.01.049.

      Butenko, A., F.R. Opperdoes, O. Flegontova, A. Horák, V. Hampl, P. Keeling, R.M.R. Gawryluk, D. Tikhonenkov, P. Flegontov, and J. Lukeš. 2020. Evolution of metabolic capabilities and molecular features of diplonemids, kinetoplastids, and euglenids. BMC Biology 2020 18:1. 18:1–28. doi:10.1186/S12915-020-0754-1.

      Cormier, A., D.G. Drubin, and G. Barnes. 2013. Phosphorylation regulates kinase and microtubule binding activities of the budding yeast chromosomal passenger complex in vitro. J Biol Chem. 288:23203–23211. doi:10.1074/JBC.M113.491480. Endow, S.A., F.J. Kull, and H. Liu. 2010. Kinesins at a glance. J Cell Sci. 123:3420. doi:10.1242/JCS.064113.

      Fink, S., K. Turnbull, A. Desai, and C.S. Campbell. 2017. An engineered minimal chromosomal passenger complex reveals a role for INCENP/Sli15 spindle association in chromosome biorientation. J Cell Biol. 216:911–923. doi:10.1083/JCB.201609123.

      van der Horst, A., M.J.M. Vromans, K. Bouwman, M.S. van der Waal, M.A. Hadders, and S.M.A. Lens. 2015. Inter-domain Cooperation in INCENP Promotes Aurora B Relocation from Centromeres to Microtubules. Cell Rep. 12:380–387. doi:10.1016/J.CELREP.2015.06.038.

      Ishii, M., and B. Akiyoshi. 2020. Characterization of unconventional kinetochore kinases KKT10/19 in Trypanosoma brucei. J Cell Sci. doi:10.1242/jcs.240978.

      Jeyaprakash, A.A., C. Basquin, U. Jayachandran, and E. Conti. 2011. Structural Basis for the Recognition of Phosphorylated Histone H3 by the Survivin Subunit of the Chromosomal Passenger Complex. Structure. 19:1625–1634. doi:10.1016/J.STR.2011.09.002.

      Jeyaprakash, A.A., U.R. Klein, D. Lindner, J. Ebert, E.A. Nigg, and E. Conti. 2007. Structure of a Survivin–Borealin–INCENP Core Complex Reveals How Chromosomal Passengers Travel Together. Cell. 131. doi:10.1016/j.cell.2007.07.045.

      Jumper, J., R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S.A.A. Kohl, A.J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A.W. Senior, K. Kavukcuoglu, P. Kohli, and D. Hassabis. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 2021 596:7873. 596:583–589. doi:10.1038/s41586-021-03819-2.

      Kang, J.S., I.M. Cheeseman, G. Kallstrom, S. Velmurugan, G. Barnes, and C.S.M. Chan. 2001. Functional cooperation of Dam1, Ipl1, and the inner centromere protein (INCENP)-related protein Sli15 during chromosome segregation. J Cell Biol. 155:763–774. doi:10.1083/JCB.200105029.

      Klein, U.R., E.A. Nigg, and U. Gruneberg. 2006. Centromere targeting of the chromosomal passenger complex requires a ternary subcomplex of Borealin, Survivin, and the N-terminal domain of INCENP. Mol Biol Cell. 17:2547–2558. doi:10.1091/MBC.E05-12-1133.

      Komaki, S., E.C. Tromer, G. De Jaeger, N. De Winne, M. Heese, and A. Schnittger. 2022. Molecular convergence by differential domain acquisition is a hallmark of chromosomal passenger complex evolution. Proc Natl Acad Sci U S A. 119. doi:10.1073/PNAS.2200108119/-/DCSUPPLEMENTAL.

      Li, Z. 2012. Regulation of the Cell Division Cycle in Trypanosoma brucei. Eukaryot Cell. 11:1180. doi:10.1128/EC.00145-12.

      Li, Z., J.H. Lee, F. Chu, A.L. Burlingame, A. Günzl, and C.C. Wang. 2008. Identification of a Novel Chromosomal Passenger Complex and Its Unique Localization during Cytokinesis in Trypanosoma brucei. PLoS One. 3. doi:10.1371/journal.pone.0002354.

      Mackay, A.M., D.M. Eckley, C. Chue, and W.C. Earnshaw. 1993. Molecular analysis of the INCENPs (inner centromere proteins): separate domains are required for association with microtubules during interphase and with the central spindle during anaphase. J Cell Biol. 123:373–385. doi:10.1083/JCB.123.2.373.

      Marchetti, M.A., C. Tschudi, H. Kwon, S.L. Wolin, and E. Ullu. 2000. Import of proteins into the trypanosome nucleus and their distribution at karyokinesis. J Cell Sci. 113 ( Pt 5):899–906. doi:10.1242/JCS.113.5.899.

      Nakajima, Y., A. Cormier, R.G. Tyers, A. Pigula, Y. Peng, D.G. Drubin, and G. Barnes. 2011. Ipl1/Aurora-dependent phosphorylation of Sli15/INCENP regulates CPC-spindle interaction to ensure proper microtubule dynamics. J Cell Biol. 194:137–153. doi:10.1083/JCB.201009137.

      Noujaim, M., S. Bechstedt, M. Wieczorek, and G.J. Brouhard. 2014. Microtubules accelerate the kinase activity of Aurora-B by a reduction in dimensionality. PLoS One. 9. doi:10.1371/JOURNAL.PONE.0086786.

      Okada, Y., and N. Hirokawa. 1999. A processive single-headed motor: Kinesin superfamily protein KIF1A. Science (1979). 283:1152–1157. doi:10.1126/SCIENCE.283.5405.1152.

      Rice, S., A.W. Lin, D. Safer, C.L. Hart, N. Naber, B.O. Carragher, S.M. Cain, E. Pechatnikova, E.M. Wilson-Kubalek, M. Whittaker, E. Pate, R. Cooke, E.W. Taylor, R.A. Milligan, and R.D. Vale. 1999. A structural change in the kinesin motor protein that drives motility. Nature 1999 402:6763. 402:778–784. doi:10.1038/45483.

      Sablin, E.P., F.J. Kull, R. Cooke, R.D. Vale, and R.J. Fletterick. 1996. Crystal structure of the motor domain of the kinesin-related motor ncd. Nature 1996 380:6574. 380:555–559. doi:10.1038/380555a0.

      Samejima, K., M. Platani, M. Wolny, H. Ogawa, G. Vargiu, P.J. Knight, M. Peckham, and W.C. Earnshaw. 2015. The Inner Centromere Protein (INCENP) Coil Is a Single α-Helix (SAH) Domain That Binds Directly to Microtubules and Is Important for Chromosome Passenger Complex (CPC) Localization and Function in Mitosis. J Biol Chem. 290:21460–21472. doi:10.1074/JBC.M115.645317.

      Schiffrin, B., S.E. Radford, D.J. Brockwell, and A.N. Calabrese. 2020. PyXlinkViewer: A flexible tool for visualization of protein chemical crosslinking data within the PyMOL molecular graphics system. Protein Sci. 29:1851–1857. doi:10.1002/PRO.3902.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We greatly appreciate the reviewers' and editors' comments and suggestions on our manuscript "Transposable elements regulate thymus development and function." We performed additional analyses to validate our results and rephrased some manuscript sections according to the comments. We believe these changes significantly increase the solidity of our conclusions. Our point-by-point answer to the reviewers' and editors' comments is detailed below. New data and analyses are shown in Figure 1d, Figure 2g and h, Figure 5e and f, Figure 1 – figure supplement 1, Figure 2 – figure supplement 2, Figure 3 – figure supplement 1 and 2, Figure 4 – figure supplement 2, Figure 5 – figure supplement 1, as well as the corresponding text sections.

      Reviewer #1:

      (1) The authors sometimes made overstatements largely due to the lack or shortage of experimental evidence.

      For example in figure 4, the authors concluded that thymic pDCs produced higher copies of TE-derived RNAs to support the constitutive expression of type-I interferons in thymic pDCs, unlike peripheral pDCs. However, the data was showing only the correlation between the distinct TE expression pattern in pDCs and the abundance of dsRNAs. We are compelled to say that the evidence is totally too weak to mention the function of TEs in the production of interferon. Even if pDCs express a distinct type and amount of TE-derived transcripts, it may be a negligible amount compared to the total cellular RNAs. How many TE-derived RNAs potentially form the dsRNAs? Are they over-expressed in pDCs?

      The data interpretation requires more caution to connect the distinct results of transcriptome data to the biological significance.

      We contend that our manuscript combines the attributes of a research article (novel concepts) and a resource article (datasets of TEs implicated in various aspects of thymus function). The critical strength of our work is that it opens entirely novel research perspectives. We are unaware of previous studies on the role of TEs in the human thymus. The drawback is that, as with all novel multi-omic systems biology studies, our work provides a roadmap for a multitude of future mechanistic studies that could not be realized at this stage. Indeed, we performed wet lab experiments to validate some but not all conclusions: i) presentation of TE-derived MAPs by TECs and ii) formation of dsRNAs in thymic pDCs. In response to Reviewer #1, we performed supplementary analyses to increase the robustness of our conclusions. Also, we indicated when conclusions relied strictly on correlative evidence and clarified the hypotheses drawn from our observations.

      Regarding the Reviewer's questions about TE-derived dsRNAs, LINE, LTR, and SINE elements all have the potential to generate dsRNAs, given their highly repetitive nature and bi-directional transcription (1). As ~32% of TE subfamilies are overexpressed in pDCs, we hypothesized that these TE sequences might form dsRNA structures in these cells. To address the Reviewer's concerns regarding the amount of TE-derived RNAs among total cellular RNAs, we also computed the percentage of reads assigned to TEs in the different subsets of thymic APCs (see Reviewer 1 comment #4).

      (2) Lack of generality of specific examples. This manuscript discusses the whole genomic picture of TE expression. In addition, one good way is to focus on the specific example to clearly discuss the biological significance of the acquisition of TEs for the thymic APC functions and the thymic selection.

      In figure 2, the authors focused on ETS-1 and its potential target genes ZNF26 and MTMR3, however, the significance of these genes in NK cell function or development is unclear. The authors should examine and discuss whether the distinct features of TEs can be found among the genomic loci that link to the fundamental function of the thymus, e.g., antigen processing/presentation.

      We thank the Reviewer for this highly relevant comment. We investigated the genomic loci associated with NK cell biology to determine if ETS1 peaks would overlap with TE sequences in protein-coding genes' promoter region. Figure 2h illustrates two examples of ETS1 significant peaks overlapping TE sequences upstream of PRF1 and KLRD1. PRF1 is a protein implicated in NK cell cytotoxicity, whereas KLRD1 (CD94) dimerizes with NKG2 and regulates NK cell activation via interaction with the nonclassical MHC-I molecule HLA-E (2, 3). Thus, we modified the section of the manuscript addressing these results to include these new analyses:

      "Finally, we analyzed publicly available ChIP-seq data of ETS1, an important TF for NK cell development (4), to confirm its ability to bind TE sequences. Indeed, 19% of ETS1 peaks overlap with TE sequences (Figure 2g). Notably, ETS1 peaks overlapped with TE sequences (Figure 2h, in red) in the promoter regions of PRF1 and KLRD1, two genes important for NK cells' effector functions (2, 3)."

      (3) Since the deep analysis of the dataset yielded many intriguing suggestions, why not add a discussion of the biological reasons and significance? For example, in Figure 1, why is TE expression negatively correlated with proliferation? cTEC-TE is mostly postnatal, while mTEC-TE is more embryonic. What does this mean?

      We thank the Reviewer for this comment. To our knowledge, the relationship between cell division and transcriptional activity of TEs has not been extensively studied in the literature. However, a recent study has shown that L1 expression is induced in senescent cells. We therefore added the following sentences to our Discussion:

      "The negative correlation between TE expression and cell cycle scores in the thymus is coherent with recent data showing that transcriptional activity of L1s is increased in senescent cells (5). A potential rationale for this could be to prevent deleterious transposition events during DNA replication and cell division."

      We also added several discussion points regarding the regulation of TEs by KZFPs to answer concerns raised by Reviewer 2 (see Reviewer 2 comment #1).

      (4) To consolidate the experimental evidence about pDCs and TE-derived dsRNAs, one option is to show the amount of TE-derived RNA copies among total RNAs. The immunohistochemistry analysis in figure 4 requires additional data to demonstrate that overlapped staining was not caused by technical biases (e.g. uneven fixation may cause the non-specifically stained regions/cells). To show this, authors should have confirmed not only the positive stainings but also the negative staining (e.g. CD3, etc.). Another possible staining control was showing that non-pDC (CD303- cell fractions in this case) cells were less stained by the ds-RNA probe.

      We thank the Reviewer for this suggestion. We computed the proportion of reads in each cell assigned to two groups of sequences known to generate dsRNAs: TEs and mitochondrial genes (1). These analyses showed that the proportion of reads assigned to TEs is higher in pDCs than other thymic APCs by several orders of magnitude (~20% of all reads). In contrast, reads derived from mitochondrial genes had a lower abundance in pDCs. We included these results in Figure 4 – figure supplement 2 and included the following text in the Results section entitled "TE expression in human pDCs is associated with dsRNA structures":

      "To evaluate if these dsRNAs arise from TE sequences, we analyzed in thymic APC subsets the proportion of the transcriptome assigned to two groups of genomic sequences known as important sources of dsRNAs, TEs and mitochondrial genes (1). Strikingly, whereas the percentage of reads from mitochondrial genes was typically lower in pDCs than in other thymic APCs, the proportion of the transcriptome originating from TEs was higher in pDCs (~22%) by several orders of magnitude (Figure 4 – figure supplement 2)."

      As a negative control for the immunofluorescence experiments, we used CD123- cells. Indeed, flow cytometry analysis of the magnetically enriched CD303+ fraction was around 90% pure, as revealed by double staining with CD123 and CD304 (two additional markers of pDCs): CD123- cells were also CD304-/lo, showing that these cells are non-pDCs. Thus, we decided to compare the dsRNA signal between CD123+ cells (pDCs) and CD123- cells (non-pDCs). The difference between CD123+ and CD123- cells was striking (Figure 4d).

      Author response image 1.

      Reviewer #1 (Recommendations For The Authors):

      It was sometimes difficult for me to recognize the dot plots representing low expression against the white background. e.g., figure 1 supplement 1.

      We thank the Reviewer for their comment, and we modified Figure 1 – figure supplement 1 as well as Figure 3 – figure 3 supplement 2 to improve the contrast between dots and background.

      Reviewer #2:

      Reviewer #2 (Recommendations For The Authors):

      (1) In the abstract, results and discussion, the following conclusions are drawn that are not supported by the data: a) TEs interact with multiple transcription factors in thymic cells, b) TE expression leads to dsRNA formation, activation of RIG-I/MDA5 and secretion of IFN-alpha, c) TEs are regulated by cell proliferation and expression of KZFPs in the thymus. All these statements derive from correlations. Only one TF has ChIP-seq data associated with it, dsRNA formation and/or IFN-alpha secretion could be independent of TE expression, and whilst KZFPs most likely regulate TEs in the thymus, the data do not demonstrate it. The authors also seem to suggests that AIRE, FEZF2 and CHD4 regulate TEs directly, but binding is not shown. The manuscript needs a thorough revision to be absolutely clear about the correlative nature of the described associations.

      We agree with Reviewer #2 that some of the conclusions in our initial manuscript were not fully supported by experimental data. In the revised manuscript, we clearly indicated when conclusions relied strictly on correlative evidence and clarified the hypotheses drawn from our observations. Regarding the regulation of TE expression by AIRE, FEZF2, and CHD4, we reanalyzed publicly available ChIP-seq data of AIRE and FEZF2 in murine mTECs. For AIRE, we confirmed that ~30% of AIRE's statistically significant peaks overlap with TE sequences (see Reviewer 2, comment #6 for more details on read alignment and peak calling), confirming its ability to bind to TE sequences directly. We added these results to the main figures (Figure 5f) and modified the "AIRE, CHD4, and FEZF2 regulate distinct sets of TE sequences in murine mTECs" as follows:

      “[…]. As a proof of concept, we validated that 31.42% of AIRE peaks overlap with TE sequences by reanalyzing ChIP-seq data, confirming AIRE's potential to bind TE sequences (Figure 5f)."

      A reanalysis of FEZF2's ChIP-seq data yielded no significant peaks while using stringent criteria. For this reason, we decided to exclude these data and only use AIRE as a proof of concept.

      Regarding KZFPs, we agree with Reviewer #2 that their impact on TE expression is probably significantly underestimated in our data. A potential reason for this is that KZFP expression is typically low; thus, transcriptomic signals from KZFPs could have been missed by the low depth of scRNA-seq. We mentioned this point in the Discussion:

      "On the other hand, the contribution of KZFPs to TE regulation in the thymus is likely underestimated due to their typically low expression (6) and scRNA-seq's limit of detection."

      (2) On the technical side, there are many dangers about analyzing RNA-seq data at the subfamily level and without stringent quality control checks. Outputs may be greatly confounded by pervasive transcription (see PMID 31425522), DNA contamination, and overlap of TEs with highly expressed genes. Whether TE transcripts are independent units or part of a gene also has important implications for the conclusions drawn. I would say that for most purposes of this work, an analysis restricted to independent TE transcripts, with appropriate controls for DNA contamination, would provide great reassurances that the results from subfamily-level analyses are sound. Showing examples from the genome browser throughout would also help.

      We agree with the Reviewer that contamination could have interfered with TE quantification. We used FastQ Screen (7) to evaluate the contamination of our human scRNA-seq data. As illustrated in the Figure below, most reads aligned with the human genome, and there were no reads uniquely assigned to another species analyzed, confirming the high purity of our dataset.

      Author response image 2.

      As stated by the Reviewer, pervasive expression is another factor that can lead to overestimation of TE expression. To evaluate if pervasive expression impacted the results of our differential expression analysis of TEs between APC subsets, we visualized read alignment to TE sequences using a genome browser. We selected two samples containing the highest numbers of mTEC(II) and pDCs (T07_TH_EPCAM and FCAImmP7277556, respectively) and used STAR to align reads to the human genome (GRCh38). We then visualized read alignment to randomly selected loci of two subfamilies identified as overexpressed by mTEC(II) or pDCs (HERVE-int and Harlequin-int, respectively). The examples below show that the signal detected is specific to the TE sequences located in introns. Even though this visualization cannot guarantee that pervasive expression did not affect TE quantification in any way, it increases the confidence that the signal detected by our analyses genuinely originates from TE expression.

      Author response image 3.

      Author response image 4.

      Author response image 5.

      Author response image 6.

      Author response image 7.

      (3) Related to the above, it would be useful to describe in the main text (and methods) how multi-mapping reads are being handled. It wasn't clear to me how kallisto handles this, and it has implications for the results. In the analysis suggested above, only uniquely mapped reads would have to be used, despite its limitations.

      We agree with the Reviewer that this information regarding assignment of multimapping reads is important. Kallisto uses an expectation-maximization (EM) algorithm to deal with multimapping reads, a strategy used by several algorithms developed to study TE expression (8). Briefly, the EM algorithm reassigns multimapping reads based on the number of uniquely mapped reads assigned to each sequence. Thus, we added the following details to the methods section:

      "Preprocessing of the scRNA-seq data was performed with the kallisto (9), which uses an expectation-maximization algorithm to reassign multimapping reads based on the frequency of unique mappers at each sequence, and bustools workflow."

      (4) Whilst I liked the basic idea, I am not convinced that correlating TE and TF expression is a good strategy for identifying TE-TF associations at enhancers. Enhancers express very low levels of short transcripts, which I doubt would be detected in low-depth scRNA-seq data. The transcripts the authors are using to make such associations may therefore have nothing to do with the enhancer roles of TEs. I would limit these analyses to cell types for which there is histone modification data and correlate TF expression with that instead.

      We agree with the Reviewer that it would have been interesting to correlate the expression of TFs with signals of histone marks at TE sequences. However, we could not perform this analysis because we did not have matched data of histone marks throughout thymic development. Therefore, we adopted an alternative, well-suited strategy.

      Our strategy to identify TE enhancer candidates is depicted in Figure 2a: i) correlation between the expression of the TF and the TE subfamily, ii) presence of the TF binding motif in the sequence of the TE enhancer candidate, and iii) colocalization of the TE enhancer candidate with significant peaks of H3K27ac and H3K4me3 in the same cell type from the ENCODE Consortium ChIP-seq data. We limited our analyses to the eight cell types present both in our dataset and the ENCODE Consortium: B cells, CD4 Single Positive T cells (CD4 SP), CD8 Single Positive T cells (CD8 SP), dendritic cells (DC), monocytes and macrophages (Mono/Macro), NK cells, Th17, and Treg.

      (5) Figure 2G: binding of ETS1 is unconvincing. Were there statistically meaningful peaks called in these regions? It would be good to also show a metaplot/heatmap of ETS1 profile over all elements of relevant subfamilies. Showing histone marks on the genome browser snapshots would also be useful. Is there any transcriptional evidence that the specific Alus shown act as alternative promoters?

      We agree with the Reviewer that the examples provided were not particularly convincing. Thus, we reanalyzed the data to determine if statistically significant ETS1 peaks (see the answer to Reviewer 2's comment #6 for details on the methods) located near gene transcription start sites overlapped with TEs. We thereby provided examples of significant ETS1 peaks overlapping TE sequences in the promoter region of two prototypical NK cell protein-coding genes (Figure 2h).

      (6) Why was -k 10 used with bowtie2? This will map the same read to multiple locations in the genome, increasing read density at more repetitive (younger) TEs. The authors should use either default settings, being clear about the outcome (random assignment of multimapping reads to one location), or use only uniquely aligned reads.

      We thank the Reviewer for their comment and agree that using the -k 10 parameter with bowtie2 was not optimal for TE analysis. To improve the strength of our analyses, we reanalyzed all ChIP-seq data of our manuscript (Figure 2g and h, Figure 5e and f) using the following strategy: alignment with bowtie2 using default parameters except –very-sensitive, multimapping read removal with samtools view -q 10, removal of duplicate reads with samtools markdup -r, peaks calling was performed with macs2 with the -m 5 50 parameter, and peaks overlapping ENCODE's blacklist regions were removed with bedtools intersect.

      These new analyses strengthen our evidence that TEs interact with multiple genes that regulate thymic development and function. We updated the results sections concerning ChIP-seq data analyses and the Methods section to include this information:

      "ChIP-seq reads were aligned to the reference Homo sapiens genome (GRCh38) using bowtie2 (version 2.3.5) (10) with the --very-sensitive parameter. Multimapping reads were removed using the samtools view function with the -q 10 parameter, and duplicate reads were removed using the samtools markdup function with the -r parameter (11). Peak calling was performed with macs2 with the -m 5 50 parameter (12). Peaks overlapping with the ENCODE blacklist regions (13) were removed with bedtools intersect (14) with default parameters. Overlap of ETS1 peaks with TE sequences was determined using bedtools intersect with default parameters. BigWig files were generated using the bamCoverage function of deeptools2 (15), and genomic tracks were visualized in the USCS Genome Browser (16)."

      (7) Figure 1d needs a y axis scale. Could the authors also provide details of how the random distribution of TE expression was generated?

      We agree that the Reviewer that Figure 1d was incomplete and made the appropriate modifications. Regarding the random distribution, we reproduced our dataset containing the expression of 809 TE subfamilies in 18 cell populations. For each combination of TE subfamily and cell type, we randomly assigned an "expression pattern" as identified by the hierarchical clustering of Figure 1b. Then, we computed the maximal occurrence of an expression pattern across cell types for each TE subfamily to generate the distribution curve in Figure 1d. We added the following details to the Methods section to clarify how the random distribution was generated:

      "As a control, a random distribution of the expression of 809 TE subfamilies in 18 cell populations was generated. A cluster (cluster 1, 2, or 3) was randomly attributed for each combination of TE subfamily and cell type, and the maximal occurrence of a given cluster across cell types was then computed for each TE subfamily. Finally, the distributions of LINE, LTR, and SINE elements were compared to the random distribution with Kolmogorov-Smirnov tests."

      (8) The motif analysis requires a minimum of 1 locus from each TE subfamily containing it in order to be reported, but this seems like a really low threshold that will output a lot of noise. What is the rationale here?

      We agree with the Reviewer that this threshold might appear low. Nonetheless, these analyses ultimately aimed to identify TE promoter and enhancer candidates. Hence, we did not want to put an arbitrary threshold at a higher value (e.g., a certain number or percentage of all loci of a given TE subfamily), as this might create a bias based on the total number of loci of a given TE subfamily. Moreover, our rationale was that a TE locus might act as a promoter/enhancer even if it is the only locus of its subfamily containing a TF binding site.

      Even though this strategy might have created some noise in the analyses of interactions between TFs and TEs of Figure 2 (panels a-e), we are confident that our bootstrap strategy efficiently removed low-quality identifications based on low correlations values or expression of TF and TE in low percentages of cells. Additionally, the subsequent analyses on TE promoter and enhancer candidates were performed exclusively for the TE loci containing TF binding sites to avoid adding noise to these analyses.

      (9) Figure 4e: is this a log2 enrichment? If not, the enrichments for some of the gene sets are not so high.

      The enrichment values represented in Figure 4e are not log-transformed. It is essential to highlight that gene set enrichment values were computed for each possible pair of thymic APCs (e.g., pDC vs. cDC1, pDC vs. mTEC(II), etc.), and the values represented in Figure 4e are an average of each comparison pictured at the bottom of the UpSet plot.

      However, we agree with Reviewer 2 that the average enrichment value is not extremely high. We thus made the following modifications to the Results section ("TE expression in human pDCs is associated with dsRNA structures") to better represent it:

      "Notably, thymic pDCs harbored moderate yet significant enrichment of gene signatures of RIG-I and MDA5-mediated IFN ɑ/β signaling compared to all other thymic APCs (Figure 4e and Supplementary file 1 – Table 8)."

      (10) Please be clear on results subtitles when these refer to mouse.

      We apologize for the confusion and modified the subtitles to clarify if the results refer to mouse or human data.

      (11) Figure 1 - figure supplement 2: "assignation" should be 'assignment'.

      We thank the Reviewer for their keen eye and changed the title of Figure 1 – figure supplement 2.

      (1) Sadeq S, Al-Hashimi S, Cusack CM, Werner A. Endogenous Double-Stranded RNA. Noncoding RNA. 2021;7(1).

      (2) Kim N, Kim M, Yun S, Doh J, Greenberg PD, Kim TD, et al. MicroRNA-150 regulates the cytotoxicity of natural killers by targeting perforin-1. J Allergy Clin Immunol. 2014;134(1):195-203.

      (3) Gunturi A, Berg RE, Forman J. The role of CD94/NKG2 in innate and adaptive immunity. Immunol Res. 2004;30(1):29-34.

      (4) Taveirne S, Wahlen S, Van Loocke W, Kiekens L, Persyn E, Van Ammel E, et al. The transcription factor ETS1 is an important regulator of human NK cell development and terminal differentiation. Blood. 2020;136(3):288-98.

      (5) De Cecco M, Ito T, Petrashen AP, Elias AE, Skvir NJ, Criscione SW, et al. L1 drives IFN in senescent cells and promotes age-associated inflammation. Nature. 2019;566(7742):73-8.

      (6) Huntley S, Baggott DM, Hamilton AT, Tran-Gyamfi M, Yang S, Kim J, et al. A comprehensive catalog of human KRAB-associated zinc finger genes: insights into the evolutionary history of a large family of transcriptional repressors. Genome Res. 2006;16(5):669-77.

      (7) Wingett SW, Andrews S. FastQ Screen: A tool for multi-genome mapping and quality control. F1000Res. 2018;7:1338.

      (8) Lanciano S, Cristofari G. Measuring and interpreting transposable element expression. Nat Rev Genet. 2020;21(12):721-36.

      (9) Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol. 2016;34(5):525-7.

      (10) Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357-9.

      (11) Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2).

      (12) Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9(9):R137.

      (13) Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep. 2019;9(1):9354.

      (14) Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841-2.

      (15) Ramirez F, Ryan DP, Gruning B, Bhardwaj V, Kilpert F, Richter AS, et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016;44(W1):W160-5.

      (16) Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The human genome browser at UCSC. Genome Res. 2002;12(6):996-1006.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public Review):

      This publication applies 3D super-resolution STORM imaging to understanding the role of developmental neural activity in the clustering of retinal inputs to the mouse dorsal lateral geniculate nucleus (dLGN). The authors argue that retinal ganglion cell (RGC) synaptic boutons start forming clusters early in postnatal development (P2). They then argue that these clusters contribute to eye-specific segregation of retinal inputs by activity-dependent stabilization of nearby boutons from the same eye. The data provided is N=3 animals for each condition of P2, P4, and P8 animals in wild-type mice and in mice where early patterns of structured retinal activity are blocked.

      Strengths:

      The 3D storm imaging of pre and postsynaptic elements provides convincing high-resolution localization of synapses.

      The experimental design of comparing ipsilateral and contralateral RGC axon boutons in a region of the dLGN that is known to become contralateral is elegant. The design makes it possible to relate fixed time point structural data to a known outcome of activity-dependent remodeling.

      Weaknesses:

      Based on previous literature, it is known that synapse density, synapse clustering, and synaptic specificity increase during postnatal development. Previous work has also shown that both the changes in synaptic clustering and synaptic specificity are affected by retinal activity. The data and analysis provided by the authors add little unambiguous evidence that advances this understanding.

      We agree with the reviewer that previous literature shows that synapse density, synapse clustering, and synaptic specificity increase during postnatal development and that these processes are affected by retinal activity. The majority of studies on synaptic refinement have been performed after eye-opening, when eye-specific segregation is already complete. In contrast, most studies of eye-specific segregation focus on axonal refinement phenotypes. To our knowledge, only a small number of experiments have examined retinogeniculate synaptic properties at the nanoscale during eye-specific segregation (1-4). Our broad goal is to understand the mechanisms of synaptogenesis and competition at the earliest stages of eye-specific refinement, when spontaneous retinal activity is a major driver of activity-dependent remodeling. We hope that readers will appreciate that there is still much to discover in this fascinating model system of synaptic competition.

      General problem 1: Most of the statistical analysis is limited to ANOVA comparison of axons from the contralateral and ipsilateral retina in the contralateral dLGN. The hypothesis that ipsilateral and contralateral axons would be statistically identical in the contralateral dLGN is not a plausible hypothesis so rejecting the hypothesis with P < X does not advance the authors' arguments beyond what was already known.

      General problem 2: Most of the interpretation of data is qualitative. While error bars are provided, these error bars are not used to draw conclusions. Given the small sample size (N=3), there is a large degree of uncertainty regarding the magnitude of changes (synapse size, number, specificity). The authors base their conclusions on the averages of these values when the likely degree of uncertainty could allow for the opposite interpretation.

      We appreciate the reviewer’s concerns regarding the use of ANOVA for statistical testing in the original submission. We have generated new figures that show confidence intervals for each analysis in the manuscript and these are included in the response to reviewers document below. To address the underlying concern that our N=3 sample size limits the interpretation of our results, we have revised the manuscript to be cautious in our interpretations and to discuss additional possibilities that are consistent with the anatomical data.

      General problem 3: Two of the four results sections depend on using the frequency of single active zone vGlut2 clusters near multiple active zone vGlut2 as a proxy for synaptic stabilization of the single active zone vGlut2 clusters by the multiple active zone vGlut2 clusters. The authors argue that the increased frequency of same-eye single active zone clusters relative to opposite-eye single active zone clusters means that multiple active zone vGlut2 clusters are selectively stabilizing single active zone clusters. There are other plausible explanations for this observation that are not eliminated. An increased frequency of nearby single active zone clusters would also occur if RGC axons form more than one synapse in the dLGN. Eye-specific segregation is, by definition, a relative increase in the frequency of nearby boutons from the same eye. The authors were, therefore, guaranteed to observe a non-random relationship between boutons from the same eye. The authors do compare their measures to a random model, but I could not find a description of the model. I would expect that the model would need to account for RGC arbor size, arbor structure, bouton number, and segregation independent of multi-active-zone vGlut2 clusters. The most common randomization for the type of analysis described here, a shift in the positions of single-active zone boutons, would not be adequate.<br /> In discussing the claimed cluster-induced stabilization of nearby boutons, the authors state that the specificity increases with age due to activity-dependent refinement. Their quantification does not support an increase in specificity with age. In fact, the high degree of clustering "specificity" they observe at P2 argues for the trivial same axon explanation.

      We agree with the reviewer that individual RGC axons form multiple synapses and that, over time, eye-specific segregation must increase the frequency of like-eye synapses relative to opposite-eye synapses. Indeed, our previous study of eye-specific refinement showed that at P8, the density of eye-specific inputs had increased for the dominant-eye and decreased for the non-dominant-eye (1). However, at postnatal day 4, contralateral and ipsilateral input densities were the same in the future contralateral-eye territory. One of our goals in this study was to determine if the process of synaptic clustering begins at these earliest stages of synaptic competition and, if so, whether it is influenced by retinal wave activity. It is plausible that the RGC axons from the same eye could initially form synapses randomly and, at some later stage, synapses may be selectively added to produce mature glomeruli. Consistent with this possibility, previous analysis of JAM-B RGC axon refinement showed the progressive clustering of axonal boutons at later stages of development after eye-specific segregation (5).

      Regarding the randomization that we employed, we performed a repositioning of synapse centroids within the volume of the neuropil after accounting for neuronal soma volumes and edge effects. We agree that this type of randomization cannot account for the fine scale structure of axons and dendrites, which we did not have access to in this four-color volumetric super-resolution data set. To address this, we have performed additional clustering analyses surrounding both single-active zone and multi-active zone synapses. This new analysis showed that there is a modest clustering effect around single-active zone synapses compared to complete randomization described above. We now present this information using a normalized clustering index for direct comparison of clustering between multi-active zone and single-active zone synapses. We have measured effect sizes and confidence intervals, which we present in point-by-point responses below. We have restructured the manuscript figures and discussion to provide a balanced interpretation of our results and the limitations of our study.

      Analysis of specific claims:

      Result Section 1

      Most of the figures show mean, error bars, and asterisks, but not the three data points from which these statistics are derived. Large changes in variance from condition to condition suggest that displaying the data points would provide more useful information.

      We thank the reviewer for their suggestion. We have updated all figures to display the means of all biological replicates as individual data points.

      Claim 1: Contralateral density increases more than ipsilateral in the contralateral region over the course of development. This claim is supported by the qualitative comparison of means and error bars in Figure 2D. The argument could be made quantitative by providing a confidence interval for synapse density increase for dominant and non-dominant synapse density. A confidence interval could then be generated for the difference in this change between the two groups. Currently, the most striking effect is a big difference in variance between P4 and P8 for dominant eye complex synapses. Given that N=3, I assume there is one extreme outlier here.

      We appreciate the comment and believe the reviewer was referring to the data presented in the original Figure 1D, rather than Figure 2D.

      We agree with the reviewer that our comment on the change in synapse density across ages was not quantitatively supported by the figure as we did not perform a proper age-wise statistical comparison. We have removed this claim in the revised manuscript.

      We also appreciate the suggestions to clarify the presentation of our statistical analyses and to utilize confidence interval measurements wherever possible. We present Author response image 1 below, showing the density of multi-AZ synapses in the contralateral-eye territory over time (P2-P8), for both CTB(+) contralateral (black) and CTB(-) ipsilateral inputs (red) featuring 5/95% confidence intervals:

      Author response image 1.

      More broadly, the reviewer has raised the concern that the low number of biological replicates (N=3) presents challenges in the use of ANOVA for statistical testing. We agree with the concern and have revised the manuscript to be cautious in our statistical tests and resulting claims. We have chosen to use paired T-tests to compare measurements of eye-specific synapse properties because these measurements were always made within each individual biological replicate (paired measurements). Below, we discuss our logic for this change and the effects on the results we present in the revised manuscript.

      Considering the above image:

      (1) ANOVA: In our initial submission, we used an ANOVA test which showed P<0.05 for the CTB(+) P4 vs. P8 comparison above, leading to our statement about an age-dependent increase in multi-AZ density. However, the figure above shows that P8 data has higher variance. Thus, the homogeneity of variance assumption of ANOVA may lead to false positives in this comparison.

      (2) Confidence interval for N=3: We calculated confidence intervals for P4 and P8 data (5/95% CI shown above). Overlap between the two groups indicates the true mean values of the two groups could be identical. However, the P8 confidence intervals (as well as other confidence intervals across other comparisons in the manuscript) also include the value of 0. This indicates there actually might be no multi-active zone synapses in the mouse dLGN. The failure arises because the low number of biological replicates (N=3 data points) precludes a reliable confidence interval measurement. CI measurements require sufficient sample sizes to determine the true population variance.

      (3) Difficulty in achieving sufficient sample sizes for CI analysis in ultrastructural studies of the brain: volumetric STORM experiments are technically complex and make use of sample preparation and analysis methods that are similar to volumetric electron microscopy (physical ultrathin sectioning and computational 3D stack alignment). For these technical reasons, it is difficult to collect imaging data from >10 mice for each group of data (e.g. age and tissue location) in one single project. Because of the technical challenges, most ultrastructural studies published to date present results from single biological replicates. In our STORM dataset, we collected imaging data of N=3 biological replicates for each age and genotype. We agree that in the future the collection of additional replicates will be important for improving the reliability of statistical comparisons in super-resolution and electron-microscopy studies. Continued advances in the throughput of imaging/analysis should help to make this easier over time. 

      (4) The use of paired T-tests: In this study, we have eye-specific CTB(+) and CTB(-) synapse imaging data from the same STORM fields within single biological replicates. When there is only one measurement from each replicate (e.g. synapse density, ratio of total synapses), using paired tests to compare these groups increases statistical power and does not assume similar variance. However, this limits our analysis to comparisons within each age, and not between ages. Accordingly, we have revised our discussion of the results and interpretations throughout the manuscript. When there are thousands of measurements of synapses from each replicate (e.g. Figure 2A-B on synapse volumes), we use a mixed linear model to analyze the variance. In the revised figures we present the results using standard error of the mean and link measurements from within the same individual replicates to show the paired data structure. In cases where specific comparisons are made across ages, we present 5/95% confidence interval measurements.

      Claim 2: The fraction of multiple-active zone vGlut2 clusters increases with age. This claim is weakly supported by a qualitative reading of panel 1E. The error bars overlap so it is difficult to know what the range of possible increases could be. In the text, the authors report mean differences without confidence intervals (or any other statistics). The reported results should, therefore, be interpreted as a description of their three mice and not as evidence about mice in general.

      We appreciate the reviewer’s concern that statistical accuracy of our synapse density comparisons over age is limited by the small sample size as discussed above. We have removed all strong claims about age-dependent changes in the density of multi-active zone and single-active zone synapses. Instead, we focus our analyses on comparisons between CTB(+) and CTB(-) synapse measurements, which are paired within each biological replicate. To specifically address the reviewer’s concern about figure panel 1E, we present Author response image 2 with confidence intervals below.

      Author response image 2.

      Figure S1. Panel A makes the point that the study could not be done without STORM by comparing the STORM images to "Conventional" images. The images are over-saturated low-resolution images. A reasonable comparison would be to a high-quality quality confocal image acquired with a high NA objective (~1.4) and low laser power (PSF ~ 0.2 x 0.2 x 0.6 um) that was acquired over the same amount of time it takes to acquire a STORM volume.

      We agree with the reviewer that the presentation of low-resolution conventional images is not necessary. We have deleted the panel and modified the text accordingly.

      Result section 2.

      Claim 1: The ipsi/contra (in contra LGN) difference in VGluT2 cluster volume increases with development. While there are many p-values listed, the main point is not directly quantified. A reasonable way to quantify the relative increase in volume could be in the form: the non-dominant volumes were 75%-95%(?) of the dominant volume at P2 and 60%-80% (?) at P8. The difference in change was -5 to 15%(?).

      We thank the reviewer for their helpful suggestion to improve the clarity of the results presented in this analysis of eye-specific synapse volumes. In our original report, we found differences in eye-specific VGluT2 volume at each time point (P2/P4/P8) in control mice (1). The original measurements used the entire synapse population. Here, we aimed to determine whether eye-specific differences in VGluT2 volumes were present for both multi-AZ synapses and single-AZ synapses, and whether one population may have a greater contribution to the previous population measurement that we reported. We found that at P4 (a time when the overall eye-specific synapse density is equivalent for both eyes in the dLGN), WT multi-AZ synapses showed a greater difference (372%) in eye-specific VGluT2 volume compared with single-AZ synapses (135%). In β2KO mice multi-AZ synapses showed a greater difference (110%) in eye-specific VGluT2 volume compared with single-AZ synapses (41%). In our initial manuscript submission, we included statistical comparisons of eye-specific volume differences across ages, but we did not highlight these differences in our discussion of the results. For clarity, we have removed all statistical comparisons across ages in the revised manuscript. We have modified the text to focus on eye-specific VGluT2 volume differences at P4 described above. To specifically address the reviewer’s question, we provide the percentage differences between multi- and single-AZ eye-specific synapses for each age/genotype below:

      Author response table 1.

      Claim 2: Complex synapses (vGlut2 clusters with multiple active zones) represent clusters of simple synapses and not single large boutons with multiple active zones. The authors argue that because vGlut2 cluster volume scales roughly linearly with active zone number, the vGlut2 clusters are composed of multiple boutons each containing a single active zone. Their analysis does not rule out the (known to be true) possibility that RGC bouton sizes are much larger in boutons with multiple active zones. The correlation of volume and active zone number, by itself, does not resolve the issue. A good argument for multiple boutons might be that the variance is smallest in clusters with 4 active zones (looks like it in the plot) since they would be the average of four active zones to vesicle pool ratios. It is very likely that the multi-active zone vGlut2 clusters represent some clustering and some multi-synaptic boutons. The reference cited by the authors as evidence for the presence of single active zone boutons in young tissue does not rule out the existence of multiple active zone boutons.

      We agree with the reviewer’s comments on the challenges of classifying multi-active zone synapses in STORM images as single terminals versus aggregates of terminals. To help address this, we have performed electron microscopy imaging of genetically labeled RGC axons and identified the existence of single retinogeniculate terminals with multiple active zones. Our EM imaging was limited to 2D sections and does not rule out the clustering of small, single- active zone synapses within 3D volumes. Future volumetric EM reconstructions will be informative for this question. We have significantly updated the figures and text to discuss the new results and provide a careful interpretation of the nature of multi-AZ synapses in STORM imaging data. 

      Several arguments are made that depend on the interpretation of "not statistically significant" (n.s.) meaning that "two groups are the same" instead of "we don't know if they are different". This interpretation is incorrect and materially impacts the conclusions.

      Several arguments are made that interpret statistical significance for one group and a lack of statistical significance for another group meaning that the effect was bigger in the first group. This interpretation is incorrect and materially impacts the conclusions.

      We thank the reviewer for raising these concerns. We have extensively revised the manuscript text to report the data in a more precise way without overinterpreting the results. All references to “N.S.” and associated conclusions have been either removed or substantiated with 5/95% confidence interval testing.

      Result Section 3.

      Claim 1: Complex synapses stabilize simple synapses. There are alternative explanations (mentioned above) for the observed clustering that negate the conclusions. 1) Boutons from the same axon tend to be found near one another. 2) Any form of eye-specific segregation would produce non-random associations in the analysis as performed. The authors compare each observation to a random model, but I cannot determine from the text if the model adequately accounts for alternative explanations.

      We thank the reviewer for their suggestion to consider alternative explanations for our results. We agree that our study does not provide direct molecular mechanistic data demonstrating synaptic stabilization effects. We have significantly revised the manuscript to be more cautious in our interpretations and specifically address alternative biological mechanisms that are consistent with the non-random arrangement of retinogeniculate synapses in our data.

      We agree with the reviewer that individual RGC axons form multiple synapses, however, nascent synapses might not always form close together. If synapses are initially added randomly within RGC axons, eye-specific segregation may conclude with a still-random pattern of dominant-eye inputs. At some later stage, synapses may be selectively refined to produce mature glomeruli. Consistent with this, individual RGCs undergo progressive clustering of axonal boutons at later stages of development after eye-specific segregation (5). One of our goals in this work was to determine if the process of synaptic clustering begins at the earliest stages of synapse formation and, if so, whether it is influenced by retinal wave activity.

      To measure synaptic clustering in our STORM data, we used a randomization of single-AZ synapse centroids within the volume of the neuropil after accounting for neuronal soma volumes and edge effects. Multi-AZ centroid positions were held fixed. Comparing the randomized result to the original distribution, we found a higher fraction of single-AZ synapse associated with multi-AZ synapses, arguing for a non-random clustering effect. However, we agree with the reviewer’s concern that this type of randomization cannot account for the fine scale structure of axons, which we did not have access to in this four-color volumetric super-resolution data set. Thus, there could still be errors in a purely volumetric randomization (e.g. the assignment of synapses to regions in the volume that would not be synaptic locations in the original neuropil), which would effectively decrease the measured degree of clustering after the randomization. To address this, we have revised our analysis to measure the degree of synapse clustering nearby both multi-AZ and single-AZ synapses after an equivalent randomization of single-AZ synapse positions in the volume. 

      We now present the revised results as a “clustering index” for both multi-AZ and single-AZ synapses. This measurement was performed in several steps: 1) randomization of single-AZ position with the imaging volume while holding multi-AZ centroid positions fixed, 2) independent measurements of the fraction of single-AZ synapses within the local shell (1.5 μm search radius) around multi-AZ and single-AZ synapses within the random distribution, 3) comparison of the result from (2) with the actual fractional measurements in the raw STORM data to compute a “clustering index” value. 4) Because the randomization is equivalent for both multi-AZ and single-AZ synapse measurements, any measured differences in the degree of clustering reflect the synapse type.

      We have updated Figure 3 in the revised manuscript to present the relative clustering index described above. We have updated the results, discussion, and methods sections accordingly.

      The authors claim that specificity increases over time. Figure 3b (middle) shows that the number of synapses near complex synapses might increase with time (needs confidence interval for effect size), but does not show that specificity (original relative to randomized) increases with time. The fact that nearby simple synapse density is always (P2) very different from random suggests a primarily non-activity-dependent explanation. The simplest explanation is that same-side boutons could be from the same axon whereas different-side axons could not be.

      We have significantly revised the analysis and presentation of results in Figure 3 to include a comparative measurement of synaptic clustering between multi-AZ and single-AZ synapses (discussed above). The data presented in the original Figure 3B have been moved to Supplemental Figure 4. Statistical comparisons in Figure S4 between the original and randomized synapse distributions are limited to within-age measurements. Cross-age comparisons were not performed or presented. To address the reviewer’s question concerning CI analysis in the original Figure 3B, we provide Author response image 3 below showing 5/95% confidence intervals for WT mice:

      Author response image 3.

      Claim 2: vGlut2 clusters more than 1.5 um away from multi-active zone vGlut2 clusters are not statistically significantly different in size than vGlut2 clusters within 1.5 um of multi-active zone vGlut2 clusters. Therefore "activity-dependent synapse stabilization mechanisms do not impact simple synapse vesicle pool size". The specific measure of 1.5 um from multi-active zone vGlut2 clusters does not represent all possible synapse stabilization mechanisms.

      We agree with the reviewer that this specific measure does not capture all possible synapse stabilization mechanisms. We have modified the text in the revised manuscript throughout to be more cautious in our data interpretation and have included additional discussion of alternative mechanisms consistent with our results.

      Result Section 4.

      Claim: The proximity of complex synapses with nearby simple synapses to other complex synapses with nearby simple synapses from the same eye is used to argue that activity is responsible for all this clustering.

      It is difficult to derive anything from the quantification besides 'not-random'. That is a problem because we already know that axons from the left and right eye segregate during the period being studied. All the measures in Section 4 are influenced by eye-specific segregation. Given this known bias, demonstrating a non-random relationship (P<X) doesn't mean anything. The test will reveal any non-random spatial relationship between same-eye and opposite-eye synapses.

      The results can be stated as: If you are a contralateral complex synapse, contralateral complex synapses that are also close to contralateral simple synapses will, on average, be slightly closer to you than contralateral complex synapses that are not close to contralateral ipsilateral synapses. That would be true if there is any eye-specific segregation (which there is).

      We appreciate the reviewer’s comments that our anatomical data are consistent with several possible mechanisms, suggesting the need for alternative interpretations of the results. In the original writing, we interpreted our results in the context of activity-dependent mechanisms of like-eye stabilization and opposite-eye competition. However, our results are also consistent with other mechanisms, including non-random molecular specification of eye-specific inputs onto subregions of postsynaptic target cells (e.g. distinct relay neuron dendrites). We have rewritten the manuscript to be more cautious in our interpretations and to provide a balanced discussion of alternative possibilities.

      Regarding the concern that the data in section four are influenced by eye-specific segregation, we previously found synapse density from both eyes is equivalent in the contralateral region at the P4 time point presented (1), which is consistent with binocular axonal overlap at this age. Within our imaging volumes, ipsilateral and contralateral inputs were broadly intermingled throughout the volume, and we did not find evidence for regional segregation with the imaging fields. By these metrics, retraction of ipsilateral inputs from the contralateral territory has not yet occurred.

      It is an overinterpretation of the data to claim that the lack of a clear correlation between vGlut2 cluster volume and distance to vGlut2 clusters with multiple active zones provides support for the claim that "presynaptic protein organization is not influenced by mechanisms governing synaptic clustering".

      We agree with the reviewer that our original language was imprecise in referring to presynaptic protein organization broadly. We have revised this text to present a more accurate description of the results.

      Reviewer #2 (Public Review):

      In this manuscript, Zhang and Speer examine changes in the spatial organization of synaptic proteins during eye-specific segregation, a developmental period when axons from the two eyes initially mingle and gradually segregate into eye-specific regions of the dorsal lateral geniculate. The authors use STORM microscopy and immunostain presynaptic (VGluT2, Bassoon) and postsynaptic (Homer) proteins to identify synaptic release sites. Activity-dependent changes in this spatial organization are identified by comparing the β2KO mice to WT mice. They describe two types of presynaptic organization based on Bassoon clustering, the complex and the simple synapse. By analyzing the relative densities and distances between these proteins over age, the authors conclude that the complex synapses promote the clustering of simple synapses nearby to form the future mature glomerular synaptic structure.

      Strengths:

      The data presented is of good quality and provides an unprecedented view at high resolution of the presynaptic components of the retinogeniculate synapse during active developmental remodeling. This approach offers an advance to the previous mouse EM studies of this synapse because of the CTB label allows identification of the eye from which the presynaptic terminal arises. Using this approach, the authors find that simple synapses cluster close to complex synapses over age, that complex synapse density increases with age.

      Weaknesses:

      From these data, the authors conclude that the complex synapse serves to "promote clustering of like-eye synapses and prohibit synapse clustering from the opposite eye". However, the authors show no causal data to support these ideas. There are a number of issues that the authors should consider:

      (1) Clustering of retinal synapses is in part due to the fact that retinal inputs synapse on the proximal dendrites. With increased synaptogenesis, there will be increased density of retinal terminals that are closely localized. And with development, perhaps simple synapses mature into complex synapses. Simple synapses may also represent ones that are in the process of being eliminated as previously described by Campbell and Shatz, JNeurosci 1992 (consider citing). Can the authors distinguish these scenarios from the ones that they conclude?

      We thank the reviewer for their thoughtful commentary and suggestions to improve our manuscript. We agree with the reviewer that our original interpretation of synaptic clustering by activity-dependent stabilization and punishment mechanisms is not directly supported by causal data. We have extensively revised the manuscript to take a more cautious view of the results and to discuss alternative mechanisms that are consistent with our data.

      During eye-specific circuit development, there is indeed increased synaptogenesis and, ultimately, RGC terminals are closely clustered within synaptic glomeruli. This process involves the selective addition and elimination of synapses. Bouton clustering has been shown to occur within individual RGC axons after eye-opening in the mouse (5). The convergence of other RGC types into clustered boutons has been shown at eye-opening by light and electron microscopy (3). There is also qualitative evidence that synaptic clusters may form earlier during eye-specific segregation in the cat (4). Our data provide additional evidence that synaptic clustering begins prior to eye-opening in the mouse (P2-P8). Although synapse numbers also increase during this period, the distribution of synapse addition is non-random. 

      Single-active zone synapses (we previously called these “simple”) may indeed mature into multi-active zone synapses (we previously called these “complex”). At the same time, single-active zone synapses may be eliminated. We believe that each of these events occurs as part of the synaptic refinement process. Our STORM images are static snapshots of eye-specific refinement, and we cannot infer the dynamic developmental trajectory of an individual synapse in our data. Future live imaging experiments in vivo/in situ will be needed to track the maturation and pruning of individual connections. We have expanded our discussion of these limitations and future directions in the manuscript.

      (2) The argument that "complex" synapses are the aggregate of "simple" synapses (Fig 2, S2) is not convincing.

      We agree with the reviewer’s concern about the ambiguous identity of complex synapses. To clarify the nature of multi-active zone synapses, we have performed RGC-specific dAPEX2 labeling to visualize retinogeniculate terminals by electron microscopy (EM). These experiments revealed the presence of synaptic terminals with multiple active zones. We have added images and text to the results section describing these findings. Our 2D EM images do not rule out the possibility that some multi-active zone synapses observed in STORM images are in fact clusters of individual RGC terminals. We have revised the text to provide a more accurate discussion of the nature of multi-active zone synapses.  

      (3) The authors use of the β2KO mice to assess changes in the organization of synaptic proteins in retinal terminals that have disrupted retinal waves. However, β2-nAChRs are also expressed in the dLGN and other areas of the brain and glutamatergic synapse development has been reported in the CNS independent of the disruption in retinal waves. This issue should be considered when interpreting the total reduced retinal synapse density in the dLGN of the mutant.

      We thank the reviewer for their suggestion to consider non-retinal effects of the germline deletion of the beta 2 subunit of the nicotinic acetylcholine receptor. Previously, Xu and colleagues reported the development of a conditional transgenic mouse model lacking β2-nAChR expression specifically in the retina (6). These retina-specific β2-nAChR mutant mice (Rx-β2cKO) have disrupted retinal wave properties and defects in eye-specific axonal segregation in binocular anterograde tracing experiments. This work suggests that the defects seen in germline β2-nAChR KO mice arise from defects in retinal wave activity rather than the loss of nicotinic receptors elsewhere in the brain. Additionally, the development of brainstem cholinergic inputs to the dLGN is delayed until the closure of the eye-specific segregation period (7), further suggesting a limited role for cholinergic transmission in the retinogeniculate refinement process.

      (4) Outside of a total synapse density difference between WT and β2KO mice, the changes in the spatial organization of synaptic proteins over development do not seem that different. In fact % simple synapses near complex synapses from the non-dominant eye in the mutant is not that different from WT at P8 (Fig 3C), an age when eye-specific segregation is very different between the genotypes. Can the authors explain this discrepancy?

      We thank the reviewer for their question concerning differences between synapse organization in WT versus β2KO mice. In the original presentation of Figure 3C at P4, the percentage of non-dominant eye single-AZ synapses near multi-AZ synapses increased at P4 in WT mice, but this did not occur in β2KO mice. This is consistent with our previous results showing that there is an increase in non-dominant eye synaptic density at this age, which does not occur in β2KO mice (1). At P8, this clustering effect is lost in WT as eye-specific segregation has taken place and non-dominant eye inputs have been eliminated. However, in β2KO mice, the overall synapse density is still low at this age. We interpret this result as a failure of synaptogenesis in the β2KO line, which leads to increased growth of individual RGC axons (8) and eye-specific overlap at P8 (9, 10). Evidence in support of this interpretation comes from live dynamic imaging studies of RGC axon branching in Xenopus and Zebrafish, showing that synapse formation stabilizes local axon branching and that disruptions of synapse formation or neurotransmission lead to enlarged axons (11-13).

      Our anatomical results do not provide a specific biological mechanism for the remaining clustering observed in the β2KO mice. We have revised our discussion of the fact that individual RGC axons may form multiple synaptic connections leading to clustering, which may be independent of changes in retinal wave properties in the β2KO mouse. We have also extensively revised the analysis and presentation of results in Figure 3 to directly compare synaptic clustering around both multi-AZ synapses and single-AZ synapses within the same imaging volumes.

      (5) The authors use nomenclature that has been previously used and associated with other aspects of retinogeniculate properties. For example, the phrases "simple" and "complex" synapses have been used to describe single boutons or aggregates of boutons from numerous retinal axons, whereas in this manuscript the phrases are used to describe vesicle clusters/release sites with no knowledge of whether they are from single or multiple boutons. Likewise, the use of the word "glomerulus" has been used in the context of the retinogeniculate synapse to refer to a specific pattern of bouton aggregates that involves inhibitory and neuromodulatory inputs. It is not clear how the release sites described by the authors fit in this picture. Finally the use of the word "punishment" is associated with a body of literature regarding the immune system and retinogeniculate refinement-which is not addressed in this study. This double use of the phrases can lead to confusion in the field and should be clarified by clear definitions of how they are used in the current study.

      We appreciate the reviewer’s concern that the terminology we used in the initial submission may cause confusion. We have revised the text throughout for clarity. “Simple” synapses are now referred to as “single-active zone synapses”. “Complex” synapses are now referred to as “multi-active zone synapses”. We have removed all text that previously referred to synaptic clusters in STORM images as glomeruli. We agree that we have not provided causal evidence for synaptic stabilization and punishment mechanisms, which would require additional molecular genetic studies. We have restructured the manuscript to remove these references and discuss our anatomical results impartially.  

      Reviewer #3 (Public Review):

      This manuscript is a follow-up to a recent study of synaptic development based on a powerful data set that combines anterograde labeling, immunofluorescence labeling of synaptic proteins, and STORM imaging (Cell Reports 2023). Specifically, they use anti-Vglut2 label to determine the size of the presynaptic structure (which they describe as the vesicle pool size), anti-Bassoon to label a number of active zones, and anti-Homer to identify postsynaptic densities. In their previous study, they compared the detailed synaptic structure across the development of synapses made with contra-projecting vs ipsi-projecting RGCs and compared this developmental profile with a mouse model with reduced retinal waves. In this study, they produce a new analysis on the same data set in which they classify synapses into "complex" vs. "simple" and assess the number and spacing of these synapses. From these measurements, they make conclusions regarding the processes that lead to synapse competition/stabilization.

      Strengths:

      This is a fantastic data set for describing the structural details of synapse development in a part of the brain undergoing activity-dependent synaptic rearrangements. The fact that they can differentiate eye of origin is also a plus.

      Weaknesses:

      The lack of details provided for the classification scheme as well as the interpretation of small effect sizes limit the interpretations that can be made based on these findings.

      We thank the reviewer for their reading of the manuscript and helpful comments to improve the work. We provide details on how single-active zone and multi-active zone synapses are classified in the methods section. We agree with the suggestion to be more careful in interpreting the results. We have extensively revised the manuscript to 1) include additional electron microscopy data demonstrating the presence of multi-active zone retinogeniculate synapses, 2) extend the synaptic clustering analysis to both single-active zone and multi-active zone synapses for comparison, and 3) improve the clarity and accuracy of the discussion throughout the manuscript.

      (1) The criteria to classify synapses as simple vs. complex is critical for all of the analysis in this study. Therefore this criteria for classification should be much more explicit and tested for robustness. As stated in the methods, it is based on the number of active zones which are designated by the number of Bassoon clusters associated with a Vglut2 cluster (line 697). A second part of the criteria is the size of the presynaptic terminal as assayed by "greater Vglut2 signal" (line 116). So how are these thresholds determined? For Bassoon clusters, is one voxel sufficient? Two? If it's one, how often do they see a Bassoon positive voxel with no Vglut2 cluster and therefore may represent "noise"? There is no distribution of Bassoon volumes that is provided that might be the basis for selecting this number of sites. Unfortunately, the images are not helpful. For example, does P8 WT in Figure 1B have 7 or 2? According to Figure 2C, it appears the numbers are closer to 2-4.

      The Vglut volume measurements also do not seem to provide a clear criterion. Figure 2 shows that the distributions of Vglut2 cluster volumes for complex and for simple synapses are significantly overlapping.

      The authors need to clarify the quantitative approach used for this classification strategy and test how sensitive the results of the study are to how robust this strategy is

      We thank the reviewer for their question concerning the STORM data analysis. Here we provide a brief overview of the complete analysis details, which are provided in the methods section.

      Our raw STORM data sets consisted of spectrally separate volumetric imaging channels of VGluT2, Bassoon, and Homer1 signals. For each of these channels, raw STORM data were processed by 1) application of the corresponding low-resolution conventional image of each physical section to the STORM data to filter artifacts in the STORM image which do not appear in the conventional image, 2) STORM images are then thresholded using a 2-factor Otsu threshold that removes low-intensity background noise while preserving all single-molecule localizations that correspond to genuine antibody labeling as well as non-specific antibody labeling in the tissue, 3) application of the MATLAB function “conncomp” to identify connected component voxel in 3D across the image stack. Clusters are only kept for further analysis steps if they are connected across at least 2 continuous physical sections (140 nm Z depth). 4) for every connected component (clusters corresponding to genuine antibody labeling and background labeling), we measure the volume and signal density (intensity/volume) for every cluster in the dataset, 5) a threshold is applied to retain clusters that have a higher volume and lower signal density. We exclude signals that have low-volume and high-density, which correspond to single antibody labels. This analysis retains larger clusters that correspond to synaptic objects and excludes non-specific antibody background. 

      The average size of WT synaptic Bassoon clusters ranges from 55 - 3532 voxels (0.00092~0.059 μm<sup>3</sup>), with a median size of 460 voxels (0.0077 μm<sup>3</sup>).

      The average size of WT synaptic VGluT2 clusters ranges from 50 -73752 voxels (0.00084~1.2 μm<sup>3</sup>), with a median size of 980 voxels (0.016 μm<sup>3</sup>).

      The average size of WT synaptic Homer1 clusters ranges from 63-7118 (0.0010~0.12 μm3), with a median size of 654 voxels (0.011 μm<sup>3</sup>).

      In practice, any Bassoon/VGluT2/Homer1 clusters with <10 voxels are immediately filtered at the Otsu thresholding step (2) above.

      The reviewer is correct that we often see Bassoon(+) clusters that are not associated with VGluT2, and these may reflect synapses of non-retinal origin or retinogeniculate synapses that lack VGluT2 expression. To identify retinogeniculate synapses containing VGluT2, we performed a synapse pairing analysis that measured the association between VGluT2 and Bassoon clusters after the synapse cluster filtering described above. We first measured the centroid-centroid distance from each VGluT2 cluster to the closest cluster in the Bassoon channel. We next quantified the signal intensity of the Bassoon channel within a 140 nm shell surrounding each VGluT2 cluster. A 2D histogram was plotted based on the measured centroid-centroid distances and opposing channel signal densities of each cluster. Paired clusters with closely positioned centroids and high intensities of apposed channel signal were identified using the OPTICS algorithm (14).

      In the original Figure 1B, the multi-active zone synapse in WT at P8 had two Bassoon clusters. To clarify this, we have revised the images in Figure 1 to include arrowheads that point to individual active zones. We have also revised Supplemental Figure 1 to show volumetric renderings of individual example synapses that help illustrate the 3D structure of these multi-active zone inputs. All details about synapse analysis and synapse pairing are provided in the methods section.

      (2) Effect sizes are quite small and all comparisons are made on medians of distributions. This leads to an n=3 biological replicates for all comparisons. Hence this small n may lead to significant results based on ANOVAS/t-tests, but the statistical power of these effects is quite weak. To accurately represent the variance in their data, the authors should show all three data points for each category (with a SD error bar when possible). They should also include the number of synapses in each category (e.g. the numerators in Figure 1D and the denominators for Figure 1E). For other figures, there are additional statistical questions described below.

      We thank the reviewer for their suggestion to improve the presentation of our results. We have added all three data points (individual biological replicates) to each figure plot when applicable. We have also included a supplemental table (Table S1) listing total eye-specific synapse numbers of each type (mAZ and sAZ) and AZ number for each biological replicate in both genotypes.

      (3) The authors need to add a caveat regarding their classification of synapses as "complex" vs. "simple" since this is a terminology that already exists in the field and it is not clear that these STORM images are measuring the same thing. For example, in EM studies, "complex" refers to multiple RGCs converging on the same single postsynaptic site. The authors here acknowledge that they cannot assign different AZs to different RGCs so this comparison is an assumption. In Figure 2 they argue this is a good assumption based on the finding that the Vglut column/active zone is constant and therefore each represents a single RGC. However, the authors should acknowledge that they are actually seeing quite different percentages than those in EM studies. For example, in Monavarfeshani et al, eLife 2018, there were no complex synapses found at P8. (Note this study also found many more complex vs. simple synapses in the adult - 70% vs. the 20% found in the current study - but this difference could be a developmental effect). In the future, the authors may want to take another data set in the adult dLGN to make a direct comparison based on numbers and see if their classification method for complex/simple maps onto the one that currently exists in the literature.

      We appreciate the reviewer’s comment that the use of the terms “complex” and “simple” may cause confusion. We have significantly revised the manuscript for clarity: 1) we now refer to “complex” synapses as “multi-active zone synapses” and “simple” synapses as “single-active zone synapses. 2) We have performed electron microscopy analysis of dAPEX2-labeled retinogeniculate projections to confirm the existence of large synaptic terminals with multiple active zones. 3) We have expanded our discussion of previous electron microscopy results describing a lack of axonal convergence at P8 (3). 4) We have added a discussion on how individual RGCs may form multiple synapses in close proximity within their axonal arbor, which would create a clustering effect.

      We agree that it will be informative to collect a STORM data set in the adult mouse dLGN and we look forward to working on this project to compare with EM results in the future.  

      (4) Figure 3 assays the relative distribution of simple vs. complex synapses. They found that a larger percentage of simple synapses were within 1.5 microns of complex synapses than you would expect by chance for both ipsi and contra projecting RGCs, and hence conclude that complex synapses are sites of synaptic clustering. In contrast, there was no clustering of ipsi-simple to contra-complex synapses and vice versa. The authors also argue that this clustering decreases between P4 and P8 for ipsi projecting RGCs.

      This analysis needs much more rigor before any conclusions can be drawn. First, the authors need to justify the 1.5-micron criteria for clustering and how robust their results are to variations in this distance. Second, these age effects need to be tested for statistical significance with an ANOVA (all the stats presented are pairwise comparisons to means expected by random distributions at each age). Finally, the authors should consider what n's to use here - is it still grouped by biological replicate? Why not use individual synapses across mice? If they do biological replicates, then they should again show error bars for each data point in their biological replicates. And they should include the number of synapses that went into these measurements in the caption.

      We appreciate the suggestion to improve the rigor of our analysis of synaptic clustering presented in Figure 3. We have revised our analysis to measure the degree of synapse clustering nearby both multi-AZ and single-AZ synapses after an equivalent randomization of single-AZ synapse positions in the volume. 

      We now present the revised results as a “clustering index” for both multi-AZ synapses and single-AZ synapses. This measurement was performed in several steps: 1) randomization of single-AZ positions within the imaging volume while holding multi-AZ centroid positions fixed, 2) independent measurements of the fraction of single-AZ synapses within the local shell (1.5 μm search radius) around multi-AZ and single-AZ synapses within the random distribution, 3) comparison of the result from (2) with the actual fractional measurements in the raw STORM data to compute a “clustering index” value. 4) Because the randomization is equivalent for both multi-AZ and single-AZ synapse measurements, the measured differences in the degree of clustering reflect a synapse type-specific effect.

      We have also updated Supplemental Figure 3 showing the results of varying the search radius from 1-4 μm for both contralateral- and ipsilateral-eye synapses. The results showed that a search radius of 1.5 μm resulted in the largest difference between the original synapse distribution and a randomized synapse distribution (shuffling of single-active zone synapse position while holding multi-active zone synapse position fixed).

      Finally, we have removed all statistical comparisons of single measurements (means or ratios) across ages from the manuscript. We focus our statistical analysis on paired data comparisons within individual biological replicates.

      For the analysis of synapse clustering, we grouped the data by biological replicates (N=3) to look for a global effect on synapse clustering. In the revised manuscript, we added data points for each replicate in the figure and included the number of synapses in Supplementary Table 1.

      (5) Line 211-212 - the authors conclude that the absence of clustered ipsi-simple synapses indicates a failure to stabilize (Figure 3). Yet, the link between this measurement and synapse stabilization is not clear. In particular, the conclusion that "isolated" synapses are the ones that will be eliminated seems to be countered by their finding in Figure 3D/E which shows that there is no difference in vesicle pool volume between near and far synapses. If isolated synapses are indeed the ones that fail to stabilize by P8, wouldn't you expect them to be weaker/have fewer vesicles? Also, it's hard to tell if there is an age-dependent effect since the data presented in Figures 3D/E are merged across ages.

      We thank the reviewer for their suggestion to clarify the results in Figure 3. Based on the measured eye-specific differences in vesicle pool size and organization, we also expected that synapses outside of clusters would show a reduced vesicle population. However, across all ages, we found no differences in the vesicle pool size of single-active zone synapses based on their proximity to multi-active zone synapses. Below, we show cumulative distributions of these results across all ages (P2/P4/P8) for WT mice CTB(+) data. Statistical tests (Kolmogorov-Smirnov tests) show no significant differences. P = 0.880, 0.767, 0.494 respectively. Separate 5/95% confidence interval calculations showed overlap between far and near populations at each age.

      Author response image 4.

      To clarify the presentation of the results, we have changed the text to state that the “vesicle pool size of sAZ synapses is independent of their distance to mAZ synapses”. We have removed references to stabilization and punishment from the results section of the manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Because none of the phenomena being measured can be expected to behave randomly (given what is already known about the system) and the sample size is small, I believe quantification of the data requires confidence intervals for effect sizes. Resolving the multi-bouton vs multi-active zone bouton with EM would also help.

      We thank the reviewer for their thorough reading of the manuscript and many helpful suggestions. We provide analysis with confidence intervals in a point-by-point response below. In the manuscript we revised our results and focused our statistical analyses on comparisons within the same biological replicate (paired effects). In addition, we have performed electron microscopy of RGC inputs to the dLGN at postnatal day 8 to demonstrate the presence of retinogeniculate synapses with multiple active zones.

      Figure 1:

      Please show data points in scatter bar plots and not just error bars.

      We have updated all plots to show data points for independent biological replicates.

      Please describe the image processing in more detail and provide an image in which the degree of off-target labeling can be evaluated.

      We have updated the description of the image processing in the methods sections. We have made all the code used in this analysis freely available on GitHub (https://github.com/SpeerLab). We have uploaded the raw STORM images of the full data set to the open-access Brain Imaging Library (16). These images can be accessed here: https://api.brainimagelibrary.org/web/view?bildid=ace-dud-lid (WTP2A data for example). All 18 datasets are currently searchable on the BIL by keyword “dLGN” or PI last name “Speer” and a DOI for the grouped dataset is pending.

      How does panel 1D get very small error bars with N = 3? Please provide scatter plots.

      We have updated panel 1D to show the means for each independent biological replicate.

      Line 129: over what volume is density measured? What are the n's? What is the magnitude (with confidence intervals) of increase?

      The volume we collected from each replicate was ~80μm*80μm*7μm (total volume ~44,800 μm3). N=3 biological replicates for each age, genotype, and tissue location. Because of concerns with the use of ANOVA for low sample numbers, we have removed a majority of the age-wise comparisons from the manuscript and instead focus on within-replicate paired data comparisons. Author response image 5 showa 5/95% confidence intervals for WT data (left panel) and β2KO data (right panel) is shown below:

      Author response image 5.

      The 5/95% CI range for the increase in synapse density from P2 to P8 for CTB(+) synapses is ~ -0.001 ~ 0.037 synapses / μm<sup>3</sup>.

      Line 131: You say that non-dominant increases and then decreases. It appears that the error bars argue that you do not have enough information to reliably determine how much or little density changes.

      Line 140: No confidence intervals. It appears the error bars allow both for the claimed effect of increased fraction and the opposite effect of decreased density.

      Because of concerns with the use of ANOVA for low sample numbers, we have removed age-wise comparisons of single-measurements (means and ratios) from the manuscript and instead focus on within-replicate paired data comparisons.

      Line 144: Confidence intervals would be a reasonable way to argue that fraction is not changed in KO: normal fraction XX%-XX%. KO fraction XX%-XX%.

      Author response image 6 shows panels for WT (left) and β2KO mice (right) with 5/95% CIs.

      Author response image 6.

      In the revised manuscript, we have updated the text to report the measurements, but we do not draw conclusions about changes over development.

      I find it hard to estimate magnitudes on a log scale.

      We appreciate the reviewer’s concern with the presentation of results on a log scale. Because the measured synapse properties are distributed logarithmically, we have elected to present the data on a log scale so that the distribution(s) can be seen clearly. Lognormal distributions enable us to use a mixed linear model for statistical analysis.

      Line 156: Needs confidence interval for difference.

      Line 158: Needs confidence interval for difference of differences.

      Line 160: Needs confidence interval for difference of differences.

      Why only compare at P4 where there is the biggest difference? The activity hypothesis would predict an even bigger effect at P8.

      Below is a table listing the mean volume (log10μm3) and [5/95%] confidence intervals for comparisons of VGluT2 signal between CTB(+) and CTB(-) synapses from Figure 2A and 2B:

      Author response table 2.

      Based on the values given above, the mean difference of differences and [5/95%] confidence intervals are listed below:

      Author response table 3.

      We added these values to the manuscript. We have also reported the difference in median values on a linear scale (as below) so that the readers can have a straightforward understanding of the magnitude.

      Author response table 4.

      We elected to highlight the results at P4 based on our previous finding that the synapse density from each eye-of-origin is similar at this time point (1).

      At P8, there is a decrease in the magnitude of the difference between CTB(+)/CTB(-) synapses compared to P4. This may be due to an increase in VGluT2 volume within non-dominant eye synapses that survive competition between P4-P8.

      At P8 in the mutant, there is an increase in the magnitude of the difference between CTB(+)/CTB(-) synapses compared to P4. This may be due to delayed synaptic maturation in β2KO mice.

      Line 171: The correct statistical comparison was not performed for the claim. Lack of * at P2 does not mean they are the same. Why do you get the same result for KO?

      We have revised the statistical analysis, figure presentation, and text to remove discussion of changes in the number of active zones per synapse over development based on ANOVA. We now report eye-specific differences at each time point using paired T-test analysis, which is mathematically equivalent to comparing the 5/95% confidence interval in the difference.

      Line 175: Qualitative claim. Correlation coefficients and magnitudes of correlation coefficients are not reported.

      Linear fitting slop and R square values are attached:

      Author response table 5.

      The values are added to the manuscript to support the conclusions.

      Line 177: n.s. does not mean that you have demonstrated the values are the same. An argument for similarity could be made by calculating a confidence interval a for potential range of differences. Example: Complex were 60%-170% of Simple.

      Author response image 7 with 5/95% CI is shown below (WT and B2KO):

      Author response image 7.

      Comparing the difference between multi-AZ synapse and single-AZ synapse revealed that the difference in average VGluT2 cluster volume per AZ is:

      Author response table 6.

      The values are added to the manuscript for discussion.

      Line 178: There is no reason to think that the vesical pool for a single bouton does not scale with active zone number within the range of uncertainty presented here.

      We have collected EM images of multi-AZ zone synapses and modified our discussion and conclusions in the revised text.

      Line 196: "non-random clustering increased progressively" is misleading. The density of the boutons increases for both the Original and Randomized. Given the increase in variance at P8, it is unlikely that the data supports the claim that the non-randomness increased. Would be easy to quantify with confidence intervals for a measure of specificity (O/R).

      We have revised the manuscript to remove analysis and discussion of changes in clustering over development. We have modified this section of the manuscript and figures to present a normalized clustering index that describes the non-random clustering effect present at each time point.

      Line 209: Evidence is for correlation, not causation and there is a trivial potential explanation for correlation.

      We appreciate the reviewer’s concern with over interpretation of the results. We have changed the text to more accurately reflect the data.

      Line 238:239: Authors failed to show effect is activity-dependent. Near/Far distinction is not necessarily a criterion for the effect of activity. The claim is likely false in other systems.

      We agree with the reviewer that the original text overinterpreted the results. We have changed the text to more accurately reflect the data. 

      Line 265-266: Assumes previous result is correct and measure of vGlut2 provides information about all presynaptic protein organization.

      We thank the reviewer for pointing out the incorrect reference to all presynaptic protein organization. We have corrected the text to reference only the VGluT2 and Bassoon signals that were measured.

      Line 276: There are many other interpretations that include trivial causes. It is unclear what the measure indicates about the biology and there is no interpretable magnitude of effect.

      We agree with the reviewer that the original text overinterpreted the results. We have changed the text to remove references to mechanisms of synaptic stabilization.

      Line 289: Differences cannot be demonstrated by comparing P-values. Try comparing confidence intervals for effect size or generate a confidence interval for the difference between the two groups.

      5/95% confidence intervals are given below for Figure 4C/D:

      Author response table 7.

      We have added these values to the manuscript to support our conclusion.

      Line 305: "This suggests that complex synapses from the non-dominant-eye do not exert a punishment effect on synapses from the dominant-eye" Even if all the other assumptions in this claim were true, "n.s." just means you don't know something. It cannot be compared with an asterisk to claim a lack of effect.

      We thank the reviewer for raising this concern. We have modified the text to remove references to synaptic punishment mechanisms in the results section.

      Below are the 5/95% confidence intervals for the results in Figure 4F:

      Author response table 8.

      We have added these values to the manuscript to support our conclusion.

      Line 308: "mechanisms that act locally". 6 microns is introduced based on differences in curves above(?). I don't see any analysis that would argue that longer-distance effects were not present.

      The original reference referred to the differences in the cumulative distribution measurements between multi-active zone synapses versus single-active zone synapses in their distance to the nearest neighboring multi-active zone synapse. For clarity, we have deleted the reference to the 6 micron distance in the revised text.

      Reviewer #2 (Recommendations For The Authors):

      (1) This data set would be valuable to the community. However, unless the authors can show experiments that manipulate the presence of complex synapses to test their concluding claims, the manuscript should be rewritten with a reassessment of the conclusions that is more grounded in the data.

      We thank the reviewer for their careful reading of the manuscript and we agree the original interpretations were not causally supported by the experimental results. We have made substantial changes to the text throughout the introduction, results, and discussion sections so that the conclusions accurately reflect the data.

      (2) To convincingly address the claim that "complex synapse" are aggregates of simple synapses, the authors should perform experiments at the EM level showing what the bouton correlates are to these synapses.

      We thank the reviewer for their suggestion to perform EM to gain a better understanding of retinogeniculate terminal structure. We generated an RGC-specific transgenic line expressing the EM reporter dAPEX2 localized to mitochondria. We have collected EM images of retinogeniculate terminals that demonstrate the presence of multiple active zones within individual synapses. These results are now presented in Figure 1. The text has been updated to reflect the new results.

      (3) Experiments using the conditional β2KO mice would help address questions of the contribution of β2-nAChRs in dLGN to the synaptic phenotype.

      We appreciate the reviewer’s concern that the germline β2KO model may show effects that are not retina-specific. To address this, Xu and colleagues generated a retina-specific conditional β2KO transgenic and characterized wave properties and defective eye-specific segregation at the level of bulk axonal tracing (6). The results from the conditional mutant study suggest that the main effects on eye-specific axon refinement in the germline β2KO model are likely of retinal origin through impacts on retinal wave activity. Additionally, anatomical data shows that brainstem cholinergic axons innervate the dLGN toward the second half of eye-specific segregation and are not fully mature at P8 when eye-specific refinement is largely complete (7). We agree with the reviewer that future synaptic studies of previously published wave mutants, including the conditional reporter line, would be needed to conclusively assess a contribution of non-retinal nAChRs. These experiments will take significant time and resources and we respectfully suggest this is beyond the scope of the current manuscript.

      Reviewer #3 (Recommendations For The Authors):

      (1) The authors need to be more transparent that they are using the same data set from the previous publication (right now it does not appear until line 471) and clarify what was found in that study vs what is being tested here.

      We thank the reviewer for their thoughtful reading of the manuscript and helpful recommendations to improve the clarity of the work. We have edited the text to make it clear that this study is a reanalysis of an existing data set. We have revised the text to discuss the results from our previous study and more clearly define how the current analysis builds upon that initial work. 

      (2) The authors restricted their competition argument in Figure 4 to complex synapses, but why not include the simple ones? This seems like a straightforward analysis to do.

      We appreciate the reviewer’s suggestion to measure spatial relationships between “clustered” and “isolated” single-AZ synapses as we have done for multi-AZ synapses in Figure 4. However, we are not able to perform a direct and interpretable comparison with the results shown for multi-AZ synapses. First, we would need to classify “clustered” and “isolated” single-AZ synapses. This classification convolves two effects: 1) a distance threshold to define clustering and 2) subsequent distance measurements between clustered synapses.

      If we apply an equivalent 1.5 μm distance threshold (or any other threshold) to define clustered synapses, the distance from each “clustered” single-AZ synapse to the nearest other single-AZ synapse will always be smaller than the defined threshold (1.5 μm). Alternatively, if all of the single-AZ synapses within each local 1.5 μm shell are excluded from the subsequent intersynaptic distance measurements, this will set a hard lower boundary on the distance between synaptic clusters (1.5 μm minimum). The two effects discussed above were separated in our original analysis of multi-AZ synapses defined as “clustered” and “isolated” based on their relationship to single-AZ synapses, but these effects cannot be separated when analyzing single-AZ distributions alone.

      (3) The Discussion seems much too long and speculative from the current data that is represented - particularly without verification of complex synapses actually being inputs from different RGCs. Along the same lines, figure captions are misleading. For example, for Figure 4 - the title indicates that the complex synapses are driving the rearrangements. But of course, these are static images. The authors should use titles that are more reflective of their findings rather than this interpretation.

      We thank the reviewer for these helpful suggestions. We have changed each of the figure captions to more accurately reflect the results. We have deleted all of the speculative discussion and revised the remaining text to improve the accuracy of the presentation.

      (4) In the future, the authors may want to consider an analysis as to whether ipsi and contra projection contribute to the same synapses

      We agree with the reviewer that it is of interest to investigate the contribution of binocular inputs to retinogeniculate synaptic clusters during development. At maturity, some weak binocular input remains in the dominant-eye territory (15). To look for evidence of binocular synaptic interactions, we measured the percentage of the total small single-active zone synapses that were within 1.5 micrometers of larger multi-active zone synapses of the opposite eye. On average, ~10% or less of the single-active zone synapses were near multi-active zone synapses of the opposite eye. This analysis is presented in Supplemental Figure S3C/D.

      It is possible that some large mAZ synapses might reflect the convergence of two or more smaller inputs from the two eyes. Our current analyses do not rule this out. However, previous EM studies have found limited evidence for convergence of multiple RGCs (3) at P8 and our own EM images show that larger terminals with multiple active zones are formed by a single RGC bouton. Future volumetric EM reconstructions with eye-specific labels will be informative to address this question.

      References

      (1) Zhang C, Yadav S, Speer CM. The synaptic basis of activity-dependent eye-specific competition. Cell Rep. 2023;42(2):112085.

      (2) Bickford ME, Slusarczyk A, Dilger EK, Krahe TE, Kucuk C, Guido W. Synaptic development of the mouse dorsal lateral geniculate nucleus. J Comp Neurol. 2010;518(5):622-35.

      (3)Monavarfeshani A, Stanton G, Van Name J, Su K, Mills WA, 3rd, Swilling K, et al. LRRTM1 underlies synaptic convergence in visual thalamus. Elife. 2018;7.

      (4) Campbell G, Shatz CJ. Synapses formed by identified retinogeniculate axons during the segregation of eye input. J Neurosci. 1992;12(5):1847-58.

      (5) Hong YK, Park S, Litvina EY, Morales J, Sanes JR, Chen C. Refinement of the retinogeniculate synapse by bouton clustering. Neuron. 2014;84(2):332-9.

      (6) Xu HP, Burbridge TJ, Chen MG, Ge X, Zhang Y, Zhou ZJ, et al. Spatial pattern of spontaneous retinal waves instructs retinotopic map refinement more than activity frequency. Dev Neurobiol. 2015;75(6):621-40.

      (7) Sokhadze G, Seabrook TA, Guido W. The absence of retinal input disrupts the development of cholinergic brainstem projections in the mouse dorsal lateral geniculate nucleus. Neural Dev. 2018;13(1):27.

      (8) Dhande OS, Hua EW, Guh E, Yeh J, Bhatt S, Zhang Y, et al. Development of single retinofugal axon arbors in normal and beta2 knock-out mice. J Neurosci. 2011;31(9):3384-99.

      (9) Rossi FM, Pizzorusso T, Porciatti V, Marubio LM, Maffei L, Changeux JP. Requirement of the nicotinic acetylcholine receptor beta 2 subunit for the anatomical and functional development of the visual system. Proc Natl Acad Sci U S A. 2001;98(11):6453-8.

      (10) Muir-Robinson G, Hwang BJ, Feller MB. Retinogeniculate axons undergo eye-specific segregation in the absence of eye-specific layers. J Neurosci. 2002;22(13):5259-64.

      (11) Fredj NB, Hammond S, Otsuna H, Chien C-B, Burrone J, Meyer MP. Synaptic Activity and Activity-Dependent Competition Regulates Axon Arbor Maturation, Growth Arrest, and Territory in the Retinotectal Projection. J Neurosci. 2010;30(32):10939.

      (12) Hua JY, Smear MC, Baier H, Smith SJ. Regulation of axon growth in vivo by activity-based competition. Nature. 2005;434(7036):1022-6.

      (13) Rahman TN, Munz M, Kutsarova E, Bilash OM, Ruthazer ES. Stentian structural plasticity in the developing visual system. Proc Natl Acad Sci U S A. 2020;117(20):10636-8.

      (14) Ankerst M, Breunig MM, Kriegel H-P, Sander J. OPTICS: ordering points to identify the clustering structure. SIGMOD Rec. 1999;28(2):49–60.

      (15) Bauer J, Weiler S, Fernholz MHP, Laubender D, Scheuss V, Hübener M, et al. Limited functional convergence of eye-specific inputs in the retinogeniculate pathway of the mouse. Neuron. 2021;109(15):2457-68.e12.

      (16) Benninger K, Hood G, Simmel D, Tuite L, Wetzel A, Ropelewski A, et al. Cyberinfrastructure of a Multi-Petabyte Microscopy Resource for Neuroscience Research.  Practice and Experience in Advanced Research Computing; Portland, OR, USA: Association for Computing Machinery; 2020. p. 1–7.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank the reviewers for their overall careful evaluation of our work, the constructive criticism, and their many helpful suggestions. We feel that our revision built on the strengths identified by the reviewers, and addressed all the concerns they have raised. Both reviewers recognize that our revisions have improved the paper.  Since the first submission we have:

      • Rewritten large parts of the papers to improve clarity and make it more concise where possible

      • Simulated an alternative working memory model, as recommended by Reviewer 1

      • Included 4 new/revised supplementary figures, following the reviewer’s suggestions for additional analysis.

      Below we provide a brief response to the Reviewers’ comments on our manuscript revision.

      Reviewer #1: Public Review:

      Strengths:

      Overall, the work offers a very interesting approach of a topic which is hard to accomplish experimentally --therefore the computational take is entirely justified and extremely useful. The authors carefully designed the computational experiments to shed light into the demyelination effects on working memory from multiple levels of description, increasing the reliability of their conclusions. I think this work provides now convincing evidence and has the potential to be influential in future studies of myelin alterations (and related disorders such as multiple sclerosis).

      Weaknesses:

      In its current form, the authors have improved the clarity of the results and the model details, and have provided a new set of simulations to complement and reinforce the original ones (including the development of a new spatial working memory model based on silent working memory principles). I do not appreciate any significant weaknesses at this point.

      We thank the reviewer for these positive comments on our revision and for the suggestion of adding the silent memory model, as we feel this has strengthened our findings.

      Reviewer #2: Public Review:

      This paper analyzes the effect of axon de-myelination and re-myelination on action potential speed, and propagation failure. Next, the findings are then incorporated in a standard spiking ring attractor model of working memory.

      I think the results are not very surprising or solid and there are issues with method and presentation.

      The authors did many simulations with random parameters, then averaged the result, and found for instance that the Conduction Velocity drops in demyelination. It gives the reader little insight into what is really going on. My personal preference is for a well understood simple model rather than a poorly understood complex model. The link between the model outcome of WM and data remains qualitative and is further weakened by the existence of known other age-related effects in PFC circuits.

      Comments on revised version:

      The paper has improved in the revision, although I still think a reduced model would have been nice.

      As noted above, in addition to our spiking bump attractor model, our revision includes a second network-level model:  an activity-silent working memory model for continuous features.  We found qualitatively similar effects as in our bump attractor network model, showing that our main conclusions do not critically depend on the exact working memory mechanism (active vs. activity-silent).  This new model was described in two new supplementary figures and a new paragraph in the Results section.

      We did not add a reduced model in our revision to this paper, since neither reviewer explicitly recommended that we add one.  As we noted in our private response to reviewers that accompanied our revision: we share the view that understanding simple models can provide critical insights into brain function (and we believe that many of our papers related to attractor dynamics in working memory and decision-making fall into this category, e.g. Wimmer et al. 2014, Esnaola-Acebes et al. 2022, Ibañez et al 2020). We disagree with the reviewer on an important point: we feel that the model complexity that we have chosen is appropriate and necessary to study the phenomenon at hand. Our modeling efforts are principled, with complexity added as necessary. We started with a biophysical single neuron model with firing dynamics fit to empirical data in pyramidal neurons of rhesus monkey dlPFC (Rumbell et al. 2016) – the same type of neurons and cortical region analyzed in the Peters et al. work on structural changes to myelin seen during aging (e.g., Figure 1).  Because simple models do not accurately capture the CV along thin axons like those in the PFC, we attached a multicompartment axon with detailed myelinated segments, and constructed a cohort of feasible models. We then used this cohort to get quantitative estimates of the effects of variable degrees of demyelination and remyelination. This would not be possible with a simpler model. We then study the consequences of de- and re-myelination in a spiking neural network model. Again, we could not use a simpler model (e.g. a firing rate attractor model) without making gross assumptions about how demyelination affects circuit function. In sum, we believe that our models are relatively simple but comprehensive given the phenomenon that we are studying.

      The reviewer is correct in that there exist “known other age-related effects in PFC circuits”. These are reviewed in the introduction and we discuss future extensions of our model that would incorporate those effects as well. It is important to note that this is the first comprehensive study of demyelination effects in aging PFC, demonstrating that myelin changes alone predict working memory changes associated with aging.

      While we agree that averaging results about different parameter sets provide a limited understanding of the system, we persist in our belief that such analyses provide an important baseline.  We acknowledge that results vary across our model cohort; this is why we included the heatmaps of our single cell model perturbation results (Figure 3 and Supplementary Figure 3), and simulated network models representing a heterogeneity of neuronal axons with healthy and altered myelin sheaths in different degrees, as likely occurs in the aging brain (Figures 7 and 8).  The model framework we present here is well-suited for more targeted analyses and better insights, including those which we are pursuing currently.


      The following is the authors’ response to the original reviews.

      We thank the reviewers for their careful evaluation of our work, the constructive criticism, and their many helpful suggestions. We feel that our revision builds on the strengths identified by the reviewers, and addresses all the concerns they have raised. We have:

      • Rewritten large parts of the papers to improve clarity and make it more concise where possible

      • Simulated an alternative working memory model

      • Included 4 new/revised supplementary figures, following the reviewer’s suggestions for additional analysis

      Reviewer #1 (Public Review):

      Summary:

      The authors study the effects of myelin alterations in working memory via the complementary use of two computational approaches: one based on the de- and re-myelination in multicompartmental models of pyramidal neurons, and one based on synaptic changes in a spiking bump attractor model for spatial working memory. The first model provides the most precise angle (biophysically speaking) of the different effects (loss of myelin lamella or segments, remyelination with thinner and shorter nodes, etc), while the second model allows to infer the consequences of myelin alterations in working memory performance, including memory stability, duration, and bump diffusion. The results indicate (i) a slowing down and failure of propagation of spikes with demyelination and partial recovery with remyelination, with detailed predictions on the role of nodes and myelina lamella, and (ii) a decrease in memory duration and an increase in memory drift as a function of the demyelination, in agreement with multiple experimental studies.

      Strengths:

      Overall, the work offers a very interesting approach of a topic which is hard to accomplish experimentally --therefore the computational take is entirely justified and extremely useful. The authors carefully designed the computational experiments to shed light into the demyelination effects on working memory from multiple levels of description, increasing the reliability of their conclusions. I think this work is solid and has the potential to be influential in future studies of myelin alterations (and related disorders such as multiple sclerosis).

      We thank the reviewer for these positive comments on our manuscript.

      Weaknesses:

      In its current form, the study still presents several issues which prevent it from achieving a higher potential impact. These can be summarized in two main items. First, the manuscript is missing some important details about how demyelination and remyelination are incorporated in both models (and what is the connection between both implementations). For example, it is unclear whether an unperturbed axon and a fully remyelinated axon would be mathematically equivalent in the multicompartment model, or how the changes in the number of nodes, myelin lamella, etc, are implemented in the spiking neural network model.

      We thank the reviewer for these suggestions to improve the clarity of our manuscript. A ‘fully remyelinated’ axon is not mathematically equivalent to the unperturbed axon: it has shorter and thinner myelinated segments, and additional nodes in between. This is consistent with empirical observations in rhesus monkey dlPFC, as reviewed in Peters et al. (2009): a 90% increase in paranode profiles, and myelin sheaths that were thinner than expected for the size of the enclosed axon. With no empirical observations of fewer numbers of nodes (but rather, the opposite) or bare sections of axon, we assumed that the remyelination process also creates new nodes (which are identical to existing nodes), as also modeled in Scurfield & Latimer (2018). We have added two new sentences to the results to clarify this fact, before presenting the first set of results for the single cell model: (starting at line 137):

      “To simulate demyelination, we removed lamellae from selected myelinated segments; for remyelination we replaced a fraction of myelinated segments by two shorter and thinner segments with a node in between. As such, a ‘fully remyelinated axon’ had all the demyelinated segments subsequently remyelinated, but with fewer lamellae and additional nodes compared to the unperturbed control case, consistent with empirical observations (Peters, 2009).”

      We also state the maximal amount of remyelination more explicitly in the Results, starting on lines 164-165: "We next examined the extent to which remyelination with shorter and thinner segments, occurring after demyelination, restored axonal AP propagation (Figure 4).”

      Also on line 192-193: “Remyelinating all affected segments with 75% of lamellae (the maximal amount of remyelination) nearly eliminated AP failures (1.8 ± 1.1%).”

      Finally, in Methods we also clarified the structure of the added node (starting at line 634): “Remyelination was performed by replacing an affected (previously demyelinated) segment with two shorter segments, each including paranodes, juxtaparanodes, and an internode, and a new node between them that was identical to existing nodes.”

      We have also provided further details describing how myelin dystrophy was simulated in the network model in Results (lines 243 - 249) and in Methods (lines 722 - 747). How myelin alterations have been implemented in the network model is one of the questions of the reviewer (Question 5 in Reviewer #1: Recommendations for the Authors_)._ We have addressed this question by describing in detail how we adjusted CV and AP failure rate to the values produced by the multicompartment neuron model. Please see our answer to Question 5 for the details.

      Second, it is unclear whether some of the conclusions are strong computational predictions or just a consequence of the model chosen. For example, the lack of effect of decreasing the conduction velocity on working memory performance could be due to the choice of considering a certain type of working memory model (continuous attractor), and therefore be absent under other valid assumptions (i.e. a silent working memory model, which has a higher dependence on temporal synaptic dynamics).

      Whether some conclusions are strong predictions or just a consequence of the model chosen is an important concern and indeed a general problem of computational modeling of working memory. For example, Stein et al. (Stein et al. Towards biologically constrained attractor models of schizophrenia, Curr. Opin. Neurobiol. 2021) showed that opposed manipulations of E/I ratio can produce the same behavioral pattern in different alternative, plausible biological network models. As long as we do not fully understand the neural mechanisms underlying working memory, modeling studies of how alterations (e.g. in E/I ratio or in the reliability and timing of axonal transmission, as we did here) affect circuit function need to be interpreted critically and tested against new experimental data.

      One way to strengthen model predictions is by showing that different computational models make similar predictions. To do this, we implemented an activity-silent working memory model for continuous features, as suggested by the reviewer, and we found qualitatively similar effects as in our bump attractor network model. Thus, our main conclusions do not critically depend on the exact working memory mechanism (active vs. activity-silent).

      In the revised manuscript, we have added two new supplementary figures (Supplementary Figure 8 and 9, see the next page) and a new paragraph in the Results section about activity silent working memory (starting at line 319):

      “Alternative working memory mechanisms. Working memory in our neural network is maintained in an attractor state with persistent neural activity (Compte et al., 2000; Hansel and Mato, 2013). Other mechanisms have been proposed, including that working memory maintenance may rely on activity-silent memory traces (Mongillo et al., 2008; Stokes, 2015; Barbosa et al., 2020). In activity-silent models, a slowly decaying transient of synaptic efficacy preserves information without the need for persistent ongoing activity. We implemented an activity-silent model, to our knowledge the first one for continuous spatial locations, and tested how working memory performance is affected by AP failures and propagation delays. We found that AP failures corresponding to demyelination caused working memory errors qualitatively similar to the delay-active network (Supplementary Figure 8). On the other hand, increasing propagation delays did not lead to additional working memory errors, unless we include unrealistically high values (uniform distribution in the range of 0 to 100 ms; Supplementary Figure 9). These results are qualitatively similar to the delay active network model. Thus, our main findings do not critically depend on the exact working memory mechanism (active vs. activity-silent).”

      Author response image 1.

      Action potential failures impair working memory performance in a network model with activity-silent memory traces. (A) Spiking and synaptic activity in an unperturbed, activity-silent working memory model. Top: Raster plot showing the activity for each excitatory neuron (labeled by its preferred direction) in a single trial with a cue stimulus presented at 180°. We modified our spiking neural network model such that it does not show elevated persistent firing throughout the delay period (see Figure 5B for comparison). In particular, we reduced the external background input to excitatory neurons by a factor of 3.61% and we increased the cue stimulus amplitude by 12.5%. Even though spiking activity decays to baseline (close to 0 Hz), a memory trace is imprinted in enhanced synaptic strength due to short-term synaptic facilitation (Mongillo et al., 2008). Selective spiking activity is recovered by a non-selective constant input applied during 300 ms to all excitatory neurons during the two reactivation periods (marked by yellow and green rectangles in the raster plot). The amplitude of the input was 11 mV during the first and 13 mV during the second reactivation period. Reactivation periods are marked in light gray shading in the remaining panels below and the cue period is indicated by dark gray shading. Firing rates (second row), synaptic facilitation variable u (third row), and synaptic depression variable x (bottom row) for the same trial, averaged for 500 neurons around the neuron with 180° as preferred direction (solid lines) and around the neuron with 0° as preferred direction (dashed lines). Note that reactivation recovers the activity bump (C) but also causes elevated firing and subsequent enhancement of synapses at all positions in the networks. (B) Activity in a network with demyelination of 50% of the myelinated segments by removing 60% of the myelin lamellae. AP failures lead to reduced firing rates in the cue and early delay periods and consequently to weaker synaptic enhancement. (C) Average spike counts of the excitatory neurons during the cue period (black lines), and the two reactivation periods indicated in the raster plots in A and B (yellow and green lines). Solid lines correspond to the control network and dashed lines to the perturbed network. (D) Memory strength as a function of time for the control and perturbed networks. (E-F) Trajectories of the bump center (i.e., remembered cue location) read out from the neural activity across the cue and delay periods using a population vector (see Methods). Cue position was 180° in all trials. The perturbed network (F) shows larger working memory errors towards the end of the delay period compared to the control network (E).

      Author response image 2.

      Effect of propagation delays on control and perturbed activity-silent network models. (A) Memory strength during the whole simulation time for the young, control networks relying on activity-silent working memory (Supplementary Figure 8) with zero propagation delays (blue line), and with propagation delays from a uniform distribution with a range between 0 and 40 ms (yellow line) and between 0 and 100 ms (orange line). (B) Memory strength for perturbed networks when demyelinating 25% of the myelinated segments by removing 50% of the myelin lamellae, without delays (red line), and with uniformly distributed delays between 0 and 40 ms (light gray line) and between 0 and 100 ms (black line). The cue period is indicated by dark gray shading and reactivation periods are marked in light gray. Memory strength was calculated by averaging across 280 trials for one network. Shaded areas indicate SEM for each case. For the young, control networks (A), working memory was not affected by including delays of up to 40 ms. Unrealistically long delays ranging up to 100 ms did cause an impairment (the longest delays found for the most extreme perturbation condition – demyelination of 75% of the segments by removing 100% of the myelin lamellae – were of 49.9 ms on average). When also incorporating AP failures to the networks (B), we observed a similar trend. For this perturbation condition, delays of up to 40 ms were already much larger than the delays quantified in the single neuron model (for the case of 25% of the segments demyelinated by removing 50% of the myelin lamellae, the average delay in the cohort was 3.75 ms).

      With additional simulations to address these issues, I consider that the present study would become a convincing milestone in the computational modeling of myelin-related models, and an important study in the field of working memory.

      Again, we would like to thank the reviewer for the positive comments. We have addressed all the main issues raised (see below our response to the “recommendations for the authors”).

      Reviewer #2 (Public Review):

      This paper analyzes the effect of axon de-myelination and re-myelination on action potential speed, and propagation failure. Next, the findings are then incorporated in a standard spiking ring attractor model of working memory.

      I think the results are not very surprising or solid and there are issues with method and presentation.

      The authors did many simulations with random parameters, then averaged the result, and found for instance that the Conduction Velocity drops in demyelination. It gives the reader little insight into what is really going on. My personal preference is for a well understood simple model rather than a poorly understood complex model. The link between the model outcome of WM and data remains qualitative, and is further weakened by the existence of known other age-related effects in PFC circuits.

      We thank the reviewer for the critical assessment of our work. We share the view that understanding simple models can provide critical insights into brain function (and we believe that many of our papers related to attractor dynamics in working memory and decision making fall into this category, e.g. Wimmer et al. 2014, Esnaola-Acebes et al. 2022, Ibañez et al 2020). However, we respectfully disagree with the reviewer on an important point: the model complexity that we have chosen is appropriate and necessary to study the phenomenon at hand. Our modeling efforts are principled, with complexity added as necessary. We started with a biophysical single neuron model with firing dynamics fit to empirical data in pyramidal neurons of rhesus monkey dlPFC (Rumbell et al. 2016) – the same type of neurons and cortical region analyzed in the Peters et al. work on structural changes to myelin seen during aging (e.g., Figure 1). Because simple models do not accurately capture the CV along thin axons like those in the PFC, we attached a multicompartment axon with detailed myelinated segments, and constructed a cohort of feasible models. We then used this cohort to get quantitative estimates of the effects of variable degrees of demyelination and remyelination. This would not be possible with a simpler model. We then study the consequences of de- and re-myelination in a spiking neural network model. Again, we could not use a simpler model (e.g. a firing rate attractor model) without making gross assumptions about how demyelination affects circuit function. In sum, we believe that our models are relatively simple but comprehensive given the phenomenon that we are studying.

      The reviewer is correct in that there exist “known other age-related effects in PFC circuits”. These are reviewed in the introduction and we discuss future extensions of our model that would incorporate those effects as well. It is important to note that this is the first comprehensive study of demyelination effects in aging PFC, demonstrating that myelin changes alone predict working memory changes associated with aging.

      The specific issues about modeling choices and interpretation of the results are discussed below.

      Both for the de/re myelination the spatial patterns are fully random. Why is this justified?

      We agree that myelin dystrophy during aging could be non-random, that is, localized to certain regions of an axon. Our collaborators (Drs Jennifer Luebke, Maya Medalla, and Patrick Hof) are currently addressing this question using 3D electron microscopy and immunohistochemistry on axons of individual neurons and their associated myelin, but results are not available yet. Early on in this study we examined how the location of myelin alterations affected AP propagation. Focusing demyelination along a section of axon led to more AP slowing and failure than when spatially randomized. Likewise, remyelination of such spatially localized dystrophy led to greater recovery, as there were fewer transitions between long and short internodes (Supplemental Figure 4). Since otherwise the effects in the localized cases were largely similar to those in the spatially random case (see Author response image 3 below), for brevity in this paper we assumed myelin alterations were randomly distributed. Our next paper, extending this study to collateralized axons and which was presented as a poster at the 2023 Society for Neuroscience meeting, will include an examination of localized myelin dystrophy.

      Author response image 3.

      Effect of localized myelin alterations on CV change. Myelin alterations were either focused on the third of myelinated segments closest to the initial segment (‘proximally clustered’), the third of myelinated segments furthest from the initial segment (‘distally clustered’), or distributed according to a uniform distribution as in the current study. For demyelination, all lamellae were removed from 25% of myelinated segments (showing mean +/- SEM of all 50 cohort models, 30 randomized trials each). For remyelination, affected segments were replaced by two shorter segments with 75% of the original lamellae thickness and a node in between.

      We have added two sentences in Methods to justify this assumption more clearly (line 510): “Evidence suggests that aging affects oligodendrocytes in several ways, including the ability for oligodendrocyte precursor cells to mature (Dimovasili et al., 2022). Knowing that individual oligodendrocytes myelinate axons of many different neurons, but without data quantifying how oligodendrocyte dystrophy affects myelination in individual axons, we assumed that myelin alterations were randomly distributed.”

      We have also added a sentence in the Discussion alluding to our upcoming study (line 434): “Our model can also be extended to explore interactions between spatially localized myelin perturbations (such as those seen in multiple sclerosis) and axon collateralization (Sengupta et al., 2023), which would affect the distance-dependence of AP failures.”

      Similarly, to model the myelin parameters were drawn from uniform distributions, Table 1 (I guess). Again, why is this reasonable?

      The reviewer is correct that our initial Latin hypercube sample generated a uniform distribution. However, parameters of the random sample of models selected as biologically feasible were not uniformly distributed. We have added a new figure (Supplementary Figure 1A) to illustrate the parameter distributions, and have added two sentences in Methods (starting on line 596):

      “Of the 1600 simulated models, 138 met these criteria; for the present study, we randomly selected 50 models to comprise the young, control model cohort. Along most dimensions, the chosen cohort was approximately normally distributed (Supplementary Figure 1). The g-ratio (ratio of axon to fiber diameter) among models in the cohort was 0.71 ± 0.02, with total axon lengths of 1.2 ± 0.1 cm.”

      Author response image 4.

      Distribution of parameters and conduction velocities in the single neuron model cohort. (A) Histograms of axon morphology parameters of models selected for the single neuron cohort. Top: axon diameter: middle, length of unperturbed myelin segments; bottom: total myelin thickness in unperturbed segments, computed as the product of lamella thickness and number of lamellae. (B) Histograms of the CV for the 50 axons of the unperturbed model cohort (top), and representative demyelination and remyelination perturbations: mild demyelination (removing 25% of lamellae from 25% of the myelinated segments, second row); severe demyelination (removing all lamellae from 75% of the myelinated segments, third row); and complete (100%) remyelination (where the demyelinated segments from the third row were remyelinated by two shorter segments with 75% of lamellae). CVs averaged over 30 trials in each case. (C) Changes in CV (measured in %) in response to demyelination and remyelination versus the magnitude of current clamp step (+180, +280, or +380 pA). Shown are mean +/- SEM for demyelinating 50% of myelinated segments (removing all lamellae), and subsequent remyelination of those segments by shorter segments with 75% of lamellae.

      The focus of most analysis is on the conduction velocity but in the end, this has no effect on WM, so the discussion of CV remains sterile.

      CV delays likely do affect brain functions that rely on neuronal oscillations and synchrony, as mentioned in the Discussion. As such, we feel that our single neuron model results on CV delays as well as AP failures are valuable for the scientific community. Yet, given the results of our network models here, the reviewer has a valid point. We have clarified in the introduction that AP failures but not CV delays affected the network output (line 115):

      “Higher degrees of demyelination led to slower propagation and eventual failure of APs along the axons of the multicompartment models. In the network models, an increase in AP failure rate resulted in progressive working memory impairment, whereas slower conduction velocities, in the range observed in the multicompartment models, had a negligible effect.”

      We have also revised the single neuron section of the Results throughout, to better highlight the effects of myelin dystrophy on AP failures. Revisions to address this in the demyelination section start on line 148:

      “AP propagation was progressively impaired as demyelination increased (Figure 3): CV became slower, eventually leading to AP failure. Removing 25% of lamellae had a negligible effect on CV, regardless of how many segments were affected. However, when all lamellae were removed, CV slowed drastically – by 38 ± 10% even when just 25% of the segments were demyelinated in this way, and 35 ± 13% of APs failed. When 75% of segments lost all their lamellae, CV slowed by 72 ± 8% and 45 ± 13% of APs failed.”

      Similiarly, we have added several sentences about AP failures that remain after remyelination of the single neuron model (starting on line 190):

      “Results for the percentage of AP failures (Figure 4C,F) were consistent with those for CV recovery. Remyelinating all previously demyelinated segments, even adding just 10% of lamellae, brought AP failure rates down to 14.6 ± 5.1%. Remyelinating all affected segments with 75% of lamellae (the maximal amount of remyelination) nearly eliminated AP failures (1.8 ± 1.1%). Incomplete remyelination, where some segments were still demyelinated, still had relatively high AP failure rates. For example, when one eighth of segments were remyelinated with the maximal amount of lamellae and one eighth were left bare, 25.7 ± 11.5% of APs failed across the cohort (Figure 4C, red dashed line and arrow). AP failure rates were slightly lower when starting with partial demyelination: 10.6 ± 7.6% of APs failed in the analogous paradigm (Figure 4F, red dashed line and arrow). In short: combinations of demyelinated and remyelinated segments often led to sizable CV delays and AP failures.”

      The more important effect of de/re myelination is on failure. However, the failure is, AFAIK, just characterized by a constant current injection of 380pA. From Fig 2 it seems however that the first spike is particularly susceptible to failure. In other words, it has not been justified that it is fine to use the failure rates from this artificial protocol in the I&F model. I would expect the temporal current trace to affect whether the propagation fails or not.

      In general, we did not find the first spike to be more susceptible to failure than latter spikes; the trace in Figure 2 is a representative snapshot intended to illustrate CV slowdown, AP failure, and recovery. Regarding the constant current injection: while the reviewer is correct that neurons do not receive such inputs in vivo, the applied current injections were designed to match in vitro current clamp protocols for these rhesus monkey neurons. While our future studies will include responses to more realistic synaptic inputs, we focused on somatic current injections here. We have added a new panel (C) to Supplementary Figure 1 (see previous response above) showing that the current step magnitude had little effect on the CV change after myelin perturbations; there was little effect on AP failure rates too. We now also state this finding more explicitly in Methods (starting on line 561):

      “As done during in vitro electrophysiological experiments (Chang et al., 2005; Ibanez et al., 2020) and past modeling studies (Coskren et al., 2015; Rumbell et al., 2016), we first applied a holding current to stabilize the somatic membrane potential at -70 mV, then injected a current step into the somatic compartment for 2 seconds. …The CV changes in response to myelin alterations were relatively insensitive to variations in the magnitude of suprathreshold somatic current steps (Supplementary Figure 1C), and whether the current was constant or included Gaussian noise. Therefore, here we quantified CV changes and AP failures from responses to constant +380 pA current steps only.”

      I don't know if there are many axon-collaterals in the WM circuits and or distance dependence in the connectivity, but if so, then the current implementation of failure would be questionable.

      We agree that axon collaterals may affect our results; our unpublished morphological analyses of individual neuron axons indicate that there is a high degree of local axon collateralization in Layer 3 pyramidal neurons in LPFC. In this first study from our group on myelin perturbations, we chose to focus here on unbranched axons. There was some distance dependence of AP failure along the length of the axon. For example, in our most extreme demyelination case (75% of segments losing all their lamellae), about 14% of the axons showed more AP failure at their distal ends relative to the middle (mean difference 6.33%). We are examining this distance dependence more broadly in our next study, now cited in the Discussion (line 434): “Our model can also be extended to explore interactions between spatially localized myelin perturbations (such as those seen in multiple sclerosis) and axon collateralization (Sengupta et al., 2023), which would affect the distance-dependence of AP failures.”

      I would also advise against thresholding at 75% failure in Fig3C. Why don't the authors not simply plot the failure rate?

      We thank the reviewer for this suggestion, and have made this change. As suggested by the reviewer, we now show the AP failure rate in Figure 3 and Figure 4. The trends shown are nearly identical to those from the high failure trials.

      Regarding the presentation, there are a number of dead-end results that are not used further on. The paper is rather extensive, and it would be clearer if written up in half the space. In addition, much information is really supplementary. The issue of the CV I already mentioned, also the Lasso regression for instance remains unused.

      We understand the reviewer’s perspective, and we do value brevity when possible. During the revision process we examined the paper carefully, and made things more concise when it was feasible. As mentioned above, reporting CV results is important, though these revisions increased emphasis on results for AP failures in our revision. We combined the two Supplementary Figures about remyelination in the single neuron model into one (Supplementary Figure 3). We also moved the Lasso figure and associated methods to the Supplementary Material (Supplementary Figure 2), and have separated the Lasso results for demyelination and remyelination into their respective paragraphs (lines 154-160 and lines 200-204 respectively). While we do not use the Lasso explicitly later in Results, we cite them in the Discussion when comparing our findings to previous work (starting on line 417):

      “Since our single neuron cohort sampled a wide range of parameter space, we used Lasso regression to identify which of the complex, interacting parameters contributed most to CV delays (which preceded AP failures). Parameters including axon diameter, node length, length of myelinated segments, and nodal ion channel densities predicted how our models responded to demyelination and remyelination; these findings are consistent with past modeling studies over more limited parameter ranges (e.g., Goldman and Albus, 1968; Moore et al., 1978; Babbs and Shi, 2013; Young et al., 2013; Schmidt and Knösche, 2019).”

      We hope that our revision has struck an appropriate balance between clear and concise writing, and addressing concerns from both reviewers. We greatly value the time you have given to help us to improve our manuscript.

      Response to Recommendations for the Authors:

      Reviewer #1 (Recommendations for the Authors):

      As I mentioned above, I consider that this study is well designed and it offers very interesting results. I have detailed below some of the issues that should be addressed to improve its potential impact in the field:

      (1) Across the manuscript, it is not entirely clear how the results of the multicompartmental model compare to existing modeling results on demyelination and CV changes (such as in the papers cited by the authors). Is this section confirming previous results with a new (more accurate) computational model, or are there any new insights previously unreported? A new paragraph in the Discussion putting these results in context would be very useful for the reader.

      We thank the reviewer for this suggestion. We have added two new subheadings to organize the Discussion better, and have expanded the single neuron section to three paragraphs. We feel this now clarifies how our model fits in with previous work while stating its novelty more explicitly. Starting on line 391:

      “Myelin changes affect AP propagation in a cohort of model neurons

      The novelty of our neuron model lies in its systematic exploration of a combination of different myelin perturbation types known to occur in myelin dystrophies, across a wide range of biologically feasible models. Our single neuron model assumed that age-related myelin dystrophies (e.g., Figure 1) alter the insulative properties of lamellae analogously to demyelination, and examined interactions between demyelination and remyelination. Past studies of myelin dystrophy examined how either demyelination or remyelination of all segments affected AP propagation for a few representative axon morphologies. For example, Scurfield and Latimer (2018) explored how remyelination affected CV delays, finding that axons with more transitions between long and short myelinated segments had slower CV (Supplementary Figure 4), and was first to explore how remyelination interacts with tight junctions. However, their study did not couple remyelination and demyelination together or examine AP failures. Other basic findings from our single neuron cohort are consistent with past modeling studies, including that demyelination caused CV slowing and eventual AP failures (Stephanova et al., 2005; Stephanova and Daskalova, 2008; Naud and Longtin, 2019), and, separately, that remyelination with shorter and thinner myelinated segments led to CV slowing (Lasiene et al., 2008; Powers et al., 2012; Scurfield and Latimer, 2018). However, by assuming that some previously demyelinated segments were remyelinated while others were not, we found that models could have much higher AP failure rates than previously reported. Such a scenario, in which individual axons have some segments that are normal, some demyelinated, and some remyelinated, is likely to occur. We also found a few neurons in our cohort showing a CV increase after remyelination, which has not generally been reported before and is likely due to an interplay between ion channels in the new nodes and altered electrotonic lengths in the perturbed myelinated segments (e.g., Waxman, 1978; Naud and Longtin, 2019).

      Since our single neuron cohort sampled a wide range of parameter space, we used Lasso regression to identify which of the complex, interacting parameters contributed most to CV delays (which preceded AP failures). Parameters including axon diameter, node length, length of myelinated segments, and nodal ion channel densities predicted how our models responded to demyelination and remyelination; these findings are consistent with past modeling studies over more limited parameter ranges (e.g., Goldman and Albus, 1968; Moore et al., 1978; Babbs and Shi, 2013; Young et al., 2013; Schmidt and Knösche, 2019). Better empirical measurements of these parameters in monkey dlPFC, for example from 3-dimensional electron microscopy studies or single neuron axon studies combined with markers for myelin, would help predict the extent to which myelin dystrophy and remyelination along individual axons with aging affect AP propagation.

      Another important feature of our multicompartment model is that it was constrained by morphologic and physiological data in rhesus monkey dlPFC —an extremely valuable dataset from an animal model with many similarities to humans (Upright and Baxter, 2021; Tarantal et al., 2022). While beyond the scope of the current study, this computational infrastructure –with a detailed axon, initial segment, soma, and apical and basal dendrites– enables simultaneous investigations of signal propagation through the dendritic arbor and axon. Our model can also be extended to explore interactions between spatially localized myelin perturbations (such as those seen in multiple sclerosis) and axon collateralization (Sengupta et al., 2023), which would affect the distance-dependence of AP failures. Integrating such results from single neuron models into network models of working memory, as we have done here, is a powerful way to connect empirical data across multiple scales.”

      (2) Although the authors provide a well-designed study for the multi-compartmental model, it would be useful to add more details about how an unperturbed model and a completely remyelinated model differ in practice, perhaps right before the first results on the single cell model are presented. Are the new myelin sheaths covering the same % of axon as in the original case? Are there the same number of nodes? It is hard to distinguish which of these results are due to a compensation by the new myelin sheaths and which ones are just the model coming back to its original (and mathematically equivalent) starting point.

      A ‘fully remyelinated’ axon is not mathematically equivalent to the unperturbed axon. Newly remyelinated segments had at most 75% of the original number of myelin wraps, with a new node in between, consistent with empirical observations in rhesus monkey dlPFC. Our manuscript changes in response to this recommendation are described in detail above in our response to the public review of the same reviewer.

      (3) The authors observe a directed component in the bias that is known to be caused by heterogeneities in network connectivity, as stated in the text. It occurs to me that similar effects could be also caused by an heterogeneous demyelination in parts of the network. Inducing these biases could be another potential effect of demyelination in practice, and could be easily revealed by the author's current model (and displayed in a supplementary figure).

      As suggested by the reviewer, we have tested heterogeneous demyelination in parts of the network and the results confirm the reviewer’s intuition. We have included these new results as new Supplementary Figure 7 (see below) and we have added the following sentences in the Legend of Figure 5, line 1265: “When demyelination is restricted to a part of the network, diffusion only increases in the perturbed zone (Supplementary Figure 7).” and in the Discussion (line 457): “In addition to age-related changes in memory duration and precision, our network model predicts an age-related increase in systematic errors (bias) due to an increased drift of the activity bump (Supplementary Figure 11). Moreover, if demyelination is spatially localized in a part of the network, the model predicts a repulsive bias away from the memories encoded in the affected zone (Supplementary Figure 7).”

      Author response image 5.

      Effect of spatially heterogeneous demyelination of the model neurons according to their preferred angle. We also tested working memory performance in the network when demyelination affects only parts of the network. The figure shows the decoded bump center position during the cue and delay period for the eight possible cue directions when a fraction of neurons was perturbed and the rest of the neurons in the circuit were unaltered (Figure 5B). We perturbed 10% of the neurons around the neuron with preferred direction 90° (left panel), 25% of the neurons around -90° (middle panel), and 50% of the neurons around 180° (right panel). Bump traces for cues that lie inside the perturbed portion of the circuit are shown in blue. Network perturbation in the three cases consisted in demyelinating 25% of the segments along the axons of model neurons, by removing 70% of the myelin lamellae. In each case, 280 trials were simulated for one network. These simulations show an increased drift and diffusion inside the perturbed zone, consistent with the increased drift and diffusion when perturbing the entire network (Figure 6B and Supplementary Figure 11). In particular, spatially heterogeneous demyelination in our network leads to a bias away from the affected zone and to increased trial-to-trial variability. Note that this is a model prediction, but we are not aware of empirical data showing heterogeneous demyelination with aging. Further, note that while our network model has a topological ring structure, neurons in PFC are not anatomically arranged depending on their preferred features. Thus, spatially heterogeneous demyelination would likely affect neurons with different feature preferences (i.e., neurons throughout our ring model).

      (4) The bump attractor model of WM relies on a continuous attractor dynamics to encode the information stored in memory --a fixed point dynamics that can only vary via the slow noise-driven drift. This means, as the authors mention, that changes in CV won't affect the performance of WM in their model. This seems to be a limitation of the model, or at least an effect which is highly dependent on the modeler's choice, rather than an accurate prediction. While testing the effects of oscillations (as the authors argue in the Discussion) might be out of the scope of this work, there are other WM models which are more sensitive to temporal differences in activity. The authors should test whether the same (lack of) effects are also found in other WM models. A silent WM model seems to be the ideal candidate for this, as the authors already have the key dynamics of that model incorporated in their computational framework (namely, short-term synaptic facilitation in excitatory synapses).

      We fully agree that considering the effects of demyelination in networks with alternative mechanisms would strengthen our manuscript. As suggested by the reviewer, we have simulated demyelination effects (AP failures and changes in CV) in an activity silent working memory model. The results are described in detail above in our response to the public review of the same reviewer.

      We also would like to mention that we have now also tested larger conduction delays in the bump attractor model, revealing additional working memory errors. This is shown in the revised version of Supplementary Figure 6 (see below). However, those delays are unrealistically large and thus the main effect in both the bump attractor and the activity-silent model is due to AP failures.

      Author response image 6.

      Effect of propagation delays on control and perturbed networks. (A) Memory strength (left panels) and diffusion (right panels) for the young, control networks with zero propagation delays (blue solid line), as in Figure 5, and with propagation delays from a uniform distribution with a range between 0 and 100 ms (yellow dashed line). (B) Memory strength and diffusion for perturbed networks when demyelinating 50% of the segments along the axons of model neurons, by removing 60% of the myelin lamellae without delays (red solid line), and with delays from a uniform distribution with a range between 0 and 40 ms (gray dashed line) and between 0 and 85 ms (black dash-dotted line). The measures of working memory performance were calculated by averaging across 20 networks and 280 trials for each network. Shaded areas indicate SEM for each case. For the young, control networks, there was no difference with and without propagation delays, even though the delays used in the network simulations were much larger than the delays quantified in the single neuron model (the longest delays found for the most extreme perturbation condition –demyelination of 75% of the segments by removing 100% of the myelin lamellae– were of 49.9 ms on average; A). Working memory performance was also unaffected in the perturbed network with AP failures for delays ranging between 0 and 40 ms, also larger than the ones quantified in the single neuron model (for the case of 50% of the segments demyelinated by removing 60% of the myelin lamellae, the average delay in the cohort was 4.6 ms and the maximum delay was 15.7 ms; B). However, including extremely long delays of up to 85 ms did further impair memory compared to the impairment level introduced by AP failures alone (B).

      (5) Impact of demyelination and remyelination on working memory: Could the authors explain here how these biologically detailed alterations are implemented in the bump attractor model? Is the CV and AP failure rate adjusted to the values produced by the multicompartment neuron model with these myelin alterations?

      Yes, the reviewer is right, the CV and AP failure rate have been adjusted to the values produced by the multicompartment neuron model. To clarify this in the manuscript, we have restated the text as follows:

      Lines 243 - 249 (Results):

      To investigate how myelin alterations affect working memory maintenance, we explored in the network model the same demyelination and remyelination conditions as we did in the single neuron model. Because our network model consists of point neurons (i.e., without detailed axons), we incorporated CV slowing as an effective increase in synaptic transmission delays (see Methods). To simulate AP failures, we adjusted the AP failure rate to the values given by the single neuron model, by creating a probabilistic model of spike transmission from the excitatory presynaptic neurons to both the excitatory and inhibitory postsynaptic neurons (see Methods).

      Lines 722 - 747 (Methods):

      Modeling action potential propagation failures in the network. The network model is composed of point neurons without an explicit model of the axon. To effectively model the action potential failures at the distal end of the axons quantified with the single neuron model under the different demyelination and remyelination conditions, the AP failure rate was adjusted to the values produced by the single neuron model. To do this, we perturbed the 10 control networks by designing a probabilistic model of spike transmission from the excitatory presynaptic neurons to both the excitatory and inhibitory postsynaptic neurons. From the single neuron model, for each demyelination/remyelination condition, we quantified the probability of AP failure for each of the neurons in the control cohort, as well as the percentage of those neurons that shared the same probabilities of failure. That is, the percentage of neurons that had probability of failure = 0, probability of failure = 1 or any other probability. Then, we computed the probability of transmission, , and we specified for the corresponding percentages of excitatory neurons in the networks. Thus, in the network model, we took into account the heterogeneity observed in the single neuron model under each demyelination/remyelination condition.

      Modeling conduction velocity slowing in the network. To explore the effect of CV slowing along the axons of model neurons, we simulated 20 young, control networks and 20 perturbed networks with AP failure rates adjusted for the case of single model neurons with 50% of the segments demyelinated along the axons by removing 60% of the myelin lamellae (we ran 280 trials for each network). Then, we added random delays uniformly distributed with a minimum value of 0 ms in both cases, a maximum value of 100 ms in the control networks, and a maximum values of 40 ms and 85 ms in the perturbed networks, in both the AMPA and NMDA excitatory connections to both E and I neurons (Supplementary Figure 6). These large values were chosen because we wanted to illustrate the potential effect of CV slowing in our network and smaller, more realistic, values did not have any effect.

      (6) "We also sought to reveal the effect on working memory performance of more biologically realistic network models with AP transmission probabilities matched to both axons with intact and with altered myelin sheaths, as likely occurs in the aging brain (Figure 1). Thus, we ran network model simulations combining AP failure probabilities corresponding to groups of neurons containing intact axons and axons presenting different degrees of demyelination." I fail to see the difference with respect to the results in previous sections. Is it that now we have subnetworks in which axons are intact and subnetworks with significant AP failures, while before there was no topological separation between both cases? Please clarify.

      In Figures 5 and 6 the AP failure rate of the neural population in the network simulations was matched to the AP failure rate of the cohort of single model neurons for each demyelination/remyelination condition. Since not all model neurons have equal features, a given condition produces different levels of impairment in its neuron. Thus, we quantified the probability of AP failure for each neuron in the control cohort, as well as the percentage of those neurons that shared the same probabilities of failure. Then, we computed the probability of AP transmission for the corresponding percentages of excitatory neurons in the networks. Thus, in the network model, we took into account the heterogeneity observed in the single neuron model under each demyelination/remyelination condition.

      However, In Figures 7 and 8, we consider additional heterogeneity due to a different degree of demylination/remyelination of different neurons. Here, excitatory neurons in the network model are not perturbed according to a single demyelination/remyelination condition. Instead, we allowed that different percentages of excitatory neurons had AP failure rates corresponding to different demyelination/remyelination conditions: some were unperturbed, while others had different degrees of demyelination (Figure 7) and different degrees of remyelination (Figure 8). We have modified the text for clarification in several places.

      First, when we describe the impact of demyelination on working memory, we already mention that (line 271): “In each of the 10 networks, we set the AP failure rate of the excitatory neurons according to the distribution of failure probabilities of the neurons in the single neuron cohort for the given demyelination or remyelination condition. Thus, we took into account the heterogeneity of demyelination and remyelination effects from our single neuron cohort (Figure 3A; Supplementary Figure 3). Note that this heterogeneity originates from differences in axon properties, but probabilities of failure for all neurons in the network correspond to the same degree of demyelination (Figure 6). We will also consider networks that contain different combinations of axons with either intact or perturbed myelin (Figure 7 and Figure 8).”

      Second, we have combined the text describing Figures 7 and 8 under a single section title, which reads “Simulated heterogenous myelin alterations match empirical data” (line 334) and start this section with (line 337): “Up to this point we have studied network models with AP failure probabilities corresponding to a single degree of myelin alterations (i.e., with all excitatory neurons in the network having AP failure rates matched to those of the single neuron cohort for one particular demyelination or remyelination condition). Next, we sought to reveal the effect on working memory performance of more biologically realistic network models, where excitatory neurons in the networks were perturbed according to a combination of different demyelination or remyelination conditions. That is, we simulated networks with excitatory neurons having AP failure probabilities matched to both neuronal axons with intact and with altered myelin sheaths in different degrees, as likely occurs in the aging brain (Figure 1).”

      (7) "Unexpectedly, our model indicates that compared to the performance of networks composed of neurons possessing axons with intact myelin sheaths, both demyelination and remyelination leads to an impaired performance." This conclusion is quite interesting, but I lack intuition from the paper as of why it is happening. In fact, the authors say in the Discussion that "complete remyelination of all the previously demyelinated segments with sufficient myelin, with fewer transitions between long and short segments, recovered working memory function." Would we then see a minimum and then an increase in memory duration in Figure 9B if we extended the X-axis until we hit 100% of new myelin sheaths?

      This is a very important question that we have carefully addressed in Results and Discussion. We distinguish between two remyelination cases in the models. Complete remyelination: when all (100%) the previously demyelinated segments have been subsequently remyelinated, and incomplete remyelination: when less than 100% (25%, 50% or 75%) of the demyelinated segments have been remyelinated. Figure 6 (middle and right columns) shows the two cases (black lines for any percentage of lamellae added vs. colored lines): for 100% of the segments remyelinated, the network performance is nearly or completely (when enough lamellae are added) recovered to the young network performance. In fact, with the single neuron model we observe that (lines 192 - 193 in Results): “Remyelinating all affected segments with 75% of lamellae (the maximal amount of remyelination) nearly eliminated AP failures (1.8 ± 1.1%)”. However, incomplete remyelination recovers the performance compared to demyelination (middle and right columns in Figure 6 vs left column), but this performance is worse than the performance of the young networks. The single neuron model shows that (lines 194 - 197 in Results): “Incomplete remyelination, where some segments were still demyelinated, still had relatively high AP failure rates. For example, when one eighth of segments were remyelinated with the maximal amount of lamellae and one eighth were left bare, 25.7 ± 11.5% of APs failed across the cohort (Figure 4C, red dashed line and arrow).”

      In Figure 9B (now Figure 8B), we combine intact axons with axons that are only partially remyelinated (i.e., incomplete remyelination). Extending the X-axis in Figure 8B until 100% of new myelin sheaths would not imply a minimum and a subsequent increase, but a continuous impairment: the more axons we perturb (remyelinate) the higher is the impairment compared to the young cases where all the axons are intact.

      The sentence "Unexpectedly, our model indicates that compared to the performance of networks composed of neurons possessing axons with intact myelin sheaths, both demyelination and remyelination leads to an impaired performance.", now reads as (lines 379 380 in Results): “Therefore, both demyelination and incomplete remyelination lead to impaired performance in our networks, compared to networks with intact myelin sheaths”. We have also rewritten the corresponding section in Discussion (lines 486 - 489) as follows: “Therefore, it is reasonable to assume that ineffective remyelination may lead to working memory impairment. In fact, complete remyelination of all previously demyelinated segments with sufficient myelin, with fewer transitions between long and short segments, led to full recovery of working memory function.”

      (8) [minor] "Our recent network model found that age-related changes in firing rates and synapse numbers in individual neurons can lead to working memory impairment (Ibañez et al., 2020), but did not consider myelin dystrophy." Could you be more precise about which age-related changes were studied in Ibanez et al. 2020? From the paper it seems like it was mostly cellular excitability and synaptic density, so this should be added here for more context.

      To clarify this, we have added the following sentences in the Introduccion (line 105):

      “Our recent network model revealed that the empirically observed age-related increase in AP firing rates in prefrontal pyramidal neurons (modeled through an increased slope of the f-I curve) and loss of up to 30% of both excitatory and inhibitory synapses (modeled as a decrease in connectivity strength) can lead to working memory impairment (Ibañez et al., 2020), but this model did not incorporate the known changes to myelin structure that occur during normal

      aging.”

      (9) [minor] "Recurrent excitatory synapses are facilitating, which promotes robust and reliable persistent activity despite spatial heterogeneities in the connectivity or in the intrinsic properties of the neurons." It would be great to add a reference here to justify the inclusion of this type of plasticity in the excitatory circuit (for example Wang, Markram et al. Nat Neuro 2006).

      We have added the references suggested by the reviewer and a further one in the Results (line 216):

      “Recurrent excitatory synapses are facilitating, as has been empirically observed in PFC (Hempel et al., 2000; Wang et al., 2006), which promotes robust and reliable persistent activity despite spatial heterogeneities in the connectivity or in the intrinsic properties of the neurons.”

      References:

      Hempel, C. M., Hartman, K. H., Wang, X. J., Turrigiano, G. G., and Nelson, S. B. (2000). Multiple forms of short-term plasticity at excitatory synapses in rat medial prefrontal cortex. J. Neurophysiol. 83, 3031–3041. doi: 10.1152/jn.2000.83.5.3031

      Wang, Y., Markram, H., Goodman, P. H., Berger, T. K., Ma, J., and Goldman- Rakic, P. S.(2006). Heterogeneity in the pyramidal network of the medial prefrontal cortex. Nat.Neurosci. 9, 534–542. doi: 10.1038/nn1670

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Neuronal activity spatiotemporal fine-tuning of cerebral blood flow balances metabolic demands of changing neuronal activity with blood supply. Several 'feed-forward' mechanisms have been described that contribute to activity-dependent vasodilation as well as vasoconstriction leading to a reduction in perfusion. Involved messengers are ionic (K+), gaseous (NO), peptides (e.g., NPY, VIP), and other messengers (PGE2, GABA, glutamate, norepinephrine) that target endothelial cells, smooth muscle cells, or pericytes. Contributions of the respective signaling pathways likely vary across brain regions or even within specific brain regions (e.g., across the cortex) and are likely influenced by the brain's physiological state (resting, active, sleeping) or pathological departures from normal physiology.

      The manuscript "Elevated pyramidal cell firing orchestrates arteriolar vasoconstriction through COX-2derived prostaglandin E2 signaling" by B. Le Gac, et al. investigates mechanisms leading to activitydependent arteriole constriction. Here, mainly working in brain slices from mice expressing channelrhodopsin 2 (ChR2) in all excitatory neurons (Emx1-Cre; Ai32 mice), the authors show that strong optogenetic stimulation of cortical pyramidal neurons leads to constriction that is mediated through the cyclooxygenase-2 / prostaglandin E2 / EP1 and EP3 receptor pathway with contribution of NPY-releasing interneurons and astrocytes releasing 20-HETE. Specifically, using a patch clamp, the authors show that 10-s optogenetic stimulation at 10 and 20 Hz leads to vasoconstriction (Figure 1), in line with a stimulation frequency-dependent increase in somatic calcium (Figure 2). The vascular effects were abolished in the presence of TTX and significantly reduced in the presence of glutamate receptor antagonists (Figure 3). The authors further show with RT-PCR on RNA isolated from patched cells that ~50% of analyzed cells express COX-1 or -2 and other enzymes required to produce PGE2 or PGF2a (Figure 4). Further, blockade of COX-1 and -2 (indomethacin), or COX-2 (NS-398) abolishes constriction. In animals with chronic cranial windows that were anesthetized with ketamine and medetomidine, 10-s long optogenetic stimulation at 10 Hz leads to considerable constriction, which is reduced in the presence of indomethacin. Blockade of EP1 and EP3 receptors leads to a significant reduction of the constriction in slices (Figure 5). Finally, the authors show that blockade of 20-HETE synthesis caused moderate and NPY Y1 receptor blockade a complete reduction of constriction.

      The mechanistic analysis of neurovascular coupling mechanisms as exemplified here will guide further in-vivo studies and has important implications for human neuroimaging in health and disease. Most of the data in this manuscript uses brain slices as an experimental model which contrasts with neurovascular imaging studies performed in awake (headfixed) animals. However, the slice preparation allows for patch clamp as well as easy drug application and removal. Further, the authors discuss their results in view of differences between brain slices and in vivo observations experiments, including the absence of vascular tone as well as blood perfusion required for metabolite (e.g., PGE2) removal, and the presence of network effects in the intact brain. The manuscript and figures present the data clearly; regarding the presented mechanism, the data supports the authors' conclusions.

      We thank the reviewer for his/her supportive comments as well as for pointing out pros and cons of the brain slice preparation.

      Some of the data was generated in vivo in head-fixed animals under anesthesia; in this regard, the authors should revise the introduction and discussion to include the important distinction between studies performed in slices, or in acute or chronic in-vivo preparations under anesthesia (reduced network activity and reduced or blockade of neuromodulation, or in awake animals (virtually undisturbed network and neuromodulatory activity).

      We have now added a paragraph in the introduction (lines 52-64) to highlight the distinction between ex vivo and in vivo models. We now also discuss that anesthetized animals exhibit slower NVC (Line 308-309).

      Further, while discussed to some extent, the authors could improve their manuscript by more clearly stating if they expect the described mechanism to contribute to CBF regulation under 'resting state conditions' (i.e., in the absence of any stimulus), during short or sustained (e.g., visual, tactile) stimulation, or if this mechanism is mainly relevant under pathological conditions; especially in the context of the optogenetic stimulation paradigm being used (10-s long stimulation of many pyramidal neurons at moderate-high frequencies) and the fact that constriction leading to undersupply in response to strongly increased neuronal activity seems counterintuitive?

      We now discuss more extensively the physiological relevance (lines 422-434 and 436-439) and the conditions where the described mechanisms of neurogenic vasoconstriction may occur.

      We agree with the reviewer that vasoconstriction in response to a large increase in neuronal activity is counterintuitive as it leads to undersupply despite an increased energy demand. We now discuss its potential physio/pathological role in attenuating neuronal activity by reducing energy supply (lines 453-464).

      Reviewer #2 (Public review):

      Summary:

      The present study by Le Gac et al. investigates the vasoconstriction of cerebral arteries during neurovascular coupling. It proposes that pyramidal neurons firing at high frequency lead to prostaglandin E2 (PGE2) release and activation of arteriolar EP1 and EP3 receptors, causing smooth muscle cell contraction. The authors further claim that interneurons and astrocytes also contribute to vasoconstriction via neuropeptide Y (NPY) and 20-hydroxyeicosatetraenoic acid (20-HETE) release, respectively. The study mainly uses brain slices and pharmacological tools in combination with Emx1Cre; Ai32 transgenic mice expressing the H134R variant of channelrhodopsin-2 (ChR2) in the cortical glutamatergic neurons for precise photoactivation. Stimulation with 470 nm light using 10-second trains of 5-ms pulses at frequencies from 1-20 Hz revealed small constrictions at 10 Hz and robust constrictions at 20 Hz, which were abolished by TTX and partially inhibited by a cocktail of glutamate receptor antagonists. Inhibition of cyclooxygenase-1 (COX-1) or -2 (COX-2) by indomethacin blocked the constriction both ex vivo (slices) and in vivo (pial artery), and inhibition of EP1 and EP3 showed the same effect ex vivo. Single-cell RT-PCR from patched neurons confirmed the presence of the PGE2 synthesis pathway.

      While the data are convincing, the overall experimental setting presents some limitations. How is the activation protocol comparable to physiological firing frequency? 

      As also suggested by Reviewer #1 we have now discussed more extensively the physiological relevance of our observations (lines 422-434 and 436-439).

      The delay (minutes) between the stimulation and the constriction appears contradictory to the proposed pathway, which would be expected to occur rapidly. The experiments are conducted in the absence of vascular "tone," which further questions the significance of the findings. 

      The slow kinetics observed ex vivo are probably due to the low recording temperature and the absence of pharmacologically induced vascular tone, as already discussed (lines 312-317). Furthermore, as recommended by reviewer #1, we have presented the advantages and limitations of ex vivo and in vivo approaches (lines 52-64).

      Some of the targets investigated are expressed by multiple cell types, which makes the interpretation difficult; for example, cyclooxygenases are also expressed by endothelial cells.

      Under normal conditions, endothelial cells only express COX-1 and barely COX-2, whose expression is essentially observed in pyramidal cells (see Tasic et al. 2016, Zeisel et al. 2015, Lacroix et al., 2015). As pointed out by Reviewer # 1, our ex vivo pharmacological data clearly indicate that vasoconstriction is mostly due to COX-2 activity, and to a much lesser extent to COX-1. Since it is well established that the previously described vascular effects of pyramidal cells are essentially mediated by COX-2 activity (Iadecola et al., 2000; Lecrux et al., 2011; Lacroix et al., 2015), we are quite confident that vasoconstriction described here is mainly due COX-2 activity of pyramidal cells.

      Finally, how is the complete inhibition of the constriction by the NPY Y1 receptor antagonist BIBP3226 consistent with a direct effect of PGE2 and 20-HETE in arterioles? 

      We agree with both reviewers that the complete blockade of the constriction by the NPY Y1 receptor antagonist BIBP3226 needs to be more carefully discussed. We have now included in the discussion the possible involvement of Y1 receptors in pyramidal cells, which could promote glutamate release and possibly COX-2, thereby contributing to PGE2 and 20-HETE signaling (lines 402-409).

      Overall, the manuscript is well-written with clear data, but the interpretation and physiological relevance have some limitations. However, vasoconstriction is a rather understudied phenomenon in neurovascular coupling, and the present findings may be of significance in the context of pathological brain hypoperfusion.

      We thank the reviewer for his/her comment and suggestions, which have helped us to improve our manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Methods:

      It is not clear if brain slices (or animals) underwent one, two, or several optogenetic stimulations - especially for experiments where 'control' is compared to 'treated' - does this data come from the same vessels (before and after treatment) or from two independent groups of vessels? If repeated stimulations are performed, do these repeated stimulations cause the same vascular response?

      As indicated in the Materials and Methods section, line 543: “Only one arteriole was monitored per slice” implies that the comparisons between the ‘control’ and ‘treated’ groups were made from independent groups of vessels. To clarify this point, we have added “receiving a single optogenetic or pharmacological stimulation” to this sentence lines 543-544.

      For in vivo experiments, animals underwent 10-20 optogenetic stimulations with a 5-minute interstimulus interval during an experiment lasting 2 hours for maximum. Trials from the same vessel were averaged (with a 0.1 s interpolation) for analysis, and the mean per vessels is presented in the graphics.

      Figure 2:

      Can the authors speculate about the cause for the slow increase in indicator fluorescence from minute 1.5 onward, which seems dependent on stimulation frequency? Is this increase also present when slices from a ChR2-negative animal undergo the same stimulation paradigm?

      Rhod2 was delivered by the patch pipette as indicated in the Materials and Methods section (line 514). Although a period of “at least 15 min after passing in whole-cell configuration to allow for somatic diffusion of the dye” (line 551-552) was observed, this single-wavelength Ca2+ indicator likely continued to diffuse into the cells during the optical recording thereby, inducing a slight increase in delta F/F0, which is consistent with the positive slopes of the mean fluorescence changes observed during the 30-s control baseline (Fig. 2b).

      Figure 4: Why did the authors include panel a) here? Also, do the authors observe that cells with different COX-1 or -2 expression profiles show different (electrical, morphological) properties?

      The purpose of panel a) in Fig. 4 was to ensure the regular spiking electrophysiological phenotype of the pyramidal neurons whose cytoplasm was harvested for subsequent RT-PCR analysis. Despite our efforts, we found no difference in the 32 electrophysiological features between COX-1 or COX-2 positive and negative cells. This is now clearly stated in the result section (lines 210-212) and a supplementary table of electrophysiological features is now provided. Because it is difficult to determine the morphology of neurons analyzed by single-cell RT-PCR (Devienne et al. 2018), these cells were not processed for biocytin labeling.

      Figure 5: (1) Maybe the authors could highlight panels b-f as in vivo experiments to emphasize that these are in-vivo observations while the other experiments (especially panels g, h) are made in slices? 

      We thank the reviewer for this suggestion. A black frame is now depicted in Figure 5 to emphasize in vivo experiments.

      (2) What is the power of the optogenetic stimulus in this experiment? 

      The power of the optogenetic stimulus was 38 mW/mm<sup>2</sup> in ex vivo experiments (see Line 527). For in vivo experiments, 1 mW pulses of 5 ms were used, the intensity being measured at the fiber end. We now provide the information for in vivo experiments in the Methods lines 639-640.

      (3) Experiments were performed with Fluorescein-Dextran at 920-nm excitation which would overlap with EYFP fluorescence from the ChR2-EYFP transgene. Did the authors encounter any issues with crosstalk between the two labels? 

      Crosstalk between EYFP and fluorescein fluorescence was indeed an issue. This is why arterioles were monitored at the pial level to avoid fluorescence contamination from the cortical parenchyma. Because of the perivascular space around pial arterioles, it was possible to measure vessel diameter without pollution for the parenchyma (see Author response image 1 below). To clarify this point we added the statement “which are not compromised by the fluorescence from the ChR2-EYFP transgene in the parenchyma (Madisen et al. 2012),” Line 628-629. Note that line scan acquisitions without photoactivation stimulation did not trigger any progressive change in the vessel size or resting fluorescence.

      Author response image 1.

      Example of a pial arteriole filled with fluorescein dextran (cyan) in an Emx1-EYFP mouse (parenchyma labeled with YFP, in cyan). The red line represents a line scan to record the change in diameter. Due to the perivascular space surrounding the arterioles, the vessel walls are clearly identified and separated from the fluorescent parenchyma.

      (4) Could the authors potentially extend the time course in panel e) to show the recovery of the preparation to the baseline? 

      Because arterioles were only monitored for a 40-s period during a session of optogenetic stimulation/imaging we cannot extend panel e. Nonetheless, a 5 minutes interstimulus interval was observed to allow the full recovery of the preparation to the baseline. This now clarified line 640. Of note, the arteriole shown in panel d before indomethacin treatment fully recovered to baseline after this treatment.

      Also, did the authors observe any 'abnormal' behavior of the vasculature after stimulation, such as large-amplitude oscillations? (5) 

      We did not specifically investigate resting state oscillations, such as vasomotion, but the 10-s long baseline recording for each measurement indicates no long lasting, abnormal and de novo behavior with a frequency higher than 0.1-0.2 Hz.

      Can the authors show in vivo data from control experiments in EYFP-expressing or WT mice that underwent the same stimulation paradigm (Supplementary Figure 1 shows data from brain slices)?

      The reviewer is correct to point out this important control, as optogenetic stimulation can induce a vascular response without channel rhodopsin activation at high power (see our study on the topic, Rungta et al, Nat Com 2017). We therefore tested this potential artefact in a WT mouse using our setup, with different intensities and durations of optogenetic stimulation.

      Author response image 2A shows that stimulations of 10 seconds, 10 Hz, 1 mW, 5 ms pulses, i.e. the conditions we used for the experiments in Emx1 mice, did not induce dilation or constriction. Stimulation for 5 seconds with the same number of pulses, but with a higher power (4 mW), longer duration (20 ms pulses) and at a higher frequency elicited a small dilation in 1 of 2 pial arterioles (Author response image 2B). For this reason, we used only shorter (5ms) and less intense (1 mW) optogenetic stimulation to ensure that the observed dilation was solely due to Emx1 activation and not to light-induced artefactual dilation.

      Author response image 2.

      Optogenetic stimulation in a wild-type mouse. A. No diameter changes upon stimulations of 10 seconds, 10 Hz, 1 mW, 5 ms pulses, i.e. the conditions we used for the experiments in Emx1 mice. B. Stimulation of higher power (4 mW), longer duration (20 ms pulses) and at a higher frequency elicited a small dilation in 1 (grey traces) of 2 pial arterioles.

      Figures 6 and 7: It is surprising that blockade of NPY Y1 receptors leads to a complete loss of the constriction response. As shown in Figure 7, the authors suggest that pyramidal neuron-released PGE2 (and glutamate) initiate several cascades acting on smooth muscle directly (PGE2-EP1/EP3), through astrocytes (Glu/COX-1/PGE2 or 20-HETE), or through NPY interneurons (Glu/NPY/Y1 or PGE2/NPY/Y1). This would imply that COX-1/2 and NPY/Y1 pathways act in series (as discussed by the authors). Besides the potential effects on NPY release mentioned in the discussion, could the authors comment if both (NPY and PGE2) pathways need to be co-activated in smooth muscle cells to cause constriction?

      We thank the reviewer for raising this surprising complete loss of vasoconstriction by Y1 antagonism, despite the contribution of other vasoconstrictive pathways. We now discuss (lines 402-409) the possibility that activation of the neuronal Y1 receptors in pyramidal cells may also have contributed to the vasoconstriction by promoting glutamate and possibly PGE2 release. The combined activation of vascular and neuronal Y1 receptors may explain the complete blockage of optogenetically induced vasoconstriction by BIBP3226.

      Reviewer #2 (Recommendations for the authors):

      The complete block of the constriction by BIBP3226 needs to be carefully considered.

      We thank the reviewer for stressing this point also raised by Reviewer #1. As mentioned above we now discuss (lines 402-409) the possibility that activation of the neuronal Y1 receptors in pyramidal cells may also have contributed to the vasoconstriction by promoting glutamate and possibly PGE2 release. The combined activation of vascular and neuronal Y1 receptors may explain the complete blockage of optogenetically induced vasoconstriction by BIBP3226.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors aim to assess the effect of salt stress on root:shoot ratio, identify the underlying genetic mechanisms, and evaluate their contribution to salt tolerance. To this end, the authors systematically quantified natural variations in salt-induced changes in root:shoot ratio. This innovative approach considers the coordination of root and shoot growth rather than exploring biomass and the development of each organ separately. Using this approach, the authors identified a gene cluster encoding eight paralog genes with a domain-of-unknown-function 247 (DUF247), with the majority of SNPs clustering into SR3G (At3g50160). In the manuscript, the authors utilized an integrative approach that includes genomic, genetic, evolutionary, histological, and physiological assays to functionally assess the contribution of their genes of interest to salt tolerance and root development.

      Strengths:

      The holistic approach and integrative methodologies presented in the manuscript are essential for gaining a mechanistic understanding of a complex trait such as salt tolerance. The authors focused on At3g50160 but included in their analyses additional DUF247 paralogs, which further contributes to the strength of their approach. In addition, the authors considered the developmental stage (young seedlings, early or late vegetative stages) and growth conditions of the plants (agar plates or soil) when investigating the role of SR3G in salt tolerance and root or shoot development.

      Weaknesses:

      The authors' claims and interpretation of the results are not fully supported by the data and analyses. In several cases, the authors report differences that are not statistically significant (e.g., Figures 4A, 7C, 8B, S14, S16B, S17C), use inappropriate statistical tests (e.g., t-test instead of Dunnett Test/ANOVA as in Figures 10B-C, S19-23), present standard errors that do not seem to be consistent with the post-hoc Tukey HSD Test (e.g., Figures 4, 9B-C, S16B), or lack controls (e.g., Figure 5C-E, staining of the truncated versions with FM4-64 is missing).

      We thank the reviewer for their critical thoughts on the presented data. We have revised our data interpretation in the main text to more accurately reflect the results. Given the nature of our experimental setup, where we trace the roots of individual Arabidopsis seedlings grown on plates, there is considerable biological variation, which makes achieving strong statistical significance between samples or genotypes challenging. However, we think that the representation of the data as transparently as possible is necessary to provide the readers and reviewers a true picture of the variability that we are observing.  Consequently, we have centered our data interpretation around observable trends that facilitate drawing conclusions.

      The choice of statistical test is closely tied to the specific biological question being addressed. In Figures 10A-C, as in Figures 6A-B, we compared all genotypes to the wild-type Col-0 within each condition, and thus ANOVA analysis, testing the general effect of the genotype across both mutants and Col-0 wild-type is not appropriate. Similarly, in Figures S19-S23, we compared each mutant line to the wild-type Col-0 under each condition.

      We repeated the post-hoc Tukey HSD Test for Figures 4, 9B-C, and S16B and made adjustments where necessary (see tracked changes manuscript).

      The truncated versions do not localize to the plasma membrane; instead, they are targeted to the nucleus and cytosol, mimicking the localization pattern of free GFP, which was used as a control in Panel F. Therefore, we believe that having FM4-64 as a control for these specific images is not informative, but instead using free GFP is serving as a better control in that particular construct.

      In other cases, traits of root system architecture and expression patterns are inconsistent between different assays despite similar growth conditions (e.g., Figures S17A-B vs. 10A-C vs. 6A, and Figures S16B vs. 4A/9B), or T-DNA insertion alleles of WRKY75 that are claimed to be loss-of-function show comparable expression of WRKY75 as WT plants. Additionally, several supplemental figures are mislabeled (Figures S6-9), and some figure panels are missing (e.g., Figures S16C and S17E).

      We thank the reviewer for raising these points and noticing the inconsistency between different assays (e.g., Figures S17A-B vs. 10A-C vs. 6A, and Figures S16B vs. 4A/9B). As mentioned above, considerable biological variation makes achieving strong statistical significance between samples, genotypes, or experiments challenging. Thus, we have centered our data interpretation around observable “trends” between experiments to facilitate drawing conclusions. Considering Figures S17A-B, 10A-C, and 6A, we acknowledge the reviewer's concern about inconsistencies in root system architecture across experiments. Initially, we observed that the sr3g mutant had reduced lateral root length compared to Col-0 under salt stress. This led us to focus on this specific phenotypic trait rather than the overall root system architecture. Despite some variation, the sr3g mutant consistently showed a similar trend/phenotype when compared to Col-0 under salt stress. We believe the variation in main root length and lateral root number between experiments is due to inherent differences between biological replicates.

      Regarding gene expression patterns between Figures S16B and 4A/9B, we included part of Figure 9B (SR3G gene expression in Col-0) in Figure 4A. Figure S16B represents a completely different assay. Despite variations between assays, the overall message remains consistent: SR3G gene expression is induced under salt stress in the root but not in the shoot.

      Both SR3G and WRKY75 are expressed at very low levels, even under the 75 mM salt stress condition we tested. When gene expression is so low, detecting changes is challenging due to inherent variations. Nonetheless, we observed a reduction in WRKY75 expression in the mutant lines compared to wild-type Col-0, though this reduction was not statistically significant. More importantly, we observed a similar phenotype in the wrky75 mutant, specifically reduced main root length under salt stress, consistent with the findings of the published paper in The Plant Cell by Lu et al. (2023) “Lu, K.K., Song, R.F., Guo, J.X., Zhang, Y., Zuo, J.X., Chen, H.H., Liao, C.Y., Hu, X.Y., Ren, F., Lu, Y.T. and Liu, W.C., 2023. CycC1; 1–WRKY75 complex-mediated transcriptional regulation of SOS1 controls salt stress tolerance in Arabidopsis. The Plant Cell, 35(7), pp.2570-2591”.

      We appreciate the reviewer for spotting the missing labels for Figures S6-9. We corrected them at the main text, figures, and legends. We added panel C to Figure S16 and removed panel E from Figure S17 legend,  now they match to actual figures and legends.

      Consequently, the authors' decisions regarding subsequent functional assays, as well as major conclusions about gene function, including SR3G function in root system architecture, involvement in root suberization, and regulation of cellular damage are incomplete.

      We greatly appreciate the reviewer's thorough review of our manuscript and their critical comments. We have carefully addressed all comments and concerns.

      Reviewer #2 (Public Review):

      Salt stress is a significant and growing concern for agriculture in some parts of the world. While the effects of sodium excess have been studied in Arabidopsis and (many) crop species, most studies have focused on Na uptake, toxicity, and overall effects on yield, rather than on developmental responses to excess Na, per se. The work by Ishka and colleagues aims to fill this gap.

      Working from an existing dataset that exposed a diverse panel of A. thaliana accessions to control, moderate, and severe salt stress, the authors identify candidate loci associated with altering the root:shoot ratio under salt stress. Following a series of molecular assays, they characterize a DUF247 protein which they dub SR3G, which appears to be a negative regulator of root growth under salt stress.

      Overall, this is a well-executed study that demonstrates the functional role played by a single gene in plant response to salt stress in Arabidopsis.

      The abstract and beginning of the Discussion section highlight the "new tool" developed here for measuring biomass accumulation. I feel that this distracts from the central aims of the study, which is really about the role of a specific gene in root development under salt stress. I would suggest moving the tool description to less prominent parts of the manuscript.

      We appreciate the reviewer's suggestion. We believe that the innovative tool used to extract shoot-to-root ratio data from previous experiments underscores the value of reutilizing previously acquired data for new discoveries and demonstrates how reanalyzing the same data can provide fresh insights, such as identification of new allelic variation. Therefore, we decided to retain this section, as our discovery of the SR3G gene originated from this innovative tool.

      Recommendations for the authors:

      Reviewer #3 (Recommendations For The Authors):

      Line 58 (opening sentence) - salt accumulation in the soil is not caused by evaporation exceeding input; that scenario results in soil water deficit. The issue is when the input water has dissolved ions.

      We thank the reviewer for raising this important point. While this point is theoretically true, all of the water that is found in natural environments contains some dissolved ions. Therefore, drought conditions will lead, over time, to increased soil salinization. We have amended this sentence to represent our point better.

      “Salt stress is predominant in the dryland areas where evaporation rate exceeds water input. As all water contains dissolved ions, the prolonged exposure to drought stress results in increased accumulation of salts in the upper soil layers 1–3.”

      I feel that it would be helpful, for replication and for interpretation, if the authors could provide water potentials for the growing media used throughout. What water potentials are the plants experiencing when grown in 1/2 MS + agar at 0, 75, and 150mM NaCl? Juenger and Verslues present a great recent discussion of the importance of reporting these values (Juenger, T. E. and P. E. Verslues (2023). "Time for a drought experiment: Do you know your plants' water status?" Plant Cell 35(1): 10-23.)

      Critically, how do the water potentials experienced by agar-grown plants compare to those experienced in soil-grown plants? As a stated aim of this study is to allow translation to crops these data are very important to convince physiologists of the relevance of the results.

      We thank the reviewer for raising this important point. We completely agree that growing plants on agar plates is an artificial setup and knowing the water potential of the plants within this setup would be highly informative. However, as indicated in review by Juenger and Verslues 2023, the agar plate setup is much more reproducible compared to various soil conditions, and we report the media composition in sufficient detail for it to be reproduced in other laboratory conditions.

      Furthermore, while investigating the water status of plants and soil is indeed intriguing, it is beyond the scope of this study and would require us to redo the experiments with specific tools listed within the Juennger and Verslues review, which are currently not within our laboratory equipment list.

      Importantly, any changes reported in this manuscript apply equally to both wild-type and mutant lines under all conditions. We provide extensive report on the soil type used, as well as soil quantity. We are using the gravimetric method to determine the water content, and salt stress application, as described in previous works from our lab (Yu and Sussman et al., 2024 Plant Physiology and Awlia et al., 2016 Frontiers in Plant Science). 

      Nonetheless, we have now included water content measurements for soil-grown plants under different conditions, calculated by subtracting dry weight from fresh weight (new Fig. S24). Although plant water content may not fully capture the water status of the media or soil, our measurements did not reveal any significant differences in water content between genotypes across the various conditions tested.

      Line 69- missing an "and" after "(ABA)."

      Thanks. We added the missing “and”.

      Line 79 - I think the association being made is between natural variation in root and shoot growth and genetic variants, not "underlying genes."

      We thank the reviewer for this suggestion. The cause for the identified association indeed relies on allelic variation within the genetic region. We have re-phrased this sentence within the manuscript.

      “Many forward genetic studies were highly successful in associating natural variation in root and shoot growth with allelic variation in gene coding and promoter regions, thereby identifying potential new target traits for improved stress resilience 18,20,21.”

      Figure 1 - what do "seGF" and "reGF" stand for? Shoot and root growth rate, respectively, but there are extra letters in there…

      The abbreviations stand for shoot exponential Growth Factor and root exponential Growth factor. An explanation of the acronym has been added to the text.

      “The increase in the projected area of shoot and root (Fig. S2) was used to estimate (A) shoot and (B) root exponential growth rate (seGR and reGR respectively).”

      Figure 1 legend - there's an "s" missing in "across." And two "additionally" in the penultimate sentence.

      Thanks for spotting the errors. We fixed these errors.

      Line 109 - how was the white balance estimated for the images on the flatbed scanner?

      Within the developed tool, we have not adjusted or controlled for white balance in any way, as the white balance from the flatbed scanner is kept at one value. The tool transforms the imaged pixels into bins consisting of white (root), green (shoot), and blue (place) pixels based on the closest distance in the RGB scale to the particular color, which makes correcting for white balance obsolete. We have provided an additional explanation for this within the M&M section.

      “A Matlab-based tool was developed to simplify and speed up the segmentation and analysis pipeline. For automatic segmentation, the tool uses a combination of image operations (histogram equalization), thresholding on different color spaces (e.g., RGB, YCbCr, Lab, HSV), and binary image processing (boundary and islands removal). As the tool is digitalizing various color scales and classifies pixels into either white (root), green (shoot) or blue (background) categories, the adjustment for white balance is obsolete. ”

      GWAS was performed separately on traits measured at control, 75mM, and 150mM NaCl treatments. Would it also be informative to map the STI measurement (i.e. plasticity) introduced here?

      We thank the reviewer for this important point. We have performed GWAS on both “raw” and STI traits, however, we found that the identified associations were not as abundant as the ones identified with “raw traits”. This makes sense, as we are compounding the root or shoot growth under both conditions, and plastic responses to the environment are expected to be genetically more complex, as they involve more genetic regulators compared to phenotypes that have low plasticity. We have added this as a part of the result description, as we acknowledge that this might be an interesting observation for the field to build upon, and might provide fodder for new methods to deconvolute the complexity in mapping the plastic traits. 

      “To identify genetic components underlying salt-induced changes in root:shoot ratio, we used the collected data as an input for GWAS. The associations were evaluated based on the p-value, the number of SNPs within the locus, and the number of traits associated with individual loci. As Bonferroni threshold differs depending on the minor allele count (MAC) considered, we identified significant associations based on a Bonferroni threshold for each subpopulation of SNPs based on MAC (Table S3). While we conducted a GWAS on directly measured traits, as well as their Salt Tolerance Index (STI) values, however the amount of associations with STI was much lower compared to directly measured traits (Table S3). This observation aligns with the understanding that plastic responses to environmental conditions tend to be genetically more complex. This complexity likely stems from the involvement of more genetic regulators compared to low-plasticity phenotypes.”

      Line 167 - how was LD incorporated into this analysis? Did you use a genome average? Or was LD allowed to vary (as it does) across the genome?

      Initially, we have used genome average LD for this purpose (10 kbp for Arabidopsis), and extended the region of interest based on the number of coding genes within the window. We have added this as a part of description to our manuscript.

      “For the most promising candidate loci (Table S4), we have identified the gene open reading frames that were located within the genome-wide linkage-disequilibrium (LD) of the associated SNPs. The LD was expanded if multiple SNPs were identified within the region, and the region of interest was expanded based on the number of coding genes within the LD window. ”

      Line 291 - I think the water potentials are essential, here. What does 50% of soil water holding capacity equal in these soils? In the substrate that we use in our lab, that would represent a considerable soil water deficit even without any salts in the soil.

      We thank the reviewer for this comment. As Arabidopsis is occurring naturally in low soil water holding capacity soils (i.e. sandy soils), it is typically growing better in soils that are not very saturated with the water. Throughout many experiments, performed within this study, and other studies performed in our lab (results reported in Awlia et al., 2016 Frontiers in Plant Science and Yu & Sussman et al., 2024 Plant Physiology), we have not observed any drought like symptoms at 50% soil water holding capacity. The fact that this is reproducible across similar soil types across two laboratories (one in Saudi Arabia and one in the USA) is not to be dismissed. Again - we are currently not equipped to measure water potentials for these plants, as this is not a standard practice (yet) for stress experiments, but we are taking these comments on board for all of our future experiments.

      Moreover, our control plants are also “dried down” to 50% of SWHC, and soaked in non-saline water during the “salt stress treatment” to make sure that the soil water saturation is accounted for within the experimental setup. This “dry down” of soil is necessary to ensure equal and effective salt penetration into the soil particles. More details on this method can be found in Awlia et al., 2016.

      Again - We have added a new dataset measuring water content in individually soil-grown plants under different conditions as a proxy for soil water status (see new Fig. S24). While we did not observe any significant differences in water content between genotypes under the various conditions, the sr3g mutant showed a slightly higher, though non-significant, water content compared to wild-type Col-0 under control conditions.

      We have provided additional information and comments to warn the readers about this method:

      “The seeds were germinated in ½ MS media for one week, as described for the agar-based plate experiments. One week after germination, the seedlings were transplanted to the pot (12 x 4 cm insert) containing the Cornell Mix soil (per batch combine: 0.16 m3 of peat moss, 20.84 kg of vermiculite, 0.59 kg of Uni-Mix fertilizer, and 2.27 kg of lime) watered to 100% water holding capacity and placed in the walk-in growth chamber with the 16 h light / 8 h dark period, 22°C and 60% relative humidity throughout the growth period. When all of the pots dried down to the weight corresponding to 50% of their water holding capacity, they were soaked for 1 h in tap water or a 200 mM NaCl solution, resulting in an effective concentration of 100 mM NaCl based on the 50% soil water holding capacity, which corresponded to a moderate level of salt stress (Awlia et al., 2016). The control pots were soaked for the same length of time in 0 mM NaCl solution, to account for the soil saturation effect. We then allowed the pots to be drained for 2-3 h to eliminate excess moisture. The pots were placed under phenotyping rigs equipped with an automated imaging system (Yu et al., 2023) and the pot weight was measured daily to maintain the reference weight corresponding to 50% of the soil water holding capacity throughout the experiment. We would like to note that this gravimetric based method for application of salt stress has been developed for soils typically used for pot-grown plants, with relatively high water holding capacity (Awlia et al. 2016). Within these specific conditions, no drought stress symptoms were observed.”

      Lines 415-416 - are these contrasts significant? Figure S3 likewise does not have any notation for significant differences in the means.

      We have previously not tested the stronger effect of 125 mM vs 75 mM on relative root and shoot growth, and thus these test results were initially not included in Fig. S3. We have now added the tests and included them within Fig. S3, and added description of their significance into the main body of the manuscript:

      “In comparison, the growth rates of the shoot were significantly reduced to 0.71 and 0.43 of the control in 75 and 125 mM NaCl treatments, respectively (Fig. S3). While the mean value of root:shoot growth rate did not change upon salt stress treatment, the variance in the root:shoot ratio significantly expanded with the increasing concentrations of salt (Fig. 1C). These results suggest that while root and shoot growth are well coordinated under non-stress conditions, salt stress exposure results in loss of coordination of organ growth across Arabidopsis accessions.”

      Line 418 - same comment as preceding. Is this change in variance significant?

      We have previously not tested this. We have now added the ANOVA tests and included them within each figure, and added description of their significance into the main body of the manuscript. (see text above)

      Line 421 - why would we expect there to be a correlation between root:shoot growth ratio and seedling size?

      We were trying to use the seedling size as a proxy for “fitness” - or how well the plants can survive under these specific conditions. We were testing here whether any simple and directional strategy - such as increase or decrease in root:shoot ratio under salt stress - is resulting in better salt tolerance - which would translate into larger overall seedlings. We have rephrased this within the manuscript, to better explain the hypothesis being tested within this specific figure:

      “To test whether there is a clear directional correlation between the change in root:shoot ratio and overall salt stress tolerance, we have used the overall seedling size as a proxy for plant salt tolerance (Fig. S4, S5). No significant correlation was found between the root:shoot growth ratio and total seedling size (Fig. S4, S5), indicating that the relationship between coordination of root and shoot growth and salt tolerance during the early seedling establishment is complex.”

      Line 438 - I think a stable web link would be more appropriate than listing Dr. Nordborg's email address.

      Sorry about this. There is a glitch with our reference citing software. We agree, and thank the reviewer for noticing this! We assigned reference number 43 to it.

      Line 439 - I expect that many of your readers may not be experienced with GWAS. Can you provide an explanation as to why only one locus was detected with both the 250K SNP panel and the 4M SNP panel?

      We thank the reviewer for raising this point. We have added additional explanation to this observation:

      “Increased SNP density can provide more potential associations, highlighting the associated loci with more confidence, due to more SNPs being detected within specific region. The different panels could capture different LD blocks across the genome. If the locus detected by both panels is in a region of strong LD or under selection, it could be detected consistently. In contrast, other loci may not be captured well by the lower-density 250K SNP panel. The new GWAS revealed 32 additional loci, with only one significantly associated locus being picked up by both 250k and 4M SNPs GWAS (locus 30, Table S3). The detection of only one common locus between the two SNP panels is likely due to differences in resolution, statistical power, and how well each panel captures the genomic regions associated with the trait. ”

      Figure 2A and B - I suggest adding the p-value cutoff to the y-axis of the Manhattan Plots

      We thank the reviewer for this suggestion, however this is not appropriate. The genome wide p-value cutoffs for GWAS studies are arbitrary, and we have not used a genome-wide cutoff for our SNPs, but rather used cutoffs depending on the minor allele frequency. Therefore, we think adding a straight line to the graphs in Fig. 2A-B representing the overall cutoff, would be misleading. Please see below the text where we explain how the threshold was calculated for individual groups of SNPs with varying MAF:

      “The GWAS associations were evaluated for minor allele count (MAC) and association strength above the Bonferroni threshold with -log10(p-value/#SNPs), calculated for each sub-population of SNPs above threshold MAC (Table S3, Bonf.threshold.MAC.specific)”

      Line 490-492 - Presents the results of the gene tree to support a model in which SR3G diverged from AT3G50150 prior to the speciation events leading to Capsella and Arabidopsis. But this topology requires at least two independent losses of SR3G - can you rule out the hypothesis that the position of SR3G on the gene tree is a result of long branch attraction? Given the syntenic orientation of AT3G50150 and SR3G, and apparent directional selection experienced by the latter lineage, it seems more parsimonious that AT3G50150 and SR3G arose from a very recent duplication event.

      We agree with the reviewer that it seemed most parsimonious for AT3G50160 (SR3G) to be a recent tandem duplication of AT3G50150 – and this was certainly our expectation given the other tandem duplications that have occurred in this genomic region. However, irrespective of the type of alignment from which we built the phylogeny (nucleotide vs AA; sometimes nucleotide is noisier but provides more information) we were never able to recapitulate a tree where AT3G50160 was immediately sister to AT3G50150 – even with a long branch for AT3G50160 indicating a rapid pace of nucleotide/AA change relative to AT3G50150. In regards to long branch attraction, it is our interpretation that long branch attraction typically requires multiple long branches that get placed together at a poorly supported node where sampling is sparse (https://www.nature.com/articles/s41576-020-0233-0), whereas we have the single long branch for AT3G50160, and all other A/C clade (Arabidopsis/Camelina/Capsella) members forming a lineage with a much shorter branch. To test the possibility of long branch attraction we subtracted out individual members of the AT3G50150/160 clade to see if there was algorithmic uncertainty in the placement of AT3G50160. We did not observe this in any of the branch subtractions that we performed (see below). Thus, it appears that we must stick with our original interpretation. If the reviewer would like us to soften this interpretation, we would be more than happy to do so, as it does not impact the overall conclusions for AT3G50160 being a rapidly evolving member of this clade.

      Author response image 1.

      Line 494 (and throughout) - I expect that all of the genes being studied herein are "experiencing selection," even if it's boring-old purifying selection on functionally conserved proteins. I think you mean to say "directional selection."

      We thank the reviewer for this comment and completely agree that we lacked precision on our statement. We have corrected this throughout the manuscript.

      Line 497 - state the background and foreground values of omega, here.

      We apologize for not including these values and have added them at this point in the manuscript (new Table S6).

      Line 511 and Line 673 - Inspection of Figure S13B suggests that SR3G is not "predominantly" expressed nor does it have the "highest enrichment" in the root stele. Certainly, among root cell types, this is predominant. But it appears to be quite highly expressed in late-stage seeds and some floral organs, as well.

      We appreciate the reviewer for recognizing that SR3G is not a highly expressed gene. In root cell types, its expression is enriched in the root stele. Overall, SR3G is expressed at both early and later developmental stages. Our investigation of later developmental stages related to seed production did not reveal any significant phenotypic differences in fertility.

      Line 514 - "54-folds" should be "54-fold."

      Thanks. We made corrections.

      Figure 7 - For symmetry, I suggest adding the "Beginning of salt stress" arrow to the "Early Stress" panel as well (even if it's right at day 0).

      Thanks. We added the arrow to Early Stress in both Panels A and B.

      Figure S2 - both graphs should have the same scale on the y-axis

      Thanks - we have now re-plotted the graph with the matching y-axis scales.

      Line 531 - I feel that this is a significant overstatement. The strongest statement supported by the results presented here is that SR3G is the most prominent DUF247 studied herein in root development under salt stress.

      Thanks for the comments. We rephrase the statement.

      “These results suggest that SR3G is the most prominent DUF247 studied within our study to affect root development under salt stress.”

      Lines 583-605 - These data seem to me to be tangential to the central aims of the study. I suggest removing them for clarity/brevity.

      We greatly appreciate the reviewer's suggestion. Our study primarily focused on characterizing the main GWAS candidate, SR3G. Since SR3G is located within a cluster of other DUF247 genes on chromosome 3, we believe that screening the neighboring DUF247 genes could provide further insights into SR3G’s role in root development. Additionally, we believe that the generated data and lines will serve as a valuable resource for other researchers interested in studying these genes. For these reasons, we have decided to retain these datasets in the manuscript.

      Lines 650-652 - these sections 1-3 differences in suberization between SR3G and Col-0 under control conditions are not significant. At best, this may be described as a "trend" and not "higher levels." In section 4, it is VERY marginally significant (and probably not at all after the large number of tests performed, here.)

      We appreciate the reviewer's feedback and have revised the wording accordingly.

      Line 660 - this statement is only true for Section 1. I suggest adding this caveat.

      We appreciate the reviewer's comments on this matter. We quantified four suberin monomers in whole root seedlings rather than in individual root sections due to the technical challenges of separating the sections without microscopy and the limited availability of samples for GS-MS analysis.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      The study is an important advancement to the consideration of antimalarial drug resistance: the authors make use of both modelling results and supporting empirical evidence to demonstrate the role of malaria strain diversity in explaining biogeographic patterns of drug resistance. The theoretical methods and the corresponding results are convincing, with the novel model presented moving beyond existing models to incorporate malaria strain diversity and antigen-specific immunity. This work is likely to be interesting to malaria researchers and others working with antigenically diverse infectious diseases.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The paper is an attempt to explain a geographic paradox between infection prevalence and antimalarial resistance emergence. The authors developed a compartmental model that importantly contains antigenic strain diversity and in turn antigen-specific immunity. They find a negative correlation between parasite prevalence and the frequency of resistance emergence and validate this result using empirical data on chloroquine-resistance. Overall, the authors conclude that strain diversity is a key player in explaining observed patterns of resistance evolution across different geographic regions.

      The authors pose and address the following specific questions:

      1. Does strain diversity modulate the equilibrium resistance frequency given different transmission intensities?

      2. Does strain diversity modulate the equilibrium resistance frequency and its changes following drug withdrawal?

      3. Does the model explain biogeographic patterns of drug resistance evolution?

      Strengths:

      The model built by the authors is novel. As emphasized in the manuscript, many factors (e.g., drug usage, vectorial capacity, population immunity) have been explored in models attempting to explain resistance emergence, but strain diversity (and strain-specific immunity) has not been explicitly included and thus explored. This is an interesting oversight in previous models, given the vast antigenic diversity of Plasmodium falciparum (the most common human malaria parasite) and its potential to "drive key differences in epidemiological features".

      The model also accounts for multiple infections, which is a key feature of malarial infections, with individuals often infected with either multiple Plasmodium species or multiple strains of the same species. Accounting for multiple infections is critical when considering resistance emergence, as with multiple infections there is within-host competition which will mediate the fitness of resistant genotypes. Overall, the model is an interesting combination of a classic epidemiological model (e.g., SIR) and a population genetics model.

      In terms of major model innovations, the model also directly links selection pressure via drug administration with local transmission dynamics. This is accomplished by the interaction between strain-specific immunity, generalized immunity, and host immune response.

      R: We thank the reviewer for his/her appreciation of the work.

      Weaknesses:

      In several places, the explanation of the results (i.e., why are we seeing this result?) is underdeveloped. For example, under the section "Response to drug policy change", it is stated that (according to the model) low diversity scenarios show the least decline in resistant genotype frequency after drug withdrawal; however, this result emerges mechanistically. Without an explicit connection to the workings of the model, it can be difficult to gauge whether the result(s) seen are specific to the model itself or likely to be more generalizable.

      R: We acknowledge that the explanation of certain results needs to be improved. We have now added the explanation of why low diversity scenarios show the least decline in resistance frequency after drug withdrawal: “Two processes are responsible for the observed trend: first, resistant genotypes have a much higher fitness advantage in low diversity regions even with reduced drug usage because infected hosts are still highly symptomatic; second, due to low transmission potential in low diversity scenarios (i.e., longer generation intervals between transmissions), the rate of change in parasite populations is slower.” (L243-247). We also compared the drug withdrawal response to that of the generalized-immunity-only model (L268-271). The medium transmission region has the fastest reduction in resistance frequency, followed by the high and low transmission regions, which differs from the full model that incorporates strain-specific diversity.

      In addition, to provide the context of different biogeographic transmission zones, we now include a new figure (now Fig. 3) that presents the parameter space of transmission potential and strain diversity of different continents, which demonstrates that PNG and South America have less strain diversity than expected by transmission potential (L179-184 and L198-202). Therefore, these two regions have low disease prevalence and high resistance frequency.

      The authors emphasize several model limitations, including the specification of resistance by a single locus (thus not addressing the importance of recombination should resistance be specified by more than one locus); the assumption that parasites are independently and randomly distributed among hosts (contrary to empirical evidence); and the assumption of a random association between the resistant genotype and antigenic diversity. However, each of these limitations is addressed in the discussion.

      R: As pointed out by the referee, our model presents several limitations that have all been addressed in the discussion and considered for future extensions.

      Did the authors achieve their goals? Did the results support their conclusion?

      Returning to the questions posed by the authors:

      1. Does strain diversity modulate the equilibrium resistance frequency given different transmission intensities? Yes. The authors demonstrate a negative relationship between prevalence/strain diversity and resistance frequency (Figure 2).

      2. Does strain diversity modulate the equilibrium resistance frequency and its changes following drug withdrawal? Yes. The authors find that, under resistance invasion and some level of drug treatment, resistance frequency decreased with the number of strains (Figure 4). The authors also find that lower strain diversity results in a slower decline in resistant genotypes after drug withdrawal and higher equilibrium resistance frequency (Figure 6).

      3. Does the model explain biogeographic patterns of drug resistance evolution? Yes. The authors find that their full model (which includes strain-specific immunity) produces the empirically observed negative relationship between resistance and prevalence/strain diversity, while a model only incorporating generalised immunity does not (Figure 8).

      Utility of work to others and relevance within and beyond the field?

      This work is important because antimalarial drug resistance has been an ongoing issue of concern for much of the 20th century and now 21st century. Further, this resistance emergence is not equitably distributed across biogeographic regions, with South America and Southeast Asia experiencing much of the burden of this resistance emergence. Not only can widespread resistant strains be traced back to these two relatively low-transmission regions, but these strains remain at high frequency even after drug treatment ceases.

      Reviewer #2 (Public Review):

      Summary:

      The evolution of resistance to antimalarial drugs follows a seemingly counterintuitive pattern, in which resistant strains typically originate in regions where malaria prevalence is relatively low. Previous investigations have suggested that frequent exposures in high-prevalence regions produce high levels of partial immunity in the host population, leading to subclinical infections that go untreated. These subclinical infections serve as refuges for sensitive strains, maintaining them in the population. Prior investigations have supported this hypothesis; however, many of them excluded important dynamics, and the results cannot be generalized. The authors have taken a novel approach using a deterministic model that includes both general and adaptive immunity. They find that high levels of population immunity produce refuges, maintaining the sensitive strains and allowing them to outcompete resistant strains. While general population immunity contributed, adaptive immunity is key to reproducing empirical patterns. These results are robust across a range of fitness costs, treatment rates, and resistance efficacies. They demonstrate that future investigations cannot overlook adaptive immunity and antigenic diversity.

      R: We thank the reviewer for his/her appreciation of the work.

      Strengths:

      Overall, this is a very nice paper that makes a significant contribution to the field. It is well-framed within the body of literature and achieves its goal of providing a generalizable, unifying explanation for otherwise disparate investigations. As such, this work will likely serve as a foundation for future investigations. The approach is elegant and rigorous, with results that are supported across a broad range of parameters.

      Weaknesses:

      Although the title states that the authors describe resistance invasion, they do not support or even explore this claim. As they state in the discussion (line 351), this work predicts the equilibrium state and doesn't address temporal patterns. While refuges in partially immune hosts may maintain resistance in a population, they do not account for the patterns of resistance spread, such as the rapid spread of chloroquine resistance in Africa once it was introduced from Asia.

      R: We do agree that resistance invasion is not the focus of our manuscript. Rather we mainly investigate the maintenance and decline after drug withdrawal. Therefore, we changed the title to “Antigenic strain diversity predicts different biogeographic patterns of maintenance and decline of anti-malarial drug resistance” (L1-4).

      We did, however, present a fast initial invasion phase for the introduction of resistant genotypes regardless of transmission scenarios in Fig. 5 (now Fig. 6). Even though the focus of the manuscript is to investigate long term persistence of resistant genotypes, we did emphasize that the initial invasion phase and how that changes the host immunity profile are key to the coexistence of resistant and wild-type genotypes (L228-239).

      As the authors state in the discussion, the evolution of compensatory mutations that negate the cost of resistance is possible, and in vitro experiments have found evidence of such. It appears that their results are dependent on there being a cost, but the lower range of the cost parameter space was not explored.

      R: It is true that compensatory mutations might mitigate the negative fitness consequences. We didn’t add a no-cost scenario because in general if there is no cost but only benefit (survival through drug usage), then resistant haplotypes will likely be fixed in the population. This is contingent on the assumption that these compensatory mutations are in perfect linkage with resistant alleles, which is unlikely in high-transmission scenarios. Our model does not incorporate recombination, but earlier models (Dye & Williams 1997, Hastings & D’Alessandro 2000) have demonstrated that recombination will delay the fixation of resistant alleles in high-transmission.

      As suggested, we ran our model with costs equal 0 and 0.01 (Fig. 2C and L189-191). We found that resistant alleles almost always fix except for when diversity is extremely high, treatment/resistance efficacy is low. In these cases, additional benefits brought by more transmission from resistant alleles do not bring many benefits (as lower GI classes have a very small number of hosts). This finding does not contradict a wider range of coexistence between wild-type and resistant alleles when the cost is higher. We therefore added these scenarios to our updated results.

      Author response image 1.

      The use of a deterministic, compartmental model may be a structural weakness. This means that selection alone guides the fixation of new mutations on a semi-homogenous adaptive landscape. In reality, there are two severe bottlenecks in the transmission cycle of Plasmodium spp., introducing a substantial force of stochasticity via genetic drift. The well-mixed nature of this type of model is also likely to have affected the results. In reality, within-host selection is highly heterogeneous, strains are not found with equal frequency either in the population or within hosts, and there will be some linkage between the strain and a resistance mutation, at least at first. Of course, there is no recourse for that at this stage, but it is something that should be considered in future investigations.

      R: We thank the reviewer for their insightful comments on the constraints of the deterministic modeling approach. We’ve added these points to discussion in the paragraph discussing the second limitation of the model (L359-364).

      The authors mention the observation that patterns of resistance in high-prevalence Papua New Guinea seem to be more similar to Southeast Asia, perhaps because of the low strain diversity in Papua New Guinea. However, they do not investigate that parameter space here. If they did and were able to replicate that observation, not only would that strengthen this work, it could profoundly shape research to come.

      R: We appreciate the suggestion to investigate the parameter space of Papua New Guinea. We now include a new figure (now Fig. 3) that presents the parameter space of transmission potential and strain diversity of different continents, which demonstrates that PNG and South America have less strain diversity than expected by transmission potential (L179-184 and L198-202). This translates to low infectivity for most mosquito bites, and most infections only occur in hosts with lower generalized immunity. Therefore resistant genotypes will help ensure disease transmission in these symptomatic hosts and be strongly selected to be maintained.

      Reviewer #1 (Recommendations For The Authors):

      1. I found lines 41-49 difficult to follow. Please rephrase (particularly punctuation) for clarity.

      R: We have edited the lines to improve the writing (L41-50)):

      “Various relationships between transmission intensity and stable frequencies of resistance were discovered, each of which has some empirical support: 1) transmission intensity does not influence the fate of resistant genotypes [Models: Koella and Antia (2003); Masserey et al. (2022); Empirical: Diallo et al. (2007); Shah et al. (2011, 2015)]; 2) resistance first increases in frequency and slowly decreases with increasing transmission rates [Models: Klein et al. (2008, 2012)]; and 3) Valley phenomenon: resistance can be fixed at both high and low end of transmission intensity [Model: Artzy-Randrup et al. (2010); Empirical: Talisuna et al. (2002)]. Other stochastic models predict that it is harder for resistance to spread in high transmission regions, but patterns are not systematically inspected across the parameter ranges [Model: Whitlock et al. (2021); Model and examples in Ariey and Robert (2003)].”

      1. Line 65: There should be a space after "recombination" and before the citation.

      R: Thank you for catching the error. We’ve added the space (L64).

      1. I'm interested in the dependency of the results on the assumption that there is a cost to resistance via lowered transmissibility (lines 142-145). I appreciate that variation in the cost(s) of resistance in single and mixed infections is explored; however, from what I can tell the case of zero cost is not explored.

      R: As suggested, we have now added the no-cost scenario. Please see the response to the Reviewer2 weaknesses paragraph 2.

      1. I felt the commentary/explanation of the response to drug policy change was a bit underdeveloped. I would have liked a walk-through of why in your model low diversity scenarios show the slowest decline in resistant genotypes after switching to different drugs.

      R: We acknowledge that the explanation of the response to drug policy change needs to be improved. We have now added the explanation of why we observe low diversity scenarios show the least decline in resistance frequency after drug withdrawal: “Two processes are responsible for the seen trend: first, resistant genotypes have a much higher fitness advantage in low diversity regions even with reduced drug usage because infected hosts are still highly symptomatic; second, due to low transmission potential in low diversity scenarios (i.e., longer generation intervals between transmissions), the rate of change in parasite populations is slower.” (L243-247). We also compared the drug withdrawal response to that of the generalized-immunity-only model. The medium transmission region has the fastest reduction in resistance frequency, followed by the high and low transmission regions, which differs from the full model that incorporates strain-specific diversity.

      1. Line 352: persistent drug usage?

      R: Yes, we meant persistent drug usage. We’ve clarified the writing (L389-391).

      1. The organisation of the manuscript would benefit from structuring around the focal questions so that the reader can easily find the answers to the focal questions within the results and discussion sections.

      R: This is a great suggestion. We modified the subheadings of results to provide answers to focal questions (L151, L179, L203-204, and L240).

      1. Line 353: Please remove either "shown" or "demonstrated".

      R: Thank you for catching the grammatical error, we’ve retained “shown” only for the sentence (L391-392).

      Reviewer #2 (Recommendations For The Authors):

      Overall, this was very nice work and a pleasure to read.

      Major:

      1. Please provide a much more thorough explanation of how resistance invasions are modeled. It is not clear from the text and could not be replicated.

      R: We have now added a section “drug treatment and resistance invasion” in Methods and Materials to explain how resistance invasions are modeled (L488-496):

      “Given each parameter set, we ran the ODE model six times until equilibrium with the following genotypic compositions: 1) wild-type only scenario with no drug treatment; 2) wild-type only scenario with 63.2% drug treatment (0.05 daily treatment rate); 3) wild-type only scenario with 98.2% drug treatment (0.2 daily treatment rate); 4) resistant-only scenario with no drug treatment; 5) resistance invasion with 63.2% drug treatment; 6) resistance invasion with 98.2% drug treatment. Runs 1-4 start with all hosts in G0,U compartment and ten parasites. Runs 5 and 6 (resistance invasion) start from the equilibrium state of 2 and 3, with ten resistant parasites introduced. We then followed the ODE dynamics till the next equilibrium.”

      1. Please make your raw data, code, and replicable examples that produce the figures in the manuscript available.

      R: We have added the data availability session, which provides the GitHub site with all the code for the model, data processing, and figures: All the ODE codes, numerically-simulated data, empirical data, and analyzing scripts are publicly available at https://github.itap.purdue.edu/HeLab/MalariaResistance.

      1. Regarding the limitations described in the paragraph about the model in the public response, these results would be strengthened if there were separate compartments for strains which could be further divided into sensitive and resistant. Could you explore this for at least a subset of the parameter space?

      R: In our model, sensitive and resistant pathogens are always modeled as separate compartments (Fig. S1B and Appendix 1). In Results/Model structure, L135-136, we stated the setup:

      “The population sizes of resistant (PR) or sensitive (wild-type; PW) parasites are tracked separately in host compartments of different G and drug status.”

      1. To what extent do these results rely on a cost to resistance? Were lower costs explored? This would be worth demonstrating. If this cannot be maintained without cost, do you think this is because there is no linkage between strain and resistance?

      R: As suggested, we have now added the no-cost scenario (Fig. 2C and L189-191). Please see the response to the Reviewer1 weaknesses paragraph 2. In sum, under a no-cost scenario, if treatment rate is low, then wild-type alleles will still be maintained in high transmission scenarios; when treatment rate is high, resistant alleles will always be fixed.

      Minor:

      1. "Plasmodium" should be italicized throughout. Ironically, italics aren't permitted in this form.

      R: We did italicize “Plasmodium” or “P. falciparum” throughout the text. If the reviewer is referring to “falciparum malaria”, the convention is not to italicize falciparum in this case.

      1. Fig 1A: the image is reversed for the non-infected host with prior exposure to strain A. Additionally, the difference between colors for WT and resistant is not visible in monochrome.

      R: Thank you for pointing out the problem of color choice in monochrome. We have modified the figure. The image in Fig 1A is not reversed for non-infected hosts with prior exposure to strain A. We now spell out “S” to be “specific immunity”, and explain it better in the figure legend.

      1. Fig 2B: add "compare to the pattern of prevalence shown in Fig 2A" or something similar to make the comparison immediately clear.

      R: We thank the reviewer’s suggestion. We’ve added a sentence to contrast Fig 2A and B in the Figure legend: “A comparison between the prevalence pattern in (A) and resistance frequency in (B) reveals that high prevalence regions usually correspond to low resistance frequency at the end of resistance invasion dynamics.”

      1. Figs 2B & C: Please thoroughly explain how you produced this data in the methods section and briefly describe it in the results sections.

      R: We agree that the modeling strategies need to be explained better. Since we explained the rationale for the parameter ranges and the prevalence patterns we observe in the results section “Appropriate pairing of strain diversity and vectorial capacity” (now “Impact of strain diversity and transmission potential on disease prevalence”), we added sentences in this section to explain how we run models until equilibrium for wild-only infections with or without drug treatment (L152-178). Then in the following section “Drug-resistance and disease prevalence” section, we explain how we obtained the resistance invasion data:

      “To investigate resistance invasion, we introduce ten resistant infections to the equilibrium states of drug treatment with wild-type only infections, and follow the ODE dynamics till the next equilibrium” (L180-181).

      1. Fig 3: The axis labels are not particularly clear. For the Y axis, please state in the label what it is the frequency of (either the mutation or the phenotype). In the X axis, it is better to spell that out in words, like "P. falciparum prevalence in children".

      R: Thank you for pointing this out. We’ve modified the axes labels of Fig. 3 (now Fig. 4): X-axis: “P. falciparum prevalence in children aged 2-10”; Y-axis: “Frequency of resistant genotypes (pfcrt 76T)”.

      1. Fig 4 and the rest of the figures of this nature: Showing an equilibrium-state timestep before treatment was introduced would improve the readers' understanding of the dynamics.

      R: We agree that the equilibrium state before treatment is important. In fact, we have those states in our figure 4 (now figure 5): the left panel- “Daily treatment rate 0” indicates the equilibrium-state timestep before treatment. We clarified this point in the caption.

      1. Fig 5 is very compelling, but the relationships in Fig 5 would be clearer if the Y axes were not all different. Consider using the same scale for the hosts, and the same scale for resistant parasites (both conditions) and WT parasites, 113 strains. It may be clearer to reference them if they are given as A-F instead of three figures each for A and B.

      R: We agree with the suggested changes and have modified figure 5 (now Fig. 6): we used one Y-axis scale for the hosts, and one Y-axis scale for the parasites. The wild-type one is very low for the low diversity scenario, thus we included one inset plot for that case.

      1. Fig 5 caption: High immune protection doesn't select against resistance. The higher relative fitness of the sensitive strain selects against resistance in a high-immunity environment.

      R: Thank you for pointing this out. Here we meant that a reduction in resistant population after the initial overshoot occurs in both diversity levels. We are not comparing resistant strains to sensitive ones. We’ve modified the sentence to: “The higher specific immunity reduces the infectivity of new strains, leading to a reduction of the resistant parasite population regardless of the diversity level”.

      1. Line 242: "keep" should be plural.

      R: We’ve corrected “keep” to “keeps” (L267).

      1. Line 360 and elsewhere: The strength of the results is somewhat overstated at times. This absolutely supports the importance of strain-specific immunity, but these results do not explain patterns of the origin of resistance and there are a number of factors that are not incorporated (a necessary evil of modeling to be sure).

      R: Thank you for pointing this out. We’ve modified discussion to remove the overstated strength of results:

      1) Original: “The inclusion of strain diversity in the model provides a new mechanistic explanation as to why Southeast Asia has been the original source of resistance to certain antimalarial drugs, including chloroquine.”

      Modified: “The inclusion of strain diversity in the model provides a new mechanistic explanation as to why Southeast Asia has persisting resistance to certain antimalarial drugs, including chloroquine, despite a lower transmission intensity than Africa. “ (L328-330)

      2) In sum, we show that strain diversity and associated strain-specific host immunity, dynamically tracked through the macroparasitic structure, can explainpredict the complex relationship between transmission intensity and drug-resistance frequencies.

      1. The color palettes are not discernible in grayscale, especially the orange/blue/gray in Fig 2. The heatmaps appear to be in turbo, the only viridis palette that isn't grayscale-friendly. Just something to keep in mind for the accessibility of individuals with achromatopsia and most people who print out papers.

      R: Thank you for the visualization suggestions. We updated all the figures with the “viridis:magma” palette. As for the orange/blue/gray scale used in Fig 2C, it is difficult to pick nine colors that are discernable in brightness in grayscale. Currently, the four colors correspond to clonal genotype cost (i.e. green, red, grey, and blue), and the three-level brightness maps to mixed genotype cost.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      (1) Some details are not described for experimental procedures. For example, what were the pharmacological drugs dissolved in, and what vehicle control was used in experiments? How long were pharmacological drugs added to cells?

      We apologise for the oversight. These details have now been added to the methods section of the manuscript as well as to the relevant figure legends.

      Briefly, latrunculin was used at a final concentration of 250 nM and Y27632 at a final concentration of 50 μM. Both drugs were dissolved in DMSO. The vehicle controls were effected with the highest final concentration of DMSO of the two drugs.

      The details of the drug treatments and their duration was added to the methods and to figures 6, S10, and S12.

      (2) Details are missing from the Methods section and Figure captions about the number of biological and technical replicates performed for experiments. Figure 1C states the data are from 12 beads on 7 cells. Are those same 12 beads used in Figure 2C? If so, that information is missing from the Figure 2C caption. Similarly, this information should be provided in every figure caption so the reader can assess the rigor of the experiments. Furthermore, how heterogenous would the bead displacements be across different cells? The low number of beads and cells assessed makes this information difficult to determine.

      We apologise for the oversight. We have now added this data to the relevant figure panels.

      To gain a further understanding of the heterogeneity of bead displacements across cells, we have replotted the relevant graphs using different colours to indicate different cells. This reveals that different cells appear to behave similarly and that the behaviour appears controlled by distance to the indentation or the pipette tip rather than cell identity.

      We agree with the reviewer that the number of cells examined is low. This is due to the challenging nature of the experiments that signifies that many attempts are necessary to obtain a successful measurement.

      The experiments in Fig 1C are a verification of a behaviour documented in a previous publication [1]. Here, we just confirm the same behaviour and therefore we decided that only a small number of cells was needed.

      The experiments in Fig 2C (that allow for a direct estimation of the cytoplasm’s hydraulic permeability) require formation of a tight seal between the glass micropipette and the cell, something known as a gigaseal in electrophysiology. The success rate of this first step is 10-30% of attempts for an experienced experimenter. The second step is forming a whole cell configuration, in which a hydraulic link is formed between the cell and the micropipette. This step has a success rate of ~ 50%. Whole cell links are very sensitive to any disturbance. After reaching the whole cell configuration, we applied relatively high pressures that occasionally resulted in loss of link between the cell and the micropipette. In summary, for the 12 successful measurements, hundreds of unsuccessful attempts were carried out.

      (3) The full equation for displacement vs. time for a poroelastic material is not provided. Scaling laws are shown, but the full equation derived from the stress response of an elastic solid and viscous fluid is not shown or described.

      We thank the reviewer for this comment. Based on our experiments, we found that the cytoplasm behaves as a poroelastic material. However, to understand the displacements of the cell surface in response to localised indentation, we show that we also need to take the tension of the submembranous cortex into account. In summary, the interplay between cell surface tension generated by the cortex and the poroelastic cytoplasm controls the cell behaviour. To our knowledge, no simple analytical solutions to this type problem exist.

      In Fig 1, we show that the response of the cell to local indentation is biphasic with a short time-scale displacement followed by a longer time-scale one. In Figs 2 and 3, we directly characterise the kinetics of cell surface displacement in response to microinjection of fluid. These kinetics are consistent with the long time-scale displacement but not the short time-scale one. Scaling considerations led us to propose that tension in the cortex may play a role in mediating the short time-scale displacement. To verify this hypothesis, we have now added new data showing that the length-scale of an indentation created by an AFM probe depends on tension in the cortex (Fig S5).  

      In a previous publication [2], we derived the temporal dynamics of cell surface displacement for a homogenous poroelastic material in response to a change in osmolarity. In the current manuscript, the composite nature of the cell (membrane, cortex, cytoplasm) needs to be taken into account as well as a realistic cell shape. Therefore, we did not attempt to provide an analytical solution for the displacement of the cell surface versus time in the current work. Instead, we turned to finite element modelling to show that our observations are qualitatively consistent with a cell that comprises a tensed submembranous actin cortex and a poroelastic cytoplasm (Fig 4). We have now added text to make this clearer for the reader.

      Reviewer #2 (Public review):

      Comments & Questions:

      The authors state, "Next, we sought to quantitatively understand how the global cellular response to local indentation might arise from cellular poroelasticity." However, the evidence presented in the following paragraph appears more qualitative than strictly quantitative. For instance, the length scale estimate of ~7 μm is only qualitatively consistent with the observed ~10 μm, and the timescale 𝜏𝑧 ≈ 500 ms is similarly described as "qualitatively consistent" with experimental observations. Strengthening this point would benefit from more direct evidence linking the short timescale to cell surface tension. Have you tried perturbing surface tension and examining its impact on this short-timescale relaxation by modulating acto-myosin contractility with Y-27632, depolymerizing actin with Latrunculin, or applying hypo/hyperosmotic shocks?

      Upon rereading our manuscript, we agree with the reviewer that some of our statements are too strong. We have now moderated these and clarified the goal of that section of the text.

      The reviewer asks if we have examined the effect of various perturbations on the short time-scale displacements. In our experimental conditions, we cannot precisely measure the time-scale of the fast relaxation because its duration is comparable to the frame rate of our image acquisition. However, we examined the amplitude of the displacement of the first phase in response to sucrose treatment and we have carried out new experiments in which we treat cells with 250nM Latrunculin to partially depolymerise cellular F-actin. Neither of these treatments had an impact on the amplitude of vertical displacements (Fig. S3).

      The absence of change in response to Latrunculin may be because the treatment decreases both the elasticity of the cytoplasm  and the cortical tension . As the length-scale  of the deformation of the surface scales as , the two effects of latrunculin treatment may therefore compensate one another and result in only small changes in . We have now added this data to supplementary information and comment on this in the text.   

      The reviewer’s comment also made us want to determine how cortical tension affects the length-scale of the cell surface deformation created by localised microindentation. To isolate the role of the cortex from that of cell shape, we decided to examine rounded mitotic cells. In our experiments, we indented a mitotic cell expressing a membrane targeted GFP with a sharp AFM tip (Fig. S5).

      In our experiments, we adjusted force to generate a 2μm depth indentation and we imaged the cell profile with confocal microscopy before and during indentation. Segmentation of this data allowed us to determine the cell surface displacement resulting from indentation and measure a length scale of deformation. In control conditions, the length scale created by deformation is on the order of 1.2μm. When we inhibited myosin contractility with blebbistatin, the length-scale of deformation decreased significantly to 0.8 μm, as expected if we decrease the surface tension γ without affecting the cytoplasmic elasticity. We have now added this data to our manuscript.

      The authors demonstrate that the second relaxation timescale increases (Figure 1, Panel D) following a hyperosmotic shock, consistent with cytoplasmic matrix shrinkage, increased friction, and consequently a longer relaxation timescale. While this result aligns with expectations, is a seven-fold increase in the relaxation timescale realistic based on quantitative estimates given the extent of volume loss?

      We thank the reviewer for this interesting question. Upon re-examining our data, we realised that the numerical values in the text related to the average rather than the median of our measurements. The median of the poroelastic time constant increases from ~0.4s in control conditions to 1.4s in sucrose, representing approximately a 3.5 fold increase.

      Previous work showed that HeLa cell volume decreases by ~40% in response to hyperosmotic shock [3]. The fluid volume fraction in cells is ~65-75%. If we assume that the water is contained in N pores of volume , we can express the cell volume as with the volume of the solid fraction. We can rewrite .

      With ∅ = 0.42  -0.6.  As  does not change in response to osmotic shock, we can rewrite the volume change to obtain the change in pore size .

      The poroelastic diffusion constant scales as and the poroelastic timescale scales as . Therefore, the measured change in volume leads to a predicted increase in poroelastic diffusion time of 1.7-1.9 fold, smaller than observed in our experiments. This suggests that some intuition can be gained in a straightforward manner assuming that the cytoplasm is a homogenous porous material.

      However, the reality is more complex and the hydraulic pore size is distinct from the entanglement length of the cytoskeleton mesh, as we discussed in a previous publication [4]. When the fluid fraction becomes sufficiently small, macromolecular crowding will impact diffusion further and non-linearities will arise. We have now added some of these considerations to the discussion.

      If the authors' hypothesis is correct, an essential physiological parameter for the cytoplasm could be the permeability k and how it is modulated by perturbations, such as volume loss or gain. Have you explored whether the data supports the expected square dependency of permeability on hydraulic pore size, as predicted by simple homogeneity assumptions?

      We thank the reviewer for this comment. As discussed above, we have explored such considerations in a previous publication (see discussion in [4]). Briefly, we find that the entanglement length of the F-actin cytoskeleton does play a role in controlling the hydraulic pore size but is distinct from it. Membrane bounded organelles could also contribute to setting the pore size. In our previous publication, we derived a scaling relationship that indicates that four different length-scales contribute to setting cellular rheology: the average filament bundle length, the size distribution of particles in the cytosol, the entanglement length of the cytoskeleton, and the hydraulic pore size. Many of these length-scales can be dynamically controlled by the cell, which gives rise to complex rheology. We have now added these considerations to our discussion.

      Additionally, do you think that the observed decrease in k in mitotic cells compared to interphase cells is significant? I would have expected the opposite naively as mitotic cells tend to swell by 10-20 percent due to the mitotic overshoot at mitotic entry (see Son Journal of Cell Biology 2015 or Zlotek Journal of Cell Biology 2015).

      We thank the reviewer for this interesting question. Based on the same scaling arguments as above, we would expect that a 10-20% increase in cell volume would give rise to 10-20% increase in diffusion constant. However, we also note that metaphase leads to a dramatic reorganisation of the cell interior and in particular membrane-bounded organelles. In summary, we do not know why such a decrease could take place. We now highlight this as an interesting question for further research.

      Based on your results, can you estimate the pore size of the poroelastic cytoplasmic matrix? Is this estimate realistic? I wonder whether this pore size might define a threshold above which the diffusion of freely diffusing species is significantly reduced. Is your estimate consistent with nanobead diffusion experiments reported in the literature? Do you have any insights into the polymer structures that define this pore size? For example, have you investigated whether depolymerizing actin or other cytoskeletal components significantly alters the relaxation timescale?

      We thank the reviewer for this comment. We cannot directly estimate the hydraulic pore size from the measurements performed in the manuscript. Indeed, while we understand the general scaling laws, the prefactors of such relationships are unknown.

      We carried out experiments aiming at estimating the hydraulic pore size in previous publications [3,4] and others have shown spatial heterogeneity of the cytoplasmic pore size [5]. In our previous experiments, we examined the diffusion of PEGylated quantum dots (14nm in hydrodynamic radius). In isosmotic conditions, these diffused freely through the cell but when the cell volume was decreased by a hyperosmotic shock, they no longer moved [3,4]. This gave an estimate of the pore radius of ~15nm.

      Previous work has suggested that F-actin plays a role in dictating this pore size but microtubules and intermediate filaments do not [4].

      There are no quantifications in Figure 6, nor is there a direct comparison with the model. Based on your model, would you expect the velocity of bleb growth to vary depending on the distance of the bleb from the pipette due to the local depressurization? Specifically, do blebs closer to the pipette grow more slowly?

      We apologise for the oversight. The quantifications are presented in Fig S10 and Fig S12. We have now modified the figure legends accordingly.

      Blebs are very heterogenous in size and growth velocity within a cell and across cells in the population in normal conditions [6]. Other work has shown that bleb size is controlled by a competition between pressure driving growth and actin polymerisation arresting it[7]. Therefore, we did not attempt to determine the impact of depressurisation on bleb growth velocity or size.

      In experiments in which we suddenly increased pressure in blebbing cells, we did notice a change in the rate of growth of blebs that occurred after we increased pressure (Author response image 1). However, the experiments are technically challenging and we decided not to perform more.

      Author response image 1.

      A. A hydraulic link is established between a blebbing cell and a pipette. At time t>0, a step increase in pressure is applied. B. Kymograph of bleb growth in a control cell (top) an in a cell subjected to a pressure increase at t=0s (bottom). Top: In control blebs, the rate of growth is slow and approximately constant over time. The black arrow shows the start of blebbing. Bottom: The black arrow shows the start of blebbing. The dashed line shows the timing of pressure application and the red arrow shows the increase in growth rate of the bleb when the pressure increase reaches the bleb. This occurs with a delay δt.

      I find it interesting that during depressurization of the interphase cells, there is no observed volume change, whereas in pressurization of metaphase cells, there is a volume increase. I assume this might be a matter of timescale, as the microinjection experiments occur on short timescales, not allowing sufficient time for water to escape the cell. Do you observe the radius of the metaphase cells decreasing later on? This relaxation could potentially be used to characterize the permeability of the cell surface.

      We thank the reviewer for this comment.

      First, we would like to clarify that both metaphase and interphase cells increase their volume in response to microinjection. The effect is easier to quantify in metaphase cells because we assume spherical symmetry and just monitor the evolution of the radius (Fig 3). However, the displacement of the beads in interphase cells (Fig 2) clearly shows that the cell volume increases in response to microinjection. For both interphase and metaphase cells, when the injection is prolonged, the membrane eventually detaches from the cortex and large blebs form until cell lysis. In contrast to the reviewer’s intuition, we never observe a relaxation in cell volume, probably because we inject fluid faster than the cell can compensate volume change through regulatory mechanisms involving ion channels.

      When we depressurise metaphase cells, we do not observe any change in volume (Fig S10). This contrasts with the increase that we observe upon pressurisation. The main difference between these two experiments is the pressure differential. During depressurisation experiments, this is the hydraulic pressure within the cell ~500Pa (Fig 6A); whereas during pressurisation experiments, this is the pressure in the micropipette, ranging from 1.4-10 kPa (Fig 3). We note in particular that, when we used the lowest pressures in our experiments, the increase in volume was very slow (see Fig 3C). Therefore, we agree with the reviewer that it is likely the magnitude of the pressure differential that explains these differences.

      I am curious about the saturation of the time lag at 30 microns from the pipette in Figure 4, Panel E for the model's prediction. A saturation which is not clearly observed in the experimental data. Could you comment on the origin of this saturation and the observed discrepancy with the experiments (Figure E panel 2)? Naively, I would have expected the time lag to scale quadratically with the distance from the pipette, as predicted by a poroelastic model and the diffusion of displacement. It seems weird to me that the beads start to move together at some distance from the pipette or else I would expect that they just stop moving. What model parameters influence this saturation? Does membrane permeability contribute to this saturation?

      We thank the reviewer for pointing this out. In our opinion, the saturation occurring at 30 microns arises from the geometry of the model. At the largest distance away from the micropipette, the cortex becomes dominant in the mechanical response of the cell because it represents an increasing proportion of the cellular material.

      To test this hypothesis, we will rerun our finite element models with a range of cell sizes. This will be added to the manuscript at a later date.

      Reviewer #3 (Public review):

      Weaknesses: I have two broad critical comments:

      (1) I sense that the authors are correct that the best explanation of their results is the passive poroelastic model. Yet, to be thorough, they have to try to explain the experiments with other models and show why their explanation is parsimonious. For example, one potential explanation could be some mechanosensitive mechanism that does not involve cytoplasmic flow; another could be viscoelastic cytoskeletal mesh, again not involving poroelasticity. I can imagine more possibilities. Basically, be more thorough in the critical evaluation of your results. Besides, discuss the potential effect of significant heterogeneity of the cell.

      We thank the reviewer for these comments and we agree with their general premise.

      Some observations could qualitatively be explained in other ways. For example, if we considered the cell as a viscoelastic material, we could define a time constant with η the viscosity and E the elasticity of the material. The increase in relaxation time with sucrose treatment could then be explained by an increase in viscosity. However, work by others has  previously shown that, in the exact same conditions as our experiment, viscoelasticity cannot account for the observations[1]. In its discussion, this study proposed poroelasticity as an alternative mechanism but did not investigate that possibility. This was consistent with our work that showed that the cytoplasm behaves as a poroelastic material and not as a viscoelastic material [4]. Therefore, we decided not to consider viscoelasticity as possibility. We now explain this reasoning better and have added a sentence about a potential role for mechanotransductory processes in the discussion.

      (2) The study is rich in biophysics but a bit light on chemical/genetic perturbations. It could be good to use low levels of chemical inhibitors for, for example, Arp2/3, PI3K, myosin etc, and see the effect and try to interpret it. Another interesting question - how adhesive strength affects the results. A different interesting avenue - one can perturb aquaporins. Etc. At least one perturbation experiment would be good.

      We agree with the reviewer. In our previous studies, we already examined what biological structures affect the poroelastic properties of cells [2,4]. Therefore, the most interesting aspect to examine in our current work would be perturbations to the phenomenon described in Fig 6G and, in particular, to investigate what volume regulation mechanisms enable sustained intracellular pressure gradients. However, these experiments are particularly challenging and with very low throughput. Therefore, we feel that these are out of the scope of the present report and we mention these as promising future directions.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Please add more information to Materials and methods and figure captions to more clearly share how many different cells and trials the data are coming from.

      This has been done.

      Please add the full equation for displacement vs. time for the poroelastic model and describe appropriately.

      This cannot be done but we explain why.

      Overall, the clarity of the writing in the manuscript could be improved.

      This has been done.

      Please increase text size in some of the figures.

      This has been done.

      Reviewer #2 (Recommendations for the authors):<br /> Figure 1 would benefit from some revisions for clarity. In Panel D, for the control experiment with 7 cells, why are only 3 data points shown?

      This was due to the use of excel for generating the box plot. Some data points overlap. We now have used a different software.

      In Panel E, there is no legend explaining the red dots in the whisker plots.

      This has now been added.

      Additionally, the inset in Panel D lacks a legend, and it is unclear how k was computed.

      This inset panel has been removed.

      Moreover, I find Figure 1, Panel C somewhat pixelated, which makes it challenging to interpret. As I am colorblind, I need to zoom in significantly to distinguish the colors, and the current resolution makes this difficult. Improving the image resolution would be helpful.

      Apologies for this. We have now verified the quality of images on our submission.  

      I am unsure about the method used to compute the relaxation timescale in Figure S2. If an exponential relaxation is assumed, I would expect a function of the form:

      which implies that for t=t1+tau_p, the result should be d1+0.6*Delta d which does not correspond to the formula given. Have you tried fitting the data with an exponential function or using the model to extract tau_p without assuming a specific functional form?

      We thank the reviewer for pointing this out. We have now added further explanation of the fitting to the figure legend.

      References:

      (1) Rosenbluth, M. J., Crow, A., Shaevitz, J. W. & Fletcher, D. A. Slow stress propagation in adherent cells. Biophys J 95, 6052-6059 (2008). https://doi.org/10.1529/biophysj.108.139139

      (2) Esteki, M. H. et al. Poroelastic osmoregulation of living cell volume. iScience 24, 103482 (2021). https://doi.org/10.1016/j.isci.2021.103482

      (3) Charras, G. T., Mitchison, T. J. & Mahadevan, L. Animal cell hydraulics. J Cell Sci 122, 3233-3241 (2009). https://doi.org/10.1242/jcs.049262

      (4) Moeendarbary, E. et al. The cytoplasm of living cells behaves as a poroelastic material. Nat Mater 12, 253-261 (2013). https://doi.org/10.1038/nmat3517

      (5) Luby-Phelps, K., Castle, P. E., Taylor, D. L. & Lanni, F. Hindered diffusion of inert tracer particles in the cytoplasm of mouse 3T3 cells. Proc Natl Acad Sci U S A 84, 4910-4913 (1987). https://doi.org/10.1073/pnas.84.14.4910

      (6) Charras, G. T., Coughlin, M., Mitchison, T. J. & Mahadevan, L. Life and times of a cellular bleb. Biophys J 94, 1836-1853 (2008). https://doi.org/10.1529/biophysj.107.113605

      (7) Tinevez, J. Y. et al. Role of cortical tension in bleb growth. Proc Natl Acad Sci U S A 106, 18581-18586 (2009). https://doi.org/10.1073/pnas.0903353106

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Vision is a highly active process. Humans move their eyes 3-4 times per second to sample information with high visual acuity from our environment, and where eye movements are directed is critical to our understanding of active vision. Here, the authors propose that the cost of making a saccade contributes critically to saccade selection (i.e., whether and where to move the eyes). The authors build on their own recent work that the effort (as measured by pupil size) that comes with planning and generating an eye movement varies with saccade direction. To do this, the authors first measured pupil size for different saccade directions for each participant. They then correlated the variations in pupil size obtained in the mapping task with the saccade decision in a free-choice task. The authors observed a striking correlation: pupil size in the mapping task predicted the decision of where to move the eyes in the free choice task. In this study, the authors provide a number of additional insightful analyses (e.g., based on saccade curvature, and saccade latency) and experiments that further support their claim that the decision to move the eyes is influenced by the effort to move the eyes in a particular direction. One experiment showed that the same influence of assumed saccade costs on saccade selection is observed during visual search in natural scenes. Moreover, increasing the cognitive load by adding an auditory counting task reduced the number of saccades, and in particular reduced the costly saccades. In sum, these experiments form a nice package that convincingly establishes the association between pupil size and saccade selection.

      We thank the reviewer for highlighting the novelty and cogency of our findings.

      In my opinion, the causal structure underlying the observed results is not so clear. While the relationship between pupil size and saccade selection is compelling, it is not clear that saccade-related effort (i.e., the cost of a saccade) really drives saccade selection. Given the correlational nature of this relationship, there are other alternatives that could explain the finding. For example, saccade latency and the variance in landing positions also vary across saccade directions. This can be interpreted for instance that there are variations in oculomotor noise across saccade directions, and maybe the oculomotor system seeks to minimize that noise in a free-choice task. In fact, given such a correlational result, many other alternative mechanisms are possible. While I think the authors' approach of systematically exploring what we can learn about saccade selection using pupil size is interesting, it would be important to know what exactly pupil size can add that was not previously known by simply analyzing saccade latency. For example, saccade latency anisotropies across saccade directions are well known, and the authors also show here that saccade costs are related to saccade latency. An important question would be to compare how pupil size and saccade latency uniquely contribute to saccade selection. That is, the authors could apply the exact same logic to their analysis by first determining how saccade latencies (or variations in saccade landing positions; see Greenwood et al., 2017 PNAS) vary across saccade directions and how this saccade latency map explains saccade selection in subsequent tasks. Is it more advantageous to use one or the other saccade metric, and how well does a saccade latency map correlate with a pupil size map?

      We thank the reviewer for the detailed comment. 1) The reviewer first points out the correlational nature of many of our results. Thereafter, 2), the reviewer asks whether saccade latencies and landing precision also predict saccade selection, and could be these potential predictors be considered alternative explanations to the idea of effort driving saccade selection? Moreover, what can pupil size add to what can be learned from saccade latency?

      In brief, although we report a combination of correlational and causal findings, we do not know of a more parsimonious explanation for our findings than “effort drives saccade selection”. Moreover, we demonstrate that oculomotor noise cannot be construed as an alternative explanation for our findings.

      (1) Correlational nature of many findings.

      We acknowledge that many of our findings are predominantly correlational in nature. In our first tasks, we correlated pupil size during saccade planning to saccade preferences in a subsequent task. Although the link between across tasks was correlational, the observed relationship clearly followed our previously specified directed hypothesis. Moreover, experiments 1 and 2 of the visual search data replicated and extended this relationship. We also directly manipulated cognitive demand in the second visual search experiment. In line with the hypothesis that effort affects saccade selection, participants executed less saccades overall when performing a (primary) auditory dual task, and even cut the costly saccades most – which actually constitutes causal evidence for our hypothesis. A minimal oculomotor noise account would not directly predict a reduction in saccade rate under higher cognitive demand. To summarize, we have a combination of correlational and causal findings, although mediators cannot be ruled out fully for the latter. That said, we do not know of a more fitting and parsimonious explanation for our findings than effort predicting saccade selection (see following points for saccade latencies). We now address causality in the discussion for transparency and point more explicitly to the second visual search experiment for causal evidence.

      “We report a combination of correlational and causal findings. Despite the correlational nature of some of our results, they consistently support the hypothesis that saccade costs predicts saccade selection [which we predicted previously, 33]. Causal evidence was provided by the dual-task experiment as saccade frequencies - and especially costly saccades were reduced under additional cognitive demand. Only a cost account predicts 1) a link between pupil size and saccade preferences, 2) a cardinal saccade bias, 3) reduced saccade frequency under additional cognitive demand, and 4) disproportional cutting of especially those directions associated with more pupil dilation. Together, our findings converge upon the conclusion that effort drives saccade selection.”

      (2) Do anisotropies in saccade latencies constitute an alternative explanation?

      First of all, we would like to to first stress that differences in saccade latencies are indeed thought to reflect oculomotor effort (Shadmehr et al., 2019; TINS). For example, saccades with larger amplitudes and saccades where distractors need to be ignored are associated with longer latencies. Therefore, even if saccade latencies would predict saccade selection, this would not contrast the idea that effort drives saccade selection. Instead, this would provide convergent evidence for our main novel conclusion: effort drives saccade selection. There are several reasons why pupil size can be used as a more general marker of effort (see responses to R2), but ultimately, our conclusions do not hinge on the employed measure of effort per se. As stressed above in 1), we see no equally parsimonious explanation besides the cost account. Moreover, we predicted this relationship in our previous publication before running the currently reported experiments and analyses (Koevoet et al., 2023). That said, we are open to discuss further alternative options and would be looking forward to test these accounts in future work against each other – we are welcoming the reviewers’ (but also the reader’s) suggestions.

      We now discuss this in the manuscript as follows:

      “We here measured cost as the degree of effort-linked pupil dilation. In addition to pupil size, other markers may also indicate saccade costs. For example, saccade latency has been proposed to index oculomotor effort [100], whereby saccades with longer latencies are associated with more oculomotor effort. This makes saccade latency a possible complementary marker of saccade costs (also see Supplemen- tary Materials). Although relatively sluggish, pupil size is a valuable measure of attentional costs for (at least) two reasons. First, pupil size is a highly established as marker of effort, and is sensitive to effort more broadly than only in the context of saccades [36–45, 48]. Pupil size therefore allows to capture not only the costs of saccades, but also of covert attentional shifts [33], or shifts with other effectors such as head or arm movements [54, 101]. Second, as we have demonstrated, pupil size can measure saccade costs even when searching in natural scenes (Figure 4). During natural viewing, it is difficult to disentangle fixation duration from saccade latencies, complicating the use of saccade latency as a measure of saccade cost.

      Together, pupil size, saccade latency, and potential other markers of saccade cost could fulfill complementary roles in studying the role of cost in saccade selection.”

      Second, we followed the reviewer’s recommendation in testing whether other oculomotor metrics would predict saccade selection. To this end, we conducted a linear regression across directions. We calculated pupil size, saccade latencies, landing precision and peak velocities maps from the saccade planning task. We then used AICbased backward model selection to determine the ‘best’ model model to determine which factor would predict saccade selection best. The best model included pupil size, latency and landing precision as predictors (Wilkinson notation: saccade preferences ~ pupil size + saccade latency + landing precision). Pupil size (b \=-42.853, t \= 4.791, p < .001) and saccade latency (b \=-.377, t \= 2.106, p \= .043; see Author response image 1) predicted saccade preferences significantly. In contrast, landing precision did not reach significance (b \= 23.631, t \= 1.675, p \= .104). This analysis shows that although saccade latency also predicts saccade preferences, pupil size remains a robust predictor of saccade selection. These findings demonstrate that minimizing oculomotor noise cannot fully explain the pattern of results.

      Author response image 1.

      The relationship between saccade latency (from the saccade planning task) and saccade preferences averaged across participants. Individual points reflect directions and shading represents bootstrapped 95% confidence intervals.

      We have added this argument into the manuscript, and discuss the analysis in the discussion. Details of the analysis have been added to the Supporting Information for transparency and further detail.

      “A control analysis ruled out that the correlation between pupil size and saccade preferences was driven by other oculomotor metrics such as saccade latency and landing precision (see Supporting Information).”

      “To ascertain whether pupil size or other oculomotor metrics predict saccade preferences, we conducted a multiple regression analysis. We calculated average pupil size, saccade latency, landing precision and peak velocity maps across all 36 directions. The model, determined using AIC-based backward selection, included pupil size, latency and landing precision as predictors (Wilkinson notation: saccade preferences  pupil size + saccade latency + landing precision). The analysis re- vealed that pupil size (β = -42.853, t = 4.791, p < .001) and saccade latency (β = -.377, t = 2.106, p = .043) predicted saccade preferences. Landing precision did not reach significance (β = 23.631, t = 1.675, p = .104). Together, this demonstrates that although other oculomotor metrics such as saccade latency contribute to saccade selection, pupil size remains a robust marker of saccade selection.”

      In addition to eye-movement-related anisotropies across the visual field, there are of course many studies reporting visual field anisotropies (see Himmelberg, Winawer & Carrasco, 2023, Trends in Neuroscience for a review). It would be interesting to understand how the authors think about visual field anisotropies in the context of their own study. Do they think that their results are (in)dependent on such visual field variations (see Greenwood et al., 2017, PNAS; Ohl, Kroell, & Rolfs, 2024, JEP:Gen for a similar discussion)?

      We agree that established visual field anisotropies are fascinating to be discussed in context of our own results. At the reviewer’s suggestion, we now expanded this discussion.

      The observed anisotropies in terms of saccade costs are likely related to established anisotropies in perception and early visual cortex. However, the exact way that these anisotropies may be linked remains elusive (i.e. what is cause, what is effect, are links causal?), and more research is necessary to understand how these are related.

      “The observed differences in saccade costs across directions could be linked to established anisotropies in perception [80–86], attention [87–92], saccade charac- teristics [87, 88, 92, 93], and (early) visual cortex [94–98] [also see 99]. For example, downward saccades are more costly than upward saccades, which mimics a similar asymmetry in early visual areas wherein the upper visual field is relatively under- represented [94–98]; similarly stronger presaccadic benefits are found for down- compared with upward saccades [87, 88]. Moreover, upward saccades are more pre- cise than downward saccades [93]. Future work should elucidate where saccade cost or the aforementioned anisotropies originate from and how they are related - something that pupil size alone cannot address.”

      We also added that the finding that more precise saccades are coupled with worse performance in a crowding task might be attributed to the increased effort associated with more precise saccades (Greenwood et al., 2017).

      “Adaptive resource allocation from, and to the oculomotor system parsimoniously explains a number of empirical observations. For example, higher cognitive demand is accompanied by smooth pursuits deviating more from to-be tracked targets [137], reduced (micro)saccade frequencies [Figure 4; 63, 64, 138, 139], and slower peak saccade velocities [140–142]. Relatedly, more precise saccades are accompanied with worse performance in a crowding task [93].”

      Finally, the authors conclude that their results "suggests that the eye-movement system and other cognitive operations consume similar resources that are flexibly allocated among each other as cognitive demand changes. The authors should speculate what these similar resources could mean? What are the specific operations of the auditory task that overlap in terms of resources with the eye movement system?

      We agree that the nature of joint resources is an interesting question. Our previous discussion was likely too simplistic here (see also responses to R3). We here specifically refer to the cognitive resources that one can flexibly distribute between tasks.

      Our data do not directly speak to the question of what the shared resources between the auditory and oculomotor tasks are. Nevertheless, both tasks charge working memory as saccade targets are mandatorily encoded into working memory prior to saccade onset (Van der Stigchel & Hollingworth, 2018), and the counting task clearly engages working memory. This may indicate some domain-generality between visual and auditory working memory during natural viewing (see Nozari & Martin, 2024 for a recent review), but this remains speculative. Another possibility is that not the working memory encoding associated with saccades per se, but that the execution of overt motor actions itself also requires cognitive processing as suggested by Beatty (1982): “the organization of an overt motor act places additional demands on informationprocessing resources that are reflected in the task-evoked pupillary response”.

      We have added upon this in more detail in the results and discussion sections.

      “Besides the costs of increased neural activity when exerting more effort, effort should be considered costly for a second reason: Cognitive resources are limited. Therefore, any unnecessary resource expenditure reduces cognitive and behavioral flexibility [22, 31, 36, 116]. As a result, the brain needs to distribute resources between cognitive operations and the oculomotor system. We found evidence for the idea that such resource distribution is adaptive to the general level of cognitive demand and available resources: Increasing cognitive demand through an additional pri- mary auditory dual task led to a lower saccade frequency, and especially costly sac- cades were cut. In this case, it is important to consider that the auditory task was the primary task, which should cause participants to distribute resources from the ocu- lomotor system to the counting task. In other situations, more resources could be distributed to the oculomotor system instead, for example to discover new sources of reward [22, 136]. Adaptive resource allocation from, and to the oculomotor system parsimoniously explains a number of empirical observations. For example, higher cognitive demand is accompanied by smooth pursuits deviating more from to-be tracked targets [137], reduced (micro)saccade frequencies [Figure 4; 63, 64, 138, 139], and slower peak saccade velocities [140–142]. Relatedly, more precise saccades are accompanied with worse performance in a crowding task [93]. Furthermore, it has been proposed that saccade costs are weighed against other cognitive operations such as using working memory [33, 143–146]. How would the resources between the oculomotor system and cognitive tasks (like the auditory counting task) be related? One possibility is that both consume from limited working memory resources [147, 148]. Saccades are thought to encode target objects in a mandatory fashion into (vi- sual) working memory [79], and the counting task requires participants to keep track of the auditory stream and maintain count of the instructed digit in working mem- ory. However, the exact nature of which resources overlap between tasks remain open for future investigation [also see 149]. Together, we propose that cognitive re- sources are flexibly (dis)allocated to and from the oculomotor system based on the current demands to establish an optimal balance between performance and cost minimization.”

      Reviewer #2 (Public Review):

      The authors attempt to establish presaccadic pupil size as an index of 'saccade effort' and propose this index as one new predictor of saccade target selection. They only partially achieved their aim: When choosing between two saccade directions, the less costly direction, according to preceding pupil size, is preferred. However, the claim that with increased cognitive demand participants would especially cut costly directions is not supported by the data. I would have expected to see a negative correlation between saccade effort and saccade direction 'change' under increased load. Yet participants mostly cut upwards saccades, but not other directions that, according to pupil size, are equally or even more costly (e.g. oblique saccades).

      Strengths:

      The paper is well-written, easy to understand, and nicely illustrated.

      The sample size seems appropriate, and the data were collected and analyzed using solid and validated methodology.

      Overall, I find the topic of investigating factors that drive saccade choices highly interesting and relevant.

      We thank the reviewer for pointing out the strengths of our paper.

      Weaknesses:

      The authors obtain pupil size and saccade preference measures in two separate tasks. Relating these two measures is problematic because the computations that underly saccade preparation differ. In Experiment 1, the saccade is cued centrally, and has to be delayed until a "go-signal" is presented; In Experiment 2, an immediate saccade is executed to an exogenously cued peripheral target. The 'costs' in Experiment 1 (computing the saccade target location from a central cue; withholding the saccade) do not relate to Experiment 2. It is unfortunate, that measuring presaccadic pupil size directly in the comparatively more 'natural' Experiment 2 (where saccades did not have to be artificially withheld) does not seem to be possible. This questions the practical application of pupil size as an index of saccade effort

      This is an important point raised by the reviewer and we agree that a discussion on these points improves the manuscript. We reply in two parts: 1) Although the underlying computations during saccade preparation might differ, and are therefore unlikely to be fully similar (we agree), we can still predict saccade selection between (Saccade planning to Saccade preference) and within tasks (Visual search). 2) Pupil size is a sluggish physiological signal, but this is outweighed by the advantages of using pupil size as a general marker of effort, also in the context of visual selection compared with saccade latencies.

      (1) Are delayed saccades (cost task) and the much faster saccades (preference task) linked?

      As the reviewer notes the underlying ‘type’ of oculomotor program may differ between voluntarily delayed-saccades and those in the saccade preference task. There are, however, also considerable overlaps between the oculomotor programs as the directions and amplitudes are identical. Moreover, the different types of saccades have considerable overlap in their underlying neural circuitry. Nevertheless, the underlying oculomotor programs likely still differ in some regard. Even despite these differences, we were able to measure differences across directions in both tasks, and costs and preferences were negatively and highly correlated between tasks. The finding itself therefore indicates that the costs of saccades measured during the saccade planning task generalize to those in the saccade preference task. Note also that we predicted this finding and idea already in a previous publication before starting the present study (Koevoet et al., 2023).

      We now address this interesting point in the discussion as follows:

      “We observed that aOordable saccades were preferred over costly ones. This is especially remarkable given that the delayed saccades in the planning task likely differ in their oculomotor program from the immediate saccades in the preference task in some regard.”

      (2) Is pupil size a sensible measure of saccade effort?

      As the reviewer points out, the pupillary signal is indeed relatively sluggish and therefore relatively slow and more artifical tasks are preferred to quantify saccade costs. This does not preclude pupil size from being applied in more natural settings, as we demonstrate in the search experiments – but a lot of care has to be taken to control for many possible confounding factors and many trials will be needed.

      That said, as saccade latencies may also capture differences in oculomotor effort (Shadmehr et al., 2019) they are a possible alternative option to assess effort in some oculomotor tasks (see below on why saccade latencies do not provide evidence for an alternative to effort driving saccade selection, but converging evidence). Whilst we do maintain that pupil size is an established and versatile physiological marker of effort, saccade latencies provide converging evidence for our conclusion that effort drives saccade selection.

      As for the saccade preference task, we are not able to analyze the data in a similar manner as in the visual search task for two reasons. First, the number of saccades is much lower than in the natural search experiments. Second, in the saccade preference task, there were always two possible saccade targets. Therefore, even if we were able to isolate an effort signal, this signal could index a multitude of factors such as deciding between two possible saccade targets. Even simple binary decisions go hand in hand with reliable pupil dilations as they require effort (e.g. de Gee et al., 2014).

      There are three major reasons why pupil size is a more versatile marker of saccade costs than saccade latencies (although as mentioned, latencies may constitute another valuable tool to study oculomotor effort). First, pupil size is able to quantify the cost of attentional shifts more generally, including covert attention as well as other effector systems such as head and hand movements. This circumvents the issue of different latencies of different effector systems and also allows to study attentional processes that are not associated with overt motor movements. Second, saccade latencies are difficult to interpret in natural viewing data, as fixation duration and saccade latencies are inherently confounded by one another. This makes it very difficult to separate oculomotor processes and the extraction of perceptual information from a fixated target. Thus, pupil size is a versatile marker of attentional costs in a variety of settings, and can measure costs that saccade latencies cannot (i.e. covert attention). Lastly, pupil size is highly established as a marker of effort which has been demonstrated across wide range of cognitive tasks and therefore not bound to eye movements alone (Bumke, 1911; Koevoet et al., 2024; Laeng et al., 2012; Loewenfeld, 1958; Mathôt, 2018; Robison & Unsworth, 2019; Sirois & Brisson, 2014; Strauch et al., 2022; van der Wel & van Steenbergen, 2018).

      We now discuss this as follows:

      “We here measured cost as the degree of effort-linked pupil dilation. In addition to pupil size, other markers may also indicate saccade costs. For example, saccade latency has been proposed to index oculomotor effort [100], whereby saccades with longer latencies are associated with more oculomotor effort. This makes saccade latency a possible complementary marker of saccade costs (also see Supplemen- tary Materials). Although relatively sluggish, pupil size is a valuable measure of attentional costs for (at least) two reasons. First, pupil size is a highly established as marker of effort, and is sensitive to effort more broadly than only in the context of saccades [36–45, 48]. Pupil size therefore allows to capture not only the costs of saccades, but also of covert attentional shifts [33], or shifts with other effectors such as head or arm movements [54, 101]. Second, as we have demonstrated, pupil size can measure saccade costs even when searching in natural scenes (Figure 4). During natural viewing, it is difficult to disentangle fixation duration from saccade latencies, complicating the use of saccade latency as a measure of saccade cost. Together, pupil size, saccade latency, and potential other markers of saccade cost could fulfill complementary roles in studying the role of cost in saccade selection.”

      The authors claim that the observed direction-specific 'saccade costs' obtained in Experiment 1 "were not mediated by differences in saccade properties, such as duration, amplitude, peak velocity, and landing precision (Figure 1e,f)". Saccade latency, however, was not taken into account here but is discussed for Experiment 2.

      The final model that was used to test for the observed anisotropies in pupil size across directions indeed did not include saccade latencies as a predictor. However, we did consider saccade latencies as a potential predictor originally. As we performed AICbased backward model selection, however, this predictor was removed due to the marginal predictive contribution of saccade latency beyond other predictors explaining pupil size.

      For completeness, we here report the outcome of a linear mixed-effects that does include saccade latency as a predictor. Here, saccade latencies did not predict pupil size (b \= 1.859e-03, t \= .138, p \= .889). The asymmetry effects remained qualitatively unchanged: preparing oblique compared with cardinal saccades resulted in a larger pupil size (b \= 7.635, t \= 3.969, p < .001), and preparing downward compared with upward saccades also led to a larger pupil size (b \= 3.344, t \= 3.334, p \= .003).

      The apparent similarity of saccade latencies and pupil size, however, is striking. Previous work shows shorter latencies for cardinal than oblique saccades, and shorter latencies for horizontal and upward saccades than downward saccades - directly reflecting the pupil sizes obtained in Experiment 1 as well as in the authors' previous study (Koevoet et al., 2023, PsychScience).

      As the reviewer notes, there are substantial asymmetries across the visual field in saccade latencies. These assymetries in saccade latency could also predict saccade preferences. We will reply to this in three points: 1) even if saccade latency is a predictor of saccade preferences, this would not constitute as an alternative explanation to the conclusion of effort driving saccade selection, 2) saccade latencies show an up-down asymmetry but oblique-cardinal effects in latency may not be generalizable across saccade tasks, 3) pupil size remains a robust predictor of saccade preferences even when saccade latencies are considered as a predictor of saccade preferences.

      (1) We want to first stress that saccade latencies are thought to reflect oculomotor effort (Shadmehr et al., 2019). For example, saccades with larger amplitudes and saccades where distractors need to be ignored are associated with longer latencies. Therefore, even if saccade latencies predict saccade selection, this would not contrast the idea that effort drives saccade selection. Instead, this would provide convergent evidence for our main conclusion – effort predicting saccade selection (rather than pupil size predicting saccade selection per se).

      “We here measured cost as the degree of effort-linked pupil dilation. In addition to pupil size, other markers may also indicate saccade costs. For example, saccade latency has been proposed to index oculomotor effort [100], whereby saccades with longer latencies are associated with more oculomotor effort. This makes saccade latency a possible complementary marker of saccade costs (also see Supplemen- tary Materials). Although relatively sluggish, pupil size is a valuable measure of attentional costs for (at least) two reasons. First, pupil size is a highly established as marker of effort, and is sensitive to effort more broadly than only in the context of saccades [36–45, 48]. Pupil size therefore allows to capture not only the costs of saccades, but also of covert attentional shifts [33], or shifts with other effectors such as head or arm movements [54, 101]. Second, as we have demonstrated, pupil size can measure saccade costs even when searching in natural scenes (Figure 4). During natural viewing, it is difficult to disentangle fixation duration from saccade latencies, complicating the use of saccade latency as a measure of saccade cost. Together, pupil size, saccade latency, and potential other markers of saccade cost could fulfill complementary roles in studying the role of cost in saccade selection.”

      (2) We first tested anisotropies in saccade latency in the saccade planning task (Wilkinson notation: latency ~ obliqueness + updownness + leftrightness + saccade duration + saccade amplitude + saccade velocity + landing error + (1+obliqueness + updownness|participant)). We found upward latencies to be shorter than downward saccade latencies (b \= -.535, t \= 3.421, p \= .003). In addition, oblique saccades showed shorter latencies than cardinal saccades (b \= -1.083, t \= 3.096, p \= .002) – the opposite of what previous work has demonstrated.

      We then also tested these latency anisotropies in another dataset wherein participants (n \= 20) saccaded toward a single peripheral target as fast as possible (Koevoet et al., submitted; same amplitude and eccentricity as in the present manuscript). There we did not find a difference in saccade latency between cardinal and oblique targets, but we did observe shorter latencies for up- compared with downward saccades. We are therefore not sure in which situations oblique saccades do, or do not differ from cardinal saccades in terms of latency, and even in which direction the effect occurs.

      In contrast, we have now demonstrated a larger pupil size prior to oblique compared with cardinal saccades in two experiments. This indicates that pupil size may be a more reliable and generalizable marker of saccade costs than saccade latency. However, this remains to be investigated further.

      (3) To gain further insights into which oculomotor metrics would predict saccade selection, we conducted a linear regression across directions. We created pupil size, saccade latencies, landing precision and peak velocities maps from the saccade planning task. We then used AIC-based model selection to determine the ‘best’ model to determine which factor would predict saccade selection best. The selected model included pupil size, latency and landing precision as predictors (Wilkinson notation: saccade preferences ~ pupil size + saccade latency + landing precision). Pupil size (b \=-42.853, t \= 4.791, p < .001) and saccade latency (b \=-.377, t \= 2.106, p \= .043) predicted saccade preferences significantly. In contrast, landing precision did not reach significance (b \= 23.631, t \= 1.675, p \= .104). This analysis shows that although saccade latency predicts saccade preferences, pupil size remains a robust predictor of saccade selection.

      “To ascertain whether pupil size or other oculomotor metrics predict saccade preferences, we conducted a multiple regression analysis. We calculated average pupil size, saccade latency, landing precision and peak velocity maps across all 36 directions. The model, determined using AIC-based backward selection, included pupil size, latency and landing precision as predictors (Wilkinson notation: saccade preferences  pupil size + saccade latency + landing precision). The analysis re- vealed that pupil size (β = -42.853, t = 4.791, p < .001) and saccade latency (β = -.377, t = 2.106, p = .043) predicted saccade preferences. Landing precision did not reach significance (β = 23.631, t = 1.675, p = .104). Together, this demonstrates that although other oculomotor metrics such as saccade latency contribute to saccade selection, pupil size remains a robust marker of saccade selection.”

      The authors state that "from a costs-perspective, it should be eOicient to not only adjust the number of saccades (non-specific), but also by cutting especially expensive directions the most (specific)". However, saccade targets should be selected based on the maximum expected information gain. If cognitive load increases (due to an additional task) an effective strategy seems to be to perform less - but still meaningful - saccades. How would it help natural orienting to selectively cut saccades in certain (effortful) directions? Choosing saccade targets based on comfort, over information gain, would result in overall more saccades to be made - which is non-optimal, also from a cost perspective.

      We thank the reviewer for this comment. Although we do not fully agree, the logic is quite close to our rationale and it is worth adding a point of discussion here. A vital part of the current interpretation is the instruction given to participants. In our second natural visual search task, participants were performing a dual task, where the auditory task was the primary task, whilst the search task was secondary. Therefore, participants are likely to adjust their resources to optimize performance on the primary task – at the expense of the secondary task. Therefore, less resources are made available and used to searching in the dual than in the single task, because these resources are needed for the auditory task. Cutting expensive directions does not help search in terms of search performance, but it does reduce the cost of search, so that more resources are available for the prioritized auditory task. Also note that the search task was rather difficult – participants did it, but it was tough (see the original description of the dataset for more details), which provides another reason to go full in on the auditory task at expense of the visual task. This, however, opens up a nice point of discussion: If one would emphasize the importance of search (maybe with punishment or reward), we would indeed expect participants to perform whichever eye movements are getting them to their goal fastest – thus reducing the relative influence of costs on saccade behavior. This remains to be tested however - we are working on this and are looking forward to discussing such findings in the future.

      Together, we propose that there is a trade-off between distributing resources either towards cognitive tasks or the oculomotor system (also see Ballard et al., 1995; Van der Stigchel, 2020). How these resources are distributed depends highly on the current task demands (also see Sahakian et al., 2023). This allows for adaptive behavior in a wide range of contexts.

      We now added these considerations to the manuscript as follows (also see our previous replies):

      “Do cognitive operations and eye movements consume from a similar pool of resources [44]? If so, increasing cognitive demand for non-oculomotor processes should result in decreasing available resources for the oculomotor system. In line with this idea, previous work indeed shows altered eye-movement behavior un- der effort as induced by dual tasks, for example by making less saccades under increased cognitive demand [62–64]. We therefore investigated whether less sac- cades were made as soon as participants had to count the occurrence of a specific digit in the auditory number stream in comparison to ignoring the stream (in Exp. 2; Figure 4a). Participants were instructed to prioritize the auditory digit-counting task over finding the visual search target. Therefore, resources should be shifted from the oculomotor system to the primary auditory counting task. The additional cognitive demand of the dual task indeed led to a decreased saccade frequency (t(24) = 7.224, p < .001, Cohen’s d = 1.445; Figure 4h).”

      I would have expected to see a negative correlation between saccade effort and saccade direction 'change' under increased load. Yet participants mostly cut upwards saccades, but not other directions that, according to pupil size, are equally or even more costly (e.g. oblique saccades).

      The reviewer’s point is taken from the initial comment, which we will address here. First, we’d like to point out that is it not established that saccade costs in different directions are always the same. Instead, it is possible that saccade costs could be different in natural viewing compared with our delayed-saccade task. Therefore, we used pupil size during natural viewing for the search experiments. Second, the reviewer correctly notes that oblique saccades are hardly cut when under additional cognitive demand. However, participants already hardly execute oblique saccades when not confronted with the additional auditory task (Figure 4b, d), making it difficult to reduce those further (i.e. floor effect). Participants chose to cut vertical saccades, possibly because these are more costly than horizontal saccades.

      We incorporated these point in our manuscript as follows:

      “To test this, we analyzed data from two existing datasets [63] wherein participants (total n = 41) searched for small targets (’Z’ or ’H’) in natural scenes (Figure 4a; [64]). Again, we tested whether pupil size prior to saccades negatively linked with saccade preferences across directions. Because saccade costs and preferences across directions could differ for different situations (i.e. natural viewing vs. saccade preference task), but should always be negatively linked, we established both cost and preferences independently in each dataset.”

      “We calculated a saccade-adjustment map (Figure 4g) by subtracting the saccade preference map in the single task (Figure 4f) from the dual task map (Fig- ure 4d). Participants seemingly cut vertical saccades in particular, and made more saccades to the top right direction. This pattern may have emerged as vertical saccades are more costly than horizontal saccades (also see Figure 1d). Oblique saccades may not have been cut because there were very little oblique saccades in the single condition to begin with (Figure 4d), making it difficult to observe a further reduction of such saccades under additional cognitive demand (i.e. a floor effect).”

      Overall, I am not sure what practical relevance the relation between pupil size (measured in a separate experiment) and saccade decisions has for eye movement research/vision science. Pupil size does not seem to be a straightforward measure of saccade effort. Saccade latency, instead, can be easily extracted in any eye movement experiment (no need to conduct a separate, delayed saccade task to measure pupil dilation), and seems to be an equally good index.

      There are two points here.

      (1) What is the practical relevance of a link between effort and saccade selection for eyemovement research and vision science?

      We see plenty – think of changing eye movement patterns under effort (be it smooth pursuits, saccade rates, distributions of gaze positions to images etc.) which have substantial implications for human factors research, but also neuropsychology. With a cost account, one may predict (rather than just observe) how eye movement changes as soon as resources are reduced/ non-visual demand increases. With a cost account, we can explain such effects (e.g. lower saccade rates under effort, cardinal bias, perhaps also central bias) parsimoniously that cannot be explained by what is so far referred to as the three core drivers of eye movement behavior (saliency, selection history, goals, e.g., Awh et al., 2012). Conversely, one must wonder why eye-movement research/vision science simply accepts/dismisses these phenomena as such, without seeking overarching explanations.

      (2) What is the usefulness of using pupil size to measure effort?

      We hope that our replies to the comments above illustrate why pupil size is a sensible, robust and versatile marker of attentional costs. We briefly summarize our most important points here.

      - Pupil size is an established measure of effort irrespective of context, as demonstrated by hundreds of original works (e.g. working memory load, multiple object tracking, individual differences in cognitive ability). This allows pupil size to be a versatile marker of the effort, and therefore costs, of non-saccadic attentional shifts such as covert attention or those realized by other effector systems (i.e. head or hand movements).

      - Our new analysis indicates that pupil size remains a strong and robust predictor of saccade preference, even when considering saccade latency.

      - Pupil size allows to study saccade costs in natural viewing. In contrast, saccade latencies are difficult to assess in natural viewing as fixation durations and saccade latencies are intrinsically linked and very difficult to disentangle.

      - Note however, that we think that it is interesting and useful so study effects of effort/cost on eye movement behavior. Whichever index is used to do so, we see plenty potential in this line of research, this paper is a starting point to do so.

      Reviewer #3 (Public Review):

      This manuscript extends previous research by this group by relating variation in pupil size to the endpoints of saccades produced by human participants under various conditions including trial-based choices between pairs of spots and search for small items in natural scenes. Based on the premise that pupil size is a reliable proxy of "effort", the authors conclude that less costly saccade targets are preferred. Finding that this preference was influenced by the performance of a non-visual, attentiondemanding task, the authors conclude that a common source of effort animates gaze behavior and other cognitive tasks.

      Strengths:

      Strengths of the manuscript include the novelty of the approach, the clarity of the findings, and the community interest in the problem.

      We thank the reviewer for pointing out the strengths of our paper.

      Weaknesses:

      Enthusiasm for this manuscript is reduced by the following weaknesses:

      (1) A relationship between pupil size and saccade production seems clear based on the authors' previous and current work. What is at issue is the interpretation. The authors test one, preferred hypothesis, and the narrative of the manuscript treats the hypothesis that pupil size is a proxy of effort as beyond dispute or question. The stated elements of their argument seem to go like this:

      PROPOSITION 1: Pupil size varies systematically across task conditions, being larger when tasks are more demanding.

      PROPOSITION 2: Pupil size is related to the locus coeruleus.

      PROPOSITION 3: The locus coeruleus NE system modulates neural activity and interactions.

      CONCLUSION: Therefore, pupil size indexes the resource demand or "effort" associated with task conditions.

      How the conclusion follows from the propositions is not self-evident. Proposition 3, in particular, fails to establish the link that is supposed to lead to the conclusion.

      We inadvertently laid out this rationale as described above, and we thank the reviewer for pointing out this initial suboptimal structure of argumentation. The notion that the link between pupil size and effort is established in the literature because of its neural underpinnings is inaccurate. Instead, the tight link between effort and pupil size is established based on covariations of pupil diameter and cognition across a wide variety of tasks and domains. In line with this, we now introduce this tight link predominantly based on the relationships between pupil size and cognition instead of focusing on putative neural correlates of this relationship.

      As reviewed previously (Beatty, 1982; Bumke, 1911; Kahneman, 1973; Kahneman & Beatty, 1966; Koevoet et al., 2024; Laeng et al., 2012; Mathôt, 2018; Sirois & Brisson, 2014; Strauch et al., 2022; van der Wel & van Steenbergen, 2018), any increase in effort is consistently associated with an increase in pupil size. For instance, the pupil dilates when increasing load in working memory or multiple object tracking tasks, and such pupillary effects robustly explain individual differences in cognitive ability and fluctuations in performance across trials (Alnæs et al., 2014; Koevoet et al., 2024; Robison & Brewer, 2020; Robison & Unsworth, 2019; Unsworth & Miller, 2021). This extends to the planning of movements as pupil dilations are observed prior to the execution of (eye) movements (Koevoet et al., 2023; Richer & Beatty, 1985). The link between pupil size and effort has thus been firmly established for a long time, irrespective of the neural correlates of these effort-linked pupil size changes.

      We again thank the reviewer for spotting this logical mistake, and now revised the paragraph where we introduce pupil size as an established marker of effort as follows:

      “We recently demonstrated that the effort of saccade planning can be measured with pupil size, which allows for a physiological quantification of saccade costs as long as low-level visual factors are controlled for [33]. Pupil size is an established marker of effort [36–44]. For instance, loading more in working memory or tracking more objects results in stronger pupil dilation [44–52]. Pupil size not only reflects cognitive (or mental) effort but also the effort of planning and executing movements [37, 53, 54]. We leveraged this to demonstrate that saccade costs can be captured with pupil size, and are higher for oblique compared with cardinal directions [33]. Here, we addressed whether saccade costs predict where to saccade.”

      We now mention the neural correlates of pupil size only in the discussion. Where we took care to also mention roles for other neurotransmitter systems:

      “Throughout this paper, we have used cost in the limited context of saccades.

      However, cost-based decision-making may be a more general property of the brain [31, 36, 114–116]. Every action, be it physical or cognitive, is associated with an in- trinsic cost, and pupil size is likely a general marker of this [44]. Note, however, that pupil dilation does not always reflect cost, as the pupil dilates in response to many sensory and cognitive factors which should be controlled for, or at least considered, when interpreting pupillometric data [e.g., see 39, 40, 42, 117]. Effort-linked pupil dilations are thought to be, at least in part, driven by activity in the brainstem locus coeruleus (LC) [40, 118–120] [but other neurotransmitters also affect pupil size, e.g. 121, 122]. Activity in LC with its widespread connections throughout the brain [120, 123–127] is considered to be crucial for the communication within and between neu- ral populations and modulates global neural gain [128–132]. Neural firing is costly [22, 133], and therefore LC activity and pupil size are (neuro)physiologically plausible markers of cost [40]. Tentative evidence even suggests that continued exertion of effort (accompanied by altered pupil dilation) is linked to the accumulation of glutamate in the lateral prefrontal cortex [134], which may be a metabolic marker of cost [also see 116, 134, 135]. “

      (2) The authors test one, preferred hypothesis and do not consider plausible alternatives. Is "cost" the only conceivable hypothesis? The hypothesis is framed in very narrow terms. For example, the cholinergic and dopamine systems that have been featured in other researchers' consideration of pupil size modulation are missing here. Thus, because the authors do not rule out plausible alternative hypotheses, the logical structure of this manuscript can be criticized as committing the fallacy of aOirming the consequent.

      As we have noted in the response to the reviewer’s first point, we did not motivate our use of pupil size as an index of effort clearly enough. For the current purpose, the neural correlates of pupil size are less relevant than the cognitive correlates (see previous point). We reiterate that the neuromodulatory underpinnings of the observed pupil size effects (which indeed possibly include effects of the cholinergic, dopaminergic and serotonergic systems), while interesting for the discussion on the neural origin of effects, are not crucial to our conclusion. We hope the new rationale (without focusing too much on the (irrelevant) exact neural underpinnings) convinces the reviewer and reader.

      Our changes to the manuscript are shown in our reply to the previous comment.

      The reviewer notes that other plausible alternative hypotheses could explain the currently reported results. However, we did not find a more parsimonuous explanation for our data than ‘Effort Drives Saccade Selection’. Effort explains why participants prefer saccading toward specific directions in (1) highly controlled and (2) more natural settings. Note that we also predicted this effect previously (Koevoet et al., 2023). Moreover, this account explains (3) why participants make less saccades under additional cognitive demand, and (4) why especially costly saccades are reduced under additional cognitive demand. We are very open to the reviewer presenting other possible interpretations of our data so these can be discussed to be put to test in future work.

      (3) The authors cite particular publications in support of the claim that saccade selection is influenced by an assessment of effort. Given the extensive work by others on this general topic, the skeptic could regard the theoretical perspective of this manuscript as too impoverished. Their work may be enhanced by consideration of other work on this general topic, e.g, (i) Shenhav A, Botvinick MM, Cohen JD. (2013) The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron. 2013 Jul 24;79(2):217-40. (ii) Müller T, Husain M, Apps MAJ. (2022) Preferences for seeking effort or reward information bias the willingness to work. Sci Rep. 2022 Nov 14;12(1):19486. (iii) Bustamante LA, Oshinowo T, Lee JR, Tong E, Burton AR, Shenhav A, Cohen JD, Daw ND. (2023) Effort Foraging Task reveals a positive correlation between individual differences in the cost of cognitive and physical effort in humans. Proc Natl Acad Sci U S A. 2023 Dec 12;120(50):e2221510120.

      We thank the reviewer for pointing us toward this literature. These papers are indeed relevant for our manuscript, and we have now incorporated them. Specifically, we now discuss how the costs of effort are weighed in relation to possible rewards during decision-making. We have also incorporated work that has investigated how the biomechanical costs of arm movements contribute to action selection.

      “Our findings are in line with established effort-based models that assume costs to be weighed against rewards during decision-making [102–107]. In such studies, reward and cognitive/physical effort are often parametrically manipulated to as- sess how much effort participants are willing to exert to acquire a given (monetary) reward [e.g. 108, 109]. Whereas this line of work manipulated the extrinsic costs and/or rewards of decision options (e.g. perceptual consequences of saccades [110, 111] or consequences associated with decision options), we here focus on the intrin- sic costs of the movement itself (in terms of cognitive and physical effort). Relatedly, the intrinsic costs of arm movements are also considered during decision-making: biomechanically aOordable movements are generally preferred over more costly ones [26–28]. We here extend these findings in two important ways. First, until now, the intrinsic costs of saccades and other movements have been inferred from gaze behavior itself or by using computational modelling [23, 25–28, 34, 35, 112]. In con- trast, we directly measured cost physiologically using pupil size. Secondly, we show that physiologically measured saccade costs predict where saccades are directed in a controlled binary preference task, and even during natural viewing. Our findings could unite state-of-the-art computational models [e.g. 23, 25, 34, 35, 113] with physiological data, to directly test the role of saccade costs and ultimately further our understanding of saccade selection.”

      (4) What is the source of cost in saccade production? What is the currency of that cost? The authors state (page 13), "... oblique saccades require more complex oculomotor programs than horizontal eye movements because more neuronal populations in the superior colliculus (SC) and frontal eye fields (FEF) [76-79], and more muscles are necessary to plan and execute the saccade [76, 80, 81]." This statement raises questions and concerns. First, the basis of the claim that more neurons in FEF and SC are needed for oblique versus cardinal saccades is not established in any of the publications cited. Second, the authors may be referring to the fact that oblique saccades require coordination between pontine and midbrain circuits. This must be clarified. Second, the cost is unlikely to originate in extraocular muscle fatigue because the muscle fibers are so different from skeletal muscles, being fundamentally less fatigable. Third, if net muscle contraction is the cost, then why are upward saccades, which require the eyelid, not more expensive than downward? Thus, just how some saccades are more effortful than others is not clear.

      Unfortunately, our current data do not allow for the specification of what the source is of differences in saccade production, nor what the currency is. We want to explicitly state that while pupil size is a sensitive measure of saccade costs, pupil size cannot directly inform what underlying mechanisms are causing differences in saccade costs across conditions (e.g. directions). Nevertheless, we do speculate about these issues because they are important to consider. We thank the reviewer for pointing out the shortcomings in our initial speculations.

      Broadly, we agree with the reviewer that a neural source of differences in costs between different types of saccades is more likely than a purely muscular account (also see Koevoet et al., 2023). Furthermore, we think that the observed differences in saccade costs for oblique vs. cardinal and up vs. down could be due to different underlying mechanisms. While we caution against overinterpreting single directions, tentative evidence for this may also be drawn by the different time course of effects for up/down versus cardinal/oblique, Figure 1c.

      Below we speculate about why some specific saccade directions may be more costly than others:

      Why would oblique saccades be more costly than cardinal saccades? We thank the reviewer for pointing out that oblique saccades additionally require coordination between pontine and midbrain circuits (Curthoys et al., 1984; King & Fuchs, 1979; Sparks, 2002). This point warrants more revised discussion compared to our initial version. We have incorporated this as follows:

      “The complexity of an oculomotor program is arguably shaped by its neural underpinnings. For example, oblique but not cardinal saccades require communication between pontine and midbrain circuits [73–75]. Such differences in neural complexity may underlie the additional costs of oblique compared with cardinal saccades. Besides saccade direction, other properties of the ensuing saccade such as its speed, distance, curvature, and accuracy may contribute to a saccade’s total cost [22, 33, 53, 76, 77] but this remains to be investigated directly.”

      Why would downward saccades be more costly than upward saccades? As the reviewer points out: from a net muscular contraction account of cost, one would expect the opposite pattern due to the movement of the eyelid. Instead, we speculate that our findings may be associated with the well-established anisotropy in early visual cortex along the vertical meridian. Specifically, the upper vertical meridian is represented at substantially less detail than the lower vertical meridian (Himmelberg et al., 2023; Silva et al., 2018). Prior to a saccade, attention is deployed towards the intended saccadic endpoint (Deubel & Schneider, 1996; Kowler et al., 1995). Attention tunes neurons to preferentially process the attended location over non-attended locations. Due to the fact that the lower visual field is represented at higher detail than the upper visual field, attention may tune neuronal responses differently when preparing up- compared with downward saccades (Hanning et al., 2024; Himmelberg et al., 2023). Thus, it may be more costly to prepare down- compared with upward saccades. This proposition, however, does not account for the lower costs associated horizontal compared with up- and downward saccades as the horizontal meridian is represented at a higher acuity than the vertical merdian. This makes it unlikely that this explains the pattern of results completely. Again, at this point we can only speculate why costs differ, yet we demonstrate that these differences in cost are decisive for oculomotor behavior. We now explicitly state the speculative nature of these ideas that would all need to be tested directly.

      We have updated our discussion of this issue as follows:

      “The observed differences in saccade costs across directions could be linked to established anisotropies in perception [80–86], attention [87–92], saccade charac- teristics [87, 88, 92, 93], and (early) visual cortex [94–98] [also see 99]. For example, downward saccades are more costly than upward saccades, which mimics a similar asymmetry in early visual areas wherein the upper visual field is relatively under- represented [94–98]; similarly stronger presaccadic benefits are found for down- compared with upward saccades [87, 88]. Moreover, upward saccades are more pre- cise than downward saccades [93]. Future work should elucidate where saccade cost or the aforementioned anisotropies originate from and how they are related - something that pupil size alone cannot address.”

      (5) The authors do not consider observations about variation in pupil size that seem to be incompatible with the preferred hypothesis. For example, at least two studies have described systematically larger pupil dilation associated with faster relative to accurate performance in manual and saccade tasks (e.g., Naber M, Murphy P. Pupillometric investigation into the speed-accuracy trade-off in a visuo-motor aiming task. Psychophysiology. 2020 Mar;57(3):e13499; Reppert TR, Heitz RP, Schall JD. Neural mechanisms for executive control of speed-accuracy trade-off. Cell Rep. 2023 Nov 28;42(11):113422). Is the fast relative to the accurate option necessarily more costly?

      We thank the reviewer for this interesting point that we will answer in two ways. First, we discuss the main point: the link between pupil size, effort, and cost. Second, we discuss the findings described specifically in these two papers and how we interpret these from a pupillometric account.

      First, one may generally ask whether 1) any effort results in pupil dilation, 2) whether any effort is costly, and 3) whether this means that pupil dilation always reflects effort and cost respectively. Indeed, it has been argued repeatedly, prominently, and independently (e.g., Bumke, 1911; Mathôt, 2018) that any change in effort (no matter the specific origin) is associated with an evoked pupil dilation. Effort, in turn, is consistently and widely experienced as aversive, both across tasks and cultures (David et al., 2024). Effort minimization may therefore be seen as an universal law of human cognition and behavior with effort as a to-be minimized cost (Shadmehr et al., 2019; Hull 1943, Tsai 1932). However, this does not imply that any pupil dilation necessarily reflects effort or that, as a consequence thereof, any pupil dilation is always signaling cost. For instance, the pupil dark response, the pupil far response and changes in baseline pupil size are not associated with effort. Baseline and task-evoked pupil dilation responses have to be interpreted differently (see below), moreover, the pupil also changes (and dilates) due to other factors (see Strauch et al., 2022; Mathôt, 2018, Bumke 1911, Loewenfeld, 1999 for reviews).

      Second, as for Naber & Murphy (2020) & Reppert at al. (2023) specifically: Both Reppert et al. (2023) and Naber & Murphy (2020) indeed demonstrate a larger baseline pupil size when participants made faster, less accurate responses. However, baseline pupil size is not an index of effort per-se, but task-evoked pupil dilation responses are (as studied in the present manuscript) (Strauch et al., 2022). For work on differences between baseline pupil diameter and task-evoked pupil responses, and their respective links with exploration and exploitation please see Jepma & Nieuwenhuis (2011). Indeed, the link between effort and larger pupil size holds for task evoked responses, but not baseline pupil size per se (also see Koevoet et al., 2023).

      Still, Naber (third author of the current paper) & Murphy (2020) also demonstrated larger task-evoked pupil dilation responses when participants were instructed to make faster, less accurate responses compared with making accurate and relatively slow responses. However, this difference in task-evoked response gains significance only after the onset of the movement itself, and peaks substantially later than response offset. Whilst pupil dilation may be sluggish, it isn’t extremely sluggish either. As feedback to the performance of the participant was displayed 1.25s after performing the movement and clicking (taking about 630ms), we deem it possible that this effect may in part result from appraising the feedback to the participant rather than the speed of the response itself (in fact, Naber and Murphy also discuss this option). In addition to not measuring saccades but mouse movements, it is therefore possible that the observed evoked pupil effects in Naber & Murphy (2020) are not purely linked to motor preparation and execution per se. Therefore, future work that aims to investigate the costs of movements should isolate the effects of feedback and other potential factors that may drive changes in pupil size. This will help clarify whether fast or more accurate movements could be linked to the underlying costs of the movements.

      Relatedly, we do not find evidence that pupil size during saccade planning predicts the onset latency of the ensuing saccade (please refer to our second response to Reviewer 2 for a detailed discussion).

      Together, we therefore do not see the results from Reppert et al. (2023) and Naber & Murphy (2020) to be at odds with our interpretation of evoked pupil size reflecting effort and cost in the context of planning saccades.

      We think that these are considerations important to the reader, which is why we now added them to the discussion as follows:

      “Throughout this paper, we have used cost in the limited context of saccades.

      However, cost-based decision-making may be a more general property of the brain [31, 36, 114–116]. Every action, be it physical or cognitive, is associated with an in- trinsic cost, and pupil size is likely a general marker of this [44]. Note, however, that pupil dilation does not always reflect cost, as the pupil dilates in response to many sensory and cognitive factors which should be controlled for, or at least considered, when interpreting pupillometric data [e.g., see 39, 40, 42, 117].”

      (6) The authors draw conclusions based on trends across participants, but they should be more transparent about variation that contradicts these trends. In Figures 3 and 4 we see many participants producing behavior unlike most others. Who are they? Why do they look so different? Is it just noise, or do different participants adopt different policies?

      We disagree with the transparency point of the reviewer. Note that we deviated from the norm here by being more transparent than common: we added individual data points and relationships rather than showing pooled effects across participants with error bars alone (see Figures 2c, 3b,c, 4c,e,f).

      Moreover, our effects are consistent and stable across participants and are highly significant. To illustrate, for the classification analysis based on cost (Figure 2E) 16/20 participants showed an effect. As for the natural viewing experiments (total > 250,000 fixations), we also find that a majority of participants show the observed effects: Experiment 1: 15/16 participants; Experiment 2: 16/25 participants; Experiment 2 – adjustment: 22/25 participants.

      We fully agree that it’s interesting to understand where interindividual variation may originate from. We currently have too little data to allow robust analyses across individuals and zooming in on individual differences in cost maps, preference maps, or potential personalized strategies of saccade selection. That said, future work could study this further. We would recommend to hereby reduce the number of directions to gain more pupil size data per direction and therefore cleaner signals that may be more informative on the individual level. With such stronger signals, studying (differences in) links on an individual level may be feasible and would be interesting to consider – and will be a future direction in our own work too. Nonetheless, we again stress that the reported effects are robust and consistent across participants, and that interindividual differences are therefore not extensive. Moreover, our results from four experiments consistently support our conclusion that effort drives saccade selection.

      Recommendations for the authors:  

      Reviewer #1 (Recommendations For The Authors):

      - Based on the public review, I would recommend that the authors carefully review and correct the manuscript with regard to the causal conclusions. The study is largely correlational (i.e. the pupil was only observed, not manipulated) and therefore does not allow causal conclusions to be drawn about the relationship between pupil size and saccade selection. These causal conclusions become even more confusing when pupil size is equated with effort and saccade cost. As a consequence, an actual correlation between pupil size and saccade selection has led to the title that effort drives saccade selection. It would also be helpful for the reader to summarize in an additional section of the discussion what they consider to be a causal or correlational link based on their results.

      We agree with the reviewer, and we have indeed included more explicitly which findings are correlational and which causal in detail now. As outlined before we do not see a more parimanious explanation for our findings than our title, but we fully agree that the paper benefits from making the correlational/causal nature of evidence for this idea explicitly transparent.

      “We report a combination of correlational and causal findings. Despite the correlational nature of some of our results, they consistently support the hypothesis that saccade costs predicts saccade selection [which we predicted previously, 33]. Causal evidence was provided by the dual-task experiment as saccade frequencies - and especially costly saccades were reduced under additional cognitive demand. Only a cost account predicts 1) a link between pupil size and saccade preferences, 2) a cardinal saccade bias, 3) reduced saccade frequency under additional cognitive demand, and 4) disproportional cutting of especially those directions associated with more pupil dilation. Together, our findings converge upon the conclusion that effort drives saccade selection.”

      - Can the authors please elaborate in more detail on how they transformed the predictors of their linear mixed model for the visualization in Figure 1f? It is difficult to see how the coeOicients in the table and the figure match.

      We used the ‘effectsize’ package to provide effect sizes of for each predictor of the linear mixed-effects model (https://cran.r-project.org/web/packages/effectsize/index.html). We report absolute effect sizes to make it visually easier to compare different predictors. These details have now been included in the Methods section to be more transparent about how these effect sizes were computed.

      “Absolute effect sizes (i.e. r) and their corresponding 95% confidence intervals for the linear mixed-effects models were calculated using t and df values with the ’effectsize’ package (v0.8.8) in R.”

      - Could the authors please explain in more detail why they think that a trial-by-trial analysis in the free choice task adds something new to their conclusions? In fact, a trialby-trial analysis somehow suggests that the pupil size data would enter the analysis at a single trial level. If I understand correctly, the pupil size data come from their initial mapping task. So there is only one mean pupil size for a given participant and direction that goes into their analysis to predict free choice in a single trial. If this is the case, I don't see the point of doing this additional analysis given the results shown in Figure 2c.

      The reviewer understands correctly that pupil size data is taken from the initial mapping task. We then used these mean values to predict which saccade target would be selected on a trial-by-trial basis. While showing the same conceptual result as the correlation analysis, we opted to include this analysis to show the robustness of the results across individuals. Therefore we have chosen to keep the analysis in the manuscript but now write more clearly that this shows the same conceptual finding as the correlation analysis.

      “As another test of the robustness of the effect, we analyzed whether saccade costs predicted saccade selection on a trial-by-trial basis. To this end, we first determined the more aOordable option for each trial using the established saccade cost map (Figure 1d). We predicted that participants would select the more aOordable option. Complementing the above analyses, the more aOordable option was chosen above chance level across participants (M = 56.64%, 95%-CI = [52.75%-60.52%], one-sample t-test against 50%: t(19) = 3.26, p = .004, Cohen’s d = .729; Figure 2e). Together, these analyses established that saccade costs robustly predict saccade preferences.”

      Reviewer #2 (Recommendations For The Authors):

      The authors report that "Whenever the difference in pupil size between the two options was larger, saccades curved away more from the non-selected option (β = .004, SE = .001, t = 4.448, p < .001; Figure 3b), and their latencies slowed (β = .050, SE = .013, t = 4.323, p < .001; Figure 3c)". I suspect this effect might not be driven by the difference but by a correlation between pupil size and latency.

      The authors correlate differences in pupil size (Exp1) with saccade latencies (Exp2), I recommend correlating pupil size with the latency directly, in either task. This would show if it is actually the difference between choices or simply the pupil size of the respective individual option that is linked to latency/effort. Same for curvature.

      The reviewer raises a good point. Please see the previous analyses concerning the possible correlations between pupil size and saccade latency, and how they jointly predict saccade selection.

      Our data show that saccade curvature and latencies are linked with the difference in pupil size between the selected and non-selected options. Are these effects driven by a difference in pupil size or by the pupil size associated with the chosen option?

      To assess this, we conducted two linear mixed-effects models. We predicted saccade curvature and latency using pupil size (from the planning task) of the selected and nonselected options while controlling for the chosen direction (Wilkinson notation: saccade curvature/latency ~ selected pupil size + non-selected pupil size + obliqueness + vertical + horizontal + (1+ selected pupil size + non-selected pupil size|participant). We found that saccades curved away more from costlier the non-selected targets (β \=1.534, t \= 8.151, p < .001), and saccades curved away from the non-selected target less when the selected target was cheaper (β \=-2.571, t \= -6.602, p < .001). As the costs of the selected and non-selected show opposite effects on saccade curvature, this indicates that the difference between the two options drives oculomotor conflict.

      As for saccade latencies, we found saccade onsets to slow when the cost of the selected target was higher (b \= .068, t \= 2.844, p \= .004). In contrast, saccade latencies were not significantly affected by the cost of the non-selected target (β \= -.018, t \= 1.457, p \= .145), although numerically the effect was in the opposite direction. This shows that latencies were primarily driven by the cost of the selected target but a difference account cannot be fully ruled out.

      Together, these analyses demonstrate that the difference in costs between two alternatives reliably affects oculomotor conflict as indicated by the curvature analysis. However, saccade latencies are predominantly affected by the cost of the selected target – even when controlling for the obliqueness, updownness and leftrightness of the ensuing saccade. We have added these analyses here for completeness, but because the findings seem inconclusive for saccade latency we have chosen to not include these analyses in the current paper. We are open to including these analyses in the supplementary materials if the reviewer and/or editor would like us to, but have chosen not to do so due to conciseness and to keep the paper focused.

      I was wondering why the authors haven't analyzed the pupil size in Experiment 2. If the pupil size can be assessed during a free viewing task (Experiment 3), shouldn't it be possible to also evaluate it in the saccade choice task?

      We did not analyze the pupil size data from the saccade preference task for two reasons. First, the number of saccades is much lower than in the natural search experiments (~14.000 vs. ~250.000). Second, in the saccade preference task, there were always two possible saccade targets. Therefore, even if we were able to isolate an effort signal, this signal could index a multitude of factors such as deciding between two possible saccade targets (de Gee et al., 2014), and has the possibility of two oculomotor programs being realized instead of only a single one (Van der Stigchel, 2010).

      Discussion: "due to stronger presaccadic benefits for upward compared with downward saccades [93,94]". I think this should be the other way around.

      We thank the reviewer for pointing this out. We have corrected our mistake in the revised manuscript.

      Saccade latencies differ around the visual field; to account for that, results / pupil size should be (additionally) evaluated relative to saccade onset (rather than cue offset). It is interesting that latencies were not accounted for here (Exp1), since they are considered for Exp2 (where they correlate with a pupil size difference). I suspect that latencies not only correlate with the difference in pupil size, but directly with pupil size itself.

      We agree with the reviewer that locking the pupil size signal to saccade onset instead of cue offset may be informative. We included an analysis in the supporting information that investigates this (see Figure S1). The results of the analysis were conceptually identical.

      The reviewer writes that latencies were not accounted for in Experiment 1. Although saccade latency was not included in the final model reported in the paper, it was considered during AIC-based backward model selection. As saccade latency did not predict meaningful variance in pupil size, it was ultimately not included in the analysis as a predictor. For completeness, we here report the outcome of a linear mixed-effects that does include saccade latency as a predictor. Here, saccade latencies did not predict pupil size (β \= 1.859e-03, t \= .138, p \= .889). The assymetry effects remained qualitatively unchanged: preparing oblique compared with cardinal saccades resulted in a larger pupil size (β \= 7.635, t \= 3.969, p < .001), and preparing downward compared with upward saccades also led to a larger pupil size (β \= 3.344, t \= 3.334, p \= .003).

      In addition, we have included a new analysis in the supporting information that directly addresses this issue. We will reiterate the main results here:

      “To ascertain whether pupil size or other oculomotor metrics predict saccade preferences, we conducted a multiple regression analysis. We calculated average pupil size, saccade latency, landing precision and peak velocity maps across all 36 directions. The model, determined using AIC-based backward selection, included pupil size, latency and landing precision as predictors (Wilkinson notation: saccade preferences  pupil size + saccade latency + landing precision). The analysis re- vealed that pupil size (β = -42.853, t = 4.791, p < .001) and saccade latency (β = -.377, t = 2.106, p = .043) predicted saccade preferences. Landing precision did not reach significance (β = 23.631, t = 1.675, p = .104). Together, this demonstrates that although other oculomotor metrics such as saccade latency contribute to saccade selection, pupil size remains a robust marker of saccade selection.”

      We have also added this point in our discussion:

      “We here measured cost as the degree of effort-linked pupil dilation. In addition to pupil size, other markers may also indicate saccade costs. For example, saccade latency has been proposed to index oculomotor effort [100], whereby saccades with longer latencies are associated with more oculomotor effort. This makes saccade latency a possible complementary marker of saccade costs (also see Supplemen- tary Materials). Although relatively sluggish, pupil size is a valuable measure of attentional costs for (at least) two reasons. First, pupil size is a highly established as marker of effort, and is sensitive to effort more broadly than only in the context of saccades [36–45, 48]. Pupil size therefore allows to capture not only the costs of saccades, but also of covert attentional shifts [33], or shifts with other effectors such as head or arm movements [54, 101]. Second, as we have demonstrated, pupil size can measure saccade costs even when searching in natural scenes (Figure 4). During natural viewing, it is difficult to disentangle fixation duration from saccade latencies, complicating the use of saccade latency as a measure of saccade cost. Together, pupil size, saccade latency, and potential other markers of saccade cost could fulfill complementary roles in studying the role of cost in saccade selection.”

      References

      Alnæs, D., Sneve, M. H., Espeseth, T., Endestad, T., van de Pavert, S. H. P., & Laeng, B. (2014). Pupil size signals mental eFort deployed during multiple object tracking and predicts brain activity in the dorsal attention network and the locus coeruleus. Journal of Vision, 14(4), 1. https://doi.org/10.1167/14.4.1

      Awh, E., Belopolsky, A. V., & Theeuwes, J. (2012). Top-down versus bottom-up attentional control: A failed theoretical dichotomy. Trends in Cognitive Sciences, 16(8), 437–443. https://doi.org/10.1016/j.tics.2012.06.010

      Ballard, D. H., Hayhoe, M. M., & Pelz, J. B. (1995). Memory Representations in Natural Tasks. Journal of Cognitive Neuroscience, 7(1), 66–80. https://doi.org/10.1162/jocn.1995.7.1.66

      Beatty, J. (1982). Task-evoked pupillary responses, processing load, and the structure of processing resources. Psychological Bulletin, 91(2), 276–292. https://doi.org/10.1037/0033-2909.91.2.276

      Bumke, O. (1911). Die Pupillenstörungen bei Geistes-und Nervenkrankheiten (2nd ed.). Fischer.

      Curthoys, I. S., Markham, C. H., & Furuya, N. (1984). Direct projection of pause neurons to nystagmusrelated excitatory burst neurons in the cat pontine reticular formation. Experimental Neurology, 83(2), 414–422. https://doi.org/10.1016/S0014-4886(84)90109-2

      David, L., Vassena, E., & Bijleveld, E. (2024). The unpleasantness of thinking: A meta-analytic review of the association between mental eFort and negative aFect. Psychological Bulletin, 150(9), 1070–1093. https://doi.org/10.1037/bul0000443

      de Gee, J. W., Knapen, T., & Donner, T. H. (2014). Decision-related pupil dilation reflects upcoming choice and individual bias. Proceedings of the National Academy of Sciences, 111(5), E618–E625. https://doi.org/10.1073/pnas.1317557111

      Deubel, H., & Schneider, W. X. (1996). Saccade target selection and object recognition: Evidence for a common attentional mechanism. Vision Research, 36(12), 1827–1837. https://doi.org/10.1016/0042-6989(95)00294-4

      Greenwood, J. A., Szinte, M., Sayim, B., & Cavanagh, P. (2017). Variations in crowding, saccadic precision, and spatial localization reveal the shared topology of spatial vision. Proceedings of the National Academy of Sciences, 114(17), E3573–E3582. https://doi.org/10.1073/pnas.1615504114

      Hanning, N. M., Himmelberg, M. M., & Carrasco, M. (2024). Presaccadic Attention Depends on Eye Movement Direction and Is Related to V1 Cortical Magnification. Journal of Neuroscience, 44(12). https://doi.org/10.1523/JNEUROSCI.1023-23.2023

      Himmelberg, M. M., Winawer, J., & Carrasco, M. (2023). Polar angle asymmetries in visual perception and neural architecture. Trends in Neurosciences, 46(6), 445–458. https://doi.org/10.1016/j.tins.2023.03.006

      Jepma, M., & Nieuwenhuis, S. (2011). Pupil Diameter Predicts Changes in the Exploration–Exploitation Trade-oF: Evidence for the Adaptive Gain Theory. Journal of Cognitive Neuroscience, 23(7), 1587– 1596. https://doi.org/10.1162/jocn.2010.21548

      Kahneman, D. (1973). Attention and Effort. Prentice-Hall.

      Kahneman, D., & Beatty, J. (1966). Pupil diameter and load on memory. Science (New York, N.Y.), 154(3756), 1583–1585. https://doi.org/10.1126/science.154.3756.1583

      King, W. M., & Fuchs, A. F. (1979). Reticular control of vertical saccadic eye movements by mesencephalic burst neurons. Journal of Neurophysiology, 42(3), 861–876. https://doi.org/10.1152/jn.1979.42.3.861

      Koevoet, D., Strauch, C., Naber, M., & Van der Stigchel, S. (2023). The Costs of Paying Overt and Covert Attention Assessed With Pupillometry. Psychological Science, 34(8), 887–898. https://doi.org/10.1177/09567976231179378

      Koevoet, D., Strauch, C., Van der Stigchel, S., Mathôt, S., & Naber, M. (2024). Revealing visual working memory operations with pupillometry: Encoding, maintenance, and prioritization. WIREs Cognitive Science, e1668. https://doi.org/10.1002/wcs.1668

      Kowler, E., Anderson, E., Dosher, B., & Blaser, E. (1995). The role of attention in the programming of saccades. Vision Research, 35(13), 1897–1916. https://doi.org/10.1016/0042-6989(94)00279-U

      Laeng, B., Sirois, S., & Gredebäck, G. (2012). Pupillometry: A Window to the Preconscious? Perspectives on Psychological Science, 7(1), 18–27. https://doi.org/10.1177/1745691611427305

      Loewenfeld, I. E. (1958). Mechanisms of reflex dilatation of the pupil. Documenta Ophthalmologica, 12(1), 185–448. https://doi.org/10.1007/BF00913471

      Mathôt, S. (2018). Pupillometry: Psychology, Physiology, and Function. Journal of Cognition, 1(1), 16. https://doi.org/10.5334/joc.18

      Naber, M., & Murphy, P. (2020). Pupillometric investigation into the speed-accuracy trade-oF in a visuomotor aiming task. Psychophysiology, 57(3), e13499. https://doi.org/10.1111/psyp.13499

      Nozari, N., & Martin, R. C. (2024). Is working memory domain-general or domain-specific? Trends in Cognitive Sciences, 0(0). https://doi.org/10.1016/j.tics.2024.06.006

      Reppert, T. R., Heitz, R. P., & Schall, J. D. (2023). Neural mechanisms for executive control of speedaccuracy trade-oF. Cell Reports, 42(11). https://doi.org/10.1016/j.celrep.2023.113422

      Richer, F., & Beatty, J. (1985). Pupillary Dilations in Movement Preparation and Execution. Psychophysiology, 22(2), 204–207. https://doi.org/10.1111/j.1469-8986.1985.tb01587.x

      Robison, M. K., & Brewer, G. A. (2020). Individual diFerences in working memory capacity and the regulation of arousal. Attention, Perception, & Psychophysics, 82(7), 3273–3290. https://doi.org/10.3758/s13414-020-02077-0

      Robison, M. K., & Unsworth, N. (2019). Pupillometry tracks fluctuations in working memory performance. Attention, Perception, & Psychophysics, 81(2), 407–419. https://doi.org/10.3758/s13414-0181618-4

      Sahakian, A., Gayet, S., PaFen, C. L. E., & Van der Stigchel, S. (2023). Mountains of memory in a sea of uncertainty: Sampling the external world despite useful information in visual working memory. Cognition, 234, 105381. https://doi.org/10.1016/j.cognition.2023.105381

      Shadmehr, R., Reppert, T. R., Summerside, E. M., Yoon, T., & Ahmed, A. A. (2019). Movement Vigor as a Reflection of Subjective Economic Utility. Trends in Neurosciences, 42(5), 323–336. https://doi.org/10.1016/j.tins.2019.02.003

      Silva, M. F., Brascamp, J. W., Ferreira, S., Castelo-Branco, M., Dumoulin, S. O., & Harvey, B. M. (2018). Radial asymmetries in population receptive field size and cortical magnification factor in early visual cortex. NeuroImage, 167, 41–52. https://doi.org/10.1016/j.neuroimage.2017.11.021

      Sirois, S., & Brisson, J. (2014). Pupillometry. WIREs Cognitive Science, 5(6), 679–692. https://doi.org/10.1002/wcs.1323

      Sparks, D. L. (2002). The brainstem control of saccadic eye movements. Nature Reviews Neuroscience, 3(12), Article 12. https://doi.org/10.1038/nrn986

      Strauch, C., Wang, C.-A., Einhäuser, W., Van der Stigchel, S., & Naber, M. (2022). Pupillometry as an integrated readout of distinct attentional networks. Trends in Neurosciences, 45(8), 635–647. https://doi.org/10.1016/j.tins.2022.05.003

      Unsworth, N., & Miller, A. L. (2021). Individual DiFerences in the Intensity and Consistency of Attention. Current Directions in Psychological Science, 30(5), 391–400. https://doi.org/10.1177/09637214211030266

      Van der Stigchel, S. (2010). Recent advances in the study of saccade trajectory deviations. Vision Research, 50(17), 1619–1627. https://doi.org/10.1016/j.visres.2010.05.028

      Van der Stigchel, S. (2020). An embodied account of visual working memory. Visual Cognition, 28(5–8), 414–419. https://doi.org/10.1080/13506285.2020.1742827

      Van der Stigchel, S., & Hollingworth, A. (2018). Visuospatial Working Memory as a Fundamental Component of the Eye Movement System. Current Directions in Psychological Science, 27(2), 136–143. https://doi.org/10.1177/0963721417741710

      van der Wel, P., & van Steenbergen, H. (2018). Pupil dilation as an index of eFort in cognitive control tasks: A review. Psychonomic Bulletin & Review, 25(6), 2005–2015. https://doi.org/10.3758/s13423-018-1432-y

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public review): 

      Summary: 

      Nitric oxide (NO) has been implicated as a neuromodulator in the retina. Specific types of amacrine cells (ACs) produce and release NO in a light-dependent manner. NO diffuses freely through the retina and can modulate intracellular levels of cGMP, or directly modify and modulate proteins via S-nitrosylation, leading to changes in gap-junction coupling, synaptic gain, and adaptation. Although these system-wide effects have been documented, it is not well understood how the physiological function of specific neuronal types is affected by NO. This study aims to address this gap in our knowledge. 

      There are two major findings. 1) About a third of the retinal ganglion cells display cell-type specific adaptation to prolonged stimulus protocols. 2) Application of NO specifically affected Off-suppressed ganglion cells designated as G32 cells. The G32 cluster likely contains 3 ganglion cell types that are differentially affected. 

      This is the first comprehensive analysis of the functional effects of NO on ganglion cells in the retina. The cell-type specificity of the effects is surprising and provides the field with valuable new information. 

      Strengths: 

      NO was expected to produce small effects, and considerable effort was expended in validating the system to ensure that changes in the state of the preparation would not confound any effects of NO. The authors used a sequential stimulus protocol to control for changes in the sensitivity of the retina during the extended recording periods. The approach potentially increases the sensitivity of the measurements and allows more subtle effects to be observed. 

      Neural activity was measured by Ca-imaging. Responsive ganglion cells were grouped into 32 types using a clustering analysis. Initial control experiments demonstrated that the celltypes revealed by the analysis largely recapitulate those from their earlier landmark study using a similar approach. 

      Application of NO to the retina modulated responses of a single cluster of cells, labeled G32, while having little effect on the remaining 31 clusters. In separate experiments, ganglion cell spiking activity was recorded on a multi-electrode array (MEA). Together the Ca-imaging and MEA recordings provide complementary approaches and demonstrate that NO modulates the temporal but not spatial properties of affected cell-types.

      Weaknesses: 

      The concentration of NO used in these experiments was ~0.25µM, which is 5- to 10-fold lower than the endogenous concentration previously measured in rodent retina. It is perhaps surprising that this relatively low NO concentration produced significant effects. However, the endogenous measurements were done in an eye-cup preparation, while the current experiments were performed in a bare (no choroid) preparation. Perhaps the resting NO level is lower in this preparation. It is also possible that the low concentration of NO promoted more selective effects.

      Reviewer #2 (Public review): 

      Neuromodulators are important for circuit function, but their roles in the retinal circuitry are poorly understood. This study by Gonschorek and colleagues aims to determine the modulatory effect of nitric oxide on the response properties of retinal ganglion cells. The authors used two photon calcium imaging and multi-electrode arrays to classify and compare cell responses before and after applying a NO donor DETA-NO. The authors found that DETA-NO selectively increases activity in a subset of contrast-suppressed RGC types. In addition, the authors found cell-type specific changes in light response in the absence of pharmacological manipulation in their calcium imaging paradigm. This study focuses on an important question and the results are interesting. The limitations of the method and data interpretation are adequately discussed in the revised manuscript. 

      The authors have addressed my previous comments, included additional discussions on the limitations of the method, and provided a more careful interpretation of their data. 

      Recommendations for the authors: 

      Please correct the citation that reviewer #1 mentioned. In addition, a little more discussion of the NO concentration issue would be helpful. The low NO concentration is not a weakness in the data; it simply raises questions regarding the interpretation.

      Thank you for these recommendations.

      Regarding the citation error, we are not sure if Reviewer #1 refers to a citation   formatting error or incorrect placement. In any case, we modified the text: We  specified the extracted information regarding the NO concentrations and put the  applied concentration into that context (Lines 621-635). In addition, we made clear  that the citation of Guthrie (2014) refers to the dissertation, which can be easily  retrieved via Google Scholar. We also cited the mentioned ARVO abstract by   Guthrie and Mieler (2014). 

      We hope that these modifications solve the above-mentioned issues. 


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):  

      Summary: 

      Nitric oxide (NO) has been implicated as a neuromodulator in the retina. Specific types of amacrine cells (ACs) produce and release NO in a light-dependent manner. NO diffuses freely through the retina and can modulate intracellular levels of cGMP, or directly modify and modulate proteins via S-nitrosylation, leading to changes in gap-junction coupling, synaptic gain, and adaptation. Although these system-wide effects have been documented, it is not well understood how the physiological function of specific neuronal types is affected by NO. This study aims to address this gap in our knowledge. 

      Strengths: 

      NO was expected to produce small effects, and considerable effort was expended in validating the system to ensure that any effects of NO would not be confounded by changes in the state of the preparation. The authors used a paired stimulus protocol to control for changes in the sensitivity of the retina during the extended recording periods. The approach potentially increases the sensitivity of the measurements and allows more subtle effects to be observed. 

      Neural activity was initially measured by Ca-imaging. Responsive ganglion cells were grouped into 32 types using a clustering analysis. Initial control experiments demonstrated that the cell-types revealed here largely recapitulate those from their earlier landmark study using the same approach (Fig. 2). 

      Application of NO to the retina strongly modulated responses of a single cluster of cells, labeled G32, while having little effect on the remaining 31 clusters. This result is evident in Fig. 3e. 

      Separate experiments measured ganglion cell spiking activity on a multi-electrode array (MEA). Clustering analysis of the peri-stimulus spike-time histograms (PSTHs) obtained from the MEA data also revealed 32 clusters. The PSTHs for each cluster were aligned to the Ca-imaging data using a convolution approach. The higher temporal resolution of the MEA recordings indicated that NO increased the speed of sub-cluster 2 responses but had no effect on receptive field size. The physiological significance of the small change in kinetics remains unclear. 

      We thank the reviewer for their detailed and constructive comments.

      Weaknesses: 

      The G32 cluster was further divided into three sub-types using Bayesian Information Criterion (BIC) based on the temporal properties of the Ca-responses. This sub-clustering result seems questionable due to the small difference in the BIC parameter between 2 and 3 clusters. Three sub-clusters of the G32 cluster were also revealed for the PSTH data, however, the BIC analysis was not applied to further validate this result. 

      (1.1) We agree with the reviewer that this is an important point to be clarified. To this end, we repeated the analysis with n=2 clusters (see Author response image 1 below). In brief, we found that the overall interpretation did not change: Both clusters in the Ctrl1-dataset showed barely any type-specific adaptational effects, whereas under NO application, temporal contrast responses decreased (see Author response image 1 below). If requested, we would be happy to add this image to the supplementary material. 

      Author response image 1.

      In an additional analysis, we evaluated if n=2 or n=3 was the “better” choice for the number of clusters. In the new Supplementary Fig. S4, we compared the clusters with n=2 (top) and n=3 (bottom). For n=2, the two clusters are relatively strongly correlated for both visual stimuli, whereas for n=3, the clusters become more distinct, especially with respect to differences in the correlations for the two stimuli (Fig. S4b). For n=2, the low intra-cluster correlation (ICC) strongly suggests that cluster 2 contains multiple response types (ICC(C2) = 0.5 ± 0.48, mean ± s.d.; Fig. S4c). For n=3, the mean ICC values are high for all three clusters (ICC(C1) = 0.81 ± 0.16; ICC(C2) = 0.86 ± 0.07; ICC(C3) = 0.83 ± 0.1; mean ± s.d.). Together, this suggests that n=3 clusters captures the response diversity in G32 better than n=2 clusters. 

      Finally, we performed a BIC analysis for the MEA dataset and found the optimal number of clusters to be also n=3 (see new Suppl. Fig. S5).

      The alignment of sub-clusters 1, 2, and 3 identified in the Ca-imaging and the MEA recordings seemed questionable, because the temporal properties of clusters did not align well, nor did the effects of NO. 

      (1.2) To address this important point, we analyzed the correlations between the control responses of the three clusters from the Ca<sup>2+</sup>-dataset with the ones from the MEA-dataset (see new Suppl. Fig. S7). To avoid confusion, we named the clusters in the MEA-dataset i,ii,iii (see Fig. 8). We found two of the three clusters to be highly correlated (Ca<sup>2+</sup> clusters 2,3 and MEA clusters iii, ii), whereas one cluster was much less so (cluster 1 vs. cluster i), likely due to differences in response kinetics. In clusters i and ii NO application led to a release of suppression for temporal contrasts – similar to what we observed in the Ca<sup>2+</sup> data (see also our new analysis of the MEA data in Suppl. Fig. S6, as discussed further below).

      We agree that the cell types underlying the Ca<sup>2+</sup> and MEA G32 clusters may not be the same – aligning functional types between those two methods is challenging due to several factors, mainly because while Ca<sup>2+</sup> is a proxy for spiking activity, other Ca<sup>2+</sup> sources as well as sub-threshold membrane potential changes affect the intracellular Ca<sup>2+</sup>, potentially in a cell type-specific way. We explain this now better in the text.

      In any case, our main point was not to unambiguously align the cell types but to show that in both datasets, we find three subclusters of G<sub>32</sub>, which are affected by NO in a differential manner, particularly their suppression to temporal contrasts.

      The title of the paper indicates that nitric oxide modulates contrast suppression in a subset of mouse retinal ganglion cells, however, this result appears to be inferred from previous results showing that G32 is identified as a "suppressed-by-contrast" cell. The present study does not explicitly evaluate the amount of contrast-suppression in G32 cells. 

      (1.3) The reviewer is correct in that we did not quantify contrast-suppression in G<sub>32</sub> in detail but focused on the responses to temporal contrast (chirp and moving bar) and its modulation by NO (Fig. 5). In this context, please note that G<sub>32</sub>’s responses to the moving bar stimulus suggests that the cells are also suppressed by spatial contrast (i.e., an edge appearing in their RF). The functional RGC type G<sub>32</sub> (“Off suppressed 2”) was defined in an earlier study (Baden et al. 2016); it was assigned to the “Suppressed-by-Contrast” (SbC) category mainly because temporal contrast suppresses its responses. Already then, coverage analysis indicated that G<sub>32</sub> may indeed contain several RGC types – in line with our clustering analysis. It is still unclear if G<sub>32</sub> contains one (or more) of the SbC cells described by Jacoby & Schwartz (2018); in their recent study, Wienbar and Schwarz (2022) introduced the novel bursty-SbC RGC, which Goetz et al. (2022) speculated to potentially align with G<sub>32</sub>.<br /> We now discuss the relationship between G<sub>32</sub> and the SbC RGCs defined in other studies in the revised manuscript.

      In its current form, the work is likely to have limited impact, since the morphological and functional properties of the affected sub-cluster remain unknown. The finding that there can be cell-specific adaptation effects during experiments on in vitro retina is important new information for the field.

      (1.4) Again, we thank the reviewer for the detailed and helpful feedback. We hope that the reviewer finds our revised manuscript improved.

      Reviewer #1 (Recommendations For The Authors):  

      Most of the calcium activity traces (dF/F) throughout the paper have neither vertical nor horizontal calibration bars. Presumably, most values are positive, but this is unclear as a zero level is not indicated anywhere. Without knowing where zero dF/F is, it is not possible to determine whether the NO increased the Ca-signal or blocked a decrease in the Ca-signal. 

      Both ∆F/F and z-scoring, as we used here, are ways to normalize Ca<sup>2+</sup> traces. We decided against using ∆F/F<sub>0</sub> because this typically assumes that F represents the cell’s Ca<sup>2+</sup> resting level (F<sub>0</sub>; without activity). However, in our measurements, the “resting” Ca<sup>2+</sup> levels (i.e. before presenting a stimulus) may indeed reflect no spiking activity (e.g., in an ON RGC) but may also reflect baseline spiking activity (e.g., in an G<sub>32</sub>, which has a baseline firing rate of ~10 Hz; see Fig. S6). Hence, we used z-scoring, which carries no assumption of resting Ca<sup>2+</sup> level equal to no activity. In practice, we normalized all traces to the Ca<sup>2+</sup> level prior to the light stimulus and defined this as zero (as described in the Methods).

      We considered the reviewer’s suggestion of adding zero lines to every trace but felt that this would hamper the overall readability of the figures.

      Regarding calibration bars: We made sure that horizontal bars (indicating time) are present in all figures. We decided to leave out vertical bars in Ca<sup>2+</sup> responses, because as explained above, the traces are normalized (and unit-free), and within a figure all traces are scaled the same.

      Points of clarification for the Methods: 

      (1) The stimulus field was 800 x 600 µm. Presumably, both scan fields were contained within this region when scanning either Field 1 or Field 2 so that the adaptation level of the preparation at both locations was maintained? 

      Yes, the stimulation field is always kept centered on the respective recording (scan) field and the adaptation level for each recording field was maintained.

      (2) There appeared to be an indeterminate amount of time between the initial 10-minute adaptation period and Ctrl1, whereas there were no such gaps between subsequent scans. Is this likely to produce differences in adaptation state and thus represent a systematic error? 

      At this time point, recording (scan) fields were selected to make sure that the cells in the field were uniformly labelled with the Ca<sup>2+</sup> indicator and responsive to light stimuli. This typically happened already at the end of the light adaptation phase and/or right after. When selecting the fields, light stimuli were presented (to test responsiveness) and thereby the adaptation level was maintained independent of the duration of this procedure, minimizing systematic errors.

      (3) Was the dense white noise stimulus applied during the wash-in period to maintain the adaptation state of the preparation prior to the subsequent scan? 

      The dense noise was not applied throughout the wash-in period but at least 5-10min before the field was recorded with a drug (e.g., NO). 

      Fig. 1d illustrates very nicely how the stimuli align with the responses. It would have been helpful to have this format continue throughout the paper but unfortunately, the vertical lines are dropped in Fig. 2a and then the stimulus waveform is omitted in Fig. 2e onwards. 

      Thanks, good idea. We added the vertical lines and the stimulus waveform to the figures where they were missing to improve the readability. 

      What was the rationale for selecting the concentration of the NO donor used? Is it likely to mimic natural levels? 

      A DETA/NO concentration of 100 µM is commonly used in studies investigating NOinduced effects. DETA/NO has a half-life time (t<sub>0.5</sub>) of 20 hours, which makes it more suitable for application in tissues (like our whole-mount preparation), because the donor can penetrate into the issue before releasing NO. In turn, this long t0.5 means that only a fraction of the bound NO is released per time unit.

      Based on t<sub>0.5</sub> for DETA/NO and NO, one can roughly estimate the NO range as follows: t<sub>0.5</sub> of NO strongly depends on the tissue and is estimated in the second to minute range (Beckman & Koppenol, 1996). Assuming a t<sub>0.5</sub> for NO of 2 minutes, a freshly prepared 100 µM DETA/NO solution is expected to result within the first hour a NO concentration of approx. 0.25 µM (taking into account that 1 mole of DETA/NO releases 1.5 moles of NO molecules; see Ramamurthi & Lewis 1997).

      In general, it is difficult to determine the physiological concentration of NO in the retina. Different measurements point at peaks of a few 100 nM (e.g., frog retina, ganglion cells: 0.25 µM, Kalamkarov et al. 2016; rodent inner retina, 0.1 to 0.4 µM, Micah et al. 2014). Hence, the NO concentrations we apply should be within the measured physiological range.

      Fig. 3e: what are the diamond symbols? If these are the individual cells, it might be better to plot them on top of the box plots so all are visible. 

      Indeed, the diamond symbols represent individual cells, yet outliers only. We decided not to plot all cells as a dot plot on top of the box plots since the readability will suffer as there are too many individual dots to show, e.g., n=251 for G<sub>32</sub> Ctrl and n=135 for G<sub>32</sub> DETA/NO.

      Fig. 3: please explain more clearly the x-axis units in a-d and the y-axis units in e. 

      To estimate potential response differences between the first and the second scan (i.e. either Ctrl 2 or NO), the traces were subtracted cell-pairwise (∆ Ctrl: Ctrl 2 – Ctrl 1; ∆ DETA/NO: NO – Ctrl 1). As all Ca<sup>2+</sup> traces were normalized, they are unit-free. Therefore, the x-axes in Fig. 3a-d represent the mean differences of each cell per cell type, e.g., a value of zero would mean that the traces of Ctrl 1 and Ctrl 2 for a cell are identical. The y-axis in Fig. 3e is also unit-free, because technically, it is the same measure as Fig. 3a-d. But since it summarizes the control- and NO-data, we refer to this as “delta mean trace.” We tried to make this clearer in the revised manuscript and a detailed description can be found in the Methods.

      Fig. 3: "...a substantial number of RGC types (34%) changed their responses to chirp and/or moving bar stimuli in the absence of any pharmacological perturbation in a highly reproducible manner...". How many of the cell types showed a significant difference? Two cell-types with p<0.001are highlighted with 3 asterisks. It would be helpful to indicate on this plot which of the other cells showed significant differences. 

      Yes, this is a good idea. Thank you. We tried to add this information to the figure, but it became rather crowded. Therefore, we added a new Suppl. Fig. S3 (same style as Fig. 3) where we exclusively summarized the control-dataset. 

      Fig. 7: To illustrate the transform from PSTH to Ca-imaging, why not use G32 data as an example?

      Fair point. We modified the figure and added G<sub>32</sub> as an example.

      It would be clearer if the cells were labeled consistently throughout the paper using their Baden cluster numbers rather than switching to the older nomenclature (JAM-B, local edge, alpha, etc), e.g. Fig. 7a,b. 

      In the revised manuscript, we now changed the nomenclature to the Ca2+ Baden et al. (2016) terminology. We used the alternative cell type names here because where Fig. 7a is discussed in the manuscript, the cell type matching did not happen yet. But we agree that a consistent nomenclature is helpful.

      The evidence supporting the sub-clustering of the G32 cells for the two recording methods could have been stronger. In Fig. 5, the BIC difference between 2 and 3 clusters is rather small. Is this result robust enough to justify 3 rather than 2 clusters? The BIC analysis should also be performed on the PSTH data-set to support the notion that the MEA G32 cluster also contains 3 rather than 2 sub-clusters. 

      Regarding the sub-clustering of G<sub>32</sub> into n=2 or n=3 clusters for both datasets, please see our detailed reply #1.1 in our response to the public comments above.

      The alignment of the three sub-clusters across the Ca-imaging and MEA data looked questionable. For example, the cluster 2 and cluster 3 traces in Fig. 5e,f look similar, with cluster 1 being more different. In Fig. 8c on the other hand, cluster 1 and 3 look similar with cluster 2 being more different. The pharmacological results also did not align well. For the Ca-imaging, NO appeared to have a large effect on cluster 1, a more modest effect on cluster 2 and less effect on cluster 3 (Fig. 5f). In comparison, the MEA results diverged, with NO producing the largest effect on cluster 2 and very modest if any effects on clusters 1 and 3 (Fig. 8c). Moreover, the temporal properties of cluster 1 and cluster 3 look very different between the Ca-imaging and MEA data. Without further comment, these differences raise concerns about the reliability of the clustering and the validity of comparisons made across the two sets of experiments. 

      We agree that this is a critical point. Please see our reply #1.2 in our response to the public comments above.

      Fig. 8: Transforming the PSTHs into Ca-traces is important to align the MEA recordings with the Ca-imaging data. It would also be very informative to see a more detailed overall presentation of the PSTH data since it provides a much higher temporal resolution of the responses. For example, illustrating the average PSTHs for the G32 cells under all the experimental conditions could be quite illuminating. 

      To address this point, we added a new Supplementary Fig. S6, which shows the pseudo-Ca<sup>2+</sup> traces for each cluster and condition next to the PSTHs. In addition, we quantified the cumulative firing rate for response features (time windows) where temporal suppression was observed in the Ca<sup>2+</sup> data. This new analysis shows that during NO-application, we can see an increase in firing rate in all clusters. Nevertheless, the effect of NO on the PSTHs is admittedly small and it is better visible in the pseudo-Ca<sup>2+</sup> transformed traces. One possible explanation for this difference may be that the overall firing rates are quite dynamic in G<sub>32</sub> such that a significant increase in “suppression” phases relative to the peak firing appears small.

      Reviewer #2 (Public Review):  

      Neuromodulators are important for circuit function, but their roles in the retinal circuitry are poorly understood. This study by Gonschorek and colleagues aims to determine the modulatory effect of nitric oxide on the response properties of retinal ganglion cells. The authors used two photon calcium imaging and multi-electrode arrays to classify and compare cell responses before and after applying a NO donor DETA-NO. The authors found that DETA-NO selectively increases activity in a subset of contrast-suppressed RGC types.

      In addition, the authors found cell-type specific changes in light response in the absence of pharmacological manipulation in their calcium imaging paradigm. While this study focuses on an important question and the results are interesting, the following issues need further clarification for better interpretation of the data. 

      We thank the reviewer for her/his detailed and constructive comments.

      (1) Design of the calcium imaging experiments: the control-control pair has a different time course from the control-drug pair (Fig 1e). First, the control-control pair has a 10 minute interval while the control-drug pair has a 25 minute interval. Second, Control 1 Field 2 was imaged 10 min later than Control 1 Field 1 since the start of the calcium imaging paradigm. 

      Given that the control dataset is used to control for time-dependent adaptational changes throughout the experiment, I wonder why the authors did not use the same absolute starting time of imaging and the same interval between the first and second round of imaging for both the control-control and the control-drug pairs. This can be readily done in one of the two ways: 1. In a set of experiment, add DETA/NO between "Control 1 Field 1 and "Control 2 Field 1" in Fig. 1e as the drug group; or 2. Omit DETA/NO in the Fig. 1e protocol as the control group to monitor the time course of adaptational changes. 

      Thank you for raising this point. We hope that in the following we can clarify the reasoning behind our protocol and the analysis approach.

      (2.1) Initially, we performed these experiments in different ways (also in the sequence suggested by the reviewer), before homing in on the paradigm illustrated in Fig. 1. We chose this paradigm for two reasons: First, we wanted to have for each retina both Ctrl1/Ctrl2 and Ctr1/NO data sets, to be sure that the time-dependent (adaptational) effects were not related to the general condition of an individual retina preparation. Second, we did not see obvious differences in time-dependent or NO-induced effects between paradigms. Therefore, while we cannot exclude that the absolute time between recordings can affect the observed changes, we do not think that such effects are substantial enough to change our conclusions.

      In the revised manuscript, we now explicitly point at the different intervals. 

      Related to the concern above, to determine NO-specific effect, the authors used the criterion that "the response changes observed for control (ΔR(Ctrl2−Ctrl1)) and NO (ΔR(NO−Ctrl1)) were significantly different". This criterion assumes that without DETA-NO, imaging data obtained at the time points of "Control 1 Field 2" and "DETA/NO Field 2" would give the same value of ΔR as ΔR(Ctrl2−Ctrl1) for all RGC types. It is not obvious to me why this should be the case, because of the unknown time-dependent trajectory of the adaptational change for each RGC type. For example, a RGC type could show stable response in the first 30 min and then change significantly in the following 30 min. DETA/NO may counteract this adaptational change, leading to the same ΔR as the control condition (false negative). Alternatively, DETA/NO may have no effect, but the nonlinear timedependent response drift can give false positive results. 

      (2.2) Initially, we assumed that after adapting the retina to a certain light level, RGCs exhibit stable responses over time, such that when adding a pharmacological agent, we can identify drug-induced response changes (e.g., by calculating the response difference). To our surprise, we found that for some RGC types the responses changed between the first and the second recording (referred to as cell type-specific adaptational effects), which is why we devised the Ctrl1/Ctrl2 vs. Ctr2/NO analysis. 

      The reviewer is correct in that we assume in our analysis that the adaptational- and NO-induced effects are independent and sum linearly. Further, we agree with the reviewer that there may be other possibilities, two of which are highlighted by the reviewer:

      (a) Interaction: for instance, if NO compensates for the adaptational effect, we would not be able to measure this; or, if this compensation was partial, underestimate both effects. 

      (b) More complex time-dependency: for example, if an RGC shows a pronounced adaptational effect with a longer delay (i.e. only after the second scan), or that a very transient NO effect has already disappeared when we perform the second scan. On the one hand, as we only can take snapshots of the RGC responses, we cannot exclude these possibilities. On the other hand, both effects (adaptational- and NO-dependent) were type-specific and reproducible between experiments (also with varying timing, see reply #2.1), which makes complex time dependencies less likely.

      The revised manuscript now reflects these limitations of our recording paradigm and points out which effects can be detected, and which likely not.

      I also wonder why washing-out, a standard protocol for pharmacological experiments, was not done for the calcium protocol since it was done in the MEA experiments. A reversible effect by washing in and out DETA/NO in the calcium protocol would provide a much stronger support that the observed NO modulation is due to NO and not to other adaptive changes. 

      (2.3) We agree that a clear wash-out would strengthen our findings. Indeed, in the beginning of our experiments, we tried to wash-out the agent in the Ca<sup>2+</sup> recordings, as we did in the MEA recordings. We soon stopped doing this in the Ca<sup>2+</sup> experiments, because response quality decreased for the third scan of the same field, likely due to bleaching of fluorescent indicator and photopigment. This is why we typically restrict the total recording time of the same field of RGCs to about 30 min (~ two scans with all light stimuli). Moreover, our MEA data showed that DETA/NO can largely be washed-out, which supports that we observed NO-specific effects. Therefore, we decided against further attempts to establish the wash-out also in the Ca<sup>2+</sup> experiments (e.g., shortening the recording time by presenting fewer light stimuli).

      (2) Effects of Strychnine: In lines 215-219, " In the light-adapted retina, On-cone BCs boost light-Off responses in Off-cone BCs through cross-over inhibition (83, 84) and hence, strychnine affects Off-response components in RGCs - in line with our observations (Fig. S2)" However, Fig. S2 doesn't seem to show a difference in the Off-response components. Rather, the On response is enhanced with strychnine. In addition, suppressed-by-contrast cells are known to receive glycinergic inhibition from VGluT3 amacrine cells (Tien et al., 2016). However, the G32 cluster in Fig. S2 doesn't seem to show a change with strychnine. More explanation on these discrepancies will be helpful.

      (2.4) We thank the reviewer for this comment. Regarding the first part, we agree that the figure does not support differences in the Off-response components. We therefore rephrased the corresponding text accordingly. Additionally, we now show all RGC types with n>3 cells per recording condition in the revised Suppl. Fig. S2 and added statistics.

      Regarding the second part, there are several possible explanations for these discrepancies:

      (a) The SbC (transient Off SbC) studied in Tien et al. (2016) likely corresponds to the RGC type G<sub>28</sub> (see Höfling et al. 2024). As mentioned above (see reply #1.2), it is unclear if G<sub>32</sub> corresponds to a previously described SbC, and if so, to which. Goetz et al. (2022) proposed that G<sub>32</sub> may align with the bursty-SbC (bSbC) type (their Supplemental Table 3), as described also by Wienbar and Schwartz (2022). An important feature of the bSbC type is that its contrast response function is mainly driven by intrinsic properties rather than synaptic input. If G<sub>32</sub> indeed included the bSbC, this may explain why strychnine does not interfere with the suppression of temporal contrast.

      (b) In Tien et al. (2016), the authors genetically removed the VG3-ACs (see their Fig. 3) and show that this ablation reduces the inhibition of tSbC cells in a stimulus size-dependent manner. Specifically, larger light stimuli (600 µm) only show marginal effects on the IPSCs and inhibitory synaptic conductance (see their Figs. 3c,d and 3e,f, respectively). In our study, the full-field chirp had a size of 800 x 600 µm. Therefore – and assuming that G<sub>32</sub> indeed included tSbCs – our observation that strychnine did not affect temporal suppression in the full-field chirp responses would be in line with Tien et al. (2016).   

      (3) This study uses DETA-NO as an NO donor for enhancing NO release. However, a previous study by Thompson et al., Br J Pharmacol. 2009 reported that DETA-NO can rapidly and reversible induce a cation current independent of NO release at the 100 uM used in the current study, which could potentially cause the observed effect in G32 cluster such as reduced contrast suppression and increased activity. This potential caveat should at least be discussed, and ideally excluded by showing the absence of DETA-NO effects in nNOS knockout mice, and/or by using another pharmacological reagent such as the NO donor SNAP or the nNOS inhibitor l-NAME. 

      Thank you for pointing out this potential caveat. We certainly cannot exclude such side effects. However, we think that this explanation of our observations is unlikely, because Thompson et al. barely see effects at 100 µM DETA/NO; in fact, their data suggests that clear NO-independent effects on the cation-selective channel occur at much higher DETA/NO concentrations, such as 3 mM. 

      In any case, in the revised manuscript, we refer to this paper in the Discussion

      (4) Clarification of methods: In the Methods, lines 1119-1127, the authors describe the detrending, baseline subtraction, and averaging. Then, line 1129, " the mean activity r(t) was computed and then traces were normalized such that: max t(|r(t)|) = 1. How is the normalization done? Is it over the entire recording (control and wash in) for each ROI? Or is it normalized based on the mean trace under each imaging session (i.e. twice for each imaging field)? 

      The normalization (z-scoring) was done for each ROI individually per stimulus and condition (Ctrl 1, Ctrl 2, DETA/NO). We normalized the traces, because the absolute Ca<sup>2+</sup> signal depends on factors, such as “resting” state of the cell (e.g., silent vs. baseline spiking activity in the absence of a light stimulus) and its fluorescent dye concentration. This also means that absolute response amplitudes are difficult to interpret. Hence, we focused on analyzing relative changes per ROI and condition, which still allowed us to investigate adaptational and drug-induced effects. In the revised manuscript, we changed the corresponding paragraph for clarification.

      As for the clustering of RGC types, I assume that each ROI's cluster identity remains unchanged through the comparison. If so, it may be helpful to emphasize this in the text.

      Yes, this is correct. We identified G<sub>32</sub> RGCs based on their Ctrl1 responses and then compares these responses with those for Ctrl2 or NO. We now clarified this in the revised manuscript.

      Reviewer #2 (Recommendations For The Authors):  

      The manuscript would benefit from a discussion of how the findings in this study relate to known mechanisms of NO modulation and previously reported effects of NO manipulations on RGC activity. 

      Thank you for the recommendation. We already refer to known mechanisms of NO within the retina in the Introduction. In the revised manuscript, we now added information to the Discussion.

      In the abstract, "a paired-recording paradigm" could be misleading because paired recording generally refers to the simultaneous recording of two neurons. However, the paradigm in this study is essentially imaging experiments done at two time points. 

      We agree with the reviewer. To avoid any confusion with paired electrophysiological recordings, we changed the term “paired-recording paradigm” to “sequential recording paradigm” and replaced the term “pair-/ed” with “sequentially recorded”.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this elegant and thorough study, Sánchez-León et al. investigate the effects of tDCS on the firing of single cerebellar neurons in awake and anesthetized mice. They find heterogeneous responses depending on the orientation of the recorded Purkinje cell.

      Strengths:

      The paper is important in that it may well explain part of the controversial and ambiguous outcomes of various clinical trials. It is a well-written paper on a deeply analyzed dataset.

      We sincerely thank Reviewer #1 for their positive feedback and insightful comments. We are pleased to know that you found our study elegant and thorough, and we appreciate your recognition of its potential to clarify the controversial and ambiguous outcomes seen in various clinical trials. Your acknowledgment of the depth of our analysis and the clarity of the writing is highly encouraging, and we are grateful for your thoughtful evaluation of our work.

      Weaknesses:

      The sample size could be increased for some of the experiments.

      We sincerely thank the reviewer for their thoughtful suggestion to increase the sample size. While we understand the importance of this consideration, we believe it is not feasible at this stage due to several factors. First, the complexity of our experiments, which include single-neuron recordings in awake animals during electric field application, juxtacellular neurobiotin injections post-tDCS (with a low success rate), and high-density recordings from Purkinje cells across different layers in awake animals, significantly limits the throughput of data collection. Second, the statistical outcomes obtained from our analyses, which combine multiple techniques, are robust and provide a strong basis for our conclusions. Third, the current study already involves a substantial number of animals (74 mice), which aligns with ethical considerations for minimizing animal use while ensuring robust results.

      We believe that the current sample size is sufficient to support the findings presented in the manuscript. Expanding the sample size further would require considerable additional resources and time, without a clear indication that it would fundamentally alter the conclusions of the study. We are grateful for the reviewer’s understanding of these limitations and their acknowledgment of the value of the current dataset.

      Reviewer #2 (Public review):

      Summary:

      In this study by Sánchez-León and colleagues, the authors attempted to determine the influence of neuronal orientation on the efficacy of cerebellar tDCS in modulating neural activity. To do this, the authors made recordings from Purkinje cells, the primary output neurons of the cerebellar cortex, and determined the inter-dependency between the orientation of these cells and the changes in their firing rate during cerebellar tDCS application.

      Strengths:

      (1) A major strength is the in vivo nature of this study. Being able to simultaneously record neural activity and apply exogenous electrical current to the brain during both an anesthetized state and during wakefulness in these animals provides important insight into the physiological underpinnings of tDCS.

      (2) The authors provide evidence that tDCS can modulate neural activity in multiple cell types.

      For example, there is a similar pattern of modulation in Purkinje cells and non-Purkinje cells (excitatory and inhibitory interneurons). Together, these data provide wholistic insight into how tDCS can affect activity across different populations of cells, which has important implications for basic neuroscience, but also clinical populations where there may be non-uniform or staged effects of neurological disease on these various cell types.

      (3) There is a systematic investigation into the effects of tDCS on neural activity across multiple regions of the cerebellum. The authors demonstrate that the pattern of modulation is dependent on the target region. These findings have important implications for determining the expected neuromodulatory effects of tDCS when applying this technique over different target regions noninvasively in animals and humans.

      We sincerely thank Reviewer #2 for their detailed and thoughtful comments on our study. We are pleased that you recognized the importance of our in vivo approach, allowing for simultaneous neural recordings and tDCS application in both anesthetized and awake states. Your acknowledgment of our findings regarding the modulation of neural activity across different cell types, including Purkinje and non-Purkinje cells, is greatly appreciated. We also value your recognition of the implications of our work for understanding how tDCS can affect diverse neuronal populations, particularly in the context of clinical applications. Additionally, your positive feedback on our systematic investigation across multiple cerebellar regions highlights the relevance of our work for determining the region-specific effects of tDCS. Thank you for your encouraging and insightful evaluation.

      Weaknesses:

      (1) In the introduction, there is a lack of context regarding why neuronal orientation might be a critical factor influencing the responsiveness to tDCS. The authors allude to in vitro studies that have shown neuronal orientation to be relevant for the effects of tDCS on neural activity but do not expand on why this might be the case. These points could be better understood by informing the reader about the uniformity/non-uniformity of the induced electric field by tDCS. In addition, there is a lack of an a priori hypothesis. For example, would the authors have expected that neuronal orientation parallel or perpendicular to the electrical field to be related to the effects of tDCS on neural activity?

      We thank the Reviewer #2 for this insightful comment. In response, we have expanded the introduction to provide a clearer context regarding the influence of neuronal orientation on the effects of tDCS. Therefore, we have added two new paragraphs in the Introduction to address these points.

      “For neurons whose somatodendritic axis is aligned with the electric field, the field induces a pronounced somatic polarization. In the case of anodal stimulation, where the positive electrode is positioned near the dendrites and the soma is oriented away, positively charged ions accumulate near the soma, leading to depolarization and increased excitability, thus facilitating action potential generation. Conversely, neurons whose orientation opposes the field, such as when the soma is closer to the positive electrode and the dendrites face away, experience hyperpolarization, reducing excitability. Lastly, neurons oriented perpendicular to the electric field would exhibit minimal somatic polarization, as the field does not induce significant redistribution of charges along the somatodendritic axis.”

      Additionally, we have now clarified our a priori hypothesis regarding neuronal orientation and its expected influence on tDCS efficacy.

      “We hypothesized that the orientation of PCs relative to the electric field would influence the effects of tDCS on neural activity. In the Vermis, PCs oriented parallel to the field are expected to exhibit stronger effects due to greater somatic polarization, leading to depolarization or hyperpolarization depending on the orientation of the somatodendritic axis. Conversely, PCs in Crus I/II, which are oriented obliquely to the field, are expected to exhibit intermediate effects, as the oblique alignment reduces the strength of polarization compared to parallel alignment.”

      (2) It is unclear how specific stimulation parameters were determined. First, how were the tDCS intensities used in the present experiments determined/selected, and how does the relative strength of this induced electric field equate to the intensities used non-invasively during tDCS experiments in humans? Second, there is also a fundamental difference in the pattern of application used here (e.g., 15 s pulses separated by 10 s of no stimulation) compared to human studies (e.g., 10-20 min of constant stimulation).

      We thank Reviewer #2 for their observations. We proceed to address their concerns and included the following text in the main manuscript, Discussion section: 

      “We used higher values than those applied in human experiments to achieve more reliable results. As seen in Supplementary Fig. 3, neurons are modulated in a similar way for 100, 200 or 300 µA but higher intensities elicited significant changes in a greater proportion of these neurons. In addition, a previous study from our lab23 using the same methodology, showed that 100, 200 and 300 µA (eliciting from 5.9 to 125.7 V/m in the current study) were ideal to obtain reliable and robust results in neuronal modulation, while keeping animal awareness of the stimulation at a minimum level. Besides, Asan et al. has recently shown that using epidural stimulation in anesthetized rats under an electric field closer to human studies (1.5–7.5 V/m) was also able to modulate the activity of cerebellar neurons.”

      In addition, we add the following text to the Results section under ‘tDCS modulates Purkinje cell activity in awake mice in a heterogeneous manner’ section:

      “This protocol allows us to avoid the development of plasticity effects, which are known to require at least several minutes of tDCS administration, and to test the direct electrical modulation exerted by the externally applied currents.”

      (3) In their first experiment, the authors measure the electric field strength at increasing depths during increasing stimulation intensities. However, it appears that an alternating current rather than a direct current, which is usually employed in tDCS protocols, was used. There is a lack of rationale regarding why the alternating current was used for this component. Typically, this technique is more commonly used for entraining/boosting neural oscillations compared to studies using tDCS which aim to increase or decrease neural activity in general.

      We appreciate Reviewer #2’s assessment of the differences between tDCS and tACS. We will clarify this distinction. We chose tACS for measuring electric field strength for two main reasons:

      • Amplifier Limitations: The amplifiers commonly used in electrophysiology are designed to filter out low-frequency components, including direct current (DC) signals, using a highpass filter. This is due to the fact that the neuronal signals of interest, such as action potentials, typically occur at higher frequencies (several Hz to kHz). Consequently, any DC signal applied is filtered out from the recordings, preventing us from measuring changes in voltage effectively.

      • Impedance Changes: DC stimulation can alter the impedance of electrodes and surrounding tissue over time. To mitigate this effect and maintain stable recordings, it is advantageous to frequently alternate the polarity and intensity of the stimulation.

      This next text has been included in the 'Transcranial Electrical Stimulation' section of the 'Materials and Methods' part of the manuscript:

      “We selected tACS to measure electric field strength due to two main reasons: (1) amplifiers used in electrophysiology filter out low-frequency signals like DC, making voltage changes from tDCS undetectable, and (2) DC stimulation can alter electrode and tissue impedance over time, whereas alternating the polarity in tACS helps maintain stable recordings.”

      It is important to note that our aim with tACS is to provide an approximation of current propagation through the tissue, rather than to exactly replicate the baseline conditions encountered during continuous tDCS stimulation.

      Reviewer #3 (Public review):

      Summary:

      In this study, Sanchez-Leon et al. combined extracellular recordings of Purkinje cell activity in awake and anesthetized mice with juxtacellular recordings and Purkinje cell staining to link Purkinje cell orientation to their stimulation response. The authors find a relationship between neuron orientation and firing rate, dependent on stimulation type (anodal/cathodal). They also show the effects of stimulation intensity and rebound effects.

      Strengths:

      Overall, the work is methodologically sound and the manuscript is well written. The authors have taken great care to explain their rationale and methodological choices.

      We sincerely thank Reviewer #3 for their positive feedback and constructive comments regarding our study. We are pleased that you found our work methodologically sound and well written. Your acknowledgment of our efforts to explain our rationale and methodological choices is greatly appreciated. We believe that the insights gained from linking Purkinje cell orientation to their stimulation response will contribute significantly to our understanding of cerebellar function and tDCS effects. Thank you for your thoughtful evaluation of our manuscript.

      Weaknesses:

      My only reservation is the lack of reporting of the precise test statistics, p-values, and multiple comparison corrections. The work would benefit from adding this and other information.

      We sincerely thank Reviewer #3 for their valuable feedback and for highlighting an important aspect of our analysis. We agree that the inclusion of precise test statistics, p-values, and details on multiple comparison corrections would strengthen the robustness of our findings. In response to your suggestion, we have now added this information to the Results section, ensuring that all statistical tests, exact p-values, and corrections for multiple comparisons are clearly reported. We believe these additions provide greater transparency and rigor to our analysis, and we appreciate your thoughtful recommendation.

      Major Comments:

      (1) The authors should report the exact test statistics. These are missing for all comparisons and hinder the reader from understanding what exactly was tested for each of the experiments. For example, having the exact test statistics would help better understand the non-significant differences in Figure 1h where there is at least a numeric difference in CS firing rate during tDCS.

      As mentioned before, we have now included the precise test statistics for all statistical comparisons throughout the manuscript. Specifically, in the case of Supplementary Figure 1h, we have added the exact values for the comparisons of CS firing rates during tDCS, even for nonsignificant differences, to ensure transparency and to clarify the observed numerical differences. We believe these additions will help readers better interpret the data and understand the statistical underpinnings of our findings. 

      However, given the large amount of data analyzed, particularly related to individual neuronal activity, it is not feasible to present all of the data for each individual neuron. We have aimed to provide a comprehensive statistical summary without overwhelming the reader with an excessive amount of detailed data.

      (2) Did the authors apply any corrections for multiple comparisons? Generally, it would be helpful if they could clarify the statistical analysis (which values were subjected to the tests, how many tests were performed for each question, etc.).

      We appreciate the reviewer’s comment regarding the need for clarification on the statistical analysis and the application of multiple comparison corrections. In response, we have updated the main text to include all the requested information. Specifically, we have added the appropriate multiple comparison tests (Tukey's or Nemenyi) where applicable to each analysis. These corrections have been applied to ensure that the results are robust and account for the number of comparisons made. We have also clarified the specific tests used for each analysis, the values subjected to these tests, and the number of comparisons performed for each question. This information is now detailed in the Methods section under 'Statistical Analysis' for transparency and to aid in the interpretation of the results.

      (3) The relationship shown in Figure 2g seems to be influenced by the two outliers. Have the authors confirmed the results using a robust linear regression method?

      We agree with the reviewer that the two neurons in Figure 2g could appear as outliers. To address this, we applied the ROUT method with a stringent Q = 1% to detect potential outliers, and none were found. In addition, we have confirmed the robustness of our results by performing a complementary analysis using robust linear regression methods (e.g., M-estimators), which showed consistent findings with our original analysis. For this purpose, we used the 'Huber' loss function, which combines least squares with robustness against outliers. The regression line obtained with this method (y = -0.5650x + 157.4556) differs minimally from the originally presented value, with the p-value of the slope and the intercept being p = 1.4846x10<sup>-4</sup> (t<sub>(22)</sub> = -4.5740) and p = 1.1382x10<sup>-11</sup> (t<sub>(22)</sub> \= 12.8010), respectively. Author response image 1 shows both regression fits to facilitate their comparison. These additional steps ensure the reliability of the relationship observed in the figure, even when accounting for the potential influence of the two data points.

      Author response image 1.

      (4) The authors conclude that tDCS modulates vermal PCs more than Crus I/II PCs - but they don't seem to test this statistically. It would be helpful to submit the firing rate change values to an actual statistical test to conclude this directly from the data.

      We agree that it would be appropriate to apply a statistical test to determine whether there is similarity in the level of modulation. To this end, we have normalized the modulation so that all data are positive. For example, a neuron that increases or decreases its activity by 50% relative to the baseline period will be considered as having a modulation of 50% in both cases. This yields a mean modulation of 9.42% for neurons recorded in Crus I/II and 62.35% for those in the Vermis. Since the two distributions do not meet the normality assumption (Shapiro-Wilk test), we used a Mann-Whitney test, which resulted in a p-value < 0.0001, thus demonstrating a significant difference in modulation between the two cerebellar regions analyzed. We added this information to the main text. Additionally, we included a new panel in Supplementary Figure 3 (Supplementary Figure 3i) to visually represent these data.

      Reviewer #1 (Recommendations for the authors):

      I have several suggestions to further improve the paper:

      (1) It remains unclear how many tDCS trials were done during each single-cell recording. What were the inclusion criteria? Were tens of trials done per cell or was a cell already included if the recording was stable during a few trials? Please clarify.

      For every single-cell recording, the maximum number of trials allowed by the recording stability were applied. A neuron was included in the analysis if the recording was stable for at least 2 trials at a given intensity and polarity, and up to a maximum of 1 hour recording. We introduced a paragraph in the methods section explaining this.

      (2) Along the same line, could the authors show cell responses to individual consecutive trials? Do the responses change over time? For example, does a cell increase the firing rate more during early trials compared to late trials? Please clarify.

      We appreciate the reviewer’s suggestion to investigate whether cell responses change over consecutive trials. In our data, when tDCS effects were observed, the changes in firing rate were evident from the very first trials in some neurons. To illustrate this, we have included Author response image 2, which shows examples of individual neuron responses (2 non-PC on the left and 2 PC on the right) across consecutive trials. Red and blue histogram bars indicate anodal and cathodal tDCS periods, respectively.

      Author response image 2.

      However, a rigorous analysis of the stimulation effect over time across trials was not feasible due to the considerable variability in the number of trials applied to different recorded neurons. This variability arose from differences in the duration for which stable recordings could be maintained.

      Despite this limitation, the early responses to tDCS provide valuable insights into the immediate effects of stimulation on neuronal activity.

      (3) Neurons are recorded very superficially, just below a 2 mm wide craniotomy. The temperature of the brain is likely lower than a normal physiological temperature. Did the authors consider the potential effects of temperature? Please address.

      We acknowledge the reviewer's concern regarding the potential effects of temperature on the recorded neurons. While it is challenging to precisely control the temperature of the tissue in the recording area, it is important to note that the temperature conditions were consistent across both the control and stimulation phases of the experiment. This consistency ensures that any potential effects of temperature are evenly distributed across conditions, thereby minimizing its impact on the observed changes in neuronal activity. Furthermore, although the recordings are conducted 2 mm below the craniotomy, this region is continuously bathed in saline, with an additional 3 mm of fluid maintained at physiological temperature, effectively preventing dehydration and cooling of the surface tissue. 

      (4) More general, but along the same line, is there any effect of the depth of the recorded cells on its response to stimulations for any of the data collected in this study? Figure 1 nicely shows that there is a significant electric field at depths up to 4 mm, but do more superficial cells have stronger/weaker responses to cathodal/anodal stimulation, as the electric field there is much stronger?

      We were also expecting to see some correlation between depth and degree of modulation, however, a linear regression analysis showed very low R<sup>2</sup> values (see Author response images 3-6), suggesting a negligible correlation between depth of recording and neuronal activity modulation. We did this analysis for Purkinje and non-Purkinje cells separately, as well as for recordings in CrusI-II or Vermis, showing similar negative results in all cases.

      Author response image 3.

      Author response table 1.

      Author response image 4.

      Author response table 2.

      Author response image 5.

      Author response table 3.

      Author response image 6.

      Author response table 4.

      (5) The authors are recording the movements of the mouse on a treadmill. Was there any correlation between tDCS and behavior? And between behavior and firing patterns? Please address.

      We appreciate the reviewer’s question regarding the potential correlation between tDCS and behavior, as well as between behavior and firing patterns. In our experimental setup, the movement of the mouse typically introduces electrical artifacts in the recordings, particularly during running on the treadmill. To ensure the accuracy of our data, trials that coincided with running or other significant movements were excluded from the analysis. This is explained in the Methods section of the main text under 'Data analysis' within the description of how single-cell activity was processed. On the other hand, conscious of the modulatory effects that animal movement or specific behaviors may have on neuronal firing rates, we thought that trials involving movement should be eliminated to avoid any potential confounding with the effects of current application. 

      (6) The strength of the electrical field seems highly variable. Do the authors have an explanation for this? Please address.

      We appreciate the reviewer’s observation regarding the variability in the strength of the electric field. This variability is indeed expected, given the inherent inter-individual differences in skull thickness across animals (which, as discussed in the main manuscript, attenuates around 20% of the current), as well as slight variations in the precise placement of the tES active electrode during surgery. These factors can lead to fluctuations in the electric field, although they remain within the same order of magnitude.

      (7) As the authors stated, even for cells recorded at a depth of over 2 mm, the electric fields are still much higher than the fields generated in human studies. Why were there no comparable strengths used? Please address.

      We thank the reviewer for raising this important point. Previous studies from our lab (SánchezLeón et al. 2021) demonstrated minimal modulation in neuronal activity (LFP) when using tDCS intensities below 200 µA in awake animals. To achieve stronger and more consistent effects, we selected an intensity of 200 µA for our experiments. It is well-established that small animals, such as mice, require higher electric field strengths than humans to induce observable effects (Ozen et al., 2010; Vöröslakos et al., 2018; Asan et al., 2020). This discrepancy may be attributed to several factors, including differences in neuronal density within the stimulated networks (Herculano-Houzel et al., 2009), as well as variations in axonal length and diameter (Chakraborty et al., 2018). However, as we stated in the Discussion, we also found modulated neurons for electric fields close to those in humans:

      “Importantly, we observe clear firing rate modulation of PCs and non-PCs at depths of 2.3 mm and tDCS intensity of 100 μA, where the measured electric field is as low as 5.9 V/m.”

      Despite these limitations, animal models remain invaluable for obtaining high-resolution invasive data that cannot be collected in human studies. Such experiments are crucial for understanding the basic mechanisms underlying non-invasive brain stimulation, validating computational models, and exploring the therapeutic potential of these techniques for various neurological conditions.

      References:

      Asan, A. S., Lang, E. J., & Sahin, M. (2020). Entrainment of cerebellar purkinje cells with directional AC electric fields in anesthetized rats. Brain stimulation, 13(6), 1548–1558. https://doi.org/10.1016/j.brs.2020.08.017 

      Chakraborty, D., Truong, D. Q., Bikson, M., & Kaphzan, H. (2018). Neuromodulation of Axon Terminals. Cerebral cortex (New York, N.Y. : 1991), 28(8), 2786–2794. https://doi.org/10.1093/cercor/bhx158

      Herculano-Houzel S. (2009). The human brain in numbers: a linearly scaled-up primate brain. Frontiers in human neuroscience, 3, 31. https://doi.org/10.3389/neuro.09.031.2009

      Ozen, S., Sirota, A., Belluscio, M. A., Anastassiou, C. A., Stark, E., Koch, C., & Buzsáki, G. (2010). Transcranial electric stimulation entrains cortical neuronal populations in rats. The Journal of neuroscience : the official journal of the Society for Neuroscience, 30(34), 11476–11485. https://doi.org/10.1523/JNEUROSCI.5252-09.2010

      Vöröslakos, M., Takeuchi, Y., Brinyiczki, K., Zombori, T., Oliva, A., Fernández-Ruiz, A., Kozák, G., Kincses, Z. T., Iványi, B., Buzsáki, G., & Berényi, A. (2018). Direct effects of transcranial electric stimulation on brain circuits in rats and humans. Nature communications, 9(1), 483. https://doi.org/10.1038/s41467-018-02928-3

      (8) It seems that there is a very high number of mice used for a relatively small number of cellular recordings. Can the authors explain this?

      We appreciate the reviewer’s observation regarding the number of mice used relative to the number of recorded neurons. There are several factors contributing to this:

      (1)  In vivo juxtacellular labeling is a complex, multi-step process where each step must be executed precisely to successfully label a neuron. During blind recordings, it is impossible to ensure with 100% certainty that the neuron targeted for juxtacellular labeling will later be recoverable with sufficient staining (Pinault, 1996). To maintain confidence in the correspondence between the recorded and labeled neuron, we typically limit our attempts to label one neuron per mouse, or at most, two neurons located far apart from each other.

      (2)  Recording duration limitations: The probability of maintaining a well-isolated, stable neuronal recording decreases significantly as the recording time increases. To obtain sufficient data with multiple tDCS trials, it is necessary to conduct numerous independent recordings. Additionally, each time the recording pipette penetrates the recording site, there is a minor but cumulative impact on the dura mater and neural tissue, leading to tissue degradation in subsequent recordings.

      (3)  Diverse experimental conditions: This study explores several conditions, including recordings in anesthetized and awake mice, targeting different cerebellar regions (Crus I/II and vermis), and utilizing a range of techniques (single-unit extracellular recordings using glass pipettes, juxtacellular recording and labeling, and high-density recordings using the Neuropixels system). These distinct approaches required the establishment of independent experimental animal groups, which contributed to the higher number of subjects used in the study.

      Although we were often able to record several neurons per mouse, the final number of neurons that met all criteria for analysis was reduced due to these limitations.

      References:

      Pinault D. (1996). A novel single-cell staining procedure performed in vivo under electrophysiological control: morpho-functional features of juxtacellularly labeled thalamic cells and other central neurons with biocytin or Neurobiotin. Journal of neuroscience methods, 65(2), 113–136. https://doi.org/10.1016/0165-0270(95)00144-1

      (9) The N for both the neurobiotin-stained neurons and the Neuropixels recordings was relatively low. If possible, it would be nice to see a few more cells.

      We sincerely thank the reviewer for their thoughtful suggestion to increase the sample size. While we understand the importance of this consideration, we believe it is not feasible at this stage due to several factors. First, the complexity of our experiments, which include single-neuron recordings in awake animals during electric field application, juxtacellular neurobiotin injections post-tDCS (with a low success rate), and high-density recordings from Purkinje cells across different layers in awake animals, significantly limits the throughput of data collection. Second, the statistical outcomes obtained from our analyses, which combine multiple techniques, are robust and provide a strong basis for our conclusions. Third, the current study already involves a substantial number of animals (74 mice), which aligns with ethical considerations for minimizing animal use while ensuring robust results.

      We believe that the current sample size is sufficient to support the findings presented in the manuscript. Expanding the sample size further would require considerable additional resources and time, without a clear indication that it would fundamentally alter the conclusions of the study. We are grateful for the reviewer’s understanding of these limitations and their acknowledgment of the value of the current dataset.

      (10) tDCS and tES seem to be used interchangeably; please make it consistent.

      We agree that this could cause confusion. To address this, we have added a clarification at the first mention of tES in the manuscript, indicating that tES (transcranial Electrical Stimulation) is an umbrella term that encompasses both tDCS (transcranial Direct Current Stimulation) and tACS (transcranial Alternating Current Stimulation). We have ensured consistent use of the appropriate term throughout the rest of the text.

      (11) Did the authors apply saline or agar to the craniotomy while recording? Or was the dura dried out? Can the authors clarify this, and relate the answer to a potential interaction of either the medium or dryness of the dura with the tDCS?

      We appreciate the reviewer’s inquiry. To prevent the dura from drying out during our recordings, we applied saline to the cranial window throughout the experiment. Additionally, in our setup, the tDCS ring-shaped electrode was placed over the skull and sealed with dental cement to prevent any leakage of currents into the craniotomy, which was positioned at the center of the preparation. This precaution also helped minimize electrical noise reaching the recording electrode. In instances where the seal was not perfectly executed, the electrical noise from tDCS leaked into the saline solution, causing amplifier saturation and rendering neuronal activity recordings impossible.

      (12) There are several mistakes in spelling and grammar throughout the document; please check carefully.

      We appreciate the reviewer’s attention to detail regarding spelling and grammar. We have carefully reviewed the manuscript and corrected all identified errors to ensure clarity and proper language use throughout the document.

      (13) Can the authors briefly explain why tACS (and not tDCS) is used to measure the effectiveness of the stimulation at the different depths as shown in Figure 1? As the rest of the paper focuses entirely on tDCS, it is important to understand why tACS is used in Figure 1.

      We will clarify this distinction. We chose tACS for measuring electric field strength for two main reasons:

      • Amplifier Limitations: The amplifiers commonly used in electrophysiology are designed to filter out low-frequency components, including direct current (DC) signals, using a highpass filter. This is due to the fact that the neuronal signals of interest, such as action potentials, typically occur at higher frequencies (several Hz to kHz). Consequently, any DC signal applied is filtered out from the recordings, preventing us from measuring changes in voltage effectively.

      • Impedance Changes: DC stimulation can alter the impedance of electrodes and surrounding tissue over time. To mitigate this effect and maintain stable recordings, it is advantageous to frequently alternate the polarity and intensity of the stimulation.

      This next text has been included in the 'Transcranial Electrical Stimulation' section of the 'Materials and Methods' part of the manuscript:

      “We selected tACS to measure electric field strength due to two main reasons: (1) amplifiers used in electrophysiology filter out low-frequency signals like DC, making voltage changes from tDCS undetectable, and (2) DC stimulation can alter electrode and tissue impedance over time, whereas alternating the polarity in tACS helps maintain stable recordings.”

      It is important to note that our aim with tACS is to provide an approximation of current propagation through the tissue, rather than to exactly replicate the baseline conditions encountered during continuous tDCS stimulation.

      (14) How do Figures 2e and f relate to each other? Figure 2e has 6 red lines, but 6f has 8 red explicitly states that 8 cells were recorded.

      We appreciate the Reviewer for highlighting this discrepancy. You are correct that in Figure 5e, the lines are too densely packed to easily distinguish all of them. Additionally, the activity of two neurons under anodal tDCS was greatly suppressed, which caused their corresponding arrowheads to be close to the origin of the arrows, making them less visible. To clarify, while Figure 5f shows all 8 cells recorded, the compression of the data in Figure 5e makes it challenging to distinguish all individual responses visually. We have added a clarifying note to the figure legend to explaining that “densely packed lines and suppressed activity of two neurons under anodal tDCS reduce the visibility of their responses”.

      (15) Figure 2g contains two outliers that seem critical to the correlation, this is noticeable as nearly all other cells seem to modulate much more modestly. Maybe add a few more cells to convince everyone?

      We agree with the reviewer that the two neurons in Figure 2g could appear as outliers. To address this, we applied the ROUT method with a stringent Q = 1% to detect potential outliers, and none were found. In addition, we have confirmed the robustness of our results by performing a complementary analysis using robust linear regression methods (e.g., M-estimators), which showed consistent findings with our original analysis. For this purpose, we used the 'Huber' loss function, which combines least squares with robustness against outliers. The regression line obtained with this method (y = -0.5650x + 157.4556) differs minimally from the originally presented value, with the p-value of the slope and the intercept being p = 1.4846x10<sup>-4</sup> (t<sub>(22)</sub> = -4.5740) and p = 1.1382x10<sup>-11</sup> (t<sub>(22)</sub> \= 12.8010), respectively. Author response image 1 both regression fits to facilitate their comparison. These additional steps ensure the reliability of the relationship observed in the figure, even when accounting for the potential influence of the two data points.

      (16) 'From these experiments we can conclude that 1) tDCS in vermis of anesthetized mice modulates PCs and non-PCs in a heterogeneous way'. Figure 4d shows no correlation between cathodal versus anodal stimulation for non-PCs, so how does the data suggest heterogeneous modulation of non-PCs? Is it simply heterogeneous because the data is very scattered?

      Thank you for your observation. By 'heterogeneous modulation,' we indeed refer to the scattered nature of the responses in non-PCs. Although Figure 4d shows a wide spread of data points and the linear regression is not statistically significant, a general trend can still be observed, where 11 out of 15 non-PCs show modulation in opposite directions with anodal and cathodal tDCS. However, this trend is not consistent across all neurons, hence our description of this modulation as heterogeneous. Importantly, this contrasts with the response observed in Purkinje cells (PCs), where a more consistent modulation pattern is evident, and the p-value for the linear regression is significant. Therefore, we conclude that while PCs show a clearer, more predictable modulation, the scattered data in non-PCs supports a more heterogeneous response.

      (17) The authors state that it is not possible to discriminate the non-PCs, even though some published papers suggest this is quite possible (see e.g., work by Simpson and Ruigrok; please discuss). For sure, the authors of the current manuscript should be able to discriminate the interneurons in the molecular layer from those in the granular layer (if it were only by identifying the polarity of the complex spikes). The authors may want to consider redoing the analyses of the non-PCs, and at least present and compare the outcomes of these two main subgroups of non-PCs.

      The authors are indeed familiar with the work of Simpson, Ruigrok, and others in linking electrophysiological recordings with neuronal class identity. Prior to proceeding with juxtacellular labeling, we conducted preliminary attempts to categorize non-PC neurons based on firing characteristics. However, we ultimately chose not to include neuronal sorting for non-PCs in this study for two main reasons. 

      First, the baseline recording period without tDCS was very short (10 seconds), and once tDCS was applied, the firing rate, coefficient of variation, and interspike intervals (ISI) of neurons were already altered. This made it difficult to reliably classify neurons based on their spontaneous activity, which is critical for precise sorting.

      Second, unlike PCs—where the presence of complex spikes and the resulting inhibition provide a clear ground truth—there is no analogous, unequivocal marker for non-PCs. Even following the reviewer's suggestion, while it might be possible in the molecular layer to identify a neuron as a molecular layer interneuron (MLI), this approach does not allow for a rigorous distinction between basket cells and stellate cells. These two cell types, despite their distinct morphologies—which could significantly affect their responses to tDCS—cannot be reliably differentiated without a true ground truth. Therefore, in the absence of such definitive markers, we believe that further subclassification of non-PCs based solely on electrophysiological properties would not be sufficiently rigorous for the purposes of our study.

      (18) Can the authors briefly discuss possible reasons why non-PCs in Crus1/2 do show heterogeneous responses similar to that of PCs, whereas the non-PCs in the vermis do not?

      We appreciate the reviewer’s insightful question regarding the different modulation patterns observed in non-PCs between Crus I/II and the vermis. Several potential factors could contribute to these differences, including variations in local cerebellar circuit connectivity between the two regions, differences in the cellular diversity of non-PCs due to the lack of a "ground truth" for their classification, or disparities in somatodendritic orientation and cell distribution. In the vermis, PCs are organized into different layers with opposing orientations (as shown in Figure 6), which could result in a more stable, polarity-dependent modulation, making their response more distinct from that of non-PCs. In contrast, in Crus I/II, the orientation of PCs is more heterogeneous and less aligned with the electric field, potentially leading to a more variable modulation pattern in both PCs and non-PCs. 

      However, it is important to note that we did not aim to juxtacellularly label non-PCs in this study, so we cannot offer a definitive answer regarding their precise orientation or identity. Additionally, the observed differences could be partially attributed to statistical power: we recorded 50 nonPCs in Crus I/II compared to only 25 in the vermis. Out of the 15 neurons in the vermis that showed statistically significant modulation, 11 displayed polarity-dependent modulation in opposite directions, but the smaller sample size might have limited our ability to detect the full range of possible effects. Furthermore, recordings in Crus I/II were conducted in awake animals, whereas the neurons recorded in Figure 4 in the vermis were obtained from anesthetized animals. This difference in physiological state could also be related to the observed changes.

      (19) 'The importance of PC axodendritic orientation in determining the effect of tDCS on firing rate modulation is further highlighted by our observation that pre-synaptic non-PC neurons providing inputs to PCs modulate their activity in a very heterogeneous way.' This is based on the finding that non-PCs modulate heterogeneously, but that is not what is shown for the vermis. Please address.

      Thank you for pointing this out. By 'heterogeneous modulation,' we are referring to the observation that non-Purkinje cells (non-PCs) respond in various ways under tDCS. Specifically, some nonPCs increase their activity under anodal stimulation and decrease it under cathodal stimulation (and vice versa), while others exhibit more complex patterns, such as increasing their activity under both anodal and cathodal stimulation or decreasing for both polarities. Additionally, some non-PCs only respond to one polarity, and others show no response at all.

      Our reasoning is that if the presynaptic non-PCs providing inputs to Purkinje cells (PCs) were the primary drivers of PC modulation, we would expect them to behave in a manner opposite to how PCs are modulated. For instance, if most non-PCs increased their activity under anodal stimulation while PCs decreased theirs, this could suggest that tDCS modulates non-PCs to fire more, imposing greater inhibition on PCs since many non-PCs are inhibitory. However, what we observe is a highly heterogeneous response from non-PCs, with no clear pattern that would consistently explain the modulation of PCs through presynaptic inputs alone. While non-PCs must certainly exert some influence on PC activity, their variable responses suggest that the modulation of PCs may also be driven by direct effects of tDCS on the PCs themselves, in addition to any indirect presynaptic influence.

      (20) To help in reinforcing the hypothesis that stimulation response depends on dendritic orientation, the authors could show, with the existing data, how PCs in different layers of the vermis respond to cathodal or anodal stimulations. The data shown in Figure 4a-c already has a large number of PCs recorded in different layers of the vermis. As shown in Figure 4b, PCs in specific layers of the vermis have specific dendritic orientations. Can the authors show that PCs recorded for Figure 4, in the different layers (implying similar dendritic orientation) have similar (or different) stimulation responses? This would greatly improve their argument for the importance of dendritic orientation for tDCS responses.

      We appreciate the reviewer’s suggestion and the valuable insight it provides. In fact, this was one of the main motivations for performing the experiments shown in Figure 6, where we conducted simultaneous recordings of different Purkinje cells (PCs) in distinct layers. This allowed us to directly compare responses in neurons with different somatodendritic orientations. Unfortunately, the data presented in Figure 4 were obtained using glass micropipettes for juxtacellular labeling— a method that permits recording from only one neuron at a time—thus precluding a robust analysis of the correlation between dendritic orientation and tDCS responses. Furthermore, it should be noted that Figure 4a represents an idealized approximation; since these recordings were performed in different animals with variations along the anteroposterior axis, precise dendritic orientation cannot be reliably attributed to each cell (except for those that were juxtacellularly labeled).

      Additionally, unlike recordings with Neuropixels, where we have numerous contacts positioned at known distances from each other, enabling us to precisely locate cells within the cerebellar layers, the localization of neurons recorded with glass pipettes is less accurate. This is due to factors such as tissue displacement during insertion and animal movements, which further complicates the precise determination of neuronal layer placement during the stimulation protocol.

      While the data in Figure 4 do not allow us to definitively test our hypothesis, the results shown in Figure 6 provide a more direct comparison of the responses of PCs across different layers to tDCS, thereby reinforcing the hypothesis that dendritic orientation is a key factor in modulating neuronal activity.

      (21) The data shown in Figure 5e-f feels underpowered, although the statistical correlation between dendritic orientation and response is strong. For example, currently, the authors show that at an angle of ~0 degrees, two cells increase their firing to anodal stimulation, and 1 cell at 180 ~degrees decreases its firing. Again, the manuscript would be much improved if the authors could increase the sample sizes for these experiments.

      We appreciate the reviewer’s concern regarding the sample size in Figure 5e-f. While the statistical correlation between dendritic orientation and response to tDCS is strong, we understand that the data may feel underpowered, particularly given the limited number of cells observed at specific angles such as ~0 degrees and ~180 degrees.

      It’s important to note that although visually it may appear there is only one neuron at 180 degrees during anodal stimulation, there are actually three neurons at this orientation. This is more clearly visible in the same figure during cathodal stimulation. However, the firing rate of these neurons during anodal stimulation is so low that the arrow representing their response appears very small, making it difficult to distinguish. (We have added a clarifying note to the figure legend to explaining that “densely packed lines and suppressed activity of two neurons under anodal tDCS reduce the visibility of their responses”).

      Unfortunately, increasing the sample size for these specific experiments is not feasible within the current study due to the technical complexity and time-consuming nature of the recordings, especially when incorporating juxtacellular labeling or high-density electrode arrays. Despite these challenges, we believe the current sample provides valuable insights into the relationship between dendritic orientation and firing rate modulation under tDCS. The significant statistical correlation suggests that the observed trend is robust, even with the existing sample size. Additionally, the different experimental approaches used in this study—single-unit extracellular recordings in different regions of the cerebellum in both awake and anesthetized animals, juxtacellular recordings and labeling, and high-density multi-unit recordings—provide a robust and comprehensive view of the results. Each technique offers complementary insights, strengthening our conclusions and ensuring that the observed patterns are not the result of one specific method or condition. Future studies could aim to expand on these findings, but we are confident that the results presented here contribute meaningfully to our understanding of how dendritic orientation influences neuronal responses to tDCS.

      (22) The authors, rightly so, address the potential impact of plasticity in the discussion. Here, the authors may want to cite other studies that have directly addressed this question: E.g., Das et al., 2017 (Frontiers Neuroscience, 11:444; doi: 10.3389/fnins.2017.00444) and van der Vliet et al., 2018 (Brain Stimul, 11(4):759-771; doi: 10.1016/j.brs.2018.04.009).

      We appreciate the reviewer’s suggestion to include additional studies addressing the impact of plasticity on the effects of cerebellar tDCS. In response, we have added a new sentence in the discussion section that cites both Das et al. (2017) and van der Vliet et al. (2018), highlighting the importance of synaptic plasticity in the effects of tDCS. 

      “These findings are consistent with previous work suggesting that synaptic plasticity is crucial for the effects of tDCS, as demonstrated by the importance of PC plasticity in behavioral outcomes(51) and the role of BDNF-mediated plasticity in motor learning(52).”

      Reviewer #2 (Recommendations for the authors):

      In the introduction, it would be beneficial to provide additional context regarding the influence of neuronal orientation on modulation shown from in-vitro studies. In addition, some explanation of the uniformity/non-uniformity of the electrical field would help. From here, the authors should provide their specific hypotheses for these experiments.

      We thank the Reviewer #2 for this insightful comment. In response, we have expanded the introduction to provide a clearer context regarding the influence of neuronal orientation on the effects of tDCS. Therefore, we have added two new paragraphs in the Introduction to address these points.

      “For neurons whose somatodendritic axis is aligned with the electric field, the field induces a pronounced somatic polarization. In the case of anodal stimulation, where the positive electrode is positioned near the dendrites and the soma is oriented away, positively charged ions accumulate near the soma, leading to depolarization and increased excitability, thus facilitating action potential generation. Conversely, neurons whose orientation opposes the field, such as when the soma is closer to the positive electrode and the dendrites face away, experience hyperpolarization, reducing excitability. Lastly, neurons oriented perpendicular to the electric field would exhibit minimal somatic polarization, as the field does not induce significant redistribution of charges along the somatodendritic axis.”

      Additionally, we have now clarified our a priori hypothesis regarding neuronal orientation and its expected influence on tDCS efficacy.

      “We hypothesized that the orientation of PCs relative to the electric field would influence the effects of tDCS on neural activity. In the Vermis, PCs oriented parallel to the field are expected to exhibit stronger effects due to greater somatic polarization, leading to depolarization or hyperpolarization depending on the orientation of the somatodendritic axis. Conversely, PCs in Crus I/II, which are oriented obliquely to the field, are expected to exhibit intermediate effects, as the oblique alignment reduces the strength of polarization compared to parallel alignment.”

      Justification of the stimulation parameters used (i.e., intensity and pattern) should be included in the Methods.

      The time of stimulation was chosen of only a few seconds to avoid confounding effects of plasticity, which is known to require several minutes of tDCS administration. Regarding the intensities, we refer to previous studies from our lab, using the exact same methodology, where we find that 100, 200 and 300 µA were ideal to obtain reliable and robust results in neuronal modulation, while keeping animal awareness of the stimulation at a minimum level. We also added the clarification to the main text.

      Please also justify the use of tACS rather than tDCS in the first experiment.

      We appreciate Reviewer #2’s assessment of the differences between tDCS and tACS. We will clarify this distinction. We chose tACS for measuring electric field strength for two main reasons:

      • Amplifier Limitations: The amplifiers commonly used in electrophysiology are designed to filter out low-frequency components, including direct current (DC) signals, using a highpass filter. This is due to the fact that the neuronal signals of interest, such as action potentials, typically occur at higher frequencies (several Hz to kHz). Consequently, any DC signal applied is filtered out from the recordings, preventing us from measuring changes in voltage effectively.

      • Impedance Changes: DC stimulation can alter the impedance of electrodes and surrounding tissue over time. To mitigate this effect and maintain stable recordings, it is advantageous to frequently alternate the polarity and intensity of the stimulation.

      This next text has been included in the 'Transcranial Electrical Stimulation' section of the 'Materials and Methods' part of the manuscript:

      “We selected tACS to measure electric field strength due to two main reasons: (1) amplifiers used in electrophysiology filter out low-frequency signals like DC, making voltage changes from tDCS undetectable, and (2) DC stimulation can alter electrode and tissue impedance over time, whereas alternating the polarity in tACS helps maintain stable recordings.”

      It is important to note that our aim with tACS is to provide an approximation of current propagation through the tissue, rather than to exactly replicate the baseline conditions encountered during continuous tDCS stimulation.

      Reviewer #3 (Recommendations for the authors):

      (1) A suggestion would be to highlight which of the data points in Figure 2g are the neurons they show as representative in Figure 2e-f. This would give the reader insights into how a standard neuron would behave/how representative these neurons are.

      We appreciate the reviewer’s comment and, in response, we have highlighted the two exemplary neurons from Figures 2e-f in Figure 2g to provide better insight into how these representative neurons behave in the context of the overall data. This will help the reader understand how typical these neurons are in relation to the broader dataset. Additionally, we have applied the same approach to Figure 3, highlighting the representative neurons for further clarity.

      (2) It would also be interesting to add figures to the supplementary materials that show the waveforms of non-PC neurons during anodal and cathodal tDCS, as done for PC neurons in the supplementary materials (as stated at the bottom of page 14, the authors chose to mention but not show these).

      We understand the reviewer’s interest in visualizing the waveforms of non-Purkinje neurons during anodal and cathodal tDCS. To address this, we have carefully examined the waveforms of both non-Purkinje neurons under these conditions. However, given the absence of notable changes in their waveforms, we believe that this data does not have sufficient standalone significance to justify the inclusion of a new figure. We are, of course, happy to provide this data upon request or to incorporate it into the supplementary materials if deemed necessary.

      Author response image 7.

      Superimposed averaged SS waveforms under control (black), anodal (red) and cathodal (blue) tDCS from the example neurons shown in panels A and B in Fig. 3.

      (3) In Figure 5d, there is a significant aftereffect of the stimulation on the Purkinje cell firing rate - do the authors have an idea why this occurred?

      We appreciate the reviewer’s observation, as it highlights an interesting phenomenon that we have not been able to fully explain. We observed this aftereffect in many of the recorded neurons, and intriguingly, it often occurred in the opposite direction to the modulation seen during tDCS. We addressed a potential explanation for this in the discussion section:

      ‘Nonetheless, we cannot rule out the possibility of indirect synaptic effects. Indeed, the electric field gradient imposed by tDCS could indirectly modulate a specific neuron firing rate by increasing (or decreasing) its pre-synaptic activity, i.e. by modulating the firing rate of other neurons that synapse onto it. Indeed, these synaptic changes could explain the rebound effect observed after tDCS termination. The synapses involved in the modulation of firing rate may undergo a short-term plasticity process(47–50), which can continue to affect the firing rate even after the external currents have been turned off and no polarization is exerted on the neuron. These findings are consistent with previous work suggesting that synaptic plasticity is crucial for the effects of tDCS, as demonstrated by the importance of PC plasticity in behavioral outcomes(51) and the role of BDNF-mediated plasticity in motor learning(52).’

      This explanation highlights the potential role of synaptic plasticity and the indirect modulation of neuronal networks, but further investigation would be required to fully understand the mechanisms underlying this aftereffect.

      (4) I'm having trouble understanding the reference electrode positioning from schematics 1a & 1b: The text and 1a suggest that the reference electrode was positioned on the back of the mouse, outside of the brain. But Figure 1b looks as if the reference electrode was on the mouse cerebral cortex. Could the authors adapt schematic 1b to clarify the reference location or add this information to the legend?

      We agree that the figure showing two different reference electrodes was confusing, and we have now modified it to better clarify the distinction between the recording reference electrode and the stimulation reference electrode. Additionally, we have specified in Figures 1A and 1B whether the reference pertains to the transcranial alternating stimulation or to the electrophysiological recording.

      (9) In the discussion, (page 22) the authors highlight the importance of axodendritic orientation, but they analyze only somatodendritic orientation. Are the two so similar that they can be used synonymously? This would be good to clarify.

      We appreciate the reviewer’s clarification and fully agree. While Purkinje cells (PCs) do indeed have a highly polarized morphology, with the axon generally oriented in the opposite direction to the main dendrites, this is not always the case, especially for other types of neurons. Therefore, our results strictly refer to the somatodendritic axis, as this is the one we can most clearly observe through our juxtacellular labeling. In response, we have changed all instances where the term 'axodendritic' appeared in the text to 'somatodendritic' for accuracy.

      (10) It would be helpful to clarify that Supplementary Figure 3b and 3e are the same as Figures 4 c and 4d, respectively. This was confusing to me.

      We appreciate the reviewer’s feedback and have now modified the caption of Supplementary Figure 3 to indicate that Supplementary Figures 3b and 3e correspond to Figures 4c and 4d, respectively. This should help clarify any confusion.

      (11) Typo: 'consisting in' ◊ consisting of

      We thank the reviewer for their clarification. The typo has been corrected to 'consisting of'.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review):

      Summary: 

      The authors compared four types of hiPSCs and four types of hESCs at the proteome level to elucidate the differences between hiPSCs and hESCs. Semi-quantitative calculations of protein copy numbers revealed increased protein content in iPSCs. Particularly in iPSCs, proteins related to mitochondrial and cytoplasmic were suggested to reflect the state of the original differentiated cells to some extent. However, the most important result of this study is the calculation of the protein copy numbers per cell, and the validity of this result is problematic. In addition, several experiments need to be improved, such as using cells of different genders (iPSC: female, ESC: male) in mitochondrial metabolism experiments.

      Strengths: 

      The focus on the number of copies of proteins is exciting and appreciated if the estimated calculation result is correct and biologically reproducible. 

      Weaknesses: 

      The proteome results in this study were likely obtained by simply looking at differences between clones, and the proteome data need to be validated. First, there were only a few clones for comparison, and the gender and number of cells did not match between ESCs and iPSCs. Second, no data show the accuracy of the protein copy number per cell obtained by the proteome data. 

      We agree with the reviewer that it would be useful to have data from more independent stem cell clones and ideally an equal gender balance of the donors would be preferable. As usual, practical cost-benefit, and time available affect the scope of work that can be performed. We note that the impact of biological donor sex on proteome expression in iPSC lines has already been addressed in previous studies13. We will however revise the manuscript to include specific mention of these limitations and propose a larger-scale follow-up when resources are available.

      Regarding the estimation of protein copy numbers in our study, we would like to highlight that the proteome ruler approach we have used has been employed extensively in the field previously, with direct validation of differences in copy numbers provided using orthogonal methods to MS, e.g., FACS2-4,7,10. Furthermore, the original manuscript14 directly compared the copy numbers estimated using the “proteomic ruler” to spike-in protein epitope signature tags and found remarkable concordance. This original study was performed with an older generation mass spectrometer and reduced peptide coverage, compared with the instrumentation used in our present study. Further, we noted that these authors predicted that higher peptide coverage, such as we report in our study, would further increase quantitative performance.

      Reviewer #2 (Public Review):

      Summary: 

      Pluripotent stem cells are powerful tools for understanding development, differentiation, and disease modeling. The capacity of stem cells to differentiate into various cell types holds great promise for therapeutic applications. However, ethical concerns restrict the use of human embryonic stem cells (hESCs). Consequently, induced human pluripotent stem cells (ihPSCs) offer an attractive alternative for modeling rare diseases, drug screening, and regenerative medicine. A comprehensive understanding of ihPSCs is crucial to establish their similarities and differences compared to hESCs. This work demonstrates systematic differences in the reprogramming of nuclear and non-nuclear proteomes in ihPSCs. 

      We thank the reviewer for the positive assessment.

      Strengths: 

      The authors employed quantitative mass spectrometry to compare protein expression differences between independently derived ihPSC and hESC cell lines. Qualitatively, protein expression profiles in ihPSC and hESC were found to be very similar. However, when comparing protein concentration at a cellular level, it became evident that ihPSCs express higher levels of proteins in the cytoplasm, mitochondria, and plasma membrane, while the expression of nuclear proteins is similar between ihPSCs and hESCs. A higher expression of proteins in ihPSCs was verified by an independent approach, and flow cytometry confirmed that ihPSCs had larger cell sizes than hESCs. The differences in protein expression were reflected in functional distinctions. For instance, the higher expression of mitochondrial metabolic enzymes, glutamine transporters, and lipid biosynthesis enzymes in ihPSCs was associated with enhanced mitochondrial potential, increased ability to uptake glutamine, and increased ability to form lipid droplets. 

      Weaknesses: 

      While this finding is intriguing and interesting, the study falls short of explaining the mechanistic reasons for the observed quantitative proteome differences. It remains unclear whether the increased expression of proteins in ihPSCs is due to enhanced transcription of the genes encoding this group of proteins or due to other reasons, for example, differences in mRNA translation efficiency. Another unresolved question pertains to how the cell type origin influences ihPSC proteomes. For instance, whether ihPSCs derived from fibroblasts, lymphocytes, and other cell types all exhibit differences in their cell size and increased expression of cytoplasmic and mitochondrial proteins. Analyzing ihPSCs derived from different cell types and by different investigators would be necessary to address these questions. 

      We agree with the Reviewer that our study does not extend to also providing a detailed mechanistic explanation for the quantitative differences observed between the two stem cell types and did not claim to have done so. We have now included an expanded section in the discussion where we discuss potential causes. However, in our view fully understanding the reasons for this difference is likely to involve extensive future in-depth analysis in additional studies and is not something that can be determined just by one or two additional supplemental experiments.

      We also agree studying hiPSCs reprogrammed from different cell types, such as blood lymphocytes, would be of great interest. Again, while we agree it is a useful way forward, in practice this will require a very substantial additional commitment of time and resources. We have now included a section discussing this opportunity within the discussion to encourage further research into the area.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) aizi1 and ueah1 clones, which were analyzed in Figure 1A, were excluded from the proteome analysis. In particular, the GAPDH expression level of the aizi1 clone is similar to that of ESCs and different from other iPSC clones. An explanation of how the clones were selected for proteome analysis is needed. Previously, the comparative analysis of iPSCs and ESCs reported in many studies from 2009-2017 (Ref#1-7) has already shown that the number of clones used in the comparative analysis is small, claiming differences (Ref#1-3) and that the differences become indistinguishable when the number of clones is increased (Ref#4-7). Certainly, few studies have been done at the proteome level, so it is important to examine what differences exist in the proteome. Also, it is interesting to focus on the amount of protein per cell. However, if the authors want to describe biological differences, it would be better to get the proteome data in biological duplicate and state the reason for selecting the clones used.

      (1) M. Chin, Cell Stem Cell, 2009, PMID: 19570518

      (2) K. Kim, Nat Biotechnol., 2011, PMID: 22119740

      (3) R. Lister, Nature, 2011, PMID: 21289626

      (4) A.M. Newman, Cell Stem Cell, 2010, PMID: 20682451

      (5) M.G. Guenther, Cell Stem Cell, 2010, PMID: 20682450

      (6) C. Bock, Cell, 2010, PMID: 21295703

      (7) S. Yamanaka, Cell Stem Cell, PMID: 22704507

      We agree with the reviewer that analysing more clones would be beneficial. We have included a section of this topic in the discussion. In our study, we only had access to the 4 hESC lines included, therefore in the original proteomic study we also analysed 4 hiPSC lines, which were routinely grown within our stem cell facility. While as the study progressed the stem cell facility expanded the culture of additional hiPSC lines, unfortunately we couldn’t also access additional hESC lines.

      We agree that ideally combining each biological replicate with additional technical replicates would provide extra robustness. As usual, cost and practical considerations at the time the experiments were performed affected the experimental design chosen. For the experimental design, each experiment was contained within 1 batch to avoid the strong batch effects present in TMT (Brenes et al 2019).

      (2) iPSC samples used in the proteome analysis are two types of female and two types of male, while ESC samples are three types of female and one type of female. The number of sexes of the cells in the comparative analysis should be matched because sex differences may bias the results.

      While we agree with the reviewer in principle, we have previously performed detailed comparisons of proteome expression in many independent iPSC lines from both biological male and female donors (see Brenes et al., Cell Reports 2021) and it seems unlikely that biological sex differences alone could account for the proteome differences between iPS and ESC lines uncovered in this study . However, as this is a relevant point, we have revised the manuscript to explicitly mention this caveat within the discussion section.

      (3) In Figure 1h, I suspect that the variation of PCA plots is very similar between ESCs and iPSCs. In particular, the authors wrote "copy numbers for all 8 replicates" in the legend, but if Figure 1b was done 8 times, there should be 8 types of cells x 8 measurements = 64 points. Even if iPSCs and ESCs are grouped together, there should be 8 points for each cell type. Is it possible that there is only one TMT measurement for this analysis? If so, at least technical duplicates or biological duplicates would be necessary. I also think each cell should be plotted in the PCA analysis instead of combining the four types of ESCs and iPSCs into one.

      We thank the reviewer for bringing this error to our attention. The legend has been corrected to state, “for all 8 stem cell lines”. Each dot represents the proteome of each of the 4 hESCs and 4 hiPSCs that were analysed using proteomics.

      (4) It is necessary to show what functions are enriched in the 4408 proteins whose protein copies per cell were increased in the iPSCs obtained in Figure 2B.

      The enrichment analysis requested has been performed and is now included as a new supplemental figure 2. We find it very interesting that despite the large number of proteins involved here (4,408), the enrichment analysis still shows clear enrichment for specific cellular processes. The summary plot using affinity propagation within webgestalt is included here:

      Author response image 1.

      (5) The Proteomic Ruler method used in this study is a semi-quantitative method to calculate protein copy numbers and is a concentration estimation method. Therefore, if the authors want to have a biological discussion based on the results, they need to show that the estimated concentrations are correct. For example, there are Western Blotting (WB) results for genes with no change in protein levels in hESC and hiPSC in Fig. 6ij, but the WB results for the group of genes that are claimed to have changed are not shown throughout the paper. Also, there is no difference in the total protein level between iPSCs and ESCs from the ponceau staining in Fig.6ij. WB results for at least a few genes are needed to show whether the concentration estimates obtained from the proteome analysis are plausible. If the protein per cell is increased in these iPSC clones, performing WB analysis using an equal number of cells would be better.

      Regarding the ‘proteome ruler’ approach we would like to highlight that this method has previously been used extensively in the field, with detailed validation, as already explained above. It is also not ‘semi-quantitative’ and can estimate absolute abundance, as well as concentrations. Our work does not use their concentration formulas, but the estimation of protein copy numbers, which was shown to closely match the observed copy numbers as determined when spike-ins are used14.

      In providing here additional validation using Western Blotting (WB), we prioritised for analysis also by WB the proteins related to pluripotency markers, which are vital to determine the pluripotency state of the hESCs and hiPSCs, as well as histone markers. We have included a section in the discussion concerning additional validation data and agree in general that further validation is always useful.

      (6) Regarding the experiment shown in Figure 4l, the gender of iPSC used (wibj2) is female and WA01 (H1; WA01) is male. Certainly, there is a difference in the P/E control ratio, but isn't this just a gender difference? The sexes of the cells need to be matched.

      We accept that ideally the sexes of donors should ideally have been matched and have mentioned this within the discussion. Nonetheless, as previously mentioned, our previous detailed proteomic analyses of multiple hiPSC lines13 derived from both biological male and female donors provide relevant evidence that the results shown in this study are not simply a reflection of the sex of the donors for the respective iPSC and ESC lines. When comparing eroded and non-eroded female hiPSCs to male hiPSCs we found no significant differences in any electron transport chain proteins, not TCA proteins between males and females.

      Minor comments:

      (1) Method: Information on the hiPSCs and hESCs used in this study should be described. In particular, the type of differentiated cells, gender, and protocols that were used in the reprogramming are needed.

      We agree with the reviewer on this. The hiPSC lines were generated by the HipSci consortium, as described in the flagship HipSci paper15. We cite the flagship paper, which specifies in great detail the reprogramming protocols and quality control measures, including analysis of copy number variations15. However, we agree that this information may not be easily accessible for readers. We agree it is relevant to explicitly include this information in our present manuscript, instead of expecting readers to look at the flagship paper. These details have therefore been added to the revised version.

      (2) Method: In Figure1a, Figure 6i, j, the antibody information of Nanog, Oct4, Sox2, and Gapdh is not written in the method and needs to be shown.

      The data relating to these has now been included within the methods section.

      (3) Method: In Figure 1b and other figures, the authors should indicate which iPSC corresponds to which TMT label; the data in the Supplemental Table also needs to indicate which data is which clone.

      We have now added this to the methods section.

      (4) Method: The method of the FACS experiment used in Figure 2 should be described.

      The methods related to the FACS analysis have now been included within the manuscript.

      (5) Method: The cell name used in the mitochondria experiment shown in Figure 4 is listed as WA01, which is thought to be H1. Variations in notation should be corrected.

      This has now been corrected.

      (6) Method: The name of the cell clone shown in Figure 3l,m should be mentioned.

      We have now added these details on the corresponding figure and legend.

      Reviewer #2 (Recommendations For The Authors):

      This study utilized quantitative mass spectrometry to compare protein expression in independently derived 4 ihPSC and 4 hESC cell lines. The investigation quantified approximately 7,900 proteins, and employing the "Proteome ruler" approach, estimated protein copy numbers per cell. Principal component analyses, based on protein copy number per cell, clearly separated hiPSC and hESC, while different hiPSCs and hESCs grouped together. The study revealed a global increase in the expression of cytoplasmic, mitochondrial, membrane transporters, and secreted proteins in hiPSCs compared to hESCs. Interestingly, standard median-based normalization approaches failed to capture these differences, and the disparities became apparent only when protein copy numbers were adjusted for cell numbers. Increased protein abundance in hiPSC was associated with augmented ribosome biogenesis. Total protein content was >50% higher in hiPSCs compared to hESCs, a observation independently verified by total protein content measurement via the EZQ assay and further supported by the larger cell size of hiPSCs in flow cytometry. However, the cell cycle distribution of hiPSC and hESC was similar, indicating that the difference in protein content was not due to variations in the cell cycle. At the phenotypic level, differences in protein expression also correlated with increased glutamine uptake, enhanced mitochondrial potential, and lipid droplet formation in hiPSCs. ihPSCs also expressed higher levels of extracellular matrix components and growth factors.

      Overall, the presented conclusions are adequately supported by the data. Although the mechanistic basis of proteome differences in ihPSC and hESC is not investigated, the work presents interesting findings that are worthy of publication. Below, I have listed my specific questions and comments for the authors.

      (1) Figure 1a displays immunoblots from 6 iPSC and 4 ESC cell lines, with 8 cell lines (4 hESC, 4 hiPSC) utilized in proteomic analyses (Fig. 1b). The figure legend should specify the 8 cell lines included in the proteomic analyses. The manuscript text describing these results should explicitly mention the number and names of cell lines used in these assays.

      We agree with the reviewer and have now marked in figure 1 all the lines that were used for proteomics and have added a section in the methods specifying which cell lines were analysed in each TMT channel.

      (2) In most figures, the quantitative differences in protein expression between hiPSC and hESC are evident, and protein expression is highly consistent among different hiPSCs and hESCs. However, the glutamine uptake capacity of different hiPSC cell lines, and to some extent hESC cell lines, appears highly variable (Figure 3e). While proteome changes were measured in 4 hiPSCs and 4 hESCs, the glutamine uptake assays were performed on a larger number of cell lines. The authors should clarify the number of cell lines used in the glutamine uptake assay, clearly indicating the cell lines used in the proteome measurements. Given the large variation in glutamine uptake among different cell lines, it would be useful to plot the correlation between the expression of glutamine transporters and glutamine uptake in individual cell lines. This may help understand whether differences in glutamine uptake are related to variations in the expression of glutamine transporters.

      The “proteomic ruler” has the capacity to estimate the protein copy numbers per cell, as such changes in the absolute number of cells that were analysed do not cause major complications in quantification. Furthermore, TMT-based proteomics is the most precise proteomics methods available, where the same peptides are detected in all samples across the same data points and peaks, as long as the analysis is done within a single batch, as is the case here.

      The glutamine uptake assay is much more sensitive to the variation in the number of cells. The number of cells were estimated by plating the cells with approximately 5e4 cells two days before the assay, which creates variability. Furthermore, hESCs and hiPSCs are more adhesive than the cells used in the original protocol, hence the quench data was noisier for these lines, making the data from the assay more variable.

      (3) In Figure 4j, it would be helpful to indicate whether the observed differences in the respiration parameters are statistically significant.

      We have now modified the plot to show which proteins were significantly different.

      (4) The iPSCs used here are generated from human primary skin fibroblasts. Different cells vary in size; for instance, fibroblast cells are generally larger than blood lymphocytes. This raises the question of whether the parent cell origin impacts differences in hiPSCs and hESC proteomes. For example, do the authors anticipate that hiPSCs derived from small somatic cells would also display higher expression of cytoplasmic, mitochondrial, and membrane transporters compared to ESC? The authors may consider discussing this point.

      This is a very interesting point. We have now added an extension to the discussion focussed on this subject.

      (5) One wonders if the "Proteome ruler" approach could be applied retrospectively to previously published ihPSC and hESC proteome data, confirming higher expression of cytoplasmic and mitochondrial proteins in ihPSCs, which may have been masked in previous analyses due to median-based normalization.

      We agree with the reviewer and think this is a very good suggestion. Unfortunately, in the main proteomic papers comparing hESC and hiPSCs16,17  the authors did not upload their raw files to a public repository (as it was not mandatory at that period in time), and they also used the International Protein Index (IPI), which is a discontinued database. So the raw files can’t be reprocessed and the database doesn’t match the modern SwissProt entries. Therefore, reprocessing the previous data was impractical.

      (6) The work raises a fundamental question: what is the mechanistic basis for the higher expression of cytoplasmic and mitochondrial proteins in ihPSCs? Conceivably, this could be due to two reasons: (a) Genes encoding cytoplasmic and mitochondrial proteins are expressed at a higher level in ihPSCs compared to hESC. (b) mRNAs encoding cytoplasmic and mitochondrial proteins are translated at a higher level in ihPSCs compared to hESC. The authors may check published transcriptome data from the same cell lines to shed light on this point.

      This is a very interesting point. We believe that the reprogrammed cells contained mature mitochondria, which are not fully regressed upon reprogramming and that this can establish a growth advantage in the normoxic environments in which the cells are grown. Unfortunately, the available transcriptomic data lacked spike-ins, and thus only enables comparison of concentration, not of copy numbers13. Therefore, we could not determine with the available data if there was an increase in the copies of specific mRNAs. However, with a future study where there was a transcriptomic dataset with spike-ins included, this would be very interesting to analyse.

      Reviewer #3 (Recommendations For The Authors):

      It is unclear whether changes in protein levels relate to any phenotypic features of cell lines used. For example, the authors highlight that increased protein expression in hiPSC lines is consistent with the requirement to sustain high growth rates, but there is no data to demonstrate whether hiPSC lines used indeed have higher growth rates.

      We respectfully disagree with the reviewer on this point. Our data show that hESCs and hiPSCs show significant differences in protein mass and cell size, with the MS data validated by the EZQ assay and FACS, while having no significant differences in their cell cycle profiles. Thus, increased size and protein content would require higher growth rates to sustain the increased mass, which is what we observe.

      The authors claim that the cell cycle of the lines is unchanged. However, no details of the method for assessing the cell cycle were included so it is difficult to appreciate if this assessment was appropriately carried out and controlled for.

      We apologise for this omission; the details have been included in the revised version of the manuscript.

      Details and characterisation of iPSC and ESC lines used in this study are overall lacking. The lines used are merely listed in methods, but no references are included for published lines, how lines were obtained, what passage they were used at, their karyotype status etc. For details of basic characterisation, the authors should refer to the ISSC Standards for the use of human stem cells in research. In particular, the authors should consider whether any of the changes they see may be attributed to copy number variants in different lines.

      We agree with the reviewer on this and refer to the reply above concerning this issue.

      The expression data for markers of undifferentiated state in Figure 1a would ideally be shown by immunocytochemistry or flow cytometry as it is impossible to tell whether cultures are heterogeneous for marker expression.

      We agree with the reviewer on this. FACS is indeed much more quantitative and a better method to study heterogeneity. However, we did not have protocols to study these markers using FACS.

      TEM analysis should ideally be quantified.

      We agree with the reviewer that it would be nice to have a quantitative measure.

      All figure legends should explicitly state what graphs are representing (e.g. average/mean; how many replicates (biological or technical), which lines)? Some data is included in Methods (e.g. glutamine uptake), but not for all of the data (e.g. TEM).

      We agree with the reviewer. These has been corrected in the revised version of the manuscript, with additional details included.

      Validation experiments were performed typically on one or two cell lines, but the lines used were not consistent (e.g. wibj_2 versus H1 for respirometry and wibj_2, oaqd_3 versus SA121 and SA181 for glutamine uptake). Can the authors explain how the lines were chosen?

      The validation experiments were performed at different time points, and the selection of lines reflected the availability of hiPSC and hESC lines within our stem cell facility at a given point in time.

      We chose to use a range of different lines for comparison, rather than always comparing only one set of lines, to try to avoid a possible bias in our conclusions and thus to make the results more general.

      The authors should acknowledge the need for further functional validation of the results related to immunosuppressive proteins.

      We agree with the reviewer and have added a sentence in the discussion making this point explicitly.

      Differences in H1 histones abundance were highlighted. Can the authors speculate as to the meaning of these differences?

      Regarding H1 histones, our study of the literature, as well as discussions with with chromatin and histone experts, both within our institute and externally, have not shed light into what the differences could imply, based upon previous literature. We think therefore that this is a striking and interesting result that merits further study, but we have not yet been able to formulate a clear hypothesis on the consequences.

      (1) Howden, A. J. M. et al. Quantitative analysis of T cell proteomes and environmental sensors during T cell differentiation. Nat Immunol, doi:10.1038/s41590-019-0495-x (2019).

      (2) Marchingo, J. M., Sinclair, L. V., Howden, A. J. & Cantrell, D. A. Quantitative analysis of how Myc controls T cell proteomes and metabolic pathways during T cell activation. Elife 9, doi:10.7554/eLife.53725 (2020).

      (3) Damasio, M. P. et al. Extracellular signal-regulated kinase (ERK) pathway control of CD8+ T cell differentiation. Biochem J 478, 79-98, doi:10.1042/BCJ20200661 (2021).

      (4) Salerno, F. et al. An integrated proteome and transcriptome of B cell maturation defines poised activation states of transitional and mature B cells. Nat Commun 14, 5116, doi:10.1038/s41467-023-40621-2 (2023).

      (5) Antico, O., Nirujogi, R. S. & Muqit, M. M. K. Whole proteome copy number dataset in primary mouse cortical neurons. Data Brief 49, 109336, doi:10.1016/j.dib.2023.109336 (2023).

      (6) Edwards, W. et al. Quantitative proteomic profiling identifies global protein network dynamics in murine embryonic heart development. Dev Cell 58, 1087-1105 e1084, doi:10.1016/j.devcel.2023.04.011 (2023).

      (7) Barton, P. R. et al. Super-killer CTLs are generated by single gene deletion of Bach2. Eur J Immunol 52, 1776-1788, doi:10.1002/eji.202249797 (2022).

      (8) Phair, I. R., Sumoreeah, M. C., Scott, N., Spinelli, L. & Arthur, J. S. C. IL-33 induces granzyme C expression in murine mast cells via an MSK1/2-CREB-dependent pathway. Biosci Rep 42, doi:10.1042/BSR20221165 (2022).

      (9) Niu, L. et al. Dynamic human liver proteome atlas reveals functional insights into disease pathways. Mol Syst Biol 18, e10947, doi:10.15252/msb.202210947 (2022).

      (10) Murugesan, G., Davidson, L., Jannetti, L., Crocker, P. R. & Weigle, B. Quantitative Proteomics of Polarised Macrophages Derived from Induced Pluripotent Stem Cells. Biomedicines 10, doi:10.3390/biomedicines10020239 (2022).

      (11) Ryan, D. G. et al. Nrf2 activation reprograms macrophage intermediary metabolism and suppresses the type I interferon response. iScience 25, 103827, doi:10.1016/j.isci.2022.103827 (2022).

      (12) Nicolas, P. et al. Systems-level conservation of the proximal TCR signaling network of mice and humans. J Exp Med 219, doi:10.1084/jem.20211295 (2022).

      (13) Brenes, A. J. et al. Erosion of human X chromosome inactivation causes major remodeling of the iPSC proteome. Cell Rep 35, 109032, doi:10.1016/j.celrep.2021.109032 (2021).

      (14) Wisniewski, J. R., Hein, M. Y., Cox, J. & Mann, M. A "proteomic ruler" for protein copy number and concentration estimation without spike-in standards. Mol Cell Proteomics 13, 3497-3506, doi:10.1074/mcp.M113.037309 (2014).

      (15) Kilpinen, H. et al. Common genetic variation drives molecular heterogeneity in human iPSCs. Nature 546, 370-375, doi:10.1038/nature22403 (2017).

      (16) Phanstiel, D. H. et al. Proteomic and phosphoproteomic comparison of human ES and iPS cells. Nat Methods 8, 821-827, doi:10.1038/nmeth.1699 (2011).

      (17) Munoz, J. et al. The quantitative proteomes of human-induced pluripotent stem cells and embryonic stem cells. Mol Syst Biol 7, 550, doi:10.1038/msb.2011.84 (2011).

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors):

      The conclusions of this paper are mostly well supported by data, but some aspects need to be corrected.

      1) Line 99. The title is not suitable for summarizing this part of the results. In this paragraph, the results mainly describe SRSF1 expression pattern and binding of spermatogonia-associated gene's transcripts in testes. There is no functional assay to conclude SRSF1 has an essential role in mouse testes. The data only indicate that SRSF1 may have a vital role in posttranscriptional regulation in the testes.

      Thank you for the professional suggestions. Following this advice, we have corrected the text in this revised version (Page 4, Line 98 and 112).

      2) Line 141. In the mating scheme, Vasa-Cre Srsf1Fl/del mice should be obtained instead of Vasa-Cre Srsf1Fl/Fl mice.

      Thank you for the professional suggestions. Following this advice, we have corrected the text in this revised version (Page 4, Line 118).

      3) Fig 2 C, "PZLF" should be corrected to "PLZF".

      Thank you very much for the helpful comments. We have corrected this in Figure 2C.

      4) Fig 5 B, "VASA" and "Merge" should be interchanged.

      Thank you very much for the helpful comments. We have interchanged "VASA" and "Merge" in Figure 5B.

      5) Fig 5 D, "Ctrl" should be added in the up panel.

      Thank you very much for the helpful suggestions. We have added "Ctrl" in Figure 5C.

      6) The legend for Figure 6 D should be revised.

      Thank you very much for the helpful suggestions. We have revised the legend for Figure 7D

      7) The legend for Figure 7 G should be revised.

      Thank you very much for the helpful suggestions. We have revised the legend for Figure 8D

      8) Immunoprecipitation mass spectrometry (IP-MS) data showed that t SRSF1 interacts with other RNA splicing-related proteins (e.g., SRSF10, SART1, RBM15, SRRM2, SF3B6, and SF3A2). The authors should verify the interactions in testis or cells.

      We thank the reviewer for the professional comments and suggestions. Following this advice, we performed co-transfection and co-IP to verify the protein-protein interactions in 293T cells, the results showed that the RRM1 domain of SRSF1 interacted with SART1, RBM15 and SRSF10 in 293T cells. In addition, the fluorescence results showed complete co-localization of mCherry-SRSF1 with eGFP-SART1, eGFP-RBM15 and eGFP-SRSF10 in 293T cells. Therefore, we have incorporated the data into the Figure 9G-J. Meanwhile, these have been incorporated into the text, given descriptions, and highlighted (Page 17, Lines 338-347).

      9) To avoid overstatement, the authors should pay attention to the use of adjectives and adverbs in the article, especially when drawing conclusions about the role of Tail1.

      We thank the reviewer for the professional comments and suggestions. To avoid overstatement, we have revised the entire text (Page 4, Lines 98, and 112; Page 16, Lines 308; Page 17, Lines 346-347; Page 20, Lines 413-414; Page 21, Lines 432-433).

      Reviewer #2 (Recommendations For The Authors):

      Major

      1) I find the use of "SSC homing" misleading/confusing because this "homing" or relocation of postnatal gonocytes/nascent spermatogonia to the basement membrane precedes the maturation of the nascent spermatogonia into SSCs. In addition, "SSC homing" is commonly used in the SSC transplantation field to describe a transplanted SSC's ability to find and colonize its niche within the seminiferous tubules. I appreciate that "postnatal gonocytes/nascent spermatogonia homing" is not easily grasped by a broader audience. Perhaps "homing of precursor SSCs" is more appropriate.

      Thank you very much for the helpful comments and suggestions. Following this advice, we have corrected the text in this revised version (Line 1-2, 39, 44, 49, 54-55, 68, 70, 72-73, 77, 84, 93-95, 191, 201, 240, 384-387, 397, 417-422, and 433)

      2) If I am misunderstanding the description of the Srsf1 cKO phenotype, and the authors truly believe SSCs have formed in the Srsf1 cKO testis, I strongly recommend immunostaining to show that the cKO germ cells robustly express SSC markers, not just markers of undifferentiated spermatogonia.

      We thank the reviewer for the professional suggestions. We fully agree with the reviewer. Immunohistochemical staining for FOXO1 and statistical results indicated a reduced number of prospermatogonia (Figure 6C-E). So, we have corrected the text in this revised version (Line 1-2, 39, 44, 49, 54-55, 68, 70, 72-73, 77, 84, 93-95, 191, 201, 240, 384-387, 397, 417-422, and 433).

      3) If the authors have the available resources, the significance of this report would be enhanced by additional characterization of the cKO phenotype at the transition from gonocyte to nascent spermatogonia. Do any cKO germ cells exhibit defects in maturing from gonocytes to nascent spermatogonia at the molecular level? I.e., by P5-7, do all cKO germ cells express PLZF and localize FOXO1 to cytoplasm, as expected of nascent spermatogonia? If the cKO germ cells are actually a heterogenous population of gonocytes and nascent spermatogonia, what is the distribution of each subpopulation in the lumen vs basement membrane?

      Thank you for the professional suggestions. Following this advice, immunohistochemical staining for FOXO1 was performed on 5 dpp mouse testis sections (Figure 6C). Further, germ cell statistics of FOXO1 expression in the nucleus showed a reduced number of prospermatogonia in cKO mice (Figure 6D). And germ cells in which FOXO1 is expressed in the nucleus similarly undergo abnormal homing (Figure 6E). Thus, all the above data indicated that SRSF1 has an essential role in the homing of precursor SSCs. we have incorporated the data into the Figure 6C-E. Meanwhile, these have been incorporated into the text, given descriptions, and highlighted (Page 9, Lines 191-201; Page 20, Lines 389-391).

      Minor

      1) Could the authors clarify why Tial1 exon exclusion in the cKO results in reduced protein expression? Is it creating a transcript isoform that undergoes nonsense-mediated decay?

      Thank you for the professional suggestions. Following this advice, we analyzed Tial1 transcripts again, and we found that Tial1 exon exclusion resulted in reduced expression of protein isoform X2 (Figure 8J). Since this region is not in the CDS, no clear evidence of nonsense-mediated decay was found in the analysis.

      2) Could the authors confirm that the TIAL1 antibody is not detecting the portion of the protein encoded by the alternatively spliced exon?

      Thank you for the helpful comments. The TIAL1 monoclonal antibody is produced by Proteintech Group under the product number 66907-1-Ig. Immunogen is TIAL1 fusion protein Ag11981. The sequence is as follows. MDARVVKDMATGKSKGYGFVSFYNKLDAENAIVHMGGQWLGGRQIRTNWATRKPPAPKSTQENNTKQLRFEDVVNQSSPKNCTVYCGGIASGLTDQLMRQTFSPFGQIMEIRVFPEKGYSFVRFSTHESAAHAIVSVNGTTIEGHVVKCYWGKESPDMTKNFQQVDYSQWGQWSQVYGNPQQYGQYMANGWQVPPYGVYGQPWNQQGFGVDQSPSAAWMGGFGAQPPQGQAPPPVIPPPNQAGYGMASYQTQ The homology was 99% in mice and all TIAL1 isoforms were detected. So, TIAL1 antibody is detecting the portion of the protein encoded by the alternatively spliced exon.

      3) Lines 143: should "cKO" actually be "control"?

      Thank you for the helpful suggestions. There is a real problem in the text description. we have corrected the text in this revised version (Page 6, Line 138-139).

      4) Lines 272-3 "visual analysis using IGV showed the peak of Tial1/Tiar was stabilized in 5 dpp cKO mouse testes (Figure 7H)": "peak stabilization" is not evident to me from the figure nor do I see Tial1 listed as differentially expressed in the supplemental. I would refrain from using IGV visualization as the basis for the differential abundance of a transcript.

      Thank you very much for the helpful comments and suggestions. Tial1/Tiar is one of 39 stabilizing genes that are bound by SRSF1 and undergo abnormal AS. Following this advice, we have substituted Tial1/Tiar's FPKM for his peaks (Figure 8H). Meanwhile, we have corrected the text in this revised version (Page 15, Line 296-300; Page 16, Line 303-304).

      5) Lines 468-473: please clarify the background list used for GO enrichment analyses. By default, the genes expressed in the testis are enriched for spermatogenesis-related genes. To control for this and test whether a gene list is enriched for spermatogenesis-related genes beyond what is already seen in the testis, I recommend using a list of all expressed genes (for example, defined by TPM>=1) as the background list.

      We thank the reviewer for the professional comments and suggestions. Following this advice, all expressed genes (TPM sum of all samples >=1) are listed background for GO enrichment analyses. The results of GO enrichment analysis of the AS gene turned out to be the same. The results of GO enrichment analysis of the SRSF1 peak-containing genes, differential genes, and IP proteins-associated genes have corrected in the figure (Figure 2A, 7E, and 9E)

      6) Figure 2B: Could the authors mark where the statistically significant peaks appear on the tracks? There are many small peaks and it's unclear if they are significant or not.

      Thank you for the helpful suggestions. Following this advice, we have marked the areas of higher peaks in the figure (Figure 2B). We generally believe that any region above the peaks of IgG is likely to be a binding region, and of course, the higher the peak value, the more pre-mRNA is bound by SRSF1 in that region.

      7) Figure 7A: I assume the SRSF1 CLIP-seq genes are all the genes from the adult testis experiments. I would suggest limiting the CLIP-seq gene set to only those expressed in the P5 RNA-seq data, as if the target is not expressed at P5, there's no way it will be differentially expressed or differentially spliced in at P5.

      Thank you very much for the helpful comments and suggestions. Following this advice, we found that 3543 of the 4824 genes bound by SRSF1 were expressed in testes at 5 dpp. we have corrected in the figure (Figure 8A). these have been incorporated into the text, given descriptions, and highlighted (Page 14, Lines 274-277).

      8) Figure 7F: Could the authors clarify where the alternatively spliced exon is relative to the total transcript, shown in 7H?

      Thank you for the helpful suggestions. Following this advice, we have labeled the number of exons where variable splicing occurs. (Figure 8F).

      9) Please include where the sequencing and mass spec data will be publicly available.

      Thank you very much for the helpful comments and suggestions. Following this advice, these have been incorporated into the text, given descriptions, and highlighted (Page 25, Lines 560-565).

      Reviewer #3 (Recommendations For The Authors):

      Suggestions for improving the data and analysis

      1) The claim that TIAL1 mediates SRSF1 effects is not well supported; this claim should be adjusted or additional supporting data should be provided. To support a claim that alternative splicing of Tial1 mediates the effects of SRSF1, at least two additional pieces of data are needed: first, a demonstration that the two alternative protein isoforms have different molecular functions, either in vitro or in vivo; and second, a better quantitation of the levels and ratios of expression of the two different isoforms in vivo.

      Thank you for the helpful comments and suggestions. Following this advice, we quantified the expression levels and ratios of two different isoforms in vivo, and we found that Tial1 exon exclusion resulted in reduced expression of protein isoform X2 (Figure 8J). However, it is not possible to prove that the two alternative protein isoforms have different molecular functions. So, this claim has been adjusted in the text. these have been incorporated into the text, given descriptions, and highlighted (Lines 1-2, 43-45, 95, 306, 323-325, 408, 413-414).

      2) Likewise, the claim that "SRSF1 is required for "homing and self-renewal" of SSCs should be adjusted or better supported. As of now, the data supports a claim that SRSF1 is required for the establishment of the SSC population in the testis after birth. This could be due to defects in homing, self-renewal, or survival. To support claims about homing and self-renewal, these phenotypes should be tested more directly, for example by quantitating numbers of spermatogonia at the basal membrane in juvenile testes (homing) and expression of SSC markers in addition to the pan-germ cell marker VASA across early postnatal time points.

      Thank you very much for the helpful comments and suggestions. Immunohistochemical staining for FOXO1 was performed on 5 dpp mouse testis sections (Figure 6C). Further, germ cell statistics of FOXO1 expression in the nucleus showed a reduced number of prospermatogonia in cKO mice (Figure 6D). And germ cells in which FOXO1 is expressed in the nucleus similarly undergo abnormal homing (Figure 6E). Thus, all the above data indicated that SRSF1 has an essential role in the homing of precursor SSCs. we have incorporated the data into the Figure 6C-E. These have been incorporated into the text, given descriptions, and highlighted (Page 9, Lines 191-201; Page 20, Lines 387-389). Meanwhile, "homing and self-renewal" of SSCs have corrected the text in this revised version (Line 1-2, 39, 44, 49, 54-55, 68, 70, 72-73, 77, 84, 93-95, 191, 201, 240, 384-387, 397, 417-422, and 433).

      3) Additional, more detailed analyses of CLIP-seq and RNA-seq data at least showing that the libraries are of good quality should be provided.

      Thank you very much for suggestions. Following this advice, detailed analyses of RNA-seq data have been incorporated the data into the figures (Figure S2). But detailed analyses of CLIP-seq have already been used in another paper (Sun et al., 2023), and we have not provided it in order to avoid multiple uses of one figure. Meanwhile, we made a citation in the article (Page 4, Lines 105; Page 25, Lines 564-565).

      4) Gene Ontology analyses should be redone with a more appropriate background gene set.

      Thank you for the helpful suggestions. All expressed genes (TPM sum of all samples >=1) are listed background for GO enrichment analyses. The results of GO enrichment analysis of the AS gene turned out to be the same. The results of GO enrichment analysis of the SRSF1 peak-containing genes, differential genes, and IP proteins-associated genes have been corrected in the figure (Figure 2A, 7E, and 9E)

      Minor points about the text and figures

      5) The species (mouse) should be stated earlier in the Introduction.

      Thank you for the professional suggestions. Following this advice, the mouse has been stated earlier in the Introduction (Page 3, Line 65).

      6) In Fig. 1C (Western blot), the results would be more convincing if quantitation of band intensities normalized to the loading control was added.

      Thank you very much for comments and suggestions. Following this advice, ACTB served as a loading control. The value in 16.5 dpc testes were set as 1.0, and the relative values of testes in other developmental periods are indicated. Therefore, we have incorporated the data into the figures (Figure 1C).

      7) In Fig 5D, TUNEL signal in the single-channel image is difficult to see; please adjust the contrast.

      Thank you for the professional suggestions. Following this advice, the images of the channels have been replaced by enlarged images for better visibility (Figure 5C).

      Major comments

      1) In Fig 1D, it appears that SRSF1 is expressed most strongly in spermatogonia by immunofluorescence, but this is inconsistent with the sharp rise in expression detected by RT-qPCR at 20 days post partum (dpp) (Fig. 1B), which is when round spermatids are first added; this discrepancy should be explained or addressed.

      We appreciate the important comments from the reviewer. In another of our studies, we showed that SRSF1 expression is higher in pachytene spermatocytes and round spermatids (Sun et al., 2023). So, it is normal for the sharp rise in expression detected by RT-qPCR at 20 days post partum (dpp).

      Author response image 1.

      Dynamic localization of SRSF1 in male mouse germ cells. (Sun et al., 2023)

      2) It is important to provide a more comprehensive basic description of the CLIP-seq datasets beyond what is shown in the tracks shown in Fig. 2B. This would allow a better assessment of the data quality and would also provide information about the transcriptome-wide patterns of SRSF1 binding. No information or quality metrics are provided about the libraries, and it is not stated how replicates are handled to maximize the robustness of the analysis. The distribution of peaks across exons, introns, and other genomic elements should also be shown.

      Thank you very much for the helpful comments and suggestions. In fact, detailed analyses of CLIP-seq have already been presented in another paper (Sun et al., 2023), and we have not provided it in order to avoid multiple uses of one figure. Meanwhile, we made a citation in the article (Page 4, Lines 105; Page 25, Lines 564-565). In addition, the distribution of peaks in exons, introns, and other genomic elements is shown in Figure 2B.

      3) The claim that SRSF1 is required for "homing and self-renewal" of SSCs is made in multiple places in the manuscript. However, neither homing nor self-renewal is ever directly tested. A single image is shown in Fig. 5E of a spermatogonium at 5dpp that does not appropriately sit on the basal membrane, potentially indicating a homing defect, but this is not quantified or followed up. There is good evidence for depletion of spermatogonia starting at 7 dpp, but no further explanation of how homing and/or self-renewal fit into the phenotype.

      Thank you very much for the helpful comments and suggestions. Following this advice, immunohistochemical staining for FOXO1 was performed on 5 dpp mouse testis sections (Figure 6C). Further, germ cell statistics of FOXO1 expression in the nucleus showed a reduced number of prospermatogonia in cKO mice (Figure 6D). And germ cells in which FOXO1 is expressed in the nucleus similarly undergo abnormal homing (Figure 6E). Thus, all the above data indicated that SRSF1 has an essential role in the homing of precursor SSCs. we have incorporated the data into the Figure 6C-E. These have been incorporated into the text, given descriptions, and highlighted (Page 9, Lines 191-201; Page 20, Lines 387-389). Meanwhile, "homing and self-renewal" of SSCs have corrected the text in this revised version (Line 1-2, 39, 44, 49, 54-55, 68, 70, 72-73, 77, 84, 93-95, 191, 201, 240, 384-387, 397, 417-422, and 433).

      4) In Fig. 6A (lines 258-260) very few genes downregulated in the cKO are bound by SRSF1 and undergo abnormal splicing. The small handful that falls into this overlap could simply be noise. A much larger fraction of differentially spliced genes are CLIP-seq targets (~33%), which is potentially interesting, but this set of genes is not explored.

      Thank you for the helpful comments. Following this advice, this was specifically indicated by the fact that 39 stabilizing genes were bound by SRSF1 and underwent abnormal AS. In our study, Tial1/Tiar is one of 39 stabilizing genes that are bound by SRSF1 and undergo abnormal AS. Therefore, we fully agree with the reviewers' comments. These have been added in this revised version (Page 14, Lines 279-280; Page 15, Lines 296-300).

      5) The background gene set for Gene Ontology analyses is not specified. If these were done with the whole transcriptome as background, one would expect enrichment of spermatogenesis genes simply because they are expressed in testes. The more appropriate set of genes to use as background in these analyses is the total set of genes that are expressed in testis.

      We thank the reviewer for the professional comments and suggestions. All expressed genes (TPM sum of all samples >=1) are listed background for GO enrichment analyses. The results of GO enrichment analysis of the AS gene turned out to be the same. The results of GO enrichment analysis of the SRSF1 peak-containing genes, differential genes, and IP proteins-associated genes have been corrected in the figure (Figure 2A, 7E, and 9E)

      6) In general, the model is over-claimed: aside from interactions by IP-MS, little is demonstrated in this study about how SRSF1 affects alternative splicing in spermatogenesis, or how alternative splicing of TIAL1 specifically would result in the phenotype shown. It is not clear why Tial1/Tiar is selected as a candidate mediator of SRSF1 function from among the nine genes that are downregulated in the cKO, are bound by SRSF1, and undergo abnormal splicing. Although TIAL1 levels are reduced in cKO testes by Western blot (Fig. 7J), this could be due just be due to a depletion of germ cells from whole testis. The reported splicing difference for Tial1 seems very subtle and the ratio of isoforms does not look different in the Western blot image.

      Thank you very much for the helpful comments and suggestions. In our study, Tial1/Tiar is one of 39 stabilizing genes that are bound by SRSF1 and undergo abnormal AS. However, Western blotting showed that expression levels of TIAL1/TIAR isoform X2 were significantly suppressed (Figure 8J). So, the data indicate that SRSF1 is required for TIAL1/TIAR expression and splicing.

      Sun, L., Chen, J., Ye, R., Lv, Z., Chen, X., Xie, X., Li, Y., Wang, C., Lv, P., Yan, L., et al. (2023). SRSF1 is crucial for male meiosis through alternative splicing during homologous pairing and synapsis in mice. Sci Bull 68, 1100-1104. 10.1016/j.scib.2023.04.030.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      This paper presents a computational model of the evolution of two different kinds of helping ("work," presumably denoting provisioning, and defense tasks) in a model inspired by cooperatively breeding vertebrates. The helpers in this model are a mix of previous offspring of the breeder and floaters that might have joined the group, and can either transition between the tasks as they age or not. The two types of help have differential costs: "work" reduces "dominance value," (DV), a measure of competitiveness for breeding spots, which otherwise goes up linearly with age, but defense reduces survival probability. Both eventually might preclude the helper from becoming a breeder and reproducing. How much the helpers help, and which tasks (and whether they transition or not), as well as their propensity to disperse, are all evolving quantities. The authors consider three main scenarios: one where relatedness emerges from the model, but there is no benefit to living in groups, one where there is no relatedness, but living in larger groups gives a survival benefit (group augmentation, GA), and one where both effects operate. The main claim is that evolving defensive help or division of labor requires the group augmentation; it doesn't evolve through kin selection alone in the authors' simulations.

      This is an interesting model, and there is much to like about the complexity that is built in. Individual-based simulations like this can be a valuable tool to explore the complex interaction of life history and social traits. Yet, models like this also have to take care of both being very clear on their construction and exploring how some of the ancillary but potentially consequential assumptions affect the results, including robust exploration of the parameter space. I think the current manuscript falls short in these areas, and therefore, I am not yet convinced of the results. Much of this is a matter of clearer and more complete writing: the Materials and Methods section in particular is incomplete or vague in some important junctions. However, there are also some issues with the assumptions that are described clearly.

      Below, I describe my main issues, mostly having to do with model features that are unclear, poorly motivated (as they stand), or potentially unrealistic or underexplored.

      We would like to thank the reviewer for the thoughtful comments that helped us to greatly improve the clarity of our paper.  

      One of the main issues I have is that there is almost no information on what happens to dispersers in the model. Line 369-67 states dispersers might join another group or remain as floaters, but gives no further information on how this is determined. Poring through the notation table also comes up empty as there is no apparent parameter affecting this consequential life history event. At some point, I convinced myself that dispersers remain floaters until they die or become breeders, but several points in the text contradict this directly (e.g., l 107). Clearly this is a hugely important model feature since it determines fitness cost and benefits of dispersal and group size (which also affects relatedness and/or fitness depending on the model). There just isn't enough information to understand this crucial component of the model, and without it, it is hard to make sense of the model output.

      We use the same dispersal gene β to represent the likelihood an individual will either leave or join a group, thereby quantifying both dispersal and immigration using the same parameter. Specifically, individuals with higher β are more likely to remain as floaters (i.e., disperse from their natal group to become a breeder elsewhere), whereas those with lower β are either more likely to remain in their natal group as subordinates (i.e., queue in a group for the breeding position) or join another group if they dispersed.  

      We added in the text “Dispersers may migrate to another group to become subordinates or remain as floaters waiting for breeding opportunities, which is also controlled by the same genetic dispersal propensity as subordinates” to clarify this issue. We also added in Table 1 that β is the “genetic predisposition to disperse versus remain in a group”, and to Figure 1 that “subordinates in the group (natal and immigrants) […]” after we already clarified that “Dispersers/floaters may join a random group to become subordinates.”

      Related to that, it seems to be implied (but never stated explicitly) that floaters do not work, and therefore their DV increases linearly with age (H_work in eq.2 is zero). That means any floaters that manage to stick around long enough would have higher success in competition for breeding spots relative to existing group members. How realistic is this? I think this might be driving the kin selection-only results that defense doesn't evolve without group augmentation (one of the two main ways). Any subordinates (which are mainly zero in the no GA, according to the SI tables; this assumes N=breeder+subordinates, but this isn't explicit anywhere) would be outcompeted by floaters after a short time (since they evolve high H and floaters don't), which in turn increases the benefit of dispersal, explaining why it is so high. Is this parameter regime reasonable? My understanding is that floaters often aren't usually high resource holding potential individuals (either b/c high RHP ones would get selected out of the floater population by establishing territories or b/c floating isn't typically a thriving strategy, given that many resources are tied to territories). In this case, the assumption seems to bias things towards the floaters and against subordinates to inherit territories. This should be explored either with a higher mortality rate for floaters and/or a lower DV increase, or both.

      When it comes to floaters replacing dead breeders, the authors say a bit more, but again, the actual equation for the scramble competition (which only appears as "scramble context" in the notation table) is not given. Is it simply proportional to R_i/\sum_j R_j ? Or is there some other function used? What are the actual numbers of floaters per breeding territory that emerge under different parameter values? These are all very important quantities that have to be described clearly.

      Although it is true that dispersers do not work when they are floaters, they may later help if they immigrate into a group as a subordinate. Consequently, immigrant subordinates have no inherent competitive advantage over natal subordinates (as step 2.2. “Join a group” is followed by step 3. “Help”, which occurs before step 5. “Become a breeder”). Nevertheless, floaters can potentially outcompete subordinates of the same age if they attempt to breed without first queuing as a subordinate (step 5) when subordinates are engaged in work tasks. We believe that this assumption is realistic and constitutes part of the costs associated with work tasks. However, floaters are at a disadvantage for becoming a breeder because: (1) floaters incur higher mortality than individuals within groups (Eq. 3); and (2) floaters may only attempt to become breeders in some breeding cycles (versus subordinate groups members, who are automatically candidates for an open breeding position in the group in each cycle). Therefore, due to their higher mortality, floaters are rarely older than individuals within groups, which heavily influences their dominance value and competitiveness. Additionally, any competitive advantage that floaters might have over other subordinate group members is unlikely to drive the kin selection-only results because subordinates would preferably choose defense tasks instead of work tasks so as not to be at a competitive disadvantage compared to floaters.  

      Regarding whether floaters aren't usually high resource holding potential (RHP) individuals and, therefore, our assumptions might be unrealistic; empirical work in a number of species has shown that dispersers are not necessarily those of lower RHP or of lower quality. In fact, according to the ecological constraints hypothesis, one might predict that high quality individuals are the ones that disperse because only individuals in good condition (e.g., larger body size, better energy reserves) can afford the costs associated with dispersal (Cote et al., 2022). To allow differences in dispersal propensity depending on RHP, we extended our model in the Supplemental Materials by incorporating a reaction norm of dispersal based on their rank (D = 1 / (1 + exp (β<sub>R</sub> * Rβ<sub>0</sub>)) under the section “Dominance-dependent dispersal propensities” and now referenced in L195. This approach allows individuals to adjust their dispersal strategy to their competitiveness and to avoid kin competition by remaining as a subordinate in another group. Results show that the addition of the reaction norm of dispersal to rank did not qualitatively influence the results described in the main text.  

      We also added “number of floaters” present in the whole population to the summary tables as requested.  

      As a side note, the “scramble context” we mention was an additional implementation in which we made rank independent of age. However, since the main conclusions remained unchanged, we decided to remove it for simplicity from the final manuscript, but we forgot to remove it from Table 1 before submission.  

      I also think the asexual reproduction with small mutations assumption is a fairly strong one that also seems to bias the model outcomes in a particular way. I appreciate that the authors actually measured relatedness within groups (though if most groups under KS have no subordinates, that relatedness becomes a bit moot), and also eliminated it with their ingenious swapping-out-subordinates procedure. The fact remains that unless they eliminate relatedness completely, average relatedness, by design, will be very high. (Again, this is also affected by how the fate of the dispersers is determined, but clearly there isn't a lot of joining happening, just judging from mean group sizes under KS only.) This is, of course, why there is so much helping evolving (even if it's not defensive) unless they completely cut out relatedness.

      As we showed in the Supplementary Tables and the section on relatedness in the SI (“Kin selection and the evolution of division of labor"), high relatedness does not appear to explain our results. In evolutionary biology generally and in game theory specifically (with the exception of models on sexual selection or sex-specific traits), asexual reproduction is often modelled because it reduces unnecessary complexity. To further study the effect of relatedness on kin structures more closely resembling those of vertebrates, however, we created an additional “relatedness structure level”, where we shuffled half of the philopatric offspring using the same method used to remove relatedness completely, effectively reducing withingroup relatedness structure by half. As shown in the new Figure S3, the conclusions of the model remain unchanged.  

      Finally, the "need for division of labor" section is also unclear, and its construction also would seem to bias things against division of labor evolving. For starters, I don't understand the rationale for the convoluted way the authors create an incentive for division of labor. Why not implement something much simpler, like a law of minimum (i.e., the total effect of helping is whatever the help amount for the lowest value task is) or more intuitively: the fecundity is simply a function of "work" help (draw Poisson number of offspring) and survival of offspring (draw binomial from the fecundity) is a function of the "defense" help. As it is, even though the authors say they require division of labor, in fact, they only make a single type of help marginally less beneficial (basically by half) if it is done more than the other. That's a fairly weak selection for division of labor, and to me it seems hard to justify. I suspect either of the alternative assumptions above would actually impose enough selection to make division of labor evolve even without group augmentation.

      In nature, multiple tasks are often necessary to successfully rear offspring. We simplify this principle in the model by maximizing reproductive output when both tasks are carried out to a similar extent, allowing for some flexibility from the mean. We added to the manuscript “For example, in many cooperatively breeding birds, the primary reasons that individuals fail to produce offspring are (1) starvation, which is mitigated by the feeding of offspring, and (2) nest depredation, which is countered by defensive behavior. Consequently, both types of tasks are necessary to successfully produce offspring, and focusing solely on one while neglecting the other is likely to result in lower reproductive success than if both tasks are performed by individuals within the group.”

      Regarding making fecundity a function of work tasks and offspring survival as a function of defensive tasks, these are actually equivalent in model terms, as it’s the same whether breeders produce three offspring and two die, or if they only produce one. This represents, of course, an oversimplification of the natural context, where breeding unsuccessfully is more costly (in terms of time and energy investment) than not breeding at all.

      Overall, this is an interesting model, but the simulation is not adequately described or explored to have confidence in the main conclusions yet. Better exposition and more exploration of alternative assumptions and parameter space are needed.

      We hope that our clarifications and extension of the model satisfy your concerns.  

      Reviewer #2 (Public review):

      Summary:

      This paper formulates an individual-based model to understand the evolution of division of labor in vertebrates. A main conclusion of the paper is that direct fitness benefits are the primary factor causing the evolution of vertebrate division of labor, rather than indirect fitness benefits.

      Strengths:

      The paper formulates an individual-based model that is inspired by vertebrate life history. The model incorporates numerous biologically realistic details, including the possibility to evolve age polytheism where individuals switch from work to defence tasks as they age or vice versa, as well as the possibility of comparing the action of group augmentation alone with that of kin selection alone.

      Weaknesses:

      The model makes assumptions that restrict the possibility that kin selection leads to the evolution of helping. In particular, the model assumes that in the absence of group augmentation, subordinates can only help breeders but cannot help non-breeders or increase the survival of breeders, whereas with group augmentation, subordinates can help both breeders and non-breeders and increase the survival of breeders. This is unrealistic as subordinates in real organisms can help other subordinates and increase the survival of non-breeders, even in the absence of group augmentation, for instance, with targeted helping to dominants or allies. This restriction artificially limits the ability of kin selection alone to lead to the evolution of helping, and potentially to division of labor. Hence, the conclusion that group augmentation is the primary driving factor driving vertebrate division of labor appears forced by the imposed restrictions on kin selection. The model used is also quite particular, and so the claimed generality across vertebrates is not warranted.

      We would like to thank the reviewer for the in-depth review. We respond to these and other comments below.  

      I describe some suggestions for improving the paper below, more or less in the paper's order.

      First, the introduction goes to great lengths trying to convince the reader that this model is the first in this or another way, particularly in being only for vertebrates, as illustrated in the abstract where it is stated that "we lack a theoretical framework to explore the conditions under which division of labor is likely to evolve" (line 13). However, this is a risky and unnecessary motivation. There are many models of division of labor and some of them are likely to be abstract enough to apply to vertebrates even if they are not tailored to vertebrates, so the claims for being first are not only likely to be wrong but will put many readers in an antagonistic position right from the start, which will make it harder to communicate the results. Instead of claiming to be the first or that there is a lack of theoretical frameworks for vertebrate division of labor, I think it is enough and sufficiently interesting to say that the paper formulates an individual-based model motivated by the life history of vertebrates to understand the evolution of vertebrate division of labor. You could then describe the life history properties that the model incorporates (subordinates can become reproductive, low relatedness, age polyethism, etc.) without saying this has never been done or that it is exclusive to vertebrates; indeed, the paper states that these features do not occur in eusocial insects, which is surprising as some "primitively" eusocial insects show them. So, in short, I think the introduction should be extensively revised to avoid claims of being the first and to make it focused on the question being addressed and how it is addressed. I think this could be done in 2-3 paragraphs without the rather extensive review of the literature in the current introduction.

      We have revised the novelty statements in the Introduction by more clearly emphasizing how our model addresses gaps in the existing literature. More details are provided in the comments below.

      Second, the description of the model and results should be clarified substantially. I will give specific suggestions later, but for now, I will just say that it is unclear what the figures show. First, it is unclear what the axes in Figure 2 show, particularly for the vertical one. According to the text in the figure axis, it presumably refers to T, but T is a function of age t, so it is unclear what is being plotted. The legend explaining the triangle and circle symbols is unintelligible (lines 227-230), so again it is unclear what is being plotted; part of the reason for this unintelligibility is that the procedure that presumably underlies it (section starting on line 493) is poorly explained and not understandable (I detail why below). Second, the axes in Figure 3 are similarly unclear. The text in the vertical axis in panel A suggests this is T, however, T is a function of t and gamma_t, so something else must be being done to plot this. Similarly, in panel B, the horizontal axis is presumably R, but R is a function of t and of the helping genotype, so again some explanation is lacking. In all figures, the symbol of what is being plotted should be included.

      We added the symbols of the variables to the Figure axes to increase clarity. In Figure 3A, we corrected the subindex t in the x-axis; it should be subindex R (reaction norm to dominance rank instead of age). As described in Table 1, all values of T, H and R are phenotypically expressed values. For instance, T values are the phenotypically expressed values from the individuals in the population according to their genetic gamma values and their current dominance rank at a given time point.  

      Third, the conclusions sound stronger than the results are. A main conclusion of the paper is that "kin selection alone is unlikely to select for the evolution of defensive tasks and division of labor in vertebrates" (lines 194-195). This conclusion is drawn from the left column in Figure 2, where only kin selection is at play, and the helping that evolves only involves work rather than defense tasks. This conclusion follows because the model assumes that without group augmentation (i.e., xn=0, the kin selection scenario), subordinates can only help breeders to reproduce but cannot help breeders or other subordinates to survive, so the only form of help that evolves is the least costly, not the most beneficial as there is no difference in the benefits given among forms of helping. This assumption is unrealistic, particularly for vertebrates where subordinates can help other group members survive even in the absence of group augmentation (e.g., with targeted help to certain group members, because of dominance hierarchies where the helping would go to the breeder, or because of alliances where the helping would go to other subordinates). I go into further details below, but in short, the model forces a narrow scope for the kin selection scenario, and then the paper concludes that kin selection alone is unlikely to be of relevance for the evolution of vertebrate division of labor. This conclusion is particular to the model used, and it is misleading to suggest that this is a general feature of such a particular model.

      The scope of this paper was to study division of labor in cooperatively breeding species with fertile workers (i.e., primarily vertebrates), in which help is exclusively directed towards breeders to enhance offspring production (i.e., alloparental care). Our focus is in line with previous work in most other social animals, including eusocial insects and humans, which emphasizes how division of labor maximizes group productivity. Other forms of “general” help are not considered in the paper, and such forms of help are rarely considered in cooperatively breeding vertebrates or in the division of labor literature, as they do not result in task partitioning to enhance productivity.

      Overall, I think the paper should be revised extensively to clarify its aims, model, results, and scope of its conclusions.

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors):

      I reserved this section for more minor comments, relating to clarity and a general admonition to give us more detail and exploration of some basic population genetic quantities.

      Another minor point, although depending on whether I assume right or wrong, it could be major: I am not entirely sure that dispersers help in the groups they join as helpers, because of line 399, which states specifically that individuals who do remain in natal territories do. But I assume dispersers help (elsewhere, the authors state helping is not conditional on relatedness to the breeder). Otherwise, this model becomes even weirder for me. Either way, please clarify.

      Apologies if this was not clear. Immigrants that join a group (so dispersers from another group) as a subordinate help and queue for a breeding position, as does any natal subordinate born into the group. We rephased the sentence to “Subordinate group members, either natal or immigrants to the group, […]”  

      More generally, in simulation studies like this, there can be interactions between the strength of selection (which affects overall genetic variation maintained in the population), population size, and mutation rate/size, which can affect, for example, relatedness values. None of these quantities is explored here (and their interactions are not quantified), so it is not possible to evaluate the robustness of any of these results.

      Thank you for your comments about the parameter landscape. It is important to point out that variations in the mutation rate do not qualitatively affect our results, as this is something we explored in previous versions of the model (not shown). Briefly, we find that variations in the mutation rates only alter the time required to reach equilibrium. Increasing the step size of mutation diminishes the strength of selection by adding stochasticity and reducing the genetic correlation between offspring and their parents. Population size could, in theory, affect our results, as small populations are more prone to extinction. Since this was not something we planned to explore in the paper directly, we specifically chose a large population size, or better said, a large number of territories (i.e. 5000) that can potentially host a large population.  

      The authors also never say how it is actually determined. There is the evolved helping variable, and there is also the evolved reaction norm. I assume that the actual amount of help of each type is given by the product of T (equation 1) and H (for defense) and (1-T) and H (for work), but this should be stated explicitly.  

      Help provided is an interaction between H (total effort) and T (proportion of total effort invested in each type of task). To clarify the distinction between these two processes, we have now added “Hence, the gene α regulates the amount of help expressed, while the genes γ determine which specific helping tasks are performed at different time points in the breeding cycle”.  

      It is also weird that after introducing the T variable as a function of age, Figure 3 actually depicts it as a function of dominance value.

      Thank you for pointing out an error in Eq. 1. This inequality was indeed written incorrectly in the paper (but is correct in the model code); it is dominance rank instead of age (see code in Individual.cpp lines 99-119). We corrected this mistake throughout the manuscript.

      What is "scramble context"?

      “Scramble context” was an additional implementation that we decided to remove from the final manuscript, but we forgot to remove from Table 1 before submission. We have now removed it from the table.

      Reviewer #2 (Recommendations for the authors):

      Some specific comments:

      (1) L 31: "All theoretical..." These absolute statements are risky and unnecessary.

      Rephrased to “To date, most theoretical and empirical work…”

      (2) L 46: I believe Tom Wenseleers has published on the evolution of division of labor with reproductive workers and high within-colony conflict.

      Tom Wenseleers has indeed produced some models on the evolution of cooperation in social insects where some workers may reproduce. However, these models focus on the relevance of relatedness and policing selecting for a reduction in within-group conflict and the evolution of reproductive division of labor. Our model focuses instead on division of labor among workers (helpers). We have rephased this section to “task specialization is linked to sterility and where conflict of interest is generally low” to account for species of social insect in which variation in relatedness between group members and higher levels of reproductive conflict may arise. We also cited one of his papers.  

      (3) L 57: Again, unnecessary categorical statements.

      Rephrased to “Although a great deal of recent empirical work highlights the importance of direct benefits in the evolution of cooperative breeding behavior in vertebrates [21–24], we lack understanding on the joint influence of direct and indirect fitness benefits in the evolution of division of labor.”

      (4) L 67: This is said to be a key distinction, but in the paper, such a key role is not clearly shown. This and other tangential points are unnecessary to keep the introduction to the point.

      The different fitness costs of different tasks is the basis of our model on division of labor. Therefore, this is a key distinction and basis from which to describe different tasks in the model. We have left this sentence unchanged.

      (5) L 61-73: "In vertebrates, however, helpers may obtain fitness benefits directly via reproduction..." Some social insects may do so as well. It seems unnecessary and incorrect to say that vertebrate sociality is fundamentally different from invertebrate one. I think it is sufficiently interesting to say this work aims to understand vertebrate division of labor, by explicitly modeling aspects of its life history, without saying this can't happen in invertebrates or that no other model has ever done anything like it.

      Our point is not that, in some social insects, workers cannot obtain direct fitness benefits, but that previous models where the focus is on the colony reproductive outcome are only a good approximation to eusocial insect with sterile workers. However, to make this clearer we have added “In vertebrates and social insect with fertile workers, however, helpers may obtain fitness benefits directly via […]”.  

      (6) L 74-86: By this point, the introduction reads like a series of disconnected comments without a clear point.

      In L60 we added: “Understanding how direct and indirect benefits interact is particularly important in systems where individuals may differentially bear the fitness costs of cooperation”. By adding this sentence, we emphasize our focus on the largely unexplored direct fitness benefits and costs, as well as their interaction with indirect fitness. We then proceed to explain why it is crucial to consider that tasks have varying direct fitness costs and how the fitness benefits derived from cooperation change with age and resource-holding potential. These elements are essential for studying the division of labour in species with totipotent workers.

      (7) L 87: This sentence gives a clear aim. It would be clearer if the introduction focused on this aim.

      With the new sentence added in L60 (see previous comment), we bring the focus to the main question that we are trying to address in this paper earlier in the Introduction.  

      (8) L 88: "stochastic model" should be changed to "individual-based model".

      Done.

      (9) L 104: "limited number" is unclear. Say a fixed finite number, or something specific.

      Done.

      (10) L 105: "unspecified number" is unclear. Say the number of subordinates emerges from the population dynamics.

      Changed to “variable number of subordinate helpers, the number of which is shaped by population dynamics, with all group members capable of reproducing during their lifetime”.

      (11) L 112: "Dispersers" is used, but in the previous lines 107-109, the three categories introduced used different terms. Those three terms introduced should be used consistently throughout the paper, without using two or more terms for one thing.

      We use the term “disperser” to describe individuals that disperse from their natal group.

      Dispersers can assume one of three roles: (1) they can join another group as "subordinates"; (2) they can join another group as "breeders" if they successfully outcompete others; or (3) they can remain as "floaters" if they fail to join a group. "Floaters" are individuals who persist in a transient state without access to a breeding territory, waiting for opportunities to join a group in an established territory. We rephased the sentence to “Dispersers cannot reproduce without acquiring a territory (denoted here as floaters)”. This was also clarified in other instances where the term “dispersers” was used (e.g. L407). Other instances where this might not have been so clear, we replace “dispersers” with “floaters”.  

      (12) L 112: "(floaters)" Unclear parenthesis.

      See previous comment.  

      (13) L 115: There should be a reference to Methods around here.

      Added a reference to Figure 1.

      (14) L 117: To be clearer, say instead that dominance value is a linearly increasing function of age as a proxy of RHP and a linearly decreasing function of help provided due to the costs of working tasks. And refer to equation 2.

      Rephrased to “We use the term dominance value to designate the competitiveness of an individual compared to other candidates in becoming a breeder, regardless of group membership, that increases as a function of age, serving as a proxy for resource holding potential (RHP), and decreases as a function of help provided, reflecting costs to body condition from performing working tasks (Eq. 2).” We did not include “linearly” to keep it simpler, since it is clear from Eq. 2, which is now referenced here.  

      (15) L 119: "Subordinate helpers". As all subordinates are helpers, the helper qualifier is confusing.

      Subordinates are not necessarily helpers, as they can evolve help values of 0, hence, why we make it explicit here.

      (16) L 119: "choose". This terminology may be misleading. The way things are implemented in the model is that individuals are assigned a task depending on their genetic traits gamma. Perhaps it would be better to use a less intentional term, like perform one of two tasks.

      We changed “choose between two” to “engage in one of two”, which has less connotations of intentionality.

      (17) L 124: "Subordinates can [...] exhibit task specialization that [...] varies with their dominance value". It should be that it varies with age.

      Apologies. The equation was wrong; it does vary with dominance value. We corrected it accordingly.

      (18) L 133: "maximised" This is apparently important for the modelling procedure, but it is completely unclear what it means. Equation 4 comes out of nowhere, and it is said that such an equation is the maximum amount of help that can affect fecundity. Why? What does this mean? If there is something that is maximised, this should be proven. This value is then used for something (line 507), but it is unclear why or what it is used for (it says "we use the value of Hmax instead" without saying what for, no justification for the listed inequalities are given, and the claimed maximisation of an unspecified variable at those H values is not proven). Moreover, the notation in this section is also unclear: what are the sums over? Also, Hdefence and Hwork should vary over the index that is summed over, but the notation suggests that those quantities don't vary.

      We changed “maximized” to “greatest”, and we added a clarification to the rationality behind the maximization of the impact of help in the breeder’s productivity: “For example, in many cooperatively breeding birds, the primary reasons that breeders fail to produce offspring are (1) starvation, which is mitigated by the feeding of offspring, here considered as a work task, and (2) nest depredation, which is countered by defensive behavior. Consequently, both types of tasks are often necessary for successful reproduction, and focusing solely on one while neglecting the other is likely to result in lower reproductive success than if both tasks are performed by helpers within the group.”

      We now also clarify that the sums are for help given within a group (L 507), and added indexes to the equations.

      (19) L 152: "habitat saturation" How is this implemented? How is density dependence implemented? Or can the population size keep increasing indefinitely? It would be good to plot the population size over time, the group size over time, and the variance in group size over time. This could substantiate later statements about enhancing group productivity and could all be shown in the SI.

      Habitat saturation emerges from population dynamics due to the limited availability of territories and the fluctuating number of individuals, leading highly productive environments to experience habitat saturation. Although the number of group members is not restricted in our model, the population could theoretically increase indefinitely. However, this is not observed in the results presented here, as we selected parameter landscapes that stabilize population numbers. We confined our parameters to those where the population neither increased indefinitely (nor collapsed), as we did not incorporate density-dependent mortality traits for simplification. Consequently, the group size in the SI, where the standard deviation is already included, closely represents group size at any other given time during equilibrium.

      L 336: we changed “environments with habitat saturation” to “environments that lead to habitat saturation”, to increase clarity.

      (20) L 152: "lifecycle". Rather than the lifecycle, the figure describes the cycle of events in a single time step. The lifecycle (birth to death) goes over multiple time steps (as individuals live over multiple steps). So this figure shouldn't be called a life cycle.

      We changed “lifecycle” to “breeding cycle”.

      (21) L 156: "generation". This is not a generation but a time step.

      We changed “generation” to “breeding cycle”.

      (22) L 157: "previous life cycle" would mean that the productivity of a breeder depends on the number of helpers that its parents had, which is not what is meant.

      We changed “lifecycle” to “breeding cycle”.

      (23) L 158: "Maximum productivity is achieved when different helping tasks are performed to a similar extent." Again, unclear why that is the case.

      We added a clarification on this, see response to comment 18.  

      (24) L 160: "Dispersers/floaters". Use just one term for a single thing.

      See response to comment 11.   

      (25) L 162: "dispersal costs". I don't recall these being described in Methods.

      Individuals that disperse do not enjoy the protection of living in a territory and within a group of other individuals, so they have a higher mortality risk, described in Eq. 3.3. (negative values in the exponential part of the equation increase survival). The cost of dispersal is the same as individuals that remain as floaters at a given time step.

      (26) L 164: "generation" -> time step.

      We changed this to “breeding cycle”.  

      (27) L 170: "Our results show that division of labor initially emerges because of direct fitness benefits..." This is a general statement, but the results are only particular to the model. So this statement and others in the manuscript should be particular to the model. Also, Figure 2 doesn't say anything about what evolves "initially" as it only plots evolutionary equilibria.

      We rephrased this statement to “Our results suggest that voluntary division of labor involving tasks with different fitness costs is more likely to emerge initially because of direct fitness benefits”, to more accurately represent the conditions under which we modeled the division of labor.  

      Our reference to “initially” is regarding group formation (family groups versus aggregations of unrelated individuals or a mix). This is shown in the comparison between the different graphs at equilibrium. The initial state of the simulation is that all individuals disperse and do not cooperate.  

      (28) L 171: "but a combination of direct and indirect fitness benefits leads to higher rates and more stable forms of division of labor". What do you mean by "higher rates and more stable forms of division of labor"? Say how division of labor is shown in the figure (with intermediate T?).

      Yes, intermediate values of T show division of labor if γR ≠ 0. This is described under the section “The role of dominance in task specialization”. We added “with intermediate values suggesting a division of labor” to the Figure 2 legend.  

      (29) L173-175: "as depicted in Figure 2, intermediate values of task specialization indicate in all cases age/dominance-mediated task specialization (γt ≠ 0; Table 1) and never a lack of specialization (γt = 0; Table 1)". This sentence is unclear and imprecise. Does this sentence want to say that in Figure 2, all plots with intermediate values of T involve gamma t different from zero? If so, just say that.

      Rephrased to: “In Figure 2, all plots depicting intermediate values of T exhibit non-zero γR values and, hence, division of labor”.

      (30) L179-180: "forms of help that impact survival never evolve under any environmental condition when only kin selection occurs". This is misleading because under the KS scenario, help cannot positively impact survival in this model, so they never evolve.

      Help cannot affect survival but could potentially affect group persistence. If helpers increase breeder productivity and offspring remain philopatric and queue for the breeding position, then they will receive help from related individuals.   

      (31) L 210: "initially". What do you mean by that?

      Help only evolves in our model in family groups, which may then open the door for the evolution of help in mixed-kin groups. Therefore, we use “initially” to refer to the ancestral group structure that likely led to cooperation under benign environmental conditions. We rephased this section to “in more benign (and often highly productive) environments that lead to habitat saturation, help likely evolved initially in family groups, and defensive tasks are favored because competition for the breeding position is lower under kin selection.”

      (32) L 212: "kin selection is achieved". What does that mean?

      Rephased to “kin selection acts not only by selecting subordinates in their natal group to increase the productivity of a related breeder […]”

      (33) L 216: "division of labor seems to be more likely to evolve in increasingly harsh environments". Say in parentheses where this is shown.

      Added.  

      (34) L 218: "help evolves in benign environments". I don't see where this is shown. Figure 2 doesn't show that H is higher with lower m (e.g., in KS+GA column).

      Help does not evolve in benign environments under only direct fitness benefits derived from group augmentation (shown in Figure 2).  

      (35) L 225: "y-axis" should be "vertical axis", as y has another meaning in the model.

      Done.

      (36) L 226: "likelihood". Here and throughout, "likelihood" should be changed to probability. Likelihood means something else.

      Thank you for the advice, we have corrected this through the manuscript.  

      (37) L 236: "the slope of the reaction norm for the dominance value in task specialization".

      Unclear. Clearer to say: the rate at which individuals to shift from defense to work as they age.

      The important part is not so much the rate but the direction, that is, from work task to defense (or vice versa) as their rank increases. Changed to “the direction and rate of change in task specialization with dominance”.

      (38) L 257: "(task = 0; cost to dominance value)," This seems out of place.

      This aims to clarify that work tasks have a cost to dominance, while defense tasks have a cost to survival. This is particularly relevant in this model since different helping tasks are defined by their fitness costs.

      (39) L 258: "increase"-> "increase with age".

      Added “with dominance”.

      (40) L 262: "division of labor equilibria" What is that?

      Changed to “at equilibrium when division of labor evolves”

      (41) L 268: "Our findings suggest that direct benefits of group living play a driving role in the evolution of division of labor via task specialization in species with totipotent workers". This is a very general statement, but the results are much more circumscribed. First, the model is quite specific by assuming that, in the absence of group augmentation (xn=0), indirect fitness benefits can only be given to breeders (Equation 5) but not to other subordinates (Equations 2, 3.1). This is unrealistic, particularly for vertebrates, and reduces the possibility that indirect fitness benefits play a role.  

      As previously discussed, the scope of this paper was to study division of labor in cooperatively breeding species with fertile workers in which help is exclusively directed towards breeders to enhance offspring production through alloparental care. Other forms of “general” help do not result in task partitioning to enhance productivity.

      Second, the difference in costs of work and defense are what drive the evolution of "division of labor" (understood as intermediate T in case this is what the authors mean) in the KS scenario, but the functional forms of those two costs are quite specific and not of the same form, so these functions may bias the results found. Specifically, R is an unbounded linear function of work and the effect of this function becomes weaker as the individual ages due to the weakening force of selection with age (Equation 2) whereas Sh is a particular bounded nonlinear function of defense (Equation 3.1). These differences may tend to make the effect of Sh stronger due to the particular functions chosen.  

      The difference in costs is inherent to the nature of the different tasks (work versus defense): while survival is naturally bounded, with death as the lower bound, dominance costs are potentially unbounded, as they are influenced by dynamic social contexts and potential competitors. Therefore, we believe that the model’s cost structure is not too different from that in nature.  

      Third, no parameter sweep is given to see to what extent these results hold across the many parameters involved. So, in summary, the discussion should at least reflect that the results are of a restricted nature rather than giving the impression that they are of the suggested level of generality.

      During the exploratory phase of the model development, various parameters and values were assessed. However, the manuscript only details the ranges of values and parameters where changes in the behaviors of interest were observed, enhancing clarity and conciseness. For instance, variation in yh (the cost of help on dominance when performing “work tasks”) led to behavioral changes similar to those caused by changes in xh (the cost of help in survival when performing “defensive tasks”), as both are proportional to each other. Specifically, since an increase in defense costs raises the proportion of work relative to defense tasks, while an increase in the costs of work task has the opposite effect, only results for the variation of xh were included in the manuscript to avoid redundancy. Added to Table 1: “To maintain conciseness, further exploration of the parameter landscape was not included in the manuscript”.

      (42) L 270: "in eusocial insects often characterized by high relatedness and reproductive inhibition, sterile workers acquire fitness benefits only indirectly". This is misleading. Sterile workers of any taxa, be it insects or vertebrates, can only acquire fitness benefits indirectly as they are sterile, but eusocial insects involve not only sterile workers.

      Rephased to “In contrast, in eusocial species characterized by high relatedness and permanent worker sterility, such as most eusocial insects, workers acquire fitness benefits only indirectly”. In any case, permanent sterility only occurs in eusocial invertebrates; in vertebrates with reproductive inhibition sterility is only temporal and context dependent. Therefore, in vertebrates, sterile workers may potentially obtain direct fitness benefits if the social context changes, as is the case in naked mole-rats.  

      (43) L 273: "Group members in eusocial species are therefore predicted to maximize colony fitness due to the associated lower within-group conflict". Again, this is incorrect. Primitively eusocial insects have high conflict.

      We added “Group members in such eusocial species” to clarify that we are not referring here to primitively eusocial species but those with permanent sterile workers.  

      (44) L 277: "when the benefits of cooperation are evenly distributed among group members". In this model, the benefits of cooperation are not evenly distributed among group members: breeders reproduce, but subordinates don't.

      Subordinates may reproduce if they become breeders later in life. However, subordinates also benefit from cooperation as subordinates directly (greater survival in larger groups), and indirectly if they are related to the breeder. Here we refer to the first one, and we expand on that in the following sentence.  

      (45) L 280: "survival fitness benefits derived from living in larger groups seem to be key for the evolution of cooperative behavior in vertebrates [22, 63], and may also translate into low within-group conflict. This suggests that selection for division of labor in vertebrates is stronger in smaller groups". I don't see how the previous sentence suggests this. The paper does not present results to support this statement (i.e., no selection gradients in smaller vs larger groups are shown).

      The benefits of living in a larger group entail diminishing returns, so those living in smaller groups benefit greater by an increase in productivity and group size than those in a larger group.  

      (46) L 284: "Our model demonstrates that vertebrates evolve a more stable division of labor". Where is that shown? How is "more stable" measured?

      Rephrased to “vertebrates are more likely to evolve division of labor”. This is shown in Figure 2, that exemplifies that division of labor evolves in a wider range of environmental condition and to a higher degree (intermediate values of T).  

      (47) L 287: "direct fitness benefits in the form of group augmentation select more strongly for defensive tasks". Where is that shown? Establishing this would entail comparing selection gradients with direct fitness benefits of group augmentation and without them.

      In Figure 2, when we compare the GA column to KS+GA column, we see that at equilibrium, more helpers choose defense tasks, specially when they are free to choose their preferred task (circles).  

      (48) L 288: "kin selection alone seems to select only for work tasks." Again, this may be an artifact of the model assuming that helpers cannot increase non-breeders' fitness components except via group augmentation, and that defense tasks are inherently more costly than work tasks.

      As stated previously, we are studying task specialization in cooperative breeders where help is in the form of alloparental care (from allofeeding and egg care to defense from predators). We also assume that the costs are different, but whether one or the other is more costly depends on the relative context (e.g., a task can be more costly if it affects competitiveness in a very competitive environment). It is important to note that we name these tasks “work” and “defense” for practical reasons, but the focus of the paper is on tasks with different fitness costs that for their characteristics may not fit so well in under this terminology. While we acknowledge that most tasks have both kinds of fitness costs to a degree, here we focus on the main fitness costs of each kind of task (L430-436).  

      (49) L 290: "are comparatively large". This sounds as if the tasks are large, which is presumably not what is meant.

      Rephrased to “costs to dominance value and to the probability of attaining a breeding position are comparatively larger than survival costs.”

      (50) L 298: "helpers are predicted to increase defensive tasks with age or rank, whereas in harsh environments, work tasks are predicted to increase with age or rank." Add parentheses referring to where this is shown.

      This is shown in Figure 3, but since this is described in the discussion, we did not add a reference to the figure. If the editor would like us to refer to figures here, we can (see also comments below relating to the same issue).

      (51) L 308: "the role of age and environmental harshness on the evolution of division of labor". What is the prediction? Simply, the role of age is an assumption, not a prediction.

      Rephrased to “the role of environmental harshness on the evolution of division of labor via age-dependent task specialization”.

      (52) L 315: "individuals shifting from work tasks such as foraging for food, digging, and maintaining the burrow system, to defensive tasks such as guarding and patrolling as individuals grow older and larger". Say in parentheses where this is predicted.

      This prediction comes from Figure 3, we do not reference it here since we are in the Discussion section.  

      (53) L 320: "Under these conditions, our model predicts the highest levels of task partitioning and division of labor." Where is this predicted? Add parentheses referring to where this is shown. As it is, it is not possible to check the validity of the statement.

      This prediction comes from Figure 2 column KS+GA, we do not reference it here since we are in the Discussion section. The results with references to the figures are found under the Results section. In the discussion, we reiterate the results already described and add some examples from real data that seem to confirm our predictions.  

      (54) L 322: "In line with our model predictions, larger and older helpers of this species invest relatively more in territory maintenance, whereas younger/smaller helpers defend the breeding shelter of the dominant pair to a greater extent against experimentally exposed egg predators". These predictions are neat, but are now very difficult to understand from the figures. Maybe at the bottom of 3A, you could add a diagram work->defense for negative gamma_t and defense>work for positive gamma_t (or whatever order it is).

      Done.

      (55) L 325: "Territory maintenance has been shown to greatly affect routine metabolic rates and, hence, growth rates [80], which directly translates into a decrease in the likelihood of becoming dominant and attaining breeding status, as predicted by our model." This seems to be an assumption, not a prediction.

      That is true. We removed: “as predicted by our model”.  

      (56) L 352: "controlled". This means something else.

      Changed to “addressed”.

      (57) L 356: "summary, our study represents the first theoretical model aimed at elucidating the potential mechanisms underlying division of labor between temporal non-reproductives via task specialization in taxa beyond eusocial organisms". Again, claiming to be the first is risky and unnecessary.

      Rephrased to “our study helps to elucidate”.

      (58) L 358: "Harsh environments, where individuals can obtain direct fitness benefits from group living, favor division of labor, thereby enhancing group productivity and, consequently, group size." I'm not sure about this conclusion as harsh environments (large m in Figure 2) also involve the evolution of no division of labor (from the triangles and circles that are zero in the right bottom panel) and perhaps more so than with less harsh environments (intermediate m). Incidentally, in the bottom right panel of Figure 2, do the two separate clusters of triangles and circles mean that there is some sort of evolutionary branching?

      Yes, there are two different equilibria for the same set of conditions. Although it is true that for m=0.3 less division of labor evolves when kin selection and group augmentation act together, it is not the case when only group augmentation takes place. In addition, we qualify m=0.2 as harsh as opposed to benign in which we observe the rise of habitat saturation (m=0.1). m=0.3 is then an extreme harsh environment, in which in several instances different parameter landscape causes population collapse (see figures in the Supplemental Material).  

      (59) L 360: "Variation in the relative fitness costs of different helping tasks with age favors temporal polyethism". I don't see that this has been shown. Temporal polyethism evolves here whenever gamma_t evolves non-zero values. Figure 3A shows that non-zero gamma_t evolves with harsher environments, but I don't see what the "variation in relative fitness costs of different helping tasks" refers to.

      The evolved reaction norms of the model are towards different fitness costs depending on the task performed, since this is how we define the different types of tasks in the model.  

      (60) L 382: "undefined". Say variable. Undefined is something else.

      Undefined is more accurate, since we did not define how many subordinates there were per group, while “variable” could have been defined within a range, which was not the case in this model.  

      (61) L 390: "each genetic locus". Say earlier that each genetic trait is controlled by a single locus.

      Added.  

      (62) L 395: "complete" and "consistent" -> "certain".

      We changed one to “certain” and another to “absolute” to avoid using the same adjective twice in a sentence.  

      (63) L 396: What determines whether dispersers become subordinates or floaters? A trait? Or a fixed probability?

      We added “which is also controlled by the same genetic dispersal predisposition as for subordinates”.

      (64) L 412-413: "cycle". This should be a breeding step.

      Changed to “season” instead.

      (65) L 418: Say negatively impacts (it could also be positively impacts, which I guess is not what you mean).

      Done.

      (66) L 425: "a sample of floaters". Chosen how?

      Added “randomly drawn”.

      (67) L 426-428. But the equation in Table 1 indicates that all floaters compete for breeding spots, not a sample of floaters. This is not clear.

      The number of floaters sampled to try to breed at a given group is N<sub>f,b</sub> = 𝑓∗𝑁<sub>𝑓</sub>/𝑁<sub>𝑏</sub> (Table 1).

      Therefore, N<sub>f,b</sub> is the sample size of floaters for a given open breeding position, and f is how many groups on average a floater attempts to access in each time step.  

      (68) L 432. In the figure, the breeding cycle is called a step, but here it is called a cycle. There should be a single term used throughout. Breeding is not really a cycle here (it doesn't involve multiple steps that are repeated cyclically), so it seems more appropriate to call this breeding steps or breeding seasons.

      Taken into account previous comments, we changed the terms “generation” and “life cycle” to “breeding cycle”. We added “or seasons”.  

      (69) L 439: "generations". What are generations here, as generations are overlapping? You probably mean time steps or something else.

      Changed to “breeding cycles”.

      (70) L 439: "equilibrium was reached". Presumably, equilibrium is reached only asymptotically, so some cutoff is implemented in practice. So maybe say explicitly what cutoff was implemented.

      As mentioned, we run the model for 200’000 time steps, and if equilibrium was not reached for the phenotypic values, then we run the model for longer, with 400’000 time steps being the maximum at which all simulation reached equilibrium. In some cases, genetic values did not reach equilibrium at ranges at which there was no impact on phenotypic values, so these were disregarded to assess whether equilibrium was reached.  

      (71) L 452: "Even though individuals are likely to change the total amount of help given throughout their lives". Do you mean in real organisms or in the model? Say which. If it is in the model, it is not clear how.

      We added “in nature” to clarify that this was not the case in the model.  

      (72) L 455: "For more details on how individuals may adapt their level of help with age and social and environmental conditions, see [63]." Do you mean real individuals or in the model? Again, if it is in the model, it is unclear how this is possible and should be explained in this paper at least briefly rather than citing another one.

      We rephrased it to “How individuals in the model may adapt their level of help with age and social and environmental conditions has been described elsewhere.” We do not go into detail here because it is not within the scope of the paper, and those results have been described elsewhere.  

      (73) L 475: "helpers". Make terminology consistent throughout.

      All helpers are subordinates, but not all subordinates are helpers, as they may evolve no help. Since here we are describing those subordinates that do help, we use that terminology. We added “subordinate helpers” to clarify this further.  

      (74) L 476: "proportional". The dependence in Equation 1 is not "proportional to". Say something like "a survival probability (not rate) that decreases with the amount of help provided".

      Done.

      (75) L 482: "environmental"-> baseline, as defined first.

      Done.

      (76) L 486: "benefits". Can you briefly say in parentheses what those benefits are in real organisms? As in line 475, where you reminded the reader of survival costs due to predator defense.

      Added “such as those offered by safety in numbers or increased resource defense potential”.

      (77) L 494. "we first outline a basic model in which individuals". It is not clear what this sentence says, and the remainder of this section does not clarify it.

      We made two models for comparison, one where individuals can choose freely which task they prefer to perform, and another in which there is an increase in productivity when both kinds of tasks are performed to a similar extent at group level. In the latter model, individuals may choose an unpreferred task at certain times during their lived to increase the effect of the help provided in the breeder’s (and group’s) productivity.  

      We rephrased this section to “we first outline a basic model where individuals evolve their preferred helping task. Then we compare this to another model in which the breeder’s reproductive outcome is maximized when the group’s helping effort in each kind of tasks is performed to a roughly equal degree.”

      (78) L 496: "by performing both tasks". Sounds as if the breeder performs both tasks, not helpers.

      We changed to “when the group’s helping effort in each kind of tasks”.

      (79) L 497: "the maximum amount of cumulative help of each type (sigma Hmax) that can affect fecundity is given by Eq. 4:" This statement is imprecise. Presumably, what is meant is that this level of help maximises breeder productivity, as stated earlier in the paper. However, there is no proof that this level of help maximises breeder productivity, so this expression seems unjustified and it is unclear how it is used.

      This is a description of the model set up. As described later in the same section, the cumulative help of each time that will influence the breeder’s fecundity if maximum Hmax. Therefore, it does represent the maximum amount of cumulative help of each type that can affect the breeder’s fecundity.

      (80) L 500: "reproduced" -> "reproduce".

      Done.  

      (81) L 503. Say here what K is so that the reader knows what equation 5 is showing.

      Added “K” to the “The quantity of offspring produced (K)”.

      (82) L 503: "diminishing returns" -> "diminishing returns as help increases".

      Done.  

      (83) L 507: Why these inequalities?

      These inequalities explain the use of Hmax (response to comment 79). We rephased it to “the cumulative defense effort is larger than or the cumulative work effort is larger than ”.  

      (84) L 526: "removing the influence of relatedness from the model". It would be helpful to plot relatedness in this and the other scenario to check that it is indeed low here and high in the other.

      The actual values of relatedness are provided in the Supplemental Material Table S1. We added this reference to Figure 2.  

      (85) L 528: "It is possible that direct and indirect fitness benefits could have an additive effect on the evolution of alloparental care". This is technically incorrect. It is also unclear what the point of this sentence is.

      We have removed this sentence.  

      (86) Table 1: Say what are the allowed values for these genotypic traits (can they take negative values, be greater than one, are they continuous or discrete?): e.g., alpha \in [0,1] or alpha \in (-infinity, infinity). For phenotypic traits, it would be helpful if the third column lists the equation where the trait is defined. As the variables in the first column are scalars, they should not be bold face. Survival "rate" should be survival "probability" throughout.

      All genetic traits can take any real number (-infinity, infinity), but the phenotypic values are either constrained by the equation like for logistic formulas, or manually constrained like for dispersal propensity or help (only positive numbers allowed). We added “Each genetic trait is controlled by a single locus, and may take any real number” (L403), and added the boundaries for help and dominance value in Table 1. We decided against including the equations in the table due to space constraints. We removed the bold face as suggested. We changed all instances of “survival rate” to “survival probability”.

      (87) Figures S1, S2: I don't recall seeing references to these figures in the main text, but there should be, as well as for Tables S1-S3.

      Table S1 is now referenced in Figure 2. The other figures are now referenced in the main text when we reference the different sections in the Supplemental Materials (L190 and L198). Other Tables are referenced in their respective Figures in the SI.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank all reviewers for their thorough and thoughtful comments. We have carefully addressed each point raised, conducting new experiments and analyses to strengthen the manuscript. Below is a summary:

      · Synchronous ensembles in new experiments: New experiments demonstrated synchronous ensembles during immobility in a novel environment (Figure 3-figure supplement 2) and revealed a significant reduction in such synchrony following familiarization training (Figure 4D).

      · Ripple-associated activity: We detected a much larger number of ripple events to confirm (a) the suppression of CA1PC spiking during ripples (Figure 4Ai) and (b) that synchronous ensembles mostly occur outside ripples (Figure 3-figure supplement 3). Additionally, spiking suppression was accompanied by decreased subthreshold membrane potentials (Figure 4Bi, Ci). Ripple-associated spiking and membrane potential dynamics shifted toward higher firing rates and more depolarization after familiarization training (Figure 4).

      · Public data analysis: Analysis of publicly available data identified thetaassociated synchronous ensembles, demonstrating the generalizability of our findings across different experimental conditions (Supplementary Figure 5).

      · Neuron morphology and algorithm validation: Images of recorded neurons after experiments confirmed their intact morphology. We also provided details on validating spike detection algorithms (Methods and Supplementary Figure 1).

      · Cell soma locations: New data and analyses illustrate the distribution of cells labeled at different embryonic days along the radial axis of the pyramidal layer (Supplementary Figure 1).

      · Analyses testing the robustness of synchronous ensembles: Additional analyses examined the impact of complex bursts and thetaphase locking, confirming the robustness of synchronous ensembles detection (Supplementary Figures 3 and 4).

      · Additional analyses and figures: We conducted further analyses and created new figures to address all remaining concerns (Response to Reviewer Figures 1-6).

      We believe these revisions have significantly enhanced the paper, and we sincerely thank all reviewers for their invaluable feedback.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      For many years, there has been extensive electrophysiological research investigating the relationship between local field potential patterns and individual cell spike patterns in the hippocampus. In this study, using state-ofthe-art imaging techniques, they examined spike synchrony of hippocampal cells during locomotion and immobility states. In contrast to conventional understanding of the hippocampus, the authors demonstrated that hippocampal place cells exhibit prominent synchronous spikes locked to theta oscillations.

      Strengths:

      The voltage imaging used in this study is a highly novel method that allows recording not only suprathreshold-level spikes but also subthreshold-level activity. With its high frame rate, it offers time resolution comparable to electrophysiological recordings. Moreover, it enables the visualization of actual cell locations, allowing for the examination of spatial properties (e.g., Figure 4G).

      We thank the reviewer for recognizing the strength of our study.

      Weaknesses:

      There is a notable deviation from several observations obtained through conventional electrophysiological recordings. Particularly, as mentioned below in detail, the considerable differences in baseline firing rates and no observations of ripple-triggered firing patterns raise some concerns about potential artifacts from imaging and analsyis, such as cell toxicity, abnormal excitability, and false detection of spikes. While these findings are intriguing if the validity of these methods is properly proven, accepting the current results as new insights is challenging.

      We appreciate the reviewer’s insightful comments regarding the apparent deviation of our observation from conventional understanding, which we address in the following sections.

      Reviewer #1 (Recommendations For The Authors):

      (1) I am not particularly inclined to strongly adhere to conventional insights, but the findings obtained through this imaging method seem significantly different from those known from conventional electrophysiological recordings. For instance, there are noticeable differences in several basic firing characteristics. First, the average firing rates of 2.3-4.3 Hz (Line 97) appear higher than the distribution of firing frequencies reported in many electrophysiological recordings of pyramidal cells (e.g., Mizuseki et al., Cell Rep, 2013).

      We understand that some of our findings differ from conventional insights. However, it is important to emphasize that many of our observations align closely with prior electrophysiological recordings. For instance, individual neurons in our study exhibit expected modulation by locomotion, spatial locations, novelty, and theta oscillations, all of which are hallmarks of normal hippocampal physiology.

      Regarding the firing rates, it is important to highlight the heterogeneity of the firing rates, which range from 0.01 to 10 Hz, with a skewed distribution toward lower frequencies(1). While our values (2.3-4.3Hz) are higher than those reported by Mizuseki et al. (2013)(1) in rats, our recordings were obtained from mice and aligned with studies using mice, including firing rates of 2.1 Hz reported by McHugh et al. (1996) and 2.4-2.6 Hz by Buzsaki et al. (2003)(2,3).

      In addition, our recordings were performed in a novel environment, which is known to enhance the firing rates of the hippocampal neurons(4). Consistent with this, our new recordings in a familiar environment demonstrate significantly lower firing rates (see below).

      Results (line 279)

      “Mean firing rates were significantly reduced in the familiar group compared to the novel group (Familiar group: 1.1 to 5.2 Hz (25<sup>th</sup>-75<sup>th</sup> percentiles), median=2.3 Hz, n\=66 cells, 6 sessions, 4 mice; Novel group: 1.7 to 6.0 Hz (25<sup>th</sup>-75<sup>th</sup> percentiles), median=4.2 Hz, n\=111 cells, 6 sessions, 6 mice, p\=0.0083, Wilcoxon signed-rank test).”

      Second, while this finding suggests that spike synchrony is entirely unrelated to ripple-triggered events, it is indeed difficult to comprehend (researchers who have analyzed electrophysiological data, at the very least, should have experienced some degree of correlation between ripples and spikes).

      We thank the reviewer for raising this important point. We, too, found it surprising that population synchrony appears largely unrelated to ripples. To ensure the robustness of this observation, we conducted new experiments under conditions optimized for ripple detection to (a) confirm that the lack of positive correlation is also observed under conditions where we can detect more ripples and (b) demonstrate that our imaging methods can detect a higher correlation between ripples and spikes in a familiar environment (see details below).

      Results (line 251)

      “It was puzzling that these CA1PCs exhibited robust spiking activities outside of ripples yet generated few spikes during ripples. To further investigate neuronal activities during ripples, we established a recording condition that allowed us to capture more ripple episodes. Specifically, we immobilized mice in a tube to promote behaviors favoring ripple generation. The mice were habituated to head fixation in a tube in a room distinct from the one where imaging experiments were conducted. On the imaging day, the mice were introduced to the recording room and head-fixed under the microscope for the first time.

      CA1PCs were labeled in utero on embryonic day (E) 14.5 (n\=56 cells from 3 sessions in 3 mice) and E17.5 (n\=55 cells from 3 sessions in 3 mice) and imaged in adult brains. Both neuronal populations exhibited prominent peaks in their grand average CCGs and significantly higher synchronous event rates compared to jittered data (Figure 3-figure supplement 2A, B). Approximately 40% of the recorded neurons participated in synchronous ensembles, indicating robust synchronous activity involving a substantial proportion of the recorded cells (Figure 3-figure supplement 2C).

      In total, 1052 synchronous ensembles and 174 ripple episodes were detected across these imaging sessions. Consistent with findings from walking animals, few synchronous ensembles occurred during ripples when animals were immobilized in a tube (Figure 3-figure supplement 3A, B). Moreover, no distinguishable ripple oscillations were observed in synchronous events, and the average firing rates during ripple episodes were near zero (Figure 3-figure supplement 3C, D). At the single-cell level, 90% of neurons showed significant negative spiking modulation during ripples, with ripple modulation indexes close to -1, indicating strong suppression of spiking (Figure 4Ai). This suppression extended to subthreshold membrane potentials, as nearly all cells exhibited decreased fluorescence during ripples compared to baseline (Figure 4Bi, Ci). These results demonstrate that spiking activity and subthreshold membrane potentials are robustly suppressed during ripples.

      Contextual novelty plays a critical role in shaping hippocampal neuronal activities. To assess its influence, we trained mice to become familiar with the imaging procedure and the recording environment over five days and recorded CA1PC activities on the final day. Mean firing rates were significantly reduced in the familiar group compared to the novel group (Familiar group:

      1.1 to 5.2 Hz (25<sup>th</sup>-75<sup>th</sup> percentiles), median=2.3 Hz, n\=66 cells, 6 sessions, 4 mice; Novel group: 1.7 to 6.0 Hz (25<sup>th</sup>-75<sup>th</sup> percentiles), median=4.2 Hz, n\=111 cells, 6 sessions, 6 mice, p\=0.0083, Wilcoxon signed-rank test). Additionally, 15% of the neurons in the familiar group exhibited significantly positive spiking modulation by ripples, while fewer cells showed negative modulation compared to the novel group (Figure 4A). During ripples, neurons in the novel group predominantly displayed hyperpolarizing membrane voltage responses, whereas a subset of neurons in the familiar group exhibited prominent depolarizing responses (Figure 4B). The mean fluorescence changes in the familiar group shifted toward depolarization compared to the novel group (Figure 4C). Finally, synchronous event frequencies were significantly lower in the familiar context, indicating weaker synchronous activities under familiar conditions (Figure 4D). These results demonstrate that hippocampal neuronal activities, particularly synchronous ensembles, are strongly influenced by contextual novelty.”

      Third, the fact that more than 40% of cells frequently exhibit synchronous firing other than during ripples has not been reported before, and if it were the case, many electrophysiologists would have likely noticed it. Overall, the excitability of cells seems too high.

      We thank the reviewer for raising this point. As discussed above, the reported spike rates are within the range expected from the previous electrophysiology recordings in mice, especially given that we record cells in a novel environment. In addition, our jittering procedure ensures that the observed synchrony exceeds what could be expected from the given level of spike rates alone. These analyses support the robustness of our observations.

      As mentioned below, there are concerns about experimental artifacts and analytical issues from optical imaging.

      (2) Method: In surgery, the cortical tissue above the hippocampus was aspirated, which is a general method for in vivo calcium imaging from the hippocampus. Furthermore, they use a CAG promoter to express the sensors. To my knowledge, this promoter is excessively strong and may sometimes be toxic to cells. In addition, for imaging, they use DMSO and Pluronic F-127, which are relatively toxic materials (please describe their concentrations). These conditions might be damaging to hippocampal neurons.

      We thank the reviewer for raising these comments. As the reviewer mentioned, cortical aspiration is a general method for in vivo imaging from the hippocampus and has been employed in numerous studies, including behavioral and systems-level investigations(5-15). For example, place cells are routinely recorded in both familiar and novel environments using this method and other approaches. Additionally, synchronous population activities have been observed and studied in the hippocampus both with and without cortical aspiration(6,15-18). These findings demonstrate that the hippocampal neuronal network generates place cells and synchronous activities regardless of whether the cortical tissue above it has been aspirated.

      DMSO and Pluronic F-127 are used as solvents for dissolving the JF<sub>552</sub>HaloTag ligand, and the resulting solution is injected into the bloodstream rather than directly into brain tissue. The concentrations of these reagents in the dye solution are now described in the text (see below). Assuming a blood volume of 2 ml in adult mice, the final concentrations of DMSO and Pluronic F-127 in the bloodstream are estimated to be 1% upon injection and then decrease rapidly while they are metabolized and excreted out of the body. Moreover, the effective concentrations in the brain tissue would be even lower. These low concentrations have been demonstrated to have minimal impact on cells and tissue(19-22).

      Methods (line 616)

      “JF<sub>552</sub>-HaloTag ligand (a generous gift from Dr. Luke Lavis) was first dissolved in DMSO (20 μl, Sigma) and then diluted in Pluronic<sup>TM</sup> F-127 (20 μl, P3000MP, Invitrogen) and PBS to achieve a final concentration of 0.83 mM of JF<sub>552</sub>-HaloTag ligand. The solution was then injected intravenously through the retro-orbital sinus. Imaging sessions were initiated 3 hours after the injection of the JF<sub>552</sub>-HaloTag ligand.”

      We understand that the CAG promoter may sometimes be toxic to cells if it drives high expression. However, it is important to note that we injected highly diluted virus (20x, final titer: 2.7x10<sup>12</sup> GC/ml) to avoid excessive expression levels. This titer was determined from serial dilution experiments to ensure an optimal expression level free from toxicity (see below). The same titer was used in a previous study(23) to label CA1 interneurons, which exhibited physiological spike rates and synchrony (see Abdelfattah 2023, Neuron, Figure 8). Furthermore, Voltron expression does not significantly affect key cellular properties, including membrane resistance, membrane capacitance, resting membrane potentials, spike amplitudes, and spike width (see Abdelfattah 2019, Science, Supplementary Figures 11 and 12). In our recordings, individual neurons exhibit the expected modulation by locomotion, spatial locations, novelty, and theta oscillations. We now include images of the recorded neurons to demonstrate their intact morphology and healthy appearance following imaging experiments (Supplementary Figure 1A, B), further supporting minimal cytotoxic effects.

      Methods (line 577)

      “A serial dilution experiment was conducted to determine an optimal titer of the virus carrying Voltron2 genes, minimizing cell toxicity, for use in this and in previous imaging experiments. A fine injection pipette (tip diameter 10-60 um) was used to inject AAV2/1-CAG-flex-Voltron2-ST (2.7x10<sup>12</sup> GC/ml, a generous gift from Dr. Eric Schreiter and the GENIE team at HHMI Janelia Research Campus) into the exposed regions at a depth of 200 μm (up to six injection sites and 100-200 nL of viral suspension).”

      (3) Another concern is the relatively low number of cells simultaneously recorded during imaging compared to typical hippocampal imaging such as Inscopix which often records several hundred cells. In this study, however, this number is 20 or fewer. This is likely because the visualized cells at baseline were limited to this extent. It is possible that these cells represent particularly too strong sensor expression, which may facilitate visualization and high signal-to-noise ratio in voltage imaging. Consequently, there is a possibility of abnormal activity occurring in these cells.

      The Inscopix studies use calcium imaging, which has a temporal resolution that is too slow to resolve fast synchrony central to our study. To enable highspeed voltage imaging at 2000 frames per second, we employed strategies to achieve sparse labeling and carefully limited the number of labeled cells to minimize out-of-focus contamination. In our analysis, we applied a criterion to include only cells separated by 70 μm or longer, reducing the potential for channel cross-talk among nearby neurons. These criteria limited the number of simultaneously imaged cells in our experiments. To address this issue, we have now included new data from 12 additional animals with 177 neurons to support our findings.

      Furthermore, despite the limited number of simultaneously imaged cells, population synchrony beyond what could be expected by chance can be detected using rigorous statistical procedures. As discussed earlier, neuronal activities were within the expected range; they were modulated by animals’ locomotion (Figure 2 and Supplementary Figure 2), exhibited place tuning, and were significantly reduced when the recording context became familiar, supporting the normal physiology of the recorded cells.

      (4) Analysis: There are some criteria for detecting spikes (described in the Methods), but there are concerns about whether these criteria truly extract only spike activity. When examining the traces in Figure 1 and Figure 2, there appear to be some activities that show fluorescence increases up to the level of putative spikes. How can we determine that these are indeed subthreshold changes? Conversely, some activities detected as spikes may also be subthreshold synaptic potential (this possibility concerns me). There is a need for more precise validation of spike detection analysis to ensure its accuracy.

      Regarding spike detection, we used validated algorithms(23-25) to ensure robust and reliable spike identification. Spiking activity was first separated from slower subthreshold potentials using high-pass filtering. This approach prevents slow fluorescence increases from being misinterpreted as spikes, even if their amplitude is large. We benchmarked this detection algorithm in our recent publication (Huang et al., 2024)(24), demonstrating its high sensitivity and specificity in spike detection (see the figure below). While we acknowledge that a small number of spikes, particularly those occurring later in a burst, might be missed due to their smaller amplitudes (as illustrated in Figures 1 and 2 of the manuscript), we anticipate that any missed spikes would lead to a decrease, rather than an increase, in synchrony between neurons. Overall, we are confident that spike detection is performed in a rigorous and reliable manner.

      Method (line 670)

      “Previous studies have described and validated the procedure for imaging preprocessing and spike detection. In short, the fluorescence intensities of individual neurons were calculated by averaging the fluorescence intensities of pixels from the same ROIs. Bleaching was corrected by calculating the baseline fluorescence (F<sub>0</sub>) at each time point as an average of the fluorescence intensities within ±0.5 seconds around the time point. The dF/F was calculated as the F<sub>0</sub> minus the fluorescence intensity of the same time point divided by F<sub>0</sub>. Positive fluorescence transients were detected to identify spikes from the high-passed dF/F traces created by subtracting the dF/F traces from the median-filtered version with a 5-ms window. To simulate the noise of recordings, high-passed dF/F traces were inverted, and the amplitudes of the transients detected from the inverted traces were used to construct a noise distribution of the spike amplitudes. A threshold was set by comparing the amplitudes of the detected transients with the noise distribution of the spike amplitudes to minimize the sum of type I and type II errors. Spikes were first detected when transients were larger than the threshold. Then, spike amplitudes smaller than half of the top 5% spike amplitudes were excluded. The signal-to-noise ratio (SNR) was calculated for each neuron as a ratio of the averaged spike amplitudes over the standard deviation of the high-passed dF/F traces, excluding points 2 ms before and 4 ms after each detected spike to estimate the quality of the recordings.”

      (5) If the authors aim to establish this new physiological phenomenon, it is necessary to compare it with electrophysiological data or verify if similar phenomena can be detected from electrophysiological data. Recently, various datasets have been made publicly available (e.g. CRCNS and Mendeley data), and these should be easily verifiable without the need for conducting experiments.

      We thank the reviewer for the suggestion. To address this, we analyzed a publicly available dataset (hc-11 on CRCNS), which contains hippocampal recordings from rats navigating novel mazes for water rewards. Using our algorithm, we detected significant population synchrony in the dataset (Supplementary Figure 5A). The synchronous event rates were 6.4-fold higher than those in jittered controls, demonstrating the reliability of our findings.

      Additionally, these synchronous events mostly occurred in the absence of ripples and were coupled to theta oscillations (Supplementary Figure 5B-D). These results not only validate our findings using independent datasets but also highlight the generalizability of synchronous ensembles as a distinct network phenomenon relevant to hippocampal function.

      Results (line 366)

      “To further investigate synchronous ensembles across different datasets, we analyzed publicly available hippocampal recordings ‘hc-11’ from the CRCNS repository, where rats navigated novel mazes for water rewards (see Method). Using our algorithm, we identified a significant number of synchronous ensembles during the first three minutes of novel navigation. On average, the rates of synchronous events were 6.4-fold higher than those detected in jittered controls (mean event rate: 2.0 ± 0.3 Hz for the original data vs. 0.32 ± 0.03 Hz for jittered data, n \= 8 sessions, p \= 0.0078, W \= 36, Wilcoxon signedrank test; Supplementary Figure 5A). To assess whether ripple oscillations were associated with these synchronous ensembles, we analyzed ripple event rates and their relationship to population synchrony. During this period, ripple events were infrequent (mean ripple rate: 0.02 ± 0.01, n \= 8 sessions), and ripple power peaked during ripple episodes but remained low at the timings of population synchrony (Supplementary Figure 5B). Nevertheless, LFP traces aligned to population synchrony revealed prominent theta oscillations (Supplementary Figure 5C). Synchronous ensembles were modulated by LFP theta oscillation (modulation strength: 0.30 ± 0.04, n \= 8 sessions, p < 0.001), and the timings of individual ensembles were consistently locked to the preferred phase of each session, suggesting a functional coupling of synchronous ensembles to theta oscillations important for information processing (Supplementary Figure 5D).”

      (6) Please describe exact statistical information (e.g. statistical values, degree of freedom, and test types) throughout the manuscript.

      Statistical values, degree of freedom and test types have been included in the manuscript. Please see below an example in the manuscript:

      Result (line 96)

      “Consistent with previous studies, neurons labeled on E14.5 located more on the deep side of the pyramidal layer than those labeled on E17.5 (t<sub>(601)</sub>=22.8, p<0.0001, Student’s t-test; Supplementary Figure 1C, D).”

      Minor comment - Figure 2A legend: what is "gray rectangles"?

      We apologize for the inconsistency in nomenclature in the figure legends. We have now corrected this issue and consistently use the term “gray vertical bars” to indicate the timings and durations of synchronous events throughout the article.

      Reviewer #2 (Public Review):

      Summary:

      This study employed voltage imaging in the CA1 region of the mouse hippocampus during the exploration of a novel environment. The authors report synchronous activity, involving almost half of the imaged neurons, occurred during periods of immobility. These events did not correlate with SWRs, but instead, occurred during theta oscillations and were phasedlocked to the trough of theta. Moreover, pairs of neurons with high synchronization tended to display non-overlapping place fields, leading the authors to suggest these events may play a role in binding a distributed representation of the context.

      We thank the reviewer for a thorough and thoughtful review of our paper.

      Strengths:

      Technically this is an impressive study, using an emerging approach that allows single-cell resolution voltage imaging in animals, that while head-fixed, can move through a real environment. The paper is written clearly and suggests novel observations about population-level activity in CA1.

      We thank the reviewer for pointing out the technical strength and the novelty of our study.

      Weaknesses:

      The evidence provided is weak, with the authors making surprising population-level claims based on a very sparse data set (5 data sets, each with less than 20 neurons simultaneously recorded) acquired with exciting, but less tested technology. Further, while the authors link these observations to the novelty of the context, both in the title and text, they do not include data from subsequent visits to support this. Detailed comments are below:

      We understand the reviewer’s concerns regarding the dataset size. In the revised manuscript, we have included additional data to further strengthen our conclusions and provide a more robust dataset. Specifically, we expanded our analysis by increasing the number of sessions and neurons recorded, ensuring that the findings are more representative and less likely to be influenced by sample sizes.

      Moreover, synchronous ensembles exceeding what could be expected by chance were detected in all examined data, validating our claims regarding population synchrony. We have also carefully considered the potential impact of the technology used in our experiments and included additional validation and comparison with results from other studies employing complementary techniques to support the reliability of our conclusions.

      Regarding the link to novelty, we have included data from subsequent visits, as suggested by the reviewer. These new data demonstrate that the observed changes in synchronous ensembles are context-dependent and significantly influenced by novelty. This confirms the novelty-related effects observed during initial visits and further supports the conclusions drawn in the manuscript. Please see below for our detailed replies to each of the reviewer’s points.

      (1) My first question for the authors, which is not addressed in the discussion, is why these events have not been observed in the countless extracellular recording experiments conducted in rodent CA1 during the exploration of novel environments. Those data sets often have 10x the neurons simultaneously recording compared to these present data, thus the highly synchronous firing should be very hard to miss. Ideally, the authors could confirm their claims via the analysis of publicly available electrophysiology data sets. Further, the claim of high extra-SWR synchrony is complicated by the observation that their recorded neurons fail to spike during the limited number of SWRs recorded during behavior- again, not agreeing with much of the previous electrophysiological recordings.

      We thank the reviewer for raising these important questions. To address the first question, it is possible that synchronous ensembles were not previously detected in extracellular recordings due to differences in detection methods or analysis approaches. To investigate this further, we analyzed a publicly available dataset (hc-11 on CRCNs), which contains hippocampal recordings from rats navigating novel mazes for water rewards. Using our algorithm, we detected robust synchronous ensembles in the dataset (Supplementary Figure 5). The rates of synchronous events were significantly higher than those in jittered controls, demonstrating the reliability and generalizability of these synchronous ensembles.

      Results (line 366)

      “To further investigate synchronous ensembles across different datasets, we analyzed publicly available hippocampal recordings ‘hc-11’ from the CRCNS repository, where rats navigated novel mazes for water rewards (see Method). Using our algorithm, we identified a significant number of synchronous ensembles during the first three minutes of novel navigation. On average, the rates of synchronous events were 6.4-fold higher than those detected in jittered controls (mean event rate: 2.0 ± 0.3 Hz for the original data vs. 0.32 ± 0.03 Hz for jittered data, n \= 8 sessions, p \= 0.0078, W \= 36, Wilcoxon signedrank test; Supplementary Figure 5A). To assess whether ripple oscillations were associated with these synchronous ensembles, we analyzed ripple event rates and their relationship to population synchrony. During this period, ripple events were infrequent (mean ripple rate: 0.02 ± 0.01, n \= 8 sessions), and ripple power peaked during ripple episodes but remained low at the timings of population synchrony (Supplementary Figure 5B). Nevertheless, LFP traces aligned to population synchrony revealed prominent theta oscillations (Supplementary Figure 5C). Synchronous ensembles were modulated by LFP theta oscillation (modulation strength: 0.30 ± 0.04, n \= 8 sessions, p < 0.001), and the timings of individual ensembles were consistently locked to the preferred phase of each session, suggesting a functional coupling of synchronous ensembles to theta oscillations important for information processing (Supplementary Figure 5D).”

      To address the second question, we conducted new experiments under conditions optimized for ripple generation. Specifically, we recorded neurons in mice head-fixed in a novel environment, resulting in 174 ripple episodes across six sessions. Consistent with our original findings, spiking rates were significantly suppressed and membrane potentials were hyperpolarized during ripples (Figure 4Ai-Ci of the manuscript). Despite this suppression, the same neurons exhibit rich synchronous activities outside of ripples (Figure 3-figure supplement 3 of the manuscript). These results confirm that these synchronous ensembles are distinct from ripple-related neuronal activity and strengthen our claim that the observed synchronous ensembles represent a distinct physiological phenomenon, consistent across different datasets and experimental conditions.

      Results (line 251)

      “It was puzzling that these CA1PCs exhibited robust spiking activities outside of ripples yet generated few spikes during ripples. To further investigate neuronal activities during ripples, we established a recording condition that allowed us to capture more ripple episodes. Specifically, we immobilized mice in a tube to promote behaviors favoring ripple generation. The mice were habituated to head fixation in a tube in a room distinct from the one where imaging experiments were conducted. On the imaging day, the mice were introduced to the recording room and head-fixed under the microscope for the first time.

      CA1PCs were labeled in utero on embryonic day (E) 14.5 (n\=56 cells from 3 sessions in 3 mice) and E17.5 (n\=55 cells from 3 sessions in 3 mice) and imaged in adult brains. Both neuronal populations exhibited prominent peaks in their grand average CCGs and significantly higher synchronous event rates compared to jittered data (Figure 3-figure supplement 2A, B). Approximately 40% of the recorded neurons participated in synchronous ensembles, indicating robust synchronous activity involving a substantial proportion of the recorded cells (Figure 3-figure supplement 2C).

      In total, 1052 synchronous ensembles and 174 ripple episodes were detected across these imaging sessions. Consistent with findings from walking animals, few synchronous ensembles occurred during ripples when animals were immobilized in a tube (Figure 3-figure supplement 3A, B). Moreover, no distinguishable ripple oscillations were observed in synchronous events, and the average firing rates during ripple episodes were near zero (Figure 3-figure supplement 3C, D). At the single-cell level, 90% of neurons showed significant negative spiking modulation during ripples, with ripple modulation indexes close to -1, indicating strong suppression of spiking (Figure 4Ai). This suppression extended to subthreshold membrane potentials, as nearly all cells exhibited decreased fluorescence during ripples compared to baseline (Figure 4Bi, Ci). These results demonstrate that spiking activity and subthreshold membrane potentials are robustly suppressed during ripples.”

      (2) The authors posit that these events are linked to the novelty of the context, both in the text, as well as in the title and abstract. However, they do not include any imaging data from subsequent days to demonstrate the failure to see this synchrony in a familiar environment. If these data are available it would strengthen the proposed link to novelty if they were included.

      Following the reviewer’s suggestion, we record neuronal activities in a familiar context to test the proposed link between synchronous activity and contextual novelty. We found that synchronous activity levels were significantly lower in the familiar context compared to the novel context, demonstrating that synchronous activity is strongly modulated by contextual novelty (Figure 4D of the manuscript). These findings provide further support for a link of the synchronous ensembles to novel environmental contexts.

      Result (line 277)

      “Contextual novelty plays a critical role in shaping hippocampal neuronal activities. To assess its influence, we trained mice to become familiar with the imaging procedure and the recording environment over five days and recorded CA1PC activities on the final day. Mean firing rates were significantly reduced in the familiar group compared to the novel group (Familiar group:

      1.1 to 5.2 Hz (25<sup>th</sup>-75<sup>th</sup> percentiles), median=2.3 Hz, n\=66 cells, 6 sessions, 4 mice; Novel group: 1.7 to 6.0 Hz (25<sup>th</sup>-75<sup>th</sup> percentiles), median=4.2 Hz, n\=111 cells, 6 sessions, 6 mice, p\=0.0083, Wilcoxon signed-rank test). Additionally, 15% of the neurons in the familiar group exhibited significantly positive spiking modulation by ripples, while fewer cells showed negative modulation compared to the novel group (Figure 4A). During ripples, neurons in the novel group predominantly displayed hyperpolarizing membrane voltage responses, whereas a subset of neurons in the familiar group exhibited prominent depolarizing responses (Figure 4B). The mean fluorescence changes in the familiar group shifted toward depolarization compared to the novel group (Figure 4C). Finally, synchronous event frequencies were significantly lower in the familiar context, indicating weaker synchronous activities under familiar conditions (Figure 4D). These results demonstrate that hippocampal neuronal activities, particularly synchronous ensembles, are strongly influenced by contextual novelty.”

      (3) In the discussion the authors begin by speculating the theta present during these synchronous events may be slower type II or attentional theta. This can be supported by demonstrating a frequency shift in the theta recording during these events/immobility versus the theta recording during movement.

      We thank the reviewer for the suggestion. As the reviewer points out, we did observe a frequency shift in synchrony-associated theta during immobility compared to locomotion (see Figure 5B, red vs. blue curves). We have now highlighted this result in the discussion section. Please refer to the text below.

      Discussion (line 471)

      “On the other hand, type 2 theta, or attentional theta, is slightly slower and is blocked by muscarinic receptor antagonists, emerging during states of arousal or attention, such as when entering a new environment. Consistent with these distinctions, the peak of the power spectrum density shows a distinctively slower theta frequency during immobility compared to locomotion (Figure 5B).”

      (4) The authors mention in the discussion that they image deep-layer PCs in CA1, however, this is not mentioned in the text or methods. They should include data, such as imaging of a slice of a brain post-recording with immunohistochemistry for a layer-specific gene to support this.

      We thank the reviewer for the constructive suggestion. In response, we have added images of slices from both E14.5 and E17.5 brains and analyzed soma locations along the radial axis of the pyramidal layer. The results are included in the main text, Methods, and Supplementary Figure 1 of the manuscript (see below).

      Result (line 96)

      “Consistent with previous studies, neurons labeled on E14.5 located more on the deep side of the pyramidal layer than those labeled on E17.5 (t<sub>(601)</sub>=22.8, p<0.0001, Student’s t-test; Supplementary Figure 1C, D).”

      Methods (line 563)

      “The injection resulted in Cre expression among neurons born on the day of injection, with earlier injection labeling neurons located on the deeper side of the cell layer.”

      Reviewer #3 (Public Review):

      Summary:

      In the present manuscript, the authors use a few minutes of voltage imaging of CA1 pyramidal cells in head-fixed mice running on a track while local field potentials (LFPs) are recorded. The authors suggest that synchronous ensembles of neurons are differentially associated with different types of LFP patterns, theta and ripples. The experiments are flawed in that the LFP is not "local" but rather collected in the other side of the brain, and the investigation is flawed due to multiple problems with the point process analyses. The synchrony terminology refers to dozens of milliseconds as opposed to the millisecond timescale referred to in prior work, and the interpretations do not take into account theta phase locking as a simple alternative explanation.

      We appreciate the reviewer’s feedback and acknowledge the concerns raised. However, we believe these concerns can be effectively addressed without compromising the validity of our conclusions. With this in mind, we respectfully disagree with the assessment that our experiments and investigation are flawed. Please allow us to address these concerns and offer additional context to support the validity of our study.

      Weaknesses:

      The two main messages of the manuscript indicated in the title are not supported by the data. The title gives two messages that relate to CA1 pyramidal neurons in behaving head-fixed mice: (1) synchronous ensembles are associated with theta (2) synchronous ensembles are not associated with ripples.

      There are two main methodological problems with the work: (1) experimentally, the theta and ripple signals were recorded using electrophysiology from the opposite hemisphere to the one in which the spiking was monitored. However, both signals exhibit profound differences as a function of location: theta phase changes with the precise location along the proximo-distal and dorso-ventral axes, and importantly, even reverses with depth. And ripples are often a local phenomenon - independent ripples occur within a fraction of a millimeter within the same hemisphere, let alone different hemispheres. Ripples are very sensitive to the precise depth - 100 micrometers up or down, and only a positive deflection/sharp wave is evident.

      We acknowledge the reviewer’s consideration regarding the collection of LFP from the contralateral hemisphere. While we acknowledge the limitation of this design, we believe these contralateral LFP recordings still provide valuable insights into the dynamics of synchronous ensembles. Despite potential variations in theta phases due to differences in recording locations and depths, the occurrence and amplitudes of theta oscillations are generally wellcoordinated across hemispheres (Buzsaki et al., 2003, Fig 5)(3). The presence of prominent contralateral LFP theta activity around the times of synchronous ensembles in our study (Figure 5A of the manuscript) strongly supports our conclusion about their association with theta oscillations, even with LFP collected from the opposite hemisphere.

      Additionally, we explicitly noted in the manuscript that the “preferred phases” varied between sessions, likely reflecting variability in recording locations (see below). Thus, we believe the concern about theta phase variability has already been adequately addressed in the current manuscript.

      Result (line 321)

      “Although the preferred phases varied from session to session due to differences in recording sites along the proximal-distal axis of the hippocampus, the timings of individual ensembles were consistently locked to the preferred phase of each session (Figure 5C).”

      While we acknowledge that ripple oscillations can sometimes occur locally, the majority of ripples occur synchronously in both hemispheres (up to 70%)(3,26), as demonstrated both in the literature (Szabo et al., 2022, Supplementary Figure 2) and by data from our lab (Huang et al., 2024, Figure S6). As a result, using contralateral LFP to infer ripple occurrence on the ipsilateral side is a well-established practice in the field, commonly employed by many studies published in reputable journals(26-29). Given the high co-occurrence of both theta and ripple oscillations across hemispheres, we maintain that the two main messages of our manuscript are supported by data, despite the concern regarding phase discrepancy mentioned by the reviewer.

      (2) The analysis of the point process data (spike trains) is entirely flawed. There are many technical issues: complex spikes ("bursts") are not accounted for; differences in spike counts between the various conditions ("locomotion" and "immobility") are not accounted for; the pooling of multiple CCGs assumes independence, whereas even conditional independence cannot be assumed; etc.

      We acknowledge the reviewer’s concern regarding spike train analysis. Complex bursts or differences in behavioral conditions can indeed lead to variations in spike counts, which could potentially affect the detection of synchronous ensembles. However, our jittering procedure is specifically designed to account for variations in spike counts. Notably, while the jittered spike trains retain the same spike count variations, we observed 7.8 times more synchronous events in our data compared to the jitter controls (Figure 1G of the manuscript). This indicates that the specific spike timings in the original data - disrupted in the jitter data – are responsible for the observed synchrony.

      To further address the concern that complex bursts might influence the observed synchrony, we performed additional analyses in which we excluded all later spikes in bursts, considering only single spikes and the first spikes of bursts. Importantly, this procedure did not affect the rate or size of synchronous ensembles and did not significantly alter the grand-average CCG (Supplementary Figure 3). These results explicitly demonstrate that complex bursts do not significantly impact the analysis of synchronous ensembles.

      Result (line 131)

      The observed population synchrony was not attributable to spikes in complex bursts, as synchronous event rates did not differ significantly with or without the inclusion of later spikes in bursts (Supplementary Figure 3).

      Beyond those methodological issues, there are two main interpretational problems: (1) the "synchronous ensembles" may be completely consistent with phase locking to the intracellular theta (as even shown by the authors themselves in some of the supplementary figures).

      We agree with the reviewer that the synchronous ensembles are indeed consistent with theta phase locking. However, it is important to note that theta phase locking alone does not necessarily imply population synchrony. In fact, previous research has demonstrated that theta phase locking can “reduce” population synchrony(30). Thus, the presence of theta phase locking cannot be considered a simple alternative explanation for the synchronous ensembles.

      The idea that theta phase locking does not necessarily lead to population synchrony is illustrated in Author response image 1A. In this example, while all three neurons are perfectly locked to specific theta phases, no synchrony among neurons is evident. In contrast, our data align with the scenario depicted in Figure 4B, where spikes occur not only at specific theta phases but also in the same cycles, thereby facilitating population synchrony.

      Author response image 1.

      Illustrative diagram of the relationship between theta phase coupling and population synchrony. Illustration of theta phase coupling with low population synchrony. Illustration of population synchrony with theta phase coupling.

      To directly assess the contribution of theta phase locking to synchronous ensembles, we performed a new analysis in which the specific theta cycles during which neurons spike were randomized while keeping the spike phases unchanged. This manipulation disrupts spike co-occurrence while preserving theta phase locking, allowing us to test whether theta phase locking alone can explain the population synchrony. We found that theta-cycle randomization significantly reduced the rate of synchronous events by 4.5 folds (Supplementary Figure 4). This new analysis demonstrates that theta phase locking alone cannot account for the population synchrony observed in our data.

      Result (line 358)

      “Correlated intracellular theta and theta-phase locking of the synchronous ensembles raise the question of whether population synchrony among CA1PCs extends beyond synchrony derived from these effects. To address this, we analyzed population synchrony after randomizing the theta cycles during which neurons spiked, while keeping their theta phases unchanged. Supplementary Figure 4 illustrates a significant reduction in synchronous event rates following theta cycle randomization. The finding indicates spiking at specific theta cycles plays a major role in driving population synchrony.”

      (2) The definition of "synchrony" in the present work is very loose and refers to timescales of 20-30 ms. In previous literature that relates to synchrony of point processes, the timescales discussed are 1-2 ms, and longer timescales are referred to as the "baseline" which is actually removed (using smoothing, jittering, etc.).

      Regarding the timescale of synchronous ensembles, we acknowledge that it varies considerably across studies and cell types. However, it is important to note that a timescale of dozens or even hundreds of milliseconds is commonly used in the context of synchrony terminology for CA1 pyramidal neurons(6,31-33). In fact, a timescale of 20-30 ms is considered particularly important for information transmission and storage in CA1, as it aligns with the membrane time constant of pyramidal neurons, the period of hippocampal gamma oscillations, and the time window for synaptic plasticity. Therefore, we believe this timescale is highly relevant and consistent with established practices in the field.

      Reviewer #3 (Recommendations For The Authors):

      (1) L19-20: "these synchronous ensembles were not associated with ripple oscillations" - this is a main fallacy in the present work (ripples are from the other side; there are not enough ripples to obtain sufficient statistical power to even test the hypothesis; etc.). The sentence should be removed.

      As we have addressed in the public review, most ripples occur synchronously in both hemispheres(3,26). Many studies have used contralateral LFP to infer ripple occurrence on the ipsilateral side(26-29). Moreover, our new data now support the dissociation between synchronous ensembles and ripples with a much larger number of ripples and rigorous statistical testing (Figure 3-figure supplement 3 of the manuscript). These findings support our conclusion that synchronous ensembles are not associated with ripple oscillations.

      Result (line 266)

      “In total, 1052 synchronous ensembles and 174 ripple episodes were detected across these imaging sessions. Consistent with findings from walking animals, few synchronous ensembles occurred during ripples when animals were immobilized in a tube (Figure 3-figure supplement 3A, B). Moreover, no distinguishable ripple oscillations were observed in synchronous events, and the average firing rates during ripple episodes were near zero (Figure 3-figure supplement 3C, D). At the single-cell level, 90% of neurons showed significant negative spiking modulation during ripples, with ripple modulation indexes close to -1, indicating strong suppression of spiking (Figure 4Ai). This suppression extended to subthreshold membrane potentials, as nearly all cells exhibited decreased fluorescence during ripples compared to baseline (Figure 4Bi, Ci). These results demonstrate that spiking activity and subthreshold membrane potentials are robustly suppressed during ripples.”

      (2) L135/Figure 1: panel C and elsewhere: show the same traces after removing (clipping) the spikes. You may be able to see the intracellular theta nicely, which may be very strongly synchronized between neurons and could then be supplemented by ticks (as in conventional raster plots). This will allow a clearer visualization of the spiking and their relations with Vm.

      We have created the plot as suggested (Author response image 2). As demonstrated in our figures (Figure 5 in the manuscript), the subthreshold membrane potentials of individual neurons are strongly correlated and coherent at theta frequency, consistent with the reviewer’s viewpoint.

      Author response image 2.

      Fluorescence traces of 19 simultaneously recorded cells with truncated spikes replaced by dots. Horizontal scale bar: 25 ms; vertical scale bar: -3%.

      (3) Related to the above comment, in general, a much more robust approach with the present dataset may be to derive an estimate of the LFP from the intracellular records. Extracellular theta is related to intracellular theta (approximately the negative), and extracellular ripples co-occur with intracellular high-frequency oscillations. However, because the precise transfer function (TF) between the two is not well established, ground truth data should first be collected. This may be done by voltage imaging of even a single neuron in parallel with an extracellular glass pipette placed in near proximity of the same cell, at the same depth. Such datasets have been collected in the past, so it may be sufficient to contact those authors and derive the TF from existing data. Alternatively, new experiments may be required. It is possible that the TF will not be well defined - in which case there are two options: (1) limit the analyses to the relation between spikes in Vm, or (2) record new datasets with true LOCAL field potentials in every case.

      We thank the reviewer for the insightful suggestion. Establishing a precise TF between intracellular and extracellular recordings is indeed crucial when exact phase information is required to draw conclusions. However, our goal is to understand the occurrence of specific network oscillation states surrounding these synchronous ensembles, rather than pinpointing the precise phase at which they occur. Therefore, we believe that the strong bilateral cooccurrence of both theta and ripple oscillations provides a practical and valid foundation for supporting our objective.

      While the approach suggested by the reviewer is an excellent idea, conducting simultaneous voltage imaging and local LFP recording is currently not feasible due to technical constraints associated with the implanted glass windows. Nevertheless, we recognize the potential value of this approach and plan to incorporate it into future experimental designs, which could provide further insights into the specific oscillatory phases associated with population synchrony.

      (4) L135/Figure 1: panel D and elsewhere: Account for second-order spike train statistics (e.g., bursts). The simplest way to do this is to remove all spikes that are not the first spike in a burst. Otherwise, the zero-lag bin of a pair-wise CCG will be filled with counts that are due e.g., to the first spike of the second neuron co-occurring with the last spike in a burst of the first neuron. In other words, without accounting for bursts, sequential activity may be interpreted as synchrony.

      We thank the reviewer for this insightful comment. As recommended, we have performed the suggested analysis by removing all spikes that are not the first spike in a burst (Supplementary Figure 3). The results demonstrate that, even after removing the subsequent spikes in bursts, the rates of synchronous events remain unchanged compared to the original data, and the sizes of the synchronous ensembles are also unaffected. These findings indicate that our conclusions are robust and not confounded by the presence of later spikes within bursts.

      Result (line131)

      “The observed population synchrony was not attributable to spikes in complex bursts, as synchronous event rates did not differ significantly with or without the inclusion of later spikes in bursts (Supplementary Figure 3).”

      (5) L135/Figure 1: panel D and elsewhere: Related to the previous comment: the "grand average" CCG of a single neuron with all the other simultaneouslyrecorded neurons is prone to a peak at zero lag ("synchrony") even if all pairs of neurons have pure mono-synaptic connections (e.g., at a 2 ms time lag). This is because neuron1 (N1) may precede N2, whereas then N3 may precede N2. In such a case, the pooled CCG will have two peaks - at e.g., 2 ms and -2 ms. However, if bursts occur (as is the case in CA1 and Figure 1C), there will also be non-zero counts around zero lag, which will accumulate as well. Together, these will build up to a peak around zero - even without any theta phase locking or any other alternative correlations.

      Please see our reply to comment #6 below.

      (6) L135/Figure 1: panel D and elsewhere: refrain from averaging "grand averages" over neurons. This problem is distinct from the above (where e.g., N2-N1 is averaged with N2-N3). In any case, all visualizations and measures should be derived from individual (pair-wise) CCGs, and not "grand averages"

      We thank the reviewer for the detailed comments and appreciate the opportunity to clarify our methods and analyses related to population synchrony. In response to the suggestion to replace grand average CCGs with pairwise CCGs, we have now included a heatmap to visualize individual pairwise CCGs for all recorded neuronal pairs that meet our inclusion criteria (497 pairs, Author response image 3). The heatmap provides a comprehensive view of the temporal relationships between neuron pairs.

      Author response image 3.

      Color-coded plot of pairwise CCGs for all cell pairs that meet our inclusion criteria.

      While we have chosen to keep the grand-average CCGs, we emphasize that they are served only to summarize the overall temporal scale of the population synchrony. Importantly, our conclusions regarding synchronous ensembles are not based on grand-average CCGs. Instead, we assess population synchrony using a rigorous approach: we compute spike counts across the population in 25-ms sliding windows and compare these counts to those derived from jittered data, where spike timings are randomly shifted by ±75 ms while preserving the overall spike count distribution. Synchrony is identified when the original spike counts exceed those from the jittered data by more than 4 standard deviations. This approach accounts for the potential accumulation of zero-lag counts arising from mixed mono-synaptic connections or bursting, as noted by the reviewer. By perturbing spike timings and preserving spike count distributions, our method identifies synchrony beyond what is expected by chance, ensuring robust and artifact-free conclusions.

      (7) L135/Figure 1: panel D and elsewhere: after deriving measures (peak lag, FWHM, synchrony strength, etc.) from individual pairwise CCGs, show the measures as a function of the spike counts. For a pair of neurons N1-N2, derive the geometric mean spike count (or the mean, or the max). For instance, if there are 500 pairs of neurons, show e.g., pairwise synchrony strength as a function of the spike count geometric mean. While little correlation is expected when the timescale is small (1-2 ms), the "synchrony" effect at a timescale of 20-30 ms is expected to be very strongly related to the spike counts. Because the spike counts may differ between the lower and higher speed "states", many results reported in the present manuscript may be an epiphenomenon of that relationship.

      We thank the reviewer for these valuable comments. In response, we analyzed pairwise synchronization strengths as a function of spike counts geometric mean of neuron pairs, as suggested. As shown in Author response image 4, the CCG peak counts in the original data (red dots) increase with the spike count geometric mean, consistent with the expected trend. However, this trend is also captured by the jitter control (black dots), which reflects synchrony levels expected by chance given the spike count levels.

      Importantly, the normalized synchronization strengths - defined as the ratio of CCG peak counts in the original data to the jitter control – are not positively correlated with spike counts and remain significantly greater than 1 across all spike count levels (Author response image 5). This demonstrates synchrony beyond what could be explained by spike count variations alone.

      While we understand the potential influence of state-dependent spike count variations, our jittering approach effectively controls for this by removing chance-level synchrony that could arise from these variations. This ensures that the observed synchrony reflects genuine neuronal interactions rather than an epiphenomenon of spike count variations between states.

      Author response image 4.

      Plot of peak spike counts of pairwise CCGs (red) and mean spike counts from jittered data (black) against geometric means of pair spike counts.

      Author response image 5.

      Plot of normalized synchronization strengths against spike count geometric means.

      (8) L135/Figure 1: show all CCGs in a color matrix.

      We have generated a color matrix visualization of all pairwise CCGs, as recommended (Author response image 3). This visualization highlights the consistency of our results across neuron pairs.

      (9) L168/Figure 2: the LFPO is nearly irrelevant - it is from the other hemisphere, and it is unclear whether the depth is the same as in the "deep" (closer to the brain surface) imaging plain used for the voltage recordings.

      As previously explained, the LFPO is relevant because it reveals the occurrence of theta and ripple states, which are highly synchronous across both hemispheres and serve as reliable indicators of network states relevant to our findings.

      (10) L222/Figure 3: The ripple-related analyses are completely irrelevant - ripples are a local phenomenon, and recording from the other hemisphere is completely irrelevant.

      We thank the reviewer’s suggestions. As we have explained in the public review, as well as in the reviewer’s comments #1 and #3, the occurrences of theta and ripple oscillations are well-coordinated across hemispheres. As our analyses only depend on the occurrences of these oscillations, our conclusions regarding the association of the synchronous ensembles with theta but not ripple oscillations are supported by data.

      (11) L292/Figure 4, panels A-E: please trigger Vm on the same-neuron spikes, not on the "synchrony events". This will already explain most of the observations. Some of this is already shown in the supplementary figures.

      As the reviewer correctly noted, we have already presented data triggered on same-neuron spikes in Figure 5-figure supplement 1C and D. The reason we show synchrony-triggered LFP and subthreshold Vm in the figure is to highlight the network dynamics during synchronous events. This approach provides a broader perspective on how neural networks function and interact during periods of synchrony, offering insights beyond individual neuron activity

      (12) L351/Figure 5, panel C: typo - should read "strength"

      The typo has been corrected.

      (13) L351/Figure 5: show "spatial tuning correlation" vs. inter-soma distance (as in Fig. 4G). This may explain part (if not all) of the observations

      We have followed the reviewer’s suggestion and generated the plot (Author response image 6). Consistent with the literature, the plot demonstrates that the spatial tuning correlations of place cell pairs exhibit little relationship with their inter-soma distances.

      Author response image 6.

      Plot of spatial tuning correlation vs. inter-soma distance (Spearman correlation coefficient=0.06, p\=0.54, n\=91 pairs).

      (14) L937/Figure S3: panel A: the ripples here appear to be recorded from the top part of the layer, i.e., the electrode is not in the center of the layer. Panel B: add statistical testing.

      We agree with the reviewer that this is possible, as we aimed to place our LFP electrodes in the stratum pyramidale. Regarding panel B of the figure, we verified the quality of LFP recordings by acquiring data from subsequent sessions following the initial imaging sessions. The detection of ripples in the same animals during these later sessions indicates that the absence of ripples during the first sessions is not due to deterioration in LFP recording quality. However, due to the small sample size, the statistical power is insufficient to demonstrate significance (n\=5 sessions, p\=0.06, Wilcoxon signed-rank test). Nevertheless, our conclusions are not contingent upon achieving statistical significance in this test.

      (15) L944/Figure S4: The "R=1" is very likely to be an outcome of n=1 spike. In other words, estimates of phase are unreliable when the spike count is very low. This is related to the problem referred to in Comment #7 above.

      We understand that phase estimates can be unreliable when the spike counts are low. We now highlight that this effect has been taken into account by a shuffling procedure that assesses the significance of phase modulation, and by excluding neurons with nonsignificant modulation strengths. Neurons with low spike count or inconsistent spike phases are typically excluded due to the non-significant strength of phase modulation.

      Method (line 828)

      “The significance of the modulation strength was tested by shuffling the spike timings and recalculating the modulation strength a thousand times to generate a distribution based on the shuffled spike timings. The original modulation strength was then compared to the distribution, with significance determined if it exceeded the 95% confidence interval of the shuffled values.

      Significant modulation strengths were plotted and compared across groups.”

      (16) L944/Figure S4: Putting the spike count issue (Comment #15) aside for a moment, the analyses in this figure are actually valid - they are carried out at the single-neuron level, with respect to the local (same-neuron) Vm. These findings provide a key alternative explanation to the observations purported in the main figures: (1) if spiking is locked to intracellular theta (occurring at the peak of Vm); and if (2) intra-cellular (Vm) theta is locked to extracellular theta (antiphase); and if (3) extracellular theta is similar for nearby neurons (the imaged neurons), then synchrony is a necessary outcome. The key question is then whether there is any EXTRA synchrony between the CA1PC - beyond that which necessarily derives from (1)+(2)+(3).

      We acknowledge the reviewer’s perspective. However, the factors (1)+(2)+(3) alone do not account for the synchrony we observed. As the reviewer points out (and as discussed in our response to the public review and in Supplementary Figure 4), theta phase locking does not necessarily imply population synchrony. To demonstrate that population synchrony extends beyond the contribution of (1)+(2)+(3), we performed an analysis where the theta cycles in which neurons spike were randomized, while the theta phases remained unchanged (Supplementary Figure 4). The analysis revealed that randomizing the theta cycles while preserving theta phases significantly reduces population synchrony. This finding indicates that spiking in specific theta cycles plays a major role in driving population synchrony.

      Result (line 358)

      “Correlated intracellular theta and theta-phase locking of the synchronous ensembles raise the question of whether population synchrony among CA1PCs extends beyond synchrony derived from these effects. To address this, we analyzed population synchrony after randomizing the theta cycles during which neurons spiked, while keeping their theta phases unchanged. Supplementary Figure 4 illustrates a significant reduction in synchronous event rates following theta cycle randomization. The finding indicates spiking at specific theta cycles plays a major role in driving population synchrony.”

      (17) L944/Fig. S4: Why 71 neurons in AB and only 59 in CD?

      In the previous version, panels A and B included 71 neurons, as we collected data from 71 cells across 5 mice (see the text below).

      Result (line 93)

      “…in total, 71 cells imaged from 5 fields of view in 5 mice; Figure 1B and

      Supplementary Figure 1A and 1B).”

      In the current version, we only include neurons with significant modulation strengths, reducing the number of cells from 71 to 65 in panel A and from 71 to 54 in panel B.

      Methods (line 828)

      “The significance of the modulation strength was tested by shuffling the spike timings and recalculating the modulation strength a thousand times to generate a distribution based on the shuffled spike timings. The original modulation strength was then compared to the distribution, with significance determined if it exceeded the 95% confidence interval of the shuffled values. Significant modulation strengths were plotted and compared across groups.”

      “Figure 5-figure supplement 1 Figure legend (line 1231)

      Polar plot comparing subVm theta modulation between spikes participating in synchronous ensembles (sync spikes) and spikes not participating in synchronous ensembles (other spikes) during immobility. Each dot represents the averaged modulation of a cell. Cells with modulation strengths that are not significant are excluded in the plot and in the comparison.”

      For panels C and D, we excluded neurons with four or fewer triggering events from the analysis, which reduced the number of cells from 71 to 59 (see the second text paragraph below).

      Method (line 835)

      “We extracted segments of fluorescence traces using a ±300 ms time window centered on the spike timings. To examine variations in fluorescence waveforms triggered by spikes within and outside synchronous events, we categorized the fluorescence traces based on whether the spikes occurred within or outside these events. Subsequently, we performed pairwise comparisons of the fluorescence values from the same neuron, concentrating on spikes occurring during corresponding behavioral states. Neurons with four or fewer triggering events in any of these categories were omitted from the analysis.”

      (1) Mizuseki, K. & Buzsaki, G. Preconfigured, skewed distribution of firing rates in the hippocampus and entorhinal cortex. Cell Rep 4, 1010-1021 (2013). https://doi.org:10.1016/j.celrep.2013.07.039

      (2) McHugh, T. J., Blum, K. I., Tsien, J. Z., Tonegawa, S. & Wilson, M. A. Impaired hippocampal representation of space in CA1-specific NMDAR1 knockout mice. Cell 87, 1339-1349 (1996). https://doi.org:10.1016/s0092-8674(00)81828-0 3

      (3) Buzsaki, G. et al. Hippocampal network patterns of activity in the mouse. Neuroscience 116, 201-211 (2003). https://doi.org:10.1016/s03064522(02)00669-3

      (4) Karlsson, M. P. & Frank, L. M. Network dynamics underlying the formation of sparse, informative representations in the hippocampus. J Neurosci 28, 14271-14281 (2008). https://doi.org:10.1523/JNEUROSCI.4261-08.2008

      (5) Dombeck, D. A., Harvey, C. D., Tian, L., Looger, L. L. & Tank, D. W. Functional imaging of hippocampal place cells at cellular resolution during virtual navigation. Nat Neurosci 13, 1433-1440 (2010). https://doi.org:10.1038/nn.2648

      (5) Malvache, A., Reichinnek, S., Villette, V., Haimerl, C. & Cossart, R. Awake hippocampal reactivations project onto orthogonal neuronal assemblies. Science 353, 1280-1283 (2016). https://doi.org:10.1126/science.aaf3319

      (7) Sheffield, M. E. J., Adoff, M. D. & Dombeck, D. A. Increased Prevalence of Calcium Transients across the Dendritic Arbor during Place Field Formation. Neuron 96, 490-504 e495 (2017). https://doi.org:10.1016/j.neuron.2017.09.029

      (8) Adam, Y. et al. Voltage imaging and optogenetics reveal behaviour-dependent changes in hippocampal dynamics. Nature 569, 413-417 (2019). https://doi.org:10.1038/s41586-019-1166-7

      (9) Go, M. A. et al. Place Cells in Head-Fixed Mice Navigating a Floating RealWorld Environment. Front Cell Neurosci 15, 618658 (2021). https://doi.org:10.3389/fncel.2021.618658

      (10) Geiller, T. et al. Local circuit amplification of spatial selectivity in the hippocampus. Nature 601, 105-109 (2022). https://doi.org:10.1038/s41586021-04169-9

      (11) Rolotti, S. V. et al. Local feedback inhibition tightly controls rapid formation of hippocampal place fields. Neuron 110, 783-794 e786 (2022). https://doi.org:10.1016/j.neuron.2021.12.003

      (12) Pettit, N. L., Yap, E. L., Greenberg, M. E. & Harvey, C. D. Fos ensembles encode and shape stable spatial maps in the hippocampus. Nature 609, 327-334 (2022). https://doi.org:10.1038/s41586-022-05113-1

      (13) Hainmueller, T. & Bartos, M. Parallel emergence of stable and dynamic memory engrams in the hippocampus. Nature 558, 292-296 (2018). https://doi.org:10.1038/s41586-018-0191-2

      (14) Gauthier, J. L. & Tank, D. W. A Dedicated Population for Reward Coding in the Hippocampus. Neuron 99, 179-193 e177 (2018). https://doi.org:10.1016/j.neuron.2018.06.008

      (15) Grosmark, A. D., Sparks, F. T., Davis, M. J. & Losonczy, A. Reactivation predicts the consolidation of unbiased long-term cognitive maps. Nat Neurosci 24, 1574-1585 (2021). https://doi.org:10.1038/s41593-021-00920-7

      (16) Farrell, J. S., Hwaun, E., Dudok, B. & Soltesz, I. Neural and behavioural state switching during hippocampal dentate spikes. Nature 628, 590-595 (2024). https://doi.org:10.1038/s41586-024-07192-8

      (17) McHugh, S. B. et al. Offline hippocampal reactivation during dentate spikes supports flexible memory. Neuron 112, 3768-3781 e3768 (2024). https://doi.org:10.1016/j.neuron.2024.08.022

      (18) Gava, G. P. et al. Organizing the coactivity structure of the hippocampus from robust to flexible memory. Science 385, 1120-1127 (2024). https://doi.org:10.1126/science.adk9611

      (19) Galvao, J. et al. Unexpected low-dose toxicity of the universal solvent DMSO. FASEB J 28, 1317-1330 (2014). https://doi.org:10.1096/fj.13-235440

      (20) Yuan, C. et al. Dimethyl sulfoxide damages mitochondrial integrity and membrane potential in cultured astrocytes. PloS one 9, e107447 (2014). https://doi.org:10.1371/journal.pone.0107447

      (21) Modrzynski, J. J., Christensen, J. H. & Brandt, K. K. Evaluation of dimethyl sulfoxide (DMSO) as a co-solvent for toxicity testing of hydrophobic organic compounds. Ecotoxicology 28, 1136-1141 (2019). https://doi.org:10.1007/s10646-019-02107-0

      (22) Hoyberghs, J. et al. DMSO Concentrations up to 1% are Safe to be Used in the Zebrafish Embryo Developmental Toxicity Assay. Front Toxicol 3, 804033 (2021). https://doi.org:10.3389/ftox.2021.804033

      (23) Abdelfattah, A. S. et al. Sensitivity optimization of a rhodopsin-based fluorescent voltage indicator. Neuron (2023). https://doi.org:10.1016/j.neuron.2023.03.009

      (24) Huang, Y. C. et al. Dynamic assemblies of parvalbumin interneurons in brain oscillations. Neuron 112, 2600-2613 e2605 (2024). https://doi.org:10.1016/j.neuron.2024.05.015

      (25) Abdelfattah, A. S. et al. Bright and photostable chemigenetic indicators for extended in vivo voltage imaging. Science 365, 699-704 (2019). https://doi.org:10.1126/science.aav6416

      (26) Szabo, G. G. et al. Ripple-selective GABAergic projection cells in the hippocampus. Neuron 110, 1959-1977 e1959 (2022). https://doi.org:10.1016/j.neuron.2022.04.002

      (27) Dudok, B. et al. Alternating sources of perisomatic inhibition during behavior. Neuron 109, 997-10<sup>12</sup> e1019 (2021). https://doi.org:10.1016/j.neuron.2021.01.003

      (28) Terada, S. et al. Adaptive stimulus selection for consolidation in the hippocampus. Nature 601, 240-244 (2022). https://doi.org:10.1038/s41586021-04118-6

      (29) Geiller, T. et al. Large-Scale 3D Two-Photon Imaging of Molecularly Identified CA1 Interneuron Dynamics in Behaving Mice. Neuron 108, 968-983 e969 (2020). https://doi.org:10.1016/j.neuron.2020.09.013

      (30) Mizuseki, K. & Buzsaki, G. Theta oscillations decrease spike synchrony in the hippocampus and entorhinal cortex. Philos Trans R Soc Lond B Biol Sci 369, 20120530 (2014). https://doi.org:10.1098/rstb.2012.0530

      (31) Csicsvari, J., Hirase, H., Mamiya, A. & Buzsaki, G. Ensemble patterns of hippocampal CA3-CA1 neurons during sharp wave-associated population events. Neuron 28, 585-594 (2000). https://doi.org:10.1016/s08966273(00)00135-5

      (32) Harris, K. D., Csicsvari, J., Hirase, H., Dragoi, G. & Buzsaki, G. Organization of cell assemblies in the hippocampus. Nature 424, 552-556 (2003). https://doi.org:10.1038/nature01834

      (33) Yagi, S., Igata, H., Ikegaya, Y. & Sasaki, T. Awake hippocampal synchronous events are incorporated into offline neuronal reactivation. Cell Rep 42, 112871 (2023). https://doi.org:10.1016/j.celrep.2023.112871

    1. Author Response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      Cheong et al. use a synapse-resolution wiring map of the fruit fly nerve cord to comprehensively investigate circuitry between descending neurons (DNs) from the brain and motor neurons (MNs) that enact different behaviours. These neurons were painstakingly identified, categorised, and linked to existing genetic driver lines; this allows the investigation of circuitry to be informed by the extensive literature on how flights walk, fly, and escape from looming stimuli. New motifs and hypotheses of circuit function were presented. This work will be a lasting resource for those studying nerve cord function.

      Strengths:

      The authors present an impressive amount of work in reconstructing and categorising the neurons in the DN to MN pathways. There is always a strong link between the circuitry identified and what is known in the literature, making this an excellent resource for those interested in connectomics analysis or experimental circuits neuroscience. Because of this, there are many testable hypotheses presented with clear predictions, which I expect will result in many follow-up publications. Most MNs were mapped to the individual muscles that they innervate by linking this connectome to pre-existing light microscopy datasets. When combined with past fly brain connectome datasets (Hemibrain, FAFB) or future ones, there is now a tantalising possibility of following neural pathways from sensory inputs to motor neurons and muscle.

      Weaknesses:

      As with all connectome datasets, the sample size is low, limiting statistical analyses. Readers should keep this in mind, but note that this is the current state-of-the-art. Some figures are weakened by relying too much on depictions of wiring diagrams as evidence of circuit function, similarity between neuropils, etc. without additional quantitative justification.

      We thank the reviewer for their helpful comments. We are excited about the release of this densely reconstructed connectome and its potential to facilitate circuit exploration in the VNC. We note that while statistical methods for analyzing complicated networks such as the connectome are still being developed, the wiring diagrams presented are themselves visualizations of quantitative data. We address specific concerns below.

      Reviewer #2 (Public Review):

      Summary:

      In Cheong et al., the authors analyze a new motor system (ventral nerve cord) connectome of Drosophila. Through proofreading, cross-referencing with another female VNC connectome, they define key features of VNC circuits with a focus on descending neurons (DNs), motor neurons (MNs), and local interneuron circuits. They define DN tracts, MNs for limb and wing control, and their nerves (although their sample suffers for a subset of MNs). They establish connectivity between DNs and MNs (minimal). They perform topological analysis of all VNC neurons including interneurons. They focus specifically on identifying core features of flight circuits (control of wings and halteres), leg control circuits with a focus on walking rather than other limbed behaviors (grooming, reaching, etc.), and intermediate circuits like those for escape (GF). They put these features in the context of what is known or has been posited about these various circuits.

      Strengths:

      Some strengths of the manuscript include the matching of new DN and MN types to light microscopy, including the serial homology of leg motor neurons. This is a valuable contribution that will certainly open up future lines of experimental work.

      Also, the analysis of conserved connectivity patterns within each leg neuromere and interconnecting connectivity patterns between neuromeres will be incredibly valuable. The standard leg connectome is very nice.

      Finally, the finding of different connectivity statistics (degrees of feedback) in different neuropils is quite interesting and will stimulate future work aimed at determining its functional significance.

      We thank the reviewer for their constructive feedback, and are optimistic about the utility of the MANC connectome to the Drosophila neurobiology community in dissecting VNC circuit function.

      Weaknesses:

      First, it seems like quite a limitation that the neurotransmitter predictions were based on training data from a fairly small set of cells, none of which were DNs. It's wonderful that the authors did the experimental work to map DN neurotransmitter identity using FISH, and great that the predictions were overall decently accurate for both ACh and Glu, but unfortunate that they were not accurate for GABA. I hope there are plans to retrain the neurotransmitter predictions using all of this additional ground truth experimental data that the authors collected for DNs, in order to provide more accurate neurotransmitter type predictions across more cell types.

      The reviewer makes an excellent suggestion, and collecting further ground truth data and retraining the neurotransmitter classifier is an ongoing research project. 

      Second, the degradation of many motor neurons is unfortunate. Figure 5 Supplement 1 shows that roughly 50% of the leg motor neurons have significantly compromised connectivity data, whereas, for non-leg motor neurons, few seem to be compromised. If that is the correct interpretation of this figure, perhaps a sentence like this that includes some percentages (~50% of leg MNs, ~5% of other MNs) could be added to the main text so that readers can get a sense of the impact more easily.

      Thank you for this suggestion. We have added a line describing the percentage of leg and other MNs affected (L416-417).

      As well, Figure 5 Supplement 1 caption says "Note that MN groups where all members of the group have reconstruction issues may not be flagged" - could the authors comment on how common they think this is based on manual inspection? If it changes the estimate of the percentage of affected leg motor neurons from 50% to 75% for example, this caveat in the current analysis would need to be addressed more directly. Comparing with FANC motor neurons could perhaps be an alternative/additional approach for estimating the number of motor neurons that are compromised.

      We agree that a direct comparison to another dataset, such as FANC, would aid in identifying reconstruction issues. However, a full analysis is not currently possible as only a minority of FANC neurons have been proofread or annotated. We were able to gain some insights into reconstruction quality by looking at T1 motor neurons, where FANC MN reconstruction is more complete. As reported in the submitted manuscript, we were able to confidently match T1 MNs between FANC and MANC for all but one MN (we are missing one ltm MN on the right side of MANC). While some of the MANC neurons had smaller/less dense arbors than FANC, none of them would have been flagged as having reconstruction issues. However, for FANC, we observe that neurons on the right have less dense arbors and fewer reconstructed synapses than neurons on the left.  We have prepared a reviewer figure analyzing the consistency of synapse counts for the T1 (front leg) MNs:

      Author response image 1.

      In these results (MANC on the left, FANC on the right) we compare the number of input synapses on matched motor neurons on the left (LHS) and right hand side (RHS) of each dataset. We see that the MANC distribution is much more symmetric, indicating left and right hand side synapse counts for matched MNs are more similar in MANC. This is likely largely due to the left-right difference in reconstruction completeness in the FANC T1 leg neuropils. The number of synapses per cell type is also more variable in FANC. Overall, we recommend that end users should inspect the morphology and total synapse counts of individual MNs of interest in either dataset as part of any detailed analysis.

      This analysis might benefit from some sort of control for true biological variability in the number of MN synapses between left and right or across segments. I assume the authors chose the threshold of 0.7 because it seemed to do a good job of separating degraded neurons from differences in counts that could just be due to biological variability or reconstruction imperfections, but perhaps there's some way to show this more explicitly. For example, perhaps show how much variability there is in synapse counts across all homologs for one or two specific MN types that are not degraded and are reconstructed extremely well, so any variability in input counts for those neurons is likely to be biologically real. Especially because the identification of serial homologs among motor neurons is a key new contribution of this paper, a more in-depth analysis of similarities and differences in homologous leg MNs across segments could be interesting to the field if the degradation doesn't preclude it.

      We agree that there can be ambiguity in whether variability in synapse counts between left-right homologs of a MN type represents biological variability or technical issues. We have added a comparison of synapse counts of T1 leg MNs in MANC (Left) vs FANC (Right) as noted in the previous point. As the number of connectomes available to us increases, we will have a better idea of how synapse counts of MNs vary within and between animals.

      Fourth, the infomap communities don't seem to be so well controlled/justified. Community detection can be run on any graph - why should I believe that the VNC graph is actually composed of discrete communities? Perhaps this comes from a lack of familiarity with the infomap algorithm, but I imagine most readers will be similarly unfamiliar with it, so more work should be done to demonstrate the degree to which these communities are really communities that connect more within than across communities.

      A priori we expect that there is some degree of functional division between circuits controlling different limbs or motor systems, given current evidence that VNC neuropils and neural hemilineages are relatively specialized in controlling motor output. We have added this explanation to section 2.4.2 (L633-635).

      The Infomap algorithm was chosen out of several directed and undirected community detection methods that we tried, as it defined communities that each had connectivity with narrow and specific motor neuron subclasses. For example, it labeled populations in each of the six leg neuropils as belonging to distinct communities. We think this provides an interesting partitioning of the VNC network that could have biological relevance (which future functional studies should investigate). To the reviewer’s final sentence, we do show intra- vs inter-community connectivity in Fig. 9–supplement 1B. Notably, most communities except several small ones have far more intra-community connectivity than inter-community connectivity. We have added text highlighting this observation (L656-658).

      We do, however, agree with the general point of the reviewer that it is not yet known which community detection methods are ‘optimal’ for use with connectomics data, so we have added further text (L679-683) explaining that community detection in MANC will require further investigation and validation in the future.

      I think the length of this manuscript reduces its potential for impact, as I suspect the reality is that many people won't read through all 140 pages and 21 main figures of (overall excellent) work and analysis.

      We intend this paper to serve not only as a first look into the organization of descending-to-motor circuits, but also as a resource for future investigations in MANC. The provided detail is intended to serve these purposes.

      Reviewer #1 (Recommendations For The Authors):

      General comments:

      I find that there are too many main figures with too much content in them, as well as too much corresponding text. Much of the initial anatomical identification and description could be summarised in fewer main figures, with more supplementary figures if the authors desired. I think there is a lot of great insight in this paper, particularly in the second half, but I am concerned that the extensive detail in the initial sections may challenge reader engagement through to the later sections of the paper. It would also be useful to have a higher level and shorter discussion.

      Reiterating our response from above, we intend this paper to serve not only as a first look into the organization of descending-to-motor circuits, but also as a resource for future investigations in MANC. The provided detail is intended to serve these purposes.

      There is sometimes an over-reliance on wiring diagrams or complex plots as evidence without further quantification. I will mention several examples below, as well as additional suggestions.

      Specific comments:

      In Figure 2E, how are DNs divided into pair vs population type? This was a very interesting idea, particularly in light of "command-like" neurons vs ensembles of DNs controlling behaviour. However, it is not clear how this distinction is made. This concept is referenced throughout the manuscript, so I think a clear quantitative way of identifying "pair" vs "population" identity for each DN would be very useful. And at the very least, a thorough explanation of how it is done in the current manuscript.

      We have added additional text in the Figure 2 legend to point towards Materials and Methods where the DN grouping (pair vs. population) is explained. These groups were formed based on morphology and further split into types based on connectivity, if needed. However, as the connectome represents a static snapshot of connectivity with no functional data, it remains possible that some DNs that were grouped as populations may act functionally as multiple pairs. Future work should continue to update these annotations.

      In Figure 4, there are some inconsistencies between neurotransmitter predictions and experimental FISH data. Have the authors taken into consideration Lacin et al. 2019 (https://elifesciences.org/articles/43701)? Specifically in that paper, it is stated: "We did not find any cases of neurons using more than one neurotransmitter, but found that the acetylcholine specific gene ChAT is transcribed in many glutamatergic and GABAergic neurons, but these transcripts typically do not leave the nucleus and are not translated." I wonder if this might explain some of the inconsistencies between FISH (mRNA detection) and the neurotransmitter predictions (presumably based on indirect protein structures detected via EM imagery), or the presence of so much co-transmission.

      We agree and have added this possible explanation for apparent co-transmission in the text (L394-397).

      In Figure 8B, the authors state: "We found that individual DN and MN subclasses have direct downstream and upstream partners, respectively, that are relatively hemilineage-restricted (Figure 8B)." While the connectivity patterns highlighted are intriguing, further quantitative analysis could help strengthen this point. The connectivity matrices in Figure 8B are linked to activation phenotypes and hemilineages below. But I don't really know how to interpret "relatively hemilineage-restricted" in light of this plot. How does this connectivity pattern for example compare statistically to a randomly selected set of DNs (maintaining the same group size for example)? Would random DN sets be less hemilineage restricted? Similar quantification would be helpful to support this statement "...with high correspondence between the hemilineages connected to individual DN and MN subclasses that are expected to be functionally related."

      "both upper tectulum DNs (DNut) and wing MNs (MNwm) have significant connectivity with hemilineages 6A, 7B, 2A, 19B, 12A and 3B". What is significant connectivity? Looking at the plot in Figure 8B, why is DNut -> 16B not considered significant? Is there a threshold and if so, what is the justification?

      These plots aim to be descriptive rather than drawing hard quantitative thresholds between ‘significant’ and ‘non-significant’ connectivity. We have revised the text to remove the terms ‘restricted’ and ‘significant’ and to clarify our interpretation (L555-559).

      In Figure 9G-H, this is a very interesting finding, but how do we know that the difference is real? Why not do a statistical test to compare the brain and VNC? Or create a null model network with edge swaps, etc. to compare against.

      Statistical comparison between the brain and VNC may be problematic given differences in generating these connectomes, as well as missing connectivity (only half the brain is imaged) in the hemibrain connectome. Comparison to a null model is possible and for purposes of understanding motif frequency in general has already been done (see for example, Lin et al., 2024, Nature). However, a null or shuffled model is not required for comparing motif frequencies between brain or VNC neuropils as is the point of this particular graph. At present, we simply highlight a qualitative observation that will require future work to investigate.

      Referring to Figure 12 in the main text, "we observe that the power MN upstream network is largely shared among all power MNs and is highly bilateral." Quantifying the fraction of shared upstream neurons from power MNs would make this statement much stronger. Particularly if compared to other non-power MNs. Or potentially using some other network comparison metric.

      This is a good point. We have added cosine similarity to figure 6 for wing/haltere MNs to show the similarity between inputs across these MNs, and added text in section 2.3 (L461-465) and 2.5.3 discussing the cosine similarity (L987-988).

      In Figure 13B, "Nearly 50% of these restricted neurons (totalling about 1200 per leg neuropil) have been serially matched across the six neuropils (Figure 13B)". There seems like a disconnect here. In the IR, CR, and BR columns, I see ~2750, ~500, and ~1250 neurons not in a serial set (~4500 total); I see ~1500, ~750, and ~1000 in a serial set (~3250 total). This would mean that ~58% of neurons are not in serial sets, ~42% are in serial sets. Shouldn't the conclusion be the opposite then? That surprisingly most intrinsic neurons are not repeated across leg neuropils. I find this fascinating if true. Perhaps there is some confusion on my part, however.

      We now find that about half of the leg-restricted neurons are serially repeated across the 6 leg neuropil with similar morphology and connectivity, especially to the downstream leg motor neurons. Since first submission of this paper, we have identified some additional serial homologues while completing the systematic cell typing, described in the accompanying paper Marin et al. 2024. Figure 13B has now been updated to reflect this. In total, 3998 of 7684 restricted neurons (IR,CR,BR) have been assigned to a serial set or serial type. The sentence in the text has been adjusted to report that 52% of these restricted neurons are in serial sets (L1125).

      In Figure 13D-E, "the Tect INs are not a homogenous population." Providing additional evidence could strengthen this statement. A connectivity matrix is shown in (D), followed by examples of morphologies in (E). What makes a population homogenous or heterogenous? For example, compared to all possible INs, the Tect IN morphology actually looks quite similar. Are those connectivity matrices in (D) really so different? What would a random selection of neurons look like?

      Our sister paper, Marin et al. (2024), has looked into variation of connectivity across neurons of the entire VNC in much more detail, including clustering methods that include connectivity and other criteria for cell typing. Thus, we have now amended the text to direct the reader to that paper for more detail on variability of connectivity in the Tect INs, which were divided into 5 cell types in Marin et al. (2024) (L1027-1031). In addition, we have replaced our clustering by connectivity in Figure 13 with the cell type clusters from Marin et al. (2024).

      In reference to Figure 13 - Supplement 1, "This standard leg connectome was very similar across legs, but there were small deviations 1051 between T1, T2, and T3 legs, as shown in Figure 13-Supplement 1." - what makes a deviation considered small? T1 seems to generally have many more synapses, T2 many less, and T3 a mixture depending on the connection. Also, are there lost connections or new connections? A quantification of these issues would be helpful instead of simply depicting the wiring diagrams.

      The connections that differ are likely due to the reconstruction state of leg MNs. We have now stated this in the main text for clarification (L1143-1145). In the leg neuropils, T2 and T3 left hand side MNs have sparser dendritic arbors than the right hand side. Therefore the differences in Figure 13–Supplement 1, which are almost exclusively the connections between the leg restricted neurons onto leg MNs, seem stronger in T1. Future work, bolstered by additional datasets, will undoubtedly reveal further insight into the comparison of circuits for the different legs.

      In Figure 15 - Supplement 2, "We used effective connectivity to identify leg DNs with similar MN connectivity patterns (Figure 15-Supplement 2). Of previously identified DNs, we found that DNg13 showed a highly similar effective connectivity fingerprint."

      How was this similarity calculated? How do we know these particular DNs have similar effective connectivity? The connectivity matrix depicted is quite complex, with both layer and connectivity scores quantified at each location. A principled way of determining similarity would make this statement much stronger.

      The similarity was calculated simply as the Euclidean distance between the effective connectivity matrix for each DN onto the set of MNs. While this is a straightforward comparison mathematically, effective connectivity calculations (as first introduced in this context by Li et al., 2020 by our collaborators Larry Abbott and Ashok Litwin-Kumar) have not yet been subject to functional validation. We therefore agree with the reviewer that this should not be over interpreted at this point. Future functional work should explore hypotheses suggested here and more quantitatively compare the similarity of different DN-MN pathways.

      Minor notes:

      In Figure 4E, the circles, squares, and triangles in the figure legend are too small. This is also true to some extent in the plot itself.

      We have increased the size of the symbols in the legend and plot.

      In Figure 8E right, the figure legend and x/y axes are not clear to me. Unfortunately, I'm not sure what the plot is showing because of this.

      The right plot in figure 8E is the number of DN groups each MN group receives input from, at a threshold of 1% input. As this plot is redundant to the left plot, we have decided to remove it.

      In Figure 8I, it would be interesting to see which neurons are directly downstream of DNs. One can't see layers 2/3/4 with the fan-out expansion of neurons and the y-axis scale.

      We have revised the plot to better show cell composition of individual layers.

      In Figure 19E, it would be helpful to also have a standard y-axis.

      The panel has been revised accordingly.

      Reviewer #2 (Recommendations For The Authors):

      General:

      In the Title, you do not mention DNs or MNs but these are a major focus of this study. The title could be more descriptive of the work.

      Per the reviewer’s comments, we have revised the title to “Transforming descending input into motor output: An analysis of the Drosophila Male Adult Nerve Cord connectome”.

      A glossary would be helpful, where all the paper's abbreviations and their definitions are provided in one place. Perhaps a hierarchical structure would help (for at least part of the glossary), so that terms like NTct, WTct, and HTct could be nested underneath UTct, for example.

      We do include a glossary in the sister paper, Marin et al. (2024) and in this paper have included a short glossary in the first Figure. Please refer to these sources for abbreviation reference.

      Introduction:

      Define 'Premotor'.

      We have defined ‘premotor circuits’ to be ‘circuits that directly or indirectly control motor output’ in lines 45-46.

      It might be worthwhile to start with a broader introduction sentence than the current one that focuses just on the fly, in order to emphasize the impact of MANC as the first complete connectome of a motor circuit in any animal with limbs or wings.

      We have revised the introductory paragraph per the reviewer’s suggestions.

      "Muscles in the leg are not innervated uniformly; indeed, in the T1 legs the number of MNs per muscle varies by as much as an order of magnitude" needs to specify the axis of variability more clearly - the authors probably mean variability across muscles in the leg (not variability across individuals for example) but I think the current sentence is a bit ambiguous in that respect.

      We have reworded this sentence to clarify this point (L132-133).

      Line 182 end of paragraph: It would be useful to point out explicitly what makes the MANC project valuable in the context of a similar FANC project - for example, that the MANC connectome is more complete, is a male (so interesting for anyone interested in sexual dimorphism), and gives the field an n=2 for VNC connectome datasets.

      We agree, and have added a sentence describing the benefits of the MANC connectome on L209-212.

      Line 213: A brief phrase or sentence of context could be provided to help unaware readers understand that 42% of synaptic connectivity being captured is in the same sort of range as previous datasets like the hemibrain and likely leads to the vast majority of important cell-cell connections being identified (perhaps cite Buhmann et al 2021 Nature Methods which does an analysis of this), and therefore is a reason to think highly of this dataset's quality and its potential for impact on the field. The sentence at the end of this paragraph doesn't quite do it for me.

      We have added the comparison of MANC synapse completeness to that of the Hemibrain, and revised the ending sentence in L234-237.

      Line 271: Clarify what happened to the remaining 15% of DNs that weren't able to be assigned to a tract. They travelled outside the tracts, or data quality issues prevented assignment, or something else?

      Indeed, some DNs could not be assigned to a tract as they traveled outside of all axon tracts and did not bundle with other DNs. We have added this explanation to the text (L300-301).

      Figure 1:

      The pie chart "DN postsynaptic partners by neuron class" is a bit hard to interpret without having another pie chart next to it showing "Neurons in MANC by neuron class". I know these numbers are written on the schematic but it would be nice to be able to easily tell which cell classes are overrepresented or underrepresented in the set of postsynaptic partners of DNs. e.g. It's obvious that ANs are overrepresented and DNs are underrepresented in the set of postsynaptic partners of DNs, but it would be nice if readers didn't have to do any mental math to figure out if INs or MNs are under/overrepresented.

      We agree and have added a pie chart of the neuron class composition of the entire VNC to Figure 1.

      "35.9% of leg MNs are matched to FANC" Why is this number so low? Because FANC motor neurons were only identified in T1, so the remaining 2/3rds of leg MNs in MANC weren't matched? How successful was matching for the neurons where it was actually attempted?

      For this work, we only matched the T1 neurons across the two datasets. This was both a way of checking that we found everything in these segments and a way of being more sure of muscle target assignments as our collaborators in the FANC dataset had generated extensive light level data to match motor neurons with their target leg muscles. The T2 and T3 MNs were not fully proofread or identified in FANC, precluding further analysis, and leading to the 35.9% matched number. We hope to be able to compare between these datasets more thoroughly in future, and have matched all the premotor leg restricted intrinsic neurons of our standard connectome to FANC. We report on their stereotypy in our latest preprint, Stürner, Brooks et al. 2024.

      Figure 2:

      Figure 2A: Perhaps darken the color of the MTD-III skeletons. Currently, they're so light it's hard to see, and this is one of the most interesting tracts because the claim is that it's a new tract.

      We take the reviewer’s point, however, the color scheme used for the tracts in Figure 2 is coordinated between multiple figures and figure panels, and thus we would prefer to keep it as is. If readers would like to examine DNs of a particular tract, we encourage them to retrieve said DNs using the tract annotations in NeuPrint.

      Figure 2 supplement 1: It's not clear to me what I should be getting out of seeing the right side DNs as well. If you want readers to be able to visually compare the left and right side morphologies and appreciate the high degree of symmetry, you may want to put the left and right side DN panels side-by-side. Perhaps do that (show both the left and right side DNs) for one or two tracts in the main Fig2, and then leave out the remaining panels - or if you want to include the remaining panels, explain more clearly what readers are supposed to learn from seeing them.

      We agree and have now removed Figure 2 supplement 1.

      Figure 2C caption: Instead of "DN primary neurites" I think the authors probably mean "longest single branch of each DN" or something along those lines. I think "primary neurite" is usually used to refer to the thick non-synaptic branch coming out of a neuron's soma, which can't be how it's being used here.

      We agree and have changed all references to ‘primary neurite’ for DNs to ‘longest neurite’.

      Figure 2D+E: Perhaps add an overall % of neurons of each class to the legend. I ask because I would be very interested to know what % of all DNs exist as single pairs versus as populations, and I imagine that could be a number that is quoted a fair amount by others in the field when talking about DNs.

      We agree and have added the overall percentage of each neuron class to the results (L275-276) and Figure 2 legend.

      Figure 3:

      UTct.IntTct neurons are by far the largest class of DNxn neurons, so would it be worth calling these the DNxt class (DN projecting to some combination of tectulum neuropils), to mirror the DNxl class? I would vote for doing that.

      Thanks for the suggestion.  However, the subclass naming scheme for DNs had been coordinated between multiple groups of people working on MANC reconstruction and annotation. As making changes to subclasses will impact many analyses that have already been completed for existing work, we will refrain from doing so.

      Figure 3G feels a bit out of place in this figure and under-explained

      We have clarified in the text our citations to Figure 3G to better explain our interpretation of this data.

      Figure 4

      "DNp20 has few vesicles and may be electrically coupled": If I'm correct that DNp20 is also known as DNOVS1 and is the second largest diameter axon in the neck after the giant fiber, then yes, Suver et al. 2016 J Neurosci show that this DN is gap junction coupled to neck motor neurons (see their Fig 2F). This neuron (along with the giant fiber) is enough of an outlier that it might be more representative to show a different, more canonical DN that has a low prediction probability.

      The reviewer is right that DNp20 is also known as DNOVS1 with known gap junction coupling.  We now clarify in the text (L366) how we think that could lead to a lower neurotransmitter prediction score, which is what we were trying to illustrate.

      Figure 4E: It looks like only a single DN has more inputs (~11000) than outputs (~9000), is that right? It could be interesting to dedicate some panels and text to the connectivity profile of that one unique neuron.

      Yes, that is correct, there is just one pair of DNs, DNxn166, that receives more input than it gives output (the two triangles lie on top of each other). We think that the other DN pair in that same box (more variable in total synapse number and therefore the triangles are further apart) also receives an unusually high amount of input versus output. The morphology of these two types are shown in Figure 4F and they both have fine processes that look more like dendrites, especially when compared to other DNs such as the ones in 4G. Unfortunately, neither of these two types have been matched to light microscopy images so we cannot say if they have the same type of morphology in the brain, or further explore their brain connectivity, at this time point.

      Figure 4E: "black rectangle ... gray rectangle" don't look different shades to me. It's obvious which is which based on where they are in the graph but if you want to color code this, pick more separate colors. Or code it with something other than colors.

      We have made the rectangle in Figure 4E a lighter shade of grey and added labels to refer to the panels D, F and G. The figure legend now also describes more clearly that we are plotting every DN as a single shape and exactly how many DN types are included in those rectangles to avoid confusion.

      Figure 5:

      "subclass is their two-letter muscle anatomical category" should be explained better, I'm not sure what "muscle anatomical category" means.

      We have changed the wording in the Figure 5 legend to better clarify that MN subclasses are the broad muscle category that they innervate (e.g. legs, wings).

      Figure 7:

      Leg MN identification and serial homology.

      Why are there no tarsus reductor (tarm1 and tarm2) motor neurons? Do we not know their anatomy from light microscopy well enough, perhaps? Were these MNs identified in FANC? Is it reasonable to guess that the remaining small number of unidentified T1 leg motor neurons in MANC would control these muscles? I think Marta Moita's lab has some ongoing projects on these muscles (see Twitter), so if more LM data is needed perhaps it will come from them.

      We now know that the small number of unidentified T1 leg motor neurons (a T1 pair with a serial T2 pair, serial set 17664) are not in fact MNs. A new and unpublished dataset (Janelia whole male CNS volume, the optic lobe from which has been published as Nern et al., 2025) shows they have axons within the VNC. The MN annotation for these neurons has been removed and they now have the type name INXXX471. Thus, we have no T1 leg MNs without a muscle target annotated. Our muscle target annotation comes from matching to the FANC dataset that has also not annotated tarsus reductor MNs. We suspect that the tarsus reductor MNs are hard to distinguish from the tarsus depressor MNs of which there are 5 per side and segment.

      It seems there are a few more leg motor neurons in MANC vs FANC. Any indication of which muscles they control?

      See above.

      -Figure 7E: A qualitative comparison between the cosine similarity results here and from FANC could be useful. What generally is the same versus different? Any indication of male/female differences?

      We observe no differences in the cosine similarity of T1 leg MNs between MANC and FANC and only very minor differences between T1, T2 and T3, as shown in Figure 7. In our most recent work, now on bioRxiv (Stürner, Brooks et al., 2024), we were able to find all intrinsic leg serial sets that we included in our standard leg premotor circuit here in the FANC dataset. We do not see any differences between them in terms of morphology, and while we have several cases in which we are still missing 1 of the 6 neurons in a serial set in FANC, we see similar connectivity when comparing small circuits. We have also found almost all neurons interconnecting the legs, with some very interesting exceptions, mainly coming from the abdomen, that we believe are male specific. These male-specific neurons can also be found in this preprint (Stürner, Brooks et al., 2024).

      Figure 8

      Figure 8A: Why are ~1/3rd of the wing and leg motor neurons considered populations instead of pairs? I thought essentially all wing and leg motor neurons have unique morphologies.

      Pair vs populations are assigned based on MN morphology and connectivity. For the wing MNs, many sets of DVMns and DLMns have near-identical morphology and connectivity, are not easily distinguishable in the VNC and are categorized as a ‘population’. For the leg MNs, there are ‘true’ population MN types that provide multiple innervation of the same muscle.

      The text states "up to a maximum of 20% [traversal probability] (corresponding to a synapse input fraction of 1)" but I interpret the bottom of Figure 8G to have flipped values, where a synapse input fraction of 0.2 yields a traversal probability of 1. Is there a mistake here or have I misunderstood?

      Thank you for pointing this discrepancy out. The text description was indeed flipped, and we have corrected this error.

      Caption for J says "Layers without neurons are omitted". How is it possible to have a layer without neurons?? Something about how the traversal is done doesn't seem to be explained clearly enough. If it's really possible to have a layer without neurons, I think the approach might need to be revisited as this seems quite strange.

      Here, ‘layer’ should be viewed as a nonlinear measure of indirect connectivity combining path length and synaptic weights. Layers without neurons are possible due to the details of the calculation–layer position is assigned probabilistically by the downstream synapse connectivity of the source neurons, and the probability is scaled up to 1 at an input synapse fraction of 0.2. Neuron-to-neuron connectivity of an input synapse fraction of >=0.2 is very rare in the VNC connectome and thus neurons strictly assigned to layer 2 downstream of each DN type are similarly rare. We have updated the figure legend for figure 8 to better explain this.

      Section 2.6

      "flies have been shown to walk normally without proprioceptive feedback, suggesting that inter- and intra-leg coordination is not strictly dependent on sensory feedback loops from the legs" is quite a drastic overinterpretation of that paper's results. The ablation there was not complete (some subtypes of sensory neurons were not perturbed), and the perturbed flies certainly walked with some defects. This statement certainly should be removed or significantly softened.

      Thank you for pointing this detail out. The term ‘normally’ has been removed from this sentence to soften the statement.

      Figure 13, Standard leg connectome

      Unfortunately, the motor neurons controlling the tarsus could not be included here, I suppose due to the difficulty in identifying the T2 and T3 homologs for these motor neurons. This should be mentioned in the text. This version of the standard leg connectome is without a doubt still an incredibly valuable discovery, but readers should be made aware that this version of the standard leg connectome does in fact lack the motor neurons for one joint.

      The MNs controlling the tarsus could not be matched with high confidence. We have added a sentence pointing this out when the leg circuit is introduced (L1141-1142).

      The focus here is on locomotion is the absence of other behaviors whereas the legs are responsible for grooming, reaching, boxing, etc. How should we consider the leg connectome in light of this?

      This is a very good point, and we have indeed found known grooming neurons that target our leg premotor circuit (L1158-1161). We’ve now added this observation to the Discussion (L1949-1951).

      Minor points

      L84 - re: Descending neurons work together - cite Braun et al., bioRxiv 2023; cite Yang HH bioRxiv 2023 .

      We agree that these papers are relevant to the function of DNs in combination, and have added them to the introduction (L83-84, 86-87).

      L193 - "intrepid" is overly florid language; similar for L1507 "enigmatic".

      We have replaced these words with suitable synonyms.

      L273 - The acronym "ITD" is not explained. Please check all other acronyms. Related, it would be good to include a Table or Box with all acronyms for the reader.

      We have added the full name of the ITD to the text. A glossary is available in Figure 1, and a full glossary of MANC terms is available in Table 1 of our sister paper, Marin et al. 2024.

      -L514, you state that hemilineages 6A and 6B unexpectedly produce uncoordinated leg movements (flight-related was expected). However, Harris didn't study animals in tethered flight but headless on the ground.

      The experimental setup of Harris et al. was capable of assessing flight-like motor output even if not true flight, as seen in the predominantly wing movement phenotypes of activating hemilineages 7B, 11A/B and 2A. We now also note that hemilineage annotation in Marin et al., 2024, shows that the 6B hemilineage has some projections into the leg neuropils, in support of a leg motor role in addition to an upper tectular role (L570-571).

      L1425 - "the TTM" is repeated twice.

      This sentence addresses both the TTM and its MN (TTMn). We have revised this sentence to improve clarity by expanding the full name of TTM in that paragraph and leaving TTMn abbreviated

      L1728 - Ascending neuron projections to the brain - cite Chen et al., Nat Neuro 2023.

      We agree that Chen et al. 2023 is relevant to the discussion of AN function, and have added this citation (L1836-1838).

      L1817, It is a good idea to compare with previous predictions for circuit control. But these originate from non-Drosophila work as well. Please cite and consider the original models from Buschges, Cruse, Holmes, and others.

      Thanks for the suggestion. We now cite the non-Drosophila literature as well. (L1971)

      L1827, how precisely should these "theories" be updated? Be explicit.

      We summarize in the sentences before what is different in comparison to one of the suggested models. We have now additionally added examples to the sentence (L1942-1945) to suggest that theoretical leg circuits need to account for the posterior-to-anterior as well as anterior-to-posterior connections between leg neuropils, as well as relative lack of connectivity between the left and right mesothoracic leg neuropils.

      L1831, include a discussion about another alternative which is through mechanical coupling and sensory feedback.

      We agree that leg sensory input likely contributes to leg locomotor circuits. We have added the following sentence to point out that annotations of sensory neurons in MANC are available through work in a companion paper (Marin et al. 2024), and future work is necessary to examine the contribution of sensory input to leg motor circuits (L1954-1956).

      Methods

      https://flyconnectome.github.io/malevnc/ link doesn't work.

      We have updated the link.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      The study presents valuable findings on the role of RIPK1 in maintaining liver homeostasis under metabolic stress. Strengths include the intriguing findings that RIPK1 deficiency sensitizes the liver to acute liver injury and apoptosis, but because the conclusions require additional experimental support, the evidence is incomplete.

      We are truly grateful, and wish to express our sincere acknowledgement to the reviewer and the editor for the time and effort spent in reviewing our manuscript. We highly appreciate the thorough and constructive comments, which can greatly improve our manuscript. We have conducted new experiments to address the reviewer’s concerns. We also carefully checked and changed our manuscript according to the constructive suggestions by the reviewer. Hopefully we have adequately addressed all the concerns. In the revised manuscript version, changes are highlighted in yellow. Please find the detailed point-to-point responses below. 

      Public Reviews:

      Reviewer #1 (Public Review):

      This study presents an investigation into the physiological functions of RIPK1 within the context of liver physiology, particularly during short-term fasting. Through the use of hepatocyte-specific Ripk1-deficient mice (Ripk1Δhep), the authors embarked on an examination of the consequences of Ripk1 deficiency in hepatocytes under fasting conditions. They discovered that the absence of RIPK1 sensitized the liver to acute injury and hepatocyte apoptosis during fasting, a finding of significant interest given the crucial role of the liver in metabolic adaptation. Employing a combination of transcriptomic profiling and single-cell RNA sequencing techniques, the authors uncovered intricate molecular mechanisms underlying the exacerbated proinflammatory response observed in Ripk1Δhep mice during fasting. While the investigation offers valuable insights into the consequences of Ripk1 deficiency in hepatocytes during fasting conditions, there appears to be a primarily descriptive nature to the study with a lack of clear connection between the experiments. Thus, a stronger focus is warranted, particularly on understanding the dialogue between hepatocytes and macrophages. Moreover, the data would benefit from reinforcement through additional experiments such as Western blotting, flow cytometry, and rescue experiments, which would offer a more quantitative aspect to the findings. By incorporating these enhancements, the study could achieve a more comprehensive understanding of the underlying mechanisms and ultimately strengthen the overall impact of the research.

      We thank the reviewer for the encouraging comments and helpful suggestions. We agree with the reviewer that additional experiments could reinforce our findings. Therefore, we conducted additional experiments including flow cytometry, western blotting, and using kinase-dead mutant mice to further investigate the underlying mechanisms. We carefully addressed every comment by the reviewer as indicated below.

      Detailed major concerns:

      (1) Related to Figure 1.

      It is imperative to ensure consistency in the number of animals analyzed across the different graphs. The current resolution of the images appears to be low, resulting in unsharp visuals that hinder the interpretation of data beyond the presence of "white dots". To address this issue, it is recommended to enhance the resolution of the images and consider incorporating zoom-in features to facilitate a clearer visualization of the observed differences. Moreover, it would be beneficial to include a complete WB analysis for the cell death pathways analyzed. These adjustments will significantly improve the clarity and interpretability of Figure 1.

      Thanks very much for the constructive advice. We carefully checked the number of animals and make sure that the animal number were consistent within different figures. We further updated the figures with incorporating zoom-in features in updated Figure 1, and the resolution of the figures were greatly improved. Western blot analysis were also included in updated Supplementary Figure 1.

      (2) Related to Figure 2.

      It is essential to ensure consistency in the number of animals analyzed across the different graphs, as indicated by n=6 in the figure legend (similar to Figure 1). Additionally, it is crucial to distinguish between male and female subjects in the dot plots to assess any potential gender-based differences, which should be consistent throughout the paper. To achieve this, the dots plot should be harmonized to clearly differentiate between males and females and investigate if there are any disparities between the genders. Moreover, it is imperative to correlate hepatic inflammation with the activation of Kupffer cells, infiltrating monocytes, and/or hepatic stellate cells (HSCs). Therefore, conducting flow cytometry would be instrumental in achieving this correlation. Additionally, the staining for Ki67 appears to be non-specific, showing a granular pattern reminiscent of bile crystals rather than the expected nuclear staining of hepatocytes or immune cells. It is crucial to ensure specific staining for Ki67, and conducting in vitro experiments on primary hepatocytes could further elucidate the proliferation process. These experiments are relatively straightforward to implement and would provide valuable insights into the mechanisms underlying hepatic inflammation and proliferation.

      Thanks very much for the helpful advice. First, we corrected the number of animals analyzed in different graphs and make sure that the number of animals listed in the figure legend were consistent with the graphs in all figures. Second, to distinguish the results between male and female mice, blue represents male mice, pink represents female mice, and green represents RIPK1 kinase inactivated mice. The majority of results were obtained from male mice, and our results indicated that there was no difference between male and female mice herein.

      The percentages of immune cell subpopulations isolated from mouse liver tissue were determined. The results were consistent with single cell analysis that greater number of  macrophages were recruited into the liver tissue in Ripk1<sup>Δhep</sup> upon 12-hour fasting (updated Figure 4F&G).

      To confirm the results of Ki67, we first detected the transcriptional expression of Ki67 using real-time qPCR, and the results were consistent with the protein expression measured by immunohistochemical analysis. The percentage of Ki67<sup>+</sup> cells in liver cells were also detected, and there was significantly more Ki67<sup>+</sup> cells in Ripk1<sup>Δhep</sup> mouse liver than WT control mouse upon 12-hour fasting. Taken together, our transcriptional analysis, immunohistochemical analysis as well as flow cytometry data indicated that Ki67 expression was higher in Ripk1<sup>Δhep</sup> mice than Ripk1<sup>fl/fl</sup> mice. (updated Figure 2). 

      (3) Related to Figure 3 & related to Figure 4.

      The immunofluorescence data presented are not entirely convincing and are insufficient to conclusively demonstrate the recruitment of monocytes. Previous suggestions for flow cytometry studies remain pertinent and are indeed necessary to bolster the robustness of the data and conclusions. Conducting flow cytometry analyses would provide more accurate and quantitative assessments of monocyte recruitment, ensuring the reliability of the findings and strengthening the overall conclusions of the study. Regarding the single-cell RNA sequencing analysis presented in the manuscript, it's worth questioning its relevance and depth of information provided. While it successfully identifies a quantitative difference in the cellular composition of the liver between control and knockout mice, it may fall short in elucidating the intricate interactions between different cell populations, which are crucial for understanding the underlying mechanisms of hepatic inflammation. Therefore, I propose considering alternative bioinformatic analyses, such as CellPhone-CellChat, which could potentially provide a more comprehensive understanding of the cellular dynamics and interactions within the liver microenvironment. By examining the dialogue between different cell clusters, these analyses could offer deeper insights into the functional consequences of Ripk1 deficiency in hepatocytes and its impact on hepatic inflammation during fasting.

      Thanks very much for the constructive suggestion. We agree with the reviewer that conducting flow cytometry analyses would provide accurate and quantitative assessments of monocyte recruitment, ensuring the reliability of the findings. Following the advice, both WT and Ripk1<sup>Δhep</sup> mice were fasted for 12 hour and then single hepatic cells were isolated and analyzed by flow cytometry. As indicated in updated Figure 4F&G, the percentage of F4/80<sup>+</sup>CD11b<sup>+</sup> cells were significantly higher in Ripk1<sup>Δhep</sup> compared with WT control mice, confirming that more monocytes were recruited into the liver.

      Additionally, we performed CellChat analysis on the single-cell transcriptomic data. As shown in updated Figures 4H-J, both the number of ligand-receptor pairs and the interaction strength among the eight cell types were significantly increased in Ripk1<sup>Δhep</sup> mice, particularly the interactions between macrophages and other cell types. Network analysis indicated that inflammation and proliferation signals were amplified in Ripk1<sup>Δhep</sup> mice. Consistent with the bulk RNA sequencing data, SAA signaling was upregulated in the hepatocytes of Ripk1<sup>Δhep</sup> mice (updated Figure 4K). SAA has been found to play a role in regulating immune responses and tumor development. Based on these findings, we speculate that fasting-induced liver injury in RIPK1 knockout mice may exacerbate the inflammatory response in liver tissue through enhanced SAA signaling. The above data analysis and interpretation were included in the updated Figure 4&S4 and line 421 - 443.

      (4) Related to Figure 5.

      What additional insights do the data from Figure 5 provide compared to the study published in Nat Comms, which demonstrated that RIPK1 regulates starvation resistance by modulating aspartate catabolism (PMID: 34686667)?

      Thank you very much for your constructive suggestion. As noted by the reviewer, this study (PMID: 34686667) primarily focuses on metabolomic analyses of Ripk1<sup>-/-</sup> neonatal mouse brain tissue and Ripk1<sup>-/-</sup> MEF cells. The authors propose that Ripk1 regulates starvation resistance by modulating aspartate catabolism.

      In our study, the global metabolic changes induced by fasting were monitored. Fastinginduced lipolysis in peripheral adipose tissue leads to hepatic lipid accumulation, and excessive deposition of free fatty acids has been shown to induce endoplasmic reticulum (ER) stress in the liver. Data from Figure 5 demonstrate that administering the ER stress inhibitor 4-PBA effectively mitigated fasting-induced liver injury and inflammatory responses in Ripk1<sup>Δhep</sup> mice. Our findings suggest that ER stress plays a critical role in fasting-induced liver injury and inflammation in Ripk1<sup>Δhep</sup> mice.

      (5) Related to Figure 6.

      The data presented in Figure 7 are complementary and do not introduce new mechanistic insights.

      Thank you very much for your insightful suggestion. As you mentioned, the AAV-TBG-Cre-mediated liver-specific RIPK1 knockout mice offer complementary validation of the results obtained from Ripk1<sup>Δhep</sup> mice. Moreover, TBG is a promoter that is exclusively expressed in mature hepatocytes, while the ALB promoter is active not only in mature hepatocytes but also in precursor cells and cholangiocytes. Therefore, we think that the inclusion of AAV-TBG-Cre further strengthens our finding that RIPK1 in hepatocytes is responsible for fasting-induced liver injury and inflammatory responses.

      (6) Related to Figure 7.

      The data from Figure 7 suggest that RIPK1 in hepatocytes is responsible for the observed damage. However, it has been previously demonstrated that inhibition of RIPK1 activity in macrophages protects against the development of MASLD (PMID: 33208891). One possible explanation for these findings could be that the overreaction of macrophages to fasting, coupled with the absence of RIPK1 in hepatocytes (an indirect effect), contributes to the observed damage. Considering this, complementing hepatocytes with a kinase-dead version of RIPK1 could be a valuable approach to further refine the molecular aspect of the study. This would allow for a more precise investigation into the specific role of RIPK1's scaffolding or kinase function in response to starvation in hepatocytes. Such experiments could provide additional insights into the mechanisms underlying the observed effects and help delineate the contributions of RIPK1 in different cell types to metabolic stress responses.

      Thank you very much for the constructive suggestion. We fully agree with the reviewer that employing a RIPK1 kinase-inactive mutant mice could precisely investigate the specific roles of RIPK1's scaffolding and kinase functions in hepatocyte responses to starvation, respectively. In accordance with this advice, we established a 12-hour fasting model using Ripk1<sup>WT/WT</sup> and Ripk1<sup>K45A/K45A</sup> mice, which were previously established and confirmed with the inactivity of RIPK1 kinase activity. As demonstrated in updated Supplementary Figure 2, these mice did not show significant liver damage or inflammatory responses after 12 hours of fasting. These findings suggest that the liver damage and inflammatory response induced by fasting in Ripk1<sup>Δhep</sup> mice may not be contributed by the kinase activity of RIPK1.  

      Reviewer #2 (Public Review):

      Summary:

      Zhang et al. analyzed the functional role of hepatocyte RIPK1 during metabolic stress, particularly its scaffold function rather than kinase function. They show that Ripk1 knockout sensitizes the liver to cell death and inflammation in response to short-term fasting, a condition that would not induce obvious abnormality in wild-type mice.

      Strengths:

      The findings are based on a knockout mouse model and supported by bulk RNA-seq and scRNA-seq. The work consolidates the complex role of RIPK1 in metabolic stress.

      Weaknesses:

      However, the findings are not novel enough because the pro-survival role of RIPK1 scaffold is well-established and several similar pieces of research already exist. Moreover, the mechanism is not very clear and needs additional experiments.

      We thank the reviewer for the encouraging comments and helpful suggestions. Here we conducted additional experiments including flow cytometry, western blotting, and using kinase-dead mutant mice to further investigate the underlying mechanisms. We carefully addressed every comment by the reviewer as indicated below.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (7) I recommend that the authors consider reassessing their results, particularly with regards to elucidating the dialogue between macrophages and hepatocytes, as this could further strengthen the study's conclusions.

      Thank you very much for your constructive suggestion. We conducted additional experiments, including flow cytometry and western blotting, to reassess our findings. Furthermore, to clarify the interactions between cells, we employed CellChat for a more in-depth analysis of the single-cell sequencing results. In the revised manuscript version, changes are highlighted in yellow. In this study, we demonstrated that the specific deletion of RIPK1 in hepatocytes exacerbated the liver's vulnerability to metabolic disturbances, such as short-term fasting and high-fat diet feeding, resulting in increased liver damage, apoptosis, inflammation, and compensatory proliferation. The data indicate that fasting-induced liver injury in RIPK1 knockout mice of hepatic parenchymal cells may exacerbate the inflammatory response in liver tissue through enhanced SAA signaling. In summary, we revealed a novel physiological role of RIPK1 as a scaffold in maintaining liver homeostasis during fasting and other nutritional disturbances.

      (8) It would be beneficial for the authors to address the minor weaknesses identified in the study, such as ensuring consistency in the number of animals analyzed across different graphs and enhancing the resolution of images to improve data clarity.

      Thank you for the suggestion. In the revised manuscript, we have addressed these minor weaknesses, and we checked the consistency in the number of animals in different graphs, as well as enhanced the resolution of all images.

      (9) I encourage the authors to incorporate additional experiments, such as Western blotting and flow cytometry, to provide a more quantitative assessment of the observed effects and enhance the robustness of their conclusions.

      Thank you for your insightful suggestion. We completely agree with the reviewer that incorporating flow cytometry and western blotting would strengthen the robustness of our conclusions. We conducted flow cytometry analysis and western blotting and the results were listed in updated Supplementary Figure 1, Figure 2, Figure 4 and Supplementary Figure 4.

      (10) Furthermore, the authors may consider conducting complementary experiments, such as rescue experiments involving complementing hepatocytes with a kinase-dead version of RIPK1, to further refine the molecular aspect of the study and elucidate the specific roles of RIPK1's scaffolding or kinase function in response to starvation.

      Thank you very much for your constructive suggestion. As shown in updated Supplementary Figure 2, we conducted fasting experiments using RIPK1 kinase-dead mice. These findings suggest that the liver damage and inflammatory response induced by fasting in Ripk1<sup>Δhep</sup> mice may not contributed by the kinase activity of RIPK1.

      Reviewer #2 (Recommendations For The Authors):

      Major:

      (11) What is the upsteam signal for RIPK1? The study investigated the change induced by short-term fasting which is metabolic stress. Although RIPK1 knockout promotes cell death and inflammation, how it is involved in this condition is unclear. RIPK1 is never reported as a metabolic sensor and its function is typically downstream of TNFR1 as well as other death receptors such as Fas, TRAIL-R1, TRAIL-R2. Thus, it's probable that metabolic stress induces the expression and secretion of some ligand of the above receptors. Although TNFα expression is upregulated on both mRNA and protein levels, it could not be concluded that TNFα is the upsteam signal for RIPK1 because expression difference does not always lead to fuctional role. In addition, a recent study, which is also reference 33, reports that knockout of TNFR1/2 does not protect against 18 h liver ischemia, a condition that is similar to the present study. Therefore, the link between the metabolic fluctuation and RIPK1 function is elusive and should be addressed. The expression difference analysis should be extended to other relevant ligands. A functional study using neutralizing antibodies in RIPK1ΔHep mice is encouraged. At least, this should be discussed in the discussion section.

      Thank you very much for your insightful comments. The upstream signals of RIPK1 remains a significant area of scientific inquiry. Fasting, as one of the main causes of metabolic stress, is known to trigger a series of physiological changes, including but not limited to decreased blood glucose levels, hepatic glycogen depletion, increased production of hepatic glucose and ketone bodies, adipose tissue lipolysis, and the influx and accumulation of free fatty lipids in the liver. It is well-established that the elevated lipid influx and hepatic accumulation during fasting may cause lipotoxicity stress for liver. To investigate whether the elevated free fatty acids influx might act as the signal to induce cytotoxicity, we isolated primary hepatocytes but observed that a significant number of cells underwent spontaneous death during the isolation and perfusion processes. To address this question, we utilized CRISPR-Cas9 technology to generate Ripk1<sup>-/-</sup> AML12 cells, as illustrated in Author response image 1A.

      To mimic hepatic lipid accumulation induced by short-term fasting, we treated the cells with palmitic acid (PA) or oleic acid (OA) for 12 hours in vitro. Our results indicated a significant increase in cell death among Ripk1<sup>-/-</sup> AML12 cells after PA treatment compared to WT control cells (Author response image 1B). As shown in Author response image 1C, we also observed a marked increase in caspase-3 activity in Ripk1<sup>-/-</sup> AML12 cells following PA treatment.

      Collectively, our results highlight the crucial role of RIPK1 in hepatocytes in maintaining the liver's adaptive capacity to counteract lipotoxicity induced by metabolic stress. These in vitro results were not included in the manuscript; however, we addressed them in the discussion section (line 593 - 597). If the reviewer suggest, we would like to incorporate in our manuscript.

      Author response image 1.

      (12) What is the exact relationship between ER stress and RIPK1? In Figure 5A and Figure 6B, Ripk1 knockout only slightly promotes the expression of ER stress markers. The evidence of RIPK1 leading to ER stress is limited in the literature and poorly supported in this study. Also in reference 33, the hypothesis is proposed that ER stress leads to death receptor upregulation and activation, which induces RIPK1 activation. Although the ER stress inhibitor showed good efficacy in rescue experiments, it could not determine whether RIPK1 deficiency leads to ER stress-associated phenotype or ER stress leads to death receptor activation and RIPK1 deficiency-associated phenotype. If RIPK1 deficiency leads to ER stress, the possible mechanism should be investigated.

      Thank you very much for your insightful comments. As the reviewer noted, the specific relationship between endoplasmic reticulum (ER) stress and RIPK1 remains unclear. However, our data, along with findings from other studies (Piccolis M et al., Mol Cell. 2019; Geng Y et al., Hepatol Int. 2021), suggest that fasting-induced lipolysis in peripheral adipose tissue leads to hepatic lipid accumulation. Additionally, excessive deposition of free fatty acids has been shown to induce ER stress in the liver. One possible explanation is that ER stress may trigger the upregulation and activation of death receptors, and the scaffold function of RIPK1 may play a protective and checkpoint role in this process. ER stress during the fasting might locate upstream of RIPK1. This could help explain why short-term fasting results in liver damage in Ripk1<sup>Δhep</sup> mice while control mice remain unaffected. Moreover, the inhibition of ER stress using 4-PBA can effectively alleviate this damage.

      Minor:  

      (13) The study starts directly from functional experiments. However, it should be firstly explored whether RIPK1 expression or activation is modulated in wild-type mice.

      Thank you very much for your insightful observation. Previous studies showed that RIPK1 deficiency in hepatocytes does not impact the growth and development of mice, indicating that RIPK1 is dispensable for proper liver development and homeostasis (Filliol A et al., Cell Death Dis. 2016). Furthermore, we did not observe any changes in RIPK1 levels in wild-type mice induced by fasting across different experimental batches. In our bulk transcriptomic analysis, the expression of RIPK1 was not changed before and after 12-hour fasting in Ripk1<sup>fl/fl</sup> mice. Therefore, we focused our attention on the function of RIPK1 and started our study directly with functional experiments.

      (14) Knockout of RIPK1 deprived both its scaffold function and kinase function. It is encouraged to explore whether blocking RIPK1 kinase activity influences the outcome of metabolic stress.

      Thank you for your insightful suggestion. To investigate the role of RIPK1 kinase activity in response to metabolic stress, we added fasting experiments using RIPK1 kinaseinactive mice in the updated Supplementary Figure 2, in which blocking RIPK1 kinase activity does not affect the outcome of metabolic stress.

      (15) In Figure 1, the number of TUNEL+ cells is about 2 times of c-casp3. What is the possible reason?

      Thank you for your careful reading. Indeed, the number of TUNEL<sup>+</sup> cells in Figure 1 is twice that of cleaved-caspase-3<sup>+</sup> cells. There are two possible reasons. First, we speculate that this discrepancy may be attributed to the higher sensitivity of the TUNEL assay compared to the cleaved-caspase-3 assay. Secondly, TUNEL assay detects DNA fragmentation, indicating that these cells are in a pre-apoptotic state or poised to undergo apoptosis. In contrast, cleaved-caspase-3 specifically identifies cells that have already committed to the apoptotic pathway, whereas TUNEL assay could detects all types of apoptosis, but the mechanisms of apoptosis may involve more than just cleaved-caspase3.

      (16) Infiltrated innate immune cells could lead to hepatocyte death. Is the hepatocyte death in this study partially caused by immune cells?

      Many thanks for the advice. As outlined in the response to the 11th comment from the second reviewer, our findings indicate that metabolic stress induced by short-term fasting is the primary cause of hepatocyte death. Additionally, we demonstrate that infiltrated innate immune cells may also play a partial role in hepatocyte death through subsequent cascade reactions.

      (17) Could the in vivo results be consolidated by in vitro experiments on primary mouse hepatocytes? This would be helpful to answer question 4.

      Thank you for your helpful comments. As demonstrated in the response to the 11th comment by the second reviewer, we attempted to conduct in vitro experiments using primary hepatocytes. However, during the isolation and perfusion processes, we observed that a significant number of cells underwent spontaneous death. To address this issue, we utilized CRISPR-Cas9 technology to generate Ripk1<sup>-/-</sup> AML12 cells, in which a significant increase in cell death among Ripk1<sup>-/-</sup> AML12 cells after palmitic acid (PA) treatment compared to WT control cells. We also observed a marked increase in caspase-3 activity in Ripk1<sup>-/-</sup> AML12 cells following PA treatment.

      (18) RIPK1 scaffold function is associated with NF-kB signal. Is NF-kB signal transduction influenced by Ripk1 deficiency? If so, to what extent does it contribute to the observed phynotype? If not, what is the direct downstream effect of Ripk1 deficiency?

      Thank you very much for your insightful perspective. As reported by Clucas J et al., RIPK1 serves as a scaffold for downstream NF-κB signaling through the ubiquitin chains generated by its ubiquitination (Clucas J et al., Nat Rev Mol Cell Biol. 2023). The deficiency of RIPK1 in hepatic parenchymal cells can disrupt NF-κB signaling and impair its pro-survival functions, resulting in increased cell death in response to stress. Our current findings suggest that the RIPK1-NF-κB axis serves as a crucial scaffold platform essential for the liver's adaptation to metabolic fluctuations. Any inappropriate inactivation or deletion of components within this scaffold disrupts the delicate balance between cell death, inflammation, and normal function, making the liver susceptible to metabolic changes, ultimately leading to liver damage, hepatic inflammation, and compensatory proliferation.

      (19) In Figure 6B, the 'RIP' should be changed to 'RIPK1'.

      Thank you for your careful observation. We have corrected "RIP" to "RIPK1" in updated Figure 6B.

      (20) For Western blot results, the blot height should be at least the lane width to reveal additional signals and the molecular weight as well as unspecific signals should be denoted.

      Thank you for your valuable advice. We appreciate your suggestions regarding the western blot results. We went through the previous western blot results and did not find any additional nonspecific signals. We added the molecular weights in the updated figures Figure 5, Figure 6 and Supplementary Figure 1.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      The authors of the study investigated the generalization capabilities of a deep learning brain age model across different age groups within the Singaporean population, encompassing both elderly individuals aged 55 to 88 years and children aged 4 to 11 years. The model, originally trained on a dataset primarily consisting of Caucasian adults, demonstrated a varying degree of adaptability across these age groups. For the elderly, the authors observed that the model could be applied with minimal modifications, whereas for children, significant fine-tuning was necessary to achieve accurate predictions. Through their analysis, the authors established a correlation between changes in the brain age gap and future executive function performance across both demographics. Additionally, they identified distinct neuroanatomical predictors for brain age in each group: lateral ventricles and frontal areas were key in elderly participants, while white matter and posterior brain regions played a crucial role in children. These findings underscore the authors' conclusion that brain age models hold the potential for generalization across diverse populations, further emphasizing the significance of brain age progression as an indicator of cognitive development and aging processes.

      Strengths: 

      (1) The study tackles a crucial research gap by exploring the adaptability of a brain age model across Asian demographics (Chinese, Malay, and Indian Singaporeans), enriching our knowledge of brain aging beyond Western populations.

      (2) It uncovers distinct anatomical predictors of brain aging between elderly and younger individuals, highlighting a significant finding in the understanding of age-related changes and ethnic differences.

      Weaknesses: 

      (1) Clarity in describing the fine-tuning process is essential for improved comprehension.

      (2) The analysis often limits its findings to p-values, omitting the effect sizes crucial for understanding the relationship with cognition.

      (3) Employing a predictive framework for cognition using brain age could offer more insight than mere statistical correlations.

      (4) Expanding the study's scope to evaluate the model's generalisability to unseen Caucasian samples is vital for establishing a comparative baseline.

      In summary, this paper underscores the critical need to include diverse ethnicities in model testing and estimation.

      Reviewer #1 (Recommendations for the authors): 

      Comment #1 - Fine-Tuning Process Clarity: Enhanced clarity in the fine-tuning process documentation is crucial for understanding how models are adapted to new datasets. This involves explaining parameter adjustments and choices, which facilitates replication and application in further research.

      We thank Reviewer #1 for this pertinent point. As advised, we have added a Supplementary Methods section with more details on the finetuning process. This includes the addition of Supplementary Figure S6, which shows examples of learning curves that helped inform our parameter adjustments and choices. We have added a reference to this section in Section 5.2 of the Methods.

      Comment #2 - Effect Sizes Reporting: The emphasis on reporting effect sizes alongside p-values addresses the need to quantify the strength of observed effects, particularly the relationship between brain age and cognition. Effect sizes provide insights into the practical significance of findings, crucial for clinical and practical applications.

      We thank Reviewer #1 for raising this important comment. As suggested, we have added standardized regression coefficients (as measures of effect size) alongside p-values in Figures 3 – 4, Supplementary Figures S2 – S4, Supplementary Tables S4 – S15, and the text of Sections 2.2 – 2.3 of the Results. We have additionally added 95% confidence intervals to Supplementary Tables S4 – S15.

      Comment #3 - Predictive Framework for Cognition: Adopting a predictive framework for cognition using brain age moves the research from mere correlation to actionable prediction, offering potentials based on predictive analytics.

      We thank Reviewer #1 for this insightful suggestion. Adopting a predictive framework would certainly be a useful and exciting avenue for the application of brain age. However, we note that the current study was primarily interested in the generalizability and interpretability of brain age in Asian children and older adults, as well as the added value of longitudinal measures of brain age. Thus, we believe our correlation-based analysis effectively demonstrated that deviations of brain age from chronological age were not merely random errors, but were informative of cognition. Furthermore, ongoing changes to these deviations were informative of future cognition. This helps to establish the brain age gap as a biomarker for aging, independent of chronological age. Additionally, we expect that the accurate prediction of future cognition would require a multitude of factors, in addition to T1-based brain age, as well as a large sample size to train and test. We believe such a dataset would be a promising avenue for future work, but it is outside the scope of the current study.

      Nonetheless, we were able to conduct a preliminary analysis using the current longitudinal data from SLABS and GUSTO. We extracted the same variables used in the original analyses of future cognition, corresponding to Figures 3D and 4B in the main text. To implement a predictive framework, we split the data into 10 stratified cross-validation folds. We also used kernel ridge regression (KRR) as the predictive model, as it has previously shown promising performance in behavioral and cognitive prediction [1]. We used a cosine kernel and nested 5-fold cross-validation to pick the optimal regularization strength (alpha).

      To investigate the added value of BAG and longitudinal changes in BAG, we compared 3 predictive models for each cognitive domain. The baseline model consisted of the demographic covariates used in the original analyses (i.e. chronological age, sex, and years of education for older adults). A second model combined demographics with baseline BAG, and the third model incorporated demographics, baseline BAG, and the (early) annual rate of change in BAG. Predictions were extracted from each test fold, and performance was measured by the correlation between test predictions and actual values of future cognition (or change in cognition). Models were statistically compared using the corrected resampled t-test for machine learning models [1], [2], [3]. The Benjamini-Hochberg procedure was used to correct for multiple comparisons.

      Author response image 1 shows the prediction results for SLABS and GUSTO. Notably, adding the early change in BAG significantly improves the prediction of future change in executive function in SLABS. There is also an improvement in predicting the future inhibition score in GUSTO, but this is not significant after multiple comparison correction. Encouragingly, these are the same domains that showed significant associations with the change in BAG in the original analyses. This suggests that longitudinal brain age continues to contribute information, independent of baseline factors, in a predictive framework. We hope that future work can expand on this analysis with, for instance, larger sample sizes, more varied and informative predictors, and state-of-the-art prediction methods, in order to establish actionable predictions of future cognition.

      Author response image 1.

      Predictive framework for cognition similarly suggests value of longitudinal change in BAG. Prediction performance (Pearson's correlation) of KRR across future cognitive outcomes. Each boxplot shows the distribution of performance over cross-validation folds. Model performances are statistically compared for each outcome. Significant outcomes from the original analyses are bolded. (A) Results for SLABS using the early change in BAG and future change in cognitive scores (non-overlapping). Early change in BAG again shows benefit for predicting future change in executive function. (B) Results for GUSTO using the early change in BAG (from 4.5-7.5 years old) and future cognitive score (at 8.5 years old). Early change in BAG again shows benefit for predicting future inhibition, but it is not significant after multiple comparison correction. Key - **: p < 0.01; * (ns): p < 0.05 but p<sub>corr</sub> > 0.05 after multiple comparison correction; ns: p > 0.05

      Comment #4 - Generalizability to Unseen Caucasian Samples: Evaluating the model's performance on unseen (longitudinal) Caucasian samples is important for benchmarking.

      We thank Reviewer #1 for this important comment. We agree that generalizability should be benchmarked against performance on unseen Caucasian samples. In the SFCN model paper [4], they conducted an out-of-sample test on unseen Caucasian samples from ages 13 to 95. In this age range, they reported a high correlation (r = 0.975) and low MAE (MAE = 3.90). This favorable generalization performance was verified in adults by independent evaluations [5], [6]. This is also in line with what we observed in Asian older adults, taking into account the different age ranges and sample sizes involved [7].

      However, this also highlights the difficulty in evaluating on younger ages in the range of GUSTO (4.5 – 10.5 years old). Most accessible developmental datasets (e.g. HBN, PING) were already included in model training, preventing an unbiased evaluation on these samples. Datasets such as PNC and ABCD were not included in training, but they primarily consist of an older age range than GUSTO. Holm et al. [8] previously tested the SFCN model in ABCD and reported satisfactory performance (low MAE) from 9 – 13 years old. However, to the best of our knowledge, there are no reported generalization results (for any ethnicity) from 4.5 – 7.5 years old, which is where we found the most performance degradation in GUSTO. We are also not aware of any datasets in this age range we could access to test this, unfortunately, but it would be an important area for future work.

      While benchmarking in Caucasian children is difficult, we were able to conduct a preliminary analysis with older adults using the ADNI dataset (which was not included in the model training [4]). We selected a longitudinal subset with cognitive data available and no dementia at baseline (N = 137). We used composite cognitive scores covering memory, executive function, language, and visuospatial function [9], [10], [11]. We followed the same methodology (e.g. preprocessing, finetuning, statistical analysis) as the main analyses on EDIS, SLABS, and GUSTO. To maximize the data available, we tested associations with future cognition (taken at the last available time point), similar to GUSTO. We again included chronological age, sex, and years of education as demographic covariates.

      Author response image 2 shows the brain age predictions for the pretrained and finetuned models on ADNI. Similar to Singaporean older adults, the pretrained model performs well, producing a high correlation (r = 0.8053; compared to r = 0.7389 for EDIS and r = 0.8136 for SLABS) and somewhat low MAE (MAE = 4.9735; compared to MAE = 3.9895 for EDIS and MAE = 3.4668 for SLABS). After finetuning, the MAE improves (MAE = 3.6837; compared to MAE = 3.3232 for EDIS and MAE = 3.2653 for SLABS) with a similar correlation (r = 0.7854; compared to r = 0.7445 for EDIS and r = 0.8138 for SLABS). This suggests that generalization to unseen Singaporean older adults is in line with the generalization to unseen Caucasian older adults.

      Author response image 2. 

      Brain age predictions on unseen Caucasian sample of older adults. Predictions from the A) pretrained and B) finetuned brain age models on ADNI participants. Compare to Figure 2 of the main text.

      For the associations with future cognition, we again find that baseline BAG does not associate with future cognition (Author response tables 1 and 2). However, encouragingly, we find that the early annual rate of change in BAG does associate with future memory, which is significant after multiple comparison correction for the finetuned model (Author response tables 2 and 3). This suggests  a degree of replicability to the original results, but interestingly, in a different domain (memory vs. executive function). In contrast to SLABS, which consists of healthy older adults recruited from the community, ADNI consists of participants at risk of AD recruited from memory clinics. Thus, this difference in domain could be due to factors such as a stronger signal for memory in the testing battery or greater variations in memory function and decline. However, it could also reflect other population differences between ADNI and SLABS. This is an intriguing area for future study, ideally with larger sample sizes and more diverse populations included.

      Author response table 1.

      Linear relationship between pretrained baseline BAG and future cognitive score in ADNI. Compare to Supplementary Tables S4 – S15 of the original text.

      Author response table 2. 

      Linear relationship between finetuned baseline BAG and future cognitive score in ADNI. Compare to Supplementary Tables S4 – S15 of the original text.

      Author response table 3.

      Linear relationship between pretrained change in BAG and future cognitive score in ADNI. Compare to Supplementary Tables S4 – S15 of the original text.

      Author response table 4. 

      Linear relationship between finetuned change in BAG and future cognitive score in ADNI. Compare to Supplementary Tables S4 – S15 of the original text.

      References

      (1) L. Q. R. Ooi et al., “Comparison of individualized behavioral predictions across anatomical, diffusion and functional connectivity MRI,” NeuroImage, vol. 263, p. 119636, Nov. 2022, doi: 10.1016/j.neuroimage.2022.119636.

      (2) C. Nadeau and Y. Bengio, “Inference for the Generalization Error,” Mach. Learn., vol. 52, no. 3, pp. 239–281, Sep. 2003, doi: 10.1023/A:1024068626366.

      (3) R. R. Bouckaert and E. Frank, “Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms,” in Advances in Knowledge Discovery and Data Mining, H. Dai, R. Srikant, and C. Zhang, Eds., Berlin, Heidelberg: Springer, 2004, pp. 3–12. doi: 10.1007/978-3-540-24775-3_3.

      (4) E. H. Leonardsen et al., “Deep neural networks learn general and clinically relevant representations of the ageing brain,” NeuroImage, vol. 256, p. 119210, Aug. 2022, doi: 10.1016/j.neuroimage.2022.119210.

      (5) R. P. Dörfel et al., “Prediction of brain age using structural magnetic resonance imaging: A comparison of accuracy and test-retest reliability of publicly available software packages,” Neuroscience, preprint, Jan. 2023. doi: 10.1101/2023.01.26.525514.

      (6) J. L. Hanson, D. J. Adkins, E. Bacas, and P. Zhou, “Examining the reliability of brain age algorithms under varying degrees of participant motion,” Brain Inform., vol. 11, no. 1, p. 9, Apr. 2024, doi: 10.1186/s40708-024-00223-0.

      (7) A.-M. G. de Lange et al., “Mind the gap: Performance metric evaluation in brain-age prediction,” Hum. Brain Mapp., vol. 43, no. 10, pp. 3113–3129, Jul. 2022, doi: 10.1002/hbm.25837.

      (8) M. C. Holm et al., “Linking brain maturation and puberty during early adolescence using longitudinal brain age prediction in the ABCD cohort,” Dev. Cogn. Neurosci., vol. 60, p. 101220, Feb. 2023, doi: 10.1016/j.dcn.2023.101220.

      (9) P. K. Crane et al., “Development and assessment of a composite score for memory in the Alzheimer’s Disease Neuroimaging Initiative (ADNI),” Brain Imaging Behav., vol. 6, no. 4, pp. 502–516, Dec. 2012, doi: 10.1007/s11682-012-9186-z.

      (10) L. E. Gibbons et al., “A composite score for executive functioning, validated in Alzheimer’s Disease Neuroimaging Initiative (ADNI) participants with baseline mild cognitive impairment,” Brain Imaging Behav., vol. 6, no. 4, pp. 517–527, Dec. 2012, doi: 10.1007/s11682-012-9176-1.

      (11) S.-E. Choi et al., “Development and validation of language and visuospatial composite scores in ADNI,” Alzheimers Dement. Transl. Res. Clin. Interv., vol. 6, no. 1, p. e12072, 2020, doi: 10.1002/trc2.12072.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      The authors describe a massively parallel reporter assays (MPRA) screen focused at identifying polymorphisms in 5' and 3' UTRs that affect translation efficiency and thus might have a functional impact on cells. The topic is of timely interest, and indeed, several related efforts have recently been published and preprinted (e.g., https://pubmed.ncbi.nlm.nih.gov/37516102/ and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10635273/). This study has several major issues with the results and their presentation.

      Major comments:

      • The main issue remains that it appears that the screen has largely failed, and the reasons for that remain unclear, which make it difficult to interpret how useful is the resulting data. The authors mention batch effects as a potential contributor. The authors start with a library that includes ~6,000 variants, which makes it a medium-size MPRA. But then, only 483 pairs of WT/mutated UTRs yield high confidence information, which is already a small number for any downstream statistical analysis, particularly since most don't actually affect translation in the reporter screen setting (which is not unexpected). It is unclear why >90% of the library did not give high-confidence information. The profiles presented as base-case examples in Fig. 2B don't look very informative or convincing. All the subsequent analysis is done on a very small set of UTRs that have an effect, and it is unclear to this reviewer how these can yield statistically significant and/or biologically-relevant associations.

      • From the variants that had an effect, the authors go on to carry out some protein-level validations, and see some changes, but it is not clear if those changes are in the same direction was observed in the screen. In their rebuttal the authors explain that they largely can not infer directionality of changes form the screen, which further limits its utility.

      • It is particularly puzzling how the authors can build a machine learning predictor with >3,000 features when the dataset they use for training the model has just a few dozens of translation-shifting variants.

      We recognize that RNA distribution within polysomes is inherently less stable than the associated protein components. This instability has been noted in previous studies, including those cited by the reviewer, which used RNA from bulk polysomes to infer the translatome without fractionation. Acknowledging this limitation, we purposely adopted a conservative strategy: (i) performing gross fractionation of polysomes, and (ii) collaborating with biostatisticians at the Institute of Statistical Science, Academia Sinica, to design a conservative yet optimized analysis pipeline that minimized batch effects.

      This approach proved robust: representative cases in Fig. 2B clearly demonstrate distinct distributions of reference and alternative alleles. From our high-confidence dataset, we applied a well-established statistical framework specifically designed to accommodate multiple influencing factors in relatively small datasets (Elements of Statistical Learning by Hastie, Tibshirani, and Friedman). We further conducted sensitivity analyses to select an optimal QC cutoff across a range of stringencies, ensuring maximal reliability of our results. We have therefore successfully shortlisted UTR variants which have strong effect on translation.

      Building upon these conservative measures, we developed a predictive model for translation effects of UTR variants. Importantly, this model was validated not only with our internal test dataset but also with independent external datasets. In addition, the sequence features identified by the model were validated through reporter assays and in vivo CRISPR editing. These external and functional validations establish the generalizability and robustness of our approach.

      A more detailed analysis of the directionality of changes in translation efficiency is under active investigation. These results will be reported in a separate manuscript currently in preparation.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The authors describe a massively parallel reporter assays (MPRA) screen focused on identifying polymorphisms in 5' and 3' UTRs that affect translation efficiency and thus might have a functional impact on cells. The topic is of timely interest, and indeed, several related efforts have recently been published and preprinted (e.g., https://pubmed.ncbi.nlm.nih.gov/37516102/ and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10635273/). This study has several major issues with the results and their presentation.

      Major comments:

      (1) The main issue is that it appears that the screen has largely failed, yet the reasons for that are unclear, which makes it difficult to interpret. The authors start with a library that includes approximately 6,000 variants, which makes it a medium-sized MPRA. But then, only 483 pairs of WT/mutated UTRs yield highconfidence information, which is already a small number for any downstream statistical analysis, particularly since most don't actually affect translation in the reporter screen setting (which is not unexpected). It is unclear why >90% of the library did not give high-confidence information. The profiles presented as basecase examples in Figure 2B don't look very informative or convincing. All the subsequent analysis is done on a very small set of UTRs that have an effect, and it is unclear to this reviewer how these can yield statistically significant and/or biologically relevant associations.

      To make sure our final results are technically and statistically sound, we applied stringent selection criteria and cutoffs in our analytics workflow. First, from our RNA-seq dataset, we filtered the UTRs with at least 20 reads in a polysome profile across all three repeated experiments. Secondly, in the following main analysis using a negative binomial generalized linear model (GLM), we further excluded the UTRs that displayed batch effect, i.e. their batch-related main effect and interaction are significant. We believe our measure has safeguarded the filtered observations (UTRs) from the (potential) high variation of our massively parallel translation assays and thus gives high confidence to our results.

      Regarding the interpretation of Figure 2B, since we aimed to identify the UTRs whose interaction term of genotype and fractions is significant in our generalized linear model, it is statistically conventional to doublecheck the interaction of the two variables using such a graph. For instance, in the top left panel of Figure 2B (5'UTR of ANK2:c.-39G>T), we can see that read counts of WT samples congruously decreased from Mono to Light, whereas the read counts of mutant samples were roughly the same in the two fractions – the trend is different between WT and mutant. Ergo, the distinct distribution patterns of two genotypes across three fractions in Figure 2B offer the readers a convincing visual supplement to our statistics from GLM.

      In contrast to Figure 2B, the graphs of nonsignificant UTRs (shown below) reveal that the trends between the two genotypes are similar across the 'Mono and Light' and 'Light and Heavy' polysome fractions. Importantly, our analysis remains unaffected by differential expression levels between WT and mutant, as it specifically distinguishes polysome profiles with different distributions. This consistent trend further supports the lack of interaction between genotype and polysome fractions for these UTRs.

      Author response image 1.

      Examples of non-significant UTR pairs in massively parallel polysome profiling assays.

      (2) From the variants that had an effect, the authors go on to carry out some protein-level validations and see some changes, but it is not clear if those changes are in the same direction as observed in the screen.

      To infer the directionality of translation efficiency from polysome profiles, a common approach involves pooling polysome fractions and comparing them with free or monosome fractions to identify 'translating' fractions. However, this method has two major potential pitfalls: (i) it sacrifices resolution and does not account for potential bias toward light or heavy polysomes, and (ii) it fails to account for discrepancies between polysome load and actual protein output (as discussed in https://doi.org/10.1016/j.celrep.2024.114098 and https://doi.org/10.1038/s41598-019-47424-w). Therefore, our analysis focused on the changes within polysome profiles themselves. 'Significant' candidates were identified based on a significant interaction between genotype and polysome distribution using a negative binomial generalized linear model, without presupposing the direction of change on protein output. 

      (3) The authors follow up on specific motifs and specific RBPs predicted to bind them, but it is unclear how many of the hits in the screen actually have these motifs, or how significant motifs can arise from such a small sample size.

      We calculated the Δmotif enrichment in significant UTRs versus nonsignificant UTRs using Fisher’s exact test. For example, the enrichment of the Δ‘AGGG’ motif in 3’ UTRs is shown below:

      Author response table 1.

      This test yields a P-value of 0.004167 by Fisher’s exact test. The P-values and Odds ratios of Δmotifs in relation to polysome shifting are included in Supplementary Table S4, and we will update the detailed motif information in the revised Supplementary Table S4.

      (4) It is particularly puzzling how the authors can build a machine learning predictor with >3,000 features when the dataset they use for training the model has just a few dozens of translation-shifting variants.

      We understand the concern regarding the relatively small number of translation-shifting variants compared to the large number of features. To address this, we employed LASSO regression, which, according to The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman, is particularly suitable for datasets where the number of features 𝑝𝑝 is much larger than the number of samples 𝑁𝑁. LASSO effectively performs feature selection by shrinking less important coefficients to zero, allowing us to build a robust and generalizable model despite the limited number of variants.

      (5) The lack of meaningful validation experiments altering the SNPs in the endogenous loci by genome editing limits the impact of the results.

      Following the reviewer’s suggestion, we assessed the endogenous mutant effect by generating CRISPR knock-in clones carrying the IRF6:c.-4609G>A variant. We showed that this G>A variant generate a deleterious upstream open reading frame, which dramatically reduced protein expression of the main open reading frame (Fig. 7B-D). The genome editing further demonstrated the G>A variant reduced endogenous IRF6 protein expression to 23% or 44% in two independent clones. We have incorporated the genome editing results in the revised  main text and the new Figure 7E&F: 

      “To further validate the endogenous effect of the novel upstream ATG (uATG), we generated CRISPR knockin clones carrying the IRF6:c.-4609G>A variant and examined its impact on gene expression. The introduction of the uATG reduced RNA levels to 88% and 37% of the wild-type in two independent clones (Fig. 7E), and protein levels to 44% and 23%, respectively (Fig. 7F), resulting in an overall reduction of translation efficiency to 50–62%.“ (p.18)

      Reviewer #2 (Public Review):

      Summary:

      In their paper "Massively Parallel Polyribosome Profiling Reveals Translation Defects of Human DiseaseRelevant UTR Mutations" the authors use massively parallel polysome profiling to determine the effects of 5' and 3' UTR SNPs (from dbSNP/ClinVar) on translational output. They show that some UTR SNPs cause a change in the polysome profile with respect to the wild-type and that pathogenic SNPs are enriched in the polysome-shifting group. They validate that some changes in polysome profiles are predictive of differences in translational output using transiently expressed luciferase reporters. Additionally, they identify sequence motifs enriched in the polysome-shifting group. They show that 2 enriched 5' UTR motifs increase the translation of a luciferase reporter in a protein-dependent manner, highlighting the use of their method to identify translational control elements.

      Strengths:

      This is a useful method and approach, as UTR variants have been more difficult to study than coding variants. Additionally, their evidence that pathogenic mutations are more likely to cause changes in polysome association is well supported.

      Weaknesses:

      The authors acknowledge that they "did not intend to immediately translate the altered polysome profile into an increase or decrease in translation efficiency, as the direction of the shift was not readily evident. Additionally, sedimentation in the sucrose gradient may have been partially affected by heavy particles other than ribosomes." However, shifted polysome distribution is used as a category for many downstream analyses. Without further clarity or subdivision, it is very difficult to interpret the results (for example in Figure 5A, is it surprising that the polysome shifting mutants decrease structure? Are the polysome "shifts" towards the untranslated or heavy fractions?)

      Our approach, combining polysome fractionation of the UTR library with negative binomial generalized linear model (GLM) analysis of RNA-seq data, systematically identifies variants that affect translational efficiency. The GLM model is specifically designed to detect UTR pairs with significant interactions between genotype and polysome fractions, relying solely on changes in polysome profiles to identify variants that disrupt translation. Consequently, our analytical method does not determine the direction of translation alteration.

      Following the massively parallel polysome profiling, we sought to understand how these polysome-shifting variants influence the translation process. To do this, we examined their effects on RNA characteristics related to translation, such as RBP binding and RNA structure. In Figure 5A, we observed a notable trend in significant hits within 5’ UTRs—they tend to increase ΔG (weaker folding energy) in response to changes in polysome profiles, regardless of whether protein production increases or decreases (Fig. 3).

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Minor comments:

      (1) Figure 3A - the claim that 5'UTR variants had a stronger effect than 3'UTR is based on the two UTRs with the strongest effect. It is unclear how these differences between 5' and 3'UTRs are significant.

      We carried out a Wilcoxon rank-sum test to examine the mut/WT fold change of translation efficiency between the 3’ and 5’ UTR variants. The results showed that the 5’ UTR variants exhibited a greater change of translation efficiency. We have inserted this result in the revised Figure 3C and refers to this figure in the main text: “Furthermore, we observed that 5’ UTR variants had a greater impact on translation activity relative to 3’ UTR variants (Fig. 3C).” (p. 12)

      (2) Figures 2B and S1, S2 - what is the meaning of less signal for a light chain and a similar signal for a heavy chain? How can this situation, while being a significant difference between the profiles, lead to a biologically relevant difference in eventual protein output?

      Taking 3’UTR ACADSB:c.*4177G>A (bottom-left panel in Figure 2B) as an example: WT transcripts have less read count (in the unit of log(CPM)) compared with the transcripts carrying the mutant UTR in the light polysome-containing fraction, whereas the read counts of the two genotypes are approximately the same in the heavy polysome-containing fraction.

      In line with our reply to Reviewer 1’s major comment 1, we aimed to identify the UTRs whose interaction term of genotype and fractions is significant in our generalized linear model (GLM). That is, the UTR pairs whose WT and mutant have different trends across the fractions (Mono to Light & Light to Heavy) are our targets. In Figure 2B, 3’UTR ACADSB:c.*4177G>A is a perfect example of our significant hits, as it displays the clear distinction of the trends of the two genotypes across three fractions.

      It is widely known that the alteration of polysome profiling distribution indicates the change of translational efficiency. Our GLM model helped us identify the UTR pairs whose WT and mutant have different polysome profiling patterns and thus likely have distinct translational efficiency. Nevertheless, since we only had limited polysome fractions in our experiments, we further validated our significant hits and confirmed the direction of regulation using luciferase reporter assay.

      (3) The paragraph starting with "Even with the high confidence dataset, we did not intend to immediately translate the altered polysome profile into an increase or decrease in translation efficiency" is confusing. The whole premise of the screen used by the authors is that polysome profiling is a useful proxy for estimating levels of translation, so claiming that it doesn't necessarily measure translation is counterintuitive.

      In line with our reply to the last question, our goal is to use the alteration of polysome profiling patterns as a proxy for the change of translational efficiency. However, due to the limited number of fractions in our experiment, we could not directly infer the direction of regulation, i.e. increase or decrease of translational efficiency, of the statistically significant variants. That is why we refrained from making any conclusion about the direction of the regulation for the significant hits and proceed to validate them using luciferase reporter assay.

      (4) Figure S5A - this is normalized to the nucleotide distribution in 5' or 3'UTRs? Is this statistic being applied to 27 SNPs in 3'UTRs?

      To identify sequence features associated with altered polysome association, we systematically analyzed both significant and nonsignificant UTRs for nucleotide and motif-level changes. Fisher’s exact test was employed to evaluate whether specific nucleotide or motif alterations were enriched or depleted in polysome-shifting UTRs, compared to nonsignificant UTR pairs. For example, in the case of nucleotide C (see table below; also Table S4 and new Fig. S6A), only four significant 3’ UTRs involved a change in C, resulting in a significant depletion of this nucleotide change among polysome-shifting 3’ UTRs (odds ratio = 0.22, p = 0.0069). Expanding this approach to all 1-7 nt motifs, we identified multiple motif and nucleotide changes that were significantly associated with altered polysome association.

      Author response table 2.

      (5) "uATG in the 5' UTR was not identified by the model as a widespread feature explaining polysome shifting". Is this because of the method of ribosome profiling or because of the sequences in the library? Can having more sequences in the library specifically looking at 5'UTR give more power for such an effect to emerge?

      Our assay design accounted for the presence of upstream ATG codons and the strength of adjacent Kozak sequences. However, additional factors known to influence the function of upstream open reading frames (uORFs)—such as the reading frame of the uORF relative to the main coding sequence, and the use of nonATG initiation codons—were not systematically included. As a result, the current assay may have limited sensitivity in detecting uORF-related regulatory effects. A dedicated design specifically tailored to uORF variants is likely to enhance the detection power and better capture their contribution to translational control.

      (6) Figure 7B- it is not clear whether the luciferase reporter and the GFP reporter in the library function in a similar manner; is it creating out-of-frame or in of in frame uORF? Also, it is not clear if the differences are statistically significant.

      In the MPRA library, the IRF6 uORF is out of frame relative to the GFP coding sequence. To directly assess its translational impact, we employed a luciferase reporter assay by fusing luciferase downstream of the IRF6 uORF. These constructs revealed a significant reduction in protein production, as shown in Figures 3 and 7B–F. Although the clinically relevant IRF6 uORF is out-of-frame with the main ORF, we engineered an inframe uORF variant to validate translation initiation at the upstream ATG (uATG) (Fig. 7B-D). The in-frame construct confirmed uATG usage and led to a significant reduction in luciferase protein expression. Together, these results support the conclusion that the IRF6:c.-4609G>A variant gives rise to an active uORF that suppresses translation of the main ORF.

      Reviewer #2 (Recommendations For The Authors):

      (1) It would be helpful for the authors to subcategorize their data in ways that they consider meaningful and interpretable (e.g. shifts from all monosome to heavy, all heavy to monosome/free, etc.) Relatedly, what do the authors think the functional meaning is when a given transcript has high mono/heavy occupancy but low light occupancy (like what is shown in Figure 2B for ANK2) in the polysome profiling experiment? It is not apparent why a transcript with a high ribosome occupancy (heavy) would also have light occupancy (light).

      From the amplicon sequencing data, we obtained read counts for each UTR variant across the monosome, light, and heavy polysome fractions. Notably, this approach does not preserve the original relative abundance of transcripts among the three fractions. That is, despite a greater abundance of mRNAs in the heavy polysome fraction, comparable numbers of sequencing reads were recovered from the monosome and light fractions. As a result, this method is not suitable for interpreting the global directionality of translational shifts but is well-suited for detecting relative differences in polysome association. Therefore, our experimental and analytical design—combining targeted amplicon sequencing with generalized linear modeling (GLM)—was optimized to identify UTR variants that alter polysome association, independently of absolute transcript abundance in each fraction.

      (2) The method put forward in Figure 2 would be more convincing if there was data showing reproducibility in the massively parallel reporter assay. Perhaps the mut/WT ratio for all transcripts can be plotted against each other and a statistical test of correlation can be performed.

      Thank you for pointing this out. To demonstrate the reproducibility of our massively parallel reporter assay, we have plotted scatter plots of the ratios of all transcripts (summing the monosome, light, and heavy fractions) across different batches using our high-confidence dataset. We calculated the Pearson correlation coefficients and corresponding p-values for these comparisons. The results show strong correlation between each batch, supporting the reproducibility of our assay. We have incorporated this analysis in the main text as well as Supplemental Figure 3: “Pearson correlation analysis revealed R coefficients ranging from 0.59 to 0.71 for the mut-to-WT transcript ratios across three independent experiments (Supplemental Fig. 3).”

      (3) The dots in Figure 2B indicate separate experiments, but the y-axis is log(counts). Values could be normalized (perhaps a ratio of mut/WT) for comparison between experiments.

      We aimed to compare UTR distribution across polysome fractions and recognized the importance of presenting the distribution patterns for both genotypes. This approach allows us to more clearly illustrate the differences or similarities in polysome association between the two genotypes.

      (4) When describing the 5' UTRs used for the validation experiments in Figure 3, more information about the 5' UTR sequence used is necessary. It is not clear how much or what part of the 5' UTRs were removed, or why this was necessary considering the same experiment was conducted using full-length UTRs.

      In the initial library design, technical limitations of bulk oligonucleotide synthesis constrained the UTRs to 155 nucleotides, comprising 115-nt of endogenous human UTR sequence flanked by 20-nt priming sites on both ends. Variants were centered at the 58th nucleotide within the 115-nt UTR sequence. When one flanking region of the native UTR was shorter than 57 nt, the variant was shifted accordingly toward the shorter arm to maintain the 115-nt UTR length (Fig. 2A).

      Given that endogenous UTRs in the human genome are often longer than 155 nt, we further evaluated the functional consequences of variants within full-length UTR sequences (Fig. 3B). While the mutant effects observed in the library setting were largely recapitulated, their magnitude was diminished in the full-length context, likely due to the increased sequence and structural complexity.

      To clarify the experimental design related to Figure 3, we modified the text as the following: “The variants significantly altering the polysome profile were then individually validated by means of high-sensitivity luciferase reporter assays (Fig. 3A). To that end, we resynthesized both the variant and corresponding wildtype alleles in the same library format - 115-nt native UTR segments centered on the variant and flanked by 20-nt priming sites. These UTRs were then cloned upstream (5’) or downstream (3’) of the firefly luciferase coding sequence, depending on their genomic location.” (p. 11)

      (5) The conclusions from inserting RBP-binding motifs into 5' UTRs and assaying translational output (Figure 4) would be strengthened by including luciferase reporters containing endogenous 5' UTRs containing these motifs, and versions where the motifs are disrupted.

      Several variants that altered translation efficiency were validated in their native sequence contexts, including 5’ UTR variants in DMD and NF1 that affect SRSF1/2 binding sites, as well as a 3’ UTR variant in AL049650.1 that impacts a KHSRP binding site (Fig. 3 and Supplemental Figs. S1 & S2). To address the functional relevance of these variants within their native regulatory landscapes, we have incorporated the following clarification into the text (p. 13): “This observation is consistent with additional findings where variants that create or disrupt specific RBP binding sites—such as SRSF1/2 (e.g., in DMD and NF1; Fig. 2 and Supplementary Fig. S4) and KHSRP (e.g., in AL049650.1; Fig. 2 and Supplementary Figs. S4 & S5)—led to significant changes in translation efficiency within their native UTR contexts.”

      (6) Figure 5C shows that 5' UTR SNPs that form an uAUG are associated with greater structural changes, but this does not "indicate" that "structure‐modifying UTR variants may control primary ORF translation partly by interfering with translation initiation from a uORF." The data presented in Figure 5 and luciferase/polysome data presented previously do not distinguish whether translation is occurring at an uAUG or canonical AUG. The statement quoted above is speculative and it should be clear that it is a hypothesis generated by the data and is not conclusive.

      We appreciate the reviewer’s suggestion. We have therefore modified our text to: ”Therefore, while changes in uATG may not be common explanatory factors for polysome-shifting mutations, our results suggest that structure-modifying UTR variants may control primary ORF translation partly by interfering with translation initiation from a uORF.” (p. 14)

      Minor points/questions

      (1) The authors should clarify whether during library construction for massively parallel polysome profiling the 3' UTR constructs contain a common 5' UTR? Likewise, do the 5' UTR constructs contain a common 3' UTR? Perhaps the lack of a 5' UTR in the 3' UTR constructs, which is implied by Figure 2A, would influence differences seen between 3' UTR pairs (and likewise for 5' UTR pairs).

      There are short common 5’ UTRs appended to the 3’ UTR library, and likewise, a common short 3’ UTR is included in the 5’ UTR library. The common 5’ UTR comprises partial sequences from the CMV promoter and the plasmid backbone of pEGFP-N1 vector. The common 3’ UTR includes sequences from the pEGFP-N1 backbone and a short polyadenylation signal from HBA1 (hemoglobin subunit alpha 1). While we cannot entirely rule out potential crosstalk between 5’ and 3’ UTRs, the design ensures that all constructs are compared in a controlled and consistent context, enabling valid pairwise comparisons between variant and wildtype alleles.

      To clarify the library design, we have revised the main text to include this explanation: 

      “The entire library of UTR oligonucleotides (UTR library) was subsequently ligated upstream or downstream of an enhanced GFP (EGFP) coding region, along with a CMV promoter and a common UTR sequence on the opposite end. Cells transfected with the UTR library were treated with cycloheximide 14 hours post transfection and then subjected to polysome fractionation (see Methods).” (p.11) 

      “The variants significantly altering the polysome profile were then individually validated through highsensitivity luciferase reporter assays (Fig. 3A). To this end, we resynthesized both the variant and corresponding wildtype alleles in the same library format - 115-nt native UTR segments centered on the variant and flanked by 20-nt priming sites. These UTRs were then cloned upstream (5’) or downstream (3’) of the firefly luciferase coding sequence, depending on their genomic location. As the initial library design, the test UTR segment differs only by one nucleotide, while a shared short UTR fragment is present on the opposite end of the coding sequence to ensure consistency across constructs (Fig. 2A).” (p. 12)

      (2) The lines connecting the polysome distribution points make the plots appear busy and difficult to read, the data would be easier to interpret if they were removed.

      We employed a generalized linear model (GLM) to identify the variants that altered the polysome association of the corresponding transcripts. Statistically speaking, we were looking for the variants which led to significant interaction between genotype and polysome fractions. Ergo, displaying the lines as it is in our plots offers readers a convincing visualization of the interaction: lines from WT and Mut groups were not parallel, which indicates the interaction between genotype and polysome fractions. Moreover, showing the lines from three batches of experiments also helps us ascertain the reproducibility of our experiments. Taken all together, the presence of the lines makes our plots even more informative.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We are very grateful to both reviewers for taking the time to review our manuscript and data in great detail. We thank you for the fair assessment of our work, the helpful feedback, and for recognizing the value of our work. We have done our best to address your concerns below:

      eLife assessment This work reports a valuable finding on glucocorticoid signaling in male and female germ cells in mice, pointing out sexual dimorphism in transcriptomic responsiveness. While the evidence supporting the claims is generally solid, additional assessments would be required to fully confirm an inert GR signaling despite the presence of GR in the female germline and GR-mediated alternative splicing in response to dexamethasone treatment in the male germline. The work may interest basic researchers and physician-scientists working on reproduction and

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Cincotta et al set out to investigate the presence of glucocorticoid receptors in the male and female embryonic germline. They further investigate the impact of tissue-specific genetically induced receptor absence and/or systemic receptor activation on fertility and RNA regulation. They are motivated by several lines of research that report inter and transgenerational effects of stress and or glucocorticoid receptor activation and suggest that their findings provide an explanatory mechanism to mechanistically back parental stress hormone exposure-induced phenotypes in the offspring.

      Strengths:

      A chronological immunofluorescent assessment of GR in fetal and early life oocyte and sperm development.

      RNA seq data that reveal novel cell type specific isoforms validated by q-RT PCR E15.5 in the oocyte.

      2 alternative approaches to knock out GR to study transcriptional outcomes. Oocytes: systemic GR KO (E17.5) with low input 3-tag seq and germline-specific GR KO (E15.5) on fetal oocyte expression via 10X single cell seq and 3-cap sequencing on sorted KO versus WT oocytes both indicating little impact on polyadenylated RNAs

      2 alternative approaches to assess the effect of GR activation in vivo (systemic) and ex vivo (ovary culture): here the RNA seq did show again some changes in germ cells and many in the soma.

      They exclude oocyte-specific GR signaling inhibition via beta isoforms.

      Perinatal male germline shows differential splicing regulation in response to systemic Dex administration, results were backed up with q-PCR analysis of splicing factors. Weaknesses:

      COMMENT #1: The presence of a protein cannot be entirely excluded based on IF data

      We agree that very low levels of GR could escape the detection by IF and confocal imaging. We feel that our IF data do match transcript data in our validation studies of the GR KO using (1) qRT-PCR on fetal ovary in Fig 2E and (2) scRNA-seq in germ cells and ovarian soma in Fig S2B.

      COMMENT #2: (staining of spermatids is referred to but not shown).

      You are correct that this statement was based on a morphological identification of spermatids using DAPI morphology. We have performed a co-stain for GR with the spermatocyte marker SYCP3, and the spermatid/spermatozoa marker PNA (Peanut Agglutinin; from Arachis hypogaea) in adult testis tissue. We have updated Figure 4D to reflect this change, as well as the corresponding text in the Results section.

      COMMENT #3: The authors do not consider post-transcriptional level a) modifications also triggered by GR activation b) non-coding RNAs (not assessed by seq).

      We thank the reviewer for raising this very important point about potential post-transcriptional (non-genomic) effects of GR in the fetal oocyte. We agree that while our RNA-seq results show only a minimal transcriptional response, we cannot rule out a non-canonical signaling function of GR, such as the regulation of cellular kinases (as reviewed elsewhere1), or the regulation of non coding RNAs at the post-transcriptional level, and we have amended the discussion to include a sentence on this point. However, while we fully acknowledge the possibility of GR regulating non-genomic level cellular signaling, we chose not to explore this option further based on the lack of any overall functional effect on meiotic progression when GR signaling was perturbed- either by KO (Figure 2D) or dex-mediated activation (Figure S3C).

      COMMENT #4: Sequencing techniques used are not total RNA but either are focused on all polyA transcripts (10x) or only assess the 3' prime end and hence are not ideal to study splicing

      We thank the reviewer for raising this concern, however this statement is not correct and we have clarified this point in the Results section to explain how the sequencing libraries of the male germ cell RNA-seq were prepared. We agree that certain sequencing techniques (such as 3’ Tag-Seq) that generate sequencing libraries from a limited portion of an entire transcript molecule are not appropriate for analysis of differential splicing. This was not the case, however, for the RNA-seq libraries prepared on our male germ cells treated with dexamethasone. These libraries were constructed using full length transcripts that were reverse transcribed using random hexamer priming, thus accounting for sequencing coverage across the full transcript length. As a result, this type of library prep technique should be sufficient for capturing differential splicing events along the length of the transcript. We do, however, point out that these libraries were constructed on polyA-enriched transcripts. Thus while we obtained full length transcript coverage for these polyA transcripts, any differential splicing taking place in non poly-adenylated RNA moieties were not captured. While we are excited about the possibility of exploring GR-mediated splicing regulation of other RNA species in the future, we chose to focus the scope of our current study on polyA mRNA molecules specifically.

      COMMENT #5: The number of replicates in the low input seq is very low and hence this might be underpowered

      While the number of replicates (n=3-4 per condition) is sufficient for performing statistical analysis of a standard RNA-seq experiment, we do acknowledge and agree with the reviewer that low numbers of FACS-sorted germ cells from individual embryos combined with the low input 3’ Tag-Seq technique could have led to higher sample variability than desired. Given that we validated our bulk RNA-seq analysis of GR knockout ovaries using an orthogonal single-cell RNA-seq approach, we feel that our conclusions regarding a lack of transcriptional changes upon GR deletion remain valid.

      COMMENT #6: Since Dex treatment showed some (modest) changes in oocyte RNA - effects of GR depletion might only become apparent upon Dex treatment as an interaction.

      We may be missing the nuance of this point, but our interpretation of an effect that is seen only when the KO is treated with Dex would be that the mechanism would not be autonomous in germ cells but indirect or off-target.

      COMMENT #7: Effects in oocytes following systemic Dex might be indirect due to GR activation in the soma.

      As both the oocytes and ovarian soma express GR during the window of dex administration, we agree that it is possible that the few modest changes seen in the oocyte transcriptome are the result of indirect effects following robust GR signaling in the somatic compartment. However, given that these modest oocyte transcript changes in response to dex treatment did not significantly alter the ability of oocytes to progress through meiosis, we chose not to explore this mechanism further.

      COMMENT #8: Even though ex vivo culture of ovaries shows GR translocation to the nucleus it is not sure whether the in vivo systemic administration does the same.

      AND

      The conclusion that fetal oocytes are resistant to GR manipulation is very strong, given that "only" poly A sequencing and few replicates of 3-prime sequencing have been analyzed and information is lacking on whether GR is activated in germ cells in the systemically dex-injected animals.

      If we understand correctly, the first part refers to a technical limitation and the second part takes issue with our interpretation of the data. For the former, we appreciate this astute insight on the conundrum of detecting a response to systemic dex in fetal oocytes, which is generally monitored by nuclear translocation of GR. As shown in Figure 1A and 1B, GR localization is overwhelmingly nuclear in fetal oocytes of WT animals at E13.5 without addition of any dex. We could not, therefore, use GR translocation as a proxy for activation in response to dex treatment. We instead used ex vivo organ culture to monitor localization changes, as we were able to maintain fetal ovaries ex vivo in hormone-depleted and ligand negative conditions. As shown in Fig. 3, these defined culture conditions elicited a shift of GR to the cytoplasm of fetal oocytes. This led us to conclude that GR is capable of translocating between nucleus and cytoplasm in fetal oocytes, and we were able to counteract this loss in nuclear localization by providing dex ligand in the media.

      We feel that our conclusion that oocytes are resistant to manipulation of glucocorticoid signaling despite their possession of the receptor and capacity for nuclear translocation is substantiated by multiple results: meiotic phenotyping, bulk RNA-seq and scRNA-seq analysis of both GR KO and dex dosed mice. Our basis for testing the timing and fidelity of meiotic prophase I was the coincident onset of GR expression in female germ cells at E13, and the disappearance of GR in neonatal oocytes as they enter meiotic arrest. The lack of transcriptional changes observed in oocytes in response to dex has made it even more challenging to demonstrate a bona fide “activation” of GR. Observation of a dose-dependent induction of the canonical GR response gene Fkbp5 in the somatic cells of the fetal ovary (Figure S3A and 3A) affirmed that dex traverses the placenta. We agree with the reviewer that it remains possible that dex or GR KO could lead to changes in epigenetic marks or small RNAs in oocytes, and have mentioned these possibilities in the discussion, but we note that even epigenetic perturbations during oocyte development such as the loss of Tet1 or Dnmt1 result in measurable changes in the transcriptome and the timing of meiotic prophase 2–4.

      COMMENT #9: This work is a good reference point for researchers interested in glucocorticoid hormone signaling fertility and RNA splicing. It might spark further studies on germline-specific GR functions and the impact of GR activation on alternative splicing. While the study provides a characterization of GR and some aspects of GR perturbation, and the negative findings in this study do help to rule out a range of specific roles of GR in the germline, there is still a range of other potential unexplored options. The introduction of the study eludes to implications for intergenerational effects via epigenetic modifications in the germline, however, it does not mention that the indirect effects of reproductive tissue GR signaling on the germline have indeed already been described in the context of intergenerational effects of stress.

      The reviewer raises an excellent point that we have not made sufficient distinction in our manuscript between prior studies of gestational stress and preconception stress and the light that our work may shed on those findings. We have revised the introduction to clarify this difference, and added reference to an outstanding study that identifies glucocorticoid-induced changes to microRNA cargo of extracellular vesicles shed by epididymal epithelial cells that when transferred to mature sperm can induce changes in the HPA axis and brain of offspring 5. Interestingly, this GR-mediated effect in the epididymal epithelial cells concurs with our observation in the adult testis that GR can be detected only cKit+ spermatogonia but not in subsequent stages of spermatids.

      COMMENT #10: Also, the study does not assess epigenetic modifications.

      We agree with the reviewer that exploring the role of GR in regulating epigenetic modifications within the germline is an area of extreme interest given the potential links between stress and transgenerational epigenetic inheritance. As this is a broader topic that requires a more thorough and comprehensive set of experiments, we have intentionally chosen to keep this work separate from the current study, and hope to expand upon this topic in the future.

      COMMENT #11: The conclusion that the persistence of a phenotype for up to three generations suggests that stress can induce lasting epigenetic changes in the germline is misleading. For the reader who is unfamiliar with the field, it is important to define much more precisely what is referred to as "a phenotype". Furthermore, this statement evokes the impression that the very same epigenetic changes in the germline have been observed across multiple generations.

      We see how this may be misleading, and we have amended the text of the introduction and discussion accordingly to avoid the use of the term “phenotype”.

      COMMENT #12: The evidence of the presence of GR in the germline is also somewhat limited - since other studies using sequencing have detected GR in the mature oocyte and sperm.

      As described above in response to Comment #2, we have included immunostaining of adult testis in a revised Figure 4D and shown that we detect GR in PLZF+ and cKIT+ spermatogonia. We also show low/minimal expression in some (SYCP3+) early meiotic spermatocytes, but not in (Lectin+) spermatids. We are not aware of any studies that have shown expression of GR protein in the mature oocyte.

      COMMENT #13: The discussion ends again on the implications of sex-specific differences of GR signaling in the context of stress-induced epigenetic inheritance. It states that the observed differences might relate to the fact that there is more evidence for paternal lineage findings, without considering that maternal lineage studies in epigenetic inheritance are generally less prevalent due to some practical factors - such as more laborious study design making use of cross-fostering or embryo transfer.

      We thank the reviewer for this valid point, and we have amended the discussion section.

      Reviewer #2 (Public Review):

      Summary:

      There is increasing evidence in the literature that rodent models of stress can produce phenotypes that persist through multiple generations. Nevertheless, the mechanism(s) by which stress exposure produces phenotypes are unknown in the directly affected individual as well as in subsequent offspring that did not directly experience stress. Moreover, it has also been shown that glucocorticoid stress hormones can recapitulate the effects of programmed stress. In this manuscript, the authors test the compelling hypothesis that glucocorticoid receptor (GR)-signaling is responsible for the transmission of phenotypes across generations. As a first step, the investigators test for a role of GR in the male and female germline. Using knockouts and GR agonists, they show that although germ cells in male and female mice have GR that appears to localize to the nucleus when stimulated, oocytes are resistant to changes in GR levels. In contrast, the male germline exhibits changes in splicing but no overt changes in fertility.

      Strengths:

      Although many of the results in this manuscript are negative, this is a careful and timely study that informs additional work to address mechanisms of transmission of stress phenotypes across generations and suggests a sexually dimorphic response to glucocorticoids in the germline. The work presented here is well-done and rigorous and the discussion of the data is thoughtful. Overall, this is an important contribution to the literature.

      Reviewer #1 (Recommendations For The Authors):

      RECOMMENDATION #1: To assess whether in females the systemic Dex administration directly activates GR in oocytes it would be great to assess GR activation following Dex administration, and ideally to see the effects abolished when Dex is administered to germline-specific KO animals.

      In regard to the recommendation to assess GR activation in response to systemic dex administration, we refer the reviewer back to our response in Comment #8 highlighting the difficulties defining and measuring GR activation in the germline.

      This therefore has made it difficult to assess whether any of the modest effects seen in response to dex are abolished in our germline-specific KO animals. While repeating our RNA-seq experiment in dex-dosed germline KO animals would address whether the ~60 genes induced in oocytes are the result of oocyte-intrinsic GR activity, we have decided not to explore this mechanism further due to the overall lack of a functional effect on meiotic progression in response to dex (Figure S3C).

      RECOMMENDATION #2: To further strengthen the link between GR and alternative splicing it would be great to see the dex administration experiment repeated in germline specific GR KO's.

      While we understand the reviewer’s suggestion to explore whether deletion of GR in the spermatogonia is sufficient to abrogate the dex-mediated decreases in splice factor expression, we chose not to explore the details of this mechanism given that deletion of GR in the male germline does not impair fertility (Figure 6).

      RECOMMENDATION #3: I am wondering how much a given reduction in one of the splicing factors indeed affects splicing events. Can the authors relate this to literature, or maybe an in vitro experiment can be done to see whether the level of differential splicing events detected is in a range that can be expected in the case of the magnitude of splicing factor reduction?

      It has been shown in many instances in the literature that a full genetic deletion of a single splice factor leads to impairments in spermatogenesis, and ultimately infertility 6–16. We suspect that dex treatment leads to fewer differential splicing events than a full splice factor deletion, given that dex treatment causes a broader decrease in splice factor expression without entirely abolishing any single splice factor. We have amended the discussion section to include this point. While we share the reviewer’s curiosity to compare the effects of dex vs genetic deletion of splicing machinery on the overall magnitude of differential splicing events, we unfortunately do not have access to mice with a floxed splice factor at this time. While we have considered knocking out one or more splice factors in an ex vivo cultured testis to compare alongside dex treatment, our efforts to date have proven unsuccessful due to high cell death upon culture of the postnatal testis for more than 24 hours.

      RECOMMENDATION #4: It is unclear from the methods whether in germline-specific KO's also the controls received tamoxifen.

      We thank the reviewer for catching this missing piece of information. All control embryos that were assessed received an equivalent dose of tamoxifen to the germline-specific KO embryos. The only difference between cKOs and controls was the presence of the Cre transgene. We have updated the Materials and Methods 3’ Tag-Seq sample preparation section to include the sentence: “Both GRcKO/cKO and control GRflox/flox embryos were collected from tamoxifen-injected dams, and thus were equally exposed to tamoxifen in utero”.

      Reviewer #2 (Recommendations For The Authors):

      I just have only a few comments/questions.

      RECOMMENDATION #5: It is somewhat surprising that GR is expressed in female germ cells, yet there doesn't seem to be a requirement. Is there any indication of what it does? Is the long-term stability of the germline compromised?

      We thank the reviewer for these questions, and we agree that it was quite surprising to find a lack of GR function in the female germline despite its robust expression. The question of whether loss of GR affects the long-term stability of the female germline is interesting, given that similar work in GR KO zebrafish has shown impairments to female reproductive capacity, yet only upon aging 17–19.

      While we have shared interest in this question, technical limitations thus far have prevented us from properly assessing the effect of GR loss in aged females. Homozygous deletion of GR results in embryonic lethality at approximately E17.5. Conditional deletion of GR using Oct4-CreERT2 with a single dose of tamoxifen (2.5 mg / 20g mouse) at E9.5 results in complete deletion of GR by E10.5, although dams consistently suffer from dystocia and are no longer able to deliver viable pups. While using the more active tamoxifen metabolite (4OHT) at 0.1 mg / 20g has allowed for successful delivery, the resulting deletion rate is very poor (see qPCR results in panel below, left). While using half the dose of standard tamoxifen (1.25 mg / 20g mouse) at E9.5 has on rare occasions led to a successful delivery, the resulting recombination efficiency is insufficient (Author response image 1 right panel).

      Author response image 1.

      While a Blimp1-Cre conditional KO model was used to assess male fertility on GR deletion, we believe this model may not be ideal for studying fertility in the context of aging. While Blimp1-Cre is highly specific to the germ cells within the gonad, there are many cell types outside of the gonad that express Blimp1, including the skin and certain cells of the immune system. It is unclear, particularly over the course of aging, whether any effects on fertility seen would be due to an oocyte-intrinsic effect, or the result of GR loss elsewhere in the body. While we hope to explore the role of GR in the aging oocyte further using alternative Cre models in the future, this is currently outside the scope of this work.

      RECOMMENDATION #6: Figure 5b: what is the left part of that panel? Is it the same volcano plot for germ cells as shown in part a but with splicing factors?

      We apologize if this panel was unclear. Yes, the left panel of Figure 5B is in fact the same volcano plot in 5A, labeled with splicing factors instead of top genes. We have edited Figure 5B and corresponding figure legend to clarify this.

      References: 1. Oakley, R.H., and Cidlowski, J.A. (2013). The biology of the glucocorticoid receptor: New signaling mechanisms in health and disease. J. Allergy Clin. Immunol. 132, 1033–1044. 10.1016/j.jaci.2013.09.007.

      1. Hargan-Calvopina, J., Taylor, S., Cook, H., Hu, Z., Lee, S.A., Yen, M.-R., Chiang, Y.-S., Chen, P.-Y., and Clark, A.T. (2016). Stage-Specific Demethylation in Primordial Germ Cells Safeguards against Precocious Differentiation. Dev. Cell 39, 75–86. 10.1016/j.devcel.2016.07.019.

      2. Hill, P.W.S., Leitch, H.G., Requena, C.E., Sun, Z., Amouroux, R., Roman-Trufero, M., Borkowska, M., Terragni, J., Vaisvila, R., Linnett, S., et al. (2018). Epigenetic reprogramming enables the transition from primordial germ cell to gonocyte. Nature 555, 392–396. 10.1038/nature25964.

      3. Eymery, A., Liu, Z., Ozonov, E.A., Stadler, M.B., and Peters, A.H.F.M. (2016). The methyltransferase Setdb1 is essential for meiosis and mitosis in mouse oocytes and early embryos. Development 143, 2767–2779. 10.1242/dev.132746.

      4. Chan, J.C., Morgan, C.P., Leu, N.A., Shetty, A., Cisse, Y.M., Nugent, B.M., Morrison, K.E., Jašarević, E., Huang, W., Kanyuch, N., et al. (2020). Reproductive tract extracellular vesicles are sufficient to transmit intergenerational stress and program neurodevelopment. Nat Commun 11, 1499. 10.1038/s41467-020-15305-w.

      5. Kuroda, M., Sok, J., Webb, L., Baechtold, H., Urano, F., Yin, Y., Chung, P., Rooij, D.G. de, Akhmedov, A., Ashley, T., et al. (2000). Male sterility and enhanced radiation sensitivity in TLS−/− mice. Embo J 19, 453–462. 10.1093/emboj/19.3.453.

      6. Liu, W., Wang, F., Xu, Q., Shi, J., Zhang, X., Lu, X., Zhao, Z.-A., Gao, Z., Ma, H., Duan, E., et al. (2017). BCAS2 is involved in alternative mRNA splicing in spermatogonia and the transition to meiosis. Nat Commun 8, 14182. 10.1038/ncomms14182.

      7. Li, H., Watford, W., Li, C., Parmelee, A., Bryant, M.A., Deng, C., O’Shea, J., and Lee, S.B. (2007). Ewing sarcoma gene EWS is essential for meiosis and B lymphocyte development. J Clin Invest 117, 1314–1323. 10.1172/jci31222.

      8. O’Bryan, M.K., Clark, B.J., McLaughlin, E.A., D’Sylva, R.J., O’Donnell, L., Wilce, J.A., Sutherland, J., O’Connor, A.E., Whittle, B., Goodnow, C.C., et al. (2013). RBM5 Is a Male Germ Cell Splicing Factor and Is Required for Spermatid Differentiation and Male Fertility. Plos Genet 9, e1003628. 10.1371/journal.pgen.1003628.

      9. Zagore, L.L., Grabinski, S.E., Sweet, T.J., Hannigan, M.M., Sramkoski, R.M., Li, Q., and Licatalosi, D.D. (2015). RNA Binding Protein Ptbp2 Is Essential for Male Germ Cell Development. Mol Cell Biol 35, 4030–4042. 10.1128/mcb.00676-15.

      10. Xu, K., Yang, Y., Feng, G.-H., Sun, B.-F., Chen, J.-Q., Li, Y.-F., Chen, Y.-S., Zhang, X.-X., Wang, C.-X., Jiang, L.-Y., et al. (2017). Mettl3-mediated m6A regulates spermatogonial differentiation and meiosis initiation. Cell Res 27, 1100–1114. 10.1038/cr.2017.100.

      11. Horiuchi, K., Perez-Cerezales, S., Papasaikas, P., Ramos-Ibeas, P., López-Cardona, A.P., Laguna-Barraza, R., Balvís, N.F., Pericuesta, E., Fernández-González, R., Planells, B., et al. (2018). Impaired Spermatogenesis, Muscle, and Erythrocyte Function in U12 Intron Splicing-Defective Zrsr1 Mutant Mice. Cell Reports 23, 143–155. 10.1016/j.celrep.2018.03.028.

      12. Ehrmann, I., Crichton, J.H., Gazzara, M.R., James, K., Liu, Y., Grellscheid, S.N., Curk, T., Rooij, D. de, Steyn, J.S., Cockell, S., et al. (2019). An ancient germ cell-specific RNA-binding protein protects the germline from cryptic splice site poisoning. Elife 8, e39304. 10.7554/elife.39304.

      13. Legrand, J.M.D., Chan, A.-L., La, H.M., Rossello, F.J., Änkö, M.-L., Fuller-Pace, F.V., and Hobbs, R.M. (2019). DDX5 plays essential transcriptional and post-transcriptional roles in the maintenance and function of spermatogonia. Nat Commun 10, 2278. 10.1038/s41467-019-09972-7.

      14. Yuan, S., Feng, S., Li, J., Wen, H., Liu, K., Gui, Y., Wen, Y., and Wang, X. (2021). hnRNPH1 recruits PTBP2 and SRSF3 to cooperatively modulate alternative pre-mRNA splicing in germ cells and is essential for spermatogenesis and oogenesis. 10.21203/rs.3.rs-1060705/v1.

      15. Wu, R., Zhan, J., Zheng, B., Chen, Z., Li, J., Li, C., Liu, R., Zhang, X., Huang, X., and Luo, M. (2021). SYMPK Is Required for Meiosis and Involved in Alternative Splicing in Male Germ Cells. Frontiers Cell Dev Biology 9, 715733. 10.3389/fcell.2021.715733.

      16. Maradonna, F., Gioacchini, G., Notarstefano, V., Fontana, C.M., Citton, F., Valle, L.D., Giorgini, E., and Carnevali, O. (2020). Knockout of the Glucocorticoid Receptor Impairs Reproduction in Female Zebrafish. Int J Mol Sci 21, 9073. 10.3390/ijms21239073.

      17. Facchinello, N., Skobo, T., Meneghetti, G., Colletti, E., Dinarello, A., Tiso, N., Costa, R., Gioacchini, G., Carnevali, O., Argenton, F., et al. (2017). nr3c1 null mutant zebrafish are viable and reveal DNA-binding-independent activities of the glucocorticoid receptor. Sci Rep-uk 7, 4371. 10.1038/s41598-017-04535-6.

      18. Faught, E., Santos, H.B., and Vijayan, M.M. (2020). Loss of the glucocorticoid receptor causes accelerated ovarian ageing in zebrafish. Proc Royal Soc B 287, 20202190. 10.1098/rspb.2020.2190.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This work presents some valuable information regarding the molecular mechanisms controlling the regeneration of pancreatic beta cells following induced cell ablation. However, the study lacks the critical lineage tracing result to support the conclusion about the origin of the regenerated beta cells. The results of the pharmacological manipulation of CaN signaling are also incomplete. In particular, these manipulation are not cell-specific, making it difficult to interpret and thus genetic approach is recommended.

      Public Reviews:

      Reviewer #1 (Public Review):

      Induction of beta cell regeneration is a promising approach for the treatment of diabetes. In this study, Massoz et.al., identified calcineurin (CaN) as a new potential modulator of beta cell regeneration by using zebrafish as model. They also showed that calcineurin (CaN) works together with Notch signaling calcineurin (CaN) to promote the beta cell regeneration. Overall, the paper is well organized, and technically sound. However, some evidence seems weak to get the conclusion.

      Reviewer #2 (Public Review):

      This work started with transcriptomic profiling of ductal cells to identify the upregulation of calcineurin in the zebrafish after beta-cell ablation. By suppressing calcineurin with its chemical inhibitor cyclosporin A and expressing a constitutively active form of calcineurin ubiquitously or specifically in ductal cells, the authors found that inhibited calcineurin activity promoted beta-cell regeneration transiently while ectopic calcineurin activity hindered beta-cell regeneration in the pancreatic tail. They also showed similar effects in the basal state but only when it was within a particular permissive window of Notch activity. To further investigate the roles of calcineurin in the ductal cells, the authors demonstrated that calcineurin inhibition additionally induced the proliferation of the ductal cells in the regenerative context or under a limited level of Notch activity. Interestingly, the enhanced proliferation was followed by a depletion of ductal cells, suggesting that calcineurin inhibition would exhaust the ductal cells. Based on the data, the authors proposed a very attractive and intriguing model of the role of calcineurin in maintaining the balance of the progenitor proliferation and the endocrine differentiation. However, the conclusions of this paper are only partially supported by the data as some evidence from the data remains suggestive.

      (1) In the transcriptomic profiling, genes differentially regulated in the ablated adults could be solely due to the chemical effects of metronidazole instead of the beta-cell ablation. A control group without ins:NTR-mCherry but treated with metronidazole is necessary to exclude the side effects of metronidazole.

      We believe that it is unlikely that the differential regulation observed is due to metronidazole rather than the beta cell loss. This experimental strategy as proven successful in well-published studies to identify regulators of beta cell regeneration in the zebrafish larvae. Importantly, the candidates identified in these studies were subsequently functionally validated in mammalian models (Lu et al. 2016, Karampelias 2021). Moreover, in our study, we also used another chemical compound, the nifurpirinol (Bergemann et al., 2018), to ablate the beta cells. Regardless of whether we employed metronidazole or nifurpirinol for beta cell ablation, our results consistently indicate a notable involvement of calcineurin. Of note, the nifurpirinol molecule is commonly used in fishkeeping without toxicity reported on the global health of the fish.

      (2) Although it has been shown that the pancreatic duct is a major source of the secondary islets in the pancreatic tail in previous studies, there is no direct evidence showing the cyclosporin A-induced cells share the source in this manuscript. Without any proper lineage tracing work, the origin of those cyclosporin A-induced cells cannot be concluded.

      Our experimental setting is similar to the one described in Ninov et al. 2013, where lineage tracing experiments demonstrate an increase of beta cell formation in the pancreatic tail that originate from the pancreatic ducts. In our study, we performed the same experiment with the addition of CsA and showed more ductal cell proliferation (Figure 5G) followed by a 19% increase of beta cell regeneration compared to nonregenerative conditions (Figure 2B). It is unlikely that the additional 19% of regenerated beta cells under CaN inhibition come from another source than the 68% first.

      On the other hand, the acinar cells cannot be consider as another source of regenerated beta cell as they are not able to form beta cells unless they are artificially reprogrammed (Maddison et al., 2012). Therefore the only other potential source of regenerated beta cell is the endocrine compartment. However at the stage where we performed beta cell ablation, there are no endocrine cell in the pancreatic tail. Moreover, there are no evidence that secondary islets could come from the principal islet, they are tightly associated with the ducts and differentiate form ductal cell (Mi et al., 2023).

      Importantly, we demonstrated that overexpression of CaN specifically in the pancreatic ducts prevents beta cell regeneration. CaN effect is therefore intrinsic to the ducts. Moreover, we showed that CsA increase beta cells formation when Notch signalling is repressed. Given that Notch signalling is known to act on the ductal cell population, this strongly suggests again that CsA exacerbate beta cells formation from the ducts.

      All of these compelling evidences strongly support the notion that the cyclosporininduced beta cells originate from the ductal cells.

      (3) It is interesting to see an increase of beta cells in the primary islet after cyclosporin A treatment (Supplemental Fig 2B). However, it remains unclear if their formation shares the same mechanism with the newly formed beta cells in the pancreatic tail.

      There are indeed several source of beta cell regeneration in the primary islet. However, a recent study showed that the contribution of alpha cell to regeneration is minor and the main contributors are ductal and sst1.1 cells (Mi et al., 2023). In our previous publication, we indeed showed that a major source of beta cell in the principal islet is the delta 1.1 cell population. Those sst1.1 cells begin to express insulin and therefore are named ‘bihormonal’ (Carril et al., 2022). We tested if this population is impacted by CsA treatment and we showed below that CsA does not affect bi-hormonal cell formation (Figure 2D supplemental). These new results suggest that the CsA mediated increase of beta cells in the principal islet arise from the ductal cells as observed in the tail. These results were added in the manuscript as Figure 2D supplemental.

      Author response image 1.

      Tg (sst1.1:GFP); Tg (ins:NTR*-mCherry) larvae were treated at 3dpf with NFP 4µM to induce beta cell ablation. Then larvae were treated with CsA 1µM from 4 to 6 dpf (or ctl with DMSO); prior fixation and analysis of bi-hormonal cells in the principal islet at 6dpf.

      (4) The conclusion of the effect of cyclosporin A on the endocrine progenitors (Line 175) is not convincing because the data cannot distinguish the endocrine progenitors from the insulin-expressing cells. Indeed, Figure 2E shows that neurod1+ cells are fewer than ins+ cells (Figure 2D) in the pancreatic tail at 10 dpt, suggesting that all or at least the majority of neurod1+ cells are already ins+.

      The neurod1+ cells population indeed included both endocrine progenitor cells and differentiated endocrine cells. However, we would like to point out that the timing of the analysis is essential to reach our conclusion. When we treat with CsA, we show an increase of neurod1+ cells already at 4dpt. At this time point, no hormone- producing cell can yet be detected (Figure 2E). Those additional neurod1+ cell are therefore endocrine progenitors and not beta cells. This result shows that CaN inhibition induces pro-endocrine cell formation in regenerative conditions.

      At 10dpt, the neurod1+ cells population includes beta cells as well as endocrine progenitor cell. We agree that the way the data are presented in figure 2D and 2E can be confusing. Those 2 figures come form 2 separated experiments, the number of beta cell in figure 2D can therefore not be compared to the number of Neurod1+ cell in figure 2E. Indeed, from one experiment to another the efficiency and rate of regeneration can vary, independently of calcineurin. To clarify, we added the number of beta cells regenerated in the experiment of figure 2E (see Author response image 2 in red). As you can see in this experiment, regeneration was a bit slower than usual.

      Author response image 2.

      Tg (neurod1:GFP); Tg (ins:NTR*-mCherry) larvae were treated at 3dpf with NFP 4µM to induce beta cell ablation. Then larvae were treated with CsA 1µM from 4 to 6 dpf (or ctl with DMSO); prior fixation and analysis of GFP+ cells (in grey, pink, dark grey and green), and mCherry+ cells for the condition ablated + CsA in red from 2 to 10 dpf.

      (5) Figure 5D shows a significant loss of nkx6.1+ cells in the combined treatment group but there is no direct evidence showing this was a result of differentiation as the authors suggested. This cell loss also outnumbered the increase in ins+ cells (Figure 4D). The cell fates of these lost cells are still undetermined, and the authors did not demonstrate if apoptosis could be a reason of the cell loss.

      Firstly, as you can notice on the graphs, we encountered a very high variability between individuals within the same condition. We decided to show this variability by presenting the raw data. This high variability could partially explain the differences that you underline. Moreover, we would like to point out that independently of CaN inhibition the progenitor loss (nkx6.1+ cell) outnumber the gain of beta cells. Indeed, in average there is a loss of 29% (41 GFP+) of the nkx6.1+ cells and a gain of only 6 beta cells after Notch inhibitory treatment. The other progenitors cells being differentiated into other endocrine cell types (pro-endocrine, alpha, delta). In the combined treatment (Notch and CaN inhibitors), we decreased the number of progenitors cell by 50%, i.e 21% (20 cells) more than without CaN inhibitor. However, we increased the number of regenerated beta cells by two fold (6 cell to 12 cells). In brief, the important progenitors cell loss could be explained by precocious differentiation in the pro-endocrine and endocrine cells type. It is therefore normal than the number of beta cells regenerated do not match the progenitors cell number loss and this in presence or absence of CaN inhibition.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Major concerns:

      (1) The evidence to indicate the proliferating ductal cell differentiate into beta cell is weak. They should use linkage tracing, or other marker genes immunostaining to confirm that.

      The experiment from the Figure 5 A-D is a short term tracing experiment and should have been presented as such in the manuscript. After LY411575 (Notch inhibitor) and CsA treatments at 3dpf, we exposed the larvae to EdU at 4dpf during 8 hours (Figure 5A). We showed that EdU is incorporated in dividing ductal cells at 4dpf (Figure 5C) ant that 2 days later there are newly form beta cells that are EdU+.(see Author response image 3) To reinforce our conclusion, the image below will be added to the manuscript.

      Author response image 3.

      Tg (nkx6.1:GFP); Tg (ins:NTR*-mCherry) larvae were treated at 3dpf with both CsA 1µM and LY411575 5µM. At 4dpf, the larvae were exposed to EdU 4mM during 8 hours, before analysis at 6 dpf.

      (2) To inhibition of CaN and Notch pathway, they just used the pharmacological approaches, genetical approaches should be used to get stronger evidence.

      We employed two distinct inhibitors specifically targeting calcineurin (CsA and FK506) for CaN inhibition. While these inhibitors have distinct chemical structures and potential non-specific effects, they both yield the same result of increased beta cell formation under Notch repression (see Figure 4D and Figure 4B in the supplementary data). This convergence of outcomes strongly suggests that the observed effect is primarily attributable to the specific inhibition of calcineurin.

      Furthermore, we complemented our inhibitor-based approach with a genetic strategy involving CaN overexpression (see Figure 3). Notably, the overactivation of CaN resulted in a reduction of beta cell regeneration. Given that this genetic approach generated an effect contrary to that achieved with the inhibitors, it provides robust support for our model, which postulates that calcineurin plays a critical role in the regulation of beta cell regeneration (see Figure 3, panels C-E).

      As for Notch inhibition, previous published data from our laboratory compared the effects of Notch inhibitor (LY411575) and genetic approaches (mib mutant and transgenic line) on pro-endocrine cell (ascl1b+) and ductal cell (nkx6.1+) formation. This study showed that both Notch inhibitor (LY411575) and Notch repression using genetic approaches recapitulate the same effect: an induction of pro-endocrine cells formation. The specificity of this inhibitor being validated (Ghaye et al., 2015), we did not consider the need of a genetic approach.

      (3) The most enriched pathways among the up-regulated genes were DNA replication and cell cycle, which suggested that these genes are more important for the duct cell proliferation, how is Calcineurin related to these pathways, such as regulating the genes important for proliferation?

      The transcriptomic data presented in this manuscript suggest that the ductal cells undergo a strong proliferative response after beta cell ablation. This is in accordance with our experimental data showing activation of ductal proliferation after beta cell ablation (Ghaye at al., 2015) and data from this manuscript (Figure 1 I-J).

      Calcineurin is a well-known regulator of the cell cycle, and can either promote or repress the cell cycle depending on the cell type. For example, stressing the cell provokes an entry of calcium and subsequently a CaN activation which result in cell cycle arrest (Leech et al. 2020). Nevertheless, depending the cell type, CaN can be either necessary or deleterious to cell proliferation (Goshima et al. 2019; Masaki and Shimada 2022). The intriguing dual role of CaN in cell cycle is well illustrated in β cell regeneration. While CaN should be repressed to enable ductal progenitor amplification and subsequent endocrine differentiation, CaN is then necessary for β cell function and for their replication (Dai et al. 2017; Heit et al. 2006). Moreover, CaN is related to cellular senescence and CaN function is important for proper fin regeneration in zebrafish.

      (4) It is hard to understand why they pick up the pathway of cellular senescence signature for the duct cell progenitor neogenesis? Moreover, among these senescence genes, many genes are cell cycle regulators.

      In response to beta cell ablation, the ductal cells undergo a strong proliferative response, as shown in our previous data (Ghaye 2015). It was therefore not surprising that many differentially expressed genes are cell cycle regulators. On the other hand, the cellular senescence signature was surprising. Indeed, senescence is usually associated with cell cycle arrest and aging. However, recent studies showed that cellular senescence is required for proper development and regeneration. We therefore wanted to investigate this pathway and more particularly the function of calcineurin, which can either promote or repress the cell cycle in different cell types (see comment above).

      (5) The RNA-seq data obtained from adult fish, while the authors use larvae to explore the CaN functions, it may have different conclusion using adult fish. Moreover, it is unclear whether the CaN increased when the beta cell ablated in young larvae.

      We decided to first perform functional experiment in the larvae as this model unable the quantification of beta cell regeneration from the ducts in the pancreatic tail. However, to validate our results in non-developmental stages, we perform experiments in juveniles (2 months old) and adults. CsA treatments in juveniles zebrafish recapitulated the same results that in larvae (Figure 2B and Figure 6A-C). Moreover, we showed that CaN overactivation delayed glycemia recovery after ablation adults (Figure 6D-E), which is in accordance with an impaired regeneration. Altogether, these results strongly suggest that CaN act as regulator of beta cell regeneration both in the juvenile/adult and larval stages.

      Concerning the expression of CaN in the zebrafish larvae, we tried to detect the level of CaN in the different experimental conditions by in situ hybridization. However, we were not able to detect it using this technique. We also tried immunostaining with antiphospho-nfact3 ser165 polyclonal antibody (Invitrogen) but this antibody does not seem to work in zebrafish. Finally, we tried to sort ductal cell at larval stage to perform a transcriptomic analysis but we were unable to collect enough ductal cells to proceed further. Indeed our staining experiment showed that there are only around 150 ductal cells (nkx6.1+, Figure 5D) at this stage.

      (6) The beta cell regeneration in the young larvae usually recovers within ~ 5 days in principle islet. Please also show the beta cell number (PI) during the beta cell recovery after ablation.

      We did show beta cell regeneration in the principal islet in Figure 2A-B supplemental. While new beta cells appears quickly in this islet (Carril, Massoz, Dupont et al., 2023), the principal islet has not yet fully recover at 5dpt.

      (7) Since the studies did not show the CaN level in Fig.3, it is hard to know that the CaN is exactly expressed.

      In the figure 3B, using Tg(hsp70:GFP-CaNCA), it is indeed not possible to see CaN expression at 10 dpt as the heat shocks induce only transiently CaNCA overexpression. However, the transient expression was detected in live shortly after the heat shocks. On the other hand, with the transgenic line Tg(UAS:GFP-CaNCA); Tg(cftr:Gal4), in which GFPCaNCA is continuously expressed allowing us to show CaNCA expression in the pancreatic ducts (Figure 3).

      (8) In Fig.6 D and 6E, did these drug treatments change the glucose level in nonablated fish?

      As you can see below, the CaN inhibitor, CsA does not affect the glycemia of the fish in non-regenerative conditions.

      Author response image 4.

      Glycemia of non-ablated fish, 3 days after drug treatment.

      (9) The logic of writing in Results is very hard to understand.

      We proofed read the paper in an effort to clarify it.

      Minor concerns,

      (1) Make a scheme for ablation and RNA-seq, and indicate the age of the fish used in Fig. 1.

      We added the scheme in Figure 1 supplemental.

      (2) In Fig. 1G, two arrows indicated mCherry+ cells is hard to see in the non-ablated fish.

      One arrow was indeed mislocated, we moved the arrow and try to improve the intensity of red. However, the only cells are indeed small and can be difficult to see.

      (3) In Fig.6, it is hard to know that the arrows indicated islets are small islets (up to 5 cells), how they compared with big islets and defined as small islet. Moreover, some of these islets are almost invisible.

      We now show a close up of a portion of the pancreatic tail and show the beta cells with arrows only in this picture, to enhance clarity.

      Reviewer #2 (Recommendations For The Authors):

      (1) This manuscript needs more proofreading and polishing to increase its readability.

      We proofread the manuscript and change some paragraph for more clarity.

      (2) The extensive use of words like "modulate" or "regulate" sometimes makes the text ambiguous as the effect is not stated directly and clearly.

      We re-wrote some parts of the text and try to avoid using “regulate” as often.

      However, as we used both repression and over-activation of CaN, we still use words as regulate to stipulate general conclusions on the function of CaN.

      (3) The list of individual differentially regulated genes after the beta-cell ablation in the RNAseq seems missing. This list could be interesting and helpful for other researchers. We added it.

      (4) In Figure 1D, "modulated" genes are shown but were they all upregulated like those in Figure 1A? The modulation should be indicated more clearly (e.g. up- or down-regulated) in the figure. The authors can use different colours to illustrate that.

      Done.

      (5) Is Figure 2D showing the same data extracted from Figure 2B? Does Figure 2D add any information to the data?

      No, it does not add data. We actually add the Figure 2D for a better visualisation of the increase at 10dpt.

      (6) In the y-axis of Figure 3E, it should be "mCherry".

      It already is. We did check all the axis again to be sure it is correct.

      (7) Line 219, "Figure 4E supplemental" instead of "Figure 4D supplemental"

      Done.

      (8) Line 266, "ablated juveniles" instead of "ablated larvae"

      Done. Thank you for noticing these mistakes.

      (9) In Figure 6A, many mCherry+ cells are hardly visible and there are some greyish white signals in the images that are supposed to show the mCherry channel only. What are those grey signals?

      There is no channel showing grey on the picture, I improved the overall quality of this pictures and show close up to improve the figure.

      (10) In Figure 6D and 6E, CaNCA overexpression had a significant effect on the glycemia. But did the overexpression affect the beta cell formation or regeneration? We showed that CaNCA overexpression did not affect beta cell formation in absence of regeneration in the larvae (Figure 3E). Moreover, it does not affect the glycemia of the fish in non-regenerative conditions (Author response image 5). As for regenerative conditions, CaN overexpression decreased the regeneration in the larvae (Figure 3E).

      Author response image 5.

      Glycemia of Tg(UAS:GFP-CaNCA); Tg(cftr:Gal4) fish, overexpressing CaNCA, compared to controls fish, in non-regenerative conditions.

      (11) The role of calcineurin seems transient (e.g. Figure 2B and 4E) and does not play a significant role in long term. It would be interesting to see if long-term/repeated treatments of calcineurin inhibitors and overexpression/knockout of important members of calcineurin signaling would affect the pool of progenitors in long term.

      We were also interested in the consequences of CaN overexpression on the long term. Our overexpression tool Tg(UAS:CaNCA) allow to address this question, as CaN is overexpress permanently. We assessed the structure of the ducts and the number of beta cells in transgenic larvae and did not see any defects of the ducts whether in regenerative context or not. On the other hand, we showed in this manuscript that CaN effect is specific to regenerative conditions. As a consequence, it is not likely that repeated treatments long after the ablation would continue to affect beta cell formation and the progenitors pool.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      The study could also valuably explore what kinds of genes experienced what forms of expression evolution. A brief description of GO terms frequently represented in genes which showed strong patterns of expression evolution might be suggestive of which selective pressures led to the changes in expression in the C. bursa-pastoris lineage, and to what extent they related to adaptation to polyploidization (e.g. cell-cycle regulators), compensating for the initial pollen and seed inviability or adapting to selfing (endosperm- or pollen-specific genes), or adaptation to abiotic conditions. ”

      We did not include a gene ontology (GO) analysis in the first place as we did not have a clear expectation on the GO terms that would be enriched in the genes that are differentially expressed between resynthesized and natural allotetraploids. Even if we only consider adaptive changes, the modifications could occur in various aspects, such as stabilizing meiosis, adapting to the new cell size, reducing hybrid incompatibility and adapting to self-fertilization. And each of these modifications involves numerous biological processes and molecular functions. As we could make post-hoc stories for too many GO terms, extrapolating at this stage have limited implications and could be misleading.

      Nonetheless, we are not the only study that compared newly resynthesized and established allopolyploids. GO terms that were repeatedly revealed by this type of exploratory analysis may give a hint for future studies. For this reason, now we have reported the results of a simple GO analysis.

      Recommendations for the authors: please note that you control which, if any, revisions, to undertake

      The majority of concerns from reviewers and the reviewing editor are in regards to the presentation of the manuscript; that the framing of the manuscript does not help the general reader understand how this work advances our knowledge of allopolyploid evolution in the broad sense. The manuscript may be challenging to read for those who aren't familiar with the study system or the genetic basis of polyploidy/gene expression regulation. Further, it is difficult to understand from the introduction how this work is novel compared to the recently published work from Duan et al and compared to other systems. Because eLife is a journal that caters to a broad readership, re-writing the introduction to bring home the novelty for the reader will be key.

      Additionally, the writing is quite technical and contains many short-hands and acronyms that can be difficult to keep straight. Revising the full text for clarity (and additionally not using acronyms) would help highlight the findings for a larger audience.

      Reviewer #1 (Recommendations For The Authors):

      Most of my suggestions on this interesting and well-written study are minor changes to clarify the writing and the statistical approaches.

      The use of abbreviations throughout for both transcriptional phenomena and lines is logical because of word limits, but for me as a reader, it really added to the cognitive burden. Even though writing out "homoeolog expression bias" or "hybridization-first" every time would add length, I would find it easier to follow and suspect others would too.

      Thank you for this suggestion. Indeed, using less uncommon acronyms or short-hands should increase the readability of the text for broader audience. Now in most places, we refer to “Sd/Sh” and “Cbp” as “resynthesized allotetraploids” and “natural allotetraploids”, respectively. We have also replaced the most occurrences of the acronyms for transcriptional phenomena (ELD, HEB and TRE) with full phrases, unless there are extra attributes before them (such as “Cg-/Co-ELD” and “relic/Cbp-specific ELD”).

      It would be helpful to include complete sample sizes to either a slightly modified Figure 1 or the beginning of the methods, just to reduce mental arithmetic ("Each of the five groups was represented by six "lines", and each line had six individuals" so there were 180 total plants, of which 167 were phenotyped - presumably the other 13 died? - and 30 were sequenced).

      The number 167 only applied to floral morphorlogical traits (“Floral morphological traits were measured for all five groups on 167 plants…”), but the exact total sample size for other traits differed. Now the total sample sizes of other traits have also been added to beginning of the second paragraph of the methods.

      For this study 180 seedings have been transplanted from Petri dishes to soil, but 8 seedlings died right after transplanting, seemingly caused by mechanical damage and insufficient moistening. Later phenotyping (2020.02-2020.05) was also disrupted by the COVID-19 pandemic, and some individuals were not measured as we missed the right life stages. Specifically, 5 individuals were missing for floral morphological traits (sepal width, sepal length, petal width, petal length, pistil width, pistil length, and stamen length), 30 for pollen traits, 1 for stem length, and 2 for flowering time. As for seed traits, we only measured individuals with more than ten fruits, so apart from the reasons mentioned above, individuals that were self-incompatible and had insufficient hand-pollination were also excluded. We spotted another mistake during the revision: two individuals with floral morphological measurements had no positional information (tray ID). These measurements were likely mis-sampled or mislabeled, and were therefore excluded from analysis. We assumed most of these missing values resulted from random technical mistakes and were not directly related to the measured traits.

      In general, the methods did a thorough job of describing the genomics approaches but could have used more detail for the plant growth (were plants randomized in the growth chamber, can you rule out block/position effects) and basic statistics (what statistical software was used to perform which tests comparing groups in each section, after the categories were identified).

      When describing the methods, mention whether the plants; this should be straightforward as a linear model with position as a covariate.

      Data used in the present study and a previously published work (Duan et al., 2023) were different subsets of a single experiment. For this reason, we spent fewer words in describing shared methods in this manuscript but tried to summarize some methods that were essential for understanding the current paper. But as you have pointed out, we did miss many important details that should have been kept. Now we have added some description and a table (Supplementary file 1) in the “Plant material” section for explaining randomization, and added more information of the software used for performing statistic tests in the “Phenotyping” section.

      Although we did not mention in the present manuscript, we used a randomized block design for the experiment (Author response image 1).

      Author response image 1.

      Plant positions inside the growth chamber. Plants used in the present study and Duan et al. (2023) were different subsets of a single experiment. The entire experiment had eight plant groups, including the five plant groups used in the present study (diploid C. orientalis (Co2), diploid C. grandiflora (Cg2), “whole-genome-duplication-first” (Sd) and “hybridization-first”(Sh) resynthesized allotetraploids, and natural allotetraploids, C. bursa pastoris (Cbp), as well as three plant groups that were only used in Duan et al. (2023; tetraploid C. orientalis (Co4), tetraploid C. grandiflora (Cg4) and diploid hybrids (F)). Each of the eight plant groups had six lines and each line represented by six plants, resulting in 288 plants (8 groups x 6 lines x 6 individuals = 288 plants). The 288 plants were grown in 36 trays placed on six shelves inside the same growth chamber. Each tray had exactly one plant from each of the eight groups, and the position of the eight plants within each tray (A-H) were randomized with random.shuffle() method in Python (Supplementary file 1). The position of the 36 trays inside the growth room (1-36) was also random and the positions of all trays were shuffled once again 28 days after germination (randomized with RAND() and sorting in Microsoft Excel Spreadsheet). (a) Plant distribution; (b) An example of one tray; (c) A view inside the growth chamber, showing the six benches.

      With the randomized block design and one round of shuffling, positional effect is very unlikely to bias the comparison among the five plant groups. The main risk of not adding positions to the statistical model is increasing error variance and decreasing the statistical power for detecting group effect. As we had already observed significant among-group variation in all phenotypic traits (p-value <2.2e-16 for group effect in most tests), further increasing statistical power is not our primary concern. In addition, during the experiment we did not notice obvious difference in plant growth related to positions. Although we could have added more variables to account for potential positional effects (tray ID, shelf ID, positions in a tray etc.), adding variables with little effect may reduce statistical power due to the loss of degree of freedom.

      Due to one round of random shuffling, positions cannot be easily added as a single continuous variable. Now we have redone all the statistical tests on phenotypic traits and included tray ID as a categorical factor (Figure 2-Source Data 1). In general, the results were similar to the models without tray ID. The F-values of group effect was only slightly changed, and p-values were almost unchanged in most cases (still < 2.2e-16). The tray effect (df=35) was not significant in most tests and was only significant in petal length (p-value=0.0111), sepal length (p-value=0.0242) and the number of seeds in ten fruits (p-value=0.0367). As expected, positions (tray ID) had limited effect on phenotypic traits.

      Figure 2 - I assume the numbers at the top indicate sample sizes but perhaps add this to the figure caption.

      Statistical power depends on both the total sample size and the sample size of each group, especially the group with the fewest observations. We lost different number of measurements in each phenotypic trait, and for pollen traits we did have a notable loss, so we chose to show sample sizes above each group to increase transparency. Since we had five different sets of sample sizes (for floral morphological traits, stem length, days to flowering, pollen traits and seed traits, respectively), it would be cumbersome to introduce all 25 numbers in figure caption and could be hard for readers to match the sample sizes with results. For this reason, we would like to keep the sample sizes in the figure, and now we have modified the legend to clarify that the numbers above groups are sample sizes.

      ’The trend has been observed in a wide range of organisms, including ...’ - perhaps group Brassica and Raphanobrassica into one clause in the sentence, since separating them out undermines the diversity somewhat.

      Indeed, it is very strange to put “cotton” between two representatives from Brassicaceae. Now the sentence is changed to “… including Brassica (Wu et al., 2018; Li et al., 2020; Wei et al., 2021) and Raphanobrassica (Ye et al., 2016), cotton (Yoo et al., 2013)…”

      The diagrams under the graph in Figure 4B are particularly helpful for understanding the expression patterns under consideration! I appreciated them a lot!

      Thank you for the comment. We also feel the direction of expression level dominance is convoluted and hard to remember, so we adopted the convention of showing the directions with diagrams.

      Reviewer #2 (Recommendations For The Authors):

      The science is very interesting and thorough, so my comments are mostly meant to improve the clarity of the manuscript text:

      • I found it challenging to remember the acronyms for the different gene expression phenomena and had to consistently cross-reference different parts of the manuscript to remind myself. I think using the full phrase once or twice at the start of a paragraph to remind readers what the acronym stands for could improve readability.

      Thank you for this reasonable suggestion. Now we have replaced the most occurrence of acronyms with the full phrases.

      • There are some technical terms, such as "homoeologous synapsis" and "disomic inheritance", which I think are under-defined in the current text.

      Indeed these terms were not well-defined before using in the manuscript. Now we have added a brief explanation for each term.

      • Under the joint action of these forces, allopolyploid subgenomes are further coordinated and degenerated, and subgenomes are often biasedly fractionated" This sentence has some unclear terminology. Does "coordinated" mean co-adapted, co-inherited, or something else? Is "biasedly fractionated" referring to biased inheritance or evolution of one of the parental subgenomes?

      We apologize for not using accurate terms. With “coordinated” we emphasized the evolution of both homoeologs depends on the selection on total expression of both homoeologs, and on both relative and absolute dosages, which may have shifted away from optima after allopolyploidization. “Co-evolved” or “co-adapted” might be a better word.

      But the term "biasedly fractionation" has been commonly used for referring to the phenomenon that genes from one subgenome of polyploids are preferentially retained during diploidization (Woodhouse et al., 2014; Wendel, 2015). Instead of inventing a new term, we prefer to keep the same term for consistency, so readers could link our findings with numerous studies in this field. Now the sentence is changed to “Under the joint action of these forces, allopolyploid subgenomes are further co-adapted and degenerated, and subgenomes are often biasedly retained, termed biased fractionation”.

      • There are a series of paragraphs in the results, starting with "Resynthesized allotetraploids and the natural Cbp had distinct floral morphologies", which consistently reference Figure 1 where they should be referencing Figure 2.

      Thank you for spotting this mistake! Now the numbers have been corrected.

      • ‘The number of pollen grains per flower decreased in natural Cbp’ this wording implies it's the effect of some experimental treatment on Cbp, rather than just measured natural variation.

      Yes, it is not scientifically precise to say this in the Results section, especially when describing details of results. We meant that assuming resynthesized allopolyploids are good approximation of the initial state of natural allotetraploid C. bursa-pastoris, our results indicate that the number of pollen grains had decreased in natural C. bursa-pastoris. But this is an implication, rather than an observation, so the sentence is better rewritten as “Natural allotetraploids had less pollen grains per flower.”

      • ‘The percentage of genes showing complete ELD was altogether limited but doubled between resynthesized allotetraploid groups and natural allotetraploids’ for clarity, I would suggest revising this to something like "doubled in natural allotetraploids relative to resynthesized allotetraploids

      Thank you for the suggestion. The sentence has been revised as suggested.

      • I'm not sure I understand what the difference is between expression-level dominance and homeolog expression bias. It seems to me like the former falls under the umbrella of the latter.

      Expression-level dominance and homeolog expression bias are easily confused, but they are conceptually independent. One gene could have expression-level dominance without any homeolog expression bias, or strong homeolog expression bias without any expression-level dominance. The concepts were well explained in Grover et al., (2012) with nice figures.

      Expression level dominance compares the total expression level of both homoeologs in allopolyploids with the expression of the same gene in parental species, and judges whether the total expression level in allopolyploids is only similar to one of the parental species. The contributions from different homoeologs are not distinguished.

      While homoeolog expression bias compares the relative expression level of each homoeologs in allopolyploids, with no implication on the total expression of both homoeologs.

      Let the expression level of one gene in parental species X and Y be e(X) and e(Y), respectively. And let the expression level of x homoeolog (from species X) and y homoeolog (from species Y) in allopolyploids be e(x) and e(y), respectively.

      Then a (complete) expression level dominance toward species X means: e(x)+e(y)=e(X) and e(x)+e(y)≠e(Y);

      While a homoeolog expression bias toward species X means: e(x) > e(y), or e(x)/e(y) > e(X)/e(Y), depending on the definition of studies.

      Both expression-level dominance and homeolog expression bias have been widely studied in allopolyploids (Combes et al., 2013; Li et al., 2014; Yoo et al., 2014; Hu & Wendel, 2019). As the two phenomena could be in opposite directions, and may be caused by different mechanisms, we think adopting the definitions in Grover et al., (2012) and distinguishing the two concepts would facilitate communication.

      • Is it possible to split up the results in Figure 7 to show which of the two homeologs was lost (i.e. orientalis vs. grandiflora)? Or at least clarify in the legend that these scenarios are pooled together in the figure?

      Maybe using acronyms without explanation made the figure titles hard to understand, but in the original Figure 7 the loss of two homoeologs were shown separately. Figure 7a,c showed the loss of C. orientalis-homoeolog (“co-expession loss”), and Figure 7b,d showed the loss of C. grandiflora-homoeolog (“cg-expession loss”). Now the legends have been modified to explain the Figure.

      • The paragraph starting with "The extant diploid species" is too long, should probably be split into two paragraphs and edited for clarity.

      The whole paragraph was used to explain why the resynthesized allotetraploids could be a realistic approximation of the early stage of C. bursa-pastoris with two arguments:

      1) The further divergence between C. grandiflora and C. orientalis after the formation of C. bursa-pastoris should be small compared to the total divergence between the two parental species; 2) The mating systems of real parental populations were most likely the same as today. Now the two arguments were separated as two paragraphs, and the second paragraph has been shortened.

      • On the other hand, the number of seeds per fruit" implies this is evidence for an alternative hypothesis, when I think it's really just more support for the same idea.

      “On the other hand” was used to contrast the reduced number of pollen grains and the increased number of seeds in natural allotetraploids. As both changes are typical selfing syndrome, indeed the two support the same idea. We replaced the “On the other hand” with “Moreover”.

      • ‘has become self-compatible before the formation" "has become" should be "became".

      The tense of the word has been changed.

      • If natural C. bursa-pastoris indeed originated from the hybridization between C. grandiflora-like outcrossing plants and C. orientalis-like self-fertilizing plants, the selfing syndrome in C. bursa-pastoris does not reflect the instant dominance effect of the C. orientalis alleles, but evolved afterward.’ This sentence should be closer to the end of the paragraph, after the main morphological results are summarized.

      Thank you for the suggestion. The paragraph is indeed more coherent after moving the conclusion sentence.

      References

      Combes, M.C., Dereeper, A., Severac, D., Bertrand, B. & Lashermes, P. (2013) Contribution of subgenomes to the transcriptome and their intertwined regulation in the allopolyploid Coffea arabica grown at contrasted temperatures. New Phytologist, 200, 251–260.

      Grover, C.E., Gallagher, J.P., Szadkowski, E.P., Yoo, M.J., Flagel, L.E. & Wendel, J.F. (2012) Homoeolog expression bias and expression level dominance in allopolyploids. New Phytologist, 196, 966–971.

      Hu, G. & Wendel, J.F. (2019) Cis – trans controls and regulatory novelty accompanying allopolyploidization. New Phytologist, 221, 1691–1700.

      Li, A., Liu, D., Wu, J., Zhao, X., Hao, M., Geng, S., et al. (2014) mRNA and Small RNA Transcriptomes Reveal Insights into Dynamic Homoeolog Regulation of Allopolyploid Heterosis in

      Nascent Hexaploid Wheat. The Plant Cell, 26, 1878–1900. Wendel, J.F. (2015) The wondrous cycles of polyploidy in plants. American Journal of Botany, 102, 1753–1756.

      Woodhouse, M.R., Cheng, F., Pires, J.C., Lisch, D., Freeling, M. & Wang, X. (2014) Origin, inheritance, and gene regulatory consequences of genome dominance in polyploids. Proceedings of the National Academy of Sciences of the United States of America, 111, 5283–5288.

      Yoo, M.J., Liu, X., Pires, J.C., Soltis, P.S. & Soltis, D.E. (2014) Nonadditive Gene Expression in Polyploids. https://doi.org/10.1146/annurev-genet-120213-092159, 48, 485–517.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We greatly appreciate the editor and reviewers’ careful and professional assessment of this manuscript. We are delighted with the reviewers’ instructive comments and suggestions. We have tried to address the raised points comprehensively. The reviewers’ scrutiny has helped us immensely to discuss and present our work extensively and properly. We are grateful for the reviewers’ efforts and insights. The detailed responses are listed here.

      Recommendations for the authors

      (1) The intuition behind the model is not properly explained, i.e., the derivation of Eqs. 1-2 and the biological meaning of the AA/OO logic modes. A different notation could be helpful.

      We thank the reviewers for this comment, and agree that the interpretation of our model in manuscript was indeed in need of improvement. We have incorporated this suggestion into the manuscript. For clarity, we have substituted AND-AND/OR-OR for original expression of AA/OO, and hope that new notations are helpful for interpreting our work.

      In general, considering the diverse audience including those with experimental background, we feel that it is essential to present this manuscript in a more digestible manner. We therefore retain the entire derivation of Eqs. 1-2 in the supplementary method. We have added a qualitative introduction to model derivation and molecular biological significance underlying different logic motifs (AND-AND/OR-OR) in the revised manuscript. Please refer to Page 5 of the revised manuscript, lines 161-167 (see below).

      “X and Y are TFs in the CIS network. n1 and n2 are the coefficients of molecular cooperation. k1-k3 in Eq1 and k4-k6 in Ep2 represent the relative probabilities for possible configurations of binding of TFs and CREs. (Fig2.A). d1 and d2 are degradation rates of X and Y, respectively. Here, we considered a total of four CRE’s configurations as shown in Figure 2A (i.e., TFs bind to the corresponding CREs or not, 22=4). Accordingly, depending on the transcription rates (i.e., r0x, r1, r2, r3 in Eq1, similarly in Eq2) of each configuration, we can model the dynamics of TFs in the Shea-Ackers formalism[1, 2].

      Thus, the distinct logic operations (AND/OR) of two inputs (e.g., activation by X itself and inhibition by Y) can be further implemented by assigning corresponding profile of transcription rates in four configurations (Fig2.A). From the perspective of molecular biology, the regulatory logics embody the complicated nature of TF regulation that TFs function in a context-dependent manner. Considering the CIS network, when X and Y bind respective CREs concurrently, whether the expression of target gene is turned on or off depends on the different regulatory logics (specifically, off in the AND logic and on in the OR logic; Fig2.A). Notably, instead of exploring the different logics of one certain gene[3, 4], we focus on different combinations of regulatory logics due to dynamics in cell fate decisions is generally orchestrated by GRN with multiple TFs.”

      (2) More clearly specify the used parameters and how these are chosen. This would be helpful to get a more quantitative grasp of the conditions that they compare.

      We appreciate the reviewers pointing out unspecified parts in the main text. We have now included related discussion in the revised manuscript. Please refer to Page 5 of the revised manuscript, lines 179-181 (“Benchmarking the Boolean models with different logic motifs (Fig2.B), we reproduced the geometry of the attractor basin in the continuous models resembling those represented by corresponding Boolean models (Fig2.C; see Methods).”).

      We would like to highlight that the Boolean models with different logic motifs (Fig. 2B) explicitly display the difference of state spaces (i.e., attractor basin). Moreover, as the focus of this work is on the role of regulatory logics in cell fate decisions, we ponder that it is rational to specify the geometry of the landscape based on the hint from Boolean models. Therefore, we reason that it is intuitive and reliable to assign values to used parameters by mapping our ODE models (Eqs. 1-2) to corresponding Boolean models qualitatively (refer to the statement in our original manuscript, Page 5, lines 162-163, “With appropriate parameters, we are able to reproduce the Boolean-like attractor basin in the continuous models”). In producing Figure 2-5, setting of parameters was performed in a heuristic way without particular searching. However, to draw general conclusions, like the "trade-offs between progression and accuracy" and the presence of the fully-connected stage, we sampled a substantial number of sets parameters to ensure statistically robust findings.

      (3) Include the explanation of how the nullclines and basins shown in the figures (e.g., Fig. 2C, Fig. 4C, Fig. 4F, etc.) are calculated.

      We thank the reviewers for this suggestion. We have incorporated this into the legend of corresponding figures when first mentioned in the main text. Please refer to Page 7 of the revised manuscript, lines 217-223 (see below).

      “Fig2.C:

      (C) State spaces of the AND-AND (top panel) and OR-OR (bottom panel) motifs in ODE models. Dark and red lines represent nullclines of respectively. Stable steady states (SSS) are denoted as orange dots. Unstable Steady States (USSs) are denoted as white dots. Each axis represents the concentration of each transcription factor, which units are arbitrary. Blue, green and purple areas in state spaces indicate attractor basins representing LX, S and LY, respectively. Color of each point in state space was assigned by the attractors they finally enter according to the deterministic models (Eq1, Eq2). These annotations were used for the following Figure 3-7.”

      (4) Clarity on the decisions in the work is needed. For example, the "introduction" of asymmetry of the noise levels (as stated in line 215) appears completely arbitrary. The reason behind it can be guessed in the following paragraph, but the reader shouldn't have to guess.

      We agree entirely with the reviewers’ comment. Indeed, this should have been stated more explicitly. The motivation for incorporating asymmetry in the noise levels stems from our endeavor to mimic the inherent biological variability in gene expression within a cell population. We have adjusted the manuscript to better convey the motivation for investigating asymmetric noise level. Please refer to Page 8 of the revised manuscript, lines 237-238 (“In biological systems, it is unlikely that the noise level of different genes is kept perfectly the same.”).

      (5) Arbitrary and/or out-of-context jargon is used throughout the manuscript, making it hard to read and follow what the authors mean in some cases. For example, "temporal fully-connected stage" is used for the first time in line 290, and the term is not explained either in the main text or in the manuscript. Similarly, the reference to a Boolean-like and Boolean model (line 163 and Figure 1) without clarifying if this is just an analogy or if a formal model is built, nor the utility and implications of this comparison. Another problem related to jargon occurs on line 291, where the authors talk about "parameter sensibility", but such analysis (as it is normally understood in the field) is never performed; the authors perform a parameter exploration and make some general conclusions about the parameter space, but that is different than a parameter sensitivity analysis.

      We thank the reviewers for this comment, as it has prompted us to better clarify our manuscript. We have reviewed the manuscript and made the necessary adjustments to improve its clarity. We do hope that this revision meets the reviewers’ expectations on the clarity and comprehensiveness of our analysis.

      Regarding the jargon of "temporal fully-connected stage", we realized that this term was slightly vague and in need of improvement. Instead, we now employ “transitory fully-connected stage” in the revised manuscript to underline the short emergence of this particular stage. Please refer to Page 11 of the revised manuscript, lines 323.

      We thank the reviewers for pointing out the lack of clarity concerning the Boolean models. We have now amended the manuscript to make this implicit expression explicit. Please refer to Page 5 of the revised manuscript, lines 179-181 (“Benchmarking the Boolean models with different logic motifs (Fig2.B; see Methods), we reproduced the geometry of the attractor basin in the continuous models resembling those represented by corresponding Boolean models (Fig2.C; see Methods).”). Specifically, we employed the Boolean models (Fig.2B) as the reference to assist us to heuristically evaluate the applicability of used parameters in the ODE models. Therefore, the Boolean models are built formally, and corresponding updated rules are listed in Fig.2A (refer to the middle row in the table called “Logic Function”, now also noted in the legend of Fig.2B, Page 7, lines 213-214). Nevertheless, we do utilize the analogy between the attractor basins from Boolean models and ODE models (refer to Fig.2B-C). Accordingly, we used the term “Boolean-like” to describe the landscape presented by the continuous models (Eqs. 1-2; refer to the statement in our original manuscript, Page 5, lines 162-163, “With appropriate parameters, we are able to reproduce the Boolean-like attractor basin in the continuous models”).

      We appreciate the reviewers for this valuable comment, and agree that the usage of “parameter sensibility” was in need of adjustment. We have now amended the manuscript. Please refer to Page 10 of the revised manuscript, lines 318-321 (see below).

      “To manifest the generality, we globally screened 6,213 groups of parameter sets under the AND-AND motif, and this logic-dependent intermediated stage can be observed for 82.7% of them (see Methods; Table S1), indicating little dependence on particular parameter setting (1.8% in the OR-OR motif).”

      (6) Probably related just to the language clarity (i.e., the abuse of jargon), but we don't understand the conclusion on lines 296-298.

      We thank the reviewers for this comment. We have adjusted the manuscript accordingly. Please refer to Page 11 of the revised manuscript, lines 323-327 (see below). And we hope that the reviewers agree with our attempt at mapping into the particular stage in cell fate decisions from the point of landscape.

      “Furthermore, this transitory fully-connected stage locates between the fate-undetermined stage (Fig4.C top panel) and fate-determined stage (Fig4.C 3rd panel), comparable to the initiation (or activation) stage before the lineage commitment in experimental observations [5-7]. Therefore, we suspected that the robust fully-connected stage in the AND-AND motif may correspond to a specific period in cell fate decisions.”

      (7) The so-called "solution landscape" in Figure 4E needs to be better explained.

      We thank the reviewers for this comment. We have introduced the concept of solution landscape, which is a pathway map consisting of all stationary points and their connections, in lines 196-198 of the revised manuscript (see below).

      “Furthermore, we introduced the solution landscape method. Solution landscape is a pathway map consisting of all stationary points and their connections, which can describe different cell states and transfer paths of them [82-84].”

      In Figure 4E, we added detailed explanation of the solution landscape for the AND-AND motif. Specifically, it describes a hierarchical structure including one 2-saddle (yellow triangle), three 1-saddles (crimson X-cross sign), and three attractors (green dot). The layer of 1-saddles is represented by a blue translucent plane, and the bottom layer is the flow field diagram. The connections from 2-saddle to 1-saddles and from 1-saddles to the attractors are represented by red and blue lines, respectively. The arrow and color of the heatmap correspond to the flow direction and the length of the acceleration at each point in the state space.

      (8) Table S1 is not properly annotated, and then it is impossible to interpret how it supports the observations in the paragraph in lines 342-342.

      We appreciate the reviewers’ useful feedback. We have refined the annotations of all tables in our manuscript (Table S1-3). Please refer to “Supplementary Table” in resubmitted files.

      Specifically, we randomly collected 6,231 sets of parameters for the AND-AND motif and 6,682 sets for the OR-OR motif (k1-k6 in Eq1 and Eq2; refer to Page 6 of the revised supplementary method, see below).

      “First, to collect parameter sets with 3 SSSs, we used Latin hypercube sampling (LHS) to screen k-series parameters symmetrically (i.e., k1 = k4, k2 = k5, k3 = k6) ranging from 0.001 to 5 both in the AND-AND and OR-OR motifs. We ultimately collected 6,231 sets for the AND-AND motif and 6,682 sets for the OR-OR motifs (Table S1).”

      To analyze the sequence of vanishing SSSs, we further filtered parameter sets with 2 SSSs remained as increasing ux (corresponding to Eq3 in the revised manuscript, Page 10, lines 293). We then got a collection of 6,207 sets for the AND-AND motif and 6,634 sets for the OR-OR motif. Based on these parameter settings, we checked if the observations (refer to Page 13, lines 377-378, “The distinct sequences of attractor basin disappearance as ux increasing can be viewed as a trade-off between progression and accuracy.”) are artifacts of particular parameter choice.

      (9) The flow in Section 5 needs to be reorganised. For instance, it is not clear which question the authors are addressing in line 395, or how the proposed approach answers the question stated in lines 381-382.

      We greatly thank the reviewers for pointing this out, and acknowledge that the Section 5 was definitely in need of improvement. We have now amended the manuscript to make this implicit understanding explicit. Please refer to Page 15 of the revised manuscript, lines 426-430 (see below).

      “In prior sections, we systematically investigated two logic motifs under the noise- and signal-driven modes in silico. With various combinations of logic motifs and driving forces, features about fate-decision behaviors were characterized by computational models. Next, we questioned whether observations in computation can be mapped into real biological systems. And how to discern different logic motifs and driving modes is a prerequisite for answering this question.

      To end this, we first evaluated the performance of different models, specifically in simulating the process of stem cells differentiating towards LX (Fig6.A).”

      (10) There are two important weak points for the successful classification of the regulatory logic of real gene expression data as presented in the manuscript: (1) the small number of time-points in the datasets and clear peaks in gene expression heterogeneity cannot be identified, and (2) it is not always clear whether cell differentiation really exclusively relies on a CIS network, and which genes constitute it. These limitations should be solved or at least discussed in the manuscript.

      We thank the reviewer for this comment. First, we agree entirely that analysis of datasets with more time points will be more amenable to identifying the trends of gene expression variation. We have made a concerted effort towards searching for such datasets, but unfortunately, there are not many such datasets publicly available. Specifically, to apply our computational framework, the datasets of our interest need to fulfill the following three characteristics: (i) sampling at multiple time points (as many as possible); (ii) to illustrate/validate our findings clearly and representatively, we would like the cell fate decisions in the biological systems to follow the classical binary tree-like pattern. i.e., there is one stem cell fate (or progenitor) and two downstream cell fates in the systems; (iii) the core GRN circuits for orchestrating the fate-decision processes have been experimentally confirmed (at least clearly supported). We have also extended the discussion to include above points to explicitly note the limitations regarding the used datasets. Please refer to Page 25 of the revised manuscript, lines 762-766 (see below).

      “The gene expression datasets analyzed here are only available for a limited number of time points. Though they meet the need for discerning trends, it is evident that the application to the datasets with more time points will yield clearer and less ambiguous changing trends to support the conclusions of this paper more generally.”

      In regards to second point, we do acknowledge that the CIS network may not always be the core module for every fate-decision case (but to our knowledge, this can be assumed in many cases, especially in binary tree-like pattern). For applicability and potential relevance to our intended readership, we developed the models and draw our conclusions primarily based on the CIS topology for its representativeness. We intend to incorporate diverse topologies (like mutual activation with self-activation, Feed-Forward Loop, etc.) in our computational framework presented here in near future. Additionally, we have incorporated this point into the discussion in the revised manuscript. Please refer to Page 25 of the revised manuscript, lines 766-769 (see below).

      “Notwithstanding the fact that the CIS network is prevalent in fate-decision programs, there are other topologies of networks that serve important roles in the cell-state transitions, like feed-forward loop, etc. The framework presented in this work should further incorporate diverse network motifs in the future.”

      As referred by the reviewers, even if given the CIS network, we may not sure about which genes constitute it in some cases. We agree that further extension of our framework to mining key regulators is an interesting question. We also note that we have become very enthusiastic about recent work that shows how to nominate core factors from high-throughput data[8, 9]. Of note, in the last section of our manuscript titled “The chemical-induced reprogramming of human erythroblasts (EBs) to induced megakaryocytes (iMKs) is the signal-driven fate decisions with an OR-OR-like motif”, we leveraged patterns of temporal expression variance to filter out key regulators (Fig7.F and H). We thus underline the potential of mining genes comprising core GRN circuits through expression variance. Nevertheless, as the focus of the present paper is on the role of regulatory logic in cell fate decisions, we feel it is beyond the scope of the present article to continue the development of our results on this point. Instead, we have included discussion of case that genes comprising the CIS network are not defined. Please refer to Page 23 of the revised manuscript, lines 685-687 (see below).

      “Notably, if the genes that constituting the CIS network are not specified, we can conversely leverage the patterns of temporal expression variance to nominate key regulators in a model-guided manner.”

      (11) The models used in Figure S5 are never clearly described.

      We thank the reviewers for pointing this out. We have now introduced the settings of the models used in Figure S5 more clearly in the legend (see below).

      Two logic motifs with the noise-driven mode (FigS5.A, see below):

      Author response image 1.

      “Initial values were identical with attractor of S fate in Figure 2C (SSSs in green attractor basins). Simulation was preformed 1000 times for each pseudo-time point, with each temporal state (from left to right) recorded as a dot on the plot. Top panel: Noise level of X (σx) is set to 0.21, and σy is 0.09. Bottom panel: Noise level of Y (σy) is set to 0.21, and σx is 0.09. Red arrow represents the direction of fate transitions of S to LX. Other than adding a white noise, parameters were identical with those in Figure 2C.”

      Two logic motifs with the signal-driven mode (FigS5.B, see below):

      Author response image 2.

      “Initial values were identical with attractor of S fate in Figure 2C (SSSs in green attractor basins). Top panel: Noise level of X (σx) and Y (σy) are both set to 0.06. Simulation was preformed 1000 times, with each final state recorded as a dot on the plot. Parameter ux switched from 0 to 0.09 (0, 0.045, 0.09, from left to right). Bottom panel: Noise level of X (σx) and Y (σy) are both set to 0.05. Simulation was preformed 1000 times, with each final state recorded as a dot on the plot. Parameter ux switched from 0 to 0.24 (0, 0.12, 0.24, from left to right). Red arrow represents the direction of fate transitions of S to LX. Other model’s parameters were identical with those in Figure 2C.”

      (12) Up until Section 5, "noise levels" have been used to refer to an input/parameter in the model. Here it is assumed as an emergent property. Are the authors talking about the variance in expression (e.g., see line 398)? Is it defined as the coefficient of variation? Clarity is essential to interpret the observations in this section, e.g., "different driving modes change in the patterns of noise rather than expression levels" (lines 399-400).

      We greatly appreciate the reviewers pointing this ambiguity out. The term of “noise level” was indeed used to refer the strength of the noise in the models in Section 1-4. For classifying different logic motifs with two driving forces, we needed a practical metric that can be quantified from data, and we found population-level gene expression variance (i.e., “noise level” in line 398) is useful which defined as the coefficient of variation. For clarity, we carefully decide to substitute “expression variance” for “noise level” presented in Section 5-6. We have amended the manuscript accordingly, and hope this revision will be helpful for interpreting our result. Please refer to Page 15 of the revised manuscript.

      (13) "Pulse-like behaviour" is used in an arbitrary way, not as it is normally used in the field. Moreover, we consider this jargon expression does not contribute to the understanding of the paper. (The authors probably meant "discrete transitions" vs "gradual transitions".)

      We appreciate the reviewers’ valuable feedback regarding our use of the term “Pulse-like behavior”. We agree with the reviewers’ statement, and acknowledge that terminology of noise level’s patterns between different driving modes (noise-driven vs signal-driven; refer to Section 5 in our manuscript) was in need of improvement.

      Upon comprehensive consideration, we primarily decided to adopt the terms “monotonic transitions” and “nonmonotonic transitions” to recapitulate the trends of noise level, underlining the distinct temporal noise’s patterns in cell fate decisions brought by two driving forces in a more contrastive way. We anticipate that current jargon expressions will be beneficial for interpreting our work. Please refer to Page 15 of the revised manuscript.

      (14) The temporal resolution of the scRNAseq datasets that the authors used is too low to unambiguously distinguish a discrete pattern of gene expression heterogeneity from a rising profile. This limitation needs to be at least acknowledged in the text. Alternatively, the authors might want to identify more recent datasets with higher time resolution.

      We appreciate the reviewers’ insightful suggestions. We agree that analysis of datasets with higher time resolution will be more unambiguous to identifying the trends of gene expression variation. We have made a concerted effort towards searching for such datasets, but unfortunately, there are not many such datasets publicly available. Specifically, to apply our computational framework, the datasets of our interest need to fulfill the following three characteristics: (i) sampling at multiple time points (as many as possible); (ii) to illustrate/validate our findings clearly and representatively, we would like the cell fate decisions in the biological systems to follow the classical binary tree-like pattern. i.e., there is one stem cell fate (or progenitor) and two downstream cell fates in the systems; (iii) the core GRN circuits for orchestrating the fate-decision processes have been experimentally confirmed (at least clearly supported). Nevertheless, we recognize this limitation should be mentioned in the paper. So, we have also extended the discussion to include above points. Please refer to Page 25 of the revised manuscript, lines 762-766 (see below).

      “The gene expression datasets analyzed here are only available for a limited number of time points. Though they meet the need for discerning trends, it is evident that the application to the datasets with more time points will yield clearer and less ambiguous changing trends to support the conclusions of this paper more generally.”

      (15) In the case of embryonic stem cell differentiation, an additional complication is that this protocol yields heterogeneous cell type mixtures, whereas the authors' simulations usually are designed to give differentiation towards a single cell type. This difference makes it difficult to compare measures of gene expression heterogeneity between simulations and the experimental system to infer regulatory logic questionable.

      We thank the reviewers for this valuable comment and realize that we were not clear enough in the manuscript regarding the case of embryogenesis. In the biological system devised by Semrau et al[10], mouse embryonic stem cells (mESCs) differentiates into two lineages simultaneously, just as mentioned by the reviewers. We noticed this additional complication and performed other simulations in two logic motifs with increasing noise level of gene X and Y, as presented in Fig.S6E (see below).

      Author response image 3.

      “(E) Time courses on the coefficient of variation in expression levels of X and Y genes in silico during differentiation under the noise-driven mode. Initial values were set to the attractors of S fate in Figure 2C (SSSs in green attractor basins). Top panel: Noise level of X (σx) and Y (σy) are both set to 0.14. Bottom panel: Noise level of X (σx) and Y (σy) are both set to 0.1. Stochastic simulation was preformed 1000 times for each pseudo-time point.”

      Given the noise-driven mode, we further employed the expression pattern of Gbx2-Tbx3 circuit to heuristically infer the logic motif.

      (16) In contrast to the hematopoiesis example, the authors do not focus on a specific gene regulatory circuit with the ESC dataset. How their approach is possible on genome-wide data needs to be discussed.

      We thank the reviewers for this comment. Indeed, the core GRN orchestrating the fate-decision process reported by Semrau et al[10] is not fully elucidated. We here focus on the Gbx2-Tbx3 circuit (Fig.6H, Fig.S6D). These two TFs were filtered out from 22 candidate TFs and suggested as potential key regulators in the original paper[10]. Accordingly, at this point we followed the original paper’s statement.

      In regards to extension into biological systems without specific gene regulatory circuits, we have included discussions about the possibility that genes comprising the CIS network are not defined. Please refer to Page 23 of the revised manuscript, lines 685-687 (see below).

      “Notably, if the genes that constituting the CIS network are not specified, we can conversely leverage the patterns of temporal expression variance to nominate key regulators in a model-guided manner.”

      (17) [In supplemental material, pp.1] Possible typo: "In our word, we considered a GRN comprised...".

      Thanks for spotting this typo. We have amended it in the revised supplemental method (refer to Page 1 of the revised supplementary method).

      (18) [In supplemental material, pp.1] In Eqs. (1), the notation for the function HX([X]) implies that HX only depends on X, leaving the combinatorial regulation out. HX([X],[Y]) would be more general and accurate.

      Thanks for pointing this out. We have incorporated this suggestion into the manuscript. Please refer to Page 1 of the revised supplementary method.

      (19) [In supplemental material, pp.1] There are several works that have shown that the Hill coefficient is rarely representative of the number of binding elements. The model can be more general. See, for example, «Santillán, Moisés. "On the Use of the Hill Functions in Mathematical Models of Gene Regulatory Networks." Mathematical Modelling of Natural Phenomena 3, no. 2 (October 22, 2008): 85-97. https://doi.org/10.1051/mmnp:2008056.» and «Nam, Kee-Myoung, Rosa Martinez-Corral, and Jeremy Gunawardena. "The Linear Framework: Using Graph Theory to Reveal the Algebra and Thermodynamics of Biomolecular Systems." Interface Focus 12, no. 4 (June 10, 2022): 20220013. https://doi.org/10.1098/rsfs.2022.0013.»;

      We thank the reviewer for drawing our attention to this and highlighting the above works. Indeed, this is important information to include in the manuscript. We have incorporated this suggestion into the revised supplemental method (refer to Page 1 of the revised supplementary method). These references have now been included in the revised supplemental method (refer to references [2]-[3]).

      (20) [Minor] The configuration labels can be confusing, especially the AA, which is rather an AND NOT gate.

      We thank the reviewers for this comment. For clarity, we have substituted AND-AND/OR-OR for original expression of AA/OO, and hope that new notations are helpful for interpreting our work.

      (21) [Minor] Very low printing quality in Figure 1.

      Thanks for the feedback regarding the printing quality of Figure 1. We have made the necessary adjustments to improve its quality. We have also ensured that all other figures in the manuscript meet the required standards.

      (22) [Minor] We suggest including a quantitative scale for the bias in Fig. 3E.

      Thanks, we have incorporated this suggestion into the manuscript.

      (23) [Recommendation] Authors could also evaluate the cell fate decision processes as mutations or other perturbations affect a regulatory network.

      We appreciate the reviewers for this valuable recommendation. We agree with the reviewers that further involving new cases would be helpful, especially those mutation-driven disease-related fate-decision processes, such as neutropenia in chemotherapy. However, given the considerable effort towards searching for appropriate datasets, we carefully decide not to make this change.

      (24) [Recommendation] The authors could include some discussion of the likely impact of the work on the field and the utility of the methods and data to the community. For example, understanding the fluidity of the epigenetic landscape and the regulatory forces behind cell fate decisions can be of great importance in designing synthetic gene regulatory circuits.

      We greatly appreciate the reviewers pointing this out. In the original manuscript, we intentionally limited the length of the discussion to make the whole story more focus. We thank the reviewers for their insightful suggestions regarding the content of discussion. We have incorporated this suggestion into the revised manuscript. Please refer to Page 25, lines 751-757 (see below).

      “Recently, synthetic biology has realized the insertion of the CIS network in mammalian cells. One of the prerequisites for recapitulating the complex dynamics of fate transitions in synthetic biology is systematical understanding of the role of GRNs and driving forces in differentiation. And the logic motifs are the essential and indispensable elements in GRNs. Our work also provides a blueprint for designing logic motifs with particular functions. We are also interested in validating the conclusions drawn from our models in a synthetic biology system.”

      In addition, a longstanding question of our interest in cell fate decisions is what contributes the distinctive development cross species, like human, mice and so on forth. However, in addition to protein coding sequences, regulatory interactions between genes (i.e., activation and inhibition) also exhibit conservation as reported in recent work of multi-species cell atlas [11], and it is generally acknowledged that gene regulatory networks (GRNs) orchestrate fate-decision procedures. Namely, conserved regulatory programs further bring us a conserved topology of core GRNs. Thus, the logics of regulation, as another vital element in GRNs, is naturally under the spot light (related to the introduction, lines 99-120 of the revised manuscript). Nevertheless, to our knowledge, regulatory logic in cell fate decisions has received only scant attention. We hope that our elucidation of the role of logic motifs in cell fate decisions will attract more inquiries in community into GRN’s regulatory logic.

      Public reviews

      In this manuscript, Xue and colleagues investigate the fundamental aspects of cellular fate decisions and differentiation, focusing on the dynamic behaviour of gene regulatory networks. It explores the debate between static (noise-driven) and dynamic (signal-driven) perspectives within Waddington's epigenetic landscape, highlighting the essential role of gene regulatory networks in this process. The authors propose an integrated analysis of fate-decision modes and gene regulatory networks, using the Cross-Inhibition with Self-activation (CIS) network as a model. Through mathematical modelling, they differentiate two logic modes and their effect on cell fate decisions: requires both the presence of an activator and absence of a repressor (AA configuration) with one where transcription occurs as long the repressor is not the only species on the promoter (OO configuration).

      The authors establish a relationship between noise profiles, logic-motifs, and fate-decision modes, showing that defining any two of these properties allows the inference of the third. They also identify, under the signal-driven mode, two fundamental patterns of cell fate decisions: either prioritising progression or accuracy in the differentiation process. The authors apply this analysis to available high-throughput datasets of cell fate decisions in hematopoiesis and embryogenesis, proposing the underlying driving force in each case and utilising the observed noise patterns to nominate key regulators.

      The paper makes a substantial contribution by rigorously evaluating assumptions in gene regulatory network modelling. Notably, it extensively compares two model configurations based on different integration logic, illuminating the consequences of these assumptions in a clear, understandable manner. The practical simulation results effectively bridge theoretical models with real biological systems, adding relevance to the study's insights. With its potential to enhance our understanding of gene regulatory networks across biological processes, the paper holds promise. Its implications extend practically to synthetic circuit design, impacting biotechnology. The conclusions stand out, addressing cell fate decisions and noise's role in gene networks, contributing significantly to our understanding. Moreover, the adaptable approach proposed offers versatility for broader applications in diverse scenarios, solidifying its relevance beyond its current scope.

      We thank the reviewers for their enthusiasm for our work, and appreciate the professional, insightful and encouraging assessment.

      However, the manuscript in its current form also has some important weaknesses, including the lack of clarity in the text and the questionable generality of specific observations.

      We thank the reviewers for this comment. We have reviewed the manuscript and made the necessary adjustments to improve its clarity. We do hope that this revision meets the reviewers’ expectations on the clarity and comprehensiveness of our analysis.

      For instance, even when focusing on the CIS network, the effect of alternative model implementations is not discussed. Notably, the input signals are only considered as an additive effect over the differential equations, while signals can potentially affect each of the individual processes.

      We agree with the reviewers’ comment that signals may affect at each level of the central dogma, including transcription, translation, etc. Further, we have also included additional section titled “limitation of this study” on this point in the revised manuscript, and explicitly point to the potential limitations of our models. Please refer to Page 25 of the revised manuscript, lines 769-771 (see below).

      “In addition, for simplicity and intuition, we here considered signals as uncoupled and additive effects in ODE models, due to feasible mapping in real biological systems, such as ectopic overexpression.”

      The proposed model allows for a continuum of interactions/competition between transcription factors, yet only very restrictive scenarios are explored (strict AND/OR logic operations).

      We thank the reviewers for this comment, and appreciate them sharing the potential for further generalization of our framework. Indeed, in addition to logic operations, our framework is able to be applied to all two-node circuits (34=81 in total), including mutual activation with self-activation. As the focus of this work is to illustrate the role of logic motifs in cell fate decisions, we mainly concentrated on two classical, intuitive and representative (at least to us) logic operations AND/OR in the context of the CIS network. Nonetheless, we already have four combinations to consider (two logic motifs and two driving forces). And we feel that the currently involved scenarios have properly fulfilled our need to manifest the role of logic motifs. Hence, we carefully decided not to further explore more logic operations in this work. Instead, we have included additional section titled “limitation of this study” in the revised manuscript. Please refer to Page 25 of the revised manuscript, lines 760-762.

      “Although our framework enables the investigation of more logic motifs, we chose two classical and symmetrical logic combinations for our analysis. Future work should involve more logic gates like XOR and explore asymmetrical logic motifs like AND-OR.”

      Moreover, how the model parameters are chosen throughout the paper is not clear. Similarly, the concentration and times are not clearly specified, making their comparison to experimental data troublesome.

      We thank the reviewers for this comment. Regarding how to specify parameters in our model, we have now revised the manuscript. Please refer to Page 5 of the revised manuscript, lines 179-181 (“Benchmarking the Boolean models with different logic motifs (Fig2.B; see Methods), we reproduced the geometry of the attractor basin in the continuous models resembling those represented by corresponding Boolean models (Fig2.C; see Methods).”). In terms of concentration and time, we acknowledge that their units are arbitrary compared to a real experimental system. We now have noted this point in the legend of corresponding figures (Fig2.C, Fig3.B&D, Fig6.B-C, Fig7.E).

      We would like to highlight that our entire work is organized in a model-driven fashion (also called top-down). We did not fine-tune the sets of parameters used in our model to specifically match the experimental data. Actually, it is also a longstanding challenge in computational biology since experimental datasets are usually insufficient to specify the parameters in a dynamical model. So, in general, it is inevitable to involve more assumptions such as non-Markov process[12, 13] and may lead to artifacts. Thus, we decided to draw qualitative conclusions (e.g., trends over time) from a quantitative model with sampling of parameter sets. Hence, we did not intentionally tailor our models to fit different datasets (i.e., all models used in our work share same basic setting of parameters), mapping into real biological systems in a top-down manner.

      Regarding clarity, how the general model (equations 1-2) transforms into the specific cases evaluated in the paper is not clearly stated in the main text, nor are the positive and negative effects of individual transcription factors adequately explained. Similarly, in the main text and Figure 2, the authors refer to a Boolean model. However, they do not clearly explain how this relates to the differential equation model, nor its relevance to understanding the paper.

      We thank the reviewers for this comment, as it has prompted us to better clarify our manuscript. We have adjusted the manuscript accordingly and made the necessary adjustments to improve its clarity.

      Additionally, the term "noise levels" is generally used to refer to noise introduced in the "noise-driven" analysis (i.e., as an input or parameter in the models). Nonetheless, it is later claimed to be evaluated as an intrinsic property of the network (likely referring to expression level variability measured by the coefficient of variation).

      We greatly appreciate the reviewers pointing this ambiguity out. The term of “noise level” was indeed used to refer the strength of the noise in the models in Section 1-4. For classifying different logic motifs with two driving forces, we needed a practical metric that can be quantified from data, and we found population-level gene expression variance (i.e., “noise level” in line 398) is useful which defined as the coefficient of variation.

      For clarity, we carefully decide to substitute “expression variance” for “noise level” presented in Section 5-6. We have amended the manuscript accordingly.

      Finally, some jargon is introduced without sufficient context about its meaning (e.g., "temporal fully-connected stage").

      Regarding the jargon of "temporal fully-connected stage", we have realized that this term was slightly vague and in need of improvement. Instead, we now employ “transitory fully-connected stage” in the revised manuscript to underline the short emergence of this particular stage. Please refer to Page 10-11 of the revised manuscript, lines 316-327 (see below).

      “Notably, in the AND-AND motif we observed a brief intermediated stage before S attractor disappears, where all three fates are directly interconnected (Fig4.C 2nd panel and D 2nd panel, Fig.4E). To manifest the generality, we globally screened 6,213 groups of parameter sets under the AND-AND motif, and this logic-dependent intermediated stage can be observed for 82.7% of them (see Methods; Table S1), indicating little dependence on particular parameter setting (1.8% in the OR-OR motif). Unlike the indirect attractor adjacency structure mediated by S attractor (Fig2.D), the solution landscape with fully-connected structure facilitates transitions between any two pairs of fates. Furthermore, this transitory fully-connected stage locates between the fate-undetermined stage (Fig4.C top panel) and fate-determined stage (Fig4.C 3rd panel), comparable to the initiation (or activation) stage before the lineage commitment in experimental observations [5-7]. Therefore, we suspected that the robust fully-connected stage in the AND-AND motif may correspond to a specific period in cell fate decisions.”

      Additionally, proper discussion of previous work is also missing. For instance, the dynamics of the CIS network investigated by the authors have been extensively characterised (see e.g., Huang et al., Dev Biol, 2007), and how the author's results compare to this previous work should be discussed. In particular, the central assumptions behind the derivation of the model proposed in the manuscript must be assessed in the context of previous work.

      Thanks for pointing this out. We have extended the discussion to include above points. We have also discussed and cited the work of Huang mentioned above. Please refer to Page 22, lines 644-647 in the revised manuscript (see below).

      “One of the most representative work is that Huang et al. [14] modeled the bifurcation in hematopoiesis to reveal the lineage commitment quantitatively. Compared to simply modularizing activation or inhibition effect by employing Hill function in previous work, our models reconsidered the multiple regulations from the level of TF-CRE binding.”

      References

      (1) Ackers, G.K., A.D. Johnson, and M.A. Shea, Quantitative model for gene regulation by lambda phage repressor. Proc Natl Acad Sci U S A, 1982. 79(4): p. 1129.

      (2) Shea, M.A. and G.K. Ackers, The OR control system of bacteriophage lambda: A physical-chemical model for gene regulation. Journal of Molecular Biology, 1985. 181(2): p. 211-230.

      (3) Hunziker, A., et al., Genetic flexibility of regulatory networks. Proc Natl Acad Sci U S A, 2010. 107(29): p. 12998-3003.

      (4) Kittisopikul, M. and G.M. Suel, Biological role of noise encoded in a genetic network motif. Proc Natl Acad Sci U S A, 2010. 107(30): p. 13300-5.

      (5) Brand, M. and E. Morrissey, Single-cell fate decisions of bipotential hematopoietic progenitors. Curr Opin Hematol, 2020. 27(4): p. 232-240.

      (6) Zhang, Y., et al., Hematopoietic Hierarchy - An Updated Roadmap. Trends Cell Biol, 2018. 28(12): p. 976-986.

      (7) Arinobu, Y., et al., Reciprocal activation of GATA-1 and PU.1 marks initial specification of hematopoietic stem cells into myeloerythroid and myelolymphoid lineages. Cell Stem Cell, 2007. 1(4): p. 416-27.

      (8)Kamimoto, K., et al., Dissecting cell identity via network inference and in silico gene perturbation. Nature, 2023. 614(7949): p. 742-751.

      (9) Hammelman, J., et al., Ranking reprogramming factors for cell differentiation. Nat Methods, 2022. 19(7): p. 812-822.

      (10) Semrau, S., et al., Dynamics of lineage commitment revealed by single-cell transcriptomics of differentiating embryonic stem cells. Nat Commun, 2017. 8(1): p. 1096.

      (11) Li, J., et al., Deep learning of cross-species single-cell landscapes identifies conserved regulatory programs underlying cell types. Nature Genetics, 2022. 54(11): p. 1711-1720.

      (12) Stumpf, P.S., F. Arai, and B.D. MacArthur, Modeling Stem Cell Fates using Non-Markov Processes. Cell Stem Cell, 2021. 28(2): p. 187-190.

      (13) Stumpf, P.S., et al., Stem Cell Differentiation as a Non-Markov Stochastic Process. Cell Syst, 2017. 5(3): p. 268-282 e7.

      (14) Huang, S., et al., Bifurcation dynamics in lineage-commitment in bipotent progenitor cells. Dev Biol, 2007. 305(2): p. 695-713.

    1. Author response:

      Reviewer #1 (Public review):

      Summary:

      This manuscript uses molecular dynamics simulations to understand how forces felt by the intracellular domain are coupled to the opening of the mechanosensitive ion channel NOMPC. The concept is interesting - as the only clearly defined example of an ion channel that opens due to forces on a tethered domain, the mechanism by which this occurs is yet to be fully elucidated. The main finding is that twisting of the transmembrane portion of the protein - specifically via the TRP domain that is conserved within the broad family of channels- is required to open the pore. That this could be a common mechanism utilised by a wide range of channels in the family, not just mechanically gated ones, makes the result significant. It is intriguing to consider how different activating stimuli can produce a similar activating motion within this family. However, the support for the finding can be strengthened as the authors cannot yet exclude that other forces could open the channel if given longer or at different magnitudes. In addition, they do not see the full opening of the channel, only an initial dilation. Even if we accept that twist is essential for this, it may be that it is not sufficient for full opening, and other stimuli are required.

      Strengths:

      Demonstrating that rotation of the TRP domain is the essential requirement for channel opening would have significant implications for other members of this channel family.

      Thank you for your positive summary and comments.

      Weaknesses:

      The manuscript centres around 3 main computational experiments. In the first, a compression force is applied on a truncated intracellular domain and it is shown that this creates both a membrane normal (compression) and membrane parallel (twisting) force on the TRP domain. This is a point that was demonstrated in the authors’ prior eLife paper - so the point here is to quantify these forces for the second experiment.

      The second experiment is the most important in the manuscript. In this, forces are applied directly to two residues on the TRP domain with either a membrane normal (compression) or membrane parallel (twisting) direction, with the magnitude and directions chosen to match that found in the first experiment. Only the twisting force is seen to widen the pore in the triplicate simulations, suggesting that twisting, but not compression can open the pore. This result is intriguing and there appears to be a significant difference between the dilation of pore with the two force directions.

      However, there are two caveats to this conclusion. Firstly, is the magnitude of the forces - the twist force is larger than the applied normal force to match the result of experiment 1. However, it is possible that compression could also open the pore at the same magnitude or if given longer. It may be that twist acts faster or more easily, but I feel it is not yet possible to say it is the key and exclude the possibility that compression could do something similar.

      Thank you for your insightful comment. As you pointed out, the membranenormal pushing forces exerted at residues E1571 and R1581 are approximately onethird and two-thirds, respectively, of the membrane-parallel twisting forces. These magnitudes were derived from a previous simulation (Wang et al., 2021), in which we decomposed the resultant force into its membrane-parallel and membrane-normal components upon applying a compressive force to the intracellular AR end. Our results indicated that, upon reaching the TRP helix, the induced twisting force is indeed greater, which partially reflects actual physiological conditions. Therefore, considering the magnitudes of the resultant forces alone, the twisting force is predominantly greater than the pushing force when the AR domain is subjected to compression.

      Then the question became, if forces of the same magnitude are applied in either the membrane-normal or membrane-parallel directions, what would the outcome be? To address this, we conducted additional simulations. Considering the situations discussed above, we applied a smaller membrane-parallel force instead of a larger membranenormal force that may disrupt the integrity of protein and membrane structure. As shown in the new Figure S6, we adjusted the applied membrane-parallel force to either half or one-third of the original value. When we applied half of the force used in the original setup, the channel opened in two out of three trajectories. When applying onethird of the force, the channel opened in one out of three trajectories. Together with our previous results, these findings suggest that if forces of equal magnitude are applied in the membrane-normal and membrane-parallel directions, the membrane-parallel force has a higher probability of inducing channel opening.

      Still, one cannot completely exclude the possibility that the pushing force on the TRP helix can open the channel if given a very long time. This becomes unfeasible to examine with MD simulations, so we investigated the likely conformational changes of multiple TRP family proteins upon opening, and found that the TRP rotation is a universal conformational change, while the TRP tilt is much less consistent (Figure 6). These findings gives us more confidence that the twist force plays a more crucial role in channel gating than the pushing force. We have added a new table (Table 1) and a new figure (Figure 6) to present this analysis.

      In addition, we did not intend to imply that compression is incapable of contributing to channel opening. In fact, our aim was to highlight that compression can generate both a twisting force and a pushing force, with the twisting force appearing to be the more critical component for facilitating channel opening. We concur that we cannot completely dismiss the possibility that the pushing component may also assist in channel opening. Consequently, we have revised our discussion on pages 4,6 to enhance clarity.

      I also note that when force was applied to the AR domain in experiment 1, the pore widened more quickly than with the twisting force alone, suggesting that compression is doing something to assist with opening.

      You are correct that the trajectory corresponding to Experiment 1 (Figure S1(b)) indicates pore opening around 300-400 ns, while the trajectory for Experiment 2 (800 ns) shows pore opening around 600 ns. This observation may suggest that the pore opens more rapidly in Experiment 1, assuming that the simulation conditions were identical for both experiments. However, it is important to note that in Experiment 1, an external force was applied to AR29. In contrast, in Experiment 2, the force was applied exclusively to two selected residues on the TRP domain, while other TRP residues also experienced mechanical forces, albeit to a lesser extent. The differing methods of force application in the two experiments complicate the comparison of pore opening speeds under these conditions.

      We acknowledge that the compression of the AR spring can facilitate pore opening. This compression generates both a twisting component and a pushing component on the TRP domain. Our simulations and structural analyses of multiple TRP channels suggest that the twisting component plays a predominant role in gating. However, we cannot entirely rule out the possibility that the pushing component may also contribute to this process. We have carefully revised our Result (page 6), Discussion (pages 10–12) and Methods (pages 14–17) sections to enhance clarity.

      Given that the forces are likely to be smaller in physiological conditions it could still be critical to have both twist and compression present. As this is the central aspect of the study, I believe that examining how the channel responds to different force magnitudes could strengthen the conclusions and recommend additional simulations be done to examine this.

      Thank you for your valuable comments. We agree that the force applied in Experiment 2 is possible to be larger than the physiological conditions. Therefore, we performed additional simulations to investigate the possibility of opening the pore using smaller torsional forces.

      As shown in the new Figure S6, we applied half and one-third of the original force and performed three replicate simulations for each condition. With half the force, the pore opened in two out of the three simulations. And with one-third of the applied force, the pore opened in one out of the three replicate simulations. The probability of pore opening within the same simulation time decreased as the applied force was reduced, consistent with our expectations. These new results are provided as supplementary figures (Figure S6) in the revised manuscript.

      We anticipate that further reductions in the forces will result in additional delays in the opening process; however, this would lead to prohibitive computational costs. Consequently, we have decided to conclude our analysis at this stage and have discussed this matter on page 6 of the revised manuscript.

      The second important consideration is that the study never sees a full pore opening, but rather a widening that is less than that seen in open state structures of other TRP channels and insufficient for rapid ion currents. This is something the authors acknowledge in their prior manuscript in eLife 2021. Although this may simply be due to the limited timescale of the simulations, it needs to be clearly stated as a caveat to the conclusions. Twist may be the key to getting this dilation, but we do not know if it is the key to full pore opening. To demonstrate that the observed dilation is a first step in the opening of pores, a structural comparison to open-state TRP channels would be beneficial in providing evidence that this motion is along the expected pathway of channel gating.

      We are grateful for this insightful comment. We acknowledge that our simulations do not capture a fully open state, but rather a dilation that is smaller than the open-state structures of other TRP channels. In our simulations, a pore radius exceeding 2 Å is considered as a partially open state, as this is generally sufficient for the permeation of water molecules or even small cations such as K<sup>+</sup> and Na<sup>+</sup> However, the passage of larger molecules and ions, such as Ca<sup>2+</sup> and clusters of hydrated ions, remains challenging. As you noted, this partial opening may be attributed to the limited timescale of the simulations.

      Furthermore, in accordance with your suggestion, we analyzed numerous TRP proteins for which multiple open or intermediate states have been resolved, and we have included a new figure (Figure 6). A clockwise rotation of the TRP domain is observed in the majority of these proteins upon gating. For instance, in the case of RnTRPV1, our analysis revealed that during TRPV1 activation, when different ligands are bound (RTX, DkTX), the pore undergoes gradual dilation, which involves a progressive clockwise rotation of the TRP domain. This analysis provides evidence that the observed motion aligns with expected gating transitions, supporting the notion that twist-induced TRP rotation and pore dilation may represent an initial step in the pore opening process.

      Nonetheless, we concur that further studies, including extended simulations, which are currently unfeasible, or experimental validation, will be necessary to ascertain whether our proposed mechanism is adequate for the complete opening of the pore. We have carefully discussed this on pages 10–12.

      Experiment three considers the intracellular domain and determines the link between compression and twisting of the intracellular AR domain. In this case, the end of the domain is twisted and it is shown that the domain compresses, the converse to the similar study previously done by the authors in which compression of the domain was shown to generate torque. While some additional analysis is provided on the inter-residue links that help generate this, this is less significant than the critical second experiment.

      Although experiment three is less significant in revealing the underlying gating mechanism, it provides quantitative measurements of the mechanical properties of the intriguing AR spring structure, which are currently challenging to obtain experimentally. These provide computational predictions for future experiments to validate.

      Reviewer #2 (Public review):

      This study uses all-atom MD simulation to explore the mechanics of channel opening for the NOMPC mechanosensitive channel. Previously the authors used MD to show that external forces directed along the long axis of the protein (normal to the membrane) result in AR domain compression and channel opening. This force causes two changes to the key TRP domains adjacent to the channel gate: 1) a compressive force pushes the TRP domain along the membrane normal, while 2) a twisting torque induces a clock-wise rotation on the TRP domain helix when viewing the bottom of the channel from the cytoplasm. Here, the authors wanted to understand which of those two changes is responsible for increasing the inner pore radius, and they show that it is the torque. The simulations in Figure 2 probe this question with different forces, and we can see the pore open with parallel forces in the membrane, but not with the membrane-normal forces. I believe this result as it is reproducible, the timescales are reaching 1 microsecond, and the gate is clearly increasing diameter to about 4 Å. This seems to be the most important finding in the paper, but the impact is limited since the authors already show how forces lead to channel opening, and this is further teasing apart the forces and motions that are actually the ones that cause the opening.

      Thank you for your insightful comments. We appreciate your recognition of our key finding that torque is responsible for increasing the inner pore radius. Indeed, our simulations illustrated in Figure 2 systematically explore the effects of different forces on pore opening. These results demonstrate that membrane-parallel forces are effective, while membrane-normal forces are not within the simulation time. We acknowledge that this study builds upon previous findings regarding force-induced channel opening. However, we believe that further decomposition of the specific forces and motions responsible for this process provides valuable mechanistic insights. By distinguishing the role of torque from the membrane-normal forces of the TRP helix, which is highly conserved across the TRP channel family, our work contributes to a more precise understanding of TRP channel gating. Moreover, in the revised manuscript, we conducted a systematic analysis of the structures of TRP family proteins and discovered that the clockwise rotation of the TRP domain is likely a universal gating mechanism among the TRP family, which significantly enhances and strengthens our original findings (Figure 6).

      Reviewer #3 (Public review):

      Summary:

      This manuscript by Duan and Song interrogates the gating mechanisms and specifically force transmission in mechanosensitive NOMPC channels using steered molecular dynamics simulations. They propose that the ankyrin spring can transmit force to the gate through torsional forces adding molecular detail to the force transduction pathways in this channel.

      Strengths:

      Detailed, rigorous simulations coupled with a novel model for force transduction.

      Thank you for your positive comments.

      Weaknesses:

      Experimental validation of reduced mechanosensitivity through mutagenesis of proposed ankyrin/TRP domain coupling interactions would greatly enhance the manuscript. I have some additional questions documented below:

      We attempted to measure the mechanical properties of the AR domain and conduct mutagenesis experiments in collaboration with Prof. Jie Yan’s laboratory at the Mechanobiology Institute, National University of Singapore; however, this proved to be a significant challenge at this time. Given the urgency of the publication, we have decided to first publish the computational results and reserve further experimental studies for future investigations.

      (1) The membrane-parallel torsion force can open NOMPC

      How does the TRP domain interact with the S4-S5 linker? In the original structural studies, the coordination of lipids in this region seems important for gating. In this manner does the TRP domain and S4-S5 linker combined act like an amphipathic helix as suggested first for MscL (Bavi et al., 2016 Nature Communications) and later identified in many MS channels (Kefauver et al., 2020 Nature).

      In our analysis of the compression trajectories (trajectory: CI-1, Figure S4), we identified stable interactions between the TRP domain and the S4-S5 linker. These interactions primarily involve the residues S1421 and F1422 of the S4-S5 linker, as indicated by the large pink data points in Figure S4. Therefore, we agree that the TRP helix and the S4–S5 linker can be considered an amphipathic helical unit, analogous to the amphipathic helix observed in MscL and other mechanosensitive channels. Moreover, the pocket adjacent to the S4-S5 linker has been recognized as a binding site for small molecules in other ligand-activated TRP channels, such as the vanilloid-binding TRPV1. We hypothesize that this unit is likely to play a critical role in the polymodal gating of the TRP channel family, including ligand-induced activation. In the revised manuscript, we have included an analysis of the interaction between the TRP domain and the transmembrane (TM) domain on page 4 (Figure S4), and we have briefly discussed its implications on pages 10 and 12.

      (2) Torsional forces on shorter ankyrin repeats of mammalian TRP channels

      Is it possible torsional forces applied to the shorter ankyrin repeats of mammalian TRPs may also convey force in a similar manner?

      This is an intriguing question.

      To answer your question, we studied the full-length squirrel TRPV1 (PDB: 7LQY, Nadezhdin et al. (2021)) using all-atom steered MD simulations. We applied pushing or torsional forces to the intracellular AR1-2 region of TRPV1, separately (Figure S10(a)). Similar to NOMPC, rotation of the TRP domain was observed under both types of mechanical stimulation (Figure S10(b-e)). The conformational change induced by the torsional force on the TRP domain resembles the change observed in NOMPC. This suggests that a torsional force applied to the shorter ankyrin repeats of mammalian TRPs may yield similar effects on channel gating. However, given that these ankyrin repeats do not act like tether elements, the implications of these results in the context of biological functions remain unclear. Additionally, in NOMPC, the AR domain is connected to the TRP domain through a linker helix (LH) domain, composed of multiple stacked helices that form a relatively compact structure (Figure 1(a)). In contrast, TRPV1 does not possess a similarly compact LH domain connecting the AR domain to the TRP domain (Figure S10(a)). These structural differences render our conclusions regarding NOMPC not directly applicable to TRPV1. We have included an additional discussion about this on page 12 (Figure S10).

      (3) Constant velocity or constant force

      For the SMD the authors write "and a constant velocity or constant force". It’s unclear from this reviewer’s perspective which is used to generate the simulation data.

      Thank you for pointing out this ambiguity. In our simulations, we first applied constant-velocity pulling to achieve specific force magnitudes, followed by constantforce pulling. This protocol allowed us to initiate the motion of the protein in a controlled manner and observe the response of the system under sustained forces. We have now clarified this in the revised Methods section.

      Reviewer #1 (Recommendations for the authors):

      The language in the paper requires some editing - particularly in the introduction. For example, what is meant by ion channels ’coalescing to form mechanical receptors’? Are the authors implying it requires multiple channels to form a receptor? It is stated that mechanically gated ion channels are only found in nerve endings when in fact they are found in almost every cell type. Another example is the statement ’In the meantime’ the TRP domain was observed to rotate when this observation came prior to the others mentioned before. While these sound like minor edits, they significantly change the meaning of the introduction. I recommend careful editing of the manuscript to avoid accidental inaccuracies like this.

      Thank you for your feedback on the clarity and accuracy of the introduction. We have carefully revised the manuscript, particularly the abstract and instroduction sections, to address these concerns:

      (1) We have reworded the original sentence ’These mechanosensitive ion channels, coalescing to form mechanical receptors, are strategically positioned within the sensory neuron terminals intricately nestled within the epidermal layer.’ into ’In both vertebrates and invertebrates, mechanosensitive ion channels are widely expressed in peripheral sensory neurons located near or within the surface tissues responsible for detecting mechanical stimuli.’

      (2) We have replaced the phrase "In the meantime" with "Interestingly" to introduce the conformational change of the TRP domain that we believe is crucial.

      (3) We have carefully reviewed the entire manuscript and used a language editing tool, Writefull integrated within Overleaf, to proof-check the language problems.

      Reviewer #2 (Recommendations for the authors):

      How do the energy values in Figure 3b, compare with the continuum energy values reported by Argudo et al. JGP (2019)? I wonder what value the authors would get with a new replicate run slower - say 200 ns total aggregate simulation? This would probe the convergence of this energy value. It seems important to determine whether the loading velocity of the experiments performed here with the steered MD is slow enough to allow the protein to relax and adopt lower energy configurations during the transition. The true loading is likely to occur on the millisecond timescale, not the nanosecond to low microsecond timescale. That said, I don’t mean to detract from the result in Figure 2, as this is likely quite solid in my opinion given the nearly 1 microsecond simulations and the replicates showing the same results.

      Thank you for your valuable suggestions. It is important to note that we calculated different physical quantities compared to those reported in Argudo’s study. In Figure 3b, we calculated the torque ( instead of the energy, although they share the same dimensional units) of the long AR bundle (AR9-29 of the four filaments combined) and subsequently determined its torsion coefficient. Argudo’s study calculated the torsional spring constant (𝑘<sub>ɵ</sub>) of three 6-AR-unit stretches of one filament, which were designated as ANK1 (AR 12-17), ANK2 (AR 17-22) and ANK3 (AR 22–27). As the four filaments are coupled within the bundled structure and the torsional axes differ between an individual filament and the four-filament bundle, a direct comparison of the torsional spring constants reported in the two studies is not meaningful.

      We agree that extending the simulation time may provide deeper insights into the convergence of energy values. In accordance with your suggestion, we conducted additional simulations to further investigate convergence and compare the results with our existing data, thereby ensuring robustness and consistency. Specifically, we slowed down the original operation of twisting from 10 degrees over 100 ns to 10 degrees over 200 ns, and extended the holding time for selected frames (sampled every 2.5 degrees) from 100 ns to 200 ns. We have updated Figure 3 and relevant main text accordingly (page 7). The results of the new simulations are similar to those of the previous ones, with the fitted torsion coefficient revised from (2.31 ± 0.44) × 10<sup>3</sup>kJ mol<sup>−1</sup>  ra<sup>−1</sup> 1 to (2.30 ± 0.31) × 10<sup>3</sup> kJmol<sup>−1</sup> rad<sup>−1</sup>  This close agreement indicates that our simulations are well-converged. Additionally, we updated the compression–twist coupling coefficient, , from (1.67 ± 0.14) nmrad<sup>−1</sup> to (1.32 ± 0.11) nmrad<sup>−1</sup>

      As you suggested, we conducted an additioanl analysis to determine whether the loading velocity/force with the steered MD is sufficiently slow to facilitate the relaxation of the protein and its adoption of lower-energy configurations during the transition. For simulations involving the application of membrane-normal or membrane-parallel force on the TRP domain, we utilized DSSP (Define Secondary Structure of Proteins) analysis to assess the stability of the secondary structure of the TRP domain. The results indicated that, during the application of external forces, the secondary structure of the TRP domain maintained good stability, as illustrated in Figure S11. For simulations involving the rotation of the AR domain, we also analyzed the DSSP of the AR9 to AR11 units, which are positioned directly above the AR8 domain where the twisting force is applied. The secondary structure of the AR domain also exhibited good stability (Figure S12). These are briefly discussed in the Methods section of the revised manuscript (page 17).

      It is unclear to me that the force transmission analysis in Figure 4 provides much insight into the mechanics of opening. Perhaps the argument was made, but I did not appreciate it. Related to this the authors state that the transfer velocity is 1.8 nm/ps based on their previous study. Is this value profound or is it simply the velocity of sound in the protein?

      The analysis of force transmission presented in Figure 4 offers detailed insights into the transfer of force along the AR domain. While this may appear straightforward, the information elucidates how a pushing force can induce a twisting force during its transmission through the AR spring structure, as well as the primary contributions that stabilize this transmission pathway. To enhance clarity, we have included an additional discussion on page 9.

      The force transfer velocity is expected to align with the velocity of sound within the protein. The value of 1.8 nm/ps, however, is specific to the unique structure of the AR spring, which is quite interesting to report in our opinion. Additionally, this rapid transfer speed suggests that the simulation timescale is sufficient for enabling the transfer of compression force from the bottom of the AR domain to the TRP domain in our simulations, given that the simulation timescale is considerably longer than the force propagation timescale within the protein.

      The methods description is largely complete, but is missing some details on the MD simulations (barostat, thermostat, piston constants, etc.).

      Thank you for pointing out the missing details; we have added the additional information in the revised Methods section.

      References

      Nadezhdin, K. D., A. Neuberger, Y. A. Nikolaev, L. A. Murphy, E. O. Gracheva, S. N. Bagriantsev, and A. I. Sobolevsky (2021). Extracellular cap domain is an essential component of the trpv1 gating mechanism. Nature communications 12(1), 2154.

      Wang, Y., Y. Guo, G. Li, C. Liu, L. Wang, A. Zhang, Z. Yan, and C. Song (2021). The pushto-open mechanism of the tethered mechanosensitive ion channel nompc. Elife 10, e58388.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Review:

      1. Evidence for a disulfide bridge contained in membrane-associated FGF2 dimers

      This aspect was brought up in detail by both Reviewer #1 and Reviewer #3. It has been addressed in the revised manuscript by (i) new experimental and computational analyses, (ii) a more detailed discussion of previous work from our lab in which experiments were done the reviewers were asking for and (iii) a more general discussion of known examples of disulfide formation in protein complexes with a particular focus on membrane surfaces facing the cytoplasm, the inner plasma membrane leaflet being a prominent example. Please find our detailed comments in our direct response to Reviewers #1 and #3, see below.

      1. Affinity towards PI(4,5)P2 comparing FGF2 dimers versus monomers

      This is an aspect that has been raised by Reviewer 3 along with additional comments on the interaction of FGF2 with PI(4,5)P2. Please find our detailed response below. With regard to PI(4,5)P2 affinity aspects of FGF2 dimers versus FGF2 monomers, we think that the increased avidity of FGF2 dimers with two high affinity binding pockets for PI(4,5)P2 are a good explanation for the different values of free energies of binding that were calculated from the atomistic molecular dynamics simulations shown in Fig. 9. This phenomenon is well known for many biomolecular interactions and is also consistent with the cryoEM data contained in our manuscript, showing a FGF2 dimer with two PI(4,5)P2 binding sites facing the membrane surface.

      1. C95-C95 FGF2 dimers as signaling units

      We have put forward this hypothesis since in structural studies analyzing the FGF ternary signaling complex consisting of FGF2, FGF receptor and heparin, FGF2 mutants were used that lack C95. Nevertheless, two FGF2 molecules are contained in FGF signaling complexes. In addition to the papers on the structure of the FGF signaling complex, we have cited work that showed that C95-C95 crosslinked FGF2 dimers are efficient FGF signaling modules (Decker et al, 2016; Nawrocka et al, 2020). Therefore, being based on an assembly/disassembly mechanism with the transient formation of poreforming FGF2 oligomers, we think it is an interesting idea that the FGF2 secretion pathway produces C95-C95 disulfide-linked FGF2 dimers at the outer plasma membrane leaflet that can engage in FGF2 ternary signaling complexes. While this is a possibility we put forward to stimulate the field, it of course remains a hypothesis which has been clearly indicated as such in the revised manuscript.

      Reviewer #1:

      1. Evidence for disulfide-bridged FGF2 dimers and higher oligomers on non-reducing versus reducing SDS gels

      The experiment suggested by Reviewer #1 is an important one that has been published by our group in previous work. In these studies, we found FGF2 oligomers analyzed on non-reducing SDS gels to be sensitive to DTT, turning the vast majority of oligomeric FGF2 species into monomers [(Müller et al, 2015); Fig. 3, compare panel D with panel H]. This phenomenon could be observed most clearly after short periods of incubations (0.5 hours) of FGF2 with PI(4,5)P2-containing liposomes. These findings constituted the original evidence for PI(4,5)P2-induced FGF2 oligomerization to depend on the formation of intermolecular disulfide bridges.

      In the current manuscript, we established the structural principles underlying this process and identified C95 to be the only cysteine residue involved in disulfide formation. Based on biochemical cross-linking experiments in cells, cryo-electron tomography, predictions from AlphaFold-2 Multimer and molecular dynamics simulations, we demonstrated a strong FGF2 dimerization interface in which C95 residues are brought into close proximity when FGF2 is bound to membranes in a PI(4,5)P2-dependent manner. These findings provide the structural basis by which disulfide bridges can be formed from the thiols contained in the side chains of two C95 residues directly facing each other in the dimerization interface. In the revised manuscript, we included additional data that further strengthen this analysis. In the experiments shown in the new Fig. 10, we combined chemical cross-linking with mass spectrometry, further validating the reported FGF2 dimerization interface. In addition, illustrated in the new Fig. 8, we employed a new computational analysis combining 360 individual atomistic molecular dynamics simulations, each spanning 0.5 microseconds, with advanced machine learning techniques. This new data set corroborates our findings, demonstrating that the C95-C95 interface self-assembles independently of C95-C95 disulfide formation, based on electrostatic interactions. Intriguingly, it is consistent with our experimental findings based on cross-linking mass spectrometry (new Fig. 10) where cross-linked peptides could also be observed with the C77/95A variant form of FGF2, suggesting a protein-protein interface whose formation does not depend on disulfide formation. Therefore, we propose that disulfide formation occurs in a subsequent step, representing the committed step of FGF2 membrane translocation with the formation of disulfide-bridged FGF2 dimers being the building blocks for pore-forming FGF2 oligomers.

      As a more general remark on the mechanistic principles of disulfide formation in different cellular environments, we would like to emphasize that it is a common misconception that the reducing environment of the cytoplasm generally makes the formation of disulfide bridges unlikely or even impossible. From a biochemical point of view, the formation of disulfide bridges is not limited by a reducing cellular environment but is rather controlled by kinetic parameters when two thiols are brought into proximity. Indeed, it has become well established that disulfide bridges can also be formed in compartments other than the lumen of the ER/Golgi system, including the cytoplasm. For example, viruses maturing in the cytoplasm can form stable structural disulfide bonds in their coat proteins (Locker & Griffiths, 1999; Hakim & Fass, 2010). Moreover, many cytosolic proteins, including phosphatases, kinases and transcriptions factors, are now recognized to be regulated by thiol oxidation and disulfide bond formation, formed as a post-transcriptional modification (Lennicke & Cocheme, 2021). In numerous cases with direct relevance for our studies on FGF2, disulfide bond formation and other forms of thiol oxidation occur in association with membrane surfaces. In fact, many of these processes are linked to the inner plasma membrane leaflet (Nordzieke & Medrano-Fernandez, 2018). Growth factors, hormones and antigen receptors are observed to activate transmembrane NADPH oxidases generating O2·-/H2O2 (Brown & Griendling, 2009). For example, the local and transient oxidative inactivation of membrane-associated phosphatases (e.g., PTEN) serves to enhance receptor associated kinase signaling (Netto & Machado, 2022). It is therefore conceivable that similar processes introduce disulfide bridges into FGF2 while assembling into oligomers at the inner plasma membrane leaflet. In the revised version of our manuscript, we have discussed the above-mentioned aspects in more detail, with the known role of NADPH oxidases in disulfide formation at the inner plasma membrane leaflet being highlighted.

      Reviewer #2:

      1. Potential effects of a C95A substitution on protein folding and comparison with a C95S substitution with regard to phenotypes observed in FGF2 secretion

      A valid point that we indeed addressed at the beginning of this project. Most importantly, we tested whether both FGF2 C95A and FGF2 C95S are characterized by severe phenotypes in FGF2 secretion efficiency. As shown in the revised Fig. 1, cysteine substitutions by serine showed very similar FGF2 secretion phenotypes compared to cysteine to alanine substitutions (Fig. 1C and 1D). In addition, in the pilot phase of this project, we also compared recombinant forms of FGF2 C95A and FGF2 C95S in various in vitro assays. For example, we tested the full set of FGF2 variants in membrane integrity assays as the ones contained in Fig. 4. As shown in Author response image 1, FGF2 variant forms carrying a serine in position 95 behaved in a very similar manner as compared to FGF2 C95A variant forms. Relative to FGF2 wild-type, membrane pore formation was strongly reduced for both types of C95 substitutions. By contrast, both FGF2 C77S and C77A did show activities that were similar to FGF2 wild-type.

      Author response image 1.

      From these experiments, we conclude that changes in protein structure are not the basis for the phenotypes we report on the C95A substitution in FGF2.

      1. Effects of a C77A substitution on FGF2 membrane recruitment in cells

      The effect of a C77A substitution in FGF2 recruitment to the inner plasma membrane leaflet is indeed a moderate one. This is likely to be the case because C77 is only one residue of a more complex surface that contacts the α1 subunit of the Na,K-ATPase. Stronger effects can be observed when K54 and K60 are changed, residues that are positioned in close proximity to C77 (Legrand et al, 2020). Nevertheless, as shown in the revised Fig. 1, we consistently observed a reduction in membrane recruitment when comparing FGF2 C77A with FGF2 wild-type. When analyzing the raw data without GFP background subtraction, a significant reduction of FGF2 C77A was observed compared to FGF2 wild-type (Fig. 1A and 1B). We therefore conclude that C77 does not only play a role in FGF2/α1 interactions in biochemical assays using purified components (Fig. 7) but also impairs FGF2/α1 interactions in a cellular context (Fig. 1A and 1B).

      1. Identity of the protein band in Fig. 3 labeled with an empty diamond

      This is a misunderstanding as we did not assign this band to a FGF2-GFP dimer. When we produced the corresponding cell lines, we used constructs that link FGF2 with GFP via a ‘self-cleaving’ P2A sequence. During translation, even though arranged on one mRNA, this causes the production of FGF2 and GFP as separate proteins in stoichiometric amounts, the latter being used to monitor transfection efficiency. However, a small fraction is always expressed as a complete FGF2-P2A-GFP fusion protein (a monomer). This band can be detected with the FGF2 antibodies used and was labeled in Fig. 3 by an empty diamond.

      1. Labeling of subpanels in Fig. 5A

      We have revised Fig. 5 according to the suggestion of Reviewer #2.

      1. FGF2 membrane binding efficiencies shown in Fig. 5C

      It is true that FGF2 variant forms defective in PI(4,5)P2-dependent oligomerization (C95A and C77/95A) bind to membranes with somewhat reduced efficiencies. This is also evident form the intensity profiles shown in Fig. 5A and was observed in biochemical in vitro experiments as well. A plausible explanation for this phenomenon would be the increased avidity when FGF2 oligomerizes, stabilizing membrane interactions (see also Fig. 9B).

      1. Residual activities of FGF2 C95A and C77/95A in membrane pore formation?

      We do not assign the phenomenon in Fig. 5 Reviewer #2 is referring to as controlled activities of FGF2 C95A and C77/95A in membrane pore formation. Rather, GUVs containing PI(4,5)P2 are relatively labile structures with a certain level of integrity issues upon protein binding and extended incubation times being conceivable. It is basically a technical limitation of this assay with GUVs incubated with proteins for 2 hours. Even after substitution of PI(4,5)P2 with a Ni-NTA membrane lipid, background levels of loss of membrane integrity can be observed (Fig. 6). Therefore, as compared to FGF2 C95A and C77/95A, the critical point here is that FGF2 wt and FGF2 C77A do display significantly higher levels of a loss of membrane integrity in PI(4,5)P2-containing GUVs, a phenomenon that we interpret as controlled membrane pore formation. By contrast, all variant forms of FGF2 show only background levels for loss of membrane integrity in GUVs containing the Ni-NTA lipid.

      1. Why does PI(4,5)P2 induce FGF2 dimerization?

      This has been studied extensively in previous work (Steringer et al, 2017). As also discussed in the current manuscript, the interaction of FGF2 with membranes through its high affinity PI(4,5)P2 binding pocket orients FGF2 molecules on a 2D surface that increase the likelihood of the formation of the C95containing FGF2 dimerization interface. Moreover, in the presence of cholesterol at levels typical for plasma membranes, PI(4,5)P2 clusters containing up to 4 PI(4,5)P2 molecules (Lolicato et al, 2022), a process that may further facilitate FGF2 dimerization.

      1. Is it possible to pinpoint the number of FGF2 subunits in oligomers observed in cryo-electron tomography?

      We indeed took advantage of the Halo tags that appear as dark globular structures in cryo-electron tomography. For most FGF2 oligomers with FGF2 subunits on both sides of the membrane, we could observe 4 to 6 Halo tags which is consistent with the functional subunit number that has been analyzed for membrane pore formation (Steringer et al., 2017; Sachl et al, 2020; Singh et al, 2023). However, since the number of higher FGF2 oligomers we observed in cryo-electron tomography was relatively small and the nature of these oligomers appears to be highly dynamic, caution should be taken to avoid overinterpretation of the available data.

      Reviewer #3:

      1. Conclusive demonstration of disulfide-linked FGF2 dimers

      A similar point was raised by Reviewer #1, so that we would like to refer to our response on page 2, see above.

      1. Identity of FGF2-P2A-GFP observed in Fig. 3

      Again, a similar point has been made, in this case by Reviewer #2 (Point 3). The observed band is not a FGF2-P2A-GFP dimer but rather the complete FGF2-P2A-GFP fusion protein (a monomer) that corresponds to a small population produced during mRNA translation where the P2A sequence did not cause the production of FGF2 and GFP as separate proteins in stoichiometric amounts.

      1. Quantification of GFP signals in Fig. 6

      Fig. 6 has been revised according to the suggestion of Reviewer #3. A comprehensive comparison of PI(4,5)P2 and the Ni-NTA membrane lipid in FGF2 membrane translocation assays is also contained in previous work that introduced the GUV-based FGF2 membrane translocation assay (Steringer et al., 2017).

      1. Experimental evidence for various aspects of FGF2 interactions with PI(4,5)P2

      Most of the points raised by Reviewer #3 have been addressed in previous work. For example, FGF2 has been demonstrated to dimerize only on membrane surfaces containing PI(4,5)P2 (Müller et al., 2015). In solution, FGF2 remained a monomer even after hours of incubation as analyzed by native gel electrophoresis and reducing vs. non-reducing SDS gels (see Fig. 3 in Müller et al, 2015). In the same paper, the first evidence for a potential role of C95 in FGF2 oligomerization has been reported, however, at the time, our studies were limited to FGF2 C77/95A. In the current manuscript, the in vitro experiments shown in Figs. 2 to 6 establish the unique role of C95 in PI(4,5)P2-dependent FGF2 oligomerization. As discussed above, FGF2 oligomers have been shown to contain disulfide bridges based on analyses on non-reducing gels in the absence and presence of DTT (Müller et al., 2015).

      References

      Brown DI, Griendling KK (2009) Nox proteins in signal transduction. Free Radic Biol Med 47: 1239-1253 Decker CG, Wang Y, Paluck SJ, Shen L, Loo JA, Levine AJ, Miller LS, Maynard HD (2016) Fibroblast growth factor 2 dimer with superagonist in vitro activity improves granulation tissue formation during wound healing. Biomaterials 81: 157-168

      Hakim M, Fass D (2010) Cytosolic disulfide bond formation in cells infected with large nucleocytoplasmic DNA viruses. Antioxid Redox Signal 13: 1261-1271

      Legrand C, Saleppico R, Sticht J, Lolicato F, Muller HM, Wegehingel S, Dimou E, Steringer JP, Ewers H, Vattulainen I et al (2020) The Na,K-ATPase acts upstream of phosphoinositide PI(4,5)P2 facilitating unconventional secretion of Fibroblast Growth Factor 2. Commun Biol 3: 141

      Lennicke C, Cocheme HM (2021) Redox metabolism: ROS as specific molecular regulators of cell signaling and function. Mol Cell 81: 3691-3707

      Locker JK, Griffiths G (1999) An unconventional role for cytoplasmic disulfide bonds in vaccinia virus proteins. J Cell Biol 144: 267-279

      Lolicato F, Saleppico R, Griffo A, Meyer A, Scollo F, Pokrandt B, Muller HM, Ewers H, Hahl H, Fleury JB et al (2022) Cholesterol promotes clustering of PI(4,5)P2 driving unconventional secretion of FGF2. J Cell Biol 221

      Müller HM, Steringer JP, Wegehingel S, Bleicken S, Munster M, Dimou E, Unger S, Weidmann G, Andreas H, GarciaSaez AJ et al (2015) Formation of Disulfide Bridges Drives Oligomerization, Membrane Pore Formation and Translocation of Fibroblast Growth Factor 2 to Cell Surfaces. J Biol Chem 290: 8925-8937

      Nawrocka D, Krzyscik MA, Opalinski L, Zakrzewska M, Otlewski J (2020) Stable Fibroblast Growth Factor 2 Dimers with High Pro-Survival and Mitogenic Potential. Int J Mol Sci 21

      Netto LES, Machado L (2022) Preferential redox regulation of cysteine-based protein tyrosine phosphatases: structural and biochemical diversity. FEBS J 289: 5480-5504

      Nordzieke DE, Medrano-Fernandez I (2018) The Plasma Membrane: A Platform for Intra- and Intercellular Redox Signaling. Antioxidants (Basel) 7

      Sachl R, Cujova S, Singh V, Riegerova P, Kapusta P, Muller HM, Steringer JP, Hof M, Nickel W (2020) Functional Assay to Correlate Protein Oligomerization States with Membrane Pore Formation. Anal Chem 92: 14861-14866

      Singh V, Macharova S, Riegerova P, Steringer JP, Muller HM, Lolicato F, Nickel W, Hof M, Sachl R (2023) Determining the Functional Oligomeric State of Membrane-Associated Protein Oligomers Forming Membrane Pores on Giant Lipid Vesicles. Anal Chem 95: 8807-8815

      Steringer JP, Lange S, Cujova S, Sachl R, Poojari C, Lolicato F, Beutel O, Muller HM, Unger S, Coskun U et al (2017) Key steps in unconventional secretion of fibroblast growth factor 2 reconstituted with purified components. eLife 6: e28985

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      The authors' finding that PARG hydrolase removal of polyADP-ribose (PAR) protein adducts generated in response to the presence of unligated Okazaki fragments is important for S-phase progression is potentially valuable, but the evidence is incomplete, and identification of relevant PARylated PARG substrates in S-phase is needed to understand the role of PARylation and dePARylation in S-phase progression. Their observation that human ovarian cancer cells with low levels of PARG are more sensitive to a PARG inhibitor, presumably due to the accumulation of high levels of protein PARylation, suggests that low PARG protein levels could serve as a criterion to select ovarian cancer patients for treatment with a PARG inhibitor drug.

      Thank you for the assessment and summary. Please see below for details as we have now addressed the deficiencies pointed out by the reviewers.

      We believe that PARP1 is one of the major relevant PARG substrates in S phase cells. Previous studies reported that PARP1 recognizes unligated Okazaki fragments and induces S phase PARylation, which recruits single-strand break repair proteins such as XRCC1 and LIG3 that acts as a backup pathway for Okazaki fragment maturation (Hanzlikova et al., 2018; Kumamoto et al., 2021). In this study, we revealed that accumulation of PARP1/2-dependent S phase PARylation eventually led to cell death (Fig. 2). Furthermore, we found that chromatin-bound PARP1 as well as PARylated PARP1 increased in PARG KO cells (Fig. S4A and Fig. 4A), suggesting that PARP1 is one of the key substrates of PARG in S phase cells. Of course, PARG may have additional substrates besides PARP1 which are required for its roles in S phase progression, as PARG is known to be recruited to DNA damage sites through pADPr- and PCNA-dependent mechanisms (Mortusewicz et al., 2011). Precisely how PARG regulates S phase progression warrants further investigation.

      Public Reviews:

      Reviewer #1 (Public Review):

      I have a major conceptual problem with this manuscript: How can the full deletion of a gene (PARG) sensitize a cell to further inhibition by its chemical inhibitor (PARGi) since the target protein is fully absent?

      Please see below for details about this point. Briefly, we found that PARG is an essential gene (Fig. 7). There was residual PARG activity in our PARG KO cells, although the loss of full-length PARG was confirmed by Western blotting and DNA sequencing (Fig. S9). The residual PARG activity in these cells can be further inhibited by PARG inhibitor, which eventually lead to cell death.

      The authors state in the discussion section: "The residual PARG dePARylation activity observed in PARG KO cells likely supports cell growth, which can be further inhibited by PARGi". What does this statement mean? Is the authors' conclusion that their PARG KOs are not true KOs but partial hypomorphic knockdowns? Were the authors working with KO clones or CRISPR deletion in populations of cells?

      The reviewer is correct that our PARG KOs are not true KOs. We were working with CRISPR edited KO clones. As shown in this manuscript, we validated our KO clones by Western blotting, DNA sequencing and MMS-induced PARylation. Despite these efforts and our inability to detect full-length PARG in our KO clones, we suspect that our PARG KO cells may still express one or more active fragments of PARG due to alternative splicing and/or alternative ATG usage.

      As shown in Fig. 7, we believe that PARG is essential for proliferation. Our initial KO cell lines are not complete PARG KO cells and residual PARG activity in these cells could support cell proliferation. Unfortunately, due to lack of appropriate reagents we could not draw solid conclusions regarding the isoforms or the truncated PARG expressed in these cells (Please see Western blots below).

      Are there splice variants of PARG that were not knocked down? Are there PARP paralogues that can complement the biochemical activity of PARG in the PARG KOs? The authors do not discuss these critical issues nor engage with this problem.

      There are five reviewed or potential PARG isoforms identified in the Uniprot database. The two sgRNAs (#1 and #2) used to generate initial PARG KO cells in this manuscript target all three catalytically active isoforms (isoforms 1, 2 and 3), and sgRNA#2 used in HeLa cells also targets isoforms 4 and 5, but these isoforms are considered catalytically inactive according to the Uniprot database. However, it is likely that sgRNA-mediated genome editing may lead to the creation of new alternatively spliced PARG mRNAs or the use of alternative ATG, which can produce catalytically active forms of PARG. Instead of searching for these putative spliced PARG RNAs, we used two independent antibodies that recognize the C-terminus of PARG for WB as shown below. Unfortunately, besides full-length PARG, these antibodies also recognized several other bands, some of them were reduced or absent in PARG KO cells, others were not. Thus, we could not draw a clear conclusion which functional isoform was expressed in our PARG KO cells. Nevertheless, we directly measured PARG activity in PARG KO cells (Fig. S9) and showed that we were still able to detect residual PARG activity in these PARG KO cells. These data clearly indicate that residual PARG activity are present and detected in our KO cells, but the precise nature of these truncated forms of PARG remains elusive.

      Author response image 1.

      These issues have to be dealt with upfront in the manuscript for the reader to make sense of their work.

      We thank this reviewer for his/her constructive comments and suggestions. We will include the data above and additional discussion upfront in our revised manuscript to avoid any further confusion by our readers.

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, Nie et al investigate the effect of PARG KO and PARG inhibition (PARGi) on pADPR, DNA damage, cell viability, and synthetic lethal interactions in HEK293A and Hela cells. Surprisingly, the authors report that PARG KO cells are sensitive to PARGi and show higher pADPR levels than PARG KO cells, which are abrogated upon deletion or inhibition of PARP1/PARP2. The authors explain the sensitivity of PARG KO to PARGi through incomplete PARG depletion and demonstrate complete loss of PARG activity when incomplete PARG KO cells are transfected with additional gRNAs in the presence of PARPi. Furthermore, the authors show that the sensitivity of PARG KO cells to PARGi is not caused by NAD depletion but by S-phase accumulation of pADPR on chromatin coming from unligated Okazaki fragments, which are recognized and bound by PARP1. Consistently, PARG KO or PARG inhibition shows synthetic lethality with Pol beta, which is required for Okazaki fragment maturation. PARG expression levels in ovarian cancer cell lines correlate negatively with their sensitivity to PARGi.

      Thank you for your nice comments. The complete loss of PARG activity was observed in PARG complete/conditional KO (cKO) cells. These cKO clones were generated using wild-type cells transfected with sgRNAs targeting the catalytic domain of PARG in the presence of PARP inhibitor.

      Strengths:

      The authors show that PARG is essential for removing ADP-ribosylation in S-phase.

      Thanks!

      Weaknesses:

      1. This begs the question as to the relevant substrates of PARG in S-phase, which could be addressed, for example, by analysing PARylated proteins associated with replication forks in PARG-depleted cells (EdU pulldown and Af1521 enrichment followed by mass spectrometry).

      We believe that PARP1 is one of the major relevant PARG substrates in S phase cells. Previous studies reported that PARP1 recognizes unligated Okazaki fragments and induces S phase PARylation, which recruits single-strand break repair proteins such as XRCC1 and LIG3 that acts as a backup pathway for Okazaki fragment maturation (Hanzlikova et al., 2018; Kumamoto et al., 2021). In this study, we revealed that accumulation of PARP1/2-dependent S phase PARylation eventually led to cell death (Fig. 2). Furthermore, we found that chromatin-bound PARP1 as well as PARylated PARP1 increased in PARG KO cells (Fig. S4A and Fig. 4A), suggesting that PARP1 is one of the key substrates of PARG in S phase cells. Of course, PARG may have additional substrates besides PARP1 which are required for its roles in S phase progression, as PARG is known to be recruited to DNA damage sites through pADPr- and PCNA-dependent mechanisms (Mortusewicz et al., 2011). Precisely how PARG regulates S phase progression warrants further investigation.

      1. The results showing the generation of a full PARG KO should be moved to the beginning of the Results section, right after the first Results chapter (PARG depletion leads to drastic sensitivity to PARGi), otherwise, the reader is left to wonder how PARG KO cells can be sensitive to PARGi when there should be presumably no PARG present.

      Thank you for your suggestion! However, we would like to keep the complete PARG KO result at the end of the Results section, since this was how this project evolved. Initially, we did not know that PARG is an essential gene. Thus, we speculated that PARGi may target not only PARG but also a second target, which only becomes essential in the absence of PARG. To test this possibility, we performed FACS-based and cell survival-based whole-genome CRISPR screens (Fig. 5). However, this putative second target was not revealed by our CRISPR screening data (Fig. 5). We then tested the possibility that these cells may have residual PARG expression or activity and only cells with very low PARG expression are sensitive to PARGi, which turned out to be the case for ovarian cancer cells. Equipped with PARP inhibitor and sgRNAs targeting the catalytic domain of PARG, we finally generated cells with complete loss of PARG activity to prove that PARG is an essential gene (Fig. 7). This series of experiments underscore the challenge of validating any KO cell lines, i.e. the identification of frame-shift mutations, absence of full-length proteins, and phenotypic changes may still not be sufficient to validate KO clones. This is an important lesson we learned and we would like to share it with the scientific community.

      To avoid further misunderstanding, we will include additional statements/comments at the end of “PARG depletion leads to drastic sensitivity to PARGi” section and at the beginning of “CRISPR screens reveal genes responsible for regulating pADPr signaling and/or cell lethality in WT and PARG KO cells”. Hope that our revised manuscript will make it clear.

      1. Please indicate in the first figure which isoforms were targeted with gRNAs, given that there are 5 PARG isoforms. You should also highlight that the PARG antibody only recognizes the largest isoform, which is clearly absent in your PARG KO, but other isoforms may still be produced, depending on where the cleavage sites were located.

      The two sgRNAs (#1 and #2) used to generate initial PARG KO cells in this manuscript target all three catalytically active isoforms (isoforms 1, 2 and 3), and sgRNA#2 used in HeLa cells also targets isoforms 4 and 5, but these isoforms are considered catalytically inactive according to the Uniprot database. As suggested, we will modify Fig. S1D and the figure legends.

      The manufacturer instruction states that the Anti-PARG antibody (66564S) can only recognize isoform 1, this antibody could recognize isoforms 2 and 3 albeit weakly based on Western blot results with lysates prepared from PARG cKO cells reconstituted with different PARG isoforms, as shown below. As suggested, we will add a statement in the revised manuscript and provide the Western blotting data below.

      Author response image 2.

      To test whether other isoforms were expressed in 293A and/or HeLa cells, we used two independent antibodies that recognize the C-terminus of PARG for WB as shown below. Unfortunately, besides full-length PARG, these antibodies also recognized several other bands, some of them were reduced or absent in PARG KO cells, others were not. Thus, we could not draw a clear conclusion which functional isoforms or truncated forms were expressed in our PARG KO cells.

      Author response image 3.

      1. FACS data need to be quantified. Scatter plots can be moved to Supplementary while quantification histograms with statistical analysis should be placed in the main figures.

      We agree with this reviewer that quantification of FACS data may provide straightforward results in some of our data. However, it is challenging to quantify positive S phase pADPr signaling in some panels, for example in Fig. 3A and Fig. 4C. In both panels, pADPr signaling was detected throughout the cell cycle and therefore it is difficult to know the percentage of S phase pADPr signaling in these samples. Thus, we decide to keep the scatter plots to demonstrate the dramatic and S phase-specific pADPr signaling in PARG KO cells treated with PARGi. We hope that these data are clear and convincing even without any quantification.

      1. All colony formation assays should be quantified and sensitivity plots should be shown next to example plates.

      As suggested, we will include the sensitivity plot next to Fig. 3D. However, other colony formation assays in this study were performed with a single concentration of inhibitor and therefore we will not provide sensitivity plots for these experiments. Nevertheless, the results of these experiments are straightforward and easy to interpret.

      1. Please indicate how many times each experiment was performed independently and include statistical analysis.

      As suggested, we will add this information in the revised manuscript.

      Reviewer #3 (Public Review):

      Here the authors carried out a CRISPR/sgRNA screen with a DDR gene-targeted mini-library in HEK293A cells looking for genes whose loss increased sensitivity to treatment with the PARG inhibitor, PDD00017273 (PARGi). Surprisingly they found that PARG itself, which encodes the cellular poly(ADP-ribose) glycohydrolase (dePARylation) enzyme, was a major hit. Targeted PARG KO in 293A and HeLa cells also caused high sensitivity to PARGi. When PARG KO cells were reconstituted with catalytically-dead PARG, MMS treatment caused an increase in PARylation, not observed when cells were reconstituted with WT PARG or when the PARG KO was combined with PARP1/2 DKO, suggesting that loss of PARG leads to a strong PARP1/2-dependent increase in protein PARylation. The decrease in intracellular NADH+, the substrate for PARP-driven PARylation, observed in PARG KO cells was reversed by treatment with NMN or NAM, and this treatment partially rescued the PARG KO cell lethality. However, since NAD+ depletion with the FK868 nicotinamide phosphoribosyltransferase (NAMPT) inhibitor did not induce a similar lethality the authors concluded that NAD+ depletion/reduction was only partially responsible for the PARGi toxicity. Interestingly, PARylation was also observed in untreated PARG KO cells, specifically in S phase, without a significant rise in γH2AX signals. Using cells synchronized at G1/S by double thymidine blockade and release, they showed that entry into S phase was necessary for PARGi to induce PARylation in PARG KO cells. They found an increased association of PARP1 with a chromatin fraction in PARG KO cells independent of PARGi treatment, and suggested that PARP1 trapping on chromatin might account in part for the increased PARGi sensitivity. They also showed that prolonged PARGi treatment of PARG KO cells caused S phase accumulation of pADPr eventually leading to DNA damage, as evidenced by increased anti-γH2AX antibody signals and alkaline comet assays. Based on the use of emetine, they deduced that this response could be caused by unligated Okazaki fragments. Next, they carried out FACS-based CRISPR screens to identify genes that might be involved in cell lethality in WT and PARG KO cells, finding that loss of base excision repair (BER) and DNA repair genes led to increased PARylation and PARGi sensitivity, whereas loss of PARP1 had the opposite effects. They also found that BER pathway disruption exhibited synthetic lethality with PARGi treatment in both PARG KO cells and WT cells, and that loss of genes involved in Okazaki fragment ligation induced S phase pADPr signaling. In a panel of human ovarian cancer cell lines, PARGi sensitivity was found to correlate with low levels of PARG mRNA, and they showed that the PARGi sensitivity of cells could be reduced by PARPi treatment. Finally, they addressed the conundrum of why PARG KO cells should be sensitive to a specific PARG inhibitor if there is no PARG to inhibit and found that the PARG KO cells had significant residual PARG activity when measured in a lysate activity assay, which could be inhibited by PARGi, although the inhabited PARG activity levels remained higher than those of PARG cKO cells (see below). This led them to generate new, more complete PARG KO cells they called complete/conditional KO (cKO), whose survival required the inclusion of the olaparib PARPi in the growth medium. These PARG cKO cells exhibited extremely low levels of PARG activity in vitro, consistent with a true PARG KO phenotype.

      We thank this reviewer for his/her constructive comments and suggestions.

      The finding that human ovarian cancer cells with low levels of PARG are more sensitive to inhibition with a small molecule PARG inhibitor, presumably due to the accumulation of high levels of protein PARylation (pADPr) that are toxic to cells is quite interesting, and this could be useful in the future as a diagnostic marker for preselection of ovarian cancer patients for treatment with a PARG inhibitor drug. The finding that loss of base excision repair (BER) and DNA repair genes led to increased PARylation and PARGi sensitivity is in keeping with the conclusion that PARG activity is essential for cell fitness, because it prevents excessive protein PARylation. The observation that increased PARylation can be detected in an unperturbed S phase in PARG KO cells is also of interest. However, the functional importance of protein PARylation at the replication fork in the normal cell cycle was not fully investigated, and none of the key PARylation targets for PARG required for S phase progression were identified. Overall, there are some interesting findings in the paper, but their impact is significantly lessened by the confusing way in which the paper has been organized and written, and this needs to be rectified.

      We believe that PARP1 is one of the major relevant PARG substrates in S phase cells. Previous studies reported that PARP1 recognizes unligated Okazaki fragments and induces S phase PARylation, which recruits single-strand break repair proteins such as XRCC1 and LIG3 that acts as a backup pathway for Okazaki fragment maturation (Hanzlikova et al., 2018; Kumamoto et al., 2021). In this study, we revealed that accumulation of PARP1/2-dependent S phase PARylation eventually led to cell death (Fig. 2). Furthermore, we found that chromatin-bound PARP1 as well as PARylated PARP1 increased in PARG KO cells (Fig. S4A and Fig. 4A), suggesting that PARP1 is one of the key substrates of PARG in S phase cells. Of course, PARG may have additional substrates besides PARP1 which are required for its roles in S phase progression, as PARG is known to be recruited to DNA damage sites through pADPr- and PCNA-dependent mechanisms (Mortusewicz et al., 2011). Precisely how PARG regulates S phase progression warrants further investigation.

      As suggested, we will revise our manuscript accordingly and provide additional explanation/statement upfront to avoid any misunderstandings.  

      Reviewer #1 (Recommendations For The Authors):

      1. Figure 1c. Why does the viability of PARG KO cells improve at higher doses of PARGi? How do the authors explain this paradox?

      This phenomenon was observed in 293A PARG KO cells and happened in CellTiter-Glo assay, especially with the top three PARGi concentrations (100 µM, 33.33 µM and 11.11 µM). This may due to the low solubility of this PARGi in the medium, since we sometimes observed precipitation at high concentrations when PARGi stock was diluted in medium.

      1. Figure 2d. The authors show that PARGi reduced NAD+ level by 20%. This reduction in NAD+ probably does not explain the cell death phenotype observed by parthanatos cell death. What pathway is activated by PARGi to induce cell death?

      Since PARG KO cells treated with PARGi led to uncontrolled pADPr accumulation, it is possible that some of these cells may die due to parthanotos. However, we did not observe a dramatic reduction in NAD+ level. A previous study showed that Parg(-/-) mouse ES cells predominantly underwent caspase-dependent apoptosis (Shirai et al., 2013). Indeed, PARP1 cleavage was detected in PARG KO cells with prolonged PARGi treatment, indicating that at least some of these cells die due to apoptosis (Fig. 2A). Cytotoxicity of PARGi in PARG KO cells may due to several mechanisms including apoptosis, parthanatos and NAD+ reduction.

      1. The authors refer to FK866 in the text without explaining what this agent is. FK866 is a noncompetitive inhibitor of nicotinamide phosphoribosyltransferase (NAPRT), a key enzyme in the regulation of NAD+ biosynthesis from the natural precursor nicotinamide. The authors should explain experimental tools in the text as they use them for clarity to the reader.

      Thanks for the suggestion! We will include additional citations and discuss how FK866 works in our revised manuscript.

      1. In addition to these issues, there are significant formatting and textual problems, such that there are multiple gaps in the body of the text that make coherent reading of the manuscript impossible. Examples are: Page 3 line 10. Page 6 line 5 and line 15, Page 7 line 2, 3, and line 8. Page 8, line 1, and line 3 from bottom. Page 9 line 1, line 7 from bottom and line 9 from the bottom, Page 18 of the results in several places, etc. etc. etc. These formatting errors convey the impression that the submitting authors did not adequately review the manuscript for technical problems prior to submission. The authors need to correct these errors.

      Sorry, we will edit the text and remove these gaps as suggested.

      Reviewer #3 (Recommendations For The Authors):

      1. The major problem with this paper is conceptual - namely, how could PARG knockout cells be hypersensitive to a selective PARG small molecular inhibitor. The evidence in Figure 7 that there is measurable residual PARG activity in the so-called PARG KO 293A and HeLa cells provides a partial explanation for why PARG inhibitor treatment might be deleterious to the PARG KO cells, i.e., because PARGi blocks this residual PARG activity. However, although the authors characterized the PARG alleles in the 293A PARG KO cells by sequencing, the molecular origin of the significant level of residual PARG activity remains unclear (see points 7-9).

      Yes, in our study we showed that PARGi treatment inhibited the residual PARG activity in PARG KO cells, which mimics complete loss of PARG as PARG is an essential gene. These data agree with a previous study using Parg(-/-) mouse cells (Koh et al., 2004).We attempted to define the molecular origin of the residual PARG activity, unfortunately this was challenging (please see below for additional discussions). Nevertheless, we showed that residual PARG activity could be detected in PARG KO cells and more importantly cells with reduced PARG expression or activity are sensitive to PARGi. These results indicate that PARG expression and/or activity may be used as a biomarker for PARGi-based therapy.

      1. Although the most obvious explanation for the PARGi sensitivity data presented in Figures 1-4 is that the PARG KO cells have residual PARG activity, the authors wait until the discussion on page 26 to raise the possibility that the PARG KO cells might have residual PARG activity that renders them sensitive to PARGi. It would be more logical to move the PARG activity data in Figure 7 earlier in the paper as a supplementary figure, so that the reader is not left wondering how a PARG KO cell remains sensitive to a PARG inhibitor. For this reason, it is recommended that the whole paper be reorganized and rewritten to provide a more logical flow that allows the reader to understand what was done, and why it is hard to generate complete PARG KO cells because the accumulation of pADPR adducts is toxic to the cell.

      Thank you for your suggestion! However, we would like to keep the complete PARG KO result at the end of the Results section, since this was how this project evolved. Initially, we did not know that PARG is an essential gene. Thus, we speculated that PARGi may target not only PARG but also a second target, which only becomes essential in the absence of PARG. To test this possibility, we performed FACS-based and cell survival-based whole-genome CRISPR screens (Fig. 5). However, this putative second target was not revealed by our CRISPR screening data (Fig. 5). We then tested the possibility that these cells may have residual PARG expression or activity and only cells with very low PARG expression are sensitive to PARGi, which turned out to be the case for ovarian cancer cells. Equipped with PARP inhibitor and sgRNAs targeting the catalytic domain of PARG, we finally generated cells with complete loss of PARG activity to prove that PARG is an essential gene (Fig. 7). This series of experiments underscore the challenge of validating any KO cell lines, i.e. the identification of frame-shift mutations, absence of full-length proteins, and phenotypic changes may still not be sufficient to validate KO clones. This is an important lesson we learned and we would like to share it with the scientific community.

      To avoid further misunderstanding, we will include additional statements/comments at the end of “PARG depletion leads to drastic sensitivity to PARGi” section and at the beginning of “CRISPR screens reveal genes responsible for regulating pADPr signaling and/or cell lethality in WT and PARG KO cells”. Hope that our revised manuscript will make it clear.

      1. Exactly how PARG activity would be coordinated with PARP1/2 activity during normal S phase to ensure that PARylation can serve its required function, whatever that may be, and is then removed by PARG is unclear - how would this be orchestrated at the level of a replication fork?

      PARG is known to be recruited to sites of DNA damage through pADPr- and PCNA-dependent mechanisms (Mortusewicz et al., 2011). Our current hypothesis is that PARP1 is one of the major PARG substrates in S phase cells. Previous studies reported that PARP1 recognizes unligated Okazaki fragments and induces S phase PARylation, which recruits single-strand break repair proteins such as XRCC1 and LIG3 that acts as a backup pathway for Okazaki fragment maturation (Hanzlikova et al., 2018; Kumamoto et al., 2021). In this study, we revealed that accumulation of PARP1/2-dependent S phase PARylation eventually led to cell death (Fig. 2). Furthermore, we found that chromatin-bound PARP1 as well as PARylated PARP1 increased in PARG KO cells (Fig. S4A and Fig. 4A), suggesting that PARP1 is one of the key substrates of PARG in S phase cells. Of course, PARG may have additional substrates besides PARP1 which are required for its roles in S phase progression. Precisely how PARG regulates S phase progression warrants further investigation.

      1. Figure 2B: What gRNAs were used to generate the 293A and HeLa PARG knock clones, i.e., where are they located in the PARG gene? If they are not in the catalytic domain it might be possible to generate PARG proteins with N-terminal deletions that are still active (see points 8-10 below).

      The two sgRNAs (#1 and #2) used to generate initial PARG KO cells in this manuscript target all three catalytically active isoforms (isoforms 1, 2 and 3), and sgRNA#2 used in HeLa cells also targets isoforms 4 and 5, but these isoforms are considered catalytically inactive according to the Uniprot database. As suggested, we will modify Fig. S1D and the figure legends to show the localization of gRNAs.

      We agree with this reviewer that truncated but active forms of PARG exist in these KO cells. We attempted to identify these trunated forms of PARG by using two independent antibodies that recognize the C-terminus of PARG for WB as shown below. Unfortunately, besides full-length PARG, these antibodies also recognized several other bands, some of them were reduced or absent in PARG KO cells, others were not. Thus, we could not draw a clear conclusion which functional isoform/truncated form was expressed in our PARG KO cells. Nevertheless, we directly measured PARG activity in PARG KO cells (Fig. S9) and showed that we were still able to detect residual PARG activity in these PARG KO cells. Based on these results, we stated that the residual PARG activity was detected in our KO cells, but we were not able to specify the truncated variants of PARG in these cells.

      Author response image 4.

      1. Figure 3B/page 19: The authors state that "emetine, which diminishes Okazaki fragments, greatly inhibited S phase pADPr signaling in PARG KO cells", and from this deduced that Okazaki fragments on the lagging strand activate PARylation. However, emetine is not a specific lagging strand synthesis inhibitor, as implied here, but rather a protein synthesis inhibitor, which inhibits Okazaki fragment formation indirectly (see PMID: 36260751). The authors need to rewrite this section to explain how emetine works in this context.

      As suggested, we will cite this reference and discuss how emetine inhibits Okazaki fragment maturation in our revised manuscript. Additionally, we used three different POLA1 inhibitors to diminish Okazaki fragments. As shown in Fig. S3B, all three POLA1 inhibitors significantly abolished S-phase pADPr induced by PARGi in PARG KO cells. Furthermore, POLA1 inhibitors, adarotene and CD437, were able to rescue cell lethality caused by PARGi in PARG KO cells (Fig. 3E).

      1. Figure 7: It is not clear why these cells are called PARG complete/conditional KO cells (cKO). Generally, "conditional knockout" refers to a cell or animal in which a gene can be conditionally knocked out by inducible expression of Cre. Here, it appears that "conditional" refers to the fact that the PARG KO cells only grow in the presence of olaparib - is this the case?

      Yes, we used the name to separate these cells from our initial PARG KO cells. Moreover, we were only able to obtain and maintain these PARG cKO clones with complete loss of PARG activity in the presence of PARP inhibitor. Therefore, we called them PARG complete/conditional KO (cKO) cells.

      1. Figure 7B and D: The level of full-length PARG protein was much lower in the 293A and HeLa cKO cells compared to WT cells consistent with cKO cells representing a more complete PARG KO. The level of PARG protein in the 293A PARG cKO cells was apparently also lower than in the original PARG KO cells, but the KO and cKO samples should be run side by side to demonstrate this conclusively, and the bands need to be quantified. In panel B, it is not clear from the legend what cKO_3 and cKO_4 are, but presumably, they are different clones, and this should be stated.

      Full-length PARG was not detected in either PARG KO or PARG cKO cells by WB. The apparent lower level of endogenous PARG in Fig. 7D was due to the fact that reconstituted cells had high exogenous PARG expression and therefore we had to reduce exposure time for WB.

      As for cKO_3 and cKO_4 in Fig.7, they are different clones created by different sgRNAs. As suggested, we will include additional information in figure legends to clearly state which sgRNA was used to generate the respective KO and cKO clones.

      1. Figure S8: There is not enough information here or in the text to allow the reader to interpret these PARG allele sequences obtained from the PARG KO cells. From the Methods section, it appears that the PARG KO cells were clonal, with sequence data from one clone of each of the 293A and HeLa cell PARG KO cells being shown. If this is right, then in both cell types one out of four PARG alleles is wild type, and therefore one would expect the PARG protein signal to be ~25% of that in WT cells. However, based on the 293A PARG KO cells PARG immunoblot in Figure 2B the PARG protein signal is clearly much lower than 25% (these bands need to be quantified), and this discrepancy needs to be explained. What is the level of PARG protein in the PARG KO HeLa cells? If different PARG KO cell clones are analyzed by sequencing, do they all have an apparently intact PARG allele? Four different gRNA target sites in the PARG gene are shown in panel A in Figure 7, but the description in the text regarding how the four gRNAs were used is totally inadequate - were all four used simultaneously or only the two in the catalytic domain? Were pairs of gRNAs used in an attempt to generate a large intervening deletion - some Southern blots of the PARG gene region in the PARG cKO cells are needed to figure this out. The gRNAs are given numbers in Figure 7A, but it is unclear from the sequences shown in Figures S8 and S9 which gRNA sites are shown. All of this has to be clarified, so that the reader can understand the nature of the KO/cKO cells knockout alleles, and what PARG-related products, if any, they can express.

      Yes, all KO and cKO cells used in this study are single clones. As suggested, we will revise figure legends in Fig.7, S8 and S9 to include detailed information. To avoid any further misunderstanding, we will label the allele “WT” to “WT (reference)” in Fig. S8 and S9. We did not detect intact/wild-type PARG sequence in any single KO/cKO clone by DNA sequencing. Sequencing of single KO/cKO clones was performed by using TOP TA Cloning kit. Briefly, genomic DNA was extracted from each single KO/cKO clone. Approximately 300bp surrounding the sgRNA targeting sequence was amplified by PCR. The PCR product was cloned into the vector and approximately 10-15 bacteria clones were extracted and sent for sequencing. If any intact/wild-type PARG sequence was detected in these 10-15 bacteria clones, this KO/cKO clone was considered heterozygous clone and discarded.

      HEK293A and HeLa cells are not diploid cells and have complex karyotypes. PARG gene is located on chromosome 10. Karyotyping by M-FISH shows that HeLa cells have 3 copies of chromosome 10 (Landry et al., 2013). HEK293 cells predominantly have 3 copies of chromosome 10 and sometimes 4 copies can be detected by G-banding (Binz et al., 2019). Therefore, it is anticipated that 1 to 4 mutant alleles would be detected in each KO/cKO clone by sequencing.

      Only one sgRNA was transfected into cells for the selection of single clones. We did not use paired or multiple sgRNAs in any of these experiments. As shown in Fig. S1D and Fig. 7A, HEK293A derived and HeLa derived PARG KO single clones were generated with the use of different sgRNAs. In addition, the two PARG cKO single clones from HEK293A and HeLa cells were also generated by the use of two different sgRNAs, as shown in Fig. 7A-B. We will include all the information above in the revised manuscript, i.e. in Methods section as well as in figure legends.

      1. Figure S9A: The sequences of the 293A PARG alleles in the cKO cells suggest that these cells also have one intact PARG allele, which again does not fit with the very low level of intact PARG protein shown in Figure 7B. How do the authors explain this?

      Sorry, this is a misunderstanding. The allele “WT” in Fig. S8 and S9 is the reference sequence. We will change it to “Reference sequence” to avoid further confusion. As mentioned above, we did not detect any intact/wild-type PARG sequence in any of our single KO/cKO clones by sequencing.

      1. Figure S9B: These critical lysate activity data show that the PARG KO cells have ~50% of the PARG activity detected in WT cells. However, this is not consistent with the PARG protein level detected in PARG immunoblot in Figure 1B, which appears to be less than 5% of the PARG protein level in WT cells (with one intact PARG allele in these cells one would theoretically expect~ 25%, although this depends on whether all four alleles are expressed equally). One possibility is that active PARG fragments are generated from one or more of the PARG KO alleles in the PARG KO cells. Targeted sequencing of PARG mRNAs might reveal whether there are shorter RNAs that could encode a protein containing the C-terminal catalytic domain (aa 570-910). In addition, the authors need to show the entire immunoblot to determine if there are smaller proteins recognized by the anti-PARG antibodies that might represent shorter PARG gene products (for this we need to know where the epitope against which the PARG antibodies are directed are located within the PARG protein - ideally they authors need to use an antibody directed against an epitope near the C-terminus).

      As stated in the Methods section, we incubated cell lysates with substrates overnight to evaluate the maximum level of pADPr hydrolysis, i.e. PARG activity, we were able to detect in this assay. It is very likely that the PARG activity in PARG KO cells was much lower than 50%, due to saturation of signals for lysates isolated from wild-type cells. Thus, the data presented in our manuscript probably underestimate the reduction of PARG activity in PARG KO cells. Nevertheless, these data indicate that residual PARG activity was detected in PARG KO cells, however this activity was absent in PARG cKO cells.

      As aforementioned, we used two independent antibodies that recognize the C-terminus of PARG for WB. Unfortunately, we could not draw a clear conclusion which functional isoforms or truncated proteins were expressed in our PARG KO cells. The dePARylation assay used here may be the best way to test the residual PARG activity in our KO and cKO cells.

      1. Figure 7D: In this experiment, the level of re-expressed WT PARG protein was much higher than that of the endogenous PARG protein (quantification is needed) - how might this affect the interpretation of these experiments (N.B., WT and catalytically-dead PARG were also re-expressed for the experiments shown in Figure 1, but there are no PARG immunoblots to demonstrate how much the exogenous proteins were overexpressed, or activity measurements). If regulated pADPr signaling is important for a normal S phase, then one would have thought that expressing a very high level of active PARG would create problems.

      In Fig. S1E, we blotted endogenous PARG level in control cells and exogenous PARG level in reconstituted cells. The reviewer is correct that exogenous PARG expression was much higher (~10-fold) than that of endogenous PARG in WT control cells. Nevertheless, we did not observe any obvious phenotypes in PARG KO/cKO cells reconstituted with high level of exogeneous PARG, which may reflect excess PARG level/activity in wild-type control cells.

      References:

      Binz, R. L., Tian, E., Sadhukhan, R., Zhou, D., Hauer-Jensen, M., and Pathak, R. (2019). Identification of novel breakpoints for locus- and region-specific translocations in 293 cells by molecular cytogenetics before and after irradiation. Sci Rep 9, 10554.

      Hanzlikova, H., Kalasova, I., Demin, A. A., Pennicott, L. E., Cihlarova, Z., and Caldecott, K. W. (2018). The Importance of Poly(ADP-Ribose) Polymerase as a Sensor of Unligated Okazaki Fragments during DNA Replication. Mol Cell 71, 319-331 e313.

      Koh, D. W., Lawler, A. M., Poitras, M. F., Sasaki, M., Wattler, S., Nehls, M. C., Stoger, T., Poirier, G. G., Dawson, V. L., and Dawson, T. M. (2004). Failure to degrade poly(ADP-ribose) causes increased sensitivity to cytotoxicity and early embryonic lethality. Proc Natl Acad Sci U S A 101, 17699-17704.

      Kumamoto, S., Nishiyama, A., Chiba, Y., Miyashita, R., Konishi, C., Azuma, Y., and Nakanishi, M. (2021). HPF1-dependent PARP activation promotes LIG3-XRCC1-mediated backup pathway of Okazaki fragment ligation. Nucleic Acids Res 49, 5003-5016.

      Landry, J. J., Pyl, P. T., Rausch, T., Zichner, T., Tekkedil, M. M., Stutz, A. M., Jauch, A., Aiyar, R. S., Pau, G., Delhomme, N., et al. (2013). The genomic and transcriptomic landscape of a HeLa cell line. G3 (Bethesda) 3, 1213-1224.

      Mortusewicz, O., Fouquerel, E., Ame, J. C., Leonhardt, H., and Schreiber, V. (2011). PARG is recruited to DNA damage sites through poly(ADP-ribose)- and PCNA-dependent mechanisms. Nucleic Acids Res 39, 5045-5056.

      Shirai, H., Fujimori, H., Gunji, A., Maeda, D., Hirai, T., Poetsch, A. R., Harada, H., Yoshida, T., Sasai, K., Okayasu, R., and Masutani, M. (2013). Parg deficiency confers radio-sensitization through enhanced cell death in mouse ES cells exposed to various forms of ionizing radiation. Biochem Biophys Res Commun 435, 100-106.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In their paper, Kang et al. investigate rigidity sensing in amoeboid cells, showing that, despite their lack of proper focal adhesions, amoeboid migration of single cells is impacted by substrate rigidity. In fact, many different amoeboid cell types can durotax, meaning that they preferentially move towards the stiffer side of a rigidity gradient.

      The authors observed that NMIIA is required for durotaxis and, building on this observation, they generated a model to explain how durotaxis could be achieved in the absence of strong adhesions. According to the model, substrate stiffness alters the diffusion rate of NMAII, with softer substrates allowing for faster diffusion. This allows for NMAII accumulation at the back, which, in turn, results in durotaxis.

      The experiments support the main message of the paper regarding durotaxis by amoeboid cells. In my opinion, a few clarifications on the mechanism proposed to explain this phenomenon could strengthen this research:

      (1) According to your model, the rear end of the cell, which is in contact with softer substrates, will have slower diffusion rates of MNIIA. Does this mean that bigger cells will durotax better than smaller cells because the stiffness difference between front and rear is higher? Is it conceivable to attenuate the slope of the durotactic gradient to a degree where smaller cells lose their ability to durotact, while longer cells retain their capacity for directional movement?

      We thank the reviewer for this comment. In fact, it is not always the case that bigger cells will durotax better than smaller cells. Although bigger cells will sense higher stiffness difference between the front and rear, cells placed on different regions of underlying substrates may respond differently. This is because diffusion coefficient difference is not proportional to stiffness difference in our theoretical model. Therefore, when cells are placed on a very stiff substrate, cells may not durotax. When cells are placed on a region with suitable stiffness, where cells are sensitive to stiffness gradient, bigger cells will durotax better than smaller cells. In this situation, as you mentioned, lowering the stiffness gradient will make smaller cells become adurotactic while longer cells still durotax.

      We tried to further address this question by our durotaxis assay but there was a challenge: the amoeboid cells we use, including CD4+ Naïve T cells, neutrophils, dHL-60 cells and Dictysotelium, frequently protrude, retract and alter contact area with the substrate which make it difficult for us to distinguish between bigger and smaller cells in a particular cell type. Previously reported durotactic cell lines, such as MDA-MB-231 and HT1080 cells, are bigger than the amoeboid cells we use but they are mesenchymal cells and adopt distinct mechanisms which always involve stable focal adhesions. Due to this, although we are eager to answer this question by experiments and that the stiffness gradient is tunable in our system, we have not found an appropriate approach and experimental setup.

      (2) Where did you place the threshold for soft, middle, and stiff regions (Figure 6)? Is it possible that you only have a linear rigidity gradient in the center of your gel and the more you approach the borders, the flatter the gradient gets? In this case, cells would migrate randomly on uniform substrates. Did you perform AFM over the whole length of the gel or just in the central part?

      We thank the reviewer for this comment. We have performed AFM over the whole length of our gradient gel (Fig. S1A). We divide the gel into three equal parts (stiff: 1-4 mm; middle: 4-7 mm; soft: 7-10 mm) and the stiffness gradient is almost linear within each part as shown in Fig. S1A.

      (3) In which region (soft, middle, stiff) did you perform all the cell tracking of the previous figures?

      We thank the reviewer for this question. We performed the cell tracking in the soft region of the gradient gel.

      (4) What is the level of confinement experienced by the cells? Is it possible that cells on the soft side of the gels experience less confinement due to a "spring effect" whereby the coverslips descending onto the cells might exert diminished pressure because the soft hydrogels act as buffers, akin to springs? If this were the case, cells could migrate following a confinement gradient.

      We thank the reviewer for this comment. Although the possibility that our thin hydrogel layers act as buffers cannot be completely excluded, we have performed the durotaxis assay without upper gradient gel providing confinement (Author response image 1A). In this case, CD4+ Naïve T cells, neutrophils, dHL-60 cells and Dictysotelium can still durotax (Author response image 1B-E), indicating stiffness gradient itself is sufficient to direct amoeboid cell migration.

      Author response image 1.

      Illustration of the durotaxis system without confinement (A) and y-FMI of CD4+ Naïve T cells (B), neutrophils (C), dHL-60 cells (D) and Dictysotelium (E) cultured on uniform substrate or gradient substrate (n ≥ 30 tracks were analyzed for each experiment, N = 3 independent experiments for each condition, replicates are biological). All error bars are SEM. ****, P < 0.0001, by Student’s t-test.

      Reviewer #2 (Public Review):

      Summary:

      The authors developed an imaging-based device that provides both spatialconfinement and stiffness gradient to investigate if and how amoeboid cells, including T cells, neutrophils, and Dictyostelium, can durotax. Furthermore, the authors showed that the mechanism for the directional migration of T cells and neutrophils depends on non-muscle myosin IIA (NMIIA) polarized towards the soft-matrix-side. Finally, they developed a mathematical model of an active gel that captures the behavior of the cells described in vitro.

      Strengths:

      The topic is intriguing as durotaxis is essentially thought to be a direct consequence of mechanosensing at focal adhesions. To the best of my knowledge, this is the first report on amoeboid cells that do not depend on FAs to exert durotaxis. The authors developed an imaging-based durotaxis device that provides both spatial confinement and stiffness gradient and they also utilized several techniques such as quantitative fluorescent speckle microscopy and expansion microscopy. The results of this study have well-designed control experiments and are therefore convincing.

      Weaknesses:

      Overall this study is well performed but there are still some minor issues I recommend the authors address:

      (1) When using NMIIA/NMIIB knockdown cell lines to distinguish the role of NMIIA and NMIIB in amoeboid durotaxis, it would be better if the authors took compensatory effects into account.

      We thank the reviewer for this suggestion. We have investigated the compensation of myosin in NMIIA and NMIIB KD HL-60 cells using Western blot and added this result in our updated manuscript (Fig. S4B, C). The results showed that the level of NMIIB protein in NMIIA KD cells doubled while there was no compensatory upregulation of NMIIA in NMIIB KD cells. This is consistent with our conclusion that NMIIA rather than NMIIB is responsible for amoeboid durotaxis since in NMIIA KD cells, compensatory upregulation of NMIIB did not rescue the durotaxis-deficient phenotype.

      (2) The expansion microscopy assay is not clearly described and some details are missed such as how the assay is performed on cells under confinement.

      We thank the reviewer for this comment. We have updated details of the expansion microscopy assay in our revised manuscript in line 481-485 including how the assay is performed on cells under confinement:

      Briefly, CD4+ Naïve T cells were seeded on a gradient PA gel with another upper gel providing confinement. 4% PFA was used to fix cells for 15 min at room temperature. After fixation, the upper gradient PA gel is carefully removed and the bottom gradient PA gel with seeded cells were immersed in an anchoring solution containing 1% acrylamide and 0.7% formaldehyde (Sigma, F8775) for 5 h at 37 °C.

      (3) In this study, an active gel model was employed to capture experimental observations. Previously, some active nematic models were also considered to describe cell migration, which is controlled by filament contraction. I suggest the authors provide a short discussion on the comparison between the present theory and those prior models.

      We thank the reviewer for this suggestion. Active nematic models have been employed to recapitulate many phenomena during cell migration (Nat Commun., 2018, doi: 10.1038/s41467-018-05666-8.). The active nematic model describes the motion of cells using the orientation field, Q, and the velocity field, u. The director field n with (n = −n) is employed to represent the nematic state, which has head-tail symmetry. However, in our experiments, actin filaments are obviously polarized, which polymerize and flow towards the direction of cell migration. Therefore, we choose active gel model which describes polarized actin field during cell migration. In the discussion part, we have provided the comparison between active gel model and motor-clutch model. We have also supplemented a short discussion between the present model and active nematic model in the main text of line 345-347:

      The active nematic model employs active extensile or contractile agents to push or pull the fluid along their elongation axis to simulate cells flowing (61).

      (4) In the present model, actin flow contributes to cell migration while myosin distribution determines cell polarity. How does this model couple actin and myosin together?

      We thank the reviewer for this question. In our model, the polarization field P(r,t) is employed to couple actin and myosin together. It is obvious that actin accumulate at the front while myosin diffuses in the opposite direction. Therefore, we propose that actin and myosin flow towards the opposite direction, which is captured in the convection term of actin (∇[c(v+wP)])  and myosin (∇[m(-wP)]) density field.

      Reviewing Editor (Recommendations For The Authors):

      We suggest that you cite the publication about confinement force microscopy from the Betz lab (https://doi.org/10.1101/2023.08.22.554088).

      We thank the editor for this suggestion. We have cited this publication in line 89 in our updated manuscript.

      Reviewer #1 (Recommendations For The Authors):

      Minor points and text corrections:

      - In line 288 you state that NMIIA basal diffusion rate is larger on softer substrates, while in line 315 you say that NMIIA is more diffusive on stiff. The two sentences seem to contradict each other.

      We thank the reviewer for pointing out this mistake. In our active gel model, the basal diffusion rate of NMIIA is larger on stiffer substrate. We have corrected this mistake in line 288 (line 283 in the updated manuscript) in our revised manuscript.

      - How were the non-muscle myosin images (Figure 3F) collected?

      We thank the reviewer for this question. The non-muscle myosin images in Fig. 3F are single planes collected by epifluorescence-confocal microscopy. We have updated the related method in our revised manuscript in line 477-478:

      After mounting medium is solidified, single plane images were captured using a 63×1.4 NA objective lens on Andor Dragonfly epi-fluorescence confocal imaging system.

      - Is there a quantification of NMAII accumulation at the back?

      We thank the reviewer for this question. We have a quantification of NMIIA distribution in Fig. 3G. We measured the fluorescence intensity of NMIIA and NMIIB in the soft and stiff region of cells and found that the soft/stiff fluorescence ratio of NMIIB is about 0.95 and the ratio of NMIIA is about 1.82, indicating NMIIA tend to be localized at back while NMIIB is evenly distributed in the soft and stiff region of cells.

      - At which frequency were images acquired for Fluorescent Speckle Microscopy? Overall, I think it would help to state the length and frequency of videos in the legends.

      We thank the reviewer for this comment. We have updated the length (10 min for movie 6-10 and 80 sec for movie11) and frequency (15 sec intervals for movie 6-10 and 2 sec intervals for movie11) of Fluorescent Speckle Microscopy videos in our revised manuscript.

      Reviewer #2 (Recommendations For The Authors):

      The cell contour of Figure S5C is not very clear.

      We thank the reviewer for this comment. We have marked the outline of the cell in Fig. S5C in our updated manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, Kroll et al. conduct an in-depth behavioral analysis of F0 knockouts of 4 genes associated with late-onset Alzheimer's Disease (AD), together with 3 genes associated with early-onset AD. Kroll and colleagues developed a web application (ZOLTAR) to compare sleep-associated traits between genetic mutants with those obtained from a panel of small molecules to promote the identification of affected pathways and potential therapeutic interventions. The authors make a set of potentially important findings vis-à-vis the relationship between AD-associated genes and sleep. First, they find that loss-of-function in late-onset AD genes universally results in night-time sleep loss, consistent with the well supported hypothesis that sleep disruption contributes to Alzheimer's-related pathologies. psen-1, an early-onset associated AD gene, which the authors find is principally responsible for the generation of AB40 and AB42 in zebrafish, also shows a slight increase in activity at night and slight decreases in night-time sleep. Conversely, psen-2 mutations increase daytime sleep, while appa/appb mutations have no impact on sleep. Finally, using ZOLTAR, the authors identify serotonin receptor activity as potentially disrupted in sorl1 mutants, while betamethasone is identified as a potential therapeutic to promote reversal of psen2 knockout-associated phenotypes.

      This is a highly innovative and thorough study, yet a handful of key questions remain. First, are night-time sleep loss phenotypes observed in all knockouts for late-onset AD genes in the larval zebrafish a valid proxy for AD risk?

      We cannot say, but it is an interesting question. We selected the four late-onset Alzheimer’s risk genes (APOE, CD2AP, CLU, SORL1) based on human genetics data and brain expression in zebrafish larvae, not based on their likelihood to modify sleep behaviour, which we could have tried by searching for overlaps with GWAS of sleep phenotypes, for example. Consequently, we find it remarkable that all four of these genes caused a night-time sleep phenotype when mutated. We also find it reassuring that knockout of appa/appb and psen2 did not cause a night-time sleep phenotype, which largely excludes the possibility that the phenotype is a technical artefact (e.g. caused by the F0 knockout method) or a property of every gene expressed in the larval brain.

      Having said that, it could still be a coincidence, rather than a special property of genes associated with late-onset AD. In addition to testing additional late-onset Alzheimer’s risk genes, the ideal way to answer this question would be to test in parallel a random set of genes expressed in the brain at this stage of development. From this random set, one could estimate the proportion of genes that cause a night-time sleep phenotype when mutated. One could then use that information to test whether late-onset Alzheimer’s risk genes are indeed enriched for genes that cause a night-time sleep phenotype when mutated.

      For those mutants that cause night-time sleep disturbances, do these phenotypes share a common underlying pathway? e.g. Do 5-HT reuptake inhibitors promote sleep across all 4 late-onset genes in addition to psen1? Can 5-HT reuptake inhibitors reverse other AD-related pathologies in zebrafish? Can compounds be identified that have a common behavioral fingerprint across all or multiple AD risk genes? Do these modify sleep phenotypes?

      To attempt to answer these questions, we used ZOLTAR to generate predictions for all the knockout behavioural fingerprints presented in the study, in the same way as for sorl1 in Fig. 5 and Fig. 5–supplement 1. Here are the indications, targets, and KEGG pathways which are shared by the largest number of knockouts (Author response image 1):

      – One indication is shared by 4/7 knockouts: “opioid dependence” (significant for appa/appb, psen1, apoea/apoeb, cd2ap).

      – Four targets are shared by 4/7 knockouts: “strychnine-binding glycine receptor” (psen1, apoea/apoeb, clu, sorl1); “neuronal acetylcholine receptor beta-2” (psen1, apoea/apoeb, cd2ap, clu); thyroid peroxidase (psen1, apoea/apoeb, cd2ap, clu); carbonic anhydrase IV (appa/appb, psen1, psen2, cd2ap).

      – Three KEGG pathways are shared by 5/7 knockouts: “cholinergic synapse” (psen1, apoea/apoeb, cd2ap, clu, sorl1); tyrosine metabolism (psen2, apoea/apoeb, cd2ap, clu, sorl1); and “nitrogen metabolism” (appa/appb, psen1, psen2, apoea/apoeb, cd2ap).

      As reminder, we hypothesised that loss of Sorl1 affected serotonin signalling based on the following annotations being significant: indication “depression”, target “serotonin transporter”, and KEGG pathway “serotonergic synapse”. Indication “depression” is only significant for sorl1 knockouts; target “serotonin transporter” is also significant for appa/appb and psen2 knockouts; and KEGG pathway “serotonergic synapse” is also significant for psen2 knockouts. ZOLTAR therefore does not predict serotonin signalling to be a major theme common to all mutants with a night-time sleep loss phenotype.

      Particularly interesting is cholinergic signalling appearing in the most common targets and KEGG pathways. Acetylcholine signalling is a major theme in research on AD. For example, the first four drugs ever approved by the FDA to treat AD were acetylcholinesterase inhibitors, which increase acetylcholine signalling by preventing its breakdown by acetylcholinesterase. These drugs are generally considered only to treat symptoms and not modify disease course, but this view has been called into question (Munoz-Torrero, 2008; Relkin, 2007). If, as ZOLTAR suggests, mutations in several Alzheimer’s risk genes affect cholinergic signalling early in development, this would point to a potential causal role of cholinergic disruption in AD.

      Author response image 1.

      Common predictions from ZOLTAR for the seven Alzheimer’s risk genes tested. Predictions from ZOLTAR which are shared by multiple knockout behavioural fingerprints presented in the study. Only indications, targets, and KEGG pathways which are significant for at least three of the seven knockouts tested are shown, ranked from the annotations which are significant for the largest number of knockouts.

      Finally, the web- based platform presented could be expanded to facilitate comparison of other behavioral phenotypes, including stimulus-evoked behaviors.

      Yes, absolutely. The behavioural dataset we used (Rihel et al., 2010) did not measure other stimuli than day/night light transitions, but the “SauronX” platform and dataset (MyersTurnbull et al., 2022) seems particularly well suited for this. To provide some context, we and collaborators have occasionally used the dataset by Rihel et al. (2010) to generate hypotheses or find candidate drugs that reverse a behavioural phenotype measured in the sleep/wake assay (Ashlin et al., 2018; Hoffman et al., 2016). The present work was the occasion to enable a wider and more intuitive use of this dataset through the ZOLTAR app, which has already proven successful. Future versions of ZOLTAR may seek to incorporate larger drug datasets using more types of measurements.

      Finally, the authors propose but do not test the hypothesis that sorl1 might regulate localization/surface expression of 5-HT2 receptors. This could provide exciting / more convincing mechanistic support for the assertion that serotonin signaling is disrupted upon loss of AD-associated genes.

      While working on the Author Response, we made some changes to the analysis ran by ZOLTAR to calculate enrichments (see Methods and github.com/francoiskroll/ZOLTAR, notes on v2). With the new version, 5-HT receptor type 2 is not a significantly enriched target for the sorl1 knockout fingerprint but type 4 is. 5-HT receptor type 4 was also shown to interact with sorting nexin 27, a subunit of retromer, so is a promising candidate (Joubert et al., 2004). Antibodies against human 5-HT receptor type 2 and 4a exist; whether they would work in zebrafish remains to be tested. In our experience, the availability of antibodies suitable for immunohistochemistry in the zebrafish is a serious experimental roadblock.

      Note, all the results presented in the “Version of Records” are from ZOLTAR v2.

      Despite these important considerations, this study provides a valuable platform for highthroughput analysis of sleep phenotypes and correlation with small-molecule-induced sleep phenotypes.

      Strengths:

      - Provides a useful platform for comparison of sleep phenotypes across genotypes/drug manipulations.

      - Presents convincing evidence that night-time sleep is disrupted in mutants for multiple late onset AD-related genes.

      - Provides potential mechanistic insights for how AD-related genes might impact sleep and identifies a few drugs that modify their identified phenotypes

      Weaknesses:

      - Exploration of potential mechanisms for serotonin disruption in sorl1 mutants is limited.

      - The pipeline developed can only be used to examine sleep-related / spontaneous movement phenotypes and stimulus-evoked behaviors are not examined.

      - Comparisons between mutants/exploration of commonly affected pathways are limited.

      Thank you for these excellent suggestions, please see our answers above.

      Reviewer #2 (Public Review):

      Summary:

      This work delineates the larval zebrafish behavioral phenotypes caused by the F0 knockout of several important genes that increase the risk for Alzheimer's disease. Using behavioral pharmacology, comparing the behavioral fingerprint of previously assayed molecules to the newly generated knockout data, compounds were discovered that impacted larval movement in ways that suggest interaction with or recovery of disrupted mechanisms.

      Strengths:

      This is a well-written manuscript that uses newly developed analysis methods to present the findings in a clear, high-quality way. The addition of an extensive behavioral analysis pipeline is of value to the field of zebrafish neuroscience and will be particularly helpful for researchers who prefer the R programming language. Even the behavioral profiling of these AD risk genes, regardless of the pharmacology aspect, is an important contribution. The recovery of most behavioral parameters in the psen2 knockout with betamethasone, predicted by comparing fingerprints, is an exciting demonstration of the approach. The hypotheses generated by this work are important stepping stones to future studies uncovering the molecular basis of the proposed gene-drug interactions and discovering novel therapeutics to treat AD or co-occurring conditions such as sleep disturbance.

      Weaknesses:

      - The overarching concept of the work is that comparing behavioral fingerprints can align genes and molecules with similarly disrupted molecular pathways. While the recovery of the psen2 phenotypes by one molecule with the opposite phenotype is interesting, as are previous studies that show similar behaviorally-based recoveries, the underlying assumption that normalizing the larval movement normalizes the mechanism still lacks substantial support. There are many ways that a reduction in movement bouts could be returned to baseline that are unrelated to the root cause of the genetically driven phenotype. An ideal experiment would be to thoroughly characterize a mutant, such as by identifying a missing population of neurons, and use this approach to find a small molecule that rescues both behavior and the cellular phenotype. If the connection to serotonin in the sorl1 was more complete, for example, the overarching idea would be more compelling.

      Thank you for this cogent criticism.

      On the first point, we were careful not to claim that betamethasone normalises the molecular/cellular mechanism that causes the psen2 behavioural phenotype. Having said that, yes, to a certain extent that would be the hope of the approach. As you say, every compound which normalises the behavioural fingerprint will not normalise the underlying mechanism, but the opposite seems true: every compound that normalises the underlying mechanism should also normalise the behavioural fingerprint. We think this logic makes the “behaviour-first” approach innovative and interesting. The logic is to discover compounds that normalise the behavioural phenotype first, only subsequently test whether they also normalise the molecular mechanism, akin to testing first whether a drug resolves the symptoms before testing whether it actually modifies disease course. While in practice testing thousands of drugs in sufficient sample sizes and replicates on a mutant line is challenging, the dataset queried through ZOLTAR provides a potential shortcut by shortlisting in silico compounds that have the opposite effect on behaviour.

      You mention a “reduction in movement bouts” but note here that the number of behavioural parameters tested is key to our argument. To take the two extremes, say the only behavioural parameter we measured in psen2 knockout larvae was time active during the day, then, yes, any stimulant used at the right concentration could probably normalise the phenotype. In this situation, claiming that the stimulant is likely to also normalise the underlying mechanism, or even that it is a genuine “phenotypic rescue”, would not be convincing. Conversely, say we were measuring thousands of behavioural parameters under various stimuli, such as swimming speed, position in the well, bout usage, tail movements, and eye angles, it seems almost impossible for a compound to rescue most parameters without also normalising the underlying mechanism. The present approach is somewhere inbetween: ZOLTAR uses six behavioural parameters for prediction (e.g. Fig 6a), but all 17 parameters calculated by FramebyFrame can be used to assess rescue during a subsequent experiment (Fig. 6c). For both, splitting each parameter in day and night increases the resolution of the approach, which partly answers your criticism. For example, betamethasone rescued the day-time hypoactivity without causing night-time hyperactivity, so we are not making the “straw man argument” explained above of using any broad stimulant to rescue the hypoactivity phenotype.

      Furthermore, for diseases where the behavioural defect is the primary concern, such as autism or bipolar disorder, perhaps this behaviour-first approach is all that is needed, and whether or not the compound precisely rescues the underlying mechanism is somewhat secondary. The use of lithium to prevent manic episodes in bipolar disorder is a good example. It was initially tested because mania was thought to be caused by excess uric acid and lithium can dissolve uric acid (Mitchell and Hadzi-Pavlovic, 2000). The theory is now discredited, but lithium continues to be used without a precise understanding of its mode of action. In this example, behavioural rescue alone, assuming the secondary effects are tolerable, is sufficient to be beneficial to patients, and whether it modulates the correct causal pathway is secondary.

      On the second point, we agree that testing first ZOLTAR on a mutant for which we have a fairly good understanding of the mechanism causing the behavioural phenotype could have been a productive approach. Note, however, that examples already exist in the literature (Ashlin et al., 2018; Hoffman et al., 2016). The example from Hoffman et al. (2016) is especially convincing. Drugs generating behavioural fingerprints that positively correlate with the cntnap2a/cntnap2b double knockout fingerprint were enriched with NMDA and GABA receptor antagonists. In experiments analogous to our citalopram and fluvoxamine treatments (Fig. 5c,d and Fig. 5–supplement 1c,d), cntnap2a/cntnap2b knockout larvae were overly sensitive to the NMDA receptor antagonist MK-801 and the GABAA receptor antagonist pentylenetetrazol (PTZ). Among other drugs tested, zolpidem, a GABAA receptor agonist, caused opposite effects on wild-type and cntnap2a/cntnap2b knockout larvae. Knockout larvae were found to have fewer GABAergic neurons in the forebrain. While these studies did not use precisely the same analysis that ZOLTAR runs, they used the same rationale and behavioural dataset to make these predictions (Rihel et al., 2010), which shows that approaches like ZOLTAR can point to causal processes.

      On your last point, we hope our experiment testing fluvoxamine, another selective serotonin reuptake inhibitor (SSRI), makes the connection between Sorl1 and serotonin signalling more convincing.

      - The behavioral difference between the sorl1 KO and scrambled at the higher dose of the citalopram is based on a small number of animals. The KO Euclidean distance measure is also more spread out than for the other datasets, and it looks like only five or so fish are driving the group difference. It also appears as though the numbers were also from two injection series. While there is nothing obviously wrong with the data, I would feel more comfortable if such a strong statement of a result from a relatively subtle phenotype were backed up by a higher N or a stable line. It is not impossible that the observed difference is an experimental fluke. If something obvious had emerged through the HCR, that would have also supported the conclusions. As it stands, if no more experiments are done to bolster the claim, the confidence in the strength of the link to serotonin should be reduced (possibly putting the entire section in the supplement and modifying the discussion). The discussion section about serotonin and AD is interesting, but I think that it is excessive without additional evidence.

      We mostly agree with this criticism. One could interpret the larger spread of the data for sorl1 KO larvae treated with 10 µM citalopram as evidence that the knockout larvae do indeed react differently to the drug at this dose, regardless of being driven by a subset of the animals. The result indeed does not survive removing the top 5 (p = 0.87) or top 3 (p = 0.18) sorl1 KO + 10 µM larvae, but this amounts to excluding 20 (3/14) or 35 (5/14) % of the datapoints as potential outliers, which is unreasonable. In fact, excluding the top 5 sorl1 KO + 10 µM is equivalent to calling any datapoint with z-score > 0.2 an outlier (z-scores of the top 5 datapoints are 0.2–1.8). Applying consistently the same criterion to the scrambled + 10 µM group would remove the top 6 datapoints (z-scores = 0.5–3.9). Comparing the resulting two distributions again gives the sorl1 KO + 10 µM distribution as significantly higher (p = 0.0015). We would also mention that Euclidean distance, as a summary metric for distance between behavioural fingerprints, has limitations. For example, the measure will be more sensitive to changes in some parameters but not others, depending on how much room there is for a given parameter to change. We included this metric to lend support to the observation one can draw from the fingerprint plot (Fig. 5c) that sorl1 mutants respond in an exaggerated way to citalopram across many parameters, while being agnostic to which parameter might matter most.

      Given that the HCR did not reveal anything striking, we agree with you that too much of our argument relied on this result being robust. As you and Reviewer #3 suggested, we repeated this experiment with a different SSRI, fluvoxamine (Fig. 5–supplement 1). We cannot readily explain why the result was opposite to what we found with citalopram, but in both cases sorl1 knockout larvae reacted differently than their control siblings, which adds an argument to our claim that ZOLTAR correctly predicted serotonin signalling as a disrupted pathway from the behavioural fingerprint. Accordingly, we mostly kept the Discussion on Sorl1 the same, although we concede that we may not have identified the molecular mechanism.

      - The authors suggest two hypotheses for the behavioral difference between the sorl1 KO and scrambled at the higher dose of the citalopram. While the first is tested, and found to not be supported, the second is not tested at all ("Ruling out the first hypothesis, sorl1 knockouts may react excessively to a given spike in serotonin." and "Second, sorl1 knockouts may be overly sensitive to serotonin itself because post-synaptic neurons have higher levels of serotonin receptors."). Assuming that the finding is robust, there are probably other reasons why the mutants could have a different sensitivity to this molecule. However, if this particular one is going to be mentioned, it is surprising that it was not tested alongside the first hypothesis. This work could proceed without a complete explanation, but additional discussion of the possibilities would be helpful or why the second hypothesis was not tested.

      There are no strong scientific reasons why this hypothesis was not tested. The lead author (F Kroll) moved to a different lab and country so the project was finalised at that time. We do not plan on testing this hypothesis at this stage. However, we adapted the wording to make it clear this is one possible alternative hypothesis which could be tested in the future. The small differences found by HCR are actually more in line with the new results from the fluvoxamine experiment, so it may also be that both hypotheses (pre-synaptic neurons releasing less serotonin when reuptake is blocked; or post-synaptic neurons being less sensitive) contribute. The fluvoxamine experiment was performed in a different lab (ICM, Paris; all other experiments were done in UCL, London) in a different wild-type strain (TL in ICM, AB x Tup LF in UCL), which complicates how one interprets this discrepancy.

      - The authors claim that "all four genes produced a fairly consistent phenotype at night". While it is interesting that this result arose in the different lines, the second clutch for some genes did not replicate as well as others. I think the findings are compelling, regardless, but the sometimes missing replicability should be discussed. I wonder if the F0 strategy adds noise to the results and if clean null lines would yield stronger phenotypes. Please discuss this possibility, or others, in regard to the variability in some phenotypes.

      For the first part of this point, please see below our answer to Reviewer #3, point (2) c.

      Regarding the F0 strategy potentially adding variability, it is an interesting question which we tested in a larger dataset of behavioural recordings from F0 and stable knockouts for the same genes (unpublished). In summary, the F0 knockout method does not increase clutchto-clutch or larva-to-larva variability in the assay. F0 knockout experiments found many more significant parameters and larger effect sizes than stable knockout experiments, but this difference could largely be explained by the larger sample sizes of F0 knockout experiments. In fact, larger sample sizes within individual clutches appears to be a major advantage of the F0 knockout approach over in-cross of heterozygous knockout animals as it increases sensitivity of the assay without causing substantial variability. We plan to report in more detail on this analysis in a separate paper as we think it would dilute the focus of the present work.

      - In this work, the knockout of appa/appb is included. While APP is a well-known risk gene, there is no clear justification for making a knockout model. It is well known that the upregulation of app is the driver of Alzheimer's, not downregulation. The authors even indicate an expectation that it could be similar to the other knockouts ("Moreover, the behavioural phenotypes of appa/appb and psen1 knockout larvae had little overlap while they presumably both resulted in the loss of Aβ." and "Comparing with early-onset genes, psen1 knockouts had similar night-time phenotypes, but loss of psen2 or appa/appb had no effect on night-time sleep."). There is no reason to expect similarity between appa/appb and psen1/2. I understand that the app knockouts could unveil interesting early neurodevelopmental roles, but the manuscript needs to be clarified that any findings could be the opposite of expectation in AD.

      On “there is no reason to expect similarity […]”, we disagree. Knockout of appa/appb and knockout of psen1 will both result in loss of Aβ (appa/appb encode Aβ and psen1 cleaves Appa/Appb to release Aβ, cf. Fig. 3e). Consequently, a phenotype caused by the loss of Aβ, or possibly other Appa/Appb cleavage products, should logically be found in both appa/appb and psen1 knockouts.

      On “it is well known that the upregulation of APP is the driver of Alzheimer’s, not downregulation”; we of course agree. Among others, the examples of Down syndrome, APP duplication (Sleegers et al., 2006), or mouse models overexpressing human APP show definitely that overexpression of APP is sufficient to cause AD. Having said that, we would not be so quick in dismissing APP knockout as potentially relevant to understanding of AD.

      Loss of soluble Aβ due to aggregation could contribute to pathology (Espay et al., 2023). Without getting too much into this intricate debate, links between levels of Aβ and risk of disease are often counter-intuitive too. For example, out of 138 PSEN1 mutations screened in vitro, 104 reduced total Aβ production and 11 even seemingly abolished the production of both Aβ40 and Aβ42 (Sun et al., 2017). In short, loss of soluble Aβ occurs in both AD and in our appa/appb knockout larvae.

      We added a sentence in Results (section psen2 knockouts […]) to briefly justify our appa/appb knockout approach. To be clear, we do not want to imply, for example, that the absence of a night-time sleep phenotype for appa/appb is contradictory to the body of literature showing links between Aβ and sleep, including in zebrafish (Özcan et al., 2020). As you say, our experiment tested loss of App, including Aβ, while the literature typically reports on overexpression of APP, as in APP/PSEN1-overexpressing mice (Jagirdar et al., 2021).

      Reviewer #3 (Public Review):

      In this manuscript by Kroll and colleagues, the authors describe combining behavioral pharmacology with sleep profiling to predict disease and potential treatment pathways at play in AD. AD is used here as a case study, but the approaches detailed can be used for other genetic screens related to normal or pathological states for which sleep/arousal is relevant. The data are for the most part convincing, although generally the phenotypes are relatively small and there are no major new mechanistic insights. Nonetheless, the approaches are certainly of broad interest and the data are comprehensive and detailed. A notable weakness is the introduction, which overly generalizes numerous concepts and fails to provide the necessary background to set the stage for the data.

      Major points

      (1) The authors should spend more time explaining what they see as the meaning of the large number of behavioral parameters assayed and specifically what they tell readers about the biology of the animal. Many are hard to understand--e.g. a "slope" parameter.

      We agree that some parameters do not tell something intuitive about the biology of the animal. It would be easy to speculate. For example, the “activity slope” parameter may indicate how quickly the animal becomes tired over the course of the day. On the other hand, fractal dimension describes the “roughness/smoothness” of the larva’s activity trace (Fig. 2–supplement 1a); but it is not obvious how to translate this into information about the physiology of the animal. We do not see this as an issue though. While some parameters do provide intuitive information about the animal’s behaviour (e.g. sleep duration or sunset startle as a measure of startle response), the benefit of having a large number of behavioural parameters is to compare behavioural fingerprints and assess rescue of the behavioural phenotype by small molecules (Fig. 6c). For this purpose, the more parameters the better. The “MoSeq” approach from Wiltschko et al., 2020 is a good example from literature that inspired our own Fig. 6c. While some of the “behavioural syllables” may be intuitive (e.g. running or grooming), it is probably pointless to try to explain the ‘meaning’ of the “small left turn in place with head motion” syllable (Wiltschko et al., 2020). Nonetheless, this syllable was useful to assess whether a drug specifically treats the behavioural phenotype under study without causing too many side effects. Unfortunately, ZOLTAR has to reduce the FramebyFrame fingerprint (17 parameters) to just six parameters to compare it to the behavioural dataset from Rihel et al., 2010, but here, more parameters would almost certainly translate into better predictions too, regardless of their intuitiveness.

      It is true however that we did not give much information on how some of the less intuitive parameters, such as activity slope or fractal dimension, are calculated or what they describe about the dataset (e.g. roughness/smoothness for fractal dimension). We added a few sentences in the legend of Fig. 2–supplement 1.

      (2) Because in the end the authors did not screen that many lines, it would increase confidence in the phenotypes to provide more validation of KO specificity. Some suggestions include:

      a. The authors cite a psen1 and psen2 germline mutant lines. Can these be tested in the FramebyFrame R analysis? Do they phenocopy F0 KO larvae?

      We unfortunately do not have those lines. We investigated the availability of importing a psen2 knockout line from abroad, but the process of shipping live animals is becoming more and more cost and time prohibitive. However, we observed the same pigmentation phenotype for psen2 knockouts as reported by Jiang et al., 2018, which is at least a partial confirmation of phenocopying a loss of function stable mutant.  

      b. psen2_KO is one of the larger centerpieces of the paper. The authors should present more compelling evidence that animals are truly functionally null. Without this, how do we interpret their phenotypes?

      We disagree that there should be significant doubt about these mutants being truly functionally null, given the high mutation rate and presence of the expected pigmentation phenotype (Jiang et al., 2018, Fig. 3f and Fig. 3–supplement 3a). The psen2 F0 knockouts were virtually 100% mutated at three exons across the gene (mutation rates were locus 1: 100 ± 0%; locus 2: 99.99 ± 0.06%; locus 3: 99.85 ± 0.24%). Additionally, two of the three mutated exons had particularly high rates of frameshift mutations (locus 1: 97 ± 5%; locus 2: 88 ± 17% frameshift mutation rate). It is virtually impossible that a functional protein is translated given this burden of frameshift mutations. Phenotypically, in addition to the pigmentation defect, double psen1/psen2 F0 knockout larvae had curved tails, the same phenotype as caused by a high dose of the γ-secretase inhibitor DAPT (Yang et al., 2008). These double F0 knockouts were lethal, while knockout of psen1 or psen2 alone did not cause obvious morphological defects. Evidently, most larvae must have been psen2 null mutants in this experiment, otherwise functional Psen2 would have prevented early lethality.

      Translation of zebrafish psen2 can start at downstream start codons if the first exon has a frameshift mutation, generating a seemingly functional Psen2 missing the N-terminus (Jiang et al., 2020). Zebrafish homozygous for this early frameshift mutation had normal pigmentation, showing it is a reliable marker of Psen2 function even when it is mutated. This mechanism is not a concern here as the alternative start codons are still upstream of two of the three mutated exons (the alternative start codons discovered by Jiang et al., 2020 are in exon 2 and 3, but we targeted exon 3, exon 4, and exon 6).

      We understand that the zebrafish community may be cautious about F0 phenotyping compared to stably generated mutants. As mentioned to Reviewer #2, we are planning to assemble a paper that expressly compares behavioural phenotypes measured in F0 vs. stable mutants to allay some of these concerns. Our current manuscript, which combines CRISPR-Cas9 rapid F0 screening with in silico pharmacological predictions, inevitability represents a first step in characterizing the functions of these genes. 

      c. Related to the above, for cd2AP and sorl1 KO, some of the effect sizes seem to be driven by one clutch and not the other. In other words, great clutch-to-clutch variability. Should the authors increase the number of clutches assayed?

      Correct, there is substantial clutch-to-clutch variability in this behavioural assay. This is not specific to our experiments. Even within the same strain, wild-type larvae from different clutches (i.e. non-siblings) behave differently (Joo et al., 2021). This is why it is essential to compare behavioural phenotypes within individual clutches (i.e. from a single pair of parents, one male and one female), as we explain in Methods (section Behavioural video-tracking) and in the documentation of the FramebyFrame package. We often see two different experimental designs in literature: comparing non-sibling wild-type and mutant larvae, or pooling different clutches which include all genotypes (e.g. pooling multiple clutches from heterozygous in-crosses or pooling wild-type clutches before injecting them). The first experimental design causes false positive findings (Joo et al., 2021), as the clutchto-clutch variability we and others observe gets interpreted as a behavioural phenotype. The second experimental design should not cause false positives but likely decreases the sensitivity of the assay by increasing the spread within genotypes. In both cases, the clutch-to-clutch variability is hidden, either by interpreting it as a phenotype (first case) or by adding it to animal-to-animal variability (second case). Our experimental design is technically more challenging as it requires obtaining large clutches from unique pairs of parents. However, this approach is better as it clearly separates the different sources of variability (clutch-to-clutch or animal-to-animal). As for every experiment, yes, a larger number of replicates would be better, but we do not plan to assay additional clutches at this time. Our work heavily focuses on the sorl1 and psen2 knockout behavioural phenotypes. The key aspects of these phenotypes were effectively tested in four experiments (five to six clutches) as sorl1 knockout larvae were also tracked in the citalopram and fluvoxamine experiments (Fig. 5 and Fig. 5–supplement 1), and psen2 knockout larvae were also tracked in the small molecule rescue experiment (Fig. 6 and Fig. 6–supplement 1).

      The psen2 behavioural phenotype replicated well across the six clutches tested (pairwise cosine similarities: 0.62 ± 0.15; Author response image 2a). 5/6 clutches were less active and initiating more sleep bouts during the day, as we claimed in Fig. 3.

      In the citalopram experiment, the H<sub>2</sub>O-treated sorl1 knockout fingerprint replicated fairly well the baseline recordings in Fig. 4, despite the smaller sample size (cos = 0.30 and 0.78; Author response image 2b, see “KO Fig. 5”). 5/6 of the significant parameters presented in Fig. 4–supplement 4 moved in the same direction, and knockout larvae were also hypoactive during the day but hyperactive at night. Note that two clutches were tracked on the same 96-well plate in this experiment. We calculated each larva’s z-score using the average of its control siblings, then we averaged all the z-scores to generate the fingerprint. The H<sub>2</sub>O treated sorl1 knockout clutch from the fluvoxamine experiment did not replicate well the baseline recordings (cos = 0.08 and 0.11; Author response image 2b, see “KO Fig. 5–suppl. 1”). Knockout larvae were hypoactive during the day as expected, but behaviour at night was not as robustly affected. As mentioned above, knockouts were made in a different genetic background (TL, instead of AB x Tup LF used for all other experiments), which could explain the discrepancy.

      We also took the opportunity to check whether our SSRI treatments replicated well the data from Rihel et al., 2010. For both citalopram (n = 3 fingerprints in the database) and fluvoxamine (n = 4 fingerprints in the database), replication was excellent (cos ≥ 0.67 for all comparisons of a fingerprint from this study vs. a fingerprint from Rihel et al. 2010; Author response image 2c,d). Note that the scrambled + 10 µM citalopram and + 10 µM fluvoxamine fingerprints correlate extremely well (cos = 0.92; can be seen in Author response image 2c,d), which was predicted by the small molecule screen dataset.

      Author response image 2.

      Replication of psen2 and sorl1 F0 knockout fingerprints and SSRI treatments from Rihel et al., 2010. a, (left) Every psen2 F0 knockout behavioural fingerprint generated in this study. Each dot represents the mean deviation from the same-clutch scrambled-injected mean for that parameter (z-score, mean ± SEM). From the experiments in Fig. 6, presented is the psen2 F0 knockout + H<sub>2</sub>O fingerprints. The fingerprints in grey (“not shown”) are from a preliminary drug treatment experiment we did not include in the final study. These fingerprints are from psen2 F0 knockout larvae treated with 0.2% DMSO, normalised to scrambled-injected siblings also treated with 0.2% DMSO. (right) Pairwise cosine similarities (−1.0–1.0) for the fingerprints presented. b, Every sorl1 F0 knockout behavioural fingerprint, as in a). c, The scrambled-injected + citalopram (10 µM) fingerprints (grey) in comparison to the citalopram (10–15 µM) fingerprints from the Rihel et al., 2010 database (green). d, The scrambled-injected + fluvoxamine (10 µM) fingerprint (grey) in comparison to the fluvoxamine fingerprints from the Rihel et al., 2010 database (pink). In c) and d), the scrambled-injected fingerprints are from the experiments in Fig. 5 and Fig. 5–suppl. 1, but were converted here into the behavioural parameters used by Rihel et al., 2010 for comparison. Parameters: 1, average activity (sec active/min); 2, average waking activity (sec active/min, excluding inactive minutes); 3, total sleep (hr); 4, number of sleep bouts; 5, sleep bout length (min); 6, sleep latency (min until first sleep bout).

      (3) The authors make the point that most of the AD risk genes are expressed in fish during development. Is there public data to comment on whether the genes of interest are expressed in mature/old fish as well? Just because the genes are expressed early does not at all mean that early- life dysfunction is related to future AD (though this could be the case, of course). Genes with exclusive developmental expression would be strong candidates for such an early-life role, however. I presume the case is made because sleep studies are mainly done in juvenile fish, but I think it is really a prejy minor point and such a strong claim does not even need to be made.

      This is a fair criticism but we do not make this claim (“early-life dysfunction is related to future AD”) from expression alone. The reviewer is probably referring to the following quote:

      “[…] most of these were expressed in the brain of 5–6-dpf zebrafish larvae, suggesting they play a role in early brain development or function,” which does not mention future risk of AD. We do suggest that these genes have a function in development. After all, every gene that plays a role in brain development must be expressed during development, so this wording seemed reasonable. Nevertheless, we adapted the wording to address this point and Reviewer #2’s complaint below. As noted, the primary goal was to check that the genes we selected were indeed expressed in zebrafish larvae before performing knockout experiments. Our discussion does raise the hypothesis that mutations in Alzheimer’s risk genes impact brain development and sleep early in life, but this argument primarily relies on our observation that knockout of late-onset Alzheimer’s risk genes causes sleep phenotypes in 7-day old zebrafish larvae and from previous work showing brain structural differences in children at high genetic risk of AD (Dean et al., 2014; Quiroz et al., 2015), not solely on gene expression early in life.

      Please also see our answer to a similar point raised by Reviewer #2 below (cf. Author response image 7).

      (4) A common quandary with defining sleep behaviorally is how to rectify sleep and activity changes that influence one another. With psen2 KOs, the authors describe reduced activity and increased sleep during the day. But how do we know if the reduced activity drives increased behavioral quiescence that is incorrectly defined as sleep? In instances where sleep is increased but activity during periods during wake are normal or elevated, this is not an issue. But here, the animals might very well be unhealthy, and less active, so naturally they stop moving more for prolonged periods, but the main conclusion is not sleep per se. This is an area where more experiments should be added if the authors do not wish to change/temper the conclusions they draw. Are psen2 KOs responsive to startling stimuli like controls when awake? Do they respond normally when quiescent? Great care must be taken in all models using inactivity as a proxy for sleep, and it can harm the field when there is no acknowledgment that overall health/activity changes could be a confound. Particularly worrisome is the betamethasone data in Figure 6, where activity and sleep are once again coordinately modified by the drug.

      This is a fair criticism. We agree it is a concern, especially in the case of psen2 as we claim that day-time sleep is increased while zebrafish are diurnal. We do not rely heavily on the day-time inactivity being sleep (the ZOLTAR predictions or the small molecule rescue do not change whether the parameter is called sleep or inactivity), but our choice of labelling can fairly be challenged.

      To address “are psen2 KO responsive to startling stimuli like controls when awake/when quiescent”, we looked at the larvae’s behaviour immediately after lights abruptly switched on in the mornings. Almost every larva, regardless of genotype, responded strongly to every lights-off transition during the experiment. Instead, we chose the lights-on transition for this analysis because it is a weaker startling stimulus for the larvae than the lights-off transition (Fig. 3–supplement 3), potentially exposing differences between genotypes or behavioural states (quiescent or awake). We defined a larva as having reacted to the lights switching on if it made a swimming bout during the second (25 frames) a er the lights-on transition. Across two clutches and two lights-on transitions, an average of 65% (range 52–73%) of all larvae reacted to the stimulus. psen2 knockout larvae were similarly likely, if not more likely, to respond (in average 69% responded, range 60–76%) than controls (60% average, range 44– 75%). When the lights switched on, about half of the larvae (39–51%) would have been classified as asleep according to the one-minute inactivity definition (i.e. the larva did not move in the minute preceding the lights transition). This allowed us to also compare behavioural states, as suggested by the reviewer. For three of the four light transitions, larvae which were awake when lights switched on were more likely to react than asleep larvae, but this difference was not striking (overall, awake larvae were only 1.1× more likely to react; Author response image 3). Awake psen2 knockout larvae were 1.1× (range 1.04–1.11×) more likely to react than awake control larvae, so, yes, psen2 knockout larvae respond normally when awake. Asleep psen2 knockout larvae were 1.4× (range 0.63–2.19×) more likely to react than asleep control larvae, so psen2 knockouts are also more or equally likely to react than control larvae when asleep. In summary, the overall health of psen2 knockouts did not seem to be a significant confound in the experiment. As the reviewer suggested, if psen2 knockout larvae were seriously unhealthy, they would not be as responsive as control larvae to a startling stimulus.

      Author response image 3.

      psen2 F0 knockouts react normally to lights switching on, indicating they are largely healthy. At each lights-on transition (9 AM), each larva was categorised as awake if it had moved in the preceding one minute or asleep if it had been inactive for at least one minute. Darker tiles represent larvae which performed a swimming bout during the second following lights-on; lighter tiles represent larvae which did not move during that second. The total count of each waffle plot was normalised to 25 so plots can be compared to each other. The real count is indicated in the corner of each plot. Data is from the baseline psen2 knockout trackings presented in Fig. 3 and Fig. 3–suppl. 2.

      Next, we compared inactive period durations during the day between psen2 and control larvae. If psen2 knockout larvae indeed sleep more during the day compared to controls, we may predict inactive periods longer than one minute to increase disproportionately compared to the increase in shorter inactive periods. This broadly appeared to be the case, especially for one of the two clutches (Author response image 4). In clutch 1, inactive periods lasting 1–60 sec were equally frequent in both psen2 and control larvae (fold change 1.0× during both days), while inactive periods lasting 1–2 min were 1.5× (day 1) and 2.5× (day 2) more frequent in psen2 larvae compared to control larvae. In clutch 2, 1–60 sec inactive periods were also equally frequent in both psen2 and control larvae, while inactive periods lasting 1–2 min were 3.4× (day 1) and 1.5× (day 2) more frequent in psen2 larvae compared to control larvae. Therefore, psen2 knockouts disproportionately increased the frequency of inactive periods longer than one minute, suggesting they genuinely slept more during the day.

      Author response image 4.

      psen2 F0 knockouts increased preferentially the frequency of longer inactive bouts. For each day and clutch, we calculated the mean distribution of inactive bout lengths across larvae of same genotype (psen2 F0 knockout or scrambled-injected), then compared the frequency of inactive bouts of different lengths between the two genotypes. For example, in clutch 1 during day 2, 0.01% of the average scrambled-injected larva’s inactive bouts lasted 111–120 seconds (X axis 120 sec) while 0.05% of the average psen2 F0 knockout larva lasted this long, so the fold change was 5×. Inactive bouts lasting < 1 sec were excluded from the analysis. In clutch 2, day 1 plot, two datapoints fall outside the Y axis limit: 140 sec, Y = 32×; 170 sec, Y = 16×. Data is from the baseline psen2 knockout trackings presented in Fig. 3 and Fig. 3–suppl. 2.

      Ultimately, this criticism seems challenging to definitely address experimentally. A possible approach could be to use a closed-loop system which, after one minute of inactivity, triggers a stimulus that is sufficient to startle an awake larva but not an asleep larva. If psen2 knockout larvae indeed sleep more during the day, the stimulus should usually not be sufficient to startle them. Nevertheless, we believe the two analyses presented here are consistent with psen2 knockout larvae genuinely sleeping more during the day, so we decided to keep this label. We agree with the reviewer that the one-minute inactivity definition has limitations, especially for day-time inactivity.

      (5) The conclusions for the serotonin section are overstated. Behavioural pharmacology purports to predict a signaling pathway disrupted with sorl1 KO. But is it not just possible that the drug acts in parallel to the true disrupted pathway in these fish? There is no direct evidence for serotonin dysfunction - that conclusion is based on response to the drug. Moreover, it is just one drug - is the same phenotype present with another SSRI? Likewise, language should be toned down in the discussion, as this hypothesis is not "confirmed" by the results (consider "supported"). The lack of measured serotonin differences further raises concern that this is not the true pathway. This is another major point that deserves further experimental evidence, because without it, the entire approach (behavioral pharm screen) seems more shaky as a way to identify mechanisms. There are any number of testable hypotheses to pursue such as a) Using transient transgenesis to visualize 5HT neuron morphology (is development perturbed: cell number, neurite morphology, synapse formation); b) Using transgenic Ca reporters to assay 5HT neuron activity.

      Regarding the comment, “is it not just possible that the drug acts in parallel to the true disrupted pathway”, we think no, assuming we understand correctly the question. Key to our argument is the fact that sorl1 knockout larvae react differently to the drug(s) than control larvae. As an example, take night-time sleep bout length, which was not affected by knockout of sorl1 (Fig. 4–supplement 4). For the sake of the argument, say only dopamine signalling (the “true disrupted pathway”) was affected in sorl1 knockouts and that serotonin signalling was intact. Assuming that citalopram specifically alters serotonin signalling, then treatment should cause the same increase in sleep bout length in both knockouts and controls as serotonin signalling is intact in both. This is not what we see, however. Citalopram caused a greater increase in sleep bout length in sorl1 knockouts than in scrambled-injected larvae. In other words, the effect is non-additive, in the sense that citalopram did not add the same number of z-scores to sorl1 knockouts or controls. We think this shows that serotonin signalling is somehow different in sorl1 knockouts. Nonetheless, we concede that the experiment does not necessarily say much about the importance of the serotonin disruption caused by loss of Sorl1. It could be, for example, that the most salient consequence of loss of Sorl1 is cholinergic disruption (see reply to Reviewer #1 above) and that serotonin signalling is a minor theme.

      Furthermore, we agree with the reviewer and Reviewer #2 that the conclusions were overly confident. As suggested, we decided to repeat this experiment with another SSRI, fluvoxamine. Please find the results of this experiment in Fig. 5–supplement 1. The suggestions to further test the serotonin system in the sorl1 knockouts are excellent as well, however we do not plan to pursue them at this stage.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Major Comments:

      - Data are presented in a variety of different ways, occasionally making comparisons across figures difficult. Perhaps at a minimum, behavioral fingerprints as in Figure 3 - Supplementary Figure 1 should be presented for all mutants in the main figures.

      We like this suggestion! Thank you. We brought the behavioural fingerprints figure (previously Fig. 4–supplement 5) as main Fig. 4, and put the figure focused on the sorl1 knockout behavioural phenotype in supplementary, with the other gene-by-gene figures.

      - It is not clear why some data were selected for supplemental rather than main figures. In many cases, detailed phenotypic data is provided for one example mutant in the main figures, and then additional mutants are described in detail in the supplement. Again, to facilitate comparisons between mutants, fingerprints could be provided for all mutants in a main figure, with detailed analyses moved to the supplements.

      The logic was to dedicate one main figure to psen2 (Fig. 3) as an example of an early-onset Alzheimer’s risk gene, and one to sorl1 (previously Fig. 4) as an example of a late-onset Alzheimer’s risk gene. We focused on them in main figures as they are both tested again later (Fig. 5 and Fig. 6). Having said that, we agree that the fingerprints may be a better use of main figure space than the parameters plots. In addition to the above (fingerprints of lateonset Alzheimer’s risk genes in main figure), we rearranged the figures in the early-onset AD section to have the psen2 F0 knockout fingerprint in main.

      - The explication of the utility of behavioral fingerprinting on page 35 is somewhat confusing. The authors describe drugs used to treat depression as enriched among small molecules anti-correlating with the sorl1 fingerprint. However, in Figure 5 - Supplementary Figure 1, drugs used to treat depression are biased toward positive cosines, which are indicated as having a more similar fingerprint to sorl1. These drugs should be described as more present among compounds positively correlating with the sorl1 fingerprint.

      Sorry, the confusion is about “(anti-)correlating”. Precisely, we meant “correlating and/or anti-correlating”, not just anti-correlating. We changed to that wording. In short, the analysis is by design agnostic to whether compounds with a given annotation are found more on the positive cosines side (le side in Fig. 5–supplement 1a) or the negative cosines side (right side). This is because the dataset often includes both agonists and antagonists to a given pathway but these are difficult to annotate. For example, say 10 compounds in the dataset target the dopamine D4 receptor, but these are an unknown mix of agonists and antagonists. In this case, we want ZOLTAR to generate a low p-value when all 10 compounds are found at extreme ends of the list, regardless of which end(s) that is (e.g. top 8 and bottom 2 should give an extremely low p-value). Initially, we were splitting the list, for each annotation, into positive-cosine fingerprints and negative-cosine fingerprints and testing enrichment on both separately, but we think the current approach is better as it reflects better the cases we want to detect and considers all available examples for a given annotation in one test. In sum, yes, in this case drugs used to treat depression were mostly in the positive-cosine side, but the other drugs on the negative-cosine side also contributed to what the p-value is, so it reflects better the analysis to say “correlating and/or anticorrelating”. You can read more about our logic for the analysis in Methods (section Behavioural pharmacology from sorl1 F0 knockout’s fingerprint).

      - The authors conclude the above-described section by stating: "sorl1 knockout larvae behaved similarly to larvae treated with small molecules targeting serotonin signaling, suggesting that the loss of Sorl1 disrupted serotonin signaling." Directionality here may be important. Are all of the drugs targeting the serotonin transporter SSRIs or similar? If so, then a correct statement would be that loss of Sorl1 causes similar phenotypes to drugs enhancing serotonin signaling. Finally, based on the correlation between serotonin transporter inhibitor trazodone and the sorl1 crispant phenotype, it is potentially surprising that the SSRI citalopram caused the opposite phenotype from sorl1, that is, increased sleep during the day and night. It is potentially interesting that this result was enhanced in mutants, and suggests dysfunction of serotonin signaling, but the statement that "our behavioral pharmacology approach correctly predicted from behaviour alone that serotonin signaling was disrupted" is too strong a conclusion.

      We understand “disrupt” as potentially going either way, but this may not be the common usage. We changed to “altered”.

      The point regarding directionality is excellent, however. We tested the proportion of serotonin transporter agonists and antagonists (SSRIs) on each side of the ranked list of small molecule fingerprints. We used the STITCH database for this analysis as it has more drug–target interactions, but likely less curated, than the Therapeutic Target Database (Szklarczyk et al., 2016). As with the Therapeutic Target Database, most fingerprints of compounds interacting with the serotonin transporter SLC6A4 were found on the side of positive cosines (p ~ 0.005 using the custom permutation test), which replicates Fig. 5a with a different source for the drug–target annotations (Author response image 5). On the side of positive cosines (small molecules which generate behavioural fingerprints correlating with the sorl1 fingerprint), there were 2 agonists and 26 antagonists. On the side of negative cosines (small molecules which generate behavioural fingerprints anti-correlating with the sorl1 fingerprint), there were 3 agonists and 2 antagonists. Using a Chi-squared test, this suggests a significant (p = 0.002) over-representation of antagonists (SSRIs) on the positive side (expected count = 24, vs. 26 observed) and agonists on the negative side (expected count = 1, vs. 3 observed). If SLC6A4 antagonists, i.e. SSRIs, indeed tend to cause a similar behavioural phenotype than knockout of sorl1, this would point in the direction of our original interpretation of the citalopram experiment; which was that excessive serotonin signalling is what causes the sorl1 behavioural phenotype.

      Author response image 5.

      Using the STITCH database as source of annotations also predicts SLC6A4 as an enriched target for the sorl1 behavioural fingerprint. Same figures as Fig. 5a,b but using the STITCH database (Szklarczyk et al., 2016) as source for the drug targets. a, Compounds annotated by STITCH as interacting with the serotonin transporter SLC6A4 tend to generate behavioural phenotypes similar to the sorl1 F0 knockout fingerprint. 40,522 compound–target protein pairs (vertical bars; 1,592 unique compounds) are ranked from the fingerprint with the most positive cosine to the fingerprint with the most negative cosine in comparison with the mean sorl1 F0 knockout fingerprint. Fingerprints of drugs that interact with SLC6A4 are coloured in yellow. Simulated p-value = 0.005 for enrichment of drugs interacting with SLC6A4 at the top (positive cosine) and/or bottom (negative cosine) of the ranked list by a custom permutation test. b, Result of the permutation test for top and/or bottom enrichment of drugs interacting with SLC6A4 in the ranked list. The absolute cosines of the fingerprints of drugs interacting with SLC6A4 (n = 52, one fingerprint per compound) were summed, giving sum of cosines = 15.9. To simulate a null distribution, 52 fingerprints were randomly drawn 100,000 times, generating a distribution of 100,000 random sum of cosines. Here, only 499 random draws gave a larger sum of cosines, so the simulated p-value was p = 499/100,000 = 0.005 **.

      If this were true, we would expect, as the reviewer suggested, SSRI treatment (citalopram or fluvoxamine) on control larvae to give a similar behavioural phenotype as knockout of sorl1. However, this generally did not appear to be the case (sorl1 knockout fingerprint vs. SSRI-treated control fingerprint, cosine = 0.08 ± 0.35; Author response image 6).

      Author response image 6.

      sorl1 F0 knockouts in comparison to controls treated with SSRIs. a, sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the citalopram experiment) in comparison with the scrambled-injected + citalopram (1 or 10 µM) fingerprints. Each dot represents the mean deviation from the same-clutch scrambled-injected H<sub>2</sub>O-treated mean for that parameter (z-score, mean ± SEM). b, As in a), sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the fluvoxamine experiment) in comparison with the scrambled-injected + fluvoxamine (10 µM) fingerprint.

      The comparison with trazodone is an interesting observation, but it is only a weak serotonin reuptake inhibitor (Ki for SLC6A4 = 690 nM, vs. 8.9 nM for citalopram; Owens et al., 1997) and it has many other targets, both as agonist or antagonist, including serotonin, adrenergic, and histamine receptors (Mijur, 2011). In any case, the average trazodone fingerprint does not correlate particularly well to the sorl1 knockout fingerprint (cos = 0.3). Finally, the sorl1 knockout behavioural phenotype could be primarily caused by altered serotonin signalling in the hypothalamus, where we found both the biggest difference in tph1a/1b/2 HCR signal intensity (Fig. 5f) and the highest expression of sorl1 across scRNA-seq clusters (Fig. 1– supplement 2). In this case, it would be correct to expect sorl1 knockouts to react differently to SSRIs than controls, but it would be incorrect to expect SSRI treatment to cause the same behavioural phenotype, as it concurrently affects every other serotonergic neuron in the brain.

      Finally, we agree the quoted conclusion was too strong given the current evidence. We since tested another SSRI, fluvoxamine, on sorl1 knockouts.

      - Also in reference to Figure 5: in panel c, data are presented as deviation from vehicle treated. Because of this data presentation choice, it's no longer possible to determine whether, in this experiment, sorl1 crispants sleep less at night relative to their siblings. Does citalopram rescue / reverse sleep deficits in sorl1 mutants?

      On your first point, please see our response to Reviewer #3 (2)c and Author Response 2b above.

      On “does citalopram rescue/reverse sleep deficits in sorl1 mutants”: citalopram (and fluvoxamine) tends to reverse the key aspects of the sorl1 knockout behavioural phenotype by reducing night-time activity (% time active and total Δ pixels), increasing night-time sleep, and shortening sleep latency (Author response image 7). Extrapolating from the hypothesis presented in Discussion, this may be interpreted as a hint that sorl1 knockouts have reduced levels of 5-HT receptors, as increasing serotonin signalling using an SSRI tends to rescue the phenotype. However, we do not think that focusing on the significant behavioural parameters necessarily make sense here. Rather, one should take all parameters into account to conclude whether knockouts react differently to the drug than wild types (also see answer to Reviewer #3, (7) on this). For example, citalopram increased more the night-time sleep bout length of sorl1 knockouts than the one of controls (Fig. 5), but this parameter was not modified by knockout of sorl1 (Fig. 4). To explain the rationale more informally, citalopram is only used as a tool here to probe serotonin signalling in sorl1 knockouts, whether it worsens or rescues the behavioural phenotype is somewhat secondary, the key question is whether knockouts react differently than controls.

      Author response image 7.

      Comparing untreated sorl1 F0 knockouts vs. treated with SSRIs. a, sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the citalopram experiment) in comparison with the sorl1 knockout + citalopram (1 or 10 µM) fingerprints. Each dot represents the mean deviation from the same-clutch scrambled-injected H<sub>2</sub>O-treated mean for that parameter (z-score, mean ± SEM). b, As in a), sorl1 F0 knockout fingerprints (baseline recordings and sorl1 + H<sub>2</sub>O fingerprint from the fluvoxamine experiment) in comparison with the sorl1 + fluvoxamine (10 µM) fingerprint.

      - Possible molecular pathways targeted by tinidazole, fenoprofen, and betamethasone are not described.

      Tinidazole is an antibiotic, fenoprofen is a non-steroidal anti-inflammatory drug (NSAIDs), betamethasone is a steroidal anti-inflammatory drug. Interestingly, long-term use of NSAIDs reduces the risk of AD (in ’t Veld Bas A. et al., 2001). Several mechanisms are possible (Weggen et al., 2007), including reduction of Aβ42 production by interacting with γ-secretase (Eriksen et al., 2003). However, we did not explore the mechanism of action of these drugs on psen2 knockouts so do not feel comfortable speculating. We do not know, for example, whether these findings apply to betamethasone.

      Minor Comments:

      - On page 25, panel "g" should be labeled as "f".

      Thank you!

      - On page 35, a reference should be provided for the statement "From genomic studies of AD, we know that mutations in genes such as SORL1 modify risk by disrupting some biological processes.".

      Thank you, this is now corrected. There were the same studies as mentioned in Introduction.

      - On page 43, the word "and" should be added - "in wild-type rats and mice, overexpressing mutated human APP and PSEN1, AND restricting sleep for 21 days...".

      Right, this sentence could be misread, we edited it. “overexpressing […]” only applied to the mice, not the rats (as they are wild-type); and both are sleep-deprived.

      - On page 45, a reference should be provided for the statement "SSRIs can generally be used continuously with no adverse effects" and this statement should potentially be softened.

      The reference is at the end of that sentence (Cirrito et al., 2011). You are correct though; we reformulated this statement to: “SSRIs can generally be used safely for many years”. SSRIs indeed have side effects.

      - On page 54, a 60-minute rolling average is described as 45k rows, but this seems to be a 30-minute rolling average.

      Thank you! We corrected. It should have been 90k rows, as in: 25 frames-per-second × 60 seconds × 60 minutes.

      Reviewer #2 (Recommendations For The Authors):

      "As we observed in the scRNA-seq data, most genes tested (appa, appb, psen1, psen2, apoea, cd2ap, sorl1) were broadly expressed throughout the 6-dpf brain (Fig. 1d and Fig. 1supplement 3 and 4)."

      - apoea and appb are actually not expressed highly in the scRNA-seq data, and the apoea in situ looks odd, as if it has no expression. The appb gene mysteriously does not look as though it has high expression in the Raj data, but it is clearly expressed based on the in situ. I had previously noticed the same discrepancy, and I attribute it to the transcriptome used to map the Raj data, as the new DanioCell data uses a new transcriptome and indicates high appb expression in the brain. Please point out the discrepancy and possible explanation, perhaps in the figure legend.

      All excellent points, thank you. We included them directly in Results text.

      "most of these were expressed in the brain of 5-6-dpf zebrafish larvae, suggesting they play a role in early brain development or function."

      - Evidence of expression does not suggest function, particularly not a function in brain development. As one example, almost half of the genome is expressed prior to the maternal-zygotic transition but does not have a function in those earliest stages of development. There are numerous other instances where expression does not equal function. Please change the sentence even as simply as "it is possible that they".

      We mostly agree and edited to “[…], so they could play a role […]”.

      Out of curiosity, we plotted, for each zebrafish developmental stage, the proportion of Alzheimer’s risk gene orthologues expressed in comparison to the proportion of all genes expressed (Author response image 8). We defined “all genes” as every gene that is expressed in at least one of the developmental stages (n = 24,856), not the complete transcriptome, to avoid including genes that are never expressed in the brain or whose expression is always below detection limit. We counted a gene as “expressed” if at least three cells had detectable transcripts. Using these definitions, 82 ± 7% of genes are expressed during development. For every developmental stage except 5 dpf (so 11/12), a larger proportion of Alzheimer’s risk genes than all genes are expressed (+5 ± 4%).

      Author response image 8.

      Proportion of Alzheimer’s risk genes orthologues expressed throughout zebrafish development. Proportion of Alzheimer’s risk genes orthologues (n = 42) and all genes (n = 24,856) expressed in the zebrafish brain at each developmental stage, from 12 hours post-fertilisation (hpf) to 15 days post-fertilisation (dpf). “All genes” corresponds to every gene expressed in the brain at any of the developmental stages, not the complete transcriptome. A gene is considered “expressed” (green) if at least three cells had detectable transcripts. Single-cell RNA-seq dataset from Raj et al., 2020.

      "This frame-by-frame analysis has several advantages over previous methods that analysed activity data at the one-minute resolution."

      - Which methods are these? There are no citations. There are certainly existing methods in the zebrafish field that can produce similar data to the method developed for this project. This new package is useful, as most existing software is not written in R, so it would help scientists who prefer this programming language. However, I would be careful not to oversell its novelty, since many methods do exist that produce similar results.

      We added the references. There were referenced above after “we combined previous sleep/wake analysis methods”, but should have been referenced again here.

      We are not convinced by this criticism. We would obviously not claim that the FramebyFrame package is as sophisticated and versatile as video-tracking tools like SLEAP or DeepLabCut, but we do think it answers a genuine need that was not addressed by other methods. Specifically, we know of many labs recording pixel count data across multiple days using the Zebrabox or DanioVision (we added support for DanioVision data after submission), but there were no packages to extract behavioural parameters from these data. Other methods involved standalone scripts with no documentation or version tracking. We would concede the FramebyFrame package is mostly targeted at these labs, but we already know of six labs routinely using it and were recently contacted by a researcher tracking Daphnia in the Zebrabox.

      "F0 knockouts of both cutches" - "clutches"

      Thank you!

      Reviewer #3 (Recommendations For The Authors):

      I would suggest totally revamping the Introduction section, and being sure to provide readers with the context and background they need for the data that comes thereafter. Key areas to touch on, in no particular order, include:

      • Far more detail on the behavioral pharm screen upon which this paper builds, as a brief overview of that approach and the data generated are needed.

      Thank you for the suggestion, we added a sentence hinting at this work in the last Introduction paragraph.

      • Limitations of current zebrafish sleep/arousal assays that motivated the authors to develop a new, temporally high-resolution system.

      We think this is better explained in Results, as is currently. For example, we need to point to Fig. 2–supplement 2a,b,c to explain that one-minute methods were missing sleep bouts and how FramebyFrame resolves this issue.

      • A paragraph about sleep and AD, that does a better job of citing work in humans, mammalian, and invertebrate models that motivate the interest in the connection pursued here.

      Sorry, we think this would place too much focus on sleep and AD. We want the main topic of the paper to be the behavioural pharmacology approach, not AD or sleep per se. As the Introduction states, we see Alzheimer’s risk genes as a case study for the behavioural pharmacology approach, rather than the reason why the approach was developed. Additionally, presenting sleep and AD in Introduction risks sounding like ZOLTAR is specifically designed for this context, while we conceived of it as much more generalisable and explicitly encourage its use to study genes associated to other diseases. Note that the paragraph you suggest is, we think, mostly present in Discussion (section Disrupted sleep and serotonin signalling […]).

      • I modestly suggest eliminating making such a strong case for a gene-first approach being the best way to understand disease. It is not a zero-sum game, and there is plenty to learn from proteomics, metabolomics, etc. I suspect nobody will argue with the authors saying they leveraged the strength of their system and focused on key AD genes of interest.

      From your point below, we understand the following quote is the source of the issue: “For finding causal processes, studying the genome, rather than the transcriptome or epigenome, is advantageous because the chronology from genomic variant to disease is unambiguous […]”. We did not want to suggest it is a zero-sum game, but we now understand how it can be read this way. We adapted slightly the wording. What we want to do is highlight the causality argument as the advantage of the genomics approach. We feel we do not read this argument often enough, while it remains a ‘magic power’ of genomics. One essentially does not have to worry about causality when studying a pathogenic germline variant, while it is a constant concern when studying the transcriptome or epigenome (i.e. did the change in this transcript’s level cause disease, or vice-versa?). To take an example in the context of AD, arguments based on genomics (e.g. Down syndrome or APP duplication) are often the definite arbiters when debating the amyloid hypothesis, exactly because their causality cannot be doubted.

      Minor comments

      (1) The opening of the introduction is perhaps overly broad, spending an entire paragraph on genome vs transcriptome, etc and making the claim that a gene-first approach is the best path. It isn't zero-sum, and the authors could just get right into AD and study genes of interest. Similar issues occur throughout the manuscript, with sentences/paragraphs that are not necessarily needed.

      Please see our answer to your previous point. On the introduction being overly broad, we perfectly agree it is broad, but related to your point about presenting sleep and AD in the Introduction, we wish to talk about finding causal processes from genomics findings using behavioural pharmacology. We purposefully present research on AD as one instance of this broader goal, not the primary topic of the paper.

      Another example are these sentences, which could be totally removed as the following paragraph starts off making the same point much more succinctly. "From genomic studies of AD, we know that mutations in genes such as SORL1 modify risk by disrupting some biological processes. Presumably, the same processes are disrupted in zebrafish sorl1 knockouts, and some caused the behavioural alterations we observed. Can we now follow the thread backwards and predict some of the biological processes in which Sorl1 is involved based on the behavioural profile of sorl1 knockouts?"

      Thanks for the suggestion, but we think these sentences are useful to place back this Results section in the context of the Introduction. Think of the paper as mainly about the behavioural pharmacology approach, not on Alzheimer’s risk genes. The function of the paragraph here is not simply to explain the method by which we decided to study sorl1; it is to reiterate the rationale behind the behavioural pharmacology approach so that the reader understands where this Results section fits in the overall structure.

      (2) Related to the above, the authors use lecanemab as an example to support their approach, but there has been a great deal of controversy regarding this drug. I don't think such extensive justification is needed. This study uses AD risk genes as a case study in a newly developed behavioral pharm pipeline. A great deal of the rest of the intro seems to just fill space and could be more focused on the study at hand. Interestingly, a er gene selection, the next step in their pipeline is sleep/wake analysis yet nothing is covered about AD and sleep in the intro. Some justification of that approach (why focus on sleep/wake as a starting point for behavioral pharm rather than learning and memory?) would be a better use of intro space.

      There has indeed been controversy about lecanemab, but even the harshest critiques of the amyloid hypothesis concede that it slows down cognitive decline (Espay et al., 2023). That is all that is needed to support our argument, which is that research on AD started primarily from genomics and thereby yielded a disease-modifying drug. The controversy seems mostly focused on whether this effect size is clinically significant, and we think we correctly represent this uncertainty (e.g. “antibodies against Aβ such as lecanemab show promise in slowing down disease progression” and “the beneficial effects from targeting Aβ aggregation currently remain modest”).

      Your next point is entirely fair. We mostly answered it above. To explain further, the primary reason why we measured sleep/wake behaviour is to match the behavioural dataset from Rihel et al., 2010 so we can use it to make predictions, not to study sleep in the context of AD per se. Sure, perhaps learning and memory would have been interesting, but we do not know of any study testing thousands of small molecules on zebrafish larvae during a memory task. We understand it can be slightly confusing though, as we then spend a paragraph of Discussion on sleep as a causal process in AD, but we obviously need to discuss this topic given the findings. However, to reiterate, we purposefully designed FramebyFrame and ZOLTAR to be useful beyond studying sleep/wake behaviour. For example, FramebyFrame would not calculate 17 behavioural parameters if the only goal was to measure sleep. We now mention the Rihel et al., 2010 study in the Introduction as you suggested above (“Far more detail on the behavioral pharm screen […]”), as that is the real reason why sleep/wake behaviour was measured in the first place.

      (3) Also related to the above, another more relevant point that could be talked about in the intro is the need for more refined approaches to analyze sleep in zebrafish, given the effort that went into the new analysis system described here. Again, I think the context for why the authors developed this system would be more meaningful than the current content.

      Thank you, we think we answered this point above (especially below Limitations of current zebrafish sleep/arousal assays […]).

      (4) GWAS can stand for Genome-wide associate studies (plural) so I do not think the extra "s" is needed (GWASs) .

      Indeed, that seems to be the common usage. Thank you.

      (5) AD candidate risk genes were determined from loci using "mainly statistic colocalization". Can the authors add a few more details about what was done and what the "mainly" caveat refers to?

      “Mainly” simply refers to the fact that other methods were used by Schwartzentruber et al. (2021) to annotate the GWAS loci with likely causal genes, but that most calls were ultimately made from statistic colocalisation. Readers can refer to this work to learn more about the methods used.

      (6) The authors write "The loss of psen1 only had mild effects on behaviour" but I think they mean "sleep behaviors" as there could be many other behaviors that are disrupted but were not assessed. The same issue a few sentences later with "Behaviour during the day was not affected" and at the end of the following paragraph.

      Yes, that would be more precise, thank you.

      (7) For the Sorl1 pharmacology data, it is very hard to understand what is being measured behaviorally. Are the authors measuring sleep +/- citalopram, or something else, and why the change to Euclidean distance rather than all the measures we were just introduced to earlier in the manuscript?

      We understand these plots (Fig. 5c,d) are less intuitive, but it is important that we show the difference in behaviour compared to H<sub>2</sub>O-treated larvae of same genotype. The claim is that citalopram has a larger effect on knockouts than on controls, so the reader needs to focus on the effect of the drug on each genotype, not on the effect of sorl1 knockout. We added the standard fingerprints (i.e. setting controls to z-score = 0) here in Author response figures.

      Euclidean distance takes as input all the measures we introduced. The point is precisely not to select a single measure. For example, say we were only plotting active bout number during the day, we would conclude that 10 µM citalopram has the same effect on knockouts and controls. Conversely, if we had taken sleep bout length at night, we would conclude 10 µM has a stronger effect on knockouts. What is the correct parameter to select? Using Euclidean distance resolves this by taking all parameters into account, rather than arbitrarily choosing one.

      And what exactly is a "given spike in serotonin"? and how is this hypothesis the conclusion based on the lack of evidence for the second hypothesis? As the authors say, there could be other ways sorl1 knockouts are more sensitive to citalopram, so the absence of evidence for one hypothesis certainly does not support the other hypothesis.

      We mean a given release of serotonin in the synaptic cleft. We have fixed this wording. 

      We tend to disagree on the second point. We can think of two ways that sorl1 knockouts are more sensitive to citalopram: 1) they produce more serotonin, so blocking reuptake causes a larger spike in knockouts; or 2) blocking reuptake causes the same increase in both knockouts and wild-types but knockouts react more strongly to serotonin. We cannot in fact think of another way to explain the citalopram results. Not finding overwhelming evidence for 1) surely supports 2) somewhat, even if we do not have direct evidence for it. As an analogy, if two diagnoses are possible for a patient, testing negative for the first one supports the other one, even before it is directly tested.

      (8) Again some language is used without enough care. Fish are referred to as "drowsier" under some drug conditions. How do the authors know the animal is drowsy? The phenotype is more specific - more sleep, less activity.

      Thank you, we switched to “Furthermore, fenoprofen worsened the day-time hypoactivity of psen2 knockout larvae […]”.

      (9) This sentence is misleading as it gives the impression that results in this manuscript suggest the conclusion: "Our observation that disruption of genes associated with AD diagnosis after 65 years reduces sleep in 7-day zebrafish larvae suggest that disrupted sleep may be a common mechanism through which these genes exert an effect on risk." That idea is widely held in the field, and numerous other previous manuscripts/reviews should be cited for clarity of where this hypothesis came from.

      This idea is not widely held in the field. You likely read this point as “disrupted sleep is a risk factor for AD”, which, yes, is widely discussed in the field, but is not precisely what we are saying. We hypothesise that mutations in some of the Alzheimer’s risk genes cause disrupted sleep, possibly from a very early age, which then causes AD decades later. Studies and reviews on sleep and AD rarely make this hypothesis, at least not explicitly. The closest we know of are a few recent human genetics studies, typically using Mendelian Randomisation, finding that higher genetic risk of AD correlates with some sleep phenotypes, such as sleep duration (Chen et al., 2022; Leng et al., 2021). The work of Muto et al. (2021) is particularly interesting as it found correlations between higher genetic risk of AD and some sleep phenotypes in men in their early twenties, which seems unlikely to be a consequence of early pathology (Muto et al., 2021). Note, however, that even these studies do not mention sleep possibly being disrupted early in development, which is what our findings in zebrafish larvae support. As we mention, we think a team should test whether sleep is different in infants at higher genetic risk of AD, essentially performing an analogous, but obviously much more difficult, experiment as we did in zebrafish larvae. We do not know of any study testing this or even raising this idea, so evidently it is not widely held. Having said that, the studies we mention here were not referenced in the Discussion paragraph. We have now corrected this.

      Ashlin TG, Blunsom NJ, Ghosh M, Cockcroft S, Rihel J. 2018. Pitpnc1a Regulates Zebrafish Sleep and Wake Behavior through Modulation of Insulin like Growth Factor Signaling. Cell Rep 24:1389–1396. doi:10.1016/j.celrep.2018.07.012

      Chen D, Wang X, Huang T, Jia J. 2022. Sleep and LateOnset Alzheimer’s Disease: Shared Genetic Risk Factors, Drug Targets, Molecular Mechanisms, and Causal Effects. Front Genet 13. doi:10.3389/fgene.2022.794202

      Cirrito JR, Disabato BM, Restivo JL, Verges DK, Goebel WD, Sathyan A, Hayreh D, D’Angelo G, Benzinger T, Yoon H, Kim J, Morris JC, Mintun MA, Sheline YI. 2011. Serotonin signaling is associated with lower amyloid-β levels and plaques in transgenic mice and humans. Proc Natl Acad Sci U S A 108:14968–14973. doi:10.1073/pnas.1107411108

      Dean DC, Jerskey BA, Chen K, Protas H, Thiyyagura P, RoonJva A, O’Muircheartaigh J, Dirks H, Waskiewicz N, Lehman K, Siniard AL, Turk MN, Hua X, Madsen SK, Thompson PM, Fleisher AS, Huentelman MJ, Deoni SCL, Reiman EM. 2014. Brain Differences in Infants at Differential Genetic Risk for Late-Onset Alzheimer Disease A Cross-sectional Imaging Study. JAMA Neurol 71:11–22. doi:10.1001/jamaneurol.2013.4544

      Eriksen JL, Sagi SA, Smith TE, Weggen S, Das P, McLendon DC, Ozols VV, Jessing KW, Zavitz KH, Koo EH, Golde TE. 2003. NSAIDs and enantiomers of flurbiprofen target γ-secretase and lower Aβ42 in vivo. J Clin Invest 112:440–449. doi:10.1172/JCI18162

      Espay AJ, Herrup K, Kepp KP, Daly T. 2023. The proteinopenia hypothesis: Loss of Aβ42 and the onset of Alzheimer’s Disease. Ageing Res Rev 92:102112. doi:10.1016/j.arr.2023.102112

      Hoffman EJ, Turner KJ, Fernandez JM, Cifuentes D, Ghosh M, Ijaz S, Jain RA, Kubo F, Bill BR, Baier H, Granato M, Barresi MJF, Wilson SW, Rihel J, State MW, Giraldez AJ. 2016. Estrogens Suppress a Behavioral Phenotype in Zebrafish Mutants of the AuJsm Risk Gene, CNTNAP2. Neuron 89:725–733. doi:10.1016/j.neuron.2015.12.039

      in ’t Veld Bas A, Ruitenberg A, Hofman A, Launer LJ, van Duijn CM, Stijnen T, Breteler MMB, Stricker BHC. 2001. Nonsteroidal Anti inflammatory Drugs and the Risk of Alzheimer’s Disease. N Engl J Med 345:1515–1521. doi:10.1056/NEJMoa010178

      Jagirdar R, Fu C-H, Park J, Corbek BF, Seibt FM, Beierlein M, Chin J. 2021. Restoring activity in the thalamic reticular nucleus improves sleep architecture and reduces Aβ accumulation in mice. Sci Transl Med 13:eabh4284. doi:10.1126/scitranslmed.abh4284

      Jiang H, Newman M, Lardelli M. 2018. The zebrafish orthologue of familial Alzheimer’s disease gene PRESENILIN 2 is required for normal adult melanotic skin pigmentation. PLOS ONE 13:e0206155. doi:10.1371/journal.pone.0206155

      Jiang H, Pederson SM, Newman M, Dong Y, Barthelson K, Lardelli M. 2020. Transcriptome analysis indicates dominant effects on ribosome and mitochondrial function of a premature termination codon mutation in the zebrafish gene psen2. PloS One 15:e0232559. doi:10.1371/journal.pone.0232559

      Joo W, Vivian MD, Graham BJ, Soucy ER, Thyme SB. 2021. A Customizable Low-Cost System for Massively Parallel Zebrafish Behavioral Phenotyping. Front Behav Neurosci 14.

      Joubert L, Hanson B, Barthet G, Sebben M, Claeysen S, Hong W, Marin P, Dumuis A, Bockaert J. 2004. New sorting nexin (SNX27) and NHERF specifically interact with the 5-HT4a receptor splice variant: roles in receptor targeting. J Cell Sci 117:5367–5379. doi:10.1242/jcs.01379

      Leng Y, Ackley SF, Glymour MM, Yaffe K, Brenowitz WD. 2021. Genetic Risk of Alzheimer’s Disease and Sleep Duration in Non-Demented Elders. Ann Neurol 89:177–181. doi:10.1002/ana.25910

      Mitchell PB, Hadzi-Pavlovic D. 2000. Lithium treatment for bipolar disorder. Bull World Health Organ 78:515–517.

      Mikur A. 2011. Trazodone: properties and utility in multiple disorders. Expert Rev Clin Pharmacol 4:181–196. doi:10.1586/ecp.10.138

      Munoz-Torrero D. 2008. Acetylcholinesterase Inhibitors as Disease-Modifying Therapies for Alzheimer’s Disease. Curr Med Chem 15:2433–2455. doi:10.2174/092986708785909067

      Muto V, Koshmanova E, Ghaemmaghami P, Jaspar M, Meyer C, Elansary M, Van Egroo M, Chylinski D, Berthomier C, Brandewinder M, Mouraux C, Schmidt C, Hammad G, Coppieters W, Ahariz N, Degueldre C, Luxen A, Salmon E, Phillips C, Archer SN, Yengo L, Byrne E, Collette F, Georges M, Dijk D-J, Maquet P, Visscher PM, Vandewalle G. 2021. Alzheimer’s disease genetic risk and sleep phenotypes in healthy young men: association with more slow waves and daytime sleepiness. Sleep 44. doi:10.1093/sleep/zsaa137

      Myers-Turnbull D, Taylor JC, Helsell C, McCarroll MN, Ki CS, Tummino TA, Ravikumar S, Kinser R, Gendelev L, Alexander R, Keiser MJ, Kokel D. 2022. Simultaneous analysis of neuroactive compounds in zebrafish. doi:10.1101/2020.01.01.891432

      Owens MJ, Morgan WN, Plok SJ, Nemeroff CB. 1997. Neurotransmiker receptor and transporter binding profile of antidepressants and their metabolites. J Pharmacol Exp Ther 283:1305– 1322.

      Özcan GG, Lim S, Leighton PL, Allison WT, Rihel J. 2020. Sleep is bi-directionally modified by amyloid beta oligomers. eLife 9:e53995. doi:10.7554/eLife.53995

      Quiroz YT, Schultz AP, Chen K, Protas HD, Brickhouse M, Fleisher AS, Langbaum JB, Thiyyagura P, Fagan AM, Shah AR, Muniz M, Arboleda-Velasquez JF, Munoz C, Garcia G, Acosta-Baena N, Giraldo M, Tirado V, Ramírez DL, Tariot PN, Dickerson BC, Sperling RA, Lopera F, Reiman EM. 2015. Brain Imaging and Blood Biomarker Abnormalities in Children With Autosomal Dominant Alzheimer Disease: A Cross-Sectional Study. JAMA Neurol 72:912–919. doi:10.1001/jamaneurol.2015.1099

      Relkin NR. 2007. Beyond symptomatic therapy: a reexamination of acetylcholinesterase inhibitors in Alzheimer’s disease. Expert Rev Neurother 7:735–748. doi:10.1586/14737175.7.6.735

      Rihel J, Prober DA, Arvanites A, Lam K, Zimmerman S, Jang S, Haggarty SJ, Kokel D, Rubin LL, Peterson RT, Schier AF. 2010. Zebrafish Behavioral Profiling Links Drugs to Biological Targets and Rest/Wake Regulation. Science 327:348–351. doi:10.1126/science.1183090

      Sleegers K, Brouwers N, Gijselinck I, Theuns J, Goossens D, Wauters J, Del-Favero J, Cruts M, van Duijn CM, Van Broeckhoven C. 2006. APP duplication is sufficient to cause early onset Alzheimer’s dementia with cerebral amyloid angiopathy. Brain J Neurol 129:2977–2983. doi:10.1093/brain/awl203

      Sun L, Zhou R, Yang G, Shi Y. 2017. Analysis of 138 pathogenic mutations in presenilin-1 on the in vitro production of Aβ42 and Aβ40 peptides by γ-secretase. Proc Natl Acad Sci 114:E476– E485. doi:10.1073/pnas.1618657114

      Szklarczyk D, Santos A, von Mering C, Jensen LJ, Bork P, Kuhn M. 2016. STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data. Nucleic Acids Res 44:D380–D384. doi:10.1093/nar/gkv1277

      Weggen S, Rogers M, Eriksen J. 2007. NSAIDs: small molecules for prevention of Alzheimer’s disease or precursors for future drug development? Trends Pharmacol Sci 28:536–543. doi:10.1016/j.Jps.2007.09.004

      Wiltschko AB, Tsukahara T, Zeine A, Anyoha R, Gillis WF, Markowitz JE, Peterson RE, Katon J, Johnson MJ, Daka SR. 2020. Revealing the structure of pharmacobehavioral space through motion sequencing. Nat Neurosci 23:1433–1443. doi:10.1038/s41593-020-00706-3

      Yang T, Arslanova D, Gu Y, Augelli-Szafran C, Xia W. 2008. Quantification of gamma-secretase modulation differentiates inhibitor compound selectivity between two substrates Notch and amyloid precursor protein. Mol Brain 1:15. doi:10.1186/1756-6606-1-15

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study presents a useful modification of a standard model of genetic drift by incorporating variance in offspring numbers, claiming to address several paradoxes in molecular evolution. It is unfortunate that the study fails to engage prior literature that has extensively examined the impact of variance in offspring number, implying that some of the paradoxes presented might be resolved within existing frameworks.

      The prior literature the reviewers referred to are all "modified WF models". In the original submission, we lumped the standard and modified WF models together as the "generalized WF models". As the lumping causes confusions, their distinctions are now made clear.  That said, the Haldane model in our proposal is not a modification of the standard WF model because, conceptually, the two models are very different. WF is based on sampling whereas the Haldane model is based on gene transmission.

      While the "modified WF models" often incorporate V(K) [variance in progeny number], the modification is still based on the WF model of population sampling. The modification is mathematically feasible but biologically untenable, as explained explicitly in the revised text. Most important, all four paradoxes are as incompatible with the modified WF models as with the standard model. Note that the Haldane model does not have the sampling step, which is absorbed into the V(K) term. In the integrated WF-Haldane model, these paradoxes are resolved (see the new sections of Discussion, quoted below).

      If readers do not have time to ponder on all four paradoxes, they may simply read the first one, as follows. When the population size (N) is growing exponentially, such as in a bacteria culture, drift is nearly absent when N is small and becomes stronger as N increases, especially when approaching the carrying capacity.  Such common observations are exactly opposite of the WF model's central prediction. Any model based on sampling cannot escape the constraint of "greater drift, smaller N".

      Revision - The following text is a reproduction of the last 7 paragraphs of Discussion.

      “The standard WF model has been extended in several directions (overlapping generations, multiple alleles, ploidy, etc.). The modification most relevant to our studies here is the introduction of V(K) into the model, thus permitting V(K) ≠ E(K). While the modifications are mathematically valid, they are often biologically untenable. Kimura and Crow (1963) may be the first to offer a biological mechanism for V(K) ≠ E(K), effectively imposing the Haldane model on the WF model. Other models (Kimura and Crow 1963; Lynch, et al. 1995; Sjodin, et al. 2005; Der, et al. 2011; Cannings 2016) indeed model mathematically the imposition of the branching process on the population, followed by the WF sampling. The constructions of such models are biologically dubious but, more importantly, still unable to resolve the paradoxes. It would seem more logical to use the Haldane model in the first place by having two parameters, E(K) and V(K). 

      Even if we permit V(K) ≠ E(K) under the WF sampling, the models would face other difficulties. For example, a field biologist needs to delineate a Mendelian population and determine its size, N or Ne. In all WF models, one cannot know what the actual population being studied is. Is it the fly population in an orchard being sampled, in the geographical region, or in the entire species range? It is unsatisfactory when a population biologist cannot identify the population being studied. The Haldane model is an individual-output model (Chen, et al. 2017), which does not require the delineation of a Mendelian population.

      We shall now review the paradoxes specifically in relation to the modified WF models, starting with the multi-copy gene systems such as viruses and rRNA genes covered in the companion study (Wang, et al. 2024). These systems evolve both within and between hosts. Given the small number of virions transmitted between hosts, drift is strong in both stages as shown by the Haldane model (Ruan, Luo, et al. 2021; Ruan, Wen, et al. 2021; Hou, et al. 2023). Therefore, it does not seem possible to have a single effective population size in the WF models to account for the genetic drift in two stages. The inability to deal with multi-copy gene systems may explain the difficulties in accounting for the SARS-CoV-2 evolution (Deng, et al. 2022; Pan, Liu, et al. 2022; Ruan, Wen, et al. 2022; Hou, et al. 2023; Ruan, et al. 2023).

      We now discuss the first paradox of this study, which is about the regulation of N. In the general WF models, N is imposed from outside of the model, rather than self-generating within the model. When N is increasing exponentially as in bacterial or yeast cultures, there is almost no drift when N is very low and drift becomes intense as N grows to near the carrying capacity. As far as we know, no modifications of the WF model can account for this phenomenon that is opposite of its central tenet. In the general WF models, N is really the carrying capacity, not population size. 

      The second paradox of sex chromosomes is rooted in V(K) ≠ E(K). As E(K) is the same between sexes but V(K) is different, clearly V(K) = E(K) would not be feasible. The mathematical solution of defining separate Ne's for males and females (Kimura and Crow 1963; Lynch, et al. 1995; Sjodin, et al. 2005; Der, et al. 2011; Cannings 2016) unfortunately obscures the interesting biology. As shown in Wang et al. (2024; MBE), the kurtosis of the distribution of K indicates the presence of super-breeder males. While the Haldane model can incorporate the kurtosis, the modified WF models are able to absorb only up to the variance term, i.e., the second moment of the distribution. The third paradox of genetic drift is manifested in the fixation probability of an advantageous mutation, 2_s_/V(K). As explained above, the fixation probability is determined by the probability of reaching a low threshold that is independent of N itself. Hence, the key parameter of drift in the WF model, N (or Ne), is missing. This paradox supports the assertion that genetic drift is fundamentally about V(K) with N being a scaling factor. 

      As the domain of evolutionary biology expands, many new systems do not fit into the WF models, resulting in the lack of a genetic drift component in their evolutionary trajectories. Multi-copy gene systems are obvious examples. Others include domestications of animals and plants that are processes of rapid evolution  (Diamond 2002; Larson and Fuller 2014; Purugganan 2019; Chen, Yang, et al. 2022; Pan, Zhang, et al. 2022; Wang, et al. 2022). Due to the very large V(K) in domestication, drift must have played a large role. Somatic cell evolution is another example with “undefinable” genetic drift (Wu, et al. 2016; Chen, et al. 2017; Chen, et al. 2019; Ruan, et al. 2020; Chen, Wu, et al. 2022). The Haldane (or WFH) model, as an "individual output" model, can handle these general cases of genetic drift.

      The Haldane model and the WF model are fundamentally different approaches to random forces of evolution. While the WF models encounter many biological contradictions, they have provided approximate mathematical solutions to more realistic scenarios. In systems such as in viral evolution (Ruan, Hou, et al. 2022; Hou, et al. 2023) or somatic cell evolution (Chen, Wu, et al. 2022; Zhai, et al. 2022) whereby the WF solution is absent, further development of the WFH model will be necessary.”

      In addition, while the modified model yields intriguing theoretical predictions, the simulations and empirical analyses are incomplete to support the authors' claims.

      This point is addressed in the responses to reviewers' comments. Since they are quite technical, they do not fit in the overview here.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors present a theoretical treatment of what they term the "Wright-Fisher-Haldane" model, a claimed modification of the standard model of genetic drift that accounts for variability in offspring number, and argue that it resolves a number of paradoxes in molecular evolution. Ultimately, I found this manuscript quite strange.

      The notion of effective population size as inversely related to the variance in offspring number is well known in the literature, and not exclusive to Haldane's branching process treatment. However, I found the authors' point about variance in offspring changing over the course of, e.g. exponential growth fairly interesting, and I'm not sure I'd seen that pointed out before.

      Weaknesses:

      I have several outstanding issues. First of all, the authors really do not engage with the literature regarding different notions of an effective population. Most strikingly, the authors don't talk about Cannings models at all, which are a broad class of models with non-Poisson offspring distributions that nonetheless converge to the standard Wright-Fisher diffusion under many circumstances, and to "jumpy" diffusions/coalescents otherwise (see e.g. Mohle 1998, Sagitov (2003), Der et al (2011), etc.). Moreover, there is extensive literature on effective population sizes in populations whose sizes vary with time, such as Sano et al (2004) and Sjodin et al (2005).

      Of course in many cases here the discussion is under neutrality, but it seems like the authors really need to engage with this literature more.

      The reviewer's summary and weakness statement reflects the general criticism summarized by the editors. The reply and revision to these criticisms have been presented in the long reply to elife assessment above.

      We hence re-emphasize only the key points here.

      (1) The literature that the reviewers fault us for not citing is about the modifications of the standard WF model. We now cite them as well as a few others in that vein. However, the WF-Haldane model we propose is conceptually very different from the modified WF models. This WFH model is in essence the Haldane model which may use the results of the WF models as the starting point to find the exact solutions.

      (2) The check of the power of the modified WF models is whether they can resolve the paradoxes. None of them can. The arguments apply to neutral cases as well as selection effects. Hence, our central point is that the modifications of the standard WF model [e.g., by incorporating V(K)] do not help the WF model in resolving the paradoxes.  Besides, the incorporation of V(K) is mathematically feasible but biologically untenable as presented in the new sections of Discussion.

      Nonetheless, I don't think the authors' modeling, simulations, or empirical data analysis are sufficient to justify their claims.

      The most interesting part of the manuscript, I think, is the discussion of the Density Dependent Haldane model (DDH). However, I feel like I did not fully understand some of the derivation presented in this section, …… - this is the whole notion of exchangeability, also neglected in this manuscript). As such, I don't believe that their analysis of the empirical data supports their claim. [Since the comments above are highly technical and fairly long, they are not copied verbatim.]

      We thank this reviewer for the detailed comments with respect to the potential confusion in the discussion of the Density Dependent Haldane (DDH) model.

      First, the reviewer appears to ask how Eqs (5-6) are derived. We should clarify that both Eq (5) and (6) are assumptions rather than derived results. Both equations are assumptions based on population ecology. Eq (7) is then derived by substituting the assumptions in Eq (5) and (6) into Eq (3).

      The definition in Equation (5) allows the growth rate of the population size to be dependent on N itself, such that growth rate E(K) (average offspring number per generation) is greater than 1 when N < Ck and less than 1 when N > Ck. The parameter z is introduced to adjust the sensitivity of E(K) to changes in population size (as shown in Fig. 3a).

      Second, we appreciate the comments regarding the use of individual-based simulations and the apparent lack of interaction between individuals. In our simulations, there is indeed an interaction among individuals, which is represented by Eq (5). This equation reflects how the competition between two alleles affects the expected growth rate 𝐸(𝐾), which decreases as the population size increases. Furthermore, once 𝐸(𝐾) for the entire population is determined, the offspring numbers of the alleles are independent.

      We believe that the primary purpose of our simulations was not clearly stated. This lack of clarity may be the root of the criticisms. We now note that the simulations are aimed at testing the accuracy of Equation (10).

      Note that Eq. (10) is a textbook result and quite important in our study. This equation shows that the strength of genetic drift, as given by Pf (the fixation probability of an advantageous mutation), is not a function of N at all. This approximate solution has been obtained using the WF model by Kimura.  The Haldane model solution that can explain Paradox 1 is based on Equation (7) as shown below

      Since the fixation probability of Equation (10) cannot be easily obtained using Eq. (7), we conducted simulations to confirm the accuracy of Eq. (10) when applied to the Haldane model.

      We have revised the relevant sections of the manuscript to clarify these points and to better distinguish between assumptions and results. 

      Revision - Details of the DDH model are given in the Supplementary Information. A synopsis is given here: We consider a non-overlapping haploid population with two neutral alleles. The population size at time t is Nt. We assume that expected growth rate E(K) is greater than 1 when N < Ck and less than 1 when N > Ck, as defined by Eq. (5) below:

      The slope of E(K) vs. N (i.e., the sensitive of growth rate to changes in population size), as shown in Fig 3a, depends on z. To determine the variance V(K), we assume that K follows the negative binomial distribution whereby parents would suffer reproduction-arresting injury with a probability of pt at each birthing (Supplementary Information). Accordingly, V(K) can then be expressed as

      By Eq. (6), the ratio of V(K)/E(K) could be constant, decrease or increase with the increase of population size. With E(K) and V(K) defined, we could obtain the effective population size by substituting Eq. (5) and Eq. (6) into Eq. (3).

      Eq. (7) presents the relationship between effective population size (Ne) and the population size (N) as shown in Fig. 3. The density-dependent E(K) could regulate N with different strength (Fig. 3a). The steeper the slope in Fig. 3a, the stronger the regulation.

      Simulation of genetic drift in the Haldane model and the Wright-Fisher (WF) model. In both models, interactions between individuals are implicitly included through the dependency of the average number of offspring on population size, as defined by Eq. (5). This dependency leads to the logistic population growth, reflecting the density-dependent interactions.

      Thus, while I think there are some interesting ideas in this manuscript, I believe it has some fundamental issues:

      first, it fails to engage thoroughly with the literature on a very important topic that has been studied extensively. Second, I do not believe their simulations are appropriate to show what they want to show. And finally, I don't think their empirical analysis shows what they want to show.

      References omitted

      The comments are the summary of previous ones, which have been addressed in detail in the preceding sections.

      Reviewer #2 (Public Review):

      Summary:

      This theoretical paper examines genetic drift in scenarios deviating from the standard Wright-Fisher model. The authors discuss Haldane's branching process model, highlighting that the variance in reproductive success equates to genetic drift. By integrating the Wright-Fisher model with the Haldane model, the authors derive theoretical results that resolve paradoxes related to effective population size [Ne]

      Thanks.  The issue of Ne will be addressed below where the reviewer returns to this issue. The strength of the integrated WFH model is that N (or Ne) is generated by the model itself, rather than externally imposed as in WF models.

      Strengths:

      The most significant and compelling result from this paper is perhaps that the probability of fixing a new beneficial mutation is 2s/V(K). This is an intriguing and potentially generalizable discovery that could be applied to many different study systems.

      The authors also made a lot of effort to connect theory with various real-world examples, such as genetic diversity in sex chromosomes and reproductive variance across different species.

      Thanks. 

      Weaknesses:

      One way to define effective population size is by the inverse of the coalescent rate. This is where the geometric mean of Ne comes from. If Ne is defined this way, many of the paradoxes mentioned seem to resolve naturally. If we take this approach, one could easily show that a large N population can still have a low coalescent rate depending on the reproduction model. However, the authors did not discuss Ne in light of the coalescent theory. This is surprising given that Eldon and Wakeley's 2006 paper is cited in the introduction, and the multiple mergers coalescent was introduced to explain the discrepancy between census size and effective population size, superspreaders, and reproduction variance - that said, there is no explicit discussion or introduction of the multiple mergers coalescent.

      The Haldane model treats N’s very differently from the WF models.  In the WF models, N’s are imposed externally (say, constant N, exponentially growing N, temporally fluctuating N’s and so on; all provided from outside of the model). Ne and coalescence are all derived from these given N’s.  In order to account for the first paradox (see the next paragraph), N needs to be regulated but the WF models cannot regulate N’s. The density-dependent Haldane model that Reviewer 1 inquired above is a model that regulates N internally. It can thus account for the paradox.

      Paradox 1 -  When the population size (N) is growing exponentially, such as in a bacteria culture, drift is nearly absent when N is small and is much stronger as N increases, especially when approaching the carrying capacity.  Such a pattern is a common observation and is exactly opposite of the WF model's central prediction. In short, a model that does not regulate N cannot explain the paradox

      Ne is a fix of the WF model in order to account for the missing components of genetic drift. The paradoxes presented in this one and the companion study show that the fix is rather inadequate.  In contrast, by the WFH model, N is regulated within the model itself as E(K) and V(K) are both functions of N.

      The Wright-Fisher model is often treated as a special case of the Cannings 1974 model, which incorporates the variance in reproductive success. This model should be discussed. It is unclear to me whether the results here have to be explained by the newly introduced WFH model, or could have been explained by the existing Cannings model. The abstract makes it difficult to discern the main focus of the paper. It spends most of the space introducing "paradoxes".

      We appreciate greatly the illuminating advice.  Nevertheless, we should explain, or should have explained, more clearly that these four paradoxes presented are central to this pair of eLife papers. The WF and Haldane models are very different conceptual ideas altogether. The choice should not be based on mathematical grounds but on how they help us understand biological evolution. We are using four paradoxes to highlight the differences.  We have said in the papers that the origin and evolution of COVID-19 caused a lot of confusions partly because the WF models cannot handle multi-copy gene systems, including viruses that evolve both within- and between- hosts.

      The standard Wright-Fisher model makes several assumptions, including hermaphroditism, non-overlapping generations, random mating, and no selection. It will be more helpful to clarify which assumptions are being violated in each tested scenario, as V(K) is often not the only assumption being violated. For example, the logistic growth model assumes no cell death at the exponential growth phase, so it also violates the assumption about non-overlapping generations.

      We appreciate the question which has two aspects.  First, why do we think the WF models are insufficient? After all, for each assumption of the WF model (as given in the reviewer’s examples), there is often a solution by modifying Ne which relaxes the assumption. In this sense, there is only one grand assumption made by the WF models. That is, however complex the biology is, it is possible to find Ne that can make the WF model work. Our argument is that Ne is a cumbersome fix of the WF model and it does not work in many situations. That is how we replied about the importance of the paradoxes above.  We shall again use the first paradox as an example whereby drift is stronger as N becomes larger, the fix has to make Ne negatively correlated with N. In reality, it does not appear possible to resolve this paradox. Another paradox is the evolution of multi-copy gene systems. In short, it seems clear that Ne is not a useful or usable fix.

      The second aspect is that “why, among the many modifications the WF models make, do we only emphasize the inclusion of V(K)?” This is the essence of the two papers of ours.  Although V(K) is a modification of the WF models, it does not enable the WF models to resolve the paradoxes. In contrast, the Haldane model has incorporate E(K) and V(K) in the model. In presenting paradox 3, it was stated that

      This equation shows that the strength of genetic drift, as given by Pf (the fixation probability of an advantageous mutation), is not a function of N at all. It supports the view that the essence of genetic drift is V(K) with N as a scaling factor. Note that, if V(K) = 0, there is no genetic drift regardless of N. As V(K) is not an add-on to the Haldane model (unlike in WF models), the Haldane model can resolve the paradoxes.

      The theory and data regarding sex chromosomes do not align. The fact that \hat{alpha'} can be negative does not make sense. The authors claim that a negative \hat{alpha'} is equivalent to infinity, but why is that? It is also unclear how theta is defined. It seems to me that one should take the first principle approach e.g., define theta as pairwise genetic diversity, and start with deriving the expected pair-wise coalescence time under the MMC model, rather than starting with assuming theta = 4Neu. Overall, the theory in this section is not well supported by the data, and the explanation is insufficient.

      a' can be negative for the same reason that a (the male/female ratio in mutation rate) can be negative (Miyata, et al. 1987; Li, et al. 2002; Makova and Li 2002). Clearly, this has not been a problem in the large literature on a becoming negative.  In fact, in many reports, a is negative, which is read as a approaching infinity.  Imagine that our equation is a'^2 = 0.25, then a' can be 0.5 or -0.5, although the latter solution is not biologically meaningful.

      As for theta, the reviewer asked why we do not use the pairwise genetic diversity (or theta[pi]) as the first-principle approach to estimating theta. While theta(pi) is the first estimator of theta used, the general principle is that every bin of the frequency spectrum can be used for estimating theta since the expected value is theta/i where i is the occurrence of the mutation in the sample.  (If the sample size is 100, then i is between 1 and 99.)  Hence, the issue is which part of the spectrum has the best statistical properties for the questions at hand.  The pairwise measure is theta(pi) [which the reviewer recommends]. While theta(pi) and theta(w) are most commonly used, there are in fact numerous ways to estimate theta.  ((Fu 2022) presents an excellent review.) For our purpose, we need a theta estimate least affected by selection and we choose the lowest frequency bin of the spectrum, which is theta(1) based on the singletons. Theta(1), least affected by selection, is the basis of the Fu and Li test. 

      Reviewer #3 (Public Review):

      Summary:

      Ruan and colleagues consider a branching process model (in their terminology the "Haldane model") and the most basic Wright-Fisher model. They convincingly show that offspring distributions are usually non-Poissonian (as opposed to what's assumed in the Wright-Fisher model), and can depend on short-term ecological dynamics (e.g., variance in offspring number may be smaller during exponential growth). The authors discuss branching processes and the Wright-Fisher model in the context of 3 "paradoxes": (1) how Ne depends on N might depend on population dynamics; (2) how Ne is different on the X chromosome, the Y chromosome, and the autosomes, and these differences do match the expectations base on simple counts of the number of chromosomes in the populations; (3) how genetic drift interacts with selection. The authors provide some theoretical explanations for the role of variance in the offspring distribution in each of these three paradoxes. They also perform some experiments to directly measure the variance in offspring number, as well as perform some analyses of published data.

      Strengths:

      (1) The theoretical results are well-described and easy to follow.

      (2) The analyses of different variances in offspring number (both experimentally and analyzing public data) are convincing that non-Poissonian offspring distributions are the norm.

      (3) The point that this variance can change as the population size (or population dynamics) change is also very interesting and important to keep in mind.

      (4) I enjoyed the Density-Dependent Haldane model. It was a nice example of the decoupling of census size and effective size.

      Thanks.

      Weaknesses:

      (1) I am not convinced that these types of effects cannot just be absorbed into some time-varying Ne and still be well-modeled by the Wright-Fisher process.

      Please allow us to refer to, again, two of the four paradoxes.  We believe that that no modification of the WF model can resolve the paradoxes.

      (1) When the population size (N) is growing exponentially, such as in a bacteria culture, drift is nearly absent when N is small and is much stronger as N increases, especially when approaching the carrying capacity.  Such common observations are exactly opposite of the WF model's key prediction. It is not possible for a model that does not regulate N to explain the paradox.

      (2) There is no way the WF models can formulate Ne for, say viruses or ribosomal RNA genes that have two levels of populations – the within-host populations as well as the host population itself.

      The fact that there are numerous Ne's suggests that Ne is a collection of cumbersome fixes of the WF model. By the WF-Haldane model, all factors are absorbed into V(K) resulting in a simpler model in the end. V(K) is often a measurable quantity. Note that, even if V(K) is incorporated into the WF model, the paradoxes remain unresolvable.

      (2) Along these lines, there is well-established literature showing that a broad class of processes (a large subset of Cannings' Exchangeable Models) converge to the Wright-Fisher diffusion, even those with non-Poissonian offspring distributions (e.g., Mohle and Sagitov 2001). E.g., equation (4) in Mohle and Sagitov 2001 shows that in such cases the "coalescent Ne" should be (N-1) / Var(K), essentially matching equation (3) in the present paper.

      The criticism of lack of engagement with well-established literature has been responded extensively above.  Briefly, the literature is about modifications of the WF model which share the same feature of population sampling. With that feature, the paradoxes are unresolvable.  For example, however Ne is defined, the fixation probability of an advantageous mutation does not depend on N or Ne. This is the third paradox of the WF models.

      (3) Beyond this, I would imagine that branching processes with heavy-tailed offspring distributions could result in deviations that are not well captured by the authors' WFH model. In this case, the processes are known to converge (backward-in-time) to Lambda or Xi coalescents (e.g., Eldon and Wakely 2006 or again in Mohle and Sagitov 2001 and subsequent papers), which have well-defined forward-in-time processes.

      We admire the learned understanding of the literature expressed by the review, which raise two points.  First, our model may not be able to handle the heavy-tailed progeny distribution (i.e., the kurtosis of the distribution of k). Second, the Xi coalescence models (cited above) can do that.  Below are our clarifications.

      First, the WFH model is based on the general distribution of K, which includes flexible and realistic representations of offspring number distributions. In fact, we have used various forms of K distribution in our publications on the evolution of SARS-CoV-2 (see the Ruan et al publications in the bibliography). Power-law distribution is particularly useful as the K-distribution in viral transmission is highly kurtotic. This is reflected in the super-spreader hypothesis. In short, the branching process on which the WFH model is based in is mainly about the distribution of K. Nevertheless, the variance V(K) can often yield good approximations when the kurtosis is modest.

      Second, we would like to comment on the models of Eldon and Wakely 2006. or Mohle and Sagitov 2001 and subsequent papers. These papers are based on the Moran model by considering a highly skewed distribution of offspring numbers. Fundamentally, the Moran models generally behave like WF models (standard or modified) and hence have the same problems with the paradoxes that are central to our studies. In fact, the reservations about introducing V(K) into the WF models apply as well to the Moran models.  The introduction of V(K) is mathematically valid but biologically untenable. Essentially, the WF models incorporate the Haldane model as a first step in the generation transition. The introduction of V(K) into the Moran model is even less biologically sensible. Furthermore, the model allows K to take only three discrete values: 0, 2, and Nψ (see Eq. (7) in Eldon and Wakely). Their model also assumes a constant population size, which contrasts with our model's flexibility in handling varying population sizes and more complex distributions for K.

      In short, the modifications of the WF (and Moran) models are unnecessarily complicated, biologically untenable but still fail to account for the paradoxes. The WFH model can rectify these problems. 

      (4) These results that Ne in the Wright-Fisher process might not be related to N in any straightforward (or even one-to-one) way are well-known (e.g., Neher and Hallatschek 2012; Spence, Kamm, and Song 2016; Matuszewski, Hildebrandt, Achaz, and Jensen 2018; Rice, Novembre, and Desai 2018; the work of Lounès Chikhi on how Ne can be affected by population structure; etc...)

      The reviewer is correct in pointing out the inexact correlation between N and Ne. Nevertheless, it should still be true that the WF models predict qualitatively weaker drift as N increases. The first paradox is as stated:

      When the population size (N) is growing exponentially, such as in a bacteria culture, drift is nearly absent when N is small and is much stronger as N increases, especially when approaching the carrying capacity.  Such common observations are exactly opposite of the WF model's key prediction.

      (5) I was also missing some discussion of the relationship between the branching process and the Wright-Fisher model (or more generally Cannings' Exchangeable Models) when conditioning on the total population size. In particular, if the offspring distribution is Poisson, then conditioned on the total population size, the branching process is identical to the Wright-Fisher model.

      We thank the reviewer for this important comment. The main difference is that N is imposed from outside the WF models but can be generated from within the Haldane model (see the density-dependent Haldane model). In nature, N of the next generation is the sum of K’s among members of the population. It is how the Haldane model determines N(t+1) from N(t). In the WF models, N is imposed from outside the model and, hence the given N determines the distribution of K.  For this reason, N regulation is not possible in the WF models, thus resulting in the paradoxes.

      (6) In the discussion, it is claimed that the last glacial maximum could have caused the bottleneck observed in human populations currently residing outside of Africa. Compelling evidence has been amassed that this bottleneck is due to serial founder events associated with the out-of-Africa migration (see e.g., Henn, Cavalli-Sforza, and Feldman 2012 for an older review - subsequent work has only strengthened this view). For me, a more compelling example of changes in carrying capacity would be the advent of agriculture ~11kya and other more recent technological advances.

      We thank the reviewer and have used this more convincing case as suggested by the reviewer.

      Recommendations for the authors:

      General replies - We thank the editors and reviewers again.  The points below are re-iterations of the comments received above and have since been replied in detail. Specific instructions about wording and notations have also been rectified. Again, we are grateful for the inputs from which we learned a great deal.

      Reviewing Editor Comments:

      The reviewers recognize the value of this model and some of the findings, particularly results from the density-dependent Haldane model. However, they expressed considerable concerns with the model and overall framing of this manuscript.

      First, all reviewers pointed out that the manuscript does not sufficiently engage with the extensive literature on various models of effective population size and genetic drift, notably lacking discussion on Cannings models and related works.

      We have addressed this issue in the beginning of Introduction and Discussion, pointing to the long section in the new second half of Discussion. The essence is that the literature is all about the modified WF models.  The WF-Haldane model is conceptually and operationally distinct from the WF models, either standard or modified ones,

      Second, there is a disproportionate discussion on the paradoxes, yet some of the paradoxes might already be resolved within current theoretical frameworks. All three reviewers found the modeling and simulation of the yeast growth experiment hard to follow or lacking justification for certain choices. The analysis approach of sex chromosomes is also questioned.

      This criticism is addressed together with the next one as they make the same point.

      The reviewers recommend a more thorough review of relevant prior literature to better contextualize their findings. The authors need to clarify and/or modify their derivations and simulations of the yeast growth experiment to address the identified caveats and ensure robustness. Additionally, the empirical analysis of the sex chromosome should be revisited, considering alternative scenarios rather than relying solely on the MSE, which only provides a superficial solution. Furthermore, the manuscript's overall framing should be adjusted to emphasize the conclusions drawn from the WFH model, rather than focusing on the "unresolved paradoxes", as some of these may be more readily explained by existing frameworks. Please see the reviewers' overall assessment and specific comments.

      Many thanks.  We have carefully reframed and presented the WF-Haldane model to make it clear and logically consistent. Whether a new model (i.e., the WF-Haldane model) deserves to be introduced depends on whether it makes any contribution for understanding nature. That is why we emphasize the four paradoxes. 

      A most important disagreement between the reviewers and the authors is about the nature of the paradoxes. While the reviewers suggest that they "may" be resolvable by the conventional WF model (standard or modified), they did not offer the possible resolutions.  To use the analogy in our provisional response: the WF vs. Haldane models are compared to gas cars vs electric vehicles.  We can say confidently that the internal combustion engine cannot resolve the conflicting demands of transportation and zero emission. Its design has limited its capability. 

      Reviewer #2 (Recommendations For The Authors):

      Many thanks.  We have incorporated all these suggestions.  When the incorporation is not straightforward, we have carefully revised the text to minimize mis-communications.

      In the introduction -- "Genetic drift is simply V(K)" -- this is a very strong statement. You can say it is inversely proportional to V(K), but drift is often defined based on changes in allele frequency.

      We change the word “simply” to “essentially”. This wording is supported by the fixation probability of advantageous mutations, 2s/(V(k). We have shown in the text that N does not matter here because the fixation is nearly deterministic when the copy number reaches, say, 100, regardless of whether N is 10^4 or 10^8,

      Page 3 line 86. "sexes is a sufficient explanation."--> "sex could be a sufficient explanation"

      The strongest line of new results is about 2s/V(K). Perhaps, the paper could put more emphasis on this part and demonstrate the generality of this result with a different example.

      The math notations in the supplement are not intuitive. e.g., using i_k and j_k as probabilities. I also recommend using E[X] and V[X]for expectation and variance rather than \italic{E(X)} to improve the readability of many equations.

      Thank you for your careful reading. Regarding the use of i_k and j_k  as probabilities, we initially considered using 𝑝 or 𝑞 to represent probabilities. However, since 𝑝 and 𝑞 are already used in the main text, we opted for 𝑖 and 𝑗 to avoid potential confusion potential confusion. As for your recommendation to use

      E[X] and V[X] for expectation and variance, we would like to clarify that we follow the standard practice of italicizing these symbols to represent variables.

      Eq A6, A7, While I manage to follow, P_{10}(t) and P_{10} are not defined anywhere in the text.<br /> Supplement page 7, the term "probability of fixation" is confusing in a branching model.

      Thank you for your observation. We have carefully revised the supplement to provide clarity on these points.<br /> Revision - In population genetics, the fixation of M allele means that the population consist entirely of the M allele, with no W alleles remaining. We define the fixation probability of M allele by generation t as follows:

      Given that M and W allele reproduce independently, this can be factored as:

      As t approaches infinity, the ultimate fixation probability of M allele can be derived as follows:

      E.q. A 28. It is unclear eq. A.1 could be used here directly. Some justification would be nice.

      We appreciate your careful review, and we will ensure this connection between the two equations is made clearer in the supplement. 

      Revision - Note we would like to clarify that Eq. (A1) and Eq. (A28) are essentially the same, with the only difference being the subscript 𝑡, which indicates the time dependence in the dynamic process.

      Supplement page 17. "the biological meaning of negative..". There is no clear justification for this claim. As a reader, I don't have any intuition as to why that is the case.

      Thank you for raising this concern. We have addressed this issue earlier.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study has uncovered some important initial findings about cellular responses to aneuploidy through analysis of gene expression in a set of donated human embryos. While the study's findings are in general solid, some experiments lack statistical power due to small sample sizes. The authors should try to get much more insight with their data highlighting the novel findings.

      We thank the editor for considering our manuscript for publication at elife, and for the helpful and thorough reviews of our work. Based on the suggestions of the reviewers, we have carried out additional experiments, expanded the sample size and reanalyzed the data. This has resulted in a thoroughly revised manuscript and much improved work, which we are convinced meets the requirements to be published as a version of record. Of note, the experiments for the revision required the support by 2 additional researchers from our lab which are now coauthors.

      These are the main changes made to the initial manuscript:

      (1) The RNA-seq data (Figures 1+2) is now FDR corrected and been reanalyzed. This has not affected the initial observations on the activation of p53 and apoptosis in aneuploid human embryos, as well as that the transcriptomic changes are driven by gene dosage effects. 

      (2) We have included the transcriptome analysis of reversine-treated embryos in the supplementary data.

      (3) For validation of novel findings such as the presence of DNA-damage and the expression of DRAM1 in aneuploid embryos, we now include the stainings of 30 human blastocysts (Figure 3o-t). We found absence of DNA-damage in aneuploid embryos and that DRAM1 is increased in the TE but not the ICM of aneuploid embryos. 

      (4) We re-analyzed the co-expression of CASP8/HSP70 in reversine-embryos as suggested by reviewer 1 and found that both proteins tend to be co-expressed. 

      (5) We have added a new analysis of NANOG expression (Figure 4a,b) of the embryos used in Figure 3o-t and have found retention of NANOG protein in both the TE and ICM.

      (6) We have added 6 euploid and 4 aneuploid embryos to Figure 4l-s, which support the conclusions on the absence of autophagy activation in the ICM and failure of PrE formation in aneuploid embryos.

      (7) We have significantly changed the layout of the figures, revised the supplementary tables, added source data files and rewritten the discussion.

      Regarding the sample size of the study, it is important to emphasize that human embryos are ethically sensitive material and that those with the specific genetic content we used in this study are rare, limiting our ability to expand the sample size. For the revision, we have added 40 human blastocysts to our initial 85 embryos. Compared to similar and high-quality studies using human embryos, our study shows a relatively large sample size (n=125): Victor et al. 2021: 30 human blastocysts for immunostainings1; Martin et al. 2023: 14 human blastocysts2; Martin et al. 2024: 64 human blastocysts3; Domingo-Muelas et al. 2023: 23 human blastocysts4.              

      Public Reviews:

      Reviewer#1(PublicReview):

      This study investigated an important question in human reproduction: why most fully aneuploid embryos is incompatible with normal fetal development. Specifically, the authors investigated the cellular responses to aneuploidy through analysis of gene expression in a set of donated human blastocysts. The samples included uniform aneuploid embryos of meiotic origin and mosaic aneuploid embryos from the SAC inhibitor reversine treatment. The authors relied mainly on low-input RNA sequencing and immunofluorescence staining. Pathway analysis with RNA-seq data of trophectoderm cells suggested activation of p53 and possibly apoptosis, and this cellular signature appeared to be stronger in TE cells with a higher degree of aneuploidy. Immunostaining also found some evidence of apoptosis, increased expression of HSP70 and autophagy in some aneuploid cells. With combinational OCT4 and GATA4 as lineage markers, it appeared that aneuploidy could alter the second lineage segregation and primitive endoderm formation in particular.

      Although this study is largely descriptive, it generated valuable RNA-seq data from a set of aneuploid TE cells with known karyotypes. Immunostaining results in general were consistent with findings in mouse embryos and human gastruloids.

      We thank the reviewer for the thorough evaluation of our manuscript. We have implemented most of the suggestions, which have further strengthened the original findings.

      While there is a scarcity of human embryo materials for research, the lack of single cell level data limits further extension of the presented data on the consequences of mosaic embryos.  

      We did not include single cell RNA-seq data of mosaic human embryos in our study because we focused on embryos diagnosed with complex meiotic abnormalities. Our hypothesis was that the cellular consequences of aneuploidy would be strongest in this type of aneuploidies and most evident to identify and would allow us to provide a basis for the mechanisms of elimination of aneuploid cells in human embryos. In the manuscript (lines 596-626) we acknowledge the limitations of the extrapolation of our results to mosaic embryos.

      A major concern is that the gene list used for pathway analysis is not FDR controlled. It is also unclear how the many plots generated with the "supervised approach" were actually performed. 

      We agree with the concerns about the fact that our differential expression gene list was not FDR but p-value ranked. We followed the suggestion of the reviewer and revised the RNAseq analysis and focused primarily on pathway analysis. We have also added the comparison between aneuploid and reversine treated embryos to the supplementary data and expanded the analysis of high dosage and low dosage embryos. Importantly, the new analysis has not changed the original finding that aneuploid embryos show hallmarks of p53 activation and apoptosis, and that these effects are gene dosage dependent. The manuscript now includes two completely revised and new figures 1 and 2.

      Since we discarded the data generated from our previous approach, we do not use the term supervised approach anymore.

      The authors also appear to have ignored the possibility that high-dosage group could have a higher mitotic defect.

      This is indeed a possibility. In the discussion (lines 504-508) we have now incorporated the notion that the high dosage embryos could have higher mitotic defects, although our data cannot provide any evidence for this. Of note, the gene expression data shows that all aneuploid embryos (including low dosage and reversine embryos) equally show an enrichment for mitotic spindle pathway genes.

      Assuming a fully aneuploid embryo, why do only some cells display p53 and autophagy marker? 

      This is a very good question, on which we can only speculate, but the answer likely lies in the diversity across cells of the same embryo.

      Even in genetically homogenous tissues and cell cultures, individual cells can exhibit different levels of stress responses, such as p53 activation and apoptosis. This variation may be influenced by the local cellular environment, stochastic gene expression, or differences in cell cycle stages. Other studies on fully aneuploid human embryos could also not detect apoptotic responses in every cell1,3.

      For instance, p53 activation differs even between cells that have a similar number of DNA breaks, and this activation is influenced by both cell-intrinsic factors and previous exposure to DNA damage5.

      Cell cycle tightly regulates the response of cells to different stressors. For instance, cells in G1 or S-phase might be more sensitive to apoptosis signals6, while those in G2/M might escape this response temporarily7.  Autophagy is more induced in G1 and S phases, with reduced activity in G2 and M phases8.

      Individual cells may also have different levels of success in the activation of the compensatory pathways, including the unfolded protein response, autophagy, or changes in metabolism, resulting in some cells adapting better than others.

      The expression of p53 and the sensitivity to apoptosis could also be influenced by epigenetic differences between cells, which may alter their transcriptional response to aneuploidy. Even in a genetically identical population, cells can have different epigenetic landscapes, leading to heterogeneous gene expression patterns.

      The conclusion about proteotoxic stress was largely based on staining of HSP70. It appears from Figure 3 d,h that the same cells exhibited increased HSP70 and CASP8 staining. Since HSP70 is known to have anti-apoptotic effect, could the increased expression of Hsp70 be an anti-apoptotic response?

      Our conclusion about proteotoxic stress was not solely based on HSP70 expression. We also stained for LC3B and p62, which are markers for autophagy and when highly expressed indirectly point towards underlying proteotoxic stress in the cells. 

      We reanalyzed the imaging of the stainings in the reversine-treated embryos, and found that the same cells were positive for both HSP70 and CASP8 staining while the minority was single positive (shown now in Figure 3k,l). 

      HSP70 does indeed not only unfold misfolded and aggregated proteins but does also have a function during cell survival and apoptosis9. HSP70 has been for instance found to inhibit the cleavage of Bid through active CASP8 within the extrinsic apoptosis pathway10. It is thus possible that it temporarily plays this role, and we have acknowledged this in the discussion (lines 623-626). On the other hand, the evidence points at an active apoptosis in the TE, with concomitant cell loss, so if HSP70 is indeed having an anti-apoptotic effect, it is having a limited impact.

      Reviewer #2 (Public Review): 

      A high fraction of cells in early embryos carry aneuploid karyotypes, yet even chromosomally mosaic human blastocysts can implant and lead to healthy newborns with diploid karyotypes. Previous studies in other models have shown that genotoxic and proteotoxic stresses arising from aneuploidy lead to the activation of the p53 pathway and autophagy, which helps eliminate cells with aberrant karyotypes. These observations have been here evaluated and confirmed in human blastocysts. The study also demonstrates that the second lineage and formation of primitive endoderm are particularly impaired by aneuploidy.

      This is a timely and potentially important study. Aneuploidy is common in early embryos and has a negative impact on their development, but the reasons behind this are poorly understood. Furthermore, how mosaic aneuploid embryos with a fraction of euploidy greater than 50 % can undergo healthy development remains a mystery. Most of our current information comes from studies on murine embryos, making a substantial study on human embryos of great importance. However, there are only very few new findings or insights provided by this study. Some of the previous findings were reproduced, but it is difficult to say whether this is a real finding, or whether it is a consequence of a low sample number. The authors could get much more insight with their data.

      We thank the reviewer for the thorough evaluation of our manuscript and the valuable suggestions made in the private recommendations. We have expanded the sample size and have carried out additional experiments that have significantly improved the manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Instead of using cut off to generate a list, the authors could just rank the entire detected transcriptome for GSEA. This method fits better the authors' intentions of "primarily focused on pathway analysis." The cut-off value "-log10(p-value)<0.05" is not correct. As we can see from the PCA plot, one would not expect many cut off defined DEGs at all. The most obvious transcriptome change is dosage dependent, as the authors cleared showed with InferCNV.

      We thank the reviewer for this suggestion and agree that this was an important concern of the study. We have entirely revised the RNA-seq analysis based on the proposed approach (Figure 1 and 2, Supplementary Figure 1). Also, we have included the analysis of aneuploid versus reversine treated embryos, which has allowed us to determine the differences between naturally occurring chromosomal abnormalities and those that are induced using reversine (Supplementary Figure 1). 

      We first performed differential gene expression analysis using DESEq2 with a cut-off value for significantly differentially expressed genes of | log2FC | > 1 and an FDR < 0.05. Based on the PCAs and the low number of differentially expressed genes for all comparisons, besides high dosage versus euploid embryos, we focussed primarily on pathway analysis. 

      For that, based on the reviewer’s suggestion, we generated a ranked gene list using the GSEA software (version 4.2.2, MSigDatabase) based on the normalized count matrix of the whole transcriptome that was detected after differential gene expression. The ranked gene list was then subjected to the run GSEA function, and we searched the Hallmark and C2 library for significantly enriched pathways. Thus, we could generate normalized enrichment scores, allowing us to predict whether a pathway is activated or suppressed. The details of the new analysis are described in the Material and Methods section (lines 220-232). Significance was determined using a cut-off value of 25% FDR. This cut-off is proposed in the user guide of the GSEA (https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideTEXT.htm) especially for incoherent gene expression datasets, as suggested by our PCAs, which allows for hypothesis driven validation of the dataset. 

      Indeed, we found that the most important transcriptome changes are aneuploidy dosage dependent. High dosage embryos show signatures of cellular unfitness, while low-dosage embryos still seem to activate survival pathways (lines 349-364). 

      This new analysis did not only increase robustness of our results but also introduced novel findings, which pave the road for future studies. 

      The validity of our findings is supported by recent work by the Zernicka-Goetz lab. We found that hypoxia is upregulated in low dosage human aneuploid TE cells. In line with our data, the Zernicka-Goetz lab found in a mouse model of low degree chromosomal abnormalities that hypoxia inducible factor 1A (HIF1A) promotes survival of extraembryonic aneuploid cells by reducing levels of DNA damage11.

      (2) It would be very helpful if the authors could perform co-staining of multiple stress markers to better understand the origins of apoptosis and autophagy cells. In Fig 3d and 3h, it seems that the same reversine treated embryo was stained with CASP8, LC3B and HSP70. Is there any correlation between CASP8 and HSP70 at the single cell level? Is there any correlation between p53 and LC3B as the authors suggested, possibly through DRAM1?

      We decided to use the complex aneuploid embryos that were left at our facility for the validation of novel findings such as upregulation of DRAM1 and presence and consequences of DNA damage in aneuploid embryos. As suggested by the editor and the other reviewer we also added embryos to existing datasets to increase the sample size where necessary. Therefore, we did not include other co-staining’s of multiple stress markers.

      Following the reviewer’s suggestion, we reanalyzed the existing stainings and evaluated whether there is a correlation between CASP8 and HSP70 at the single cell level. The reversine-treated embryos were the only embryo group that was co-stained for both CASP8 and HSP70. We quantified the percentage of cells that were single or double positive for CASP8 and HSP70 and found a higher proportion of double positive cells than to single positives. Therefore, we concluded that there is indeed a correlation between both proteins at the single cell level in reversine-treated embryos and included this data in Figure 3k,l. 

      During the experiments for the revision, we found that the DRAM1 protein was upregulated in the cytoplasm of TE cells but not in the ICM of aneuploid embryos (Figure 3s,t), which validates the findings of the gene expression analysis. This data also supports our findings that autophagy is active in aneuploid TE cells while not significantly increased in aneuploid pluripotent ICM cells. Unfortunately, we could not stain LC3B and DRAM1 in the same embryo because the antibodies were raised in the same species.

      (3) While " the possibilities for functional studies and lineage tracing experiments in human embryos are very limited," the authors can leverage in silico modelling (ie, PMID: 28700688) to address the roles of aneuploidy in blastocyst formation and development. Is there any selfregulating mechanism underlying the ratios of PrE and EPI? Is apoptosis of ICM cells a natural process during PrE formation (PMID: 18725515)?

      It is a very interesting proposal to use in silico modelling to address the roles of aneuploidy during human blastocyst formation and lineage segregation. Although this type of analysis would yield very important insights, we are not able to address this point of the revision due to lack of expertise for this type of analysis in our group, requiring setting up a collaboration with experts in this field.  In the discussion we proposed that future studies can leverage our data to be carried out in silico modelling and cited the proposed article (lines 608-610).

      On the second part of the question, we would like to discuss the differences between mouse and human embryo studies. Parts of this were included in the discussion on the possible mechanisms of PrE elimination. 

      Is there a self-regulating mechanism for EPI/PrE formation?

      To extrapolate the knowledge on mouse development to human it is important to bear in mind that (1) human embryos are outbred, as compared to inbred super-fertile laboratory mouse strains and (2) the embryos are donated to research by subfertile couples, which could compromise the EPI/PrE ratios. For instance, Chousal and colleagues found that poor quality blastocysts have a reduced number of PrE cells12. In human embryos the proportion EPI and PrE cells is indeed highly variable (20%-60%) and while the number of EPI cells does not increase between dpf6 and 7, the number of PrE cells does grow13. We found a similar variable number of EPI and PrE in our study on the lineage segregation mechanisms in good quality human embryos, with an absolute number of EPI of 12.1±6.5 cells and 8.4±3.44 PrE cells14.

      By comparison, in late mouse blastocysts, the ratio EPI/PrE cells is consistent (2/3)15. Overall, self-regulating mechanisms in the human embryo are not yet studied in detail due to the lack of possible functional testing.

      Is apoptosis a natural process during PrE formation?

      Yes, in mice apoptosis is a natural process during PrE formation to eliminate misallocated cells of the inner cell mass through cell competition16,17. Yet, in the human embryo there is no evidence of such mechanisms. Although apoptosis is present even in human blastocysts of good quality18, the origin of such apoptotic cells is now still shown, although suboptimal culture conditions are known to increase cellular fragmentation19. Conversely, our data and that of others1,2 supports the notion that the pluripotent inner cell mass in human embryos is more resistant to apoptosis than the trophectoderm, even in karyotypically aberrant cells. 

      (4) The "count tables generated from the raw data files" could not be found in the source data files.

      This slipped to our attention, we have added now the count tables to the source data files. Our apologies.

      (5) Citations on aneuploidy literature were not done in a fully scholarly manner. It appears that authors selectively cite previous papers that are in support of their hypothesis but left out those with alternative conclusions.

      We apologize if we missed any literature that contradicts our findings, it is not intentional. We would be grateful if the reviewer could provide such references. 

      In the manuscript we describe the alignment and differences of key findings with several studies (listed below) and the limitations of our study are extensively described in lines 596626.

      Our findings align with other work on these aspects:

      - RNA-sequencing data2,20–26

      - Gene dosage effects drive the transcriptome of the aneuploid human embryo27,28

      - Aneuploid cells are cleared by sustained proteotoxic stress followed by p53 activation, autophagy and eventually apoptosis29–37.

      - p53 is active in constitutional aneuploid cells38

      - The ICM is less sensitive to apoptosis1,2

      Our findings differ with other work on these points:

      - p53 activation is independent from DNA-damage39

      - p53 is active in constitutional aneuploid cells40,41

      - Apoptosis is only present in the aneuploid TE of aneuploid cells in the embryo29,30,42    

      Reviewer #2 (Recommendations For The Authors):

      Comments:

      (1) The main problem is that there is no substantial novelty. The authors look at previously identified factors affected by chromosome gains and losses, but none of the new one from their analysis. Anything what could be potentially novel is not carefully analyzed (e.g. the difference between reversine-treated and aneuploid samples, or new potential candidates) or explained. This is really a pity.

      In the revision, we have further elaborated on the DNA damage aspect by staining for DNA double-stranded breaks and have validated DRAM1 as an activated downstream effector of p53. We have also added the analyses of the gene-expression of the reversine-treated embryos.

      (2) Some of the general statements on aneuploidy are confusing and often borderline generalized. E.g. introduction line 106: "If this (proteotoxic stress) remains unresolved by the activation of autophagy..." I am not aware of any publication suggesting that autophagy resolves proteotoxic stress in aneuploid cells. Citations that replication stress causes DNA damage in aneuploid cells are wrong. This link was first shown by Passerini et al. in 2016. etc.

      We have clarified these statements in the introduction and added the proposed citations on replication stress that causes DNA damage in aneuploid cells (lines 95-108).

      (3) In the figures the authors show a representative image of aneuploid and diploid embryos. Given the aneuploid embryos have widely different karyotypes, it would be important to clarify which of the embryos has been actually shown. Similarly, in the heat maps it is not clear which line is which embryo. This would be very useful.

      We added the karyotypes of the aneuploid embryos to the images in figure 3 and 4. Since the heatmaps were removed from the figures we added the karyotypes to the PCAs in all figures.

      (4) The authors constantly state that aneuploid embryo accumulate more DNA damage, which is supported by some of their observations, e.g. the DNA damage response is upregulated. It would be great if they would validated this statements with testing some markers for DNA damage.

      We agree with the reviewer that this was an important point and addressing it has revealed that our initial assumption was incorrect and has provided new interesting findings. From the revised RNA-seq analysis, we found only one pathway (DNA damage response TP53) to be activated in all aneuploid embryos (Fig.1e). The ATM pathway was also activated specifically in high-dosage embryos. Following this, we set to test if DNA damage was indeed increased in aneuploid embryos by staining for DNA double strand breaks with gH2AX. 

      First, we investigated the gH2AX expression in 5dpf embryos in which we induced DNAdamage with Bleomycin. We compared 6 untreated versus 6 Bleomycin treated human embryos (Fig. 3m) and found that gH2AX foci were rarely present in the untreated embryos and that all cells of the treated embryos showed a pan-nuclear gH2AX staining. 

      Second, we compared the presence of gH2AX foci in the TE (NANOG negative cells), ICM (NANOG positive cells) and the whole embryo of 7 euploid versus 11 aneuploid embryos. Interestingly, we found no differences in the number of gH2AX foci or pan-nuclear gH2AX nuclei between euploid and aneuploid embryos (Fig 3o). When dividing our aneuploid embryos into high and low dosage embryos we could also not account for differences. Our data now suggests that complex aneuploid human embryonic cells of meiotic origin do not contain more DNA-double strand breaks, precluding DNA-damage as the source of p53 activation. Last, in our previous experiment we found that phosphorylated S15p53 is increased in aneuploid embryos, supporting an active p53 pathway as suggested by our transcriptomic data. Since we could not find DNA-damage in aneuploid human embryos we speculate that p53 is phosphorylated on Serine15 through metabolic stress as suggested by Jones and colleagues43. We also argue that proteotoxic stress might induce p53 expression as proposed by Singla and colleagues29.

      (5) The source of embryos is only partially described in a figure legend. This should be expanded and described in the Materials and Methods section. The embryos are named, but this is nowhere explained. One can only assume that T is for trisomy and M is for monosomy.

      We have divided the embryos into different experimental series (Experiment 1-4). This is now described in the Material and methods section (lines 157-175). Also, we have added the experiment number of each embryo to the supplementary tables and to the source data. The abbreviation for T = Trisomy and M= Monosomy was initially introduced in the last paragraph of the figure legend of figure 4.  We now added it to every panel.

      (6) Recent works from non-embryonic cells suggest that the cellular response to monosomy is different than the response to trisomy. Did the authors try to test this possible difference? For example, one could compare embryos M174/21, M2/19 and M17 with T2/10, T10/22 and T1/15/18/22.

      We thank the reviewer for pointing this out. Our RNA-seq. dataset consisted of three embryos that contained trisomies only and four embryos that contained monosomies only. When reanalyzing our data we found different transcriptomic responses between monosomic only and trisomic only cells. Compared to euploid cells, monosomy only cells activate mainly the p53pathway and protein secretion while translation, DNA replication, cell cycle G1/S, DNA synthesis and processing of DNA double strand breaks were inhibited. Trisomy only cells show activated oxidative phosphorylation, ribosome and translation while protein secretion, apoptosis and cell cycle are inhibited. These differences were confirmed by testing transcriptomic differences between trisomic versus monosomic cells. Our results are similar to studies on human embryos20,26 and other monosomic and trisomic cell lines44,45. However, the interpretation of these results is very limited by the small sample size and the comparison of monosomies and trisomies of different chromosomes. Thus, we decided to keep this analysis out of the manuscript.

      Author response image 1.

      On the protein level, next to the small sample size, our results were also limited by the fact that not all embryos were stained with the same combinations of antibodies. LC3B was the only protein for which all embryos were immunostained. Thus, other protein data could not be re-analyzed due to even lower sample sizes. 

      Below we have separated the LC3B puncta per cell counts into euploid, trisomies only, monosomies only and all other aneuploid embryos. We performed a Kruskal Wallis test with multiple comparisons. It is worth noticing that the difference between euploid and monosomies only (and those that contained both) was statistically significant, while the difference between euploid vs trisomies only and trisomies only vs monosomies only was not statistically significant. These differences contradict the studies on monosomic cell lines that found that proteotoxic stress and autophagy are not present and specific to trisomic cell lines. Here we also decided to keep this specific protein expression analysis out of the manuscript due to the above-mentioned limitations.

      Author response image 2.

      (7) Line 329: "a trisomy 12 meiotic chromosomal abnormality in one reversine-treated embryo." What does it mean? Why meiotic chromosomal abnormality when the reversine treatment was administered 4 days after fertilization? In the discussion, the authors state "presumed meiotic," but this should be discussed and described more clearly.

      Since reversine induces mitotic abnormalities of different types leading to chromosomally mosaic embryos, we could not identify these induced abnormalities using inferCNV on the RNAseq of TE biopsies of said embryos. However, we were not aware of the karyotype of the embryos that were used for these experiments, as they were thawed after they had been cryopreserved at day 3 of development and had not been subjected to genetic testing.  This makes it possible that some of those embryos we used for the reversine experiments in fact carried endogenously acquired meiotic and mitotic chromosomal abnormalities. Since we are only able to detect by inferCNV aneuploidies homogeneously present in the majority of the cells of the sequenced biopsy, we only picked up this trisomy 12.  It is possible that this was not a meiotic abnormality but a miotic one originating at the first cleavage and present at a high percentage of cells in the blastocyst. At any rate, the exact origin of this aneuploidy has no further implications for the results of the study. We clarified this in the manuscript (lines 310-315).

      (8) Line 422: "The gene expression profiles suggest that the accumulation of autophagic proteins in aneuploid embryos is caused by increased autophagic flux due to differential expression of the p53 target gene DNA Damage Regulated Autophagy Modulator-1 (DRAM1), rather than by inhibition of autophagy (Supplementary Table 2)." This is highly speculative, as the authors do not have any evidence to support this statement.

      To validate this finding we have now stained 7 euploid and 11 aneuploid embryos with a DRAM1 antibody. We found DRAM1 protein to be significantly enriched in the cytoplasm of TE cells but not in the ICM of aneuploid embryos when comparing with euploid embryos (Fig. 3s,t). This data is consistent with the finding that autophagy is increased in the TE and not the ICM of aneuploid human embryos. (Fig 4l-o). Potential implications of DRAM1 expression have been mentioned in the discussion.

      (9) The figure legends are confusing. They are mixed up with the methods and some key information are missing.

      We revised all figure legends accordingly and removed the experimental set-up figures from the manuscript to reduce any confusion. The methods section was revised and expanded.

      (10) In Figure 1, what is the difference between "activated" and "deregulated"?

      Since we analyzed our RNA-seq dataset with the method proposed by reviewer 1 we now generated normalized enrichment scores. The terms activated and deregulated are thus not present anymore.  

      (11) The p62 images are not really clear. There might be more puncta (not obvious, though), but the staining intensity seems lower in the representative images.  

      We do not agree with the reviewer that there might be more p62 puncta (purple), however, we agree that it was not clearly visible from the pictures. Below we show an example of the counting mask (in green) of the aneuploid embryo from figure 3i, where one can clearly appreciate that all the puncta are captured by the counting mask. In this case, the software counted 1704 puncta. To further clarify, we now added a zoom of a randomly chose ROI of the p62 staining’s to figure 3i.

      Author response image 3.

      (12) The authors claim that there are differences between lineages in response to aneuploidy, such as autophagy not being activated in the OCT4+ lineage, etc. However, the differences are very small and based on a small number of embryos. It is difficult to draw far-reaching conclusions based on a small number of experiments (Fig. 4n-r). The authors also claim in the Abstract that they demonstrated "clear differences with previous findings in the mouse", which are however difficult to identify in the text.

      We agree with the reviewer that our conclusions on figures 4l-o were based on a small number of embryos. We have increased as much as possible the sample size. This is challenging due to the constrictions in accessing human embryos, and especially the limited number of embryos with meiotic complex aneuploidy. We have performed immunostainings for LC3B, OCT4 and GATA4 of six additional euploid and four additional aneuploid human embryos. This did not change our overall findings that aneuploid embryos upregulate autophagy in the TE rather than the ICM (Figure 4l-o). After the inclusion of additional embryos, we removed our speculation from the manuscript that autophagy is present in ICM cells of already differentiated cells towards EPI/PrE.

      We have rephrased the abstract to state that we highlight a few differences with previous findings in the mouse. Here we focused especially on the different transcriptomic response of reversine treated embryos, that aneuploid mouse embryos do not seem to suffer from lineage segregation errors and that the ICM of aneuploid human embryos lacks apoptosis while aneuploid mouse embryos show elimination from the EPI. Likewise, we highlighted the similar stress responses and that we could give novel insights into p53 mediated autophagy and apoptosis activation through DRAM1 in aneuploid TE cells but not the ICM.  

      (13) The text needs thorough editing - long sentences, typos, and grammar errors are frequent. Punctuation is largely missing.

      We have revised the text.

      References

      (1) Victor, A. R. et al. One hundred mosaic embryos transferred prospectively in a single clinic: exploring when and why they result in healthy pregnancies. Fertil Steril 111, 280–293 (2019).

      (2) Martin, A. et al. Mosaic results after preimplantation genetic testing for aneuploidy may be accompanied by changes in global gene expression. Front Mol Biosci 10, 264 (2023).

      (3) Martín, Á. et al. Trophectoderm cells of human mosaic embryos display increased apoptotic levels and impaired differentiation capacity: a molecular clue regarding their reproductive fate? Human Reproduction 39, 709–723 (2024).

      (4) Domingo-Muelas, A. et al. Human embryo live imaging reveals nuclear DNA shedding during blastocyst expansion and biopsy. Cell 186, 3166-3181.e18 (2023).

      (5) Loewer, A., Karanam, K., Mock, C. & Lahav, G. The p53 response in single cells is linearly correlated to the number of DNA breaks without a distinct threshold. BMC Biol 11, 1–13 (2013).

      (6) Kim, H., Watanabe, S., Kitamatsu, M., Watanabe, K. & Ohtsuki, T. Cell cycle dependence of apoptosis photo-triggered using peptide-photosensitizer conjugate. Scientific Reports 2020 10:1 10, 1–8 (2020).

      (7) Pollak, N. et al. Cell cycle progression and transmitotic apoptosis resistance promote escape from extrinsic apoptosis. J Cell Sci 134, (2021).

      (8) Neufeld, T. P. Autophagy and cell growth--the yin and yang of nutrient responses. J Cell Sci 125, 2359–2368 (2012).

      (9) Lanneau, D. et al. Heat shock proteins: essential proteins for apoptosis regulation. J Cell Mol Med 12, 743 (2008).

      (10) Gabai, V. L., Mabuchi, K., Mosser, D. D. & Sherman, M. Y. Hsp72 and Stress Kinase cjun N-Terminal Kinase Regulate the Bid-Dependent Pathway in Tumor Necrosis Factor-Induced Apoptosis. Mol Cell Biol 22, 3415 (2002).

      (11) Sanchez-Vasquez, E., Bronner, M. E. & Zernicka-Goetz, M. HIF1A contributes to the survival of aneuploid and mosaic pre-implantation embryos. bioRxiv 2023.09.04.556218 (2023) doi:10.1101/2023.09.04.556218.

      (12) Chousal, J. N. et al. Molecular profiling of human blastocysts reveals primitive endoderm defects among embryos of decreased implantation potential. Cell Rep 43, (2024).

      (13) Corujo-Simon, E., Radley, A. H. & Nichols, J. Evidence implicating sequential commitment of the founder lineages in the human blastocyst by order of hypoblast gene activation. Development (Cambridge) 150, (2023).

      (14) Regin, M. et al. Lineage segregation in human pre-implantation embryos is specified by YAP1 and TEAD1. Human Reproduction 38, 1484–1498 (2023).

      (15) Saiz, N., Williams, K. M., Seshan, V. E. & Hadjantonakis, A. K. Asynchronous fate decisions by single cells collectively ensure consistent lineage composition in the mouse blastocyst. Nature Communications 2016 7:1 7, 1–14 (2016).

      (16) Plusa, B., Piliszek, A., Frankenberg, S., Artus, J. & Hadjantonakis, A. K. Distinct sequential cell behaviours direct primitive endoderm formation in the mouse blastocyst. Development 135, 3081–3091 (2008).

      (17) Hashimoto, M. & Sasaki, H. Epiblast Formation by TEAD-YAP-Dependent Expression of Pluripotency Factors and Competitive Elimination of Unspecified Cells. Dev Cell 50, 139-154.e5 (2019).

      (18) Hardy, K. Apoptosis in the human embryo. Rev Reprod 4, 125–134 (1999).

      (19) Ramos-Ibeas, P. et al. Embryo responses to stress induced by assisted reproductive technologies. Mol Reprod Dev 86, 1292–1306 (2019).

      (20) Licciardi, F. et al. Human blastocysts of normal and abnormal karyotypes display distinct transcriptome profiles. Sci Rep 8, 1–9 (2018).

      (21) Maxwell, S. M. et al. Investigation of Global Gene Expression of Human Blastocysts Diagnosed as Mosaic using Next-generation Sequencing. Reproductive Sciences 1–11 (2022) doi:10.1007/s43032-022-00899-x.

      (22) Groff, A. F. et al. RNA-seq as a tool for evaluating human embryo competence. Genome Res 29, 1705–1718 (2019).

      (23) Starostik, M. R., Sosin, O. A. & McCoy, R. C. Single-cell analysis of human embryos reveals diverse patterns of aneuploidy and mosaicism. Genome Res 30, 814–826 (2020).

      (24) Vera-Rodriguez, M., Chavez, S. L., Rubio, C., Pera, R. A. R. & Simon, C. Prediction model for aneuploidy in early human embryo development revealed by single-cell analysis. Nat Commun 6, 7601 (2015).

      (25) Sanchez-Ribas, I. et al. Transcriptomic behavior of genes associated with chromosome 21 aneuploidies in early embryo development. Fertil Steril 111, 991-1001.e2 (2019).

      (26) Fuchs Weizman, N. et al. Towards Improving Embryo Prioritization: Parallel Next Generation Sequencing of DNA and RNA from a Single Trophectoderm Biopsy. Sci Rep 9, 1–11 (2019).

      (27) Fernandez Gallardo, E. et al. A multi-omics genome-and-transcriptome single-cell atlas of human preimplantation embryogenesis reveals the cellular and molecular impact of chromosome instability. bioRxiv 2023.03.08.530586 (2023) doi:10.1101/2023.03.08.530586.

      (28) Dürrbaum, M. & Storchová, Z. Effects of aneuploidy on gene expression: implications for cancer. FEBS J 283, 791–802 (2016).

      (29) Singla, S., Iwamoto-Stohl, L. K., Zhu, M. & Zernicka-Goetz, M. Autophagy-mediated apoptosis eliminates aneuploid cells in a mouse model of chromosome mosaicism. Nat Commun 11, 1–15 (2020).

      (30) Bolton, H. et al. Mouse model of chromosome mosaicism reveals lineage-specific depletion of aneuploid cells and normal developmental potential. Nat Commun 7, 1– 12 (2016).

      (31) Ohashi, A. et al. Aneuploidy generates proteotoxic stress and DNA damage concurrently with p53-mediated post-mitotic apoptosis in SAC-impaired cells. Nat Commun 6, 1–16 (2015).

      (32) Santaguida, S. & Amon, A. Short- and long-term effects of chromosome missegregation and aneuploidy. Nature Reviews Molecular Cell Biology vol. 16 473–485 Preprint at https://doi.org/10.1038/nrm4025 (2015).

      (33) Santaguida, S., Vasile, E., White, E. & Amon, A. Aneuploidy-induced cellular stresses limit autophagic degradation. Genes Dev 29, 2010–2021 (2015).

      (34) Chunduri, N. K. & Storchová, Z. The diverse consequences of aneuploidy. Nature Cell Biology 2019 21:1 21, 54–62 (2019).

      (35) Dürrbaum, M. et al. Unique features of the transcriptional response to model aneuploidy in human cells. BMC Genomics 15, 139 (2014).

      (36) Pan, J.-A., Ullman, E., Dou, Z. & Zong, W.-X. Inhibition of protein degradation induces apoptosis through a microtubule-associated protein 1 light chain 3-mediated activation of caspase-8 at intracellular membranes. Mol Cell Biol 31, 3158–70 (2011).

      (37) Stingele, S. et al. Global analysis of genome, transcriptome and proteome reveals the response to aneuploidy in human cells. Mol Syst Biol 8, 608 (2012).

      (38) Tang, Y.-C., Williams, B. R., Siegel, J. J. & Amon, A. Identification of aneuploidyselective antiproliferation compounds. Cell 144, 499–512 (2011).

      (39) Janssen, A., Van Der Burg, M., Szuhai, K., Kops, G. J. P. L. & Medema, R. H. Chromosome segregation errors as a cause of DNA damage and structural chromosome aberrations. Science 333, 1895–1898 (2011).

      (40) Li, M. et al. The ATM-p53 pathway suppresses aneuploidy-induced tumorigenesis. Proc Natl Acad Sci U S A 107, 14188–14193 (2010).

      (41) Thompson, S. L. & Compton, D. A. Proliferation of aneuploid human cells is limited by a p53-dependent mechanism. J Cell Biol 188, 369–381 (2010).

      (42) Yang, M. et al. Depletion of aneuploid cells in human embryos and gastruloids. Nat Cell Biol 23, 314–321 (2021).

      (43) Jones, R. G. et al. AMP-activated protein kinase induces a p53-dependent metabolic checkpoint. Mol Cell 18, 283–293 (2005).

      (44) Chunduri, N. K., Barthel, K. & Storchova, Z. Consequences of Chromosome Loss: Why Do Cells Need Each Chromosome Twice? Cells 2022, Vol. 11, Page 1530 11, 1530 (2022).

      (45) Krivega, M., Stiefel, C. M. & Storchova, Z. Consequences of chromosome gain: A new view on trisomy syndromes. American Journal of Human Genetics vol. 109 2126–2140 Preprint at https://doi.org/10.1016/j.ajhg.2022.10.014 (2022).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Although the reviewers found our work interesting, they raised several important concerns about our study. To address these concerns, mostly we performed new experiments. The most important changes are highlighted in the summary paragraphs.

      First, in response to Reviewer 1’s suggestions, we have conducted the SFN experiments systematically, e.g., we further confirmed the mechanism of SFN-activated TFEB in HeLa NPC1 cells with new experiments including: the effect of BAPTA-AM (a calcium chelator), FK506+CsA (calcineurin inhibitors) and NAC (ROS scavenger) on SFN-induced TFEB-nuclear translocation in HeLa NPC1 cells (New Fig. S3). The effect of SFN on NPC1 expression (New Fig. S5). Particularly, we examined the colocalization of DiO (a PM marker) staining and surface LAMP1 staining in HeLa NPC1 cells under SFN treatment to confirm the PM exocytosis. In main text and figure legends, accuracy of sentence is thoroughly checked and defined. Hence, we have significantly improved the presentation and clarity in the revision.

      Second, in response to Reviewer 2’s suggestions, we have performed additional experiments to demonstrate that the role of TFEB in SFN-evoked the lysosomal exocytosis by using TFEB-KO cells (New Fig. S7B). In TFEB KO cells, this increase of surface LAMP1 signal by SFN treatment was significantly reduced, suggestive of SFN-induced exocytosis in a TFEB-dependent manner. We also investigated the effect of U18666A on CF555-dextran endocytosis. By examining the localization of CF-dex and Lamp1, we found that CF555 is present in the lysosome with U18666A treatment (Fig for reviewers only A,B), suggesting that NPC1 deficiency/U18666A treatment has no effect on CF-dex endocytosis.

      Third, in response to Reviewer 3’s suggestions, we have performed experiments in addition to response to other reviewers’ suggestion ie. the cytotoxicity of the concentration of SFN used in this study in various cell lines (New Fig.S10).

      In addition, according to the reviewers’ suggestions, we made clarifications and corrections wherever appropriate in the manuscript.

      Reviewer #1 (Public review):

      Summary:

      The authors are trying to determine if SFN treatment results in dephosphorylation of TFEB, subsequent activation of autophagy-related genes, exocytosis of lysosomes, and reduction in lysosomal cholesterol levels in models of NPC disease.

      Strengths:

      (1) Clear evidence that SFN results in translocation of TFEB to the nucleus.

      (2) In vivo data demonstrating that SFN can rescue Purkinje neuron number and weight in NPC1<sup>-/-</sup> animals.

      Thank you for the support!

      Weaknesses:

      (1) Lack of molecular details regarding how SFN results in dephosphorylation of TFEB leading to activation of the aforementioned pathways. Currently, datasets represent correlations.

      Thank you for raising this critical point! The reviewer is right that in this manuscript we did not talk too much about the molecular mechanism of SFN-evoked TFEB activation. Because in our previous study (Li, Shao et al. 2021), we explored the mechanism of SFN-induced TFEB activation. We show that SFN-evoked TFEB activation via a ROS-Ca<sup>2+</sup>-calcineurin dependent but MTOR -independent pathway (Li, Shao et al. 2021). In the current manuscript, we cited this paper, but did not talk the details of the mechanism, which obviously confused the reviewers. Therefore, in the revision manuscript we added more details of the molecular mechanism of SFN-activated TFEB. Also, we further confirmed this mechanism in HeLa NPC1 cells with new experiments including: the effect of BAPTA-AM (a calcium chelator), FK506+CsA (calcineurin inhibitors) and NAC (ROS scavenger) on SFN-induced TFEB-nuclear translocation in NPC cells (New Fig.S3).

      (2) Based on the manuscript narrative, discussion, and data it is unclear exactly how steady-state cholesterol would change in models of NPC disease following SFN treatment. Yes, there is good evidence that lysosomal flux to (and presumably across) the plasma membrane increases with SFN. However, lysosomal biogenesis genes also seem to be increasing. Given that NPC inhibition, NPC1 knockout, or NPC1 disease mutations are constitutively present and the cell models of NPC disease contain lysosomes (even with SFN) how could a simple increase in lysosomal flux decrease cholesterol levels? It would seem important to quantify the number of lysosomes per cell in each condition to begin to disentangle differences in steady state number of lysosomes, number of new lysosomes, and number of lysosomes being exocytosed.

      Thank you for this constructive comment. From our data, in NPC1 cells SFN reduced the cholesterol levels by inducing lysosomal exocytosis and increasing lysosomal biogenesis. We understand the reviewer’s point that it would be really helpful to differentiate the exact three states of original number of lysosomes, number of new lysosomes, and number of lysosomes being exocytosis. Unfortunately, due to the technique limitation, so far seems there is no appropriate method that could clearly differentiate the lysosomes exactly come from which state. In the future, hopefully we will have technique to explore this mechanism.

      (3) Lack of evidence supporting the authors' premise that "SFN could be a good therapeutic candidate for neuropathology in NPC disease".

      Suggestion was taken! We removed this sentence. Thanks!

      Reviewer #2 (Public review):

      (4) The in vivo experiments demonstrate the therapeutic potential of SFN for NPC. A clear dose response analysis would further strengthen the proposed therapeutic mechanism of SFN.

      Thank you for this constructive suggestion. We examined the effect of two doses of SFN30 and 50mg/kg on NPC mice. As shown in Fig.6, SFN (50mg/kg), but not 30mg/kg prevents a degree of Purkinje cell loss in the lobule IV/V of cerebellum, suggesting a dose-correlated preventive effect of SFN. In the future study, we will continue optimizing the dosage form and amount of SFN and do a dose-responsive analysis.

      (5) Additional data supporting the activation of TFEB by SFN for cholesterol clearance in vivo would strengthen the overall impact of the study.

      Thank the reviewer for this constructive comment. We have detected a significant decrease of pS211-TFEB protein in brain tissues of NPC mice upon SFN treatment compared to vehicle, suggesting that SFN activates TFEB in brain tissue for the first time. It is worth to further examine the lysosomal cholesterol levels in brain tissues to show the direct effect of SFN. However, in our hands and in the literatures Filipin seems not suitable for detecting lysosomal cholesterol accumulation in brain tissue. So far there isn’t a good method to directly measure lysosomal cholesterol in tissue.

      (6) In Figure 4, the authors demonstrate increased lysosomal exocytosis and biogenesis by SFN in NPC cells. Including a TFEB-KO/KD in this assay would provide additional validation of whether these effects are TFEB-dependent.

      Great suggestion! We investigated the role of TFEB in SFN-evoked the lysosomal exocytosis by using TFEB-KO cells. As shown in New Suppl. Fig. 7B, in TFEB KO cells, this increase of surface LAMP1 signal by SFN (15 μM, 12 h) treatment was significantly reduced, suggestive of SFN induced exocytosis in a TFEB-dependent manner.

      (7) For lysosomal pH measurement, the combination of pHrodo-dex and CF-dex enables ratiometric pH measurement. However, the pKa of pHrodo red-dex (according to Invitrogen) is ~6.8, while lysosomal pH is typically around 4.7. This discrepancy may account for the lack of observed lysosomal pH changes between WT and U18666A-treated cells. Notably, previous studies (PMID: 28742019) have reported an increase in lysosomal pH in U18666A-treated cells.

      We understand the reviewer’s point. But as stated in the methods and main text, we used pHrodo™ Green-Dextran (P35368, Invitrogen), rather than pHrodo Red-dextran. According to the product information from Invitrogen, pHrodo Green-dex conjugates are non-fluorescent at neural pH, but fluorescence bright green at acidic pH around 4, such as those in endosomes and lysosomes. Therefore, pHrodo Green-dex is suitable to monitor the acidity of lysosome (Hu, Li et al. 2022). We also used LysoTracker Red DND-99 (Thermo Scien fic, L7528) to measure lysosomal pH (Fig. 4G, H), which is consistent with results from pHrodo Green/CF measurement.

      The reviewer mentioned that previous studies have reported an increase in lysosomal pH in U18666Atreated cells. We understood this concern. But in our hands, from our data with two lysosomal pH sensors, we have not detected lysosomal pH change in U18666A-treated NPC1 cell models.

      (7) The authors are also encouraged to perform colocalization studies between CF-dex and a lysosomal marker, as some researchers may be concerned that NPC1 deficiency could reduce or block the trafficking of dextran along endocytosis.

      Thank you for raising this important point and suggestion was taken! We investigated the effect of NPC1 deficiency on CF555-dextran trafficking into lysosome by examining the localization of CF-dex and Lamp1. To clearly define whether CF555-dex is present in the lysosome, we first used apilimod to enlarge lysosomes and then examined the relative posi on of CF555-dex and lamp1. As shown in Author response image 1A,B, in HeLa cells treated with U18666A, CF555 signals (red) clearly present inside lysosome (LAMP1 labelled lysosomal membrane, green signal), suggesting that CF555dex endocytosis is not affected by NPC1 deficiency (U18666A treatment).

      Author response image 1.

      The effect of NPC1 deficiency on CF555 endocytosis. HeLa cells were transiently transfected with LAMP1-GFP plasmid for 24 h. Cells were then treated with apilimod (100 nM) for 2 h to enlarge the lysosomes, and followed by co- treatment of U18666A (2.5 μM, 24 h) and CF555 (12 h). (A)Each panel shows fluorescence images taken by confocal microscopes. (B) Each panel shows the fluorescence intensity of a line scan (white line) through the double labeled object indicated by the white arrow. Scale bar, 20 μm or 2 μm (for zoom-in images).

      (9) In vivo data supporting the activation of TFEB by SFN for cholesterol clearance would significantly enhance the impact of the study. For example, measuring whole-animal or brain cholesterol levels would provide stronger evidence of SFN's therapeutic potential.

      We really appreciate the reviewer’s comments. Please see response to point #5.

      Reviewer #3 (Public review):

      (10) The manuscript is extremely hard to read due to the writing; it needs careful editing for grammar and English.

      Sorry for the defects in the writing and grammar. We had thoroughly checked grammar and polished the English to improve the manuscript.

      (11) There are a number of important technical issues that need to be addressed.

      We will address the technical issues mentioned in the following ques ons.

      (12) The TFEB influence on filipin staining in Figure 1A is somewhat subtle. In the mCherry alone panels there is a transfected cell with no filipin staining and the mCherry-TFEBS211A cells still show some filipin staining.

      Thank you for raising this point. The reviewer is right that not all the mCherry alone cells with the same level of filipin signal and not all mCherry-TFEBS211 transfected cells show completely no filipin signal. The statistical results were from randomly selected cells from 3 independent experiments. To avoid the confusion, we have included more cells in the statistical analysis to cover all the conditions as shown in the new Fig. 1B. Hopefully this helps to clarify the confusion.

      (13) Figure 1C is impressive for the upregulation of filipin with U18666A treatment. However, SFN is used at 15 microM. This must be hitting multiple pathways. Vauzour et al (PMID: 20166144) use SFN at 10 nM to 1microM. Other manuscripts use it in the low microM range. The authors should repeat at least some key experiments using SFN at a range of concentrations from perhaps 100 nM to 5 microM. The use of 15 microM throughout is an overall concern.

      The reason that we use this concentration of SFN is based on our previous study (Li, Shao et al. 2021). We had shown that SFN (10–15 μM, 2–9 h) induces robust TFEB nuclear translocation in a dose- and time-dependent manner in HeLa cells as well as in other human cell lines without cytotoxicity (Li, Shao et al. 2021). Also, tissue concentrations of SFN can reach 3–30 μM upon broccoli consumption (Hu, Khor et al. 2006), so we used low micromolar concentrations of SFN (15 μM) in our study. Moreover, we further confirmed that SFN (15 μM) induces TFEB nuclear translocation in HeLa NPC1 cells (Fig. 1F, G Fig. 2B, G) and this concentration of SFN has no cytotoxicity (New Fig.S10).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The following comments are designed to improve and focus the authors' work.

      (14) Related to data in Figure 1. The mechanism through which TFEB can reduce Filipin in U18 conditions is unclear. Inhibi on of NPC1 results in hyperactivation of mTOR through cholesterol transport at ER-Lysosome contacts (see Zoncu group publications). If mTORC is hyperac ve in NPC disease models, TFEB would be expected to remain cytoplasmic and not enter the nucleus as the representative image in Figure 1A demonstrates.

      In our previous study (Li, Shao et al. 2021), we have shown that SFN induces TFEB nuclear translocation in a mTOR-independent manner (Li, Shao et al. 2021). Consistent with this result, in this study we confirmed that SFN-induced TFEB nuclear translocation is mTor-independent in NPC1 cells (Now Fig. S4A, B). Thus, SFN induced TFEB nuclear translocation in various NPC cells (Fig. 1F, G, Fig. 2B, G). Please also see the discussion about the mechanism of SFN in response to point #1.

      (15) Therefore, how does overexpression of TFEB, which remains in the cytoplasm, result in a decreased filipin signal? Similar ques ons relate to Figure 1C-H.

      Medina et. al (Medina, Fraldi et al. 2011) show that TFEB overexpression (not activation, so overexpressed TFEB is in the cytoplasm) increases the pool of lysosomes in the proximity of the plasma membrane and promotes their fusion with PM by raising intracellular Ca<sup>2+</sup> levels through lysosomal Ca<sup>2+</sup> channel MCOLN1, leading to increased lysosomal exocytosis. Hence, TFEB overexpression only (TFEB is not activated) could reduce filipin signal via increasing lysosomal exocytosis. And with TFEB agonist treatment such as TFEB could further boost this increase.

      (16) It would seem appropriate to measure the NPC1 and NPC2 proteins using western blot to ensure that SFN-dependent clearance of cholesterol is not due to enhanced expression of the native protein in U18-treated cells or enhanced folding of the protein in patient fibroblasts.

      Thank you for this constructive comment! Because NPC1 gene mutation takes about 95% of NPC cases and NPC2 mutation takes about 5% of NPC cases. And in this study we focused on NPC1 deficiency cases. Thus, we measured the effect of SFN on the expression of NPC1 in human NPC1-patient fibroblasts. Western blot analysis showed that SFN (15 μM, 24 h) treatment did not affect NPC1 expression in human NPC1-patient fibroblasts (new Fig. S5).

      (17) Related to data in Figures 1C-E. Controls are missing related to the effect SFN has on steady-state cholesterol levels. This may be insightful in providing information on the mode of action of this compound.

      Suggestion was taken! We have supplemented the control- SFN only in new Fig. 1C-E.

      (18) The mechanism that links SFN to TFEB-dependent translocation is suggested to involve calcineur independent dephosphorylation of TFEB. However, no data is provided. It would seem important to iden fy the mechanism(s) through which SFN positively regulates TFEB location. This would shift the manuscript and its model from correlations to causation. Experiments involving calcineurin inhibitors, or agonists of TRPML1 that have been reported as being a key source of Ca<sup>2+</sup> for calcineurin activation, may provide molecular insight.

      Please see the paragraph in response to point #1.

      (19) Related to Figure 4. Using a plasma membrane counterstain to quantify plasma membrane LAMP1 would increase the rigor of the analysis.

      Great idea! We examined the colocalization of DiO (a PM marker) staining and LAMP1 staining in HeLa NPC1 cells under SFN treatment. As shown in new Fig.4A, surface LAMP1 signal(red) colocalized with DiO (green), a PM marker.

      (20) Related to Figure 5. How do the authors explain the kinetic disparity between SFN treatment for 24 vs 72 hrs? IF TFEB is activated and promoting lysosomal biogenesis and increased lysosomal flux across the PM, why does cholesterol accumulation lag? Perhaps related to this point. Are other cholesterol metabolizing enzymes that may have altered activity in NPC sensitive to SFN? A similar comment applies to the Sterol regulatory element binding protein pathway, which has been shown to be activated in models of NPC disease.

      We understand the reviewer’s point. As shown in Fig. 5C, D, in NPC1<sup>-/-</sup> MEF cells, SFN treatment for 24 h showed relative weaker cholesterol clearance compared to the effects in human cells (Fig.1C, D, Fig.2.E, I). Thus, we explored a longer treatment of SFN for 72 h (fresh SFN in medium was added every 24 h), and 72h treatment of SFN exhibited substantial cholesterol reduction (Fig. 5C, D). This different effect could be attributed to the continuous action of SFN, which could prolong the exocytosis, leading to more effective cholesterol clearance. As shown in the DMSO-treated MEF cells, the cholesterol levels are similar in both 24 and 72 h, thus 24 h U18666A treatment has reached the upper limit of the accumulated cholesterol, longer treatment me would not change the cholesterol levels. Thus, cholesterol accumulation has no lag.

      We did not investigate whether SFN regulates other cholesterol metabolizing enzymes or sterol regulatory element binding proteins although we cannot rule out this possibility. In this study we mainly focus on the cholesterol clearance effect by SFN via TFEB-mediated pathways. From our data, TFEB KO could significantly diminish SFN-evoked cholesterol clearance. Hence, the effect of other cholesterol metabolizing enzymes or sterol regulatory element binding proteins maybe not as important as TFEB, thus out of scope of this study. In the future, we may explore the involvement of possible other pathways on SFN’s effects.

      (21) Related to Figure 7. The western blots for pS211-TFEB are poor. It's suggested that whole blots are shown to increase rigor.

      Thank you for the comments. We have represented the blots with more spare space to increase the rigor.

      (22) Data demonstrating the ability of SFN to improve Purkinje cell survival are exci ng and pair well with the weight analysis, however, to address the overall goal of determining if "SFN could be a good therapeutic candidate for neuropathology in NPC disease" survival analysis should be tested as well.

      Please see the paragraph in response to point #3.

      Minor

      (23) Throughout the manuscript many different Fonts and font sizes are used. This is very jarring to readers. It is suggested that a more uniform approach is taken to presenting these nice datasets.

      We are so sorry and apologize for these oversights. We have thoroughly checked all the manuscript to make sure that Fonts and sizes of font are synchronized.

      (24) Related to data presentation. In general, there is a lack of alignment and organization of the figures.

      So sorry about this. We have reorganized the figures to get them better aligned.

      (25) Line 149, SFN is missing.

      Corrected!

      Reviewer #3 (Recommendations for the authors):

      (26) In Figure 3 the authors should use multiple single siRNAs or perform a functional rescue to determine specificity.

      We understand the reviewer’s point. We did design several siRNAs and the efficiency of these siRNAs were validated. Finally, we decide use this siRNA whose knockdown efficiency is best in the study and the specificity of the siTFEB has been validated by Western blot as shown in Fig. 3A. Furthermore, we used TFEB knockout cells constructed by CRISPR/Cas9 to further examine the role of TFEB in SFN-induced cholesterol clearance (Fig. 3D). Consistently with the results in the siTFEB-transfected HeLa NPC1 cells (Fig. 3B, C), SFN failed to diminish cholesterol in HeLa TFEB KO cells. The result from TFEB KO cells is even convincing than siRNA experiment. We also performed a functional rescue of re-expressing TFEB in TFEB KO cells, in which SFN-induced cholesterol clearance was restored (Fig. 3E, F). Collectively, these data indicate that TFEB is required for lysosomal cholesterol reduction upon SFN treatment. Thus, we did not repeat this rescue experiment in the siTFEB-transfected HeLa NPC1 cells.

      (27) The label for 3D is missing.

      Corrected! Thanks!

      (28) Figure 4, although the authors use an an body against the luminal domain of LAMP1 there could s ll be some permeabilization. A marker of the plasma membrane would be helpful.

      Please see the response to point #19.

      (29) Figure 4, cholesterol in the media because of lysosome exocytosis. This is where the high concentration of SFN is of concern. Is there any cell death that could explain the result? The authors should test for cell death with the SFN treatment.

      Thank you for raising this important point! We have measured the cytotoxicity of SFN of the concentrations used in this study in various cell lines (New Fig.S10). Please also see the paragraph in response to point #13.

      (30) The blot in Figure 6A is unclear. It is very hard to see any change in pS211-TFEB levels, and, the blurry signal is the detection of phospho-TFEB is uncertain.

      Please see the summary paragraph in response to point #21.

      References:

      Hu, M. Q., P. Li, C. Wang, X. H. Feng, Q. Geng, W. Chen, M. Marthi, W. L. Zhang, C. L. Gao, W. Reid, J. Swanson, W. L. Du, R. Hume and H. X. Xu (2022). "Parkinson's disease-risk protein TMEM175 is a proton-activated proton channel in lysosomes." Cell 185(13): 2292-+.

      Hu, R., T. O. Khor, G. Shen, W. S. Jeong, V. Hebbar, C. Chen, C. Xu, B. Reddy, K. Chada and A. N. Kong (2006). "Cancer chemoprevention of intestinal polyposis in ApcMin/+ mice by sulforaphane, a natural product derived from cruciferous vegetable." Carcinogenesis 27(10): 2038-2046.

      Li, D., R. Shao, N. Wang, N. Zhou, K. Du, J. Shi, Y. Wang, Z. Zhao, X. Ye, X. Zhang and H. Xu (2021). "Sulforaphane Activates a lysosome-dependent transcriptional program to mitigate oxidative stress." Autophagy 17(4): 872-887.

      Medina, D. L., A. Fraldi, V. Bouche, F. Annunziata, G. Mansueto, C. Spampanato, C. Puri, A. Pignata, J. A. Martina, M. Sardiello, M. Palmieri, R. Polishchuk, R. Puertollano and A. Ballabio (2011). "Transcriptional activation of lysosomal exocytosis promotes cellular clearance." Dev Cell 21(3): 421-430.

    1. Author response:

      The following is the authors’ response to the original reviews

      We would like to thank you and the reviewers for valuable feedback on the first version of the manuscript. We now addressed all of the issues raised by reviewers, mostly by implementing the suggested changes and clarifying important details in the revised version of the manuscript. A detailed response to each comment is provided in the rebuttal letter. Briefly, the main changes were as follow:

      - We changed homeostatic balance to network balance especially when describing the main finding as the response changes induced by the stimulation occurred on a fast timescale. We speculate the sustained changes observed in the post-stimulation condition are the result of homeostatic mechanisms.

      - We added additional verification on the target stimulation effect by adding a supplementary result showing its effect between the target and off-target z-planes, as well as demonstrating the minimal impact of the imaging laser to rsChRmine.

      - We added a simple toy model illustrating suppression specifically applied to co-tuned cells that yields the response amplitude decrease, to further support our findings.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Kang et al. provide the first experimental insights from holographic stimulation of auditory cortex. Using stimulation of functionally-defined ensembles, they test whether overactivation of a specific subpopulation biases simultaneous and subsequent sensory-evoked network activations.

      Strengths:

      The investigators use a novel technique to investigate the sensory response properties in functionally defined cell assemblies in auditory cortex. These data provide the first evidence of how acutely perturbing specific frequency-tuned neurons impacts the tuning across a broader population.

      Weaknesses:

      I have several main concerns about the interpretation of these data:<br /> (1) The premise of the paper suggests that sensory responses are noisy at the level of neurons, but that population activity is reliable and that different neurons may participate in sensory coding on different trials. However, no analysis related to single trial variance or overall stability of population coding is provided. Specifically, showing that population activity is stable across trials in terms of total activity level or in some latent low dimensional representation would be required to support the concept of "homeostatic balancing".

      Thank you for raising an important point. We agree that the term ‘homeostatic balancing’ may be not the best term to be applied to explain the main results. We now have toned down on the homeostatic plasticity aspect to explain the main result. We have changed the term to a simple ‘network balance’, potentially due to various factors including rapid synaptic plasticity. We speculate the persistent activity of co-tuned cells in the post-stimulation session as a result of homeostatic balance, instead of rapidly changing back their responses to the baseline. Relevant changes are implemented throughout the manuscript including Introduction (e.g., lines 76-78) and Discussion sections (e.g., lines 453-456).

      (2) Rebalancing would predict either that the responses of stimulated neurons would remain A) elevated after stimulation due to a hebbian mechanism or B) suppressed due to high activity levels on previous trials, a homeostatic mechanism. The authors report suppression in targeted neurons after stimulation blocks, but this appears similar to all other non-stimulated neurons. How do the authors interpret the post-stimulation effect in stimulated neurons?

      It is true that the post stimulation effect of no response change both from co-tuned and non co-tuned neurons, and both from stimulation and control sessions. This could be due to neuronal activity being adapted and decreased enough from the consecutive presentation of acoustic stimuli themselves. However, we still think that if the stimulation driven co-tuned non stimulated neurons’ response decrease is highly driven by stimulation without homeostasis, at least their responses should bounce back during the post-stimulation. We agree that further investigation would be required to further confirm such effect. We elaborated this as another discussion point in the discussion section (lines 457-464).

      (3) The authors suggest that ACtx is different from visual cortex in that neurons with different tuning properties are intermingled. While that is true at the level of individual neurons, there is global order, as demonstrated by the authors own widefield imaging data and others at the single cell level (e.g. Tischbirek et al. 2019). Generally, distance is dismissed as a variable in the paper, but this is not convincing. Work across multiple sensory systems, including the authors own work, has demonstrated that cortical neuron connectivity is not random but varies as a function of distance (e.g. Watkins et al. 2014). Better justification is needed for the spatial pattern of neurons that were chosen for stimulation. Further, analyses that account for center of mass of stimulation, rather than just the distance from any stimulated neuron would be important to any negative result related to distance.

      Thank you for the further suggestion regarding the distance matter. While Watkins et al., 2014 and Levy and Reyes (2012) showed stronger connectivity for nearby cells as well as for more distant patches, on a functional level, Winkowski & Kanold 2013 showed high frequency heterogeneity especially in L2/3, where we targeted to image in this study. Thus, connected cells can have varied tuning consistent with spine imaging (Konnerth paper). We now also calculated the distance based on the center of mass of target cells to calculate the distance effect for an additional verification and still observed no distance related stimulation effect. We now replaced the Figure 4B with the result from the center of mass calculation.

      (4) Data curation and presentation: Broadly, the way the data were curated and plotted makes it difficult to determine how well-supported the authors claims are. In terms of curation, the removal of outliers 3 standard deviations above the mean in the analysis of stimulation effects is questionable. Given the single-cell stimulation data presented in Figure 1, the reader is led to believe that holographic stimulation is quite specific. However, the justification for removing these outliers is that there may be direct stimulation 20-30 um from the target. Without plotting and considering the outliers as well, it is difficult to understand if these outsized responses are due to strong synaptic connections with neighboring neurons or rather just direct off-target stimulation. Relatedly, data presentation is limited to the mean + SEM for almost all main effects and pre-post stimulation effects are only compared indirectly. Whether stimulation effects are driven by just a few neurons that are particularly suppressed or distinct populations which are suppressed or enhanced remains unclear.

      Thank you for pointing this out. Now we specifically removed neighboring cells that are < 20 um from the target point and we observed similar. We replaced all the relevant figures, texts, and statistical results to ensure that the exclusion was specific to overlapping neighboring cells.

      Reviewer #2 (Public review):

      The goal of HiJee Kang et al. in this study is to explore the interaction between assemblies of neurons with similar pure-tone selectivity in mouse auditory cortex. Using holographic optogenetic stimulation in a small subset of target cells selective for a given pure tone (PTsel), while optically monitoring calcium activity in surrounding non-target cells, they discovered a subtle rebalancing process: co-tuned neurons that are not optogenetically stimulated tend to reduce their activity. The cortical network reacts as if an increased response to PTsel in some tuned assemblies is immediately offset by a reduction in activity in the rest of the PTsel-tuned assemblies, leaving the overall response to PTsel unchanged. The authors show that this rebalancing process affects only the responses of neurons to PTsel, not to other pure tones. They also show that assemblies of neurons that are not selective for PTsel don't participate in the rebalancing process. They conclude that assemblies of neurons with similar pure-tone selectivity must interact in some way to organize this rebalancing process, and they suggest that mechanisms based on homeostatic signaling may play a role.

      he conclusions of this paper are very interesting but some aspects of the study including methods for optogenetic stimulation, statistical analysis of the results and interpretation of the underlying mechanisms need to be clarified and extended.

      (1) This study uses an all-optical approach to excite a restricted group of neurons chosen for their functional characteristics (their frequency tuning), and simultaneously record from the entire network observable in the FOV. As stated by the authors, this approach is applied for the first time to the auditory cortex, which is a tour de force. However, such an approach is complex and requires precise controls to be convincing. In the manuscript, several methodological aspects are not sufficiently described to allow a proper understanding.

      (i) The use of CRmine together with GCaMP8s has been reported as problematic as the 2Ph excitation of GCaMP8s also excites the opsin. Here, the authors use a red-shifted version of CRmine to prevent such cross excitation by the imaging laser. To be convincing, they should explain how they controlled for the absence of rsCRmine activation by the 940nm light. Showing the fluorescence traces immediately after the onset of the imaging session would ensure that neurons are not excited as they are imaged.

      Thank you for pointing this out. We realized that the important reference was omitted. Kishi et al. 2022 validated the efficacy of the rsChRmine compared to ChRmine. In this paper, they compared regular ChRmine and rsChRmine activity to different wavelengths and setting and showed the efficiency of rsChRmine with reduced optical cross talk. This reference is now included in the manuscript (line 98). We also checked the spontaneous baseline activity that lasted about 10 sec. before any of the sound presentation and observed a relatively stable activity throughout, rather than any imaging session onset related activation, which is also similar to what we see from another group of GCaMP6s transgenic animals.

      Author response image 1.

      Baseline fluorescence activity across cells within FOVs from AAV9-hSyn-GCaMP8s-T2A-rsChRmine injected mice (top) and CBA X Thy1-GCaMP6s F1 transgenic mice (bottom). Fluorescence levels and activity patterns remain similar, suggesting no evident imaging laser-induced activation from rsChRmine. Note that GCaMP8s examples are smoothed by using moving average of 4 points as GCaMP8s show faster activity.

      (ii) Holographic patterns used to excite 5 cells simultaneously may be associated with out-of-focus laser hot spots. Cells located outside of the FOV could be activated, therefore engaging other cells than the targeted ones in the stimulation. This would be problematic in this study as their tuning may be unrelated to the tuning of the targeted cells. To control for such an effect, one could in principle decouple the imaging and the excitation planes, and check for the absence of out-of-focus unwanted excitation.

      We further verified whether the laser power at the targeted z-plane influences cells’ activity at nearby z-planes. As the Reviewer pointed out, the previous x- and y-axis shifts were tested by single-cell stimulation. This time, we stimulated five cells simultaneously, to match the actual experiment setup and assess potential artifacts in other planes. We observed no stimulation-driven activity increase in cells at a z-planed shifted by 20 µm (Supplementary Figure 1). This confirms the holographic stimulation accurately manipulates the pre-selected target cells and the effects we observe is not likely due to out-of-focus stimulation artifacts. It is true that not all pre-selected cells showing significant response changes prior to the main experiment are effectively activated t every trial during the experiments. We varied the target cell distances across FOVs, from nearby cells to those farther apart within the FOV. We have not observed a significant relationship between the target cell distances and stimulation effect. Lastly, cells within < 20 µm of the target were excluded to prevent potential excitation due to the holographic stimulation power. Given the spontaneous movements of the FOV during imaging sessions due to animal’s movement, despite our efforts to minimize them, we believe that any excitation from these neighboring neurons would be directly from the stimulation rather than the light pattern artifact itself.

      (iii) The control shown in Figure 1B is intended to demonstrate the precision of the optogenetic stimulation: when the stimulation spiral is played at a distance larger or equal to 20 µm from a cell, it does not activate it. However, in the rest of the study, the stimulation is applied with a holographic approach, targeting 5 cells simultaneously instead of just one. As the holographic pattern of light could produce out-of-focus hot spots (absent in the single cell control), we don't know what is the extent of the contamination from non-targeted cells in this case. This is important because it would determine an objective criterion to exclude non-targeted but excited cells (last paragraph of the Result section: "For the stimulation condition, we excluded non-target cells that were within 15 µm distance of the target cells...")

      Highly sensitive neurons to certain frequency also shows the greatest adaptation effect, which can be observed the control condition. Therefore, the high sensitive neurons showing greater amplitude change is first related to the neuronal adaptation to its sensitive information. However, by stimulating the co-tuned target neurons, other co-tuned non-target neurons shows significantly greater amplitude decrease, compared to either non co-tuned target neurons stimulation or control (the latter did not meet the significance level).

      We also tried putting more rigorous criterion as 20 um instead of 15 um as you pointed out since the spiral size was 20 um. The result yielded further significant response amplitude decrease due to the stimulation effect only from co-tuned non-target neurons for processing their preferred frequency information.

      (2) A strength of this study comes from the design of the experimental protocol used to compare the activity in non-target co-tuned cells when the optogenetic stimulation is paired with their preferred tone versus a non-preferred pure tone. The difficulty lies in the co-occurrence of the rebalancing process and the adaptation to repeated auditory stimuli, especially when these auditory stimuli correspond to a cell's preferred pure tones. To distinguish between the two effects, the authors use a comparison with a control condition similar to the optogenetic stimulation conditions, except that the laser power is kept at 0 mW. The observed effect is shown as an extra reduction of activity in the condition with the optogenetic paired with the preferred tone, compared to the control condition. The specificity of this extra reduction when stimulation is synchronized with the preferred tone, but not with a non-preferred tone, is a potentially powerful result, as it points to an underlying mechanism that links the assemblies of cells that share the same preferred pure tones.

      The evidence for this specificity is shown in Figure 3A and 3D. However, the universality of this specificity is challenged by the fact that it is observed for 16kHz preferring cells, but not so clearly for 54kHz preferring cells: these 54kHz preferring cells also significantly (p = 0.044) reduce their response to 54kHz in the optogenetic stimulation condition applied to 16kHz preferring target cells compared to the control condition. The proposed explanation for this is the presence of many cells with a broad frequency tuning, meaning that these cells could have been categorized as 54kHz preferring cells, while they also responded significantly to a 16kHz pure tone. To account for this, the authors divide each category of pure tone cells into three subgroups with low, medium and high frequency preferences. Following the previous reasoning, one would expect at least the "high" subgroups to show a strong and significant specificity for an additional reduction only if the optogenetic stimulation is targeted to a group of cells with the same preferred frequency. Figure 3D fails to show this. The extra reduction for the "high" subgroups is significant only when the condition of opto-stimulation synchronized with the preferred frequency is compared to the control condition, but not when it is compared to the condition of opto-stimulation synchronized with the non-preferred frequency.

      Therefore, the claim that "these results indicate that the effect of holographic optogenetic stimulation depends not on the specific tuning of cells, but on the co-tuning between stimulated and non-stimulated neurons" (end of paragraph "Optogenetic holographic stimulation decreases activity in non-target co-tuned ensembles") seems somewhat exaggerated. Perhaps increasing the number of sessions in the 54kHz target cell optogenetic stimulation condition (12 FOV) to the number of sessions in the 16kHz target cell optogenetic stimulation condition (18 FOV) could help to reach significance levels consistent with this claim.

      We previously also tested by randomly subselecting 12 FOVs from 16kHz stimulation condition to match the same number of FOV between two groups and did not really see any result difference. However, to further ensure the results, we now added three more dataset for 54 kHz target cell stimulation condition (now 15 FOV) which yielded similar outcome. We have now updated the statistical values from added datasets.

      (3) To interpret the results of this study, the authors suggest that mechanisms based on homeostatic signaling could be important to allow the rebalancing of the activity of assemblies of co-tuned neurons. In particular, the authors try to rule out the possibility that inhibition plays a central role. Both mechanisms could produce effects on short timescales, making them potential candidates. The authors quantify the spatial distribution of the balanced non-targeted cells and show that they are not localized in the vicinity of the targeted cells. They conclude that local inhibition is unlikely to be responsible for the observed effect. This argument raises some questions. The method used to quantify spatial distribution calculates the minimum distance of a non-target cell to any target cell. If local inhibition is activated by the closest target cell, one would expect the decrease in activity to be stronger for non-target cells with a small minimum distance and to fade away for larger minimum distances. This is not what the authors observe (Figure 4B), so they reject inhibition as a plausible explanation. However, their quantification doesn't exclude the possibility that non-target cells in the minimum distance range could also be close and connected to the other 4 target cells, thus masking any inhibitory effect mediated by the closest target cell. In addition, the authors should provide a quantitative estimate of the range of local inhibition in layers 2/3 of the mouse auditory cortex to compare with the range of distances examined in this study (< 300 µm). Finally, the possibility that some target cells could be inhibitory cells themselves is considered unlikely by the authors, given the proportions of excitatory and inhibitory neurons in the upper cortical layers. On the other hand, it should be acknowledged that inhibitory cells are more electrically compact, making them easier to be activated optogenetically with low laser power.

      Minimum distance is defined as the smallest distance non-target cell to any of the target cells. Thus, if this is local inhibition, it is likely that the closest target cell would have affected the non-target cells’ response changes. We also calculated the distance based on the center of mass of target cells to calculate the distance effect for an additional verification, based on both Reviewers’ comments, and still observed no distance related stimulation effect. The result is now updated in Figure 4B.

      Based on previous literature, such as Levy & Reyes 2012, the excitatory and inhibitory connectivity is known to range around 100 um distance. Our results do not necessarily show any further effect observed for cells with distance below 100 um. This suggests that such effect is not limited to local inhibition. We also added further speculation on why our results are less likely due to increased inhibition, albeit the biological characteristics of inhibitory neurons to optogenetics.

      Reviewer #3 (Public review):

      Summary:

      The authors optogenetically stimulate 5 neurons all preferring the same pure tone frequency (16 or 54 kHz) in the mouse auditory cortex using a holography-based single cell resolution optogenetics during sound presentation. They demonstrate that the response boosting of target neurons leads to a broad suppression of surrounding neurons, which is significantly more pronounced in neurons that have the same pure tone tuning as the target neurons. This effect is immediate and spans several hundred micrometers. This suggests that the auditory cortical network balances its activity in response to excess spikes, a phenomenon already seen in visual cortex.

      Strengths:

      The study is based on a technologically very solid approach based on single-cell resolution two-photon optogenetics. The authors demonstrate the potency and resolution of this approach. The inhibitory effects observed upon targeted stimulation are clear and the relative specificity to co-tuned neurons is statistically clear although the effect size is moderate.

      Weaknesses:

      The evaluation of the results is brief and some aspects of the observed homeostatic are not quantified. For example, it is unclear whether stimulation produces a net increase or decrease of population activity, or if the homeostatic phenomenon fully balances activity. A comparison of population activity for all imaged neurons with and without stimulation would be instructive. The selectivity for co-tuned neurons is significant but weak. Although it is difficult to evaluate this issue, this result may be trivial, as co-tuned neurons fire more strongly. Therefore, the net activity decrease is expected to be larger, in particular, for the number of non-co-tuned neurons which actually do not fire to the target sound. The net effect for the latter neurons will be zero just because they do not respond. The authors do not make a very strong case for a specific inhibition model in comparison to a broad and non-specific inhibitory effect. Complementary modeling work would be needed to fully establish this point.

      Thank you for raising important points. We agree that the term homeostatic balancing may have been an overstatement. We toned down regarding the homeostatic plasticity and conclude the result from the rapid plasticity at a single trial level now. Regardless, the average activity level did not differ among stimulation conditions (control, 16kHz stim, and 54kHz stim), which seems to suggest that overall activity level has been maintained regardless of the stimulation. We added a new figure of the global activity change as Fig. 4A.

      We also added a simple model work in which a suppression term was applied either to all neurons or specifically to non-target co-tuned cells to test our results from the data.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) For the first holography paper in A1, more information is needed about how holographic stimulation was performed and how stimulation artifacts were avoided or removed from the data set, especially as the text states that the PMTs were left open for the duration of the experiment.

      We further clarified the rationale of leaving the shutter open to avoid any mechanic sounds to activate neurons in the AC. We further clarified that we keep the uncaging shutter open since the Bruker default setting (Software version: 5.7) opens and closes the shutter for the every iteration of the stimulation which generates extra heavy mechanical sounds which then hinders whether the activation is due to the sound or stimulation.

      (2) The choice of the dF/F as the primary tool for quantifying data should be better justified. Presumably, cells have very different variances in baseline activity levels and baseline fluorescence levels that create a highly skewed distribution of responses across the population. Further, a

      To take the baseline activity variances into account, we first calculate dF/F normalising to the baseline period (about 330 ms before the sound onset) right before each trial, per cell level. By doing so, we minimize any effect that could have been driven by variable baseline activity levels across neurons.

      (3) More analysis should be performed to determine why 33% of stimulated cells are not activated, and instead are suppressed during stimulation. Is this related to a cells baseline fluorescence?

      Great point. Although we tried our best to pre-select stimulation-responsive neurons before we start the actual experiments and head fix the animals as much as possible, these neurons do not stay as the “best stimulation-responsive neurons” throughout the entire imaging session. There can be various caveats on this. First, they seem to change their activity levels due to the optogenetic stimulation after they are exposed to acoustic stimulation. Second, since the AC is in the temporal side, it is likely to be more affected from the animals’ and their brain movements throughout the imaging session, which could be bigger than visual cortex or motor cortex. However, 33% of 5 cells is about 1.5 cells so it is usually missed about one cell on average, although some sessions have all 5 cells being stimulated while some other sessions have clearly less effective holographic stimulation effect.

      We even manually visualised the fluorescence change due to the holographic stimulation before we start any imaging sessions. Regardless, they don’t stay as the ‘best stimulation responsive cells’ throughout which we cannot control the natural biological aspect of neuronal activities. Regardless, based on the significant stimulation effects observed by presenting different pure tone frequencies as well as delivering different target stimulation and no-stimulation control, we believe that the effect itself is valid. We added these caveats into the manuscript as a further discussion point and things to consider.

      (4) The linear mixed-effects model should include time as a variable as A) the authors hypothesize that responses should be reduced over time due to sensory adaptation and that B) stimulation induced suppression might be dynamic (though they find it is not).

      Since the stimulation effect seems to be independent from trial-by-trial changes among stimulation conditions (Fig. 4) and we now have toned down on the aspect of homeostasis, we kept the current mixed-effect model variables.

      (5) More speculation is needed on why stimulation suppresses responses from the first trial onwards.

      We further speculate such rapid response changes due to activity-dependent synaptic changes due to overall network energy shift from optogenetic stimulation to maintain the cortical circuit balance.  

      (6) What does each dot represent in Figure 4a vs. Figure 4B? They are very different in number.

      In 4A, each dot is average amplitude change values per each trial level. They are exactly same number of dots between frequency, cell groups and conditions as each dot represents each trial (20 each). The reason why it may look differ could be only due to some overlaps between frequencies.

      In 4B, each dot is each cell. The reason why it’s denser in Stimulation conditions’ 16kHz preferring cells panel is that it naturally had more FOVs thus more cells to be plotted. We further clarified these details in the figure legend.

      (7) How sensory responsive neurons were selected should be shown in the figures. Specifically, which fraction of the 30% of most responsive neurons were stimulated should be stated. Depending on the exact yield in the field of view, all or only a minority of strongly sensory responsive neurons are being stimulated, which in either case would color the interpretation of the data.

      We tried varying the FOV as much as possible across sessions to ensure that FOVs are directly in the A1 covering a range of frequencies. If we cannot observe more than 80 neurons as sound responsive neurons from processed suite2p data, we searched for another FOV.  

      We now included an example FOV of the widefield imaging we first conducted to identify A1, and another example FOV of the 2-photon imaging where we conducted a short sound presentation session to identify the sensory responsive neurons, as an inset of the ‘Cell selection’ part in Figure 1.

      Reviewer #2 (Recommendations for the authors):

      Minor points:

      - p.4, last line: "of" probably missing "the processing the target..."

      Fixed.

      - p.5, top, end of the first paragraph of this page: Figure 3B and 3E don't show exemplar traces.

      Corrected as Figure 2A and 2D.

      - P.5, first sentence of the paragraph "Optogenetic holographic stimulation increases activity in targeted ensembles": reference to Figure 3A and 3D should rather be Figure 2A and 2D.

      Corrected.

      - P.9, 2nd paragraph: sentence with a strange syntax: "since their response amplitude..."

      Corrected.

      - Figure 2: panels C and F are missing.

      Corrected.

      - p.11, methods: "wasthen" should be "was then".

      Corrected.

      - p.12, analysis: it is not clearly explained why the sound evoked activity is computed based on the 160ms to 660ms after sound onset instead of 0ms to 660 ms. It is likely related to some potential contamination but it should be explicitly explained.

      Due to the relatively slow calcium transient to more correctly capture the sound related evoked responses. Added this detail.

      - Methods, analysis: the authors should better explain how they conducted the random permutation described in the Figures 1D, 2B and 2E. Which signals were permutated?

      Random permutation to shuffle the target cell ID.

      - References 55 and 56 don't explicitly state that excitatory neurons generally have stronger responses to sound than inhibitory neurons.

      Thank you for pointing out this error. We replaced those references with Maor et al. 2016 and Kerlin et al. 2010, showing excitatory neurons show more selective tuning, and also changed the wording more appropriately.

      - It is not explained whether the imaging sessions are performed on awake or anaesthetized animals. It is probably done on awake animals, but then it is not clear what procedure is used to get the animals used to the head restraint. It usually takes a few days for the mice to get used to it, and the stress level is often different at the beginning and end of an experiment. Given the experimental protocol used in the study, in which sessions are performed sequentially and compared to each other, this aspect could play a role. However, the main comparison made is probably safe as it compares a control condition (laser at 0mW) and conditions with optogenetic stimulation, all done with similar sequences of sessions.

      The experiment was conducted on awake animals. Although we did not have any control on comparing their status in the beginning and the end of the experiment, they all had a widefield imaging session imaging session to identify the A1 region which uses the same head-fixation setup, thus they are more used to the setup when we conduct 2-photon imaging and stimulation. Regardless of the session, if animals show any sign of extra discomfort due to the unfamiliar setup, we keep them there for 10-15 minutes until they are accustomed to the setup with no movement. If they still show a sign of discomfort, we take them out and try for another day. We now included this detail on the manuscript.

      Reviewer #3 (Recommendations for the authors):

      - Evaluate the global effect of stimulation on the population activity averaged across all neurons (activated and non-activated).

      Thank you for your suggestions. We now included a new Figure 3A that present the population activity across all responsive cells. The average activity level did not differ among stimulation conditions (control, 16kHz stim, and 54kHz stim).

      - Evaluate with a simple model if a population of neurons with different sound tuning receiving non-specific inhibition would not produce the observed effect.

      Thank you for the suggestion. We generated a simple model in which a suppression term was applied either to all neurons or specifically to non-target co-tuned cells to test our results from the data. We took a similar range of number of neurons and FOVs to closely simulate the model to the real dataset structure. On 50 simulated calcium traces of neurons (n),

      Trace<sub>n(t)</sub> = R<sub>n(t)</sub> – theta<sub>n</sub> + epsilon<sub>n(t)</sub>

      Where R<sub>n(t)</sub> is a response amplitude from either baseline or stimulation session, theta<sub>n</sub> is a suppression term applied either to all neurons or only to non-target co-tuned neurons, only during the stimulation session, and epsilon<sub>n(t)</sub> is additive noise. Theta was defined based on the average amount of increased activity amplitudes generated from target neurons due to the stimulation, implemented from the real dataset with extra neuron-level jitter. Similar to the real data analyses, we compared the response change between the stimulation and baseline sessions’ trace amplitudes. By comparing two different model outcomes and the real data, we observed a significant effect of the model type (F(2, 2535) = 34.943, p < 0.0001) and interaction between the model type and cell groups was observed (F(2, 2535) = 36.348, p < 0.0001). Applying suppression to only non-target co-tuned cells during the stimulation session yielded a significant response amplitude decrease for co-tuned cells compared to non co-tuned cells (F(1, 2535) = 45.62, p < 0.0001), which resembles the real data In contrast, applying suppression to all non-target cells led to similar amplitude changes in both co-tuned and non co-tuned neurons (F(1, 2535) = 0.87, p = 0.35), which was not observed in either the real data or the simulated data restricted to co-tuned cell suppression. Therefore, the model predicts correctly that the specific suppression given to only co-tuned neurons drove the real data outcome. All of this information is now added into Methods and Results sections and the figure is added as Figure 3C.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This work describes the mechanism of protein disaggregation by the ClpL AAA+ protein of Listeria monocytogenes. Using several model subtrate proteins the authors first show that ClpL possesses a robust disaggregase activity that does not further require the endogenous DnaK chaperone in vitro. In addition, they found that ClpL is more thermostable than the endogenous L. monocytogenes DnaK and has the capacity to unfold tightly folded protein domains. The mechanistic basis for the robust disaggregase activity of ClpL was also dissected in vitro and in some cases, supported by in vivo data performed in chaperonedeficient E. coli strains. The data presented show that the two AAA domains, the pore-2 site and the N-terminal domain (NTD) of ClpL are critical for its disaggregase activity. Remarkably, grafting the NTD of ClpL to ClpB converted ClpB into an autonomous disaggregase, highlighting the importance of such a domain in the DnaK-independent disaggregation of proteins. The role of the ClpL NTD domain was further dissected, identifying key residues and positions necessary for aggregate recognition and disaggregation. Finally, using sets of SEC and negative staining EM experiments combined with conditional covalent linkages and disaggregation assays the authors found that ClpL shows significant structural plasticity, forming dynamic hexameric and heptameric active single rings that can further form higher assembly states via their middle domains.

      Strengths:

      The manuscript is well-written and the experimental work is well executed. It contains a robust and complete set of in vitro data that push further our knowledge of such important disaggregases. It shows the importance of the atypical ClpL N-terminal domain in the disaggregation process as well as the structural malleability of such AAA+ proteins. More generally, this work expands our knowledge of heat resistance in bacterial pathogens.

      Weaknesses:

      There is no specific weakness in this work, although it would have helped to have a drawing model showing how ClpL performs protein disaggregation based on their new findings. The function of the higher assembly states of ClpL remains unresolved and will need further extensive research. Similarly, it will be interesting in the future to see whether the sole function of the plasmid-encoded ClpL is to cope with general protein aggregates under heat stress.

      We thank the reviewer for the positive evaluation. We agree with the reviewer that it will be important to test whether ClpL can bind to and process non-aggregated protein substrates. Our preliminary analysis suggests that the disaggregation activity of ClpL is most relevant in vivo, pointing to protein aggregates as main target.

      We also agree that the role of dimers or tetramers of ClpL rings needs to be further explored. Our initial analysis suggests a function of ring dimers as a resting state. It will now be important to study the dynamics of ClpL assembly formation and test whether substrate presence shifts ClpL assemblies towards an active, single ring state.

      Reviewer #2 (Public Review):

      The manuscript by Bohl et al. is an interesting and carefully done study on the biochemical properties and mode of action of potent autonomous AAA+ disaggregase ClpL from Listeria monocytogenes. ClpL is encoded on plasmids. It shows high thermal stability and provides Listeria monocytogenes food-pathogen substantial increase in resistance to heat. The authors show that ClpL interacts with aggregated proteins through the aromatic residues present in its N-terminal domain and subsequently unfolds proteins from aggregates translocating polypeptide chains through the central pore in its oligomeric ring structure. The structure of ClpL oligomers was also investigated in the manuscript. The results suggest that mono-ring structure and not dimer or trimer of rings, observed in addition to mono-ring structures under EM, is an active species of disaggregase.

      Presented experiments are conclusive and well-controlled. Several mutants were created to analyze the importance of a particular ClpL domain.

      The study's strength lies in the direct comparison of ClpL biochemical properties with autonomous ClpG disaggregase present in selected Gram-negative bacteria and well-studied E. coli system consisting of ClpB disaggregase and DnaK and its cochaperones. This puts the obtained results in a broader context.

      We thank the reviewer for the detailed comments. There are no specific weaknesses indicated in the public review.

      Reviewer #3 (Public Review):

      Summary:

      This manuscript details the characterization of ClpL from L. monocytogenes as a potent and autonomous AAA+ disaggregase. The authors demonstrate that ClpL has potent and DnaKindependent disaggregase activity towards a variety of aggregated model substrates and that this disaggregase activity appears to be greater than that observed with the canonical DnaK/ClpB co-chaperone. Furthermore, Lm ClpL appears to have greater thermostability as compared to Lm DnaK, suggesting that ClpL-expressing cells may be able to withstand more severe heat stress conditions. Interestingly, Lm ClpP can provide thermotolerance to E. coli that have been genetically depleted of either ClpB or in cells expressing a mutant DnaK103. The authors further characterized the mechanisms by which ClpL interacts with protein aggregates, identifying that the N-terminal domain of ClpL is essential for disaggregase function. Lastly, by EM and mutagenesis analysis, the authors report that ClpL can exist in a variety of larger macromolecular complexes, including dimer or trimers of hexamers/heptamers, and they provide evidence that the N-terminal domains of ClpL prevent dimer ring formation, thus promoting an active and substrate-binding ClpL complex. Throughout this manuscript the authors compare Lm ClpL to ClpG, another potent and autonomous disaggregase found in gram-negative bacteria that have been reported on previously, demonstrating that these two enzymes share homologous activity and qualities. Taken together this report clearly establishes ClpL as a novel and autonomous disaggregase.

      Strengths:

      The work presented in this report amounts to a significant body of novel and significant work that will be of interest to the protein chaperone community. Furthermore, by providing examples of how ClpL can provide in vivo thermotolerance to both E. coli and L. gasseri the authors have expanded the significance of this work and provided novel insight into potential mechanisms responsible for thermotolerance in food-borne pathogens.

      Weaknesses:

      The figures are clearly depicted and easy to understand, though some of the axis labeling is a bit misleading or confusing and may warrant revision. While I do feel that the results and discussion as presented support the authors' hypothesis and overall goal of demonstrating ClpL as a novel disaggregase, interpretation of the data is hindered as no statistical tests are provided throughout the manuscript. Because of this only qualitative analysis can be made, and as such many of the concluding statements involving pairwise comparisons need to be revisited or quantitative data with stats needs to be provided. The addition of statistical analysis is critical and should not be difficult, nor do I anticipate that it will change the conclusions of this report.

      We thank the reviewer for the valid criticism. We addressed the major concern of the reviewer and added the requested statistical analysis to all relevant figures. The analysis confirms our conclusions. We also followed the advice of the reviewer and revised axis labeling to increase clarity.

      Reviewer #1 (Recommendations For The Authors):

      • It would really help to have a model showing how ClpL performs protein disaggregation based on their findings.

      We show that ClpL exerts a threading activity that is fueled by ATP hydrolysis in both AAA domains and executed by pore-located aromatic residues. The basic disaggregation mechanism of ClpL therefore does not differ from ClpB and ClpG disaggregases. Similarly, the specificity of ClpL towards protein aggregates is based on simultaneous interactions of multiple N-terminal domains with the aggregate surface. We could recently describe a similar mode of aggregate recognition for ClpG [1]. We therefore prefer not to add a model to the manuscript. We are currently in preparation of a review that includes the characterization of the novel bacterial disaggregases and will present models there as we consider a review article as more appropriate for such illustrations.

      • AAA2 domain of ClpL in Fig 3E should be the same color as in Fig 1A.

      We used light grey instead of dark grey for the ClpL AAA2 domain in Fig 3E, to distinguish between ClpL and ClpB AAA domains. This kind of illustration allows for clearer separation of both AAA+ proteins and the fusion construct LN-ClpB*. We therefore prefer keeping the color code.

      • Partial suppression of the dnaK mutant could be added in the main manuscript Figure.

      The main figure 3 is already very dense and we therefore prefer showing respective data as part of a supplementary figure.

      • It would have been interesting to know if the robust autonomous disaggregation activity of ClpL would be sufficient to rescue the growth of more severe E. coli chaperone mutants, like dnaK tig for example. Did the authors test this?

      We tested whether expression of clpL can rescue growth of E. coli dnaK103 mutant cells at 40°C on LB plates. This experiment is different from the restoration of heat resistance in dnaK103 cells (Figure 3, figure supplement 2A), as continuous growth at elevated temperatures (40°C) is monitored instead of cell survival upon abrupt severe heat shock (49°C). We did not observe rescue of the temperature-sensitive growth phenotype (40°C) of dnaK103 cells upon clpL expression, though expression of clpG complemented the temperature-sensitive growth phenotype (see Author response image 1 below). This finding points to differences in chaperone activities of ClpL and ClpG. It also suggests that ClpL activity is largely restricted to heat-shock generated protein aggregates, enabling ClpL to complement the missing disaggregation function of DnaK but not other Hsp70 activities including folding and targeting of newly synthesized proteins. We believe that dissecting the molecular reasons for differences in ClpG and ClpL complementation activities should be part of an independent study and prefer showing the growth-complementation data only in the response letter.

      Author response image 1.

      Serial dilutions (10-1 – 10-6) of E. coli dnaK103 mutant cells expressing E. coli dnaK, L. monocytogenes clpL or P. aeruginosa clpG were spotted on LB plates including the indicated IPTG concentrations. Plates were incubated at 30°C or 40°C for 24 h. p: empty vector control.

      Reviewer #2 (Recommendations For The Authors):

      Based on results presented in Fig. 2B the authors conclude "that stand-alone disaggregases ClpL and ClpG but not the canonical KJE/ClpB disaggregase exhibit robust threading activities that allow for unfolding of tightly folded domains" (page 5 line 209). In this experiment, the threading power of disaggregases was assessed by monitoring YFP fluorescence during the disaggregation of aggregates formed by fusion luciferase-YFP protein. In my opinion, the results of the experiment depend not only on the threading power of disaggregases but also on the substrate recognition by analyzed disaggregating systems and/or processivity of disaggregases. N-terminal domain in the case of ClpL and KJE chaperones in the case of the KJE/ClpB system are involved in recognition. This is not discussed in the manuscript and the obtained result might be misinterpreted. The authors have created the LN-ClpB* construct (N-terminal domain of ClpL fused to derepressed ClpB) (Fig. 3 E and F). In my opinion, this construct should be used as an additional control in the experiment in Fig. 2 B. It possesses the same substrate recognition domain and therefore the direct comparison of disaggregases threading power might be possible.

      We performed the requested experiment (new Figure 3 - figure supplement 2D). We did not observe unfolding of YFP by LN-ClpB. Sínce ClpL and LN-ClpB do not differ in their aggregate targeting mechanisms, this finding underlines the differences in threading power between ClpL and activated (derepressed) ClpB. It also suggests that the AAA threading motors and the aggregate-targeting NTD largely function independently.

      Presented results suggest that tetramer and dimer of rings might be a "storage form" of disaggregase. It would be interesting to analyze the thermotolerance and/or phenotype of ClpL mutants that do not form tetramer and dimer (E352A). This variant possesses similar to WT disaggregation activity but does not form dimers and tetramers. If in vivo the differences are observed (for example toxicity of the mutant), the "storage form" hypothesis will be probable.

      When testing expression of clpL-MD mutants (E352A, F354A), which cannot form dimers and tetramers of ClpL rings, in E. coli ∆clpB cells, we observed reduced production levels as compared to ClpL wildtype and speculated that reduced expression might be linked to cellular toxicity. We therefore compared spotting efficiencies of E. coli ∆clpB cells expression clpL, ∆NclpL or the clpL-MD mutants at different temperatures. Expression of clpL at high levels abrogated colony formation at 42°C (new Figure 6 - figure supplement 3). ClpL toxicity was dependent on its NTD as no effect was observed upon expression of ∆N-clpL. ClpL-MD mutants (E352A, F354A) were expressed at much lower levels and exhibited strongly increased toxicity as compared to ClpL-WT when produced at comparable levels (new Figure 6 – figure supplement 3). This implies a protective role of ClpL ring dimers and tetramers in the cellular environment by downregulating ClpL activity. We envision that the formation of ClpL assemblies restricts accessibility of the ClpL NTDs and reduces substrate interaction. Increased toxicity of ClpL-E352A and ClpL-F354A points to a physiological relevance of the dimers and tetramers of ClpL rings and is in agreement with the proposed function as storage forms. We added this potential role of ClpL ring assemblies to the discussion section. Due to the strongly reduced production levels of ClpL MD mutants and their enhanced toxicity at elevated temperatures we did not test for their ability to restore thermotolerance in E. coli ∆clpB cells.

      Figure 6G and Figure 6 -figure supplement 2 - it is not clear what is the difference in the preparation of WT and WTox forms of ClpL.

      ClpL WT was purified under reduced conditions (+ 2 mM DTT), whereas WTox was purified in absence of DTT, thus serving as control for ClpL-T355C, which forms disulfide bonds upon purification without DTT. We have added respective information to the figure legend and the materials and methods section.

      Page 5 line 250 - wrong figure citation. Instead of Figure 1 - Figure Supplement 2A should be Figure 3 - Figure Supplement 2A.

      Page 5 line 251 - wrong figure citation. Instead of Figure 1 - Figure Supplement 2B/C should be Figure 3 - Figure Supplement 2B/C.

      Page 7 line 315 - wrong figure citation. Instead of Figure 4F, it should be Figure 4G Figure 1 - Figure Supplement 2E - At first glance, this Figure does not correspond to the text and is confusing. It would be nice to have bars for Lm ClpL activity in the figure. Alternatively, the description of the y-axis might be changed to "relative to Lm ClpL disaggregation activity" instead of "relative disaggregation activity". One has to carefully read the figure legend to find out that 1 corresponds to Lm ClpL activity.

      We have corrected all mistakes and changed the description of y-axis (Figure 1 - figure Supplement 2E) as suggested.

      Reviewer #3 (Recommendations For The Authors):

      (1) While the authors make many experimental comparisons throughout their study, no statistical tests are described or presented with their results or figures, nor are these statistical tests described in the methods. While the data as presented does appear to support the author's conclusions, without these statistical tests no meaningful conclusions from paired analysis can be drawn. Critically, please report these statistical tests. As a general suggestion please include the statistics (p-values) in the results section when presenting this data, as well as in the figure legends, as this will allow the reader to better understand the authors' presentation and interpretation of the data.

      We have added statistical tests to all relevant figures. The analysis is confirming our former statements. We have further clarified our approach for the statistical analysis in the methods section. We report p-values in the results section, however, due to the volume of comparisons we did not add individual p-values to the figure legends but used standard labeling with stars.

      (2) Some of the axis labels for the presented graphs are a bit misleading or confusing. Many describe a relative (%) disaggregation rate, but it is not clear from the methods or figure legends what this rate is relative to. Is it relative to non-denatured substrates, to no chaperone conditions, etc.? Is it possible to present the figures with the raw data rates/activity (ex. luciferase activity / time) vs. relative rates? I think that labeling these figure axes with "disaggregation rate" is a bit misleading as none of these experiments measure the actual rate of disaggregation of these model substrates per se (say by SEC-MALS or other biophysical measurements), but instead infer the extent of disaggregation by measuring a property of these substrates, i.e. luciferase activity or fluorescence intensity over time. Thus, labeling these figures with the appropriate axis for what is being measured, and then clarifying in the methods and results what is being inferred by these measurements, will help solidify the author's conclusions.

      Relative (%) disaggregation rate usually refers to the disaggregation activity of ClpL wildtype serving as reference. We clarified this point in the revised text and respective figure legends. We now also refer to the process measured (e.g. relative refolding activity of aggregated Luciferase instead of relative disaggregation activity) as suggested by the reviewer and added clarifications to text and materials and methods.

      Since we have many measurements for our most frequently used assays and have a reasonable estimate for the general variance within these assays, we found it reasonable to show activity data in relation to fixed controls. This reduces the impact of unspecific variance and thereby makes more accurate comparisons between different repetitions. The reference is now indicated in the axis title.

      (3) The figures are well presented, clutter-free, and graphically easy to understand. Figure legends have sufficient information aside from the aforementioned statistical information and should include the exact number of independent replicates for each panel/experiment (ex. n=4), not just a greater than 3. While the figures do show each data point along with the mean and error, in some figures it is difficult to determine the number of replicate data points. Example figures 2c, 2d, and 3a. Also, please state whether the error is std. error or SEM.

      While we agree, that this is valuable information, we fear that overloading the figure legends with information may take a toll on the readability. We therefore decided to append the number of replicates for each experiment in a separate supplementary table (Table S2). The depicted error is showing the SD and not the SEM, which we also specified in the figure legends.

      (4) There are various examples throughout the results where qualitative descriptors are used to describe comparisons. Examples of this are "hardly enhanced" (Figure 1) and "partially reduced" (Figure 6). While this is not necessarily wrong, qualitative descriptions of comparisons in this manner would require further explanation. What is the definition of "hardly" or "partially"? My recommendation is to just state the data quantitatively, such as "% enhanced" or "reduced by x", this way there is no misinterpretation. Examples of this can be found in Figures 6C-G. This would require a full statistical overview and presentation of these stats in the results.

      We followed the reviewer`s advice and no longer use the terms criticized (e.g. “hardly enhanced”). We instead provide the requested quantifications in the text.

      Questions for Figures:

      Figures 1B and 1C:

      (1) Is the disaggregase activity of ClpL towards heat-denatured luciferase and GFP ATPdependent? While the authors later in the manuscript show that mutations within the Walker B domains dramatically impair reactivation (disaggregation) of denatured luciferase, this does not rule out an ATP-independent effect of these mutations. Thus, the authors should test whether disaggregase activity is observed when wild-type ClpL is incubated with denatured substrates without ATP present or in the presence of ADP only.

      We tested for ClpL disaggregation activity in absence of nucleotide and presence of ADP only (new Figure 1 – figure supplement 2A). We did not observe any activity, demonstrating that ClpL activity depends on ATP binding and hydrolysis (see also Figure 3 – figure supplement 1D: ATPase-deficient ClpL-E197A/E530A is lacking disaggregation activity).

      (2) The authors suggest that a reduction in disaggregase activity observed in samples combining Lm ClpL and KJE (Figure 1C, supp. 1C-E) could be due to competition for protein aggregate binding as observed previously with ClpG. Did the authors test this directly by pulldown assay or another interaction-based assay? While ClpL and ClpG appear to work in a similar manner, it would be good to confirm this. Also, clarification on how this competition operates would be useful. Is it that ClpL prevents aggregates from interacting with KJE, or vice versa?

      We probed for binding of ClpL to aggregated Malate Dehydrogenase in the presence of L. monocytogenes or E. coli Hsp70 (DnaK + respective J-domain protein DnaJ) by a centrifugation-based assay. Here, we used the ATPase-deficient ClpL-E197A/E530A (ClpLDWB) mutant, ensuring stable substrate interaction in presence of ATP. We observe reduced binding of ClpL-DWB to protein aggregates in presence of DnaK/DnaJ (new Figure 1 – figure supplement 2G). This finding indicates that both chaperones compete for binding to aggregated proteins and explains inhibition of ClpL disaggregation activity in presence of Hsp70.

      (3) Related to the above, while incubation of aggregated substrates with ClpL and KJE does appear to reduce aggregase activity towards GFP (Figure 1c), α-glucosidase (Supp. 1C), and MDH (Supp. 1D), this doesn't appear to be the case towards luciferase (Figure 1b, Supp. 1b). Furthermore, ClpL aggregase activity is reduced towards luciferase when combined with E. coli KJE (Supp. 1e) but not with Lm KJE (Figure 1b). The authors provide no commentary or explanation for these observations. Furthermore, these results complicate the concluding statement that "combining ClpL with Lm KJE always led to a strong reduction in disaggregation activity ... ".

      We suggest that the differing inhibitory degrees of the KJE system on ClpL disaggregation activities reflect diverse binding affinities of KJE and ClpL to the respective aggregates. While we usually observe strong inhibition of ClpL activity in presence of KJE, this is different for aggregated Luciferase. This points to specific structural features of Luciferase aggregates or the presence of distinct binding sites on the aggregate surface that favour ClpL binding. We have added a respective comment to the revised manuscript.

      The former statement that “combining ClpL with Lm KJE always led to a strong reduction in disaggregation activity” referred to aggregated GFP, MDH and α-Glucosidase for which a strong inhibition of ClpL activity was observed. We have specified this point.

      Figures 1D and 1E:

      (1) The authors conclude that the heat sensitivity of ΔClpL L. gasseri cells is because they do not express the canonical ClpB disaggregase. A good test to validate this would be to express KJE/ClpB in these Lg ΔClpL cells to see if heat-sensitivity could be fully or partially rescued.

      We agree that such experiment would further strengthen the in vivo function of ClpL as alternative disaggregase. However, such approach would demand for co-expression of E. coli ClpB with the authentic E. coli DnaK chaperone system (KJE), as ClpB and DnaK cooperate in a species-specific manner [2-4]. This makes the experiment challenging, also because the individual components need to be expressed at a correct stochiometry. Furthermore, the presence of the authentic L. gasseri KJE system, which is likely competing with the E. coli KJE system for aggregate binding, will hamper E. coli KJE/ClpB disaggregation activity in L. gasseri. In view of these limitations, we would like to refrain from conducting such an experiment.

      (2) The rationale for investigating Lg ClpL, and the aggregase activity assays are compelling and support the hypothesis that ClpL contributes to thermotolerance in multiple grampositive species. Though, from Figure 1d, why was only Lg ClpL investigated? It appears that S. thermophilus also lacks the canonical ClpB disaggregase and demonstrates ΔClpL heat sensitivity. There is also other Lactobacillus sp. presented that lack ClpB but were not tested for heat sensitivity. Why only test and move forward with L. gasseri? Lastly, L. mesenteroides is ClpB-negative but doesn't demonstrate ΔClpL heat sensitivity. Why?

      We wanted to document high, partner-independent disaggregation activity for another ClpL homolog. We chose L. gasseri, as (i) this bacterial species lacks a ClpB homolog and (ii) a ∆clpL mutant exhibit reduced survival upon severe heat shock (thermotolerance phenotype), which is associated with defects in cellular protein disaggregation. The characterization of L. gasseri ClpL as potent disaggregase in vitro represents a proof-of-concept and allows to generalize our conclusion. We therefore did not further test S. thermophilus ClpL. L. mesenteroides encodes for ClpL but not ClpB, yet, a ∆clpL mutant has not yet been characterized in this species to the best of our knowledge. As we wanted to link ClpL in vitro activity with an in vivo phenotype, we did not characterize L. mesenteroides ClpL.

      We agree with the reviewer that the characterization of additional ClpL homologs is meaningful and interesting, however, we strongly believe that such analysis should be part of an exhaustive and independent study.

      Figures 2A and 2B:

      (1) Figure 2B demonstrates that both ClpL and ClpG, but not the canonical KJE/ClpB, are able to unfold YFP during the luciferase disaggregation process, suggesting that ClpL and ClpG exhibit stronger threading activity. A technical question, can luciferase activity be measured alongside in the same assay sample? If so, would you expect to observe a concomitant increase in luciferase activity as YFP fluorescence decreases?

      KJE/ClpB can partially disaggregate and refold aggregated Luciferase-YFP without unfolding YFP during the disaggregation reaction [5]. YFP unfolding is therefore not linked to refolding of aggregated Luciferase-YFP. On the other hand, unfolding of YFP during disaggregation can hamper the refolding of the fused Luciferase moiety as observed for the AAA+ protein ClpC in presence of its partner MecA [5]. These diverse effects make the interpretation of LuciferaseYFP refolding experiments difficult as the degree of YFP unfolding activity does not necessarily correlate with the extend of Luciferase refolding. We therefore avoided to perform the suggested experiment.

      Figure 2C and 2D:

      (1) Thermal shift assays for ClpL, ClpG, and DnaK were completed with various nucleotides. Were these experiments also completed with samples in their nucleotide-free apo state? Also, while all these chaperones are ATPases, the nucleotides used differ, but no explanation is provided. Comparison should be made of these ATPases bound to the same molecules.

      We did not monitor thermal stabilities of chaperones without nucleotide as such state is likely not relevant in vivo. We used ATPγS in case of ClpL to keep the AAA+ protein in the ATPconformation. ATP would be rapidly converted to ADP due to the high intrinsic ATPase activity of ClpL. In case of DnaK ATPγS cannot be used as it does not induce the ATP conformation [6]. The low intrinsic ATPase activity of DnaK allows determining the thermal stability of its ATP conformation in presence of ATP. This is confirmed by calculating a reduced thermal stability of ADP-bound DnaK.

      (2) The authors suggest that incubation at 55⁰C will cause unfolding of Lm DnaK, but not ClpL, providing ClpL-positive Lm cells disaggregase activity at 55⁰C. While the thermal shift assays in Figures 2C and 2D support this, an experiment to test this would be to heat-treat Lm DnaK and ClpL at 55⁰C then test for disaggregase activity using either aggregated luciferase or GFP as in Figure 1.

      We followed the suggestion of the reviewer and incubated Lm ClpL and DnaK at 55-58°C in presence of ATP for 15 min prior to their use in disaggregation assays. We compared the activities of pre-heated chaperones with controls that were incubated at 30°C for 15 min. Notably, we did not observe a loss of DnaK disaggregation activity, suggesting that thermal unfolding of DnaK at this temperature is reversible. We provide these data as Figure 2 -figure supplement 1 and added a respective statement to the revised manuscript.

      Figure 3B:

      (1) The authors state that ATPase activity of ΔN-ClpL was "hardly affected", but from the data provided it appeared to result in an approximate 35% reduction. As discussed above, no stats are provided for this figure, but given the error bars, it is highly likely that this reduction is significant. Please perform this statistical test, and if significant, please reflect this in the written results as well as the figure. Lastly, if this reduction in ATPase activity is significant, why would this be so, and could this contribute to the reduction in aggregase activity towards luciferase and MDH observed in Figure 3A?

      We applied statistical tests as suggested by the reviewer, showing that the reduction in ATPase activity of ∆N-ClpL is statistically significant. N-terminal domains of Hsp100 proteins can modulate ATPase activity as shown for the family member ClpB, functioning as auxiliary regulatory element for fine tuning of ClpB activity [7]. We speculate that the impact of the ClpL-NTD on the assembly state (stabilization of ClpL ring dimers) might affect ClpL ATPase activity. We would like to point out that other ClpL mutants (e.g. NTD mutant ClpL-Y51A; MDmutant ClpL-F354A) have a similarly reduced ATPase activity, yet exhibit substantial disaggregation activity (approx. 2-fold reduced compared to ClpL wildtype). In contrast ∆NClpL does not exhibit any disaggregation activity. This suggests that the loss of disaggregation activity is caused by a substrate binding defect but not by a partial reduction in ATPase activity. We added a comment on the reduced ATPase activity and also discuss its potential reasons in the discussion section.

      (2) I think the authors' conclusion that deletion of the ClpL NTD does not contribute to structural defects of ClpL is premature given the apparent reduction in ATPase activity. Did the authors perform any biophysical analysis of ΔN-ClpL to confirm this conclusion? Thermal shift assays, Native-PAGE, or size-exclusion chromatography for aggregates would all be good assays to demonstrate that the wild-type and ΔN-ClpL have similar structural properties. Surprisingly, Figure 6 describes significant macromolecular changes associated with ΔN-ClpL such that it preferentially forms a dimer of rings. Furthermore, in Supp. Figure 6D the authors report that ΔN-ClpL appears to have an increased Tm as compared to WT- or ΔM-ClpL. The authors should reflect these observations as deletion of the ClpL NTD does appear to contribute to structural changes, though perhaps only at the macromolecular scale, i.e. dimerization of the rings.

      We have characterized the oligomeric state of ∆N-ClpL by size exclusion chromatography (Figure 6 – figure supplement 1A) and negative staining electron microscopy (Figure 6C), both showing that it forms assemblies similar to ClpL wildtype. We did not observe an increased tendency of ∆N-ClpL to form aggregates and the protein remained fully soluble after several cycles of thawing and freezing. EM data reveal that ∆N-ClpL exclusively form ring dimers, suggesting that the NTDs destabilize MD-MD interactions. The stabilized interaction between two ∆N-ClpL rings can explain the increased thermal stability (Figure 6 – figure supplement 1D). We speculate that the ClpL NTDs either affect MD-MD interactions through steric hindrance or by directly contacting MDs. We have added a respective statement to the discussion section.

      Figure 3C and 3D:

      (1) Given the larger error in samples expressing ClpG (100) or ClpL (100) statistical analysis with p-values is required to make conclusions regarding the comparison of these samples vs. plasmid-only control. The effect of ΔN-ClpL vs. wild-type ClpL looks compelling and does appear to attenuate the ClpL-induced thermotolerance. This is nicely demonstrated in Figure 3D.

      We quantified respective spot tests (new Figure 3E) and tested for statistical significance as suggested by the reviewer. We show that restoration of heat resistance is significant for the first 30 min. While we always observe rescue at later timepoints significance is lost here due to larger deviations in the number of viable cells and thus the degree of complementation.

      Figure 3F:

      (1) What is the role of the ClpB NTD? It appears to be dispensable for disaggregase activity, assuming that ClpB is co-incubated with KJE. A quick explanation of this domain in ClpB could be useful.

      The ClpB NTD is not required for disaggregation activity, as ClpB is recruited to protein aggregates by DnaK, which interacts with the ClpB MDs. Still, two functions have been described for the ClpB NTD. First, it can bind soluble unfolded substrates such as casein [8]. This substrate binding function can increase ClpB disaggregation activity towards some aggregated model substrates (e.g. Glucose-6-phosphate dehydrogenase) [9]. However, NTD deletion usually does not decrease ClpB disaggregation activity and can even lead to an increase [7, 10, 11]. An increased disaggregation activity of ∆N-ClpB correlates with an enhanced ATPase activity, which is explained by NTDs stabilizing a repressing conformation of the ClpB MDs, which function as main regulators of ClpB ATPase activity [7]. We added a short description on the role of the ClpB NTD to the respective results section.

      (2) The result of fusing the ClpL NTD to ClpB supports a role for this NTD in promoting autonomous disaggregase activity. What would you expect to observe if the fused Ln-ClpB protein was co-incubated with KJE? Would this further promote disaggregase activity, or potentially impair through competition? This experiment could potentially support the authors' hypothesis that ClpL and ClpB/KJE can compete with each other for aggregated substrates as suggested in Figure 1.

      We have performed the suggested experiment using aggregated MDH as model substrate. We did not observe an inhibition of LN-ClpB disaggregation activity in presence of KJE. In contrast ClpL disaggregation activity towards aggregated MDH is inhibited upon addition of KJE due to competition for aggregate binding (Figure 1 – figure supplement 2D/F). Disaggregation activity of LN-ClpB in presence of KJE can be explained by functional cooperation between both chaperone systems, which involves interactions between aggregate-bound DnaK and the ClpB MDs of the LN-ClpB fusion construct. We prefer showing these data only in the response letter but not including them in the manuscript, as respective results distract from the main message of the LN-ClpB fusion construct: the ClpL NTD functions as autonomous aggregatetargeting unit that can be transferred to other Hsp100 family members.

      Author response image 2.

      LN-ClpB cooperates with DnaK in protein disaggregation. Relative MDH disaggregation activities of indicated disaggregation systems were determined. KJE: DnaK/DnaJ/GrpE. The disaggregation activity of Lm ClpL was set to 1. Statistical Analysis: Oneway ANOVA, Welch’s Test for post-hoc multiple comparisons. Significance levels: **p < 0.001. n.s.: not significant.

      Figures 4E and 4F:

      (1) While the effect of various NTD mutations follows a similar trend in regard to the impairment of ClpL-mediated disaggregation of luciferase and MDH, the degree of these effects does appear different. For example, patch A and C mutations reduce ClpL disaggregase activity towards luciferase (~60% / 50% reduction) vs. MDH (>90%) respectively. While these results do suggest a critical role for residues in patches A and C of ClpL, these substrate-specific differences are not discussed. Why would we expect a difference in the effect of these patch A/C ClpL mutations on different substrates?

      We speculate that the aggregate structure and the presence or distributions of ClpL NTD binding sites differ between aggregated Luciferase and MDH. A difference between both aggregated model substrates was also observed when testing for an inhibitory effect of Lm KJE (and Ec KJE) on ClpL disaggregation activity (see comment above). We speculate that the mutated NTD residues make specific contributions to aggregate recognition. The severity of binding defects (and reduction of disaggregation activities) of these mutants will depend on specific features of the aggregated model substrates. We now point out that ClpL NTD patch mutants can differ in disaggregation activities depending on the aggregated model substrate used and refer to potential differences in aggregate structures.

      (2) The authors suggest that the loss of disaggregation activity of selected NTD mutants could be linked to reduced binding to aggregated luciferase. While this is likely given that these mutations do not appear to affect ATPase activity (Supp. 4), it could be possible that these mutants can still bind to aggregated luciferase and some other mechanism may impair disaggregation. A pull-down assay would help to prove whether reduced binding is observed in these NTD ClpL mutants. This also needs to be confirmed for Supp. Figure 4.2H.

      We have shown a strong correlation between loss of aggregate binding and disaggregation activity for several NTD mutants (Fig. 4G, Figure 4 – figure supplement 2H). We decided to perform the aggregate binding assay only with mutants that show a full but not a partial disaggregation defect as we made the experience that the centrifugation-based assay provides clear and reproducible results for loss-of-activity mutants but has limitations in revealing differences for partially affected mutants. This might be explained by the use of nonhydrolyzable ATPγS in these experiments, which strongly stabilizes substrate interactions, potentially covering partial binding defects. We agree with the reviewer that some ClpL NTD mutants might have additional effects on disaggregation activity by e.g. controlling substrate transfer to the processing pore site. We have added a respective comment to the revised manuscript.

      (3) Supp. Figure 4.2H has no description in the figure legend. The Y-axes states % aggregate bound to chaperone. How was this measured? See the above comments for Figures 4E and 4F.

      We apologize and added the description to the figure legend. The determination of % aggregate bound chaperone is based on the quantifications of chaperones present in the supernatant and pellet fractions after sample centrifugation. Background levels of chaperones in the pellet fractions in absence of protein aggregates were subtracted. We added this information to the materials and methods section.

      Figure 6G:

      The authors observed reduced disaggregase activity and ATPase activity of mutant T355C under both oxidative and reducing conditions. While this observation under oxidative conditions supports the authors' hypothesis, under reducing conditions (+DTT) we would expect the enzyme to behave similarly to wild-type ClpL unless this mutation has other effects. Can the authors please comment on this and provide an explanation or hypothesis?

      The reviewer is correct, ClpL-T355C exhibit a reduced disaggregation activity (Figure 6 – figure supplement 2B). We observe a similar reduction in disaggregation activity for the ClpL MD mutant F354A, pointing to an auxiliary function of the MD in protein disaggregation. We have made a respective comment in the discussion section of the revised manuscript. How exactly ClpL MDs support protein disaggregation is currently unclear and will be subject of future analysis in the lab. We strongly believe that such analysis should be part of an independent study.

      Discussion:

      In the fourth feature, it is discussed that one disaggregase feature of ClpL is that it does not cooperate with the ClpP protease. While a reference is provided for the canonical ClpB, no data in this paper, nor a reference, is provided demonstrating that ClpL does not interact with ClpP. As discussed, it is highly unlikely that ClpL interacts with ClpP given that ClpL does not contain the IGL/F loops that mediate the interaction of ClpP with cochaperones, such as ClpX, but data or a reference is needed to make such a factual statement.

      The absence of the IGL/F loop makes an interaction between ClpL and ClpP highly unlikely. However, the reviewer is correct, direct evidence for a ClpP-independent function of ClpL, though very likely, is not provided. We have therefore rephrased the respective statement: “Forth, novel disaggregases lack the specific IGL/F signature motif, which is essential for cooperation of other Hsp100 proteins with the peptidase ClpP. This feature is shared with the canonical ClpB disaggregase [12] suggesting that protein disaggregation is primarily linked to protein refolding.”.

      References

      (1) Katikaridis P, Simon B, Jenne T, Moon S, Lee C, Hennig J, et al. Structural basis of aggregate binding by the AAA+ disaggregase ClpG. J Biol Chem. 2023:105336.

      (2) Glover JR, Lindquist S. Hsp104, Hsp70, and Hsp40: A novel chaperone system that rescues previously aggregated proteins. Cell. 1998;94:73-82.

      (3) Krzewska J, Langer T, Liberek K. Mitochondrial Hsp78, a member of the Clp/Hsp100 family in Saccharomyces cerevisiae, cooperates with Hsp70 in protein refolding. FEBS Lett. 2001;489:92-6.

      (4) Seyffer F, Kummer E, Oguchi Y, Winkler J, Kumar M, Zahn R, et al. Hsp70 proteins bind Hsp100 regulatory M domains to activate AAA+ disaggregase at aggregate surfaces. Nat Struct Mol Biol. 2012;19:1347-55.

      (5) Haslberger T, Zdanowicz A, Brand I, Kirstein J, Turgay K, Mogk A, et al. Protein disaggregation by the AAA+ chaperone ClpB involves partial threading of looped polypeptide segments. Nat Struct Mol Biol. 2008;15:641-50.

      (6) Theyssen H, Schuster H-P, Bukau B, Reinstein J. The second step of ATP binding to DnaK induces peptide release. J Mol Biol. 1996;263:657-70.

      (7) Iljina M, Mazal H, Goloubinoff P, Riven I, Haran G. Entropic Inhibition: How the Activity of a AAA+ Machine Is Modulated by Its Substrate-Binding Domain. ACS chemical biology. 2021;16:775-85.

      (8) Rosenzweig R, Farber P, Velyvis A, Rennella E, Latham MP, Kay LE. ClpB N-terminal domain plays a regulatory role in protein disaggregation. Proc Natl Acad Sci U S A. 2015;112:E6872-81.

      (9) Barnett ME, Nagy M, Kedzierska S, Zolkiewski M. The amino-terminal domain of ClpB supports binding to strongly aggregated proteins. J Biol Chem. 2005;280:34940-5.

      (10) Beinker P, Schlee S, Groemping Y, Seidel R, Reinstein J. The N Terminus of ClpB from Thermus thermophilus Is Not Essential for the Chaperone Activity. J Biol Chem. 2002;277:47160-6.

      (11) Mogk A, Schlieker C, Strub C, Rist W, Weibezahn J, Bukau B. Roles of individual domains and conserved motifs of the AAA+ chaperone ClpB in oligomerization, ATP-hydrolysis and chaperone activity. J Biol Chem. 2003;278:15-24.

      (11) Weibezahn J, Tessarz P, Schlieker C, Zahn R, Maglica Z, Lee S, et al. Thermotolerance Requires Refolding of Aggregated Proteins by Substrate Translocation through the Central Pore of ClpB. Cell. 2004;119:653-65.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1

      1) Here are a few sentences that could potentially benefit from further discussion, particularly in the context of the plant developmental framework of an effective germline. It is important to note that the idea of an effective germline is supported by many, but not all, scientists. Nevertheless, as long as this concept remains relevant, a discussion based on it may be appropriate.

      The early establishment of germlines during development is crucial in addressing the impact of somatic mutation on the next generation. To emphasize this aspect, we have included an additional sentence addressing this point in ll. 242–244.

      2) Lines 161-163: The suggestion that long-lived tropical trees do not necessarily suppress somatic mutation rates to the same extent as their temperate counterparts might warrant additional examination.

      We have revised our statement to present a more balanced perspective, and we have also included a sentence to emphasize the importance of conducting further studies in future.

      3) Lines 200-202: The observation of potential influences of GC-biased gene conversion during meiosis or biased purifying selection for C>T inter-individual nucleotide substitutions could be further elaborated upon.

      Our data does not provide enough information to delve into a more detailed discussion regarding GC-biased gene conversion during meiosis or biased purifying selection for C>T substitution. However, future studies that obtain genome sequences from somatic cells, male or female gametophytes, and offspring (such as seeds or seedlings) would offer opportunities to assess these phenomena.

      4) Line 245: The statement "somatic mutations can be transmitted to seeds" might be correct, but it would be helpful to explore the extent to which this occurs.

      In response to the comment from Reviewer 1 (#4) and 2 (#16), we have decided to remove the discussion about the heritability of somatic mutations in next generation. We have completely rewritten the final paragraph to discuss the possibility of a disparity in the relationship between lifespan and somatic mutation rates between plants and animals.

      Reviewer #2

      5) l. 108- 115: The authors seem to have made a really great work at assembling and annotating two reference genomes. Even if this does not represent the main result of the manuscript, these genomic resources are a plus for the community, especially given that reference genomes from tropical trees are known to be underrepresented in the literature (e.g. Plomion et al. 2016). The authors have made the particular effort of generating two high-quality reference genome assemblies for two species of the same genus, including one with an excellent contiguity. Even if they do not explicitly indicate the divergence time between the two species, it is clear that the cheapest solution would have been to map the reads of the two species against a single assembly, but this could have generated some biases. So by generating two de novo assemblies, the authors have used here the best design possible to control for some potential biases for the detection of somatic mutations. However, given the interests these two assemblies represent by themselves, I consider that a couple of additional investigations could have been made on local synteny and orthologous genes in particular. Thanks to whole-genome alignments and orthology (e.g. Lovell et al. 2022), they could have generated more general information regarding the two assembles and investigated additional questions regarding mutations, e.g. mutations in collinear / non-collinear (if any) segments, intensity of purifying selection (or neutral evolution) at single vs. multiple copies or between shared vs. private genes, etc.

      To address the comment by Reviewer 2, we performed synteny analysis using the MCScanX in TBtools-II and added Supplementary Figure 3 to illustrate conserved synteny relationship between S. laevis and S. leprosula. Detecting selection in the genome will be a future study as our current data are not sufficient for the aim because of limited number of individuals (n = 2 for each species).

      6) l. 123-124. Here, the authors indicate that they have "validated" 93.9% of the mutations. It would be more accurate to indicate that they have "validated" 31/33 mutations (94%), 22/24 mutations on S1 and 9/9 on S2 (Table S5). Can the authors indicate why no somatic mutations from the F1 and F2 were tested? According to me, the use of the word "validation" is not totally accurate (see also Schmitt et al. 2022), since amplicon sequencing can be viewed as a kind of validation but it doesn't represent a complete validation since it represents new sequencing data that are mapped against the same reference assembly, in such a way that we could always imagine that the same biases are at play, leading to a similarly false positive call. Reciprocally, a "non-validated" mutation could be associated to a mutation that is at a too low allele frequency, at least after amplification, in such a way that the call is not heterozygous despite the fact that the mutation is real. I think that another terminology than "validated" could be used, plus one or two sentences explaining this degree of complexity.

      To improve the clarity of the statement, we have modified the sentence as follows: We conducted an independent evaluation of a subset of the inferred single nucleotide variants (SNVs) using amplicon sequencing. Our analysis demonstrated accurate annotation for 31 out of 33 mutations (94% overall), with 22 out of 24 mutations on S1 and all 9 mutations on S2 (Supplementary Table 5).”

      While we did not conduct additional assessments using F1 and F2, we anticipate a similar high level of agreement between the somatic SNV calls and amplicon sequencing in these trees. We have included sentences in the Materials and Methods section to elucidate the challenges involved in validating true somatic mutations.

      7) l. 135-137 the reasoning appears to be quite circular to me. As indicated by the authors in the line just before, an incongruent pattern could also be explained biologically, in such a way that the overall congruency between the phylogenetic tree and the tree architecture cannot be considered as a way to prove the reliability of the detection. In some species, it seems clear that the phylogenetic tree do not seem to follow the plant architecture (Zahradnikova et al. 2020) in such a way that we should argue to not consider the plant architecture in the design and not consider this represents either a way to validate mutations or a way to validate the methodological framework. I suggest removing this sentence.

      We have removed the sentence as suggested by Reviewer 2.

      8) l. 150. It seems that the differences in length and diameter between the two species come from two different studies and therefore that no statistical test has been performed to test its significance.

      We agree with Reviewer 2. To clarify this point, we have replaced “significantly” with “substantially” in the revised text.

      9) l. 156-159: the same sentence is repeated twice.

      We have removed the repeated sentence.

      10) l. 159-161: Comparing somatic mutation rates between studies is difficult. It is too sensitive to the methodology used, here again see Schmitt et al. 2022. I propose to remove these two sentences. It represents an interesting working hypothesis but would require a better design, or at least, to reanalyze all the data with the same pipeline.

      We have toned down our statement, and added a sentence that additional studies are required to compare somatic mutation rates among trees in tropical, temperate, and boreal regions, employing standardized methodologies.

      11) l. 171-175: Here I am wondering if the authors could provide more information regarding the enrichment at CpG sites? I suggest first estimating the proportion of CpG sites thanks to the two genome assemblies and then using this information as a way to weight the results and therefore to estimate the level of enrichment of mutations at CpG sites.

      In response to the comment by Reviewer 2, we first determined the proportion of CpG sites as 0.030 and 0.028 for S. laevis and S. leprosula, respectively, based on the triplet matrix using the reference genome of each species. Subsequently, we estimated the proportion of somatic mutations at CpG sites. The results revealed a 4.54-fold and 3.53-fold increase in somatic mutations at CpG sites for S1 and S2, and a 3.38-fold and 2.56-fold increase for F1 and F2, respectively. We have incorporated this finding into ll. 172–175.

      12) l. 176-187. Interesting comparison and insights. You could also indicate that SBS5 is also detected in all human cancers too. So the detection of SBS1 and SBS5 signatures indeed suggest some shared mutation biases. Note that in humans, a specific signature of UV is associated to TCG -> TTG mutations (Martincorena & Campbell, 2015). It seems that there is a substantial difference in the mutation spectra between the two trees for this specific category, note sure if this difference could be associated to UV.

      We slightly modified the sentence to indicate that SBS5 is also detected in all human cancers. We are very interested in the potential impact of UV on somatic mutations in tropical trees, considering the high levels of UVR in the tropics. Conducting a comparative analysis of the mutational spectrum among trees inhabiting diverse UVR environments would provide valuable insights to substantiate this hypothesis.

      13) l. 206: I rather suggest "the somatic mutation rate per year is roughly the same, suggesting that somatic mutations rates are independent of growth rate".

      In response to the suggestion from Reviewer 2, we have revised the sentence as follows: "The somatic mutation rate per year remains largely consistent, indicating that somatic mutation rates are independent of the growth rate."

      14) l. 207-232: Here, It is the section looks a mixture between a result and a discussion. I guess the authors consider here that it remains a verbal model at this stage and it therefore represents more a discussion. If so, I agree but it could be good to discuss more this part, in particular to know how this model could be improved and empirically tested.

      The argument based on the model will be more accurate when the cell cycle duration can be directly estimated for each tree. We have added this explanation in the revised text.

      15) l. 238-239: The parallel drawn with the molecular clock is interesting but according to me, it remains a working hypothesis at this stage, since it is not validated outside the two focal species. I encourage the readers to continue to work on this question and to investigate also some annual plants for instance in the future (assuming that they have a higher α) in order to be able to derive a global model. In addition, even if I consider that the authors use and interpret this parallel wisely, I consider that the use of this terminology could be misleading for some readers. That's why I also suggest removing "molecular clock" from the title and using a more explicit one, e.g. "Somatic mutation rates scale with time not growth rate in dipterocarp trees".

      We agree with Reviewer 2. We have changed the title to “Somatic mutation rates scale with time not growth rate in long-lived tropical trees.”

      16) l. 245-249: The results rather suggest that (i) there is little diversity due to somatic mutations and that (ii) most heritable non-synonymous mutations are deleterious and therefore purged from the population. So rather than this last section of this discussion that has little interest and could be quite debatable, I consider that the authors could extend their discussion, e.g. the differences with somatic mutations in mammals (recently, Cagan and coauthors (2022) demonstrated that somatic mutation rates are inversely correlated with lifespan in mammals) or the overall low rate of molecular evolution in trees could be some directions. But there are many others.

      We have completely rewritten the final paragraph to propose the possibility of a disparity in the relationship between lifespan and somatic mutation rates between plants and animals, rather than discussing the heritability of somatic mutation in next generation.

      17) l. 570-571: I guess, the reader should understand here "fixed at the heterozygous state"

      To avoid confusion, we have modified the text as follows: “If the alternative allele was present or absent in all eight branches in the amplicon sequence, the site was determined as fixed within an individual tree.” We have also removed “heterozygote” in Supplementary Figure 5.

      18) Fig. 4d. the y-axis would be easier to interpret by writing "Delta Inter-individual vs. Somatic SNPs" and/or by adding arrows on the right margin of the plot to indicate the directions with some short sentences such as "more somatic mutations observed than expected assuming the inter-individual comparison", "less somatic mutation than expected". According to me, some statistical tests are lacking here. Are the differences in the mutation spectra significant given the relatively limited amount of somatic mutations detected?

      We have added short sentences explaining the directions.

      19) Supplementary Tables (excel file): please correct the typos. There are many on these supplementary tables.

      We carefully checked supplementary tables and corrected the typos.

      Reviewer #3

      20) To estimate false negative rates, the authors might consider using mutation insertion tools such as Bamsurgeon (https://github.com/adamewing/bamsurgeon) to create simulated mutations. Alternatively, one could assess the calling rate of high-confidence SNPs that differ between individuals of the same species to get at the FNR.

      We agree with Reviewer 3. To calibrate our pipeline, we previously performed simulation to estimate the false negative and positive rates in different tree species (Betula platyphylla) using wgsim v0.1.11 (https://github.com/lh3/wgsim). Based on our simulations, we found that the false negative and false positive rates were very low, averaging at 0.050 and 0.046, respectively. It is important to note that the estimated false positive rate obtained from the simulation data was substantially lower than the proportion of potential false positive SNVs (as shown in Supplementary Fig. 5). This observation suggests that simulation-based evaluation of the false positive rate is not reliable, at least for the tree species we studied. Similarly, the same argument could be applied to the false negative rate. Therefore, we conclude that the simulation-based analysis for estimating false positive and false negative rates is not informative for our study.

      The rate of true-positive or false-negative mutation calls can be estimated only when the true mutational status is known, but the data are not currently available. However, under the assumption that the final set of SNVs represents true somatic mutations, we were able to calculate the potential false negative rate. Our findings indicate that this rate is low, specifically less than 10%, when using less stringent filtering thresholds such as BQ20 and MQ20. While these estimated values may not precisely represent the true false negative rate, we included them as potential false negative rates in Supplementary Figure 7 of the revised manuscript. This information provides additional insights into the performance of our pipeline under different filtering thresholds and contributes to the overall assessment of our study.

      21) It may be interesting to examine the mutation trees for constancy (or not) in mutation rate per meter. Examining Figure 1, it appears that the number of mutations near the crown "4" node is consistently higher than in nearby nodes (3-1 and 3-2).

      We calculated the branch-level increment of SNVs per meter by dividing the number of single nucleotide variations (SNVs) by the physical distance. Our analysis revealed a slight increase in the number of SNVs per meter as the branch position became higher in S. laevis, as shown in Author response table 1. However, this trend was not clearly observed in S. leprosula. We found this observation in S. laevis intriguing, particularly because our recent analysis (Tomimoto et al., in preparation) demonstrated that genetic distance increases in branch pairs located in the upper part of a tree. This was elucidated through a mathematical model that describes the dynamics of the stem cell population during elongation and branching. We opted not to delve further into the findings in the current manuscript, as this topic will be extensively investigated in a future study.

      Author response table 1.

      The branch-level increment of SNVs per meter.

      22) Line 150: Use of "significantly different" is confusing as the phrase is usually reserved for statistical significance. Consider replacing with "substantially different."

      We have replaced “significantly” with “substantially” in the revised text.

      23) In the Discussion, a clearer explanation of the assumptions that underlie the authors' reasoning would be welcome: e.g., constancy in mutation rate per meter within an individual tree. In particular, the authors assume that mutations that are seen in one leaf and not in another cannot have predated the most recent common meristematic node linking the two leaves. Is this a reasonable assumption? Since the meristem is multicellular, is it possible for a mutation to have arisen earlier in development and "assorted" into one cell lineage but not another?

      We greatly appreciate an important comment. It is true that when the meristem is multicellular, and the stem cell lines are retained during mutation accumulation (e.g. a structured meristem analyzed in Tomimoto and Satake 2023), it is possible for a mutation to have arisen earlier before the bifurcation. Using a mathematical model, we have proved that the intercept and slope of the linear regression between the pairwise genetic distance and physical distance are influenced by the type of a meristem (strength of somatic genetic drift in a meristem) as well as the branching architecture of the tree. We have included an explanation of this point in the revised manuscript (ll. 244–249).

      24) Supplementary Data 7: Column J should be "2_2"

      We corrected the typo.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #1 (Public Review):

      Summary - This study was designed to investigate changes in gene expression and associated chromatin accessibility patterns in spermatogonia in mice at different postnatal stages from pups to adults. The objective was to describe dynamic changes in these patterns that potentially correlate with functional changes in spermatogonia as a function of development and reproductive maturation. The potential utility of this information is to serve as a reference against which similar data from animals subjected to various disruptive environmental influences can be compared.

      Major Strengths and Weaknesses of the Methods and Results - A strength of the study is that it reviews previously published datasets describing gene expression and chromatin accessibility patterns in mouse spermatogonia. A weakness of the study is that it is not clear what new information is provided by the data provided that was not already known from previously published studies (see below). Specific weaknesses include the following:

      • Terminology - in the Abstract and first part of the Introduction the authors use the generic term "spermatogonial cells" in a manner that seems to be referring primarily to spermatogonial stem cells (SSCs) but initially ignores the well-known heterogeneity among spermatogonia - particularly the fact that only a small proportion of developing spermatogonia become SSCs - and ONLY those SSCs and NOT other developing spermatogonia - support steady-state spermatogenesis by retaining the capacity to either self-renew or contribute to the differentiating spermatogenic lineage throughout the male reproductive lifespan. The authors eventually mention other types of developing male germ cells, but their description of prospermatogonial stages that precede spermatogonial stages is deficient in that M-prospermatogonia - which occur after PGCs but before T1-prospermatogonia - are not mentioned. This description also seems to imply that all T2-prospermatogonia give rise to SSCs which is far from the case. It is the case that prospermatogonia give rise to spermatogonia, but only a very small proportion of undifferentiated spermatogonia form the foundational SSCs and ONLY SSCs possess the capacity to either self-renew or give rise to sequential waves of spermatogenesis.

      We thank Reviewer 1 for the comments and clarifications. As suggested in the previous revision, we use the term spermatogonial cells (SPGs) to make it clear that our cell preparations do not exclusively contain SSCs but all SPGs since they derive from a FACS enrichment strategy. This is explained in the manuscript. Further, we conducted deconvolution analyses on the datasets to examine the composition of the enriched SPGs preparations and provide new sequencing information confirming the presence of SSCs and differentiating SPGs.

      • Introduction - Statements regarding distinguishing transcriptional signatures in spermatogonia at different postnatal stages appear to refer to ALL subtypes of spermatogonia present at each stage collectively, thereby ignoring the well-known fact that there are distinct spermatogonial subtypes present at each postnatal stage and that some of those occur at certain stages but not at others. This brings into question the usefulness of the authors' discussion of what types of genes are expressed and/or what types of changes in chromatin accessibility are detected in spermatogonia at each stage.

      We agree that our data do not provide information about the transcriptional program of each subtype of SPGs. Rather they provide information about the dynamics of transcriptional programs in the transition from postnatal stage to adulthood in an enriched population of SPGs. The datasets are comprehensive and contain mRNA and non-coding RNA (with and without a polyA+ tail), which provides more precise transcriptomic information than classical single cell methods.

      • Methodology - The authors based recovery (enrichment) of spermatogonia from male pups on FACS sorting for THY1 and RMV-1. While sorting total testis cells for THY1+ cells does enrich for spermaogonia, this approach is now known to not be highly specific for spermatogonia (somatic cells are also recovered) and definitely not for SSCs. There are more effective means for isolating SSCs from total testis cells that have been validated by transplantation experiments (e.g. use of the Id4/eGFP transgene marker).

      We acknowledge the technical limitations of our enrichment strategy and made them clear in our revised manuscript.

      The authors then used "deconvolution" of bulk RNA-seq data in an attempt to discern spermatogonial subtype-specific transcriptomes. It is not clear why this is necessary or how it is beneficial given the availability of multiple single-cell RNA-seq datasets already published that accomplish this objective quite nicely - as the authors essentially acknowledge. Beyond this concern, a potential flaw with the deconvolution of bulk RNA-seq data is that this is a derivative approach that requires assumptions/computational manipulations of apparent mRNA abundance estimates that may confound interpretation of the relative abundance of different cellular subtypes within the hetergeneous cell population from which the bulk RNA-seq data is derived. Bottom line, it is not clear that this approach affords any experimental advantage over use of the publicly available scRNA-seq datasets and it is possible that attempts to employ this approach may be flawed yielding misleading data.

      The deconvolution analyses were necessary to address the question of the cell composition of our preparations raised by reviewers. These analyses were highly beneficial because they clarify the presence of different SPGs including SSCs in the samples. They are also advantageous because the datasets they are conducted upon have significantly higher sequencing coverage than published single cell datasets. They contain the full transcriptome and not just polyA+ transcripts as 10x datasets thus they provide considerably richer and more comprehensive transcriptomic information. This is very important to correctly interpret the results and to gain additional biological information. For the deconvolution analyses, we used state-of-the-art methods with proper computational controls for calibration. We selected published single-cell RNA-seq datasets of the highest quality. These analyses are extremely useful because they confirm the predominance of SSCs in the postnatal and adult cell samples and a minimal contamination by somatic cells. Our approach also provides a useful workflow that can easily be used by other researchers who cannot afford single-cell RNA-seq and allow them gain more information about the cellular composition of their samples. Finally, the execution of any computational analyses, including analyses of single-cell RNA-seq datasets requires to make assumptions during the development and the use of a method. The assumptions made for deconvolution analyses are not special in this respect and do not introduce more confounds than other methods. What is critical for such analyses is to include proper controls for calibration, which we carefully did and validated using our own previously published datasets for Sertoli cells.

      • Results & Discussion - In general, much of the information reported in this study is not novel. The authors' discussion of the makeup of various spermatogonial subtypes in the testis at various ages does not really add anything to what has been known for many years on the basis of classic morphological studies. Further, as noted above, the gene expression data provided by the authors on the basis of their deconvolution of bulk RNA-seq data does not add any novel information to what has been shown in recent years by multiple elegant scRNA-seq studies - and, in fact, as also noted above - represents an approach fraught with potential for misleading results. The potential value of the authors' report of "other cell types" not corresponding to major somatic cell types identified in earlier published studies seems quite limited given that they provide no follow-up data that might indicate the nature of these alternative cell types. Beyond this, much of the gene expression and chromatin accessibility data reported by the authors - by their own admission given the references they cite - is largely confirmatory of previously published results. Similarly, results of the authors' analyses of putative factor binding sites within regions of differentially accessible chromatin also appear to confirm previously reported results. Ultimately, it is not at all novel to note that changes in gene expression patterns are accompanied by changes in patterns of chromatin accessibility in either related promoters or enhancers. The discussion of these observations provided by the authors takes on more of a review nature than that of any sort of truly novel results. As a result, it is difficult to discern how the data reported in this manuscript advance the field in any sort of novel or useful way beyond providing a review of previously published studies on these topics.

      • Likely impact - The likely impact of this work is relatively low because, other than the value it provides as a review of previously published datasets, the new datasets provided are not novel and so do not advance the field in any significant manner.

      We acknowledge that much of the reported information is not novel but this is not necessarily a drawback as sequencing datasets on the same tissues or cells produced by different groups using comparable methods are common. This does not diminish the validity and usefulness of the datasets but rather enriches the respective fields as omics methods and data analyses can deliver different findings. Thus, our study cannot be criticized and disqualified because other datasets have been published but instead it should be acknowledged for providing high resolution full transcriptome information from different stages and adult of SCs that other studies do not provide. In this respect, the subjective nature of Reviewer 1’s statements is of concern. For instance, the statement: “…represents an approach fraught with potential for misleading results”. Such declaration suggests that all studies that previously used enrichment strategies are “fraught with potential for misleading results», which disqualifies the work of many colleagues. Further, this wrongly assumes that newer technologies are exempt of “potential for misleading results» which is not the case. Single-cell RNA-seq methods, extensively used to study SPGs, has been questioned for their limitation and potential biases due to low sequencing coverage, issues with transcript detection, low capture efficiency and higher degree of noise than bulk RNA datasets. Thus, caution is needed to interpret single-cell datasets on SPGs and these datasets also have their biases. For our datasets, we made major efforts to address the criticisms raised by the reviewer and reduce any potential misleading information by conducting additional analyses, by providing more details on the methods and enrichment strategy and by being careful with data interpretation. We would be grateful if these efforts could be acknowledged and the improvements on the manuscript and the value of the datasets be evaluated with objectivity.

      Reviewer #2 (Public Review):

      This revised manuscript attempts to explore the underlying chromatin accessibility landscape of spermatogonia from the developing and adult mouse testis. The key criticism of the first version of this manuscript was that bulk preparations of mixed populations of spermatogonia were used to generate the data that form the basis of the entire manuscript. To address this concern, the authors applied a deconvolution strategy (CIBERSORTx (Newman et al., 2019)) in an attempt to demonstrate that their multi-parameter FACS isolation (from Kubota 2004) of spermatogonia enriched for PLZF+ cells recovered spermatogonial stem cells (SSCs). PLZF (ZBTB16) protein is a transcription factor known to mark all or nearly all undifferentiated spermatogonia and some differentiating spermatogonia (KIT+ at the protein level) - see Niedenberger et al., 2015 (PMID: 25737569). The authors' deconvolution using single-cell transcriptomes produced at postnatal day 6 (P6) argue that 99% of the PLZF+ spermatogonia at P8 are SSCs, 85% at P15 and 93% in adults. Quite frankly given the established overlap between PLZF and KIT and known identity of spermatogonia at these developmental stages, this is impossible. Indeed - the authors' own analysis of the reference dataset demonstrates abundant PLZF mRNA in P6 progenitor spermatogonia - what is the authors' explanation for this observation? The same is essentially true in the use of adult references for celltype assignment. The authors found 63-82% of SSCs using this different definition of types (from a different dataset), begging the question of which of these results is true.

      For full transparency, we provided information about the deconvolution analyses for all libraries that use cell-type specific matrices generated from PND6 and adult single-cell RNA-seq reference datasets in our previous response (Fig1-3, response to reviewer 1). However, we don’t claim “that 99% of the PLZF+ spermatogonia at P8 are SSCs, 85% at P15 and 93% in adults”. Of these percentages, the ones that correspond to our postnatal libraries are the ones reported in our updated manuscript (Please see FigS2). Importantly, we never claimed that these percentages correspond to “PLZF+ spermatogonia», exclusively. Rather, they were inferred using gene expression-specific signature matrices (Fig1-c response to Reviewer 1 as example). As clearly evident in feature maps in FigS2 of our updated manuscript, the cellular population identified as SSCs using the dataset from Hermann et al., 2018 shows overlap for the expression of Ddx4, Zbtb16 (PLZF), Gfra1 and Id4 but minimal Kit. In agreement with the reviewer’s observation, progenitors also show a signal for Zbtb16 but have a different gene expression signature matrix (see Fig.1c and 2c for an example of gene signature matrices from PND6 and adult samples from the same publication).

      Regarding the question of which of these results are true, we observed that deconvolution analyses of our postnatal libraries using two different single-cell postnatal RNA-seq reference datasets consistently suggest a high contribution (>90%) by SSCs (defined using cell-specific expression matrices following identification of cell-types that match the closest ones reported by each study (See FigS2 updated manuscript). The analyses of our adult libraries using published adult datasets from the same group (Hermann et al., 2018; Fig1 response to Reviewer 1 and FigS2 updated manuscript) suggest that the contribution of adult SSCs to the cell population is lower than at postnatal stages, but SSCs still are the most abundant cell stage identified in our libraries (FigS2g). We reported these analyses and acknowledge that in our adult samples, we also likely have differentiating SPGs.

      In their rebuttal, the authors also raise a fair point about the precision of differential gene expression among spermatogonial subsets. At the mRNA level, Kit is definitely detectable in undifferentiated spermatogonia, but it is never observed at the protein level until progenitors respond to retinoic acid (see Hermann et al., 2015). I agree with the authors that the mRNAs for "cell type markers" are rarely differentially abundant at absolute levels (0 or 1), but instead, there are a multitude of shades of grey in mRNA abundance that "separate" cell types, particularly in the male germline and among the highly related spermatogonial subtypes of interest (SSCs, progenitor spermatogonia and differentiating spermatogonia). That is, spermatogonial biology should be considered as a continuous variable (not categorical), so examining specific cell populations with defined phenotypes (markers, function) likely oversimplifies the underlying heterogeneity in the male germ lineage. But, here, the authors have ignored this heterogeneity entirely by selecting complex populations and examining them in aggregate. We already know that PLZF protein marks a wide range of spermatogonia, complicating the interpretation of aggregate results emerging from such samples. In their rebuttal, the authors nicely demonstrate the existence of these mixtures using deconvolution estimation. What remains a mystery is why the authors did not choose to perform single-cell multiome (RNA-seq + ATAC-seq) to validate their results and provide high-confidence outcomes. This is an accessible technique and was requested after the initial version, but essentially ignored by the authors.

      We agree with the reviewer that the male germ lineage should be considered as a continuous variable and that examining specific cell populations with defined features oversimplifies its heterogeneity. Regarding the use of single-cell multiome (RNA-seq + ATAC-seq), we also agree that this technology can provide additional insight by integrating RNA and chromatin accessibility in the same cells. However, it is an refined method that is expensive, time consuming and requires human resources that are beyond our capacity for this project.

      A separate question is whether these data are novel. A prior publication by the Griswold lab (Schleif et al., 2023; PMID: 36983846) already performed ATAC-seq (and prior data exist for RNA-seq) from germ cells isolated from synchronized testes. These existing data are higher resolution than those provided in the current manuscript because they examine germ cells before and after RA-induced differentiation, which the authors do not base on their selection methods. Another prior publication from the Namekawa lab extensively examined the transcriptome and epigenome in adult testes (Maezawa et al., 2000; PMID: 32895557; and several prior papers). The authors should explain how their results extend our knowledge of spermatogonial biology in light of the preceding reports.

      Our data do extend previous studies because they provide high-resolution transcriptomic (full transcriptome) and chromatin accessibility profiling in postnatal and adult stages. They now also provide an approach for deconvolution analyses of bulk RNA datasets that can be of use to the community. Novelty in the field of omics is usually not a prime feature and it is common that datasets on the same tissues or cells be published by different groups using comparable methods and analyses.

      The authors are also encouraged to improve their use of terminology to describe the samples of interest. The mitotic male germ cells in the testis are called spermatogonia (not spermatogonial cells, because spermatogonia are cells). Spermatogonia arise from Prospermatogonia. Spermatogonia are divisible into two broad groups: undifferentiated spermatogonia (comprised of few spermatogonial stem cells or SSCs and many more progenitor spermatogonia - at roughly 1:10 ratio) and differentiating spermatogonia that have responded to RA. The authors also improperly indicate that SSCs directly produce differentiating spermatogonia - indeed, SSCs produce transit-amplifying progenitor spermatogonia, which subsequently differentiate in response to retinoic acid stimulation. Further, the use of Spermatogonial cells (and SPGs) is imprecise because these terms do not indicate which spermatogonia are in question. Moreover, there have been studies in the literature which have used similar terms inappropriately to refer to SSCs, including in culture. A correct description of the lineage and disambiguation by careful definition and rigorous cell type identification would benefit the reader.

      Overall, my concern from the initial version of this manuscript stands - critical methodological flaws prevent interpretation of the results and the data are not novel. Readers should take note that results in essentially all Figures do not reflect the biology of any one type of spermatogonium.

      We revised and improved the terminology wherever possible and also considering requests from other reviewers about terminology.

      Reviewer #3 (Public Review):

      In this study, Lazar-Contes and colleagues aimed to determine whether chromatin accessibility changes in the spermatogonial population during different phases postnatal mammalian testis development. Because actions of the spermatogonial population set the foundation for continual and robust spermatogenesis and the gene networks regulating their biology are undefined, the goal of the study has merit. To advance knowledge, the authors used mice as a model and isolated spermatogonia from three different postnatal developmental age points using cell sorting methodology that was based on cell surface markers reported in previous studies and then performed bulk RNA-sequencing and ATAC-sequencing. Overall, the technical aspects of the sequencing analyses and computational/bioinformatics seems sound but there are several concerns with the cell population isolated from testes and lack of acknowledgement for previous studies that have also performed ATAC-sequencing on spermatogonia of mouse and human testes. The limitations, described below, call into question validity of the interpretations and reduce the potential merit of the findings.

      I suggest changing the acronym for spermatogonial cells from SC to SPG for two reasons. First, SPG is the commonly used acronym in the field of mammalian spermatogenesis. Second, SC is commonly used for Sertoli Cells.

      This was suggested in the previous review by Reviewer 1 and was modified in the revised version of the manuscript.

      The authors should provide a rationale for why they used postnatal day 8 and 15 mice. The FACS sorting approach used was based on cell surface proteins that are not germline specific so there was undoubtedly somatic cells in the samples used for both RNA and ATAC sequencing. Thus, it is essential to demonstrate the level of both germ cell and undifferentiated spermatogonial enrichment in the isolated and profiled cell populations. To achieve this, the authors used PLZF as a biomarker of undifferentiated spermatogonia. Although PLZF is indeed expressed by undifferentiated spermatogonia, there have been several studies demonstrating that expression extends into differentiating spermatogonia. In addition, PLZF is not germ cell specific and single cell RNA-seq analyses of testicular tissue has revealed that there are somatic cell populations that express Plzf, at least at the mRNA level. For these reasons, I suggest that the authors assess the isolated cell populations using a germ cell specific biomarker such as DDX4 in combination with PLZF to get a more accurate assessment of the undifferentiated spermatogonial composition. This assessment is essential for interpretation of the RNA-seq and ATAC-seq data that was generated.

      A previous study by the Namekawa lab (PMID: 29126117) performed ATAC-seq on a similar cell population (THY1+ FACS sorted) that was isolated from pre-pubertal mouse testes. It was surprising to not see this study referenced to in the current manuscript. In addition, it seems prudent to cross-reference the two ATAC-seq datasets for commonalities and differences. In addition, there are several published studies on scATAC-seq of human spermatogonia that might be of interest to cross-reference with the ATAC-seq data presented in the current study to provide an understanding of translational merit for the findings.

      These points have been addressed in our previous response and in the revised manuscript.


      The following is the authors’ response to the original reviews.

      Reviewer #1:

      Weaknesses:

      There appears to be a lack of basic knowledge of the process of spermatogenesis. For instance, the statement that "During the first week of postnatal life, a population of SCs continues to proliferate to give rise to undifferentiated Asingle (As), Apaired (Apr) and Aaligned (Aal) cells. The remaining SCs differentiate to form chains of daughter cells that become primary and secondary spermatocytes around postnatal day (PND) 10 to 12." is inaccurate. The Aal cells are the spermatogonial chains, the two are not distinct from one another. In addition, the authors fail to mention spermatogonial stem cells which form the basis for steady-state spermatogenesis. The authors also do not acknowledge the well-known fact that, in the mouse, the first wave of spermatogenesis is distinct from subsequent waves. Finally, the authors do not mention the presence of both undifferentiated spermatogonia (aka - type A) and differentiating spermatogonia (aka - type B). The premise for the study they present appears to be the implication that little is known about the dynamics of chromatin during the development of spermatogonia. However, there are published studies on this topic that have already provided much of the information that is presented in the current manuscript.

      Regarding the inaccuracy and incompleteness of some of the statements about spermatogonial cells and spermatogenesis. In the Introduction, we replaced the following statement: "During the first week of postnatal life, a population of SCs continues to proliferate to give rise to undifferentiated Asingle (As), Apaired (Apr) and Aaligned (Aal) cells. The remaining SCs differentiate to form chains of daughter cells that become primary and secondary spermatocytes around postnatal day (PND) 10 to 12." by: “Spermatogonial cells (SPGs) are the initiators and supporting cellular foundation of spermatogenesis in testis in many species, including mammals. In the mammalian testis, the founding germ cells are primordial germ cells (PGCs), which give rise sequentially to different populations of SPGs : primary transitional (T1)-prospermatogonia (ProSG), secondary transitional (T2)-ProSG, and then spermatogonial stem cells (SSCs) (McCarrey, 2013; Rabbani et al., 2022; Tan et al., 2020). The ProSG population is exhausted by postnatal day (PND) 5 (Drumond et al., 2011) and by PND6-8, distinct SPGs subtypes can be distinguished on the basis of specific marker proteins and regenerative capacity (Cheng et al., 2020; Ernst et al., 2019; Green et al., 2018; Hermann et al., 2018; Tan et al., 2020).

      SSCs represent an undifferentiated population of SPGs that retain regenerative capacity and divide to either self-renew or generate progenitors that initiate spermatogenic differentiation, giving rise to differentiating SPGs (diff-SPGs ). Diff-SPGs form chains of daughter cells that become primary and secondary spermatocytes around PND10 to 12. Spermatocytes then undergo meiosis and give rise to haploid spermatids that develop into spermatozoa. Spermatozoa are then released into the lumen of seminiferous tubules and continue to mature in the epididymis until becoming capable of fertilization by PND42-48 in mice  (Kubota and Brinster, 2018; Rooij, 2017).”

      Regarding the premise and implications of our findings. We clarified the premise of our finding in the revised manuscript. The following statement was included in the Discussion: "our findings complement existing datasets on spermatogonial cells by providing parallel transcriptomic and chromatin accessibility maps at high resolution from the same cell populations at early postnatal, late postnatal and adult stages collected from single individuals (for adults)".  

      It is not clear which spermatogonial subtype the authors intended to profile with their analyses. On the one hand, they used PLZF to FACS sort cells. This typically enriches for undifferentiated spermatogonia. On the other hand, they report detection in the sorted population of markers such as c-KIT which is a well-known marker of differentiating spermatogonia, and that is in the same population in which ID4, a well-known marker of spermatogonial stem cells, was detected. The authors cite multiple previously published studies of gene expression during spermatogenesis, including studies of gene expression in spermatogonia. It is not at all clear what the authors' data adds to the previously available data on this subject.

      The authors analyzed cells recovered at PND 8 and 15 and compared those to cells recovered from the adult testis. The PND 8 and 15 cells would be from the initial wave of spermatogenesis whereas those from the adult testis would represent steady-state spermatogenesis. However, as noted above, there appears to be a lack of awareness of the well-established differences between spermatogenesis occurring at each of these stages.

      We applied computational deconvolution to our bulk RNA-seq datasets, employing publicly available single-cell RNA-seq datasets, to estimate and identify cellular composition. Trained on high-quality RNA-seq datasets from pure or single-cell populations, deconvolution algorithms create expression matrices reflecting the cellular diversity in reference datasets. These cell-type-specific expression matrices are subsequently used to determine the cellular composition of bulk RNA-seq samples with unknown cellular components (Cobos et al., 2023).

      For our analysis, we chose CIBERSORTx (Newman et al., 2019), recognized as the most advanced deconvolution algorithm to date, employing it with three high-quality, publicly available single-cell RNA-seq datasets. First, we assessed the cellular composition of all our RNA-seq libraries, using datasets generated by (Hermann et al., 2018) which characterized the single-cell transcriptomes of testicular cells and various populations of spermatogonial progenitor cells (SPGs) in early postnatal (PND6) and adult stages. This enabled us to not only address potential somatic cell contamination but also to analyse the composition of isolated SPGs using a unified dataset source.

      Author response image 1.

      Deconvolution analysis of bulk RNA-seq samples using PND6 single-cell RNA seq from Hermann et al, 2018 a. Seurat clusters from PND6 single-cell RNA-seq. b. Feature maps of gene expression for markers of SPGs and somatic cells. c. Gene expression signature matrix from PND6  single-cell RNA-seq datasets. d. Barplot of estimated cellular proportions for all bulk RNA-seq libraries reported in this study. e. Dotplot of the average estimated proportion of SSCs in all bulk RNA-seq libraries reported in this study.

      By re-analyzing the single-cell RNA-seq datasets, we identified distinct cell-type clusters, marked by specific cellular markers as reported in the original and subsequent studies (Author response image 1a,b and Author response image 2a,b). Then, CIBERSORTx generated gene-expression signature matrices and estimated the cell-type proportions within our 18 bulk RNA-seq libraries. Evaluation of our postnatal libraries (PND8 and 15) against a PND6 signature matrix revealed a predominant derivation from SPGs, with average estimated proportions of spermatogonial stem cells (SSCs) being 0.99 and 0.85 for PND8 and PND15 samples, respectively (Author response image 1c-e). Notably, the analysis of PND15 libraries also suggested the presence of additional SPGs types, including progenitors and differentiating SPGs (Author response image 1d), albeit at lower frequency. 

      Similarly, evaluation of our adult RNA-seq libraries, using an adult signature matrix, showed an average SSC proportion of 0.82, indicating a primary derivation from SSC cells. Consistent with the findings from PND15 libraries, our deconvolution analysis also suggests the presence of additional SPG types, including progenitors and differentiating SPGs (Author response image 1d). However, unlike our early and late postnatal stage libraries, the deconvolution analysis of adult libraries indicated the presence of other cell types (labeled "Other"), not corresponding to the major somatic cell types identified by Hermann et al. 2018. The estimated average proportion of these cells was less than 0.05 in two adult libraries and 0.10 in the others. This variance in cellular composition underlines the deconvolution method's effectiveness in dissecting complex cellular compositions in bulk RNA-seq samples.

      Author response image 2.

      Deconvolution analysis of bulk RNA-seq samples using Adult single-cell RNA seq (Hermann et al, 2018) a. Seurat clusters from Adult single-cell RNA-seq. b. Feature maps of gene expression for markers of SPG and somatic cells. c. Gene expression signature matrix from Adult single-cell RNA-seq datasets. d. Barplot of estimated cellular proportions for all bulk RNA-seq libraries reported in this study. e. Dotplot of the average estimated proportion of SSCs in all bulk RNA-seq libraries reported in this study.

      To further validate our observations, we re-analyzed two additional testicular single-cell RNA-seq datasets derived from an early postnatal stage (PND7) (Tan et al., 2020) and adult (Green et al., 2018) (Author response image 3a,b and Author response image 4a,b). We identified distinct cell-type clusters, marked by specific cellular markers (Author response image 3a,b and Author response image 4a,b), and proceeded with the deconvolution analysis using CIBERSORTx. Evaluation of our postnatal libraries (PND8 and 15) against the PND7 signature matrix from Tan et al., 2020 confirmed a derivation from germ cells (Author response image 3d,e), in particular from SSCs (Author response image 3g), with average estimated proportions of SSCs being 0.93 and 0.86 for PND8 and PND15 samples, respectively, and the rest estimated to be in origin from differentiating SPGs (Author response image 3g,h). In the case of the adult samples, evaluation against the adult signature matrix from Green et al., 2018 confirmed a predominant derivation from SSCs, with average estimated proportions of SSCs being 0.79, consistent with the 0.82 estimated proportion from Hermann et al., 2018. 

      Author response image 3.

      Deconvolution analysis of bulk RNA-seq samples with additional single-cell datasets. Seurat clusters from PND7 single-cell RNA-seq (Tang 2020). b. Barplot of estimated cellular proportions for all bulk RNA-seq libraries reported in this study. c. Dotplot of the average estimated proportion of germ cells in all bulk RNA-seq libraries reported in this study. d. Re-clustering of germ cell cluster shown in a. e. Barplot of estimated cellular proportions for all bulk RNA-seq libraries reported in this study. f. Dotplot of the average estimated proportion of SSCs in all bulk RNA-seq libraries reported in this study. g. Seurat clusters from adult single-cell RNA-seq (Green et al., 2018). h. Barplot of estimated cellular proportions for all bulk RNA-seq libraries reported in this study. i. Dotplot of the average estimated proportion of germ cells in all bulk RNA-seq libraries reported in this study.

      To further validate our deconvolution strategy, we interrogated the cellular composition of bulk RNA-seq libraries derived from cellular populations enriched in Sertoli cells, generated by our group using a similar enrichment/sorting strategy (Thumfart et al., 2022). As expected, our results show that all our libraries are mainly composed of Sertoli cells suggesting that the deconvolution strategy employed is accurate in detecting cell-type composition (Author response image 4).

      Author response image 4.

      Deconvolution analysis of Sertoli bulk RNA-seq samples. Barplots of estimated cellular proportions for bulk RNAseq libraries reported in Thumfart et al., 2022. Expression matrices were derived from the analysis of single-cell RNA-seq datasets used to asses cellular composition of the SPGs bulk libraries.

      Author response image 5.

      Id4 and Kit are transcribed in SSCs. Seurat clusters from PND6 single-cell RNA-seq (left) and feature maps of gene expression for Id4 (center) and Kit (right). Zoom in into SSCs (red).

      Finally, regarding the following observation by the reviewer: "On the other hand, they report detection in the sorted population of markers such as c-KIT which is a well-known marker of differentiating spermatogonia, and that is in the same population in which ID4, a well-known marker of spermatogonial stem cells, was detected." It was recently shown using single-cell RNA that “nearly all differentiating spermatogonia at P3 (delineated as c-KIT+) are ID4-eGFP” (Law et al., 2019).  While this finding does not exclude the fact that we have a mixture of SPGs cells, this finding supports the possibility that SPG cells express both markers of undifferentiated and differentiated cells, particularly in the early stages of postnatal development. Indeed, we observe that some cells labeled as SSC show signals for both Id4 and Kit in single-cell RNA-seq data from Hermann et al., 2018 (Author response image 5).

      Therefore, the results from the deconvolution analysis and our immunofluorescence data showing 85-95% PLZF+  cells in our cellular preparations underscore that our bulk RNA-seq libraries are mainly composed of SPGs. The deconvolution analysis also suggests a predominantly cellular composition of SSCs and to a lesser degree of differentiating SPGs. Our adult RNA-seq libraries show a small proportion of somatic cells (<0.10). 

      In the revised manuscript, we compiled the deconvolution analyses and present them in a condensed version in Supplementary Fig 2. 

      In general, the authors present observational data of the sort that is generated by RNA-seq and ATAC-seq analyses, and they speculate on the potential significance of several of these observations. However, they provide no definitive data to support any of their speculations. This further illustrates the fact that this study contributes little if any new information beyond that already available from the numerous previously published RNA-seq and ATAC-seq studies of spermatogenesis. In short, the study described in this manuscript does not advance the field.

      We acknowledge that RNA-seq and ATAC-seq datasets like ours are observational and that their interpretation can be speculative. Nevertheless, our datasets represent an additional useful resource for the community because they are comprehensive and high resolution, and can be exploited for instance, for studies in environmental epigenetics and epigenetic inheritance examining the immediate and long-term effects of postnatal exposure and their dynamics. The depth of our RNA sequencing allowed detect transcripts with a high dynamic range, which has been limited with classical RNA sequencing analyses of spermatogonial cells and with single-cell analyses (which have comparatively low coverage). Further, our experimental pipeline is affordable (more than single cell sequencing approaches) and in the case of adults, provides data per animal informing on the intrinsic variability in transcriptional and chromatin regulation across males. These points will be discussed in the revised manuscript.

      In general, the authors present observational data of the sort that is generated by RNA-seq and ATAC-seq analyses, and they speculate on the potential significance of several of these observations. However, they provide no definitive data to support any of their speculations. This further illustrates the fact that this study contributes little if any new information beyond that already available from the numerous previously published RNA-seq and ATAC-seq studies of spermatogenesis. In short, the study described in this manuscript does not advance the field.

      Relevant information for both points was included in the Discussion of the revised manuscript.  

      The phenomenon of epigenetic priming is discussed, but then it seems that there is some expression of surprise that the data demonstrate what this reviewer would argue are examples of that phenomenon. The authors discuss the "modest correspondence between transcription and chromatin accessibility in SCs." Chromatin accessibility is an example of an epigenetic parameter associated with the primed state. The primed state is not fully equivalent to the actively expressing state. It appears that certain histone modifications along with transcription factors are critical to the transition between the primed and actively expressing states (in either direction). The cell types that were investigated in this study are closely related spermatogenic, and predominantly spermatogonial cell types. It is very likely that the differentially expressed loci will be primed in both the early (PND 8 or 15) and adult stages, even though those genes are differentially expressed at those stages. Thus, it is not surprising that there is not a strict concordance between +/- chromatin accessibility and +/- active or elevated expression.

      Relevant information was included in the Discussion of the revised manuscript.

      Reviewer #2:

      The objective of this study from Lazar-Contes et al. is to examine chromatin accessibility changes in "spermatogonial cells" (SCs) across testis development. Exactly what SCs are, however, remains a mystery. The authors mention in the abstract that SCs are undifferentiated male germ cells and have self-renewal and differentiation activity, which would be true for Spermatogonial STEM Cells (SSCs), a very small subset of total spermatogonia, but then the methods they use to retrieve such cells using antibodies that enrich for undifferentiated spermatogonia encompass both undifferentiated and differentiating spermatogonia. Data in Fig. 1B prove that most (85-95%) are PLZF+, but PLZF is known to be expressed both by undifferentiated and differentiating (KIT+) spermatogonia (Niedenberger et al., 2015; PMID: 25737569). Thus, the bulk RNA-seq and ATAC-seq data arising from these cells constitute the aggregate results comprising the phenotype of a highly heterogeneous mixture of spermatogonia (plus contaminating somatic cells), NOT SSCs. Indeed, Fig. 1C demonstrates this by showing the detection of Kit mRNA (a well-known marker of differentiating spermatogonia - which the authors claim on line 89 is a marker of SCs!), along with the detection of markers of various somatic cell populations (albeit at lower levels).

      The reviewer is correct that our spermatogonial cell populations are mixed and include undifferentiated and differentiated cells, hence the name of spermatogonia (SCs), and probably also contains some somatic cells. We acknowledge that this is a limitation of our isolation approach. To circumvent this limitation, we will conduct in silico deconvolution analysis using publicly available single-cell RNA sequencing datasets to obtain information about markers corresponding to undifferentiated and differentiated spermatogonia cells, and somatic cells. These additional analyses will provide information about the cellular composition of the samples and clarify the representation of undifferentiated and differentiated spermatogonial cells and other cells.

      This admixture problem influences the results - the authors show ATAC-seq accessibility traces for several genes in Fig. 2E (exhibiting differences between P15 and Adult), including Ihh, which is not expressed by spermatogenic cells, and Col6a1, which is expressed by peritubular myoid cells. Thus, the methods in this paper are fundamentally flawed, which precludes drawing any firm conclusions from the data about changes in chromatin accessibility among spermatogonia (SCs?) across postnatal testis development.

      The reviewer raises concern about the lack of correspondence between chromatin accessibility and expression observed for some genes, arguing that this precludes drawing firm conclusions. However, a dissociation between chromatin accessibility and gene expression is normal and expected since chromatin accessibility is only a readout of protein deposition and occupancy e.g. by transcription factors, chromatin regulators, or nucleosomes, at specific genomic loci that does not give functional information of whether there is ongoing transcriptional activity or not. A gene that is repressed or poised for expression can still show a clear signal of chromatin accessibility at regulatory elements. The dissociation between chromatin accessibility and transcription has been reported in many different cells and conditions (PMID: 36069349, PMID: 33098772) including in spermatogonial cells (PMID: 28985528) and in gonads in different species (PMID: 36323261). Therefore, the dissociation between accessibility and transcription is not a reason to conclude that our data are flawed.

      In addition, there already are numerous scRNA-seq datasets from mouse spermatogenic cells at the same developmental stages in question.

      This is true but full transcriptomic profiling like ours on cell populations provides different transcriptional information that is deeper and more comprehensive. Our datasets identified >17,000 genes while scRNA-seq typically identifies a few thousand of genes. Our analyses also identified full-length transcripts, variants, isoforms, and low abundance transcripts. These datasets are therefore a valuable addition to existing scRNAseq.

      Moreover, several groups have used bulk ATAC-seq to profile enriched populations of spermatogonia, including from synchronized spermatogenesis which reflects a high degree of purity (see Maezawa et al., 2018 PMID: 29126117 and Schlief et al., 2023 PMID: 36983846 and in cultured spermatogonia - Suen et al., 2022 PMID: 36509798) - so this topic has already begun to be examined. None of these papers was cited, so it appears the authors were unaware of this work.

      We apologize for not mentioning these studies in our manuscript, we will do so in the revised version.

      The authors' methodological choice is even more surprising given the wealth of single-cell evidence in the literature since 2018 demonstrating the exceptional heterogeneity among spermatogonia at these developmental stages (the authors DID cite some of these papers, so they are aware). Indeed, it is currently possible to perform concurrent scATAC-seq and scRNA-seq (10x Genomics Multiome), which would have made these data quite useful and robust. As it stands, given the lack of novelty and critical methodological flaws, readers should be cautioned that there is little new information to be learned about spermatogenesis from this study, and in fact, the data in Figures 2-5 may lead readers astray because they do not reflect the biology of any one type of male germ cell. Indeed, not only do these data not add to our understanding of spermatogonial development, but they are damaging to the field if their source and identity are properly understood. Here are some specific examples of the problems with these data:

      Fig. 2D - Gata4 and Lhcgr are not expressed by germ cells in the testis.

      Fig. 3A - WT1 is expressed by Sertoli cells, so the change in accessibility of regions containing a WT1 motif suggests differential contamination with Sertoli cells. Since Wt1 mRNA was differentially high in P15 (Fig. 3B) - this seems to be the most likely explanation for the results. How was this excluded?

      Fig. 3D - Since Dmrt1 is expressed by Sertoli cells, the "downregulation" likely represents a reduction in Sertoli cell contamination in the adult, like the point above. Did the authors consider this?

      Regarding concerns about contamination by somatic cells (Transcription). In addition to the results of our deconvolution analysis (see response to Reviewer #1), we addressed the specific concern of the paradoxical expression of genes considered markers of somatic cells in the testis. For instance, we plotted the expression values of Ihh, Lhcgr, Gata4, Col16a, Wt1, and Dmrt1 along with the expression values of Ddx4 and Zbtb16. We observe that the expression level of Ddx4 and Zbtb16, genes expressed predominantly in SPGs, is orders of magnitude higher than the one observed for the rest of the genes with the notable exception of Dmrt1 which is also highly expressed (Fig.6). Indeed, our analysis of publicly available single-cell RNA-seq datasets shows that Dmrt1 is robustly expressed in germ cells (Author response image 7), and as also noted by the reviewer, in Sertoli cells in postnatal stages. Notably, we observe a significant stepwise decrease in the expression of Dmrt1 across the postnatal maturation of SPG cells. This is highly unlikely to be a result of major contamination by Sertoli cells of just our postnatal libraries. We based this statement on three observations. First, the deconvolution analysis of all our RNA-seq libraries using four different expression signature matrices from high-quality single-cell RNAseq from testis showed that our libraries are largely derived from SPGs. Second, the evaluation of our adult libraries with the PND6 signature matrix from Green et al., 2018 suggested that the proportion of Sertoli cells in our adult libraries, if any, would be higher than in our postnatal libraries (Author response image 3d, blue bars). This makes it unlikely that the observed decrease in expression of Dmrt1 in adult samples is due to prominent somatic contamination of the postnatal libraries. Third, the step-wise decrease in Dmrt1 expression seems to correlate with progression during postnatal development (Author response image 7) as feature maps of Dmrt1 expression derived from public single-cell RNA-seq experiments show a reduction in expression in adult SPGs in comparison with early postnatal stages (Author response image 7 last two panels). Then, the observed effects are likely the result of developmental gene regulatory processes that operate during the developmental maturation of SPGs. 

      Author response image 6.

      Expression of germ and somatic cell markers in our RNA-seq datasets. Boxplots of log2(CPM) (Top) and CPM (Bottom) values for selected genes from our RNAseq datasets. Each point in boxplots represent the expression value of a biological replicate.

      Author response image 7.

      Expression of germ and somatic cell markers in publicly available single-cell RNA-seq datasets. Seurat clusters from all analyzed single-cell RNA-seq datasets (first column from left) and feature maps of gene expression for Zbtb16, Dmrt1 and Wt1.

      Consistent with the reviewer’s observation, Ihh is not expressed in germ cells and indeed we do not detect signal at this locus nor Lhcgr. Furthermore, while we indeed observe a significant increase in the expression of Wt1 in PND15 samples, its expression level is considerably lower than that of SPG markers. This is even more evident when plotting expression data in a linear scale rather than as a log2 transformation of the expression values. Whether such transcriptional profiles reflect developmentally regulated transcription, stochastic effects on gene expression, or potential somatic contamination is difficult to determine. However, based on our deconvolution data we believe it is unlikely that major contamination could account for our observations. 

      Notably, while Wt1 is robustly expressed in nearly all Sertoli cells across postnatal development (Author response image 7), it is also detected in other cell types including SPGs -although in fewer cells and with lower expression levels-, consistent with our observations (Author response image 6 and 8). Therefore, the assignment of a gene as a marker of a particular cell type does not imply that such a gene is expressed uniquely in such cell, rather it is expressed in more cells and likely at higher levels. 

      Author response image 8.

      Expression of Wt1 in publicly available single-cell RNA-seq datasets. Feature maps of gene expression for Wt1. In dashed boxes, a zoom-in into germ cells cluster that show expression of Wt1 at some of these cells.

      Regarding concerns about contamination by somatic cells (chromatin accessibility). In Figure 2 of our manuscript, we show the chromatin accessibility landscape of different genes, including genes either not expressed in testicular cells (Ihh) and those believed to be expressed exclusively in somatic cells (Lhcgr, Gata4, Col16a1, Wt1). For some of these genes, we reported changes in chromatin accessibility at specific sites between PND15 and adults (e.g. Wt1 and Col16a1). The observation of "traces of chromatin accessibility" at these loci and the reported changes in accessibility raised concerns of potential contamination which "fundamentally flaw" our results, as stated by the reviewer. While we acknowledge that all enrichment methods have a margin of potential contamination, we fundamentally disagree with the reviewer's observations. 

      The term chromatin accessibility can be misleading. In principle, the term accessibility might suggest the literal lack of protein deposition at a given place in the genome. Rather, chromatin accessibility as evaluated by ATAC- seq (as in this case) must be interpreted as a measure of protein occupancy genome-wide (PMID: 30675018). Depending on the type of fragments analyzed we can obtain information regarding the occupancy of transcription factors (TFs), nucleosomes, and other chromatin-associated proteins that are present at genomic locations at a given time within a population of cells. The detection of chromatin accessibility at a given locus does not necessarily indicate transcription of the gene in a given cell type. A gene can be repressed or poised for expression and still show a clear signal of chromatin accessibility at its regulatory elements or along the gene body. For instance, in agreement with the reviewer's observation, neither Ihh nor Lhcgr is expressed in our datasets (Author response image 6 and Author response image 9), however, they show a distinctive pattern of chromatin accessibility in our datasets and publicly available ATAC-seq data derived from undifferentiated (Id4bright) and differentiating SPGs (Id4-dim) (Cheng et al., 2020) (Author response image 9). A similar argument can be applied regarding other loci such as Wt1 and Col6a1 for which we also observe extremely low levels of transcription. Therefore, the lack of transcription does not exclude that these loci display clear patterns of chromatin accessibility (Author response image 9). Notably, while traces of  chromatin accessibility can also be observed in ATAC-seq datasets from embryonic Sertoli cells (Garcia-Moreno et al., 2019) and other somatic stem cells (hematopoietic stem cells; HSCs) (Xiang et al., 2020) (Author response image 9), the pattern of chromatin accessibility markedly differs with that observed in SPG cells. Therefore, the observed changes in chromatin accessibility are unlikely to result from contaminating somatic cells.

      To strengthen our observation, we identified regions of chromatin accessibility in SPGs, Sertoli, and HSCs using both our datasets and publicly available ATAC-seq datasets. Overlap analysis revealed at least four groups of ATAC-seq peaks: 1) peaks shared among all analyzed cell types, 2)peaks shared just among SPG cells, 3) peaks specific to Sertoli cells and 4) peaks specific to HSCs (Author response image 10). Peaks shared among all tested cell-types are predominantly located at promoters of genes involved in translation and DNA replication (GO analysis adj p-value<0.05). In contrast, cell-type specific peaks are localized at intergenic and intragenic regions, suggesting localization at enhancer elements (Author response image 10). Indeed, GO analysis of cell-type specific peaks revealed enrichment for genes involved in male meiosis for SPGs, vesicle-mediated transport for Sertoli cells and in immune system process for HSCs, consistent with cell-type specific functions. If contamination by somatic cells, such as Sertoli cells, would be prominent as stated by the reviewer, we would expect to observe prominent ATAC-seq signal from our datasets at peaks specific to Sertoli cells. Notably, we don't observe ATAC-seq signal at peaks specific for Sertoli cells using our ATAC-seq samples. However, we observe robust signals at shared peaks and peaks specific to SPG cells. This observation, strongly argues against the possibility of major contamination by somatic cells. 

      Author response image 9.

      Chromatin accessibility profiles at specific loci differ between SPG cells and other cell types. Genome-browser tracks for Ihh, Wt1, Col16a1 and Zbtb16. For each gene, an extended locus view is presented with RNA-seq data (this study) and normalized ATAC-seq tracks from our study and public sources (SPG Id4; GSE131657; Sertoli; GSM3346484; HSC; ENCFF204JEE). Public ATAC-seq datasets were generated enrichment methods similar to the one employed in our study.

      Author response image 10.

      Shared and cell-type specific ATAC-seq peaks among SPGs, Sertoli and HSC. Up, Normalized ATACseq signal heatmaps of shared and unique ATAC-seq peaks. PND15 and Adult samples are derived from our study. ATAC-seq signal is plotted +/- 500bp from peak center. Bottom, pie charts of ATAC-seq peaks genomic distribution.

      Reviewer #3:

      In this study, Lazar-Contes and colleagues aimed to determine whether chromatin accessibility changes in the spermatogonial population during different phases of postnatal mammalian testis development. Because actions of the spermatogonial population set the foundation for continual and robust spermatogenesis and the gene networks regulating their biology are undefined, the goal of the study has merit. To advance knowledge, the authors used mice as a model and isolated spermatogonia from three different postnatal developmental age points using a cell sorting methodology that was based on cell surface markers reported in previous studies and then performed bulk RNA-sequencing and ATAC-sequencing. Overall, the technical aspects of the sequencing analyses and computational/bioinformatics seem sound but there are several concerns with the cell population isolated from testes and lack of acknowledgment for previous studies that have also performed ATACsequencing on spermatogonia of mouse and human testes. The limitations, described below, call into question the validity of the interpretations and reduce the potential merit of the findings. I suggest changing the acronym for spermatogonial cells from SC to SPG for two reasons. First, SPG is the commonly used acronym in the field of mammalian spermatogenesis. Second, SC is commonly used for Sertoli Cells.

      We thank the reviewer for the suggestion and will rename SCs into SPG cells in the revised manuscript.

      The authors should provide a rationale for why they used postnatal day 8 and 15 mice.

      We will provide a rationale for the use of postnatal 8 and 15 stages in the revised manuscript. Briefly, these stages are interesting to study because early to mid postnatal life is a critical window of development for germ cells during which environmental exposure can have strong and persistent effects. The possibility that changes in germ cells can happen during this period and persist until adulthood is an important area of research linked to disciplines like epigenetic toxicology and epigenetic inheritance.

      The FACS sorting approach used was based on cell surface proteins that are not germline-specific so there were undoubtedly somatic cells in the samples used for both RNA and ATAC sequencing. Thus, it is essential to demonstrate the level of both germ cell and undifferentiated spermatogonial enrichment in the isolated and profiled cell populations. To achieve this, the authors used PLZF as a biomarker of undifferentiated spermatogonia. Although PLZF is indeed expressed by undifferentiated spermatogonia, there have been several studies demonstrating that expression extends into differentiating spermatogonia. In addition, PLZF is not germ-cell specific and single-cell RNA-seq analyses of testicular tissue have revealed that there are somatic cell populations that express Plzf, at least at the mRNA level. For these reasons, I suggest that the authors assess the isolated cell populations using a germ-cell specific biomarker such as DDX4 in combination with PLZF to get a more accurate assessment of the undifferentiated spermatogonial composition. This assessment is essential for the interpretation of the RNA-seq and ATAC-seq data that was generated.

      In agreement with the reviewer’s observation, Zbtb16 (PLZF) is expressed in germ cells but also in somatic cells, in particular in the dataset derived from Green et al., 2018 (Author response image 11). However, when evaluating the expression patterns of Ddx4, we noticed that similar to Zbtb16, it is expressed both in the germ line and in the somatic compartment (Author response image 11). Notably, we observe expression of Ddx4 in SSC but also in progenitors and differentiating SPGs (Author response image 11g). These observations suggest that at least at the transcript level, both genes are transcribed in germ cells and to a lesser degree in somatic cells. 

      Author response image 11.

      Single-cell expression of Ddx4 and Zbtb16. Seurat clusters from all analyzed single-cell RNA-seq datasets (a,c,e,g,i) and feature maps of gene expression for Ddx4 and Zbtb16 (b,d,f,j, h).

      Finally, our deconvolution analysis using geneexpression signature matrices for different cellular populations suggest that our RNA-seq and ATAC-seq libraries are largely derived from SPG cells and in particular of SSCs.

      Furthermore, while this analysis suggested the presence of somatic cells, their proportion is minimal in comparison with germ cells (Author response images 1-4). This is also supported by ATAC-seq analysis of somatic cells from testis (Author response images 9 and 10). 

      A previous study by the Namekawa lab (PMID: 29126117) performed ATAC-seq on a similar cell population (THY1+ FACS sorted) that was isolated from pre-pubertal mouse testes. It was surprising to not see this study referenced in the current manuscript. In addition, it seems prudent to cross-reference the two ATAC-seq datasets for commonalities and differences. In addition, there are several published studies on scATACseq of human spermatogonia that might be of interest to cross-reference with the ATAC-seq data presented in the current study to provide an understanding of translational merit for the findings.

      We compared our ATAC-seq datasets with the ones from (Maezawa et al., 2017) and those from (Cheng et al., 2020). All these datasets were generated from FACSs sorted cells enriched for undifferentiating and differentiating SPGs. Sequencing files from Cheng et al, 2020 were equally processed as described in out methods section, while our pipeline was adjusted to process files from Maezawa et al., 2018 as they were single-end sequencing files. We generated a reference set of peaks from SPGs and calculated signal scores for all peaks across all samples. Then, calculated the Pearson correlation for all pairwise comparisons and generated a heatmap of correlations (Author response image 12). Two clusters emerge that separate the SPG samples from the pachytene spermatocytes and round spermatids reported by Maezawa et al., 2018. As expected SPG samples clustered together based on study of origin. Consistently, our postnatal samples formed one cluster next to but separated from the adult one. Similarly, the id4-bright samples clustered together and next to the id4-sim and the sample applied for the Thy1 and cKit samples. Notably, our samples and the ones from Cheng et al., 2020 have a higher correlation with each other when compared with the ones from Maezawa et al., 2018. Given the fundamental difference in library sequencing (single-end instead of the widely used paired-end for ATAC-seq experiments) we reasoned a comparison with the Maezawa et al., 2018 datasets is not optimal. Therefore, this data in addition to the one presented before (see response to Reviewer 1 and 2) strongly supports a predominantly SPG derivation of all our sequencing libraries. 

      Author response image 12.

      Pearson correlation at the peak level among different ATAC-seq datasets. a) Our ATAC-seq libraries and ATAC-seq libraries from b) Cheng et al., 2020 and c) Maezawa et al., 2020. Thy1-1 and cKit libraries correspond to undifferentiated and differentiating SPGs, respectively. PS, pachytene spermatocytes and RS, round spermatids. Correlation analysis was done using Deeptools.

      References

      Cheng K, Chen I-C, Cheng C-HE, Mutoji K, Hale BJ, Hermann BP, Geyer CB, Oatley JM, McCarrey JR. 2020. Unique Epigenetic Programming Distinguishes Regenerative Spermatogonial Stem Cells in the Developing Mouse Testis. iScience 23:101596. doi:10.1016/j.isci.2020.101596

      Cobos FA, Panah MJN, Epps J, Long X, Man T-K, Chiu H-S, Chomsky E, Kiner E, Krueger MJ, Bernardo D di, Voloch L, Molenaar J, Hooff SR van, Westermann F, Jansky S, Redell ML, Mestdagh P, Sumazin P. 2023. Effective methods for bulk RNA-seq deconvolution using scnRNA-seq transcriptomes. Genome Biol 24:177. doi:10.1186/s13059-023-03016-6

      Drumond AL, Meistrich ML, Chiarini-Garcia H. 2011. Spermatogonial morphology and kinetics during testis development in mice: a high-resolution light microscopy approach. Reproduction 142:145–155. doi:10.1530/rep-10-0431

      Ernst C, Eling N, Martinez-Jimenez CP, Marioni JC, Odom DT. 2019. Staged developmental mapping and X chromosome transcriptional dynamics during mouse spermatogenesis. Nat Commun 10:1251. doi:10.1038/s41467-019-09182-1

      Garcia-Moreno SA, Futtner CR, Salamone IM, Gonen N, Lovell-Badge R, Maatouk DM. 2019. Gonadal supporting cells acquire sex-specific chromatin landscapes during mammalian sex determination. Dev Biol 446:168–179. doi:10.1016/j.ydbio.2018.12.023

      Green CD, Ma Q, Manske GL, Shami AN, Zheng X, Marini S, Moritz L, Sultan C, Gurczynski SJ, Moore BB, Tallquist MD, Li JZ, Hammoud SS. 2018. A Comprehensive Roadmap of Murine Spermatogenesis Defined by Single-Cell RNA-Seq. Dev Cell 46:651-667.e10. doi:10.1016/j.devcel.2018.07.025

      Hermann BP, Cheng K, Singh A, Cruz LR-DL, Mutoji KN, Chen I-C, Gildersleeve H, Lehle JD, Mayo M, Westernströer B, Law NC, Oatley MJ, Velte EK, Niedenberger BA, Fritze D, Silber S, Geyer CB, Oatley JM, McCarrey JR. 2018. The Mammalian Spermatogenesis Single-Cell Transcriptome, from Spermatogonial Stem Cells to Spermatids. Cell Rep 25:1650-1667.e8. doi:10.1016/j.celrep.2018.10.026

      Kubota H, Brinster RL. 2018. Spermatogonial stem cells†. Biol Reprod 99:52–74. doi:10.1093/biolre/ioy077

      Law NC, Oatley MJ, Oatley JM. 2019. Developmental kinetics and transcriptome dynamics of stem cell specification in the spermatogenic lineage. Nat Commun 10:2787. doi:10.1038/s41467-019-10596-0

      Maezawa S, Yukawa M, Alavattam KG, Barski A, Namekawa SH. 2017. Dynamic reorganization of open chromatin underlies diverse transcriptomes during spermatogenesis. Nucleic Acids Res 46:gkx1052-. doi:10.1093/nar/gkx1052

      McCarrey JR. 2013. Toward a More Precise and Informative Nomenclature Describing Fetal and Neonatal Male Germ Cells in Rodents1. Biol Reprod 89:Article 47, 1-9. doi:10.1095/biolreprod.113.110502

      Newman AM, Steen CB, Liu CL, Gentles AJ, Chaudhuri AA, Scherer F, Khodadoust MS, Esfahani MS, Luca BA, Steiner D, Diehn M, Alizadeh AA. 2019. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat Biotechnol 37:773–782. doi:10.1038/s41587-019-0114-2

      Rabbani M, Zheng X, Manske GL, Vargo A, Shami AN, Li JZ, Hammoud SS. 2022. Decoding the Spermatogenesis Program: New Insights from Transcriptomic Analyses. Annu Rev Genet 56:339–368.

      doi:10.1146/annurev-genet-080320-040045

      Rooij DG de. 2017. The nature and dynamics of spermatogonial stem cells. Development 144:3022–3030. doi:10.1242/dev.146571

      Tan K, Song H-W, Wilkinson MF. 2020. Single-cell RNAseq analysis of testicular germ and somatic cell development during the perinatal period. Development 147:dev183251. doi:10.1242/dev.183251

      Thumfart KM, Lazzeri S, Manuella F, Mansuy IM. 2022. Long-term effects of early postnatal stress on Sertoli cells. Front Genet 13:1024805. doi:10.3389/fgene.2022.1024805

      Xiang G, Keller CA, Heuston EF, Giardine BM, An L, Wixom AQ, Miller A, Cockburn A, Sauria MEG, Weaver K, Lichtenberg J, Göttgens B, Li Q, Bodine D, Mahony S, Taylor J, Blobel GA, Weiss MJ, Cheng Y, Yue F, Hughes J, Higgs DR, Zhang Y, Hardison RC. 2020. An integrative view of the regulatory and transcriptional landscapes in mouse hematopoiesis. Genome Res 30:gr.255760.119. doi:10.1101/gr.255760.119

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations for The Authors):

      Q1: Please replace lymphocytes with lymphatic endothelial cells throughout the manuscript.

      A1: Thank you for your conscientious review. Per your suggestion, we have replaced “lymphocytes” with “lymphatic endothelial cells (LECs)” throughout the manuscript.

      Q2: Please re-analyse lymphatics using LYVE1 and CD68 or another macrophage marker, as Lyve1 is NOT specific for lymphatics.

      A2: Thank you for your suggestion. We completely agree with your opinion. Because both the CD68 (CST,97778S) and LYVE1 antibodies (Abcam,ab14917) are rabbit multiclonal antibodies and to more accurately label cardiac lymphatics, we performed immunofluorescence co-staining using LYVE1 and PDPN antibodies (Thermo,53-5381-82) and re-measured the lymphatic vessel area using the Image J software (version 1.53). The result is shown in Figure 1A and 1B. Further, we performed co-staining with PDPN and CD68 to observe the relationship between macrophage and cardiac lymphatic vessel distributions at different time points post-myocardial infarction (MI) (Figure1-figure supplement 1F). Per your comment, some LYVE1 markers are positive, whereas PDPN markers may be negative for macrophages in the heart tissue. We have added notes on the catalog numbers of anti-PDPN and anti-CD68 in the methods (Page 10, Lines 351‒352) and updated them in the KRT template and MDAR checklist.

      Q3: Rephrase title 2.6, 2.7 to fit the results in these sections that are purely descriptive and do not add any insight into the functional relevance of the findings.

      A3: Thank you for your suggestion. We have rephrased titles 2.6 and 2.7 as follows:

      2.6 AQP1 in LEC is correlated with myocardial edema occurrence and resolution post-MI.

      2.7 Gal9 secreted by LEC can affect macrophage migration.

      Q4: Please refrain from extensive discussion of non-significant findings, such as Figures 6D, and 7A, B, and M (ifng vs ifng + antiGal9 is n.s).

      A4: Thank you for your suggestions. Lymphatic endothelial cells (LECs) are a type of cell that exists in the myocardial tissue in small quantities. Owing to the extremely small number of LECs, elucidating their biological functions and regulation may be challenging during MI. To gain a deeper understanding of the role of the lymphatic system post-MI, we attempted to analyze the transcriptomic changes of LEC subsets at different time points after MI by combining single-cell sequencing and spatial transcriptomics data. We have selected relevant molecules with significant differences in transcription levels and conducted the validation analysis in LECs at different time points after MI. Among them, AQP1 and GAL9 showed significant differences. CD44, as a receptor for GAL9, showed significant differences in its expression in macrophages at different time points after MI. Therefore, we have added the relevant information to the discussion section (marked with yellow) on Page 9, Lines 299‒312.

      Q5: Please explain the method used to calculate lymphatic areas in Figure 1.

      A5: Thank you for your observation. The method we used is consistent with that described in previous studies[1,2]. (PMID: 30582443 and PMID: 32404007). The detailed methods have been described in the Methods as follows (Page 10, Lines 358‒363):

      For quantification of vessel area, vessels with visible co-staining were measured using Image J software. First, we selected an image, turned it into 8-bit, and then applied a suitable threshold adjustment (present co-stained areas wherever possible). Second, five equally sized squares were selected in the respective zones (remote, infarct, and border zones) of each slice. ROI manager tools were used to analyze the automatic signal intensity quantification by the software in the area inside this square. Finally, the GraphPad software was used to plot the results as a bar graph.   

      Q6: In Figure 1 supp C, the upper and lower panels don't seem to have the same zoom factor.

      A6: Thank you for pointing this out. The upper and lower images in Figure S1C have the same magnification. To facilitate your review, we have added a 1× image and re-labeled the position and scale information of the image. The revised Figure S1C was added to the manuscript and is shown as follows:

      Q7: In Figure 2d please include aqp1 among displayed genes.

      A7: Thank you for your suggestion. The Aqp1 gene is already displayed in the 11th, and we have labeled it.

      Q8: In Figure 2f include markers of LECs such as Prox1, Flt4, Itga9, and also show Aqp1 here.

      A8: Thank you for your valuable comment. We have updated Figure 2f.

      Q9: Please indicate in Figure 3a what the y axis means? % of total LECs? % of total LECs at a given time point? The data is really not clear.

      A9: Thank you for your suggestion. The y-axis represents the percentage of the total number of LECs at d1, d3, d7, d14, and d28 post-MI, relative to the number of LECs at d0, which is used as the reference value set at 100%. Meanwhile, different colors were applied to represent the proportion of different cell subtypes at different time points. We have updated Figure 3a.

      Q10:Add n of LECs per time points in Figures 3a and b.

      A10: Thank you for your suggestion. We have updated Figure 3b.

      Q11: For Figure 3c please explain what marker genes were used to identify LEC enriched areas. What was the spatial resolution of the transcriptomic screens? How do these images relate to the localization of lymphatics in the heart?

      A11: We appreciate your observation. We have added the required information to the Methods on Page 13, Lines 442‒448, as follows:

      “We conducted spatial transcriptome data analysis using the deconvolution algorithm. The deconvolution algorithm refers to the application of feature genes to infer the full matrix information of single-cell transcriptome of cell subclusters. We then compared and anchored the matrix information of the single-cell transcriptome with the information of each SPOT in the spatial transcriptome, predicting cell types based on the similarity between the two sets of information.”

      Q12:Figure 6 explains the y-axis in panel A, the timepoint in panel G, and absence of aqp1 staining in blood vessels in images d1 and d3 in panel D.

      A12: Thank you for your suggestion. The y-axis in Figure 6A (Figure to reviewer 7A) shows Aqp1 expression in LECs at different time points from the sc-RNA sequence data. We have also added the timepoint in Figure 6G, which is for 24 hours. To clarify the expression trend of APQ1 more clearly, we performed immunofluorescence staining of APQ1 and LYVE1 at different time points after MI (d0, d1, d3, d7, and d14). The results are shown in Figure to reviewer 7C. APQ1 expression was found to be increased in the border zone of infarction at d3 post-MI adjacent to LYVE1 staining positive area.

      Q13: Explain the y-axis unit in Figure 7a.

      A13: Thank you for your comment. The y-axis in Figure 7A shows Lgals9 gene expression in LECs at different time points from the Sc-RNA sequence data.

      Q14: In Figure 7c, d how was the induction of cell death excluded as a cause of IFNg-mediated effects in LECs?

      A14: Thank you for your suggestion. To remove the interference of apoptosis on the results, we performed TUNEL staining of LECs after stimulation with different concentrations of IFN-r for 24 h. As shown in the Figure to reviewer 9, little apoptosis of LECs was observed in this concentration gradient range. Therefore, we can exclude the potential impact of IFN-r-induced cell apoptosis.

      Author response image 1.

      TUNEL staining of LECs after stimulation with different concentrations of IFN-r for 24 h.

      Q15: Results with hypoxia in Figure 7 are mentioned but not shown.

      A15: Thank you for your observation. In the revised article, we supplemented the detection of Gal9 expression after hypoxic stimulation. We conducted hypoxia intervention experiments using two methods. First, we applied 1% oxygen concentration stimulation to detect the expression of Gal9 at 0 h, 2 h, 4 h, 8 h, 12 h, and 24 hours. Second, we applied CoCl2 intervention to activate HIF1α expression and simulated cell hypoxia stimulation to detect Gal9 expression. Both results confirmed that hypoxia could not stimulate LECs to secrete galectin 9. The results are presented in Figure 7-figure Supplement 1 (A-D).

      Reviewer #3 (Recommendations For The Authors):

      Q1: In Figure 1, the so-called "LYVE1-labeled lymphatic capillaries with discontinuous walls" might be macrophages. The authors measured lymphatic area by measuring "vessels with visible lumens", which is unclear. This may underestimate the number of capillaries that expand after MI in the border zone of the infarct area. The authors need to use CD68 and Pdpn markers, as Lyve1 is not specific for lymphatics and also stains macrophages, and Pdpn is more reliable for assessing lymphatic identity.

      A1: Thank you for your good suggestion. We totally agree with your opinion. Because both the CD68 (CST,97778S) and LYVE1 antibodies (Abcam,ab14917) are rabbit multiclonal antibodies and to more accurately label cardiac lymphatics, we performed immunofluorescence co-staining using LYVE1 and PDPN antibodies(Thermo,53-5381-82) and re-measured the lymphatic vessel area using the Image J software (version 1.53). The result is shown in Figure to reviewer 1 (Figure 1A and 1B in manuscript). Further, we performed co-staining with PDPN and CD68 to observe the relationship between macrophage and cardiac lymphatic vessel distributions at different time points post-myocardial infarction (Figure to reviewer 2,and Figure1-figure supplement 1F in manuscript). Per your comment, some LYVE1 markers are positive, whereas PDPN markers may be negative for macrophages in the heart tissue. We have added notes on the catalog numbers of anti-PDPN and anti-CD68 in the methods (Page 10, Lines 351‒352) and updated them in the KRT template and MDAR checklist.

      Q2: It is not clear how they analyse the lymphatic area in Figure 1, please explain.

      A2: Thank you for your observation. The method we used is consistent with that described in previous studies[1,2]. (PMID: 30582443 and PMID: 32404007). The detailed methods have been described in the Methods as follows (Page 10, Lines 347‒352):

      For quantification of vessel area, vessels with visible co-staining were measured using Image J software. First, we selected an image, turned it into 8-bit, and then applied a suitable threshold adjustment (present co-stained areas wherever possible). Second, five equally sized squares were selected in the respective zones (remote, infarct, and border zones) of each slice. ROI manager tools were used to analyze the automatic signal intensity quantification by the software in the area inside this square. Finally, the GraphPad software was used to plot the results as a bar graph.   

      Q3: Figure 1-supplement 1D: The authors claim that the observed structure is a lymphatic valve, however in 2D sections, this shape might result from membrane destruction due to the cutting and staining process. To accurately identify valves, the authors should employ 3D imaging of the lymphatic network, such as using a clearing protocol followed by lightsheet microscopy.

      A3: Thank you for your good suggestion. We performed a 3D scan using a confocal microscope on another slice. The results are shown in Figure 1-supplement 1D. We believe it is more like the lymphatic valve than chips from membrane destruction.

      Q4: In Figure 2, the number of LECs is too little. Indeed, 242 LECs were identified over 44860 total cell numbers and 5688 endothelial cells cannot be representative and cannot afford to distinguish 4 different clusters.

      A4: We further analyzed the percentage of LEC in the adult mouse heart in the physiological state on day d0 based on the results of single-cell nuclear sequencing from public databases (GSE214611). A total of 292 LEC cells were obtained from 26,779 cells captured on board in three samples, meaning that the percentage of LEC cells in the normal adult mouse heart is 1.09%. Cardiac LECs are really rare, and enrichment methods such as flow cytometry and magnetic beads separation for cardiac LECs are under marked probing, which might exhibit more irrefutable evidence in future studies.

      Q5: The authors claimed that there is transcriptional heterogeneity in regenerated cardiac LECs post-MI, based on their over-clusterization. However, to substantiate this claim, they need to include a control comparison. Currently, the observed differences in cardiac LEC profiles lack a direct connection to the disease condition.

      A5: Thank you for pointing this out. Because we could not download spatial transcriptome data for day d0 in the public database (GSE214611) or from the authors, we have used data of 1 h after IR as a reference for approximating the physiological state in Figure 3 and in Supplemental Figure 1.

      Q6: Line 131, what is the regeneration ratio the authors cite here?

      A6: Thank you for the comment. Regeneration ratio is an inappropriate use of the word, and we apologize for this confusion. We were actually referring to the regenerative potential of LECs.

      Q7: Line 132, it is not clear what is the "normal myocardial tissue" in the graphs presented Figures 3A and B. Is it d0 time point?

      A7: Thank you for your suggestion. The d0 time point means LECs in the normal adult mouse heart.

      Q8: In Figure 2D, please add more lymphatic markers as Ccl21, Flt4, Itga9, FoxC2 and Aqp1.

      A8: Thank you for your suggestion. We have added these markers (Except Ccl21, whose gene expression is too low to mark) in Figure 2D in the revised manuscript.

      Q9: The authors must replace "lymphocyte" with "lymphatic" from 2.5, where they start to present interactions between lymphatic and immune cells.

      A9: Thank you for your good comments. We have corrected these words.

      Q10: In Figure 3, please indicate what the color scale means.

      A10: Thank you for your suggestion. We have supplied a color scale label.

      Q11: In Figures 3C and D, the authors distinguished the same LECs clusters in the spatial transcriptomic as in the scRNAseq analysis. This is not clear whether they used the same markers.

      A11: We appreciate your observation. We have added the required information to the Methods on Page 12, Lines 429‒434, as follows:

      “We conducted spatial transcriptome data analysis using the deconvolution algorithm. The deconvolution algorithm refers to the application of feature genes to infer the full matrix information of single-cell transcriptome of cell subclusters. We then compared and anchored the matrix information of the single-cell transcriptome with the information of each SPOT in the spatial transcriptome, predicting cell types based on the similarity between the two sets of information.”

      Q12: In 2.5, it is not clear whether the main message is about macrophage interactions with lymphocytes or with lymphatics(LEC interact with others)

      A12: Thank you for your suggestion. We have revised the title 2.5 as “Assessment of Cell-Cell Communication between LECs and immune cells,” which is clearer for the reader.

      Q13: In 2.6, the authors claim that they reveal "that fluid retention occurs in LEC ca I and LEC co. They don't show any data supporting this.

      A13: Thank you for your comment. “…that fluid retention occurs in LEC ca I and LEC co” is mainly supported by Figure 3D KEGG enrichment. LEC Ca I is related to vasopressin-regulated water reabsorption, and LEC co is related to renin secretion.

      Q14: In Figure 6A, please add statistical values, as the authors claim a significant correlation. Please also add a figure to support the correlation between Aqp1 and edema score, as mentioned in 2.6.

      A14: Thank you for pointing this out. We have presented the information on statistical values in Figure 6A. Moreover, we calculated the correlation between Aqp1 and edema score in Figure 6D (shown in Author response image 2).

      Author response image 2.

      Correlation between Aqp1 expression intensity and edema score.

      Q15: In Figure 6B, myocardial edema assessment using H&E staining is not accurate. If the authors wish to analyse cardiac edema, they must use gravimetry or MRI techniques.

      A15: Thank you for your comment. We totally agree with your opinion. However, owing to limitations in experimental conditions, we could not perform MRI detection of mouse myocardial injury. To evaluate whether edema occurred in the mouse heart tissue, we used classic pathological evaluation methods described in the literature (PMID: 30582443). This method has been described in detail as follows (Page 11, Lines 365‒370):

      Four high-power (40×) representative images were chosen per animal under the H&E stained section; each image must have a clear border of the section visible. Images were blinded, and five visual fields per sample were evaluated. Subsequently, an edema score was determined for each sample (Score 1=no edema, 2=mild edema, 3=severe edema). Graphs represent the average score value per animal.

      Q16: Line 227, please correct "LVEC" with "LEC".

      A16: Thank you for your careful review. We have revised this in the manuscript.

      Q17: In Figure 6D, IF co-staining of Aqp1 and lymphatic vessels is mentioned as "significantly reduced". However, we don't see any quantification data supporting this.

      A17: Thank you for your comment. To clarify the expression trend of APQ1 more clearly, we performed immunofluorescence staining of APQ1 and LYVE1 at different time points post-MI (d0, d1, d3, d7, and d14). The results are shown in the corrected Figure 6-figure supplement 1A. The result showed that APQ1 expression increased in the border zone of infarction in d3 post-MI adjacent to LYVE1 staining positive area.

      Q18: As Gal9 was not significantly impaired in LECs post. MI, Figure 7A does not support any real finding concerning the role of this molecule in monocytes/macrophages interaction with cardiac lymphatics.

      A18: Thank you for your comment. The Lgals9 gene is significantly impaired in LEC post-MI, as well as the Cd44 gene in macrophage. We have updated them in Figures 7A and 7B.

      Q19:  In Figure 7, please correct INF by IFN.

      A19: Thank you for your careful review. We have revised this in the manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary.

      The authors goal was to map the neural circuitry underlying cold sensitive contraction in Drosophila. The circuitry underlying most sensory modalities has been characterized but noxious cold sensory circuitry has not been well studied. The authors achieve their goal and map out sensory and post-sensory neurons involved in this behavior.

      Strengths.

      The manuscript provides convincing evidence for sensory and post sensory neurons involved in noxious cold sensitive behavior. They use both connectivity data and functional data to identify these neurons. This work is a clear advance in our understanding of noxious cold behavior. The experiments are done with a high degree of experimental rigor.

      Positive comments

      - Campari is nicely done to map cold responsive neurons, although it doesn't give data on individual neurons.

      - Chrimson and TNT experiments are nicely done.

      - Cold temperature activates basin neurons, it's a solid and convincing result.

      Weaknesses.

      Among the few weaknesses in this manuscript is the failure to trace the circuit from sensory neuron to motor neuron; and to ignore analysis of the muscles driving, cold induced contraction. Authors also need to elaborate more on the novel aspects of their work in the introduction or abstract.

      We have performed a more thorough em connectivity analysis of the CIII md neuron circuit (Figure 1A, Figure 1 – Figure supplement 1, Figure 10A). We now report all premotor neurons that are connected to CIII md neurons along with two additional projection/commandlike neurons. These additional premotor neurons (A01d3, A02e, A02f, A02g, A27k, and A31k) that are primarily implicated in locomotion were not required for cold nociception (Figure 5 – Figure supplement 2). Collectively, we have tested the requirement in cold nociception for ~94% synapses between CIII md->premotor neurons and all tested premotor with available driver lines. The requirement in cold nociception was also assessed for the two projection/command-like neurons dLIP7 and A02o neurons, which are required for sensory integration and directional avoidance to noxious touch, respectively (Figure 7 – Figure supplement 2) (Hu et al., 2017; Takagi et al., 2017). Silencing dLIP7 neurons resulted in modest reduction in cold-evoked behaviors, meanwhile A02o neurons were not required for cold nociception (Figure 7 – Figure supplement 2). To complete the analysis from thermosensation to evoked behavior, we analyzed cold-evoked Ca<sup>2+</sup> responses of larval musculature (Figure 10). Premotor neurons, which are connected to CIII md neurons, target multiple muscle groups (DL, DO, LT, VL, and VO) (Figure 10A). Individual larval segments have unique cold-evoked Ca<sup>2+</sup> responses, where the strongest cold-evoked Ca<sup>2+</sup> occurs in the central abdominal segments (Figure 10B-D). Inhibiting motor neuron activity or using an anesthetic (ethyl ether), there is a negligible cold-evoked Ca<sup>2+</sup> response compared to controls (Figure 10 – Figure supplement 1). Analysis of cold-evoked Ca<sup>2+</sup> in individual muscles reveal unique Ca<sup>2+</sup> dynamics for individual muscle groups (Figure 10E-H).

      Major comments.

      - Class three sensory neuron connectivity is known, and role in cold response is known (turner 16, 18). Need to make it clearer what the novelty of the experiments are.

      In figure 1, we are trying to guide the audience to CIII md neuron circuitry and emphasize the necessity and sufficiency CIII md neurons in cold nociception. Previously, only transient (GCaMP6) cold-evoked Ca<sup>2+</sup> were reported (Turner et al., 2016, 2018). However, here using CaMPARI, we performed dendritic spatial (sholl) analysis of cold-evoked Ca<sup>2+</sup> responses (Figure 1B-C). During the revision, we evaluated both CIII- and cold-evoked CT throughout larval development (Figure 1G, H). All in all, the findings from the first figure reiterate and replicate previous findings for the role of CIII md neuron in cold nociception. CIII md connectivity might be known, however, we investigated the functional and physiological roles of individual circuit neurons.

      - Why focus on premotor neurons in mechano nociceptive pathways? Why not focus on PMNs innervating longitudinal muscles, likely involved in longitudinal larval contraction? Especially since chosen premotor neurons have only weak effects on cold induced contraction?

      We assessed requirements for all premotor neurons that are connected to CIII md neurons and for which there are validated driver lines. Only premotor neurons (DnB, mCSI and Chair-1), which were previously initially implicated in mechanosensation, were also required for cold nociception. Premotor neurons previously implicated in locomotion (A01d3, A02e, A02f, A02g, A27k, and A31k) are not required for cold-evoked behaviors (Figure 5 – Figure supplement 2).

      Reviewer #2 (Public Review):

      Patel et al perform the analysis of neurons in a somatosensory network involved in responses to noxious cold in Drosophila larvae. Using a combination of behavioral experiments, Calcium imaging, optogenetics, and synaptic connectivity analysis in the Drosophila larval they assess the function of circuit elements in the somatosensory network downstream of multimodal somatosensory neurons involved in innocuous and noxious stimuli sensing and probe their function in noxious cold processing, Consistent with their previous findings they find the multidendritic class III neurons, to be the key cold sensing neurons that are both required and sufficient for the CT behaviors response (shown to evoked by noxious cold). They further investigate the downstream neurons identified based on literature and connectivity from EM at different stages of sensory processing characterize the different phenotypes upon activating/silencing those neurons and monitor their responses to noxious cold. The work reveals diverse phenotypes for the different neurons studied and provides the groundwork for understanding how information is processed in the nervous system from sensory input to motor output and how information from different modalities is processed by neuronal networks. However, at times the writing could be clearer and some results interpretations more rigorous.

      Specific comments

      (1) In Figure 1 -supplement 6D-F (Cho co-activation)

      The authors find that Ch neurons are cold sensitive and required for cold nociceptive behavior but do not facilitate behavioral responses induced but CIII neurons

      The authors show that coactivating mdIII and cho inhibits the CT (a typically observed coldinduced behavioral response) in the second part of the stimulation period, while Cho was required for cold-induced CT. Different levels of activation of md III and Cho (different light intensities) could bring some insights into the observed phenotypes upon Cho manipulation as different levels activate different downstream networks that could correspond to different stimuli. Also, it would be interesting to activate chordotonal during exposure to cold to determine how a behavioral response to cold is affected by the activation of chordotonal sensory neurons.

      Modulating both CIII md and Ch activation to assess the contribution of individual sensory neuron’s role in thermosensation would certainly shed unique insights. However, we believe that such analyses are beyond the scope of the current manuscript and better suited to future followup studies.

      (2) Throughout the paper the co-activation experiments investigate whether co-activating the different candidate neurons and md III neurons facilitates the md III-induced CT response. However, the cold noxious stimuli will presumably activate different neurons downstream than optogenetic activation of MdIII and thus can reveal more accurately the role of the different candidate neurons in facilitating cold nociception.

      We agree that the CIII md neuron activation of the downstream circuitry would be different from the cold-evoked activation of neurons downstream of primary sensory neurons. We believe that our current finding lay foundations for future works to evaluate how multiple sensory neurons work in concert for generating stimulus specific behavioral responses.

      (3) Use of blue lights in behavioral and imaging experiments

      Strong Blue and UV have been shown to activate MDIV neurons (Xiang, Y., Yuan, Q., Vogt, N. et al. Light-avoidance-mediating photoreceptors tile the Drosophila larval body wall. Nature 468, 921-926 (2010). https://doi.org/10.1038/nature09576) and some of the neurons tested receive input from MdIV.

      In their experiments, the authors used blue light to optogenetically activate CDIII neurons and then monitored Calcium responses in Basin neurons, premotor neurons, and ascending neurons and UV light is necessary for photoconversion in Campari Experiments. Therefore, some of the neurons monitored could be activated by blue light and not cdIII activation. Indeed, responses of Basin-4 neurons can be observed in the no ATR condition (Fig 3HI) and quite strong responses of DnB neurons. (Figure 6E) How do authors discern that the effects they see on the different neurons are indeed due to cold nociception and not the synergy of cold and blue light responses could especially be the case for DNB that could have in facilitating the response to cold in a multisensory context (where mdIV are activated by light).

      In addition, the silencing of DNB neurons during cold stimulation does not seem to give very robust phenotypes (no significant CT decrease compared to empty GAL4 control).

      It would be important to for example show that even in the absence of blue light the DNB facilitates the mdIII activation or cold-induced CT by using red light and Chrimson for example or TrpA activation (for coactivation with md III).

      Alternatively, in some other cases, the phenotype upon co-activation could be inhibited by blue light (e.g. chair-1 (Figure 5 H-I)).

      More generally, given the multimodal nature of stimuli activating mdIV , MdIII (and Cho) and their shared downstream circuitry it is important to either control for using the blue light in these stimuli or take into account the presence of the stimulus in interpreting the results as the coactivation of for example Cho and mdIII using blue lights also could activate mdIV (and downstream neurons, alter the state of the network that could inhibit the md III induced CT responses.

      Assessing the differences in behavioral phenotypes in the different conditions could give an idea of the influence of combining different modalities in these assays. For example, did the authors observe any other behaviors upon co-activation of MDIII and Cho (at the expense of CT in the second part of the stimulation) or did the larvae resume crawling? Blue light typically induces reorientation behavior. What about when co-activating mdIII and Basin-4?

      Using Chrimson and red light or TrpA in some key experiments e.g. with Cho, Basin-4, and DNB would clarify the implication of these neurons in cold nociception

      We agree that exposure to a bright light source results in avoidance behaviors in Drosophila larvae, which is primarily mediated by CIV md neurons. However, the light intensities used in our assays is much milder than the ones required to activate sensory neurons. Specifically, based on Xiang et al. 470nm light does not evoke any electrical response at the lowest tested light intensity (0.74mWmm<sup>-2</sup>), whereas our light intensity used in behavioral experiments was much lower at 0.15mWmm<sup>-2</sup>. Additionally, we assessed larval mobility and turning for control conditions ±ATR and also sensory neuron activation. As expected, there is an increase in larval immobility upon CIII md neurons activation (Author response image 1). Only activation of CIV md neurons resulted in light-evoked turning, meanwhile remaining conditions did show stimulus time locked turning response (Author response image 1). Furthermore, we tested whether the intensity of 470nm light used in our behavior experiments was enough to result in light-evoked Ca<sup>2+</sup> response in CIII md and CIV md neurons. We expressed RCaMP in sensory neurons using a pan-neural driver (GMR51C10<sup>GAL4</sup>). There was no detectable increase in light-evoked Ca<sup>2+</sup> response in either CIII md or CIV md neuron (Author response image 1).

      Furthermore, we also tested multiple optogenetic actuators (ChR2, ChR2-H134R, and CsChrimson) and two CIII md driver lines (19-12<sup>Gal4</sup> and R83B04<sup>Gal4</sup>). Regardless of the optogenetic actuator used or the wavelength of the light used, we observe light-evoked CT responses (Figure 1– Figure supplement 6). We found using CsChrimson raises several procedural challenges with our current experimental setup. In our hands, CsChrimson showed extreme sensitivity to any amount ambient white light intensities, whereas others have used infrared imaging to counteract ambient light sensitivity. Our imaging setup is equipped with visible spectrum imaging and cannot be retrofitted record infrared light sources. Thus, we have limited the use of CsChrimson to optogenetic-Ca<sup>2+</sup> imaging experiments, where we are not recording larval behavior.

      The use of TrpA1 would require heat stimulation for activating the channels, which in turn would impact downstream circuit neurons that are shared amongst sensory neurons.

      For CaMPARI experiments, the PC light was delivered using a similar custom filter cube, which was used in the original CaMPARI paper (Fosque et al., 2015). This filter cube delivers 440nm wavelength as the PC light. PC light exposure in absence of cold stimulus does not result in differential CaMPARI conversion between CIII md and CIV md (F<sub>red/green</sub> = 0.086 and 0.097, respectively). For the same condition, Ch neurons have high CaMPARI, but it is expected as they function in proprioception. Therefore, the chances of downstream neurons being solely activated by PC light remain low. The differential baseline CaMPARI F<sub>red/green</sub> ratios of individual circuit neurons could be a result of varying resting state cytosolic Ca<sup>2+</sup> concentrations.

      Lastly, for optogenetic-GCaMP experiments, where we use CIII md>CsChrimson and Basin-2/-4 or DnB>GCaMP to visualize CIII md evoked Ca<sup>2+</sup> responses in downstream neuron. Xiang et al. reported that confocal laser excitation for GCaMP does not activate CIV md neurons, which is consistent with what we have observed as well.

      Author response image 1.

      (A) For optogenetic experiments, percent turning was assessed in control conditions and sensory neuron activation. Only CIV md neurons activation results in an increase in bending response. Other conditions do not blue light-evoked turning. (A’) We assessed larval turning based on ellipse fitting using FIJI, the aspect ratio of the radii is indicative of larval bending state. We empirically determined that radii ratio of <2.5 represents a larval turning/bending. This method of ellipse fitting has previously been used to identify C. elegans postures using WrMTrck in FIJI (Nussbaum-Krammer et al., 2015). (B) Percent immobility for all control conditions plus sensory activation driver lines. Only CIII md neuron activation leads to sustained stimulus-locked increase in immobility. There’s also no blue light-evoked reductions in mobility, indicating that there was not increase in larval movement due to blue light. (C) We assessed CIII md (ddaF) and CIV md (ddaC) neurons response to blue light with similar light intensity that was used in behavioral optogenetic experiments. There is no blue light evoked increase in RCaMP fluorescence.

      (4) Basins

      - Page 17 line 442-3 "Neural silencing of all Basin (1-4) neurons, using two independent driver lines (R72F11GAL4 and R57F07<sup>GAL4</sup>).

      Did the authors check the expression profile of the R57F07 line that they use to probe "all basins"? The expression profile published previously (Ohyama et al, 2015, extended data) shows one basin neuron (identified as basin-4 ) and some neurons in the brain lobes. Also, the split GAL4 that labels Basin-4 (SS00740) is the intersection between R72F11 and R57F07 neurons. Thus the R57F07 likely labels Basin-4 and if that is the case the data in Figure 2 9 and supplement) and Figure 3 related to this driver line, should be annotated as Basin-4, and the results and their interpretation modified to take into account the different phenotypes for all basins and Basin-4 neurons.

      Due to the non-specific nature of R57F07<sup>GAL4</sup> in labeling Basin-4 and additional neuron types, we have decided to remove the driver line from our current analysis. We would need to perform further independent investigations to identify the other cell types and validate their role in cold nociception.

      Page 19 l. 521-525 I am confused by these sentences as the authors claim that Basin-4 showed reduced Calcium responses upon repetitive activation of CDIII md neurons but then they say they exhibit sensitization. Looking at the plots in FIG 3 F-I the Basin-4 responses upon repeated activation seem indeed to decrease on the second repetition compared to the first. What is the sensitization the authors refer to?

      We have rephrased this section.

      On Page 47-In this section of the discussion, the authors emit an interesting hypothesis that the Basin-1 neuron could modulate the gain of behavioral responses. While this is an interesting idea, I wonder what would be the explanation for the finding that co-activation of Cho and MDIII does not facilitate cold nociceptive responses. Would activation of Basin-1 facilitate the cold response in different contexts (in addition to CH0-mediated stimuli)?

      Page 48 Thus the implication of the inhibitory network in cold processing should be better contextualized.

      The authors explain the difference in the lower basin-2 Ca- response to Cold/ mdIII activation (compared to Basin-4) despite stronger connectivity, due a stronger inputs from inhibitory neurons to Basin-2 (compared to Basin-4). The previously described inhibitory neurons that synapse onto Basin-2 receive rather a small fraction of inputs from the class III sensory neurons. The differences in response to cold could be potentially assigned to the activation of the inhibitory neurons by the cold-sensing cho- neurons. However, that cannot explain the differences in responses induced by class III neurons. Do the authors refer to additional inhibitory neurons that would receive significant input from MdIII?

      Alternative explanations could exist for this difference in activation: electrical synapses from mdIII onto Basin-4, and by stronger inputs from mdIV (compared to Basin-2 in the case of responses to Cold stimulus (Cold induces responses in md IV sensory neurons). Different subtypes of CD III may differentially respond to cold and the cold-sensing ones could synapse preferentially on basin-4 etc.

      A possible explanation for lack of CT facilitation when Ch and CIII md neurons are both activated are likely the competing sensory inputs going into Basins and yet unknown role of the inhibitory network between sensory and Basin neurons in cold nociception (Jovanic et al., 2016). Mechanical activation of Ch leads to several behavioral responses (hunch, back-up, pause, crawl, and/or bend) and transition between behaviors (Kernan et al., 1994; Tsubouchi et al., 2012; Zhang et al., 2015; Turner et al., 2016, 2018; Jovanic et al., 2019; Masson et al., 2020).

      Meanwhile, primary CIII md-/cold-evoked is CT (Turner et al., 2016, 2018, Patel et al., 2022, Himmel et al., 2023). Certain touch- versus cold- evoked behaviors are mutually exclusive, where co-activation of Ch and CIII md likely leads to competing neural impulses leading to lack of any single behavioral enhancement. Furthermore, the mini circuit motif between Ch and Basins consisting of feedforward, feedback and lateral inhibitory neurons that play a role in behavioral selection and transitions might impact the overall output of Basin neurons. Upon Ch and CIII md neuron co-activation, the cumulative Basin neuronal output may be biased towards increased behavioral transitions instead of sustained singular behavior response.

      While we posited one possible mechanism explaining the differences between cold- or CIII mdevoked Ca<sup>2+</sup> responses in Basin 2 and 4 neurons, where we suggest the differences in evoked Ca<sup>2+</sup> responses may arise due to differential connectivity of TePns and inhibitory network neurons to Basin 2 and/or 4. Furthermore, ascending A00c neurons are connected to descending feedback SEZ neuron, SeIN128, which have connectivity to Basins (1-3 and strongest with Basin 2), A02o, DnB, Chair-1 and A02m/n (Ohyama et al., 2015; Zhu et al., 2024). However, how the 5 different subtypes of CIII md neurons respond to cold is unknown. Electrical recordings of the dorsal CIII md neurons revealed that within & between neuron subtypes there’s variability in temperature sensitivity of individual neurons, where population coding results in fine-tuned central temperature representation (Maksymchuk et al., 2022). Evaluating the role of how individual CIII md subtypes Basin activation could reveal important insights into the precise relationship between CIII md and multisensory integration Basin neurons. However, as of yet there are no known CIII md neuron driver lines that mark a subset of CIII md neurons thus limiting further clarification on how primary sensory information is transduced to integration neurons.

      (5) A00c

      Page 26 Figure 4F-I line While Goro may not be involved in cold nociception the A00c (and A05q) seems to be.

      A00c could convey information to other neurons other than Goro and thus be part of a pathway for cold-induced CT.

      A deeper look into A00c connectivity reveals that there is a reciprocal relationship between A00c and SEZ descending neuron, SeIN128 (Ohyama et al., 2015; Zhu et al., 2024). Additionally, this feedback SEZ descending neuron synapse onto A02o, A05q, Basins (highest connectivity to Basin 2 and weak connectivity to Basin 1 & 3), and select premotor neurons (Chair-1, DnB, and A02m/n) (Ohyama et al., 2015; Zhu et al., 2024). Interestingly, SEZ feedback neuron likely plays a role in the observed cold-/CIII md neuron evoked differential calcium activity and behavioral requirement amongst Basin-2 and -4 in cold nociception. We have added this to our discussion section.

      (6) Page 31 766-768 the conclusion that "premotor function is required for and can facilitate cold nociception" seems odd to stress as one would assume that some premotor neurons would be involved in controlling the behavioral responses to a stimulus. It would be more pertinent in the summary to specify which premotor neurons are involved and what is their function

      We have updated the section regarding premotor neurons’ role in cold nociception and now there’s a more specific concluding statement.

      (7) There are several Split GAL4 used in the study (with transgenes inserted in attP40 et attP2 site). A recent study points to a mutation related to attP40 that can have an effect on muscle function: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9750024/. The controls used in behavioral experiments do not contain the attP40 site. It would be important to check a control genotype bearing an attP40 site and characterize the different parameters of the CT behavior to cold and take this into account in interpreting the results of the experiments using the SplitGAL4 lines

      We have performed control experiments bearing empty attP40;attP2 sites in our neural silencing experiments. The observed muscle phenotypes were present in larvae bearing homozygous copies attP40/attP40 (van der Graaf et al., 2022). However, in our experiments, none of the larvae that we tested behaviorally had homozygous attP40;attP2 insertions. We have updated Table 1 to now include insertion sites.

      Reviewer #3 (Public Review):

      Summary:

      The authors follow up on prior studies where they have argued for the existence of cold nociception in Drosophila larvae. In the proposed pathway, mechanosensitive Class III multidendritic neurons are the noxious cold responding sensory cells. The current study attempts to explore the potential roles of second and third order neurons, based on information of the Class III neuron synaptic outputs that have been obtained from the larval connectome.

      Strengths:

      The major strength of the manuscript is the detailed discussion of the second and third order neurons that are downstream of the mechanosensory Class III multidendritic neurons. These will be useful in further studies of gentle touch mechanosensation and mechanonociception both of which rely on sensory input from these cells. Calcium imaging experiments on Class III

      activation with optogenetics support the wiring diagram.

      Weaknesses:

      The scientific premise is that a full body contraction in larvae that are exposed to noxious cold is a sensorimotor behavioral pathway. This premise is, to start with, questionable. A common definition of behavior is a set of "orderly movements with recognizable and repeatable patterns of activity produced by members of a species (Baker et al., 2001)." In the case of nociception behaviors, the patterns of movement are typically thought to play a protective role and to protect from potential tissue damage.

      Does noxious cold elicit a set of orderly movements with a recognizable and repeatable pattern in larvae? Can the patterns of movement that are stimulated by noxious cold allow the larvae to escape harm? Based on the available evidence, the answer to both questions is seemingly no. In response to noxious cold stimulation many, if not all, of the muscles in the larva, simultaneously contract (Turner et al., 2016), and as a result the larva becomes stationary. In response to cold, the larva is literally "frozen" in place and it is incapable of moving away. This incapacitation by cold is the antithesis of what one might expect from a behavior that protects the animals from harm.

      Extensive literature has investigated the physiological responses of insects to cold (reviewed in Overgaard and MacMillan, 2017). In numerous studies of insects across many genera (excluding cold adapted insects such as snow flies), exposure to very cold temperatures quickly incapacitates the animal and induces a state that is known as a chill coma. During a chill coma, the insect becomes immobilized by the cold exposure, but if the exposure to cold is very brief the insect can often be revived without apparent damage. Indeed, it is common practice for many laboratories that use adult Drosophila for studies of behavior to use a brief chilling on ice as a form of anesthesia because chilling is less disruptive to subsequent behaviors than the more commonly used carbon dioxide anesthesia. If flies were to perceive cold as a noxious nociceptive stimulus, then this "chill coma" procedure would likely be disruptive to behavioral studies but is not. Furthermore, there is no evidence to suggest that larval sensation of "noxious cold" is aversive.

      The insect chill coma literature has investigated the effects of extreme cold on the physiology of nerves and muscles and the consensus view of the field is that the paralysis that results from cold is due to complex and combined action of direct effects of cold on muscle and on nerves (Overgaard and MacMillan, 2017). Electrophysiological measurements of muscles and neurons find that they are initially depolarized by cold, and after prolonged cold exposure they are unable to maintain potassium homeostasis and this eventually inhibits the firing of action potentials (Overgaard and MacMillan, 2017). The very small thermal capacitance of a Drosophila larva means that its entire neuromuscular system will be quickly exposed to the effect of cold in the behavioral assays under consideration here. It would seem impossible to disentangle the emergent properties of a complex combination of effects on physiology (including neuronal, glial, and muscle homeostasis) on any proposed sensorimotor transformation pathway.

      Nevertheless, the manuscript before us makes a courageous attempt at attempting this. A number of GAL4 drivers tested in the paper are found to affect parameters of contraction behavior (CT) in cold exposed larvae in silencing experiments. However, notably absent from all of the silencing experiments are measurements of larval mobility following cold exposure. Thus, it is not known from the study if these manipulations are truly protecting the larvae from paralysis following cold exposure, or if they are simply reducing the magnitude of the initial muscle contraction that occurs immediately following cold (ie reducing CT). The strongest effect of silencing occurs with the 19-12-GAL4 driver which targets Class III neurons (but is not completely specific to these cells).

      Optogenetic experiments for Class III neurons relying on the 19-12-GAL4 driver combined with a very strong optogenetic acuator (ChETA) show the CT behavior that was reported in prior studies. It should be noted that this actuator drives very strong activation, and other studies with milder optogenetic stimulation of Class III neurons have shown that these cells produce behavioral responses that resemble gentle touch responses (Tsubouchi et al 2012 and Yan et al 2013). As well, these neurons express mechanoreceptor ion channels such as NompC and Rpk that are required for gentle touch responses. The latter makes the reported Calcium responses to cold difficult to interpret in light of the fact that the strong muscle contractions driven by cold may actually be driving mechanosensory responses in these cells (ie through deformation of the mechanosensitive dendrites). Are the cIII calcium signals still observed in a preparation where cold induced muscle contractions are prevented?

      A major weakness of the study is that none of the second or third order neurons (that are downstream of CIII neurons) are found to trigger the CT behavioral responses even when strongly activated with the ChETA actuator (Figure 2 Supplement 2). These findings raise major concerns for this and prior studies and it does not support the hypothesis that the CIII neurons drive the CT behaviors.

      Later experiments in the paper that investigate strong CIII activation (with ChETA) in combination with other second and third order neurons does support the idea activating those neurons can facilitate body-wide muscle contractions. But many of the co-activated cells in question are either repeated in each abdominal neuromere or they project to cells that are found all along the ventral nerve cord, so it is therefore unsurprising that their activation would contribute to what appears to be a non-specific body-wide activation of muscles along the AP axis. Also, if these neurons are already downstream of the CIII neurons the logic of this coactivation approach is not particularly clear. A more convincing experiment would be to silence the different classes of cells in the context of the optogenetic activation of CIII neurons to test for a block of the effects, a set of experiments that is notably absent from the study.

      The authors argument that the co-activation studies support "a population code" for cold nociception is a very optimistic interpretation of a brute force optogenetics approach that ultimately results in an enhancement of a relatively non-specific body-wide muscle convulsion.

      We have responded extensively to reviewer 3’s comments in our provisional response to address the critiques regarding conceptual merit of this paper.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Review:

      We would like to thank the reviewers for providing constructive feedback on the manuscript. To address their concerns, we have performed additional experiments, analyzed the new data, and revised the manuscript.

      (1) The utility of a pipeline depends on the generalization properties.

      While the proposed pipeline seems to work for the data the authors acquired, it is unclear if this pipeline will actually generalize to novel data sets possibly recorded by a different microscope (e.g. different brand), or different imagining conditions (e.g. illumination or different imagining artifacts) or even to different brain regions or animal species, etc.

      The authors provide a 'black-box' approach that might work well for their particular data sets and image acquisition settings but it is left unclear how this pipeline is actually widely applicable to other conditions as such data is not provided.

      In my experience, without well-defined image pre-processing steps and without training on a wide range of image conditions pipelines typically require significant retraining, which in turn requires generating sufficient amounts of training data, partly defying the purpose of the pipeline.

      It is unclear from the manuscript, how well this pipeline will perform on novel data possibly recorded by a different lab or with a different microscope.

      To address the generalizability of our DL segmentation model, we have performed several validation experiments with deploying our model on out-of-distribution data that 1) had distinct channels  2) were acquired in different species (rat) with a different vascular fluorescent label and a different imaging protocol, and 3) were acquired on a different microscope and with a different vascular label. We first used our model to segment images (507x507um lateral FOV, 170-250 um axial range) from three C57BL/6 mice imaged on the same two-photon fluorescent microscope following the same imaging protocol. The vasculature was labelled by intravenous injection of the Texas Red dextran (70 kDa MW, Thermo Fisher Scientific Inc, Waltham MA), as in the current experiment. In lieu of the EYFP signal from pyramidal neurons that was present in the original data, we added Gaussian noise with a mean and standard deviation identical to the acquired vascular channel in the out-of-distribution dataset. Second, we applied our model to images (507x507um lateral FOV, 300-400 um axial range) from two Fischer rats that were injected with 2000-kDa Alexa680-dextran via a tail vein catheter. These rats were imaged on the same two-photon fluorescence microscope, but with Galvano scanners (instead of resonant scanners). As before, a second channel of Gaussian noise was added to simulate the missing EYFP signal. Finally, we segmented an image of vasculature from an ex-vivo cleared mouse brain (1665x1205x780 um) acquired on a light sheet fluorescence microscope (Miltenyi UltraMicroscope Blaze), with a Lectin-DyLight 649 labelling the vessel walls.  The Dice Score, Precision, Recall, Hausdorff 95%, and Mean surface distance were reported for segmentations of 2PFM data sets, following the generation of ground truth images by assisted manual segmentation in ilastik. Examples of the generated segmentation masks are presented in Supplementary figure 9 for visual comparison. We have described the image pre-processing steps/transforms before model inference in the revised Methods section. In general, should the segmentation results on a data set be deemed unsatisfactory, our model can be further fine-tuned on out-of-distribution data. Furthermore, the image analyses downstream from segmentation are applicable irrespective of the method utilized to arrive at a robust vascular segmentation.

      Author response table 1.

      Dataset performance comparison for UNETR

      (2) Some of the chosen analysis results seem to not fully match the shown data, or the visualization of the data is hard to interpret in the current form.

      We have updated the visualizations to make them more accessible and ensure close correspondence between tables and figures.

      (3) Additionally, some measures seem not fully adapted to the current situation (e.g. the efficiency measure does not consider possible sources or sinks). Thus, some additional analysis work might be required to account for this.

      Thank you for your comment. The efficiency metric was selected as it does not consider sources or sinks. We do agree that accounting for vessel subtypes in the analysis (thus classifying larger vessels as either suppliers/sources or drainers/sinks) would be very useful: notwithstanding, this classification is extremely laborious, as we have noted in our prior work1 . We are therefore leveraging machine learning in a parallel project to afford vessel classification by type. Notwithstanding, the source/sink analysis based on in vivo 2PFM data is confounded by the small FOV.

      (4) The authors apply their method to in vivo data. However, there are some weaknesses in the design that make it hard to accept many of the conclusions and even to see that the method could yield much useful data with this type of application. Primarily, the acquisition of a large volume of tissue is very slow. In order to obtain a network of vascular activity, large volumes are imaged with high resolution. However, the volumes are scanned once every 42 seconds following stimulation. Most vascular responses to neuronal activation have come and gone in 42 seconds so each vessel segment is only being sampled at a single time point in the vascular response. So all of the data on diameter changes are impossible to compare since some vessels are sampled during the initial phase of the vascular response, some during the decay, and many probably after it has already returned to baseline. The authors attempt to overcome this by alternating the direction of the scan (from surface to deep and vice versa). But this only provides two sample points along the vascular response curve and so the problem still remains.

      We thank the Reviewer for bringing up this important point. Although vessels can show relatively rapid responses to perturbation, vascular responses to photostimulation of ChannelRhodopsin-2 in neighbouring neurons are long-lasting: they do not come and go in 42 seconds. To demonstrate this point, we acquired higher temporal-resolution images of smaller volumes of tissue over 5 minutes preceding and 5 minutes following the 5-s photoactivation with the original photostimulation parameters. The imaging protocol was different in that we utilized a piezoelectric motor, a smaller field of view (512um x (80-128)um x (34-73)um), and only 3x frame averaging, resulting in a temporal resolution of 1.57-3.17 seconds per frame. This acquisition was repeated at different cortical depths in three Thy1-ChR2 mice and the vascular radii were estimated using our presented pipeline. Significantly responding vessels here were selected via an F-test of radius estimates before vs. after stimulation. LOESS fits to the time-dependent radius of significantly responding vessels are shown in Supplementary Figure 5. Vessels shorter than 20 um in length were excluded from the analysis so as to focus on vessel segments where averaging the vascular radius over many vertices was possible. A video of one of the acquisitions is shown along with the timecourses of select vessels’ calibre changes in Author response image 1. The vascular calibre changes following photostimulation persisted for several minutes, consistent with earlier observations by us and others2–5. These small-volume acquisitions demonstrated that dilations were repeatedly longer than the 42 seconds (i.e. our original temporal resolution).

      Our temporal sampling was chosen to permit a large field of view acquisition while still being well within the span of the vascular response to look at larger scale vascular coordination that has not previously been studied. The pipeline readily adapts to smaller fields of view at a finer temporal sampling, though such an acquisition precludes the study of the response coordination across hundreds of vessels. While a greater number of baseline frames would help with the baseline variability estimation, maintaining animals under anesthesia during prolonged imaging is exceedingly difficult, precluding us from extending our total acquisition time.

      Author response image 1.

      Estimated vascular radius at each timepoint for select vessels from the imaging stack shown in the following video: https://flip.com/s/kB1eTwYzwMJE

      (5) A second problem is the use of optogenetic stimulation to activate the tissue. First, it has been shown that blue light itself can increase blood flow (Rungta et al 2017). The authors note the concern about temperature increases but that is not the same issue. The discussion mentions that non-transgenic mice were used to control for this with "data not shown". This is very important data given these earlier reports that have found such effects and so should be included.

      We have updated the manuscript to incorporate the data on volumetric scanning in (nontransgenic) C57BL/6 mice undergoing blue light stimulation, with identical parameters as those used in Thy-ChR2 mice (Supplementary Figure 8). As before, responders were identified as vessels that following blue light stimulation showed a radius change greater than 2 standard deviations of their baseline radius standard deviation: their estimated radii changes are shown in Supplementary Figure 8.  There was no statistical difference between the radii distributions of any of the photostimulation conditions and pre-photostimulation baseline.

      (6) Secondly, there doesn't seem to be any monitoring of neural activity following the photo-stimulation. The authors repeatedly mention "activated" neurons and claim that vessel properties change based on distance from "activated" neurons. But I can't find anything to suggest that they know which neurons were active versus just labeled. Third, the stimulation laser is focused at a single depth plane. Since it is single-photon excitation, there is likely a large volume of activated neurons. But there is no way of knowing the spatial arrangement of neural activity and so again, including this as a factor in the analysis of vascular responses seems unjustified.

      Given the high fidelity of Channel-Rhodpsin2 activation with blue light photostimulation found by us and others3, we assume that all labeled neurons within the volume of photostimulation are being activated. Depending on their respective connectivities, their postsynaptic neurons (whether or not they are labeled) may also get activated. We therefore agree with the reviewer that the spatial distribution of neuronal activation is not well defined. The manuscript has been revised to update the terminology from activated to labeled neurons and stress in the Discussion that the motivation for assessing the distance to the closest labeled neuron as one of our metrics is purely to demonstrate the possibility of linking vascular response to activations in their neighbouring neurons and including morphological metrics in the computational pipeline.

      (7) The study could also benefit from more clear illustration of the quality of the model's output. It is hard to tell from static images of 3-D volumes how accurate the vessel segmentation is. Perhaps some videos going through the volume with the masks overlaid would provide some clarity. Also, a comparison to commercial vessel segmentation programs would be useful in addition to benchmarking to the ground truth manual data.

      We generated a video demonstrating the deep-learning model outputs and have made the video available here: https://flip.com/s/_XBs4yVxisNs. We aimed to develop an open-source method for the research community as the vast majority of groups do not have access to commercial software for vessel segmentation.

      (8) Another useful metric for the model's success would be the reproducibility of the vessel responses. Seeing such a large number of vessels showing constrictions raises some flags and so showing that the model pulled out the same response from the same vessels across multiple repetitions would make such data easier to accept.

      We have generated a figure demonstrating the repeatability of the vascular responses following photostimulation in a volume and presented them next to the corresponding raw acquisitions for visual inspection (Supplementary figure 6). It is important to note that there is a significant biological variability in vessels’ responses to repeated stimulation, as described previously 3,6: a well-performing model should be able to quantify biological heterogeneity as it of itself may be of interest. Constrictions have been reported in the literature by our group and others 1,2,4,5,7, though their prevalence has not been systematically studied to date. Concerning the reproducibility of our analysis, we have demonstrated model reproducibility (as a metric of its success) on a dataset where vessels visually appeared to dilate consistently following 452 nm light stimulation: these results are now presented in Supplementary Figure 6 of the revised Manuscript. We thus observed that the model repeatedly detected the vessels - that appeared to dilate on visual inspections - as dilating. Examples of vessels constricting repeatedly were also examined and maximal intensity projections of the vessel before and after photostimulation inspected, confirming their repeated constriction (Author response image 2).

      It is also worth noting that while the presence of the response (defined as change above 2 standard deviations of the radius across baseline frames) was infrequent (2107 vessels responded at least once, out of a total of 10,552 unique vessels imaged), the direction of the response was highly consistent across trials. Given twice the baseline variability as the threshold for response, of the vessels that responded more than once, 31.7% dilated on some trials while constricting on others; 41.1% dilated on each trial; and 27.2% constricted on each trial. (Note that some trials use 1.1 vs. 4.3 mW/mm2 and some have opposite scanning directions).

      Author response image 2.

      Sample capillaries constrictions from maximum intensity projections at repeated time points following optogenetic stimulation. Baseline (pre-stimulation) image is shown on the left and the post-stimulation image, is on the right, with the estimated radius changes listed to the left.

      (9) A number of findings are questionable, at least in part due to these design properties. There are unrealistically large dilations and constrictions indicated. These are likely due to artifacts of the automated platform. Inspection of these results by eye would help understand what is going on.

      Some of the dilations were indeed large in magnitude. We present select examples of large dilations and constrictions ranging in magnitude from 2.08 to 10.80 um for visual inspection (Author response image 3) (for reference, average, across vessel and stimuli, the magnitude of radius changes were 0.32 +/- 0.54 um). Diameter changes above 5 um were visually inspected.

      Author response image 3.

      Additional views of diameter change in maximum intensity projections ranging in magnitude from 2.08 um to 10.80 um.

      (10) In Figure 6, there doesn't seem to be much correlation between vessels with large baseline level changes and vessels with large stimulus-evoked changes. It would be expected that large arteries would have a lot of variability in both conditions and veins much less. There is also not much within-vessel consistency. For instance, the third row shows what looks like a surface vessel constricting to stimulation but a branch coming off of it dilating - this seems biologically unrealistic.

      We now plot photostimulation-elicited vessel-wise radius changes vs. their corresponding baseline radius standard deviations (Author response image 4). The Pearson correlation coefficient between the baseline standard deviation and the radius change was 0.08 (p<1e-5) for  552nm 4.3 mW/mm^2 stimulation,  -0.08 (p<1e-5) for  458nm 1.1 mW/mm^2 stimulation, and -0.04 (p<1e-5) for  458nm 4.3 mW/mm^2 stimulation. For non-control (i.e. blue) photostimulation conditions, the change in the radius is thus negatively correlated to the vessel’s baseline radius standard deviation: this small negative correlation indicates that there is little correlation between vessel radius change and the baseline variability in the vessel radius. Classification of vessels by type (arteries vs. veins) is needed before we can comment on differences between these vascular components. The between-vessel (i.e. between parent vessels and their daughter branches separated by branch points) consistency is explicitly evaluated by the assortativity metric, in Figure 9: vessels do somewhat tend to react similarly to their downstream branches: we observed a mean assortativity of 0.4. As for the instance of a surface vessel constricting while a downstream vessel dilates, it is important to remember that the 2PFM FOV restricts us to imaging a very small portion of the cortical microvascular network: one (among many) daughter vessels showing changes in the opposite direction to the parent vessel is not violating the conservation of mass; in addition, mural cells on adjacent branches can respond differently.

      Author response image 4.

      Vessel radius change elicited by photostimulation vs. baseline radius standard deviation across all vessels. The threshold level for response identification is shown as the black line.

      (11) As mentioned, the large proportion of constricting capillaries is not something found in the literature. Do these happen at a certain time point following the stimulation? Did the same vessel segments show dilation at times and constriction at other times? In fact, the overall proportion of dilators and constrictors is not given. Are they spatially clustered? The assortativity result implies that there is some clustering, and the theory of blood stealing by active tissue from inactive tissue is cited. However, this theory would imply a region where virtually all vessels are dilating and another region away from the active tissue with constrictions. Was anything that dramatic seen?

      The kinetics of the vascular responses are not accessible via the current imaging protocol and acquired data; however, this computational pipeline can readily be adapted to test hypotheses surrounding the temporal evolution of the vascular responses, as shown in Supplementary Figure 2 (with higher temporal-resolution data). Some vessels dilate at some time points and constrict at others as shown in Supplementary Figure 2. As listed in Table 2, 4.4% of all vessels constrict and 7.5% dilate for 452nm stimulation at 4.3 mW/mm^2. There was no obvious spatial clustering of dilators or constrictors: we expect such spatial patterns to be more common with different modes of stimulation and/or in the presence of pathology. The assortativity peaked at 0.4 (quite far from 1 where each vessel’s response exactly matches that of its neighbour).

      (12) Why were nearly all vessels > 5um diameter not responding >2SD above baseline? Did they have highly variable baselines or small responses? Usually, bigger vessels respond strongly to local neural activity.

      In Author response image 5, we now present the stimulation-induced radius changes vs. baseline radius variability across vessels with a radius greater than 5 um. The Pearson correlation between the radius change and the baseline radius standard deviation across time was low: r=0.05 (p=0.5) for  552nm 4.3 mW/mm^2 stimulation,  r=-0.27 (p<1e-5) for  458nm 1.1 mW/mm^2 stimulation, and r=-0.31 (p<1e-5) for 458nm 4.3 mW/mm^2 stimulation. These results demonstrate that the changes following optogenetic stimulation are lower than twice the baseline standard deviation across time for most of these vessels. The pulsatility of arteries results in significant variability in their baseline radius8; in turn, literature to date suggests very limited radius changes in veins. Both of these effects could contribute to the radius response not being detected in many larger vessels.

      Author response image 5.

      The change in the vessel radius elicited by photostimulation vs. baseline vessel radius standard deviation in vessels with a baseline radius greater than 5 um. The threshold level for response identification is shown as the black line.

      References

      (1) Mester JR, Rozak MW, Dorr A, Goubran M, Sled JG, Stefanovic B. Network response of brain microvasculature to neuronal stimulation. NeuroImage. 2024;287:120512. doi:10.1016/j.neuroimage.2024.120512

      (2) Alarcon-Martinez L, Villafranca-Baughman D, Quintero H, et al. Interpericyte tunnelling nanotubes regulate neurovascular coupling. Nature. 2020;kir 2.1(7823):91-95. doi:10.1038/s41586-020-2589-x

      (3) Mester JR, Bazzigaluppi P, Weisspapir I, et al. In vivo neurovascular response to focused photoactivation of Channelrhodopsin-2. NeuroImage. 2019;192:135-144. doi:10.1016/j.neuroimage.2019.01.036

      (4) O’Herron PJ, Hartmann DA, Xie K, Kara P, Shih AY. 3D optogenetic control of arteriole diameter in vivo. Nelson MT, Calabrese RL, Nelson MT, Devor A, Rungta R, eds. eLife. 2022;11:e72802. doi:10.7554/eLife.72802

      (5) Hartmann DA, Berthiaume AA, Grant RI, et al. Brain capillary pericytes exert a substantial but slow influence on blood flow. Nat Neurosci. Published online February 18, 2021:1-13. doi:10.1038/s41593-020-00793-2

      (6) Mester JR, Bazzigaluppi P, Dorr A, et al. Attenuation of tonic inhibition prevents chronic neurovascular impairments in a Thy1-ChR2 mouse model of repeated, mild traumatic brain injury. Theranostics. 2021;11(16):7685-7699. doi:10.7150/thno.60190

      (7) Hall CN, Reynell C, Gesslein B, et al. Capillary pericytes regulate cerebral blood flow in health and disease. Nature. 2014;508(7494):55-60. doi:10.1038/nature13165

      (8) Meng G, Zhong J, Zhang Q, et al. Ultrafast two-photon fluorescence imaging of cerebral blood circulation in the mouse brain in vivo. Proc Natl Acad Sci U S A. 2022;119(23):e2117346119. doi:10.1073/pnas.2117346119

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Line 207: a superfluous '.' before the references.

      This has been corrected.

      Line 273 ff:

      While the metrics are described in mathematical terms which is very useful, the appearing distances (d) and mathematical symbols are not. While mostly intuitively clear, precise definitions of all symbols introduced should be given to avoid ambiguities.

      The description has been clarified.

      This applies to all formulas appearing in the manuscript and the authors might want to check them carefully.

      We have updated them wherever needed.

      The mean surface distance seems not to reflect the mean MINIMAL surface distance but just the overall mean surface distance. Or a different definition of the appearing symbols is used, highlighting the need for introducing every mathematical symbol carefully.

      The definitions have been updated for clarity, specifying the distinction between Hausdorff 95% distance and mean surface distance.

      Line 284:

      It is unclear to me why center-line detection was performed in MATLAB and not Python. Using multiple languages/software packages and in addition relying on one that is not freely available/open source makes this tool much less attractive as a real open-source tool for the community. The authors stress in the manuscript abstract that their pipeline is an open and accessible tool, the use of MATLAB defies this logic to some extent in my view.

      Centerline detection for large volumetric data is available in Python, see e.g. Scipy packages as well for large data sets via ClearMap or VesselVio.

      We tested the centerline detection in Python, scipy (1.9.3) and Matlab. We found that the Matlab implementation performed better due to its inclusion of a branch length parameter for the identification of terminal branches, which greatly reduced the number of false branches; the Python implementation does not include this feature (in any version) and its output had many more such “hair” artifacts. Clearmap skeletonization uses an algorithm by Palagyi & Kuba(1999) to thin segmentation masks, which does not include hair removal. Vesselvio uses a parallelized version of the scipy implementation of Lee et al. (1994) algorithm which does not do hair removal based on a terminal branch length filter; instead, Vesselvio performs a threshold-based hair removal that is frequently overly aggressive (it removes true positive vessel branches), as highlighted by the authors.

      Moreover, the authors mention that robust center-line detection was critical. In my view, robust center-line extraction typically requires some additional processing of the binarized data, e.g. using a binary smoothing step. Various binary smoothers are available in the literature and as Python code.

      Indeed, binary smoothing was performed: background “holes” located within the vasculature were filled; the masks were dilated (3x) and then eroded to the centreline. Scipy’s binary closing function smoothes the morphology of binary segmentation masks by dilating and then eroding the segmentation masks (as a part of the selected skeletonization algorithm).

      Line 303:

      'RBC' is not defined (red blood cells?)

      This has been updated.

      Line 398:

      pPhotonsimulation -> Photostimulation

      This has been corrected.

      Line 400 ff: Efficiency:

      I am not sure how useful the measure really is without any information about the 'sources' (i.e. arteries) and sinks (i.e. veins) as blood does not need to be moved between any two arbitrary nodes.

      While blood reversals are observed, blood is typically not moved arbitrarily between two arbitrary nodes in capillary networks.

      We agree with the reviewer that classifying the vessels by type is important and are currently working on deep learning-based algorithms for the classification of microvasculature into arterioles and venules for future work.

      In addition, short paths between two nodes with low resistivity will potentially dominate the sum and the authors excluded vessels 10um and above. This threshold seems arbitrary.

      The 10-um diameter threshold was not applied in the computation of the network metrics. The 10-um thresholding was restricted to “capillary” identification in Figure 8: the 10-um cutoff for referring to a vessel as a capillary has long been applied in the literature [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11].

      Figure 3:

      It's unclear what the units are for the Mean Surface and Harsdorf Distances (pixel or um?).

      The units have now been specified (um).

      Figure 4:

      The binarized data, and particularly the crops are difficult to interpret in black and white. It would be much more useful to present the segmentation results in a way that is interpretable (e.g. improving the rendering of the 3d information, particularly in the crops by using shadows or color codes for depth, etc).

      We have updated these visualizations and shaded them based on cortical depth.

      Panel C indicates that the illastik is performing badly due to changes in imagining conditions (much higher background level). As pointed out before, in my view, a reasonable pipeline should start by removing and standardizing background levels as well as dynamic ranges and possibly other artifacts before performing a more detailed analysis. This would also make the pipeline more robust against data from other microscopes etc as only a few preprocessing parameters might need to be adjusted.

      I wonder whether after such a pre-processing step, UNET / UNETR would still perform in a way that was superior to ilastik, as ground truth data was generated with the aid of illastiks initially.

      The Ilastik model is based on semi-automatically generated foreground labels in small batches. We had to break it up into small groups during manual labelling as larger groups were not able to run due to the computational limits of Ilastik. Ilastik is typically trained in an iterative fashion on a few patches at a time because it takes 2-3 hours per patch to train and the resulting model does not generalize on the remaining patches or out-of-distribution data - even with image pre-processing steps. On the reviewer's comment, we did try inputting normalized images into Ilastik, but this did not improve its results. UNET and UNETR inputs have been normalized for signal intensities.

      Typical pre-processing/standard computer vision techniques with parameter tuning do not generalize on out-of-distribution data with different image characteristics, motivating the shift to DL-based approaches.

      Figure 5:

      This is a validation figure that might be better shown in an appendix or as a supplement.

      Since this is a methodological paper, we think it is important to highlight the validation of the proposed method.

      Line 476:

      It's surprising that the number of vessel segments almost doubles when taking the union. Is the number of RBC plugs expected to be so high?

      The etiology of discontinuities includes, but is not limited to, RBC plugs; we expect discontinuities to arise also from a very short pixel dwell time (0.067us) of the resonant scanning and have indeed observed apparent vessel discontinuities on resonant scanning that are not present with Galvano scanning using a pixel dwell time of 2us.

      Section 4.4 / 4.5 :

      The analysis in these sections provides mostly tables with numbers that are more difficult to read and hides possible interesting structures in the distribution of the various measures/quantities. For example, why is 5um a good choice to discriminate between small and large vessels, why not resolve this data more precisely via scatter plots?

      Some distributions are shown in the appendix and could be moved to the main analysis.

      Generally, visualizing the data and providing more detailed insights into the results would make this manuscript more interesting for the general reader.

      The radius of vessel segments drops off after 5.0 um, as shown in Supplementary Figure 4A. The 10-um diameter thresholding is based on prior literature [1], [12], [13], [14], [15], [16], [17], [18], [19] and is used to segregate different vessel types in a conservative manner. The smallest capillaries are expected to have pericytes on their vessel walls whereas arteries are expected to have smooth muscle cells on their vessel walls. These differences in mural cells also may lead to differences in respective vessels’ reactivity.

      The data summarized in Tables 1 and 2 are shown as scatter plots in Figures 8, Supplementary Fig 4 and Supplementary Fig 5.

      Line 556:

      The authors deem a certain change in radius as the relevant measure for responding vessels. They deem a vessel responding if it dilates by twice the std deviation in the radius.

      Based on this measure they find that large vessels rarely respond.

      However, I think this analysis might obscure some interesting effects:

      (1) The standard deviation of the radius depends on the correct estimation of the center point. Given the limited spatial resolution the center point (voxel) obtained from the binarization and skeletonization might not lie in the actual center of the vessel. This effect will be stronger for larger vessels. Center point coordinates should thus be corrected to minimize the std in radius.

      (2) Larger vessels will not necessarily have a perfectly circular shape, and thus the std measure is not necessarily a good measure of 'uncertainty' of estimating the actual radius.

      (3) The above reasons possibly contribute to the fact that from Figure 6 it seems vessels with larger radii have higher std in general (as indicated above some more detailed visualization of the data instead of plain tables could reveal such effects better, e.g. scatter radius vs std). This higher std is making it harder to detect changes in larger vessels. However, with respect to the blood flow, the critical factor is the cross-section of the vessel that scales with the radius squared. Thus, a fixed change in radius for a vessel (say 1um) will induce a larger increase in the flow rate in larger vessels as the change in cross-section is also proportional to the radius of the vessel.

      Thus, larger vessels to be deemed responders should probably have lower thresholds, thresholds should be taken on the cross-section change, or at least thresholds should not be higher for larger vessels as it is the case now using the higher std.

      (1) The radius estimate does not depend on the precise placement of the center point as the radius is not being estimated by the distance from the center point to the boundary of the vessel. Instead, our strategy is to estimate the cross-sectional area (A) of the vessel by the Riemann sum of the sectors with the apex at the center point; the radius is then quoted as sqrt(A/pi) (Supplementary figure 3B). Thus, estimated vessel radius estimates in each cross-sectional plane are then averaged across the cross-sectional planes placed every ~1um along the vessel length. The uncertainty in the cross-sectional plane’s vessel radius, the uncertainty in the vessel radius (upon averaging the cross-sectional planes), and the uncertainty in the radius estimate across repeated measures of a state (i.e. across different samples of the baseline vs, post-photostimulation states) are all reported, and the last one used to define responding vessels.

      To demonstrate the insensitivity to the precise placement of the vessel’s centrepoint, we have jittered the centerline in the perpendicular plane to the vessel tangent plane at each point along the vessel and then estimated the mean radius in 71 cross-sectional planes of larger vessels (mean radius > 5 um). The percent difference in the estimated radius at our selected vessel centrepoints vs. the jittered centrepoints is plotted above. The percent difference in the mean radius estimated was 0.64±3.44%  with 2.45±0.30 um centerpoint jittering. (In contrast, photostimulation was estimated to elicit an average 25.4±18.1% change in the magnitude of the radius of larger vessels, i.e. those with a baseline radius >5um.)

      (2) Indeed, the cross-sectional areas of either large or small vessels are not circles. Consequently, we are placing the vessel boundary, following other published work[20], at the minimum of the signal intensity gradients computed along thirty-six spokes emanating from the centrepoint (cf Figure 2H,K). The cross-sectional area of the vessel in the said cross-sectional plane is then estimated by summing the areas of the sectors flanked by neighbouring spokes. We do not make an assumption about the cross-sectional area being circular. We report radii of circles with the equivalent area as that of the cross-sectional areas merely for ease of communication (as most of the literature to date reports vessel radii, rather than vessel cross-sectional areas.)

      To demonstrate the robustness of this approach, we show the sensitivity of vessel-wise radius estimate on the number of spokes used to estimate the radius in Supplementary Figure 3a. The radius estimate converges after 20 spokes have been used for estimation. Our pipeline utilizes 36 spokes and then excludes minima that lie over 2 STD away from the mean radius estimate across those 36 spokes. With 36 spokes, the vesselwise mean radius estimation was within 0.24±0.62% of the mean of radius estimates using 40-60 spokes.

      (3) Across-baseline sample uncertainty in vessel radius is not dependent on baseline vessel caliber (i.e. this uncertainty is not larger in larger vessels).

      Supplementary Figure 5 shows vessel radius changes for large vessels without a threshold defining responding or non-responding vessels. To explore the dependence of the outcomes on the threshold used to identify the responding vessels, we have explored an alternative strategy, whereby responding small vessels are identified as those vessels that show a post-photostimulation (vs. baseline) radius change of more than 10%. These data are now plotted in Supplementary Figure 10, for capillaries which is in agreement with Figure 8. These points are now also discussed in the Discussion section of the revised manuscript:

      “Additionally, alternative definitions of responding vessels may be useful depending on the end goal of a study (e.g., this could mean selecting a threshold for the radius change based on a percentage change from the baseline level).”

      Section 4.5.1

      Why is the distance to the next neuron a good measure here? If two or more neurons are just a bit further away there will be twice or multiple times the 'load' while the measure would only indicate the distance to the shortest neuron. I wonder how the results change if those 'ensemble' effects are taken into account.

      In this direction, looking for network-level effects with respect to the full spatial organization of the neurons would be very interesting to look at.

      We agree with the review that this question is interesting; however, it is not addressable using present data: activated neuronal firing will have effects on their postsynaptic neighbors, yet we have no means of measuring the spread of activation using the current experimental model.

      Figure 8

      The scatter plots shown are only partly described (e.g. what's the line with error bars in C, why does it only appear for the high-intensity stimulation?).

      Quadratic polynomial fit is shown only in C as the significant response was observed only for this condition, i.e. for the higher intensity blue photostimulation.

      From the scatter plots as shown it is not clear to me why dilations happen on average further away. This might be a density effect not well visible in this representation. The data does not seem to show a clear relationship between neuron distance and Delta R.

      Particularly in the right panel (high stimulation) there seems to be a similar number of close by neurons responding in both directions, but possibly a few more contracting at larger distances?

      So, the overall effect does not seem as 'simple' as suggested in the title of section 4.5.1 in my view, but rather more cells start to contract at larger distances while there seems to be a more intricate balance nearby.

      A more thorough analysis and visualization of the densities etc. might be needed to clarify this point.

      The language has been revised to:

      458-nm photostimulation resulted in a mix of constrictions and dilations with 44.1% of significantly responding vessels within 10 um of a labelled pyramidal neuron constricting and 55.1% dilating, while 53.3% of vessels further than 30 um constricted and 46.7% dilated. The cutoff distances from the closest labelled neuron were based on estimates of cerebral metabolic rate of oxygen consumption that showed a steep gradient in oxygen consumption with distance from arteries, CMRO2 being halved by 30 μm away

      We added a probability density plot for significant constrictors and dilators to Figure 8 and Supplementary Figure 5.

      Figure 8 Panel D / Section 4.5.2

      This is a very interesting result in my view found in this study.

      I am unclear how to interpret the effect. The authors state that dilators tend to be closer to the surface. Looking at the scatter plot (without real density information except the alpha value) it seems again the number of responders in both directions is about the same, but in deeper regions the contraction is just larger? This would be different, than how the authors interpret the data. It is unclear from the provided analysis/plots what is actually the case.

      We added a probability density function plot of the constrictors and dilators, which shows a greater incidence of constrictions (vs. dilations). The text of the paper was then clarified to include the proportion of significant constrictors/ dilators closer than 10 um vs. further than 30 um away from the closest labeled neuron.

      For the analyses above involving $Delta R$ I recommend also look how those results change when looking at changes in cross section instead, i.e. taking into account the actual vessel radius as well as discussed above.

      It would be interesting to speculate here or in the discussion on a reason why vessels in deeper regions might need to contract more?

      Unaddressed is the question if e.g. contraction in a vessel for small stimulation is predictive of contractions for larger stimulation or any other relationships?

      Thank you for your comment. Given its hierarchical organization and high within-vessel response heterogeneity, we believe that the vasculature is best analyzed as a network. Our radius estimates come from averaged cross-sectional estimates allowing us to examine heterogeneity within individual vessel segments.

      The discussion has been updated to include reasons as to why deeper vessels may contract more:

      “As the blue light stimulation power increased, the mean depth of both constricting and dilating vessels increased, likely resulting from higher intensity light reaching ChR2-expressing neurons deeper in the tissue and exciting superficial neurons (and thus their postsynaptic neurons) to a greater level [21], [22]. The blue light would be expected to excite a lower number of neurons farther from the cortical surface at lower powers.”

      Also, how consistent are contractions/dilations observed at a particular vessel etc.

      To look at the consistency of a particular vessel's response to the 1.1 or 4.3 mW/mm^2 blue light photostimulation, we categorized all significant responses as constrictions or dilations, defining a responding vessel as that showing a change that is either > 2 x baseline vessel radius variability or >10% of the vessel’s mean baseline radius.

      Given twice the baseline variability as the threshold for response, of the vessels that responded more than once, 31.7% dilated on some trials while constricting on others; 41.1% dilated on each trial; and 27.2% constricted on each trial. (Note that some trials use 1.1 vs. 4.3 mW/mm2 and some have opposite scanning directions).

      Section 4.5.3

      The results in assortativity are interesting. It would be interesting to look at how the increase in assortativity is mediated. For, example, is this in localized changes in some parts of the graph as visible in A or are there other trends? Do certain sub-graphs that systematically change their radius have certain properties (e.g. do activated neurons cluster there) or are these effects related to some hotspots that also show a coordinated change in control conditions (the assortativity seems not zero there)?

      I already discussed if the efficiency measure is necessarily the best measure to use here without taking into account 'sources' and 'sinks'.

      We plan to address this in future work once we have successfully trained models for the classification of vessels into arteries, veins, and capillaries. Capillaries will be classified based on their branch order from parent arteries to specify where in the network changes are occurring.

      Figure 9

      It's unclear to me why the Ohm symbol needs to be bold?

      It is not bolded (just the font’s appearance).

      Line 707:

      "458-nm photostimulation caused capillaries to dilate when pyramidal neurons were close, and constrict when they were further away."

      In my view, this interpretation is too simple, given the discussion above. A more detailed analysis could clarify this point.

      The discussion on this point has been revised to:

      458-nm photostimulation resulted in a mix of constrictions and dilations, with 44.1% of significantly responding vessels within 10 μm of a labelled pyramidal neuron constricting, and 55.1% dilating; while 53.3% of vessels further than 30 μm constricted and 46.7% dilated. The cutoff distances from the closest labelled neuron were based on estimates of cerebral metabolic rate of oxygen consumption that showed a steep gradient in oxygen consumption with distance from arteries, CMRO2 being halved by 30 μm away [23].

      Line 740:

      "The network efficiency here can be thought of as paralleling mean transit time, i.e., the time it takes blood to traverse the capillary network from the arteries to the veins".

      The network efficiency as defined by the authors seems not to rely on artery/vein information and thus this interpretation is not fully correct in my view.

      The authors might want to reconsider this measure for one that accounts for sources and sinks, if they like to interpret their results as in this line.

      Yes, the efficiency described does not account for sources and sinks. It estimates the resistivity of capillaries, as a proxy for the ease of moving through the observed capillary nexus. Looking at the efficiency metric from graph theory does not require knowledge of the direction of blood flow, and can comment on the resistivity changes across capillary networks.

      For future work, we are investigating methods of classifying vessels as arteries, capillaries, or veins. This type of analysis will provide more detailed information on paths between arteries and veins; it will not provide insight into large-scale network-wide modifications, as those require larger fields of view. 

      Line 754 Pipeline Limitations and Adaptability

      I think the additional 'problem' of generating new training data for novel data sets or data from other microscopes etc should be addressed or the pipeline tested on such data sets.

      Generating training data is typically the biggest time investment when adapting pipelines.

      The generalization properties of the current pipeline are not discussed (e.g. performance on a different microscope / different brain area / different species etc.).

      The public response to reviews has been updated with out-of-distribution data from other imaging protocols, microscopes, and species showing generalizability. These results have also been added to the paper as Supplementary Table 4, and Figure 6. The performance of our pipeline on these out-of-distribution data is now discussed in the updated Discussion section.

      Line 810

      Code availability should be coupled with the publication of this paper as it seems the main contribution. I don't see how the code can be made available after publication only. It should be directly available once the manuscript is published and it could help to make it available to the reviewers before that. It can be updated later of course.

      The code is being made available.

      Reviewer #2 (Recommendations For The Authors):

      This analytical pipeline could be quite useful but it needs to be better demonstrated. If faster volumetric imaging is not possible, perhaps using it over a small volume would still demonstrate its utility at a smaller but more believable scale.

      The higher temporal resolution scans (over smaller tissue volumes) have now been performed and the results of applying our pipeline to these data are summarized in Supplementary Figure 2.

      Using sensory stimuli for neuronal activation might be a better idea than optogenetic stimulation. It isn't necessary but it would avoid the blue light issue.

      The pipeline is readily applicable for analysis of vasoreactivity following different perturbers; however, the robustness of vessels’ response is higher with blue light photostimulation of ChR2 than with sensory stimuli [24]. Notwithstanding, an example of the vascular response to electrical stimulation of the contralateral forepaw is now included in Supplementary Figure 2.

      This tool could be quite useful even without neural activity mapping. It obviously makes it even more powerful, but again, the utility could be demonstrated with just vascular data or even anatomical neuronal data without function.

      We agree with both points, and have emphasized them in the revised discussion section.

      Line 559 says the average capillary diameter change was 1.04 um. The next sentence and the table below all have different values so this is unclear.

      The wording was updated to make this clearer.

      Line 584 - should 458 be 552?

      458 is correct.

      Figure 1 - the schematic doesn't seem right - the 650 LPF with the notches is positioned to pass short light and reflect long wavelengths and the notch bands.

      The figure has been updated to reflect this. The original layout was done for compactness.

      References

      (1) D. A. Hartmann, V. Coelho-Santos, and A. Y. Shih, “Pericyte Control of Blood Flow Across Microvascular Zones in the Central Nervous System,” Annu. Rev. Physiol., vol. 84, no. Volume 84, 2022, pp. 331–354, Feb. 2022, doi: 10.1146/annurev-physiol-061121-040127.

      (2) J. Batista, “An adaptive gradient-based boundary detector for MRI images of the brain,” in 7th International Conference on Image Processing and its Applications, Manchester, UK: IEE, 1999, pp. 440–444. doi: 10.1049/cp:19990360.

      (3) Y. Le, X. Xu, L. Zha, W. Zhao, and Y. Zhu, “Tumor boundary detection in ultrasound imagery using multi-scale generalized gradient vector flow,” J. Med. Ultrason., vol. 42, no. 1, pp. 25–38, Jan. 2015, doi: 10.1007/s10396-014-0559-3.

      (4) X. Ren, “Multi-scale Improves Boundary Detection in Natural Images,” in Computer Vision – ECCV 2008, D. Forsyth, P. Torr, and A. Zisserman, Eds., Berlin, Heidelberg: Springer, 2008, pp. 533–545. doi: 10.1007/978-3-540-88690-7_40.

      (5) C. Grigorescu, N. Petkov, and M. A. Westenberg, “Contour and boundary detection improved by surround suppression of texture edges,” Image Vis. Comput., vol. 22, no. 8, pp. 609–622, Aug. 2004, doi: 10.1016/j.imavis.2003.12.004.

      (6) J. Tang and S. T. Acton, “Vessel Boundary Tracking for Intravital Microscopy Via Multiscale Gradient Vector Flow Snakes,” IEEE Trans. Biomed. Eng., vol. 51, no. 2, pp. 316–324, Feb. 2004, doi: 10.1109/TBME.2003.820374.

      (7) J. Merkow, A. Marsden, D. Kriegman, and Z. Tu, “Dense Volume-to-Volume Vascular Boundary Detection,” in Medical Image Computing and Computer-Assisted Intervention - MICCAI 2016, S. Ourselin, L. Joskowicz, M. R. Sabuncu, G. Unal, and W. Wells, Eds., Cham: Springer International Publishing, 2016, pp. 371–379. doi: 10.1007/978-3-319-46726-9_43.

      (8) F. Orujov, R. Maskeliūnas, R. Damaševičius, and W. Wei, “Fuzzy based image edge detection algorithm for blood vessel detection in retinal images,” Appl. Soft Comput., vol. 94, p. 106452, Sep. 2020, doi: 10.1016/j.asoc.2020.106452.

      (9) M. E. Martinez-Perez, A. D. Hughes, S. A. Thom, A. A. Bharath, and K. H. Parker, “Segmentation of blood vessels from red-free and fluorescein retinal images,” Med. Image Anal., vol. 11, no. 1, pp. 47–61, Feb. 2007, doi: 10.1016/j.media.2006.11.004.

      (10) A. M. Mendonca and A. Campilho, “Segmentation of retinal blood vessels by combining the detection of centerlines and morphological reconstruction,” IEEE Trans. Med. Imaging, vol. 25, no. 9, pp. 1200–1213, Sep. 2006, doi: 10.1109/TMI.2006.879955.

      (11) A. F. Frangi, W. J. Niessen, K. L. Vincken, and M. A. Viergever, “Multiscale vessel enhancement filtering,” in Medical Image Computing and Computer-Assisted Intervention — MICCAI’98, W. M. Wells, A. Colchester, and S. Delp, Eds., Berlin, Heidelberg: Springer, 1998, pp. 130–137. doi: 10.1007/BFb0056195.

      (12) K. Bisht et al., “Capillary-associated microglia regulate vascular structure and function through PANX1-P2RY12 coupling in mice,” Nat. Commun., vol. 12, no. 1, p. 5289, Sep. 2021, doi: 10.1038/s41467-021-25590-8.

      (13) Y. Wu et al., “Quantitative relationship between cerebrovascular network and neuronal cell types in mice,” Cell Rep., vol. 39, no. 12, p. 110978, Jun. 2022, doi: 10.1016/j.celrep.2022.110978.

      (14) T. Kirabali et al., “The amyloid-β degradation intermediate Aβ34 is pericyte-associated and reduced in brain capillaries of patients with Alzheimer’s disease,” Acta Neuropathol. Commun., vol. 7, no. 1, p. 194, Dec. 2019, doi: 10.1186/s40478-019-0846-8.

      (15) X. Ren et al., “Linking cortical astrocytic neogenin deficiency to the development of Moyamoya disease–like vasculopathy,” Neurobiol. Dis., vol. 154, p. 105339, Jul. 2021, doi: 10.1016/j.nbd.2021.105339.

      (16) J. Steinman, M. M. Koletar, B. Stefanovic, and J. G. Sled, “3D morphological analysis of the mouse cerebral vasculature: Comparison of in vivo and ex vivo methods,” PLOS ONE, vol. 12, no. 10, p. e0186676, Oct. 2017, doi: 10.1371/journal.pone.0186676.

      (17) A.-A. Berthiaume et al., “Dynamic Remodeling of Pericytes In Vivo Maintains Capillary Coverage in the Adult Mouse Brain,” Cell Rep., vol. 22, no. 1, pp. 8–16, Jan. 2018, doi: 10.1016/j.celrep.2017.12.016.

      (18) S. Katz, R. Gattegno, L. Peko, R. Zarik, Y. Hagani, and T. Ilovitsh, “Diameter-dependent assessment of microvascular leakage following ultrasound-mediated blood-brain barrier opening,” iScience, vol. 26, no. 6, p. 106965, Jun. 2023, doi: 10.1016/j.isci.2023.106965.

      (19) J. Drouin-Ouellet et al., “Cerebrovascular and blood-brain barrier impairments in Huntington’s disease: Potential implications for its pathophysiology,” Ann. Neurol., vol. 78, no. 2, pp. 160–177, Aug. 2015, doi: 10.1002/ana.24406.

      (20) K. P. McDowell, A.-A. Berthiaume, T. Tieu, D. A. Hartmann, and A. Y. Shih, “VasoMetrics: unbiased spatiotemporal analysis of microvascular diameter in multi-photon imaging applications,” Quant. Imaging Med. Surg., vol. 11, no. 3, pp. 969–982, Mar. 2021, doi: 10.21037/qims-20-920.

      (21) E. L. Johnson et al., “Characterization of light penetration through brain tissue, for optogenetic stimulation.” bioRxiv, p. 2021.04.08.438932, Apr. 08, 2021. doi: 10.1101/2021.04.08.438932.

      (22) S. I. Al-Juboori, A. Dondzillo, E. A. Stubblefield, G. Felsen, T. C. Lei, and A. Klug, “Light scattering properties vary across different regions of the adult mouse brain,” PloS One, vol. 8, no. 7, p. e67626, 2013, doi: 10.1371/journal.pone.0067626.

      (23) P. Mächler et al., “Baseline oxygen consumption decreases with cortical depth,” PLOS Biol., vol. 20, no. 10, p. e3001440, Oct. 2022, doi: 10.1371/journal.pbio.3001440.

      (24) J. R. Mester et al., “In vivo neurovascular response to focused photoactivation of Channelrhodopsin-2,” NeuroImage, vol. 192, pp. 135–144, May 2019, doi: 10.1016/j.neuroimage.2019.01.036.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We have significant concerns about the eLife assessment and the reviews. The reviewers acknowledged substantial strengths in our work:

      • Reviewer 3 noted that “the single-unit analyses of tuning direction are robustly characterized”, “the differences in neural correlations across behaviors, regions and perturbations are robust”, and “The evidence for these claims is solid.”

      • Reviewer 2 stated that “the manuscript has been improved” with “new analyses [that] provide improved rigor”.

      Despite these, the final eLife assessment inexplicably downplayed the significance of the findings and strength of evidence.

      Broader Impact and Significance. The findings, not only the data, have theoretical and/or practical implications extending well beyond a single subfield relevant to:

      1. behavioral neuroscientists studying sensorimotor integration

      2. systems and theoretical neuroscientists

      3. neural and biomechanical engineers working on brain-computer interfaces for speech or oral or limb prosthetics

      4. soft robotics researchers

      5. comparative motor control researchers

      6. clinicians involved in the evaluation and rehabilitation of orolingual function (e.g., after stroke or glossectomy, dysphagia)

      Given this broad relevance, we question why the significance was characterized as merely "useful" rather than "important."

      Dismissive Tone Toward Descriptive Research. Some reviews displayed a dismissive or skeptical tone of the findings and their significance, even when methods were solid and support for the claims were strong. They critiqued the “descriptive nature” of our study, faulting the lack of mechanistic explanation. However, in poorly understood fields such as orofacial sensorimotor control, descriptive studies provide the empirical foundation for mechanistic studies. Rich descriptive data generate testable hypotheses that drive mechanistic discoveries forward, while mechanistic studies conducted without this groundwork often pursue precise answers to poorly formulated questions.

      Specific Issues with Reviews:

      1. Significant omission in study description:

      The eLife Assessment’s second sentence states: “The data, which include both electrophysiology and nerve block manipulations, will be of value to neuroscientists and

      neural engineers interested in tongue use.”

      This description omits our simultaneously recorded high-resolution 3D kinematics data—a significant oversight given that combining high-density electrophysiological recording from multiple cortical regions with high-resolution 3D tongue kinematics during naturalistic behaviors in non-human primates represents one of our study's key strengths. Currently, only two research labs in the US possess this capability.

      2. Overemphasis on the “smaller” and “inconsistent” findings

      While we acknowledge some inconsistent findings between animals, the reviews overemphasized these inconsistencies in ways that cast unwarranted doubt on our more significant and consistent results.

      a. Reviewer 1: “[...] the discrepancies in tuning changes across the two NHPs, coupled with the overall exploratory nature of the study, render the interpretation of these subtle differences somewhat speculative. “[...] in some recording sessions, they blocked sensory feedback using bilateral nerve block injections, which seemed to result in fewer directionally tuned units and changes in the overall distribution of the preferred direction of the units.”

      The skeptical tone of the critique is in opposition to Reviewer 3’s statement that: “the evidence for these claims were solid”. In this statement, the reviewer characterized our findings as “somewhat speculative”, seemingly overlooking robust and consistent changes we documented:

      • “Following nerve block, MIo and SIo showed significant decreases in the proportion of directionally modulated neurons across both tasks (Fig. 10A; Chi-square, MIo: p <0.001, SIo: p < 0.05).”

      • “Nerve block significantly altered PD distributions during both tasks. During feeding, MIo neurons in both subjects exhibited a significant clockwise shift in mean PD toward the center (0°), resulting in more uniform distributions (Fig. 11A; circular k-test, p < 0.01).”

      These results were obtained through careful subsampling of trials with similar kinematics for both feeding and drinking tasks, ensuring that the tuning changes in the nerve block experiments could not be attributed to differing kinematics.

      b. Reviewer 2: “One weakness of the current study is that there is substantial variability in results between monkeys.”

      This vague critique, without specifying which results showed “substantial variability”, reads as though most findings were inconsistent, unfairly casting doubt on our study’s validity.

      3. Inaccurate statements in the Reviewers’ summaries

      Several reviewer statements contain factual inaccuracies:

      a. Reviewer 2: “A majority of neurons in MIo and a (somewhat smaller) percentage of SIo modulated their firing rates during tongue movements, with different modulation depending on the direction of movement (i.e., exhibited directional tuning).”

      Reviewer 2's characterization of directional tuning misrepresents our findings. We reported substantial differences in the proportion of directionally tuned neurons between MIo and SIo during the feeding task but a smaller difference in the drinking task:

      • “The proportion of directionally tuned neurons [...] differed significantly between MIo and SIo during the feeding task in both subjects (Chi-square, p < 0.001). In rostral and caudal MIo, 80% of neurons were modulated to 3D direction (bootstrap, p < 0.05, Fig. 3B, left), compared to 52% in areas 1/2 and 3a/3b.

      • “During drinking, the proportion of directionally modulated neurons was more similar between regions (69% in MIo vs. 60% in SIo: Chi-square, p > 0.05, Fig. 3B right).”

      b. Reviewer 2: “There were differences observed in the proportion and extent of directional tuning between the feeding and licking behaviors, with stronger tuning overall during licking.”

      Reviewer 2's claim about task differences directly contradicts our findings. We consistently reported stronger tuning in feeding compared to drinking across multiple measures:

      • “The proportion of directionally tuned neurons was higher in the feeding vs. drinking task (Chi-square, p < 0.05, feeding: 72%, drinking: 66%)”;

      • “Cumulative explained variance for the first three factors was higher in feeding (MIo: 82%, SIo: 81%) than in drinking (MIo: 74%, SIo: 63%)”;

      • “Decoding using LSTM showed consistently higher accuracies in feeding compared to drinking regardless of the length of intervals used ..., behavioral window .., and directional angles ...”

      These results were also summarized in the Discussion.

      c. Reviewer 1: In Figure 12, factor 2 and 3 are plotted against each other? and factor 1 is left out?

      Reviewer 1’s observation about Figure 12 is incorrect. Factor 1 was included: Top subplots (feeding) show Factor 1 vs 3 (MIo) and Factor 1 vs 2 (SIo) while the bottom subplots (drinking) show Factor 2 vs 3 (MIo) and Factor 1 vs 2 (SIo). We plotted the two latent factors with highest explained variance for clarity, though all 20 factors were included in intertrajectory distance calculations.

      4. Framing and interpretive over-scrutiny

      Several critiques targeted framing rather than methodological rigor and emphasized that interpretations were speculative even when appropriately hedged:

      a. Reviewer 2: “A revised version of the manuscript incorporates more population-level analyses, but with inconsistent use of quantifications/statistics and without sufficient contextualization of what the reader is to make of these results.”

      Reviewer 2 mentioned "inconsistent use of quantifications/statistics" without specifying which analyses were problematic or updating their summary to include our additional population-level findings.

      b. Reviewer 2: “The described changes in tuning after nerve block could also be explained by changes in kinematics between these conditions, which temper the interpretation of these interesting results”

      Despite our addressing kinematic concerns through subsampled data analysis, Reviewer 2 remained unsatisfied, contrasting sharply with Reviewer 3's assessment that our arguments were "convincing" with "solid" evidence.

      c. Reviewer 2: “I am not convinced of the claim that tongue directional encoding fundamentally changes between drinking and feeding given the dramatically different kinematics and the involvement of other body parts like the jaw”

      Reviewer 2 expressed skepticism about fundamental encoding differences between tasks, despite our comprehensive controls including subsampled data with similar kinematics and multiple verification analyses (equal neuron numbers, stable neurons, various interval lengths, behavioral windows, and directional angles).

      Without describing why these analyses were insufficient, this criticism goes beyond methods or statistics. It casts doubt and challenges whether the conclusions are even worth drawing despite careful experimental controls.

      d. Reviewer 2: “The manuscript states that "An alternative explanation be more statistical/technical in nature: that during feeding, there will be more variability in exactly what somatosensation afferent signals are being received from trial to trial (because slight differences in kinematics can have large differences in exactly where the tongue is and the where/when/how of what parts of it are touching other parts of the oral cavity)? This variability could "smear out" the apparent tuning using these types of trial-averaged analyses. Given how important proprioception and somatosensation are for not biting the tongue or choking, the speculation that somatosensory cortical activity is suppressed during feedback is very counter-intuitive to this reviewer".

      By not updating this section, Reviewer 2 failed to acknowledge our responsive revisions, including Fano factor analysis showing higher variability in SIo during feeding versus drinking, and our updated discussion addressing their concerns about trial-to-trial variability: “Varying tongue shape, tongue’s contact with varying bolus properties (size and texture) and other oral structures (palate, teeth) may weaken the directional signal contained in SIo activity. Thus, small differences in tongue kinematics might create large differences in sensory signals across trials. When looking at trial-averaged signals, this natural variability could make the neural response patterns appear less precise or specific than they are. These are consistent with our findings that for both tasks, spiking variability was higher in SIo.”

      Authors’ Response to Recommendations for the authors:

      We thank the editors and the reviewers for their helpful comments. We have provided a response to reviewers’ recommendations and made some revisions on the manuscript. 

      Reviewer #1 (Recommendations for the authors): 

      In the newly added population factor analysis, several methodological decisions remain unclear to me:

      In Figure 7, why do the authors compare the mean distance between conditions in the latent spaces of MIo and SIo? Since these latent spaces are derived separately, they exist on different scales (with MIo appearing roughly four times larger than SIo), and this discrepancy is reflected in the reported mean distances (Figure 7, inset plots). Wouldn't this undermine a direct comparison?

      Thank you for this helpful feedback. The reviewer is correct that the latent spaces are derived separately for MIo and SIo, thus they exist on different scales as we have noted in the caption of Figure 7: “Axes for SIo are 1/4 scale of MIo.” 

      To allow for a direct comparison between MIo and SIo, we corrected the analysis by comparing their normalized mean inter-trajectory distances obtained by first calculating the geometric index (GI) of the inter-trajectory distances, d, between each pair of population trajectories per region as: GI= (d<sub>1</sub>-d<sub>2</sub>)/ (d<sub>1</sub>+d<sub>2</sub>). We then performed the statistics on the GIs and found a significant difference between mean inter-trajectory distances in MIo vs. SIo. We performed the same analysis comparing the distance travelled between MIo and SIo trajectories by getting the normalized difference in distances travelled and still found a significant difference in both tasks. We have updated the results and figure inset to reflect these changes.

      In Figure 12, unlike Figure 7 which shows three latent dimensions, only two factors are plotted. While the methods section describes a procedure for selecting the optimal number of latent factors, Figure 7 - figure supplement 3 shows that variance explained continues to increase up to about five latent dimensions across all areas. Why, then, are fewer dimensions shown?

      Thank you for the opportunity to clarify the figure. The m obtained from the 3-fold crossvalidation varied for the full sample and was 20 factors for the subsample. We clarify that all statistical analyses were done using 20 latent factors. Using the full sample of neurons, the first 3 factors explained 81% of variance in feeding data compared to 71% in drinking data. When extended to 5 factors, feeding maintained its advantage with 91% variance explained versus 82% for drinking. Because feeding showed higher variance explained than drinking across 3 or 5 factors, only three factors were shown in Figure 7 for better visualization. We added this clarification to the Methods and Results.

      Figure 12 shows the differences in the neural trajectories between the control and nerve block conditions. The control vs. nerve block comparison complicated the visualization of the results. Thus, we plotted only the two latent factors with the highest separation between population trajectories. This was clarified in the Methods and caption of Figure 12.

      In Figure 12, factor 2 and 3 are plotted against each other? and factor 1 is left out?

      This observation is incorrect; Factor 1 was included: Top subplots (feeding) show Factor 1 vs 3 (MIo) and Factor 1 vs 2 (SIo) while the bottom subplots (drinking) show Factor 2 vs 3 (MIo) and Factor 1 vs 2 (SIo).  We have clarified this in the Methods and caption of Figure 12.

      Finally, why are factor analysis results shown only for monkey R? 

      Factor analysis results were performed on both animals, but the results were shown only for monkey R to decrease the number of figures in the manuscript. Figure 7- figure supplement 1 shows the data for both monkeys. Here are the equivalent Figure 7 plots for monkey Y. 

      Author response image 1.

      Reviewer #2 (Recommendations for the authors): 

      Overall, the manuscript has been improved. 

      New analyses provide improved rigor (as just one example, organizing the feeding data into three-category split to better match the three-direction drinking data decoding analysis and also matching the neuron counts).

      The updated nerve block change method (using an equal number of trials with a similar leftright angle of movement in the last 100 ms of the tongue trajectory) somewhat reduces my concern that kinematic differences could account for the neural changes, but on the other hand the neural analyses use 250 ms (meaning that the neural differences could be related to behavioral differences earlier in the trial). Why not subselect to trials with similar trajectories throughout the whole movement(or at least show that as an additional analysis, albeit one with lower trial counts). 

      As the reviewer pointed out, selecting similar trajectories throughout the whole movement would result in lower trial counts that lead to poor statistical power. We think that the 100 ms prior to maximum tongue protrusion is a more important movement segment to control for similar kinematics between the control and nerve block conditions since this represents the subject’s intended movement endpoint. 

      A lot of the Results seemed like a list of measurements without sufficient hand-holding or guide-posting to explain what the take-away for the reader should be. Just one example to make concrete this broadly-applicable feedback: "Cumulative explained variance for the first three factors was higher in feeding (MIo: 82%, SIo: 81%) than in drinking (MIo: 74%, SIo: 63%) when all neurons were used for the factor analysis (Fig. 7)": why should we care about 3 factors specifically? Does this mean that in feeding, the neural dimensionality is lower (since 3 factors explain more of it)? Does that mean feeding is a "simpler" behavior (which is counter-intuitive and does not conform to the authors' comments about the higher complexity of feeding). And from later in that paragraph: what are we do make of the differences in neural trajectory distances (aside from quantifying using a different metric the same larger changes in firing rates that could just as well be quantified as statistics across single-neuron PETHs)?

      Thank you for the feedback on the writing style. We have made some revisions to describe the takeaway for the reader. That fewer latent factors explain 80% of the variance in the feeding data means that the underlying network activity is relatively simple despite apparent complexity. When neural population trajectories are farther away from each other in state space, it means that the patterns of activity across tongue directions are more distinct and separable, thus, less likely to be confused with each other. This signifies that neural representations of 3D tongue directions are more robust. When there is better neural discrimination and more reliable information processing, it is easier for downstream brain regions to distinguish between different tongue directions.  

      The addition of more population-level analyses is nice as it provides a more efficient summary of the neural measurements. However, it's a surface-level dive into these methods; ultimately the goal of ensemble "computation through dynamics" analyses is to discover simpler structure / organizational principles at the ensemble level (i.e., show things not evidence from single neurons), rather than just using them as a way to summarize data. For instance, here neural rotations are remarked upon in the Results, without referencing influential prior work describing such rotations and why neural circuits may use this computational motif to separate out conditions and shape muscle activity-generating readouts (Churchland et al. Nature 2012 and subsequent theoretical iterations including the Russo et al.). That said, the Russo et al tangling study was well-referenced and the present tangling results were eGectively contextualized with respect to that paper in terms of the interpretation. I wish more of the results were interpreted with comparable depth. 

      Speaking of Russo et al: the authors note qualitative differences in tangling between brain areas, but do not actually quantify tangling in either. These observations would be stronger if quantified and accompanied with statistics.

      Contrary to the reviewer’s critique, we did frame these results in the context of structure/organizational principles at the ensemble level. We had already cited prior work of Churchland et al., 2012; Michaels et al., 2016and Russo et al., 2018. In the Discussion, Differences across behaviors, we wrote: “In contrast, MIo trajectories in drinking exhibited a consistent rotational direction regardless of spout location (Fig. 7). This may reflect a predominant non-directional information such as condition-independent time-varying spiking activity during drinking (Kaufman et al., 2016; Kobak et al., 2016; Arce-McShane et al., 2023).” 

      Minor suggestions: 

      Some typos, e.g. 

      • no opening parenthesis in "We quantified directional differences in population activity by calculating the Euclidean distance over m latent factors)"

      • missing space in "independent neurons(Santhanam et al., 2009;..."); 

      • missing closing parentheses in "followed by the Posterior Inferior (Figure 3 - figure supplement 1."

      There is a one-page long paragraph in the Discussion. Please consider breaking up the text into more paragraphs each organized around one key idea to aid readability.

      Thank you, we have corrected these typos.

      Could it be that the Kaufman et al 2013 reference was intended to be Kaufman et al 2015 eNeuro (the condition-invariant signal paper)?

      Thank you, we have corrected this reference.

      At the end of the Clinical Implications subsection of the Discussion, the authors note the growing field of brain-computer interfaces with references for motor read-out or sensory write-in of hand motor/sensory cortices, respectively. Given that this study looks at orofacial cortices, an even more clinically relevant development is the more recent progress in speech BCIs (two     recent reviews: https://www.nature.com/articles/s41583-024-00819-9, https://www.annualreviews.org/content/journals/10.1146/annurev-bioeng-110122012818) many of which record from human ventral motor cortex and aspirations towards FES-like approaches for orofacial movements (e.g., https://link.springer.com/article/10.1186/s12984-023-01272-y).  

      Thank you, we have included these references.

      Reviewer #3 (Recommendations for the authors): 

      Major Suggestions 

      (1) For the factor analysis of feeding vs licking, it appears that the factors were calculated separately for the two behaviors. It could be informative to calculate the factors under both conditions and project the neural data for the two behaviors into that space. The overlap/separations of the subspace could be informative. 

      We clarify that we performed a factor analysis that included both feeding and licking for MIo, as stated in the Results: “To control for factors such as different neurons and kinematics that might influence the results, we performed factor analysis on stable neurons across both tasks using all trials (Fig. 7- figure supplement 2A) and using trials with similar kinematics (Fig. 7- figure supplement 2B).” We have revised the manuscript to reflect this more clearly.

      (2) For the LSTM, the Factor analyses and the decoding it is unclear if the firing rates are mean subtracted and being normalized (the methods section was a little unclear). Typically, papers in the field either z-score the data or do a softmax.

      The firing rates were z-scored for the LSTM and KNN. For the factor analysis, the spike counts were not z-scored, but the results were normalized. We clarified this in the Methods section.

      Minor: 

      Page 1: Abstract- '... how OSMCx contributes to...' 

      Since there are no direct causal manipulations of OSMCx in this manuscript, this study doesn't directly study the OSMCx's contribution to movement - I would recommend rewording this sentence.

      Similarly, Page 2: 'OSMCx plays an important role in coordination...' the citations in this paragraph are correlative, and do not demonstrate a causal role.

      There are similar usages of 'OSMCx coordinates...' in other places e.g. Page 8. 

      Thank you, we revised these sentences.

      Page 7: the LSTM here has 400 units, which is a very large network and contains >12000 parameters. Networks of this size are prone to memorization, it would be wise to test the rsquare of the validation set against a shuGled dataset to see if the network is actually working as intended. 

      Thank you for bringing up this important point of verifying that the network is learning meaningful patterns versus memorizing. Considering the size of our training samples, the ratio of samples to parameters is appropriate and thus the risk of memorization is low. Indeed, validation tests and cross-validation performed indicated expected network behavior and the R squared values obtained here were similar to those reported in our previous paper (Laurence-Chasen et al., 2023).


      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In their paper, Hosack and Arce-McShane investigate how the 3D movement direction of the tongue is represented in the orofacial part of the sensory-motor cortex and how this representation changes with the loss of oral sensation. They examine the firing patterns of neurons in the orofacial parts of the primary motor cortex (MIo) and somatosensory cortex (SIo) in non-human primates (NHPs) during drinking and feeding tasks. While recording neural activity, they also tracked the kinematics of tongue movement using biplanar videoradiography of markers implanted in the tongue. Their findings indicate that most units in both MIo and SIo are directionally tuned during the drinking task. However, during the feeding task, directional turning was more frequent in MIo units and less prominent in SIo units. Additionally, in some recording sessions, they blocked sensory feedback using bilateral nerve block injections, which resulted in fewer directionally tuned units and changes in the overall distribution of the preferred direction of the units.

      Strengths:

      The most significant strength of this paper lies in its unique combination of experimental tools. The author utilized a video-radiography method to capture 3D kinematics of the tongue movement during two behavioral tasks while simultaneously recording activity from two brain areas. Moreover, they employed a nerve-blocking procedure to halt sensory feedback. This specific dataset and experimental setup hold great potential for future research on the understudied orofacial segment of the sensory-motor area.

      Weaknesses:

      Aside from the last part of the result section, the majority of the analyses in this paper are focused on single units. I understand the need to characterize the number of single units that directly code for external variables like movement direction, especially for less-studied areas like the orofacial part of the sensory-motor cortex. However, as a field, our decadelong experience in the arm region of sensory-motor cortices suggests that many of the idiosyncratic behaviors of single units can be better understood when the neural activity is studied at the level of the state space of the population. By doing so, for the arm region, we were able to explain why units have "mixed selectivity" for external variables, why the tuning of units changes in the planning and execution phase of the movement, why activity in the planning phase does not lead to undesired muscle activity, etc. See (Gallego et al. 2017; Vyas et al. 2020; Churchland and Shenoy 2024) for a review. Therefore, I believe investigating the dynamics of the population activity in orofacial regions can similarly help the reader go beyond the peculiarities of single units and in a broader view, inform us if the same principles found in the arm region can be generalized to other segments of sensorymotor cortex.

      We thank and agree with the reviewer on the value of information gained from studying population activity. We also appreciate that population analyses have led to the understanding that individual neurons have “mixed selectivity”. We have shown previously that OSMCx neurons exhibit mixed selectivity in their population activity and clear separation between latent factors associated with gape and bite force levels (Arce-McShane FI, Sessle BJ, Ram Y, Ross CF, Hatsopoulos NG (2023) Multiple regions of primate orofacial sensorimotor cortex encode bite force and gape. Front Systems Neurosci. doi: 10.3389/fnsys.2023.1213279. PMID: 37808467 PMCID: 10556252), and chew-side and food types (Li Z & Arce-McShane FI (2023). Cortical representation of mastication in the primate orofacial sensorimotor cortex. Program No. NANO06.05. 2023 Neuroscience Meeting Planner. Washington, D.C.: Society for Neuroscience, 2023. Online.). 

      The primary goal of this paper was to characterize single units in the orofacial region and to do a follow-up paper on population activity. In the revised manuscript, we have now incorporated the results of population-level analyses. The combined results of the single unit and population analyses provide a deeper understanding of the cortical representation of 3D direction of tongue movements during natural feeding and drinking behaviors. 

      Further, for the nerve-blocking experiments, the authors demonstrate that the lack of sensory feedback severely alters how the movement is executed at the level of behavior and neural activity. However, I had a hard time interpreting these results since any change in neural activity after blocking the orofacial nerves could be due to either the lack of the sensory signal or, as the authors suggest, due to the NHPs executing a different movement to compensate for the lack of sensory information or the combination of both of these factors. Hence, it would be helpful to know if the authors have any hint in the data that can tease apart these factors. For example, analyzing a subset of nerve-blocked trials that have similar kinematics to the control.

      Thank you for bringing this important point. We agree with the reviewer that any change in the neural activity may be attributed to lack of sensory signal or to compensatory changes or a combination of these factors. To tease apart these factors, we sampled an equal number of trials with similar kinematics for both control and nerve block feeding sessions. We added clarifying description of this approach in the Results section of the revised manuscript: “To confirm this e ect was not merely due to altered kinematics, we conducted parallel analyses using carefully subsampled trials with matched kinematic profiles from both control and nerve-blocked conditions.”

      Furthermore, we ran additional analysis for the drinking datasets by subsampling a similar distribution of drinking movements from each condition. We compared the neural data from an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. We compared the directional tuning across an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. These analyses that control for similar kinematics showed that there was still a decrease in the proportion of directionally modulated neurons with nerve block compared to the control. This confirms that the results may be attributed to the lack of tactile information. These are now integrated in the revised paper under Methods section: Directional tuning of single neurons, as well as Results section: E ects of nerve block: Decreased directional tuning of MIo and SIo neurons and Figure 10 – figure supplement 1.

      Reviewer #2 (Public review):

      Summary:

      This manuscript by Hosack and Arce-McShane examines the directional tuning of neurons in macaque primary motor (MIo) and somatosensory (SIo) cortex. The neural basis of tongue control is far less studied than, for example, forelimb movements, partly because the tongue's kinematics and kinetics are difficult to measure. A major technical advantage of this study is using biplanar video-radiography, processed with modern motion tracking analysis software, to track the movement of the tongue inside the oral cavity. Compared to prior work, the behaviors are more naturalistic behaviors (feeding and licking water from one of three spouts), although the animals were still head-fixed.

      The study's main findings are that:

      • A majority of neurons in MIo and a (somewhat smaller) percentage of SIo modulated their firing rates during tongue movements, with different modulations depending on the direction of movement (i.e., exhibited directional tuning). Examining the statistics of tuning across neurons, there was anisotropy (e.g., more neurons preferring anterior movement) and a lateral bias in which tongue direction neurons preferred that was consistent with the innervation patterns of tongue control muscles (although with some inconsistency between monkeys).

      • Consistent with this encoding, tongue position could be decoded with moderate accuracy even from small ensembles of ~28 neurons.

      • There were differences observed in the proportion and extent of directional tuning between the feeding and licking behaviors, with stronger tuning overall during licking. This potentially suggests behavioral context-dependent encoding.

      • The authors then went one step further and used a bilateral nerve block to the sensory inputs (trigeminal nerve) from the tongue. This impaired the precision of tongue movements and resulted in an apparent reduction and change in neural tuning in Mio and SIo.

      Strengths:

      The data are difficult to obtain and appear to have been rigorously measured, and provide a valuable contribution to this under-explored subfield of sensorimotor neuroscience. The analyses adopt well-established methods, especially from the arm motor control literature, and represent a natural starting point for characterizing tongue 3D direction tuning.

      Weaknesses:

      There are alternative explanations for some of the interpretations, but those interpretations are described in a way that clearly distinguishes results from interpretations, and readers can make their own assessments. Some of these limitations are described in more detail below.

      One weakness of the current study is that there is substantial variability in results between monkeys, and that only one session of data per monkey/condition is analyzed (8 sessions total). This raises the concern that the results could be idiosyncratic. The Methods mention that other datasets were collected, but not analyzed because the imaging pre-processing is very labor-intensive. While I recognize that time is precious, I do think in this case the manuscript would be substantially strengthened by showing that the results are similar on other sessions.

      We acknowledge the reviewer’s concern about inter-subject variability. Animal feeding and drinking behaviors are quite stable across sessions, thus, we do not think that additional sessions will address the concern that the results could be idiosyncratic. Each of the eight datasets analyzed here have su icient neural and kinematic data to capture neural and behavioral patterns.  Nevertheless, we performed some of the analyses on a second feeding dataset from Monkey R. The results from analyses on a subset of this data were consistent across datasets; for example, (1) similar proportions of directionally tuned neurons, (2) similar distances between population trajectories (t-test p > 0.9), and (3) a consistently smaller distance between Anterior-Posterior pairs than others in MIo (t-test p < 0.05) but not SIo (p > 0.1). 

      This study focuses on describing directional tuning using the preferred direction (PD) / cosine tuning model popularized by Georgopoulous and colleagues for understanding neural control of arm reaching in the 1980s. This is a reasonable starting point and a decent first-order description of neural tuning. However, the arm motor control field has moved far past that viewpoint, and in some ways, an over-fixation on static representational encoding models and PDs held that field back for many years. The manuscript benefits from drawing the readers' attention (perhaps in their Discussion) that PDs are a very simple starting point for characterizing how cortical activity relates to kinematics, but that there is likely much richer population-level dynamical structure and that a more mechanistic, control-focused analytical framework may be fruitful. A good review of this evolution in the arm field can be found in Vyas S, Golub MD, Sussillo D, Shenoy K. 2020. Computation Through Neural Population Dynamics. Annual Review of Neuroscience. 43(1):249-75

      Thank you for highlighting this important point. Research on orofacial movements hasn't progressed at the same pace as limb movement studies. Our manuscript focused specifically on characterizing the 3D directional tuning properties of individual neurons in the orofacial area—an analysis that has not been conducted previously for orofacial sensorimotor control. While we initially prioritized this individual neuron analysis, we recognize the value of broader population-level insights.

      Based on your helpful feedback, we have incorporated additional population analyses to provide a more comprehensive picture of orofacial sensorimotor control and expanded our discussion section. We appreciate your expertise in pushing our work to be more thorough and aligned with current neuroscience approaches.

      Can the authors explain (or at least speculate) why there was such a large difference in behavioral e ect due to nerve block between the two monkeys (Figure 7)?

      We acknowledge this as a variable inherent to this type of experimentation. Previous studies have found large kinematic variation in the effect of oral nerve block as well as in the following compensatory strategies between subjects. Each animal’s biology and response to perturbation vary naturally. Indeed, our subjects exhibited different feeding behavior even in the absence of nerve block perturbation (see Figure 2 in Laurence-Chasen et al., 2022). This is why each individual serves as its own control.

      Do the analyses showing a decrease in tuning after nerve block take into account the changes (and sometimes reduction in variability) of the kinematics between these conditions? In other words, if you subsampled trials to have similar distributions of kinematics between Control and Block conditions, does the effect hold true? The extreme scenario to illustrate my concern is that if Block conditions resulted in all identical movements (which of course they don't), the tuning analysis would find no tuned neurons. The lack of change in decoding accuracy is another yellow flag that there may be a methodological explanation for the decreased tuning result.

      Thank you for bringing up this point. We accounted for the changes in the variability of the kinematics between the control and nerve block conditions in the feeding dataset where we sampled an equal number of trials with similar kinematics for both control and nerve block. However, we did not control for similar kinematics in the drinking task. In the revised manuscript, we have clarified this and performed similar analysis for the drinking task. We sampled a similar distribution of drinking movements from each condition. We compared the neural data from an equal number of trials with a similar left-right angle of movement in the last 100 ms of the tongue trajectory, nearest the spout. There was a decrease in the percentage of neurons that were directionally modulated (between 30 and 80%) with nerve block compared to the control. These results have been included in the revised paper under Methods section: Directional tuning of single neurons, as well as Results section: E ects of nerve block: Decreased directionality of MIo and SIo neurons.

      While the results from decoding using KNN did not show significant differences between decoding accuracies in control vs. nerve block conditions, the results from the additional factor analysis and decoding using LSTM were consistent with the decrease in directional tuning at the level of individual neurons.  

      The manuscript states that "Our results suggest that the somatosensory cortex may be less involved than the motor areas during feeding, possibly because it is a more ingrained and stereotyped behavior as opposed to tongue protrusion or drinking tasks". Could an alternative explanation be more statistical/technical in nature: that during feeding, there will be more variability in exactly what somato sensation afferent signals are being received from trial to trial (because slight differences in kinematics can have large differences in exactly where the tongue is and the where/when/how of what parts of it are touching other parts of the oral cavity)? This variability could "smear out" the apparent tuning using these types of trial-averaged analyses. Given how important proprioception and somatosensation are for not biting the tongue or choking, the speculation that somatosensory cortical activity is suppressed during feedback is very counter-intuitive to this reviewer.

      Thank you for bringing up this point. We have now incorporated this in our revised Discussion (see Comparison between MIo and SIo). We agree with the reviewer that trialby-trial variability in the a erent signals may account for the lower directional signal in SIo during feeding than in drinking. Indeed, SIo’s mean-matched Fano factor in feeding was significantly higher than those in drinking (Author response image 1). Moreover, the results of the additional population and decoding analyses also support this.  

      Author response image 1.

      Comparison of mean-matched Fano Factor between Sio neurons during feeding and drinking control tasks across both subjects (Wilcoxon rank sum test, p < 0.001).

      Reviewer #3 (Public review):

      Summary:

      In this study, the authors aim to uncover how 3D tongue direction is represented in the Motor (M1o) and Somatosensory (S1o) cortex. In non-human primates implanted with chronic electrode arrays, they use X-ray-based imaging to track the kinematics of the tongue and jaw as the animal is either chewing food or licking from a spout. They then correlate the tongue kinematics with the recorded neural activity. Using linear regressions, they characterize the tuning properties and distributions of the recorded population during feeding and licking. Then, they recharacterize the tuning properties after bilateral lidocaine injections in the two sensory branches of the trigeminal nerve. They report that their nerve block causes a reorganization of the tuning properties. Overall, this paper concludes that M1o and S1o both contain representations of the tongue direction, but their numbers, their tuning properties, and susceptibility to perturbed sensory input are different.

      Strengths:

      The major strengths of this paper are in the state-of-the-art experimental methods employed to collect the electrophysiological and kinematic data.

      Weaknesses:

      However, this paper has a number of weaknesses in the analysis of this data.

      It is unclear how reliable the neural responses are to the stimuli. The trial-by-trial variability of the neural firing rates is not reported. Thus, it is unclear if the methods used for establishing that a neuron is modulated and tuned to a direction are susceptible to spurious correlations. The authors do not use shuffling or bootstrapping tests to determine the robustness of their fits or determining the 'preferred direction' of the neurons. This weakness colors the rest of the paper.

      Thank you for raising these points. We have performed the following additional analyses: (1) We have added analyses to ensure that the results could not be explained by neural variability. To show the trial-by-trial variability of the neural firing rates, we have calculated the Fano factor (mean overall = 1.34747; control = 1.46471; nerve block = 1.23023). The distribution was similar across directions, suggesting that responses of MIo and SIo neurons to varying 3D directions were reliable. (2) We have used a bootstrap procedure to ensure that directional tuning cannot be explained by mere chance. (3) To test the robustness of our PDs we also performed a bootstrap test, which yielded the same results for >90% of neurons, and a multiple linear regression test for fit to a cosine-tuning function. In the revised manuscript, the Methods and Results sections have been updated to include these analyses.  

      Author response image 2.

      Comparison of Fano Factor across directions for MIo and SIo Feeding Control (Kruskal-Wallis, p > 0.7).

      The authors compare the tuning properties during feeding to those during licking but only focus on the tongue-tip. However, the two behaviors are different also in their engagement of the jaw muscles. Thus many of the differences observed between the two 'tasks' might have very little to do with an alternation in the properties of the neural code - and more to do with the differences in the movements involved. 

      Using the tongue tip for the kinematic analysis of tongue directional movements was a deliberate choice as the anterior region of the tongue is highly mobile and sensitive due to a higher density of mechanoreceptors. The tongue tip is the first region that touches the spout in the drinking task and moves the food into the oral cavity for chewing and subsequent swallowing. 

      We agree with the reviewer that the jaw muscles are engaged differently in feeding vs. drinking (see Fig. 2). For example, a wider variety of jaw movements along the three axes are observed in feeding compared to the smaller amplitude and mostly vertical jaw movements in drinking. Also, the tongue movements are very different between the two behaviors. In feeding, the tongue moves in varied directions to position the food between left-right tooth rows during chewing, whereas in the drinking task, the tongue moves to discrete locations to receive the juice reward. Moreover, the tongue-jaw coordination differs between tasks; maximum tongue protrusion coincides with maximum gape in drinking but with minimum gape in the feeding behavior. Thus, the different tongue and jaw movements required in each behavior may account for some of the differences observed in the directional tuning properties of individual neurons and population activity. These points have been included in the revised Discussion.

      Author response image 3.

      Tongue tip position (mm) and jaw pitch(degree) during feeding (left) and drinking (right) behaviors. Most protruded tongue position coincides with minimum gape (jaw pitch at 0°) during  feeding but with maximum gape during drinking.

      Many of the neurons are likely correlated with both Jaw movements and tongue movements - this complicates the interpretations and raises the possibility that the differences in tuning properties across tasks are trivial.

      We thank the reviewer for raising this important point. In fact, we verified in a previous study whether the correlation between the tongue and jaw kinematics might explain differences in the encoding of tongue kinematics and shape in MIo (see Supplementary Fig. 4 in Laurence-Chasen et al., 2023): “Through iterative sampling of sub-regions of the test trials, we found that correlation of tongue kinematic variables with mandibular motion does not account for decoding accuracy. Even at times where tongue motion was completely un-correlated with the jaw, decoding accuracy could be quite high.” 

      The results obtained from population analyses showing distinct properties of population trajectories in feeding vs. drinking behaviors provide strong support to the interpretation that directional information varies between these behaviors.

      The population analyses for decoding are rudimentary and provide very coarse estimates (left, center, or right), it is also unclear what the major takeaways from the population decoding analyses are. The reduced classification accuracy could very well be a consequence of linear models being unable to account for the complexity of feeding movements, while the licking movements are 'simpler' and thus are better accounted for.

      We thank the reviewer for raising this point. The population decoding analyses provide additional insight on the directional information in population activity,  as well as a point of comparison with the results of numerous decoding studies on the arm region of the sensorimotor cortex. In the revised version, we have included the results from decoding tongue direction using a long short-term memory (LSTM) network for sequence-tosequence decoding. These results differed from the KNN results, indicating that a linear model such as KNN was better for drinking and that a non-linear and continuous decoder was better suited for feeding.  These results have been included in the revised manuscript.

      The nature of the nerve block and what sensory pathways are being affected is unclear - the trigeminal nerve contains many different sensory afferents - is there a characterization of how e ectively the nerve impulses are being blocked? Have the authors confirmed or characterized the strength of their inactivation or block, I was unable to find any electrophysiological evidence characterizing the perturbation.

      The strength of the nerve block is characterized by a decrease in the baseline firing rate of SIo neurons, as shown in Supplementary Figure 6 of “Loss of oral sensation impairs feeding performance and consistency of tongue–jaw coordination” (Laurence-Chasen et al., 2022)..

      Overall, while this paper provides a descriptive account of the observed neural correlations and their alteration by perturbation, a synthesis of the observed changes and some insight into neural processing of tongue kinematics would strengthen this paper.

      We thank the reviewer for this suggestion. We have revised the Discussion to provide a synthesis of the results and insights into the neural processing of tongue kinematics.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The procedure for anesthesia explained in the method section was not clear to me. The following information was missing: what drug/dose was used? How long the animal was under anesthesia? How long after the recovery the experiments were done?

      The animals were fully sedated with ketamine (100 mg/ml, 10 mg/kg) for less than 30 minutes, and all of the data was collected within 90 minutes after the nerve block was administered.

      (2) In Figure 10, panels A and B are very close together, it was not at first clear whether the text "Monkey R, Monkey Y" belongs to panel A or B.

      We have separated the two panels further in the revised figure.

      (3) I found Figure 11 very busy and hard to interpret. Separating monkeys, fitting the line for each condition, or using a bar plot can help with the readability of the figure.

      Thank you for the suggestion. We agree with you and have reworked this figure. To simplify it we have shown the mean accuracy across iterations.

      (4) I found the laterality discussions like "This signifies that there are more neurons in the left hemisphere contributes toward one direction of tongue movement, suggesting that there is some laterality in the PDs of OSMCx neurons that varies between individuals" bit of an over-interpretation of data, given the low n value and the dissimilarity in how strongly the nerve blocking altered monkies behavior.

      Thank you for sharing this viewpoint. We do think that laterality is a good point of comparison with studies on M1 neurons in the arm/hand region. In our study, we found that the peak of the PD distribution coincides with leftward tongue movements in feeding. The distribution of PDs provides insight into how tongue muscles are coordinated during movement. Intrinsic and extrinsic tongue muscles are involved in shaping the tongue (e.g., elongation, broadening) and positioning the tongue (e.g., protrusion/retraction, elevation/depression), respectively. These muscles receive bilateral motor innervation except for genioglossus. Straight tongue protrusion requires the balanced action of the right and left genioglossi while the lateral protrusion involves primarily the contralateral genioglossus. Given this unilateral innervation pattern, we hypothesized that left MIo/SIo neurons would preferentially respond to leftward tongue movements, corresponding to right genioglossus activation. 

      Reviewer #2 (Recommendations for the authors):

      Are the observation of tuning peaks being most frequently observed toward the anterior and superior directions consistent with the statistics of the movements the tongue typically makes? This could be analogous to anisotropies previously reported in the arm literature, e.g., Lillicrap TP, Scott SH. 2013. Preference Distributions of Primary Motor Cortex Neurons Reflect Control Solutions Optimized for Limb Biomechanics. Neuron. 77(1):168-79

      Thank you for bringing our attention to analogous findings by Lillicrap & Scott, 2013. Indeed, we do observe the highest number of movements in the Anterior Superior directions, followed by the Posterior Inferior. This does align with the distribution of tuning peaks that we observed. Author response image 4 shows the proportions of observed movements in each group of directions across all feeding datasets. We have incorporated this data in the Results section: Neuronal modulation patterns differ between MIo and SIo, as well as added this point in the Discussion.

      Author response image 4.

      Proportion of feeding trials in each group of directions. Error bars represent ±1 standard deviation across datasets (n = 4).

      "The Euclidean distance was used to identify nearest neighbors, and the number of nearest neighbors used was K = 7. This K value was determined after testing different Ks which yielded comparable results." In general, it's a decoding best practice to tune hyperparameters (like K) on fully held-out data from the data used for evaluation. Otherwise, this tends to slightly inflate performance because one picks the hyperparameter that happened to give the best result. It sounds like that held-out validation set wasn't used here. I don't think that's going to change the results much at all (especially given the "comparable results" comment), but providing this suggestion for the future. If the authors replicate results on other datasets, I suggest they keep K = 7 to lock in the method.

      K = 7 was chosen based on the size of our smallest training dataset (n = 55). The purpose of testing different K values was not to select which value gave the best result, but to demonstrate that similar K values did not affect the results significantly. We tested the different K values on a subset of the feeding data, but that data was not fully held-out from the training set. We will keep your suggestion in mind for future analysis.

      The smoothing applied to Figure 2 PSTHs appears perhaps excessive (i.e., it may be obscuring interesting finer-grained details of these fast movements). Can the authors reduce the 50 ms Gaussian smoothing (I assume this is the s.d.?) ~25 ms is often used in studying arm kinematics. It also looks like the movement-related modulation may not be finished in these 200 ms / 500 ms windows. I suggest extending the shown time window. It would also be helpful to show some trial-averaged behavior (e.g. speed or % displacement from start) under or behind the PSTHs, to give a sense of what phase of the movement the neural activity corresponds to.

      Thank you for the suggestion. We have taken your suggestions into consideration and modified Figure 2 accordingly. We decreased the Gaussian kernel to 25 ms and extended the time window shown. The trial-averaged anterior/posterior displacement was also added to the drinking PSTHs.

      Reviewer #3 (Recommendations for the authors):

      The major consideration here is that the data reported for feeding appears to be very similar to that reported in a previous study:

      "Robust cortical encoding of 3D tongue shape during feeding in macaques"

      Are the neurons reported here the same as the ones used in this previous paper? It is deeply concerning that this is not reported anywhere in the methods section.

      These are the same neurons as in our previous paper, though here we include several additional datasets of the nerve block and drinking sessions. We have now included this in the methods section.

      Second, I strongly recommend that the authors consider a thorough rewrite of this manuscript and improve the presentation of the figures. As written, it was not easy to follow the paper, the logic of the experiments, or the specific data being presented in the figures.

      Thank you for this suggestion. We have done an extensive rewrite of the manuscript and revision of the figures.

      A few recommendations:

      (1) Please structure your results sections and use descriptive topic sentences to focus the reader. In the current version, it is unclear what the major point being conveyed for each analysis is.

      Thank you for this suggestion. We have added topic sentences to the begin each section of the results.

      (2) Please show raster plots for at least a few example neurons so that the readers have a sense of what the neural responses look like across trials. Is all of Figure 2 one example neuron or are they different neurons? Error bars for PETH would be useful to show the reliability and robustness of the tuning.

      Figure 2 shows different neurons, one from MIo and one from SIo for each task. There is shading showing ±1 standard error around the line for each direction, however this was a bit difficult to see. In addition to the other changes we have made to these figures, we made the lines smaller and darkened the error bar shading to accentuate this. We also added raster plots corresponding to the same neurons represented in Figure 2 as a supplement.

      (3) Since there are only two data points, I am not sure I understand why the authors have bar graphs and error bars for graphs such as Figure 3B, Figure 5B, etc. How can one have an error bar and means with just 2 data points?

      Those bars represent the standard error of the proportion. We have changed the y-axis label on these figures to make this clearer.

      (4) Results in Figure 6 could be due to differential placement of the electrodes across the animals. How is this being accounted for?

      Yes, this is a possibility which we have mentioned in the discussion. Even with careful placement there is no guarantee to capture a set of neurons with the exact same function in two subjects, as every individual is different. Rather we focus on analyses of data within the same animal. The purpose of Figure 6 is to show the difference between MIo and SIo, and between the two tasks, within the same subject. The more salient result from calculating the preferred direction is that there is a change in the distribution between control and nerve block within the same exact population. Discussions relating to the comparison between individuals are speculative and cannot be confirmed without the inclusion of many more subjects.

      (5) For Figure 7, I would recommend showing the results of the Sham injection in the same figure instead of a supplement.

      Thank you for the suggestion, we have added these results to the figure.

      (6) I think the e ects of the sensory block on the tongue kinematics are underexplored in Figure 7 and Figure 8. The authors could explore the deficits in tongue shape, and the temporal components of the trajectory.

      Some of these effects on feeding have been explored in a previous paper, LaurenceChasen et al., 2022. We performed some additional analyses on changes to kinematics during drinking, including the number of licks per 10 second trial and the length of individual licks. The results of these are included below. We also calculated the difference in the speed of tongue movement during drinking, which generally decreased and exhibited an increase in variance with nerve block (f-test, p < 0.001). However, we have not included these figures in the main paper as they do not inform us about directionality.

      Author response image 5.

      Left halves of hemi-violins (black) are control and right halves (red) are nerve block for an individual. Horizontal black lines represent the mean and horizontal red lines the median. Results of two-tailed t-test and f-test are indicated by asterisks and crosses, respectively: *,† p < 0.05; **,†† p < 0.01; ***,††† p < 0.001.

      (9) In Figures 9 and 10. Are the same neurons being recorded before and after the nerve block? It is unclear if the overall "population" properties are different, or if the properties of individual neurons are changing due to the nerve block.

      Yes, the same neurons are being recorded before and after nerve block. Specifically, Figure 9B shows that the properties of many individual neurons do change due to the nerve block. Differences in the overall population response may be attributed to some of the units having reduced/no activity during the nerve block session.

      Additionally, I recommend that the authors improve their introduction and provide more context to their discussion. Please elaborate on what you think are the main conceptual advances in your study, and place them in the context of the existing literature. By my count, there are 26 citations in this paper, 4 of which are self-citations - clearly, this can be improved upon.

      Thank you for this suggestion. We have done an extensive rewrite of the Introduction and Discussion. We discussed the main conceptual advances in our study and place them in the context of the existing literature.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study uses state-of-the-art methods to label endogenous dopamine receptors in a subset of Drosophila mushroom body neuronal types. The authors report that DopR1 and Dop2R receptors, which have opposing effects in intracellular cAMP, are present in axons termini of Kenyon cells, as well as those of two classes of dopaminergic neurons that innervate the mushroom body indicative of autocrine modulation by dopaminergic neurons. Additional experiments showing opposing effects of starvation on DopR1 and DopR2 levels in mushroom body neurons are consistent with a role for dopamine receptor levels increasing the efficiency of learned food-odour associations in starved flies. Supported by solid data, this is a valuable contribution to the field.

      We thank the editors for the assessment, but request to change “DopR2” to “Dop2R”. The dopamine receptors in Drosophila have confusing names, but what we characterized in this study are called Dop1R1 (according to the Flybase; aka DopR1, dDA1, Dumb) and Dop2R (ibid; aka Dd2R). DopR2 is the name of a different dopamine receptor.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This is an important and interesting study that uses the split-GFP approach. Localization of receptors and correlating them to function is important in understanding the circuit basis of behavior.

      Strengths:

      The split-GFP approach allows visualization of subcellular enrichment of dopamine receptors in the plasma membrane of GAL4-expressing neurons allowing for a high level of specificity.

      The authors resolve the presynaptic localization of DopR1 and Dop2R, in "giant" Drosophila neurons differentiated from cytokinesis-arrested neuroblasts in culture as it is not clear in the lobes and calyx.

      Starvation-induced opposite responses of dopamine receptor expression in the PPL1 and PAM DANs provide key insights into models of appetitive learning.

      Starvation-induced increase in D2R allows for increased negative feedback that the authors test in D2R knockout flies where appetitive memory is diminished.

      This dual autoreceptor system is an attractive model for how amplitude and kinetics of dopamine release can be fine-tuned and controlled depending on the cellular function and this paper presents a good methodology to do it and a good system where the dynamics of dopamine release can be tested at the level of behavior.

      Weaknesses:

      LI measurements of Kenyon cells and lobes indicate that Dop2R was approximately twice as enriched in the lobe as the average density across the whole neuron, while the lobe enrichment of Dop1R1 was about 1.5 times the average, are these levels consistent during different times of the day and the state of the animal. How were these conditions controlled and how sensitive are receptor expression to the time of day of dissection, staining, etc.

      To answer this question, we repeated the experiment in two replicates at different times of day and confirmed that the receptor localization was consistent (Figure 3 – figure supplement 1); LI measurements showed that Dop2R is enriched more in the lobe and less in the calyx compared to Dop1R1 (Figure 3D). The states of animals that could affect LI (e.g. feeding state and anesthesia for sorting, see methods) were kept constant. 

      The authors assume without discussion as to why and how presynaptic enrichment of these receptors is similar in giant neurons and MB.

      In the revision, we added a short summary to recapitulate that the giant neurons exhibit many characteristics of mature neurons (Lines #152-156): "Importantly, these giant neurons exhibit characteristics of mature neurons, including firing patterns (Wu et al., 1990; Yao & Wu, 2001; Zhao & Wu, 1997) and acetylcholine release (Yao et al., 2000), both of which are regulated by cAMP and CaMKII signaling (Yao et al., 2000; Yao & Wu, 2001; Zhao & Wu, 1997)." In addition, we found punctate Brp accumulations localized to the axon terminals of the giant neurons (former Figure 4D and 4E). Therefore, the giant neuron serves as an excellent model to study the presynaptic localization of dopamine receptors in isolated large cells.

      Figures 1-3 show the expensive expression of receptors in alpha and beta lobes while Figure 5 focusses on PAM and localization in γ and β' projections of PAM leading to the conclusion that presynaptic dopamine neurons express these and have feedback regulation. Consistency between lobes or discussion of these differences is important to consider.

      In the revised manuscript, we show data in the γ KCs (Figure 4C, Figure 5 - figure supplement 1) in addition to α/β KCs, and demonstrate the consistent synaptic localization of Dop1R1 and Dop2R as in α/β KCs (Figure 4B and 5A). 

      Receptor expression in any learning-related MBONs is not discussed, and it would be intriguing as how receptors are organized in those cells. Given that these PAMs input to both KCs and MBONs these will have to work in some coordination.

      The subcellular localization of dopamine receptors in MBONs indeed provides important insights into the site of dopaminergic signaling in these neurons (Takemura et al., 2017; Pavlowsky et al., 2018; Pribbenow et al., 2022). Therefore, we added new data for Dop1R1 and Dop2R in MBON-γ1pedc>αβ (Figure 6). Interestingly, these receptors are localized to in the dendritic projection in the γ1 compartment as well as presynaptic boutons (Figure 6). 

      Although authors use the D2R enhancement post starvation to show that knocking down receptors eliminated appetitive memory, the knocking out is affecting multiple neurons within this circuit including PAMs and KCs. How does that account for the observed effect? Are those not important for appetitive learning? 

      In the appetitive memory experiment (Figure 9C), we knocked down Dop2R only in the select neurons of the PPL1 cluster, and this manipulation does not directly affect Dop2R expression in PAMs and KCs.

      Starvation-induced enhancement of Dop2R expression in the PPL1 neurons (Figure 8F) would attenuate their outputs and therefore disinhibit expression of appetitive memory in starved flies (Krashes et al., 2009). Consistently, Dop2R knock-down in PPL1 impaired appetitive memory in starved flies (Figure 9C). We revised the corresponding text to make this point clearer (Lines #224227).

      The evidence for fine-tuning is completely based on receptor expression and one behavioral outcome which could result from many possibilities. It is not clear if this fine-tuning and presynaptic feedback regulation-based dopamine release is a clear possibility. Alternate hypotheses and outcomes could be considered in the model as it is not completely substantiated by data at least as presented.

      The reviewer’s concern is valid, and the presynaptic dopamine tuning by autoreceptors may need more experimental support. We therefore additionally discussed another possibility (Lines #289-291): “Alternatively, these presynaptic receptors could potentially receive extrasynaptic dopamine released from other DANs. Therefore, the autoreceptor functions need to be experimentally clarified by manipulating the receptor expression in DANs.”

      Reviewer #2 (Public Review):

      Summary:

      Hiramatsu et al. investigated how cognate neurotransmitter receptors with antagonizing downstream effects localize within neurons when co-expressed. They focus on mapping the localization of the dopaminergic Dop1R1 and Dop2R receptors, which correspond to the mammalian D1- and D2-like dopamine receptors, which have opposing effects on intracellular cAMP levels, in neurons of the Drosophila mushroom body (MB). To visualize specific receptors in single neuron types within the crowded MB neuropil, the authors use existing dopamine receptor alleles tagged with 7 copies of split GFP to target reconstitution of GFP tags only in the neurons of interest as a read-out of receptor localization. The authors show that both Dop1R1 and Dop2R, with differing degrees, are enriched in axonal compartments of both the Kenyon Cells cholinergic presynaptic inputs and in different dopamine neurons (DANs), which project axons to the MB. Co-localization studies of dopamine receptors with the presynaptic marker Brp suggest that Dop1R1 and, to a larger extent Dop2R, localize in the proximity of release sites. This localization pattern in DANs suggests that Dop1R1 and Dop2R work in dual-feedback regulation as autoreceptors. Finally, they provide evidence that the balance of Dop1R1 and Dop2R in the axons of two different DAN populations is differentially modulated by starvation and that this regulation plays a role in regulating appetitive behaviors.

      Strengths:

      The authors use reconstitution of GFP fluorescence of split GFP tags knocked into the endogenous locus at the C-terminus of the dopamine receptors as a readout of dopamine receptor localization. This elegant approach preserves the endogenous transcriptional and post-transcriptional regulation of the receptor, which is essential for studies of protein localization.

      The study focuses on mapping the localization of dopamine receptors in neurons of the mushroom body. This is an excellent choice of system to address the question posed in this study, as the neurons are well-studied, and their connections are carefully reconstructed in the mushroom body connectome. Furthermore, the role of this circuit in different behaviors and associative memory permits the linking of patterns of receptor localization to circuit function and resulting behavior. Because of these features, the authors can provide evidence that two antagonizing dopamine receptors can act as autoreceptors within the axonal compartment of MB innervating DANs. The differential regulation of the balance of the two receptors under starvation in two distinct DAN innervations provides evidence of the role that regulation of this balance can play in circuit function and behavioral output.

      Weaknesses:

      The approach of using endogenously tagged alleles to study localization is a strength of this study, but the authors do not provide sufficient evidence that the insertion of 7 copies of split GFP to the C terminus of the dopamine receptors does not interfere with the endogenous localization pattern or function. Both sets of tagged alleles (1X Venus and 7X split GFP tagged) were previously reported (Kondo et al., 2020), but only the 1X Venus tagged alleles were further functionally validated in assays of olfactory appetitive memory. Despite the smaller size of the 7X split-GFP array tag knocked into the same location as the 1X venus tag, the reconstitution of 7 copies of GFP at the C terminus of the dopamine receptor, might substantially increase the molecular bulk at this site, potentially impeding the function of the receptor more significantly than the smaller, single Venus tag. The data presented by Kondo et al. 2020, is insufficient to conclude that the two alleles are equivalent.

      In the revision, we validated the function of these engineered receptors by a new set of olfactory learning experiments. Both these receptors in KCs were shown to be required for aversive memory (Kim et al., 2007, Scholz-Kornehl et al., 2016). As in the anatomical experiments, we induced GFP110 expression in KC of the flies homozygous for 7xGFP<sub>11</sub>-tagged receptors using MB-Switch and 3 days of RU486 feeding o. We confirmed STM performance of these flies were not significantly different from the control (Figure 2 – figure supplement 1). Thus, these fusion receptors are functional.

      The authors' conclusion that the receptors localize to presynaptic sites is weak. The analysis of the colocalization of the active zone marker Brp whole-brain staining with dopamine receptors labeled in specific neurons is insufficient to conclude that the receptors are localized at presynaptic sites. Given the highly crowded neuropil environment, the data cannot differentiate between the receptor localization postsynaptic to a dopamine release site or at a presynaptic site within the same neuron. The known distribution of presynaptic sites within the neurons analyzed in the study provides evidence that the receptors are enriched in axonal compartments, but co-labeling of presynaptic sites and receptors in the same neuron or super-resolution methods are needed to provide evidence of receptor localization at active zones.  The data presented in Figures 5K-5L provides compelling evidence that the receptors localize to neuronal varicosities in DANs where the receptors could play a role as autoreceptors.

      Given the highly crowded environment of the mushroom body neuropil, the analysis of dopamine receptor localization in Kenyon cells is not conclusive. The data is sufficient to conclude that the receptors are preferentially localizing to the axonal compartment of Kenyon cells, but co-localization with brain-wide Brp active zone immunostaining is not sufficient to determine if the receptor localizes juxtaposed to dopaminergic release sites, in proximity of release sites in Kenyon cells, or both.

      To better resolve the microcircuits of KCs, we triple-labeled the plasma membrane and DAR::rGFP in KCs, and Brp, and examined their localizations with high-resolution imaging with  Airyscan. This strategy revealed the receptor clusters associated with Brp accumulation within KCs (Figure 4). To further verify the association of DARs and active zones within KCs, we co-expressed Brp<sup>short</sup>::mStraw and GFP<sub>1-10</sub> and confirmed their colocalization (Figure 5A), suggesting presynaptic localization of DARs in KCs. With these additional characterizations, we now discuss the significance of receptors at the presynaptic sites of KCs.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      This is an important and interesting study that uses the split-GFP approach. Localization of receptors and correlating them to function is important in understanding the circuit basis of behavior.

      For Figure 1, the authors show PAM, PPL1 neurons, and the ellipsoid body as a validation of their tools (Dop1R1-T2A-GAL4 and Dop2R-T2A-GAL4) and the idea that these receptors are colocalized. However, it appears that the technique was applied to the whole brain so it would be great to see the whole brain to understand how much labelling is specific and how stochastic. Methods could include how dissection conditions were controlled and how sensitive are receptor expression to the time of day of dissection, staining, etc.

      The expression patterns of the receptor T2A-GAL4 lines (Figure 1A and 1B) are consistent in the multiple whole brains (Kondo et al., 2020, Author response image 1).

      Author response image 1.

      The significance of the expression of these two receptors in an active zone is not clearly discussed and presynaptic localization is not elaborated on. Would something like expansion microscopy be useful in resolving this? It would be important to discuss that as giant neurons in culture don't replicate many aspects of the MB system.

      In the revised manuscript, we elaborated discussion regarding the function of the two antagonizing receptors at the AZ (Lines #226-275).

      Does MB-GeneSwitch > GFP1-1 reliably express in gamma lobes? Most of the figures show alpha/beta lobes.

      Yes. MB-GeneSwitch is also expressed in γ KCs, but weakly. 12 hours of RU486 feeding, which we did in the previous experiments, was insufficient to induce GFP reconstitution in the γ KCs. By extending the time of transgene induction, we visualized expression of Dop1R1 and Dop2R more clearly in γ KCs. Their localization is similar to that in the α/β KCs (Figure 4C, Figure 5 - figure supplement 1).

      Figure 6, y-axis says protein level. At first, I thought it was related to starvation so maybe authors can be more specific as the protein level doesn't indicate any aspect of starvation.

      We appreciate this comment, and the labels on the y-axis were now changed to “rGFP levels” (Figure 8C and 8F, Figure 8 - figure supplement 1B, 1D and 1F).

      Reviewer #2 (Recommendations For The Authors):

      Title:

      The title of the manuscript focuses on the tagging of the receptors and their synaptic enrichment.

      Given that the alleles used in the study were generated in a previously published study (Kondo et al, 2020), which describes the receptor tagging and that the data currently provided is insufficient to conclude that the receptors are localizing to synapses, the title should be changed to reflect the focus on localizing antagonistic cognate neurotransmitter receptors in the same neuron and their putative role as autoreceptors in DANs.

      Following this advice, we removed the methodology from the title and revised it to “Synaptic enrichment and dynamic regulation of the two opposing dopamine receptors within the same neurons”.

      Minor issues with text and figures:

      Figure 1

      A conclusion from Figure 1 is that the two receptors are co-expressed in Kenyon cells. Please provide panels equivalent to the ones shown in D-G, with Kenyon cells cell bodies, or mark these cells in the existing panels, if present. Line 111 refers to panel 1D as the Kenyon cells panel, which is currently a PAM panel.

      We added images for coexpression of these receptors in the cell bodies of KCs (Figure 1 - figure supplement 1) and revised the text accordingly (Lines #89-90).

      Given that most of the study centers on visualizing receptor localization, it would benefit the reader to include labels in Figure 1 that help understand that these panels reflect expression patterns rather than receptor localization. For instance, rCD2::GFP could be indicated in the Dop1R1-LexA panels.

      As suggested, labels were added to indicate the UAS and lexAop markers (Figure 1D, 1E, 1G-1I and Figure 1 – figure supplement 1).

      Given that panels D-E focus on the cell bodies of the neurons, it could be beneficial for the reader to present the ellipsoid body neurons using a similar view that only shows the cell bodies. Similarly, one could just show the glial cell bodies .

      We now show the cell bodies of ring neurons (Figure 1G) and ensheathing glia (Figure 1I).

      For panel 1E, please indicate the subset of PPL1 neurons that both expressed Dop1R1 and Dop2R, as indicated in the text, as it is currently unclear from the image.

      Dop1R1-T2A-LexA was barely detected in all PPL1 (Figure 1E). We corrected the confusing text (Lines #95-96).

      Figure 2

      The cartoon of the cell-type-specific labeling should show that the tag is 7XFP-11 and the UAScomponent FP-10, as the current cartoon leads the reader to conclude that the receptors are tagged with a single copy of split GFP. The detail that the receptors are tagged with 7 copies of split GFP is only provided through the genotype of the allele in the resource table.  This design aspect should be made clear in the figure and the text when describing the allele and approach used to tag receptors in specific neuron types.

      We now added the construct design in the scheme (Figure 2A) and revised the corresponding text (Line #101-103).

      Panel A. The arrow representing the endogenous promoter in the yellow gene representation should be placed at the beginning of the coding sequence. Currently, the different colors of what I assume are coding (yellow) and non-coding (white) transcript regions are not described in the legend.  I would omit these or represent them in the same color as thinner boxes if the authors want to emphasize that the tag is inserted at the C terminus within the endogenous locus.

      The color scheme was revised to be more consistent and intuitive (Figure 2A).

      Figure 3

      Labels of the calyx and MB lobes would benefit readers not as familiar with the system used in the study. In addition, it would be beneficial to the reader to indicate in panel A the location of the compartments analyzed in panel H (e.g., peduncle, α3).

      Figure 3A was amended to clearly indicate the analyzed MB compartments.

      Adding frontal and sagittal to panels B-E, as in Figure 2, would help the reader interpret the data. 

      In Figure 3B, “Frontal” and “Sagittal” were indicated.

      Panel F-G. A scale bar should be provided for the data shown in the insets. Could the author comment on the localization of Dop1R1 in KCs? The data in the current panel suggests that only a subset of KCs express high levels of receptors in their axons, as a portion of the membrane is devoid of receptor signals. This would be in line with differential dopamine receptor expression in subsets of Kenyon cells, as shown in Kondo et al., 2020, which is currently not commented on in the paper. 

      We confirmed that the majority of the KCs express both Dop1R1 and Dop2R genes (Figure 1 - figure supplement 1). LIs should be compared within the same cells rather than the differences of protein levels between cell types as they also reflect the GAL4 expression levels. 

      Panel H. Some P values are shown as n.s. (p> 0.05). Other non-significant p values in this panel and in other figures throughout the paper are instead reported (e.g. peduncle P=0.164). For consistency, please report the values as n.s. as indicated in the methods for all non-significant tests in this panel and throughout the manuscript.

      We now present the new dataset, and the graph represents the appropriate statistical results (Figure 3D; see the methods section for details).

      The methods of labeling the receptors through the expression of the GeneSwitch-controlled GFP1-10 in Kenyon cells induced by RU486 are not provided in the methods. Please provide a description of this as referenced in the figure legend and the genotypes used in the analysis shown in the panels.

      The method of RU486 feeding has been added. We apologize for the missing method.

      Figure 4

      Please provide scale bars for the inset in panels A-B.

      Scale bars were added to all confocal images.

      The current analysis cannot distinguish between postsynaptic and presynaptic dopamine receptors in KCs, and the figure title should reflect this.

      We now present the new data dopamine receptors in KCs and clearly distinguish Brp clusters of the KCs and other cell types (Figure 4, Figure 5).

      The reader could benefit from additional details of using the giant neuron model, as it is not commonly used, and it is not clear how to relate this to interpret the localization of dopaminergic receptors within Kenyon cells. The use of the venus-tagged receptor variant should be introduced in the text, as using a different allele currently lacks context. Figures 4F-4J show that the receptor is localizing throughout the neuron. Quantifying the fraction of receptor signal colocalizing with Brp could aid in interpreting the data.  However, it would still not be clear how to interpret this data in the context of understanding the localization of the receptors in neurons within fly brain circuits. In the absence of additional data, the data provided in Figure 4 is inconclusive and could be omitted, keeping the focus of the study on the analysis of the two receptors in DANs. Co-expressing a presynaptic marker in Kenyon cells (e.g., by expressing Brp::SNAP)  in conjunction with rGFP labeled receptor would provide additional evidence of the relationship of release sites in Kenyon cells and tagged dopamine receptors in these same cells and could add evidence in support to the current conclusion.

      Following the advice, we added a short summary to recapitulate that the giant neurons exhibit many characteristics of mature neurons (Lines #152-156): "Importantly, these giant neurons exhibit characteristics of mature neurons, including firing patterns (Wu et al., 1990; Yao & Wu, 2001; Zhao & Wu, 1997) and acetylcholine release (Yao et al., 2000), both of which are regulated by cAMP and CaMKII signaling (Yao et al., 2000; Yao & Wu, 2001; Zhao & Wu, 1997)." Therefore, the giant neuron serves as an excellent model to study the presynaptic localization in large cells in isolation.

      To clarify polarized localization of Brp clusters and dopamine receptors but not "localizing throughout the neuron", we now show less magnified data (Figure 5C). It clearly demonstrates punctate Brp accumulations localized to the axon terminals of the giant neurons (former Figure 4D and 4E). This is the same membrane segment where Dop1R1 and Dop2R are localized (Figure 5C). Therefore, the association of Brp clusters and the dopamine receptors in the isolated giant neurons suggests that the subcellular localization in the brain neurons is independent of the circuit context. 

      As the giant neurons do not form intermingled circuits, venus-tagged receptors are sufficient for this experiment and simpler in genetics.

      Following the suggestion to clarify the AZ association of the receptors in KCs, we coexpressed Brpshort-mStraw and GFP1-10 in KCs and confirmed their colocalization (Figure 5A).

      Figure 6

      The data and analysis show that starvation induces changes in the α3 compartment in PPL1 neurons only, while the data provided shows no significant change for PPL1 neurons innervating other MB compartments. This should be clearly stated in lines 174-175, as it is implied that there is a difference in the analysis for compartments other than α3. Panel L of Figure 6 - supplement 1 shows no significant change for all three compartments analyzed and should be indicated as n.s. in all instances, as stated in the methods. 

      We revised the text to clarify that the starvation-induced differences of Dop2R expression were not significant (Lines #217-219). The reason to highlight the α3 compartment is that both Dop1R1 and Dop2R are coexpressed in this PPL1 neuron (Figure 8D).

      Additional minor comments:

      There are a few typos and errors throughout the manuscript. The text should be carefully proofread to correct these. Here are the ones that came to my attention:

      Please reference all figure panels in the text. For instance, Figure 3A is not mentioned and should be revised in line 112 as Figure 3A-E.

      Lines 103-104. The sentence "LI was visualized as the color of the membrane signals" is unclear and should be revised. 

      Figure 4 legend - dendritic claws should likely be B and C and not B and E.

      Lines 147 - Incorrect figure panels, should be 5C-L or 5D-E.

      Line 241 - DNAs should be DANs.

      Methods - please define what the abbreviation CS stands for.

      We really appreciate for careful reading of this reviewer. All these were corrected.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This valuable study investigates how the neural representation of individual finger movements changes during the early period of sequence learning. By combining a new method for extracting features from human magnetoencephalography data and decoding analyses, the authors provide incomplete evidence of an early, swift change in the brain regions correlated with sequence learning, including a set of previously unreported frontal cortical regions. The addition of more control analyses to rule out that head movement artefacts influence the findings, and to further explain the proposal of offline contextualization during short rest periods as the basis for improvement performance would strengthen the manuscript.

      We appreciate the Editorial assessment on our paper’s strengths and novelty. We have implemented additional control analyses to show that neither task-related eye movements nor increasing overlap of finger movements during learning account for our findings, which are that contextualized neural representations in a network of bilateral frontoparietal brain regions actively contribute to skill learning. Importantly, we carried out additional analyses showing that contextualization develops predominantly during rest intervals.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study addresses the issue of rapid skill learning and whether individual sequence elements (here: finger presses) are differentially represented in human MEG data. The authors use a decoding approach to classify individual finger elements and accomplish an accuracy of around 94%. A relevant finding is that the neural representations of individual finger elements dynamically change over the course of learning. This would be highly relevant for any attempts to develop better brain machine interfaces - one now can decode individual elements within a sequence with high precision, but these representations are not static but develop over the course of learning.

      Strengths:

      The work follows a large body of work from the same group on the behavioural and neural foundations of sequence learning. The behavioural task is well established and neatly designed to allow for tracking learning and how individual sequence elements contribute. The inclusion of short offline rest periods between learning epochs has been influential because it has revealed that a lot, if not most of the gains in behaviour (ie speed of finger movements) occur in these socalled micro-offline rest periods. The authors use a range of new decoding techniques, and exhaustively interrogate their data in different ways, using different decoding approaches. Regardless of the approach, impressively high decoding accuracies are observed, but when using a hybrid approach that combines the MEG data in different ways, the authors observe decoding accuracies of individual sequence elements from the MEG data of up to 94%.

      We have previously showed that neural replay of MEG activity representing the practiced skill was prominent during rest intervals of early learning, and that the replay density correlated with micro-offline gains (Buch et al., 2021). These findings are consistent with recent reports (from two different research groups) that hippocampal ripple density increases during these inter-practice rest periods, and predict offline learning gains (Chen et al., 2024; Sjøgård et al., 2024). However, decoder performance in our earlier work (Buch et al., 2021) left room for improvement. Here, we reported a strategy to improve decoding accuracy that could benefit future studies of neural replay or BCI using MEG.

      Weaknesses:

      There are a few concerns which the authors may well be able to resolve. These are not weaknesses as such, but factors that would be helpful to address as these concern potential contributions to the results that one would like to rule out. Regarding the decoding results shown in Figure 2 etc, a concern is that within individual frequency bands, the highest accuracy seems to be within frequencies that match the rate of keypresses. This is a general concern when relating movement to brain activity, so is not specific to decoding as done here. As far as reported, there was no specific restraint to the arm or shoulder, and even then it is conceivable that small head movements would correlate highly with the vigor of individual finger movements. This concern is supported by the highest contribution in decoding accuracy being in middle frontal regions - midline structures that would be specifically sensitive to movement artefacts and don't seem to come to mind as key structures for very simple sequential keypress tasks such as this - and the overall pattern is remarkably symmetrical (despite being a unimanual finger task) and spatially broad. This issue may well be matching the time course of learning, as the vigor and speed of finger presses will also influence the degree to which the arm/shoulder and head move. This is not to say that useful information is contained within either of the frequencies or broadband data. But it raises the question of whether a lot is dominated by movement "artefacts" and one may get a more specific answer if removing any such contributions.

      Reviewer #1 expresses concern that the combination of the low-frequency narrow-band decoder results, and the bilateral middle frontal regions displaying the highest average intra-parcel decoding performance across subjects is suggestive that the decoding results could be driven by head movement or other artefacts.

      Head movement artefacts are highly unlikely to contribute meaningfully to our results for the following reasons. First, in addition to ICA denoising, all “recordings were visually inspected and marked to denoise segments containing other large amplitude artifacts due to movements” (see Methods). Second, the response pad was positioned in a manner that minimized wrist, arm or more proximal body movements during the task. Third, while online monitoring of head position was not performed for this study, it was assessed at the beginning and at the end of each recording. The head was restrained with an inflatable air bladder, and head movement between the beginning and end of each scan did not exceed 5mm for all participants included in the study.

      The Reviewer states a concern that “it is conceivable that small head movements would correlate highly with the vigor of individual finger movements”. We agree that despite the steps taken above, it is possible that minor head movements could still contribute to some remaining variance in the MEG data in our study. However, such correlations between small head movements and finger movements could only meaningfully contribute to decoding performance if: (A) they were consistent and pervasive throughout the recording (which might not be the case if the head movements were related to movement vigor and vigor changed over time); and (B) they systematically varied between different finger movements, and also between the same finger movement performed at different sequence locations (see 5-class decoding performance in Figure 4B). The possibility of any head movement artefacts meeting all these conditions is unlikely. Alternatively, for this task design a much more likely confound could be the contribution of eye movement artefacts to the decoder performance (an issue raised by Reviewer #3 in the comments below).

      Remember from Figure 1A in the manuscript that an asterisk marks the current position in the sequence and is updated at each keypress. Since participants make very few performance errors, the position of the asterisk on the display is highly correlated with the keypress being made in the sequence. Thus, it is possible that if participants are attending to the visual feedback provided on the display, they may generate eye movements that are systematically related to the task. Since we did record eye movements simultaneously with the MEG recordings (EyeLink 1000 Plus; Fs = 600 Hz), we were able to perform a control analysis to address this question. For each keypress event during trials in which no errors occurred (which is the same time-point that the asterisk position is updated), we extracted three features related to eye movements: 1) the gaze position at the time of asterisk position update (triggered by a KeyDown event), 2) the gaze position 150ms later, and 3) the peak velocity of the eye movement between the two positions. We then constructed a classifier from these features with the aim of predicting the location of the asterisk (ordinal positions 1-5) on the display. As shown in the confusion matrix below (Author response image 1), the classifier failed to perform above chance levels (overall cross-validated accuracy = 0.21817):

      Author response image 1.

      Confusion matrix showing that three eye movement features fail to predict asterisk position on the task display above chance levels (Fold 1 test accuracy = 0.21718; Fold 2 test accuracy = 0.22023; Fold 3 test accuracy = 0.21859; Fold 4 test accuracy = 0.22113; Fold 5 test accuracy = 0.21373; Overall cross-validated accuracy = 0.2181). Since the ordinal position of the asterisk on the display is highly correlated with the ordinal position of individual keypresses in the sequence, this analysis provides strong evidence that keypress decoding performance from MEG features is not explained by systematic relationships between finger movement behavior and eye movements (i.e. – behavioral artefacts) (end of figure legend).

      Remember that the task display does not provide explicit feedback related to performance, only information about the present position in the sequence. Thus, it is possible that participants did not actively attend to the feedback. In fact, inspection of the eye position data revealed that on majority of trials, participants displayed random-walk-like gaze patterns around a central fixation point located near the center of the screen. Thus, participants did not attend to the asterisk position on the display, but instead intrinsically generated the action sequence. A similar realworld example would be manually inputting a long password into a secure online application. In this case, one intrinsically generates the sequence from memory and receives similar feedback about the password sequence position (also provided as asterisks) as provided in the study task – feedback which is typically ignored by the user.

      The minimal participant engagement with the visual task display observed in this study highlights another important point – that the behavior in explicit sequence learning motor tasks is highly generative in nature rather than reactive to stimulus cues as in the serial reaction time task (SRTT). This is a crucial difference that must be carefully considered when designing investigations and comparing findings across studies.

      We observed that initial keypress decoding accuracy was predominantly driven by contralateral primary sensorimotor cortex in the initial practice trials before transitioning to bilateral frontoparietal regions by trials 11 or 12 as performance gains plateaued. The contribution of contralateral primary sensorimotor areas to early skill learning has been extensively reported in humans and non-human animals.(Buch et al., 2021; Classen et al., 1998; Karni et al., 1995; Kleim et al., 1998) Similarly, the increased involvement of bilateral frontal and parietal regions to decoding during early skill learning in the non-dominant hand is well known. Enhanced bilateral activation in both frontal and parietal cortex during skill learning has been extensively reported (Doyon et al., 2002; Grafton et al., 1992; Hardwick et al., 2013; Kennerley et al., 2004; Shadmehr & Holcomb, 1997; Toni, Ramnani, et al., 2001), and appears to be even more prominent during early fine motor skill learning in the non-dominant hand (Lee et al., 2019; Sawamura et al., 2019). The frontal regions identified in these studies are known to play crucial roles in executive control (Battaglia-Mayer & Caminiti, 2019), motor planning (Toni, Thoenissen, et al., 2001), and working memory (Andersen & Buneo, 2002; Buneo & Andersen, 2006; Shadmehr & Holcomb, 1997; Toni, Ramnani, et al., 2001; Wolpert et al., 1998) processes, while the same parietal regions are known to integrate multimodal sensory feedback and support visuomotor transformations (Andersen & Buneo, 2002; Buneo & Andersen, 2006; Shadmehr & Holcomb, 1997; Toni, Ramnani, et al., 2001; Wolpert et al., 1998), in addition to working memory (Grover et al., 2022). Thus, it is not surprising that these regions increasingly contribute to decoding as subjects internalize the sequential task. We now include a statement reflecting these considerations in the revised Discussion.

      A somewhat related point is this: when combining voxel and parcel space, a concern is whether a degree of circularity may have contributed to the improved accuracy of the combined data, because it seems to use the same MEG signals twice - the voxels most contributing are also those contributing most to a parcel being identified as relevant, as parcels reflect the average of voxels within a boundary. In this context, I struggled to understand the explanation given, ie that the improved accuracy of the hybrid model may be due to "lower spatially resolved whole-brain and higher spatially resolved regional activity patterns".

      We disagree with the Reviewer’s assertion that the construction of the hybrid-space decoder is circular for the following reasons. First, the base feature set for the hybrid-space decoder constructed for all participants includes whole-brain spatial patterns of MEG source activity averaged within parcels. As stated in the manuscript, these 148 inter-parcel features reflect “lower spatially resolved whole-brain activity patterns” or global brain dynamics. We then independently test how well spatial patterns of MEG source activity for all voxels distributed within individual parcels can decode keypress actions. Again, the testing of these intra-parcel spatial patterns, intended to capture “higher spatially resolved regional brain activity patterns”, is completely independent from one another and independent from the weighting of individual inter-parcel features. These intra-parcel features could, for example, provide additional information about muscle activation patterns or the task environment. These approximately 1150 intra-parcel voxels (on average, within the total number varying between subjects) are then combined with the 148 inter-parcel features to construct the final hybrid-space decoder. In fact, this varied spatial filter approach shares some similarities to the construction of convolutional neural networks (CNNs) used to perform object recognition in image classification applications (Srinivas et al., 2016). One could also view this hybrid-space decoding approach as a spatial analogue to common timefrequency based analyses such as theta-gamma phase amplitude coupling (θ/γ PAC), which assess interactions between two or more narrow-band spectral features derived from the same time-series data (Lisman & Jensen, 2013).

      We directly tested this hypothesis – that spatially overlapping intra- and inter-parcel features portray different information – by constructing an alternative hybrid-space decoder (Hybrid<sub>Alt</sub>) that excluded average inter-parcel features which spatially overlapped with intra-parcel voxel features, and comparing the performance to the decoder used in the manuscript (Hybrid<sub>Orig</sub>). The prediction was that if the overlapping parcel contained similar information to the more spatially resolved voxel patterns, then removing the parcel features (n=8) from the decoding analysis should not impact performance. In fact, despite making up less than 1% of the overall input feature space, removing those parcels resulted in a significant drop in overall performance greater than 2% (78.15% ± 7.03% SD for Hybrid<sub>Orig</sub> vs. 75.49% ± 7.17% for Hybrid<sub>Alt</sub>; Wilcoxon signed rank test, z = 3.7410, p = 1.8326e-04; Author response image 2).

      Author response image 2.

      Comparison of decoding performances with two different hybrid approaches. Hybrid<sub>Alt</sub>: Intra-parcel voxel-space features of top ranked parcels and inter-parcel features of remaining parcels. Hybrid<sub>Orig</sub>: Voxel-space features of top ranked parcels and whole-brain parcel-space features (i.e. – the version used in the manuscript). Dots represent decoding accuracy for individual subjects. Dashed lines indicate the trend in performance change across participants. Note, that Hybrid<sub>Orig</sub> (the approach used in our manuscript) significantly outperforms the Hybrid<sub>Alt</sub> approach, indicating that the excluded parcel features provide unique information compared to the spatially overlapping intra-parcel voxel patterns (end of figure legend).

      Firstly, there will be a relatively high degree of spatial contiguity among voxels because of the nature of the signal measured, i.e. nearby individual voxels are unlikely to be independent. Secondly, the voxel data gives a somewhat misleading sense of precision; the inversion can be set up to give an estimate for each voxel, but there will not just be dependence among adjacent voxels, but also substantial variation in the sensitivity and confidence with which activity can be projected to different parts of the brain. Midline and deeper structures come to mind, where the inversion will be more problematic than for regions along the dorsal convexity of the brain, and a concern is that in those midline structures, the highest decoding accuracy is seen.

      We agree with the Reviewer that some inter-parcel features representing neighboring (or spatially contiguous) voxels are likely to be correlated, an important confound in connectivity analyses (Colclough et al., 2015; Colclough et al., 2016), not performed in our investigation.

      In our study, correlations between adjacent voxels effectively reduce the dimensionality of the input feature space. However, as long as there are multiple groups of correlated voxels within each parcel (i.e. – the rank is greater than 1), the intra-parcel spatial patterns could meaningfully contribute to the decoder performance, as shown by the following results:

      First, we obtained higher decoding accuracy with voxel-space features (74.51% ± 7.34% SD) compared to parcel space features (68.77% ± 7.6%; Figure 3B), indicating individual voxels carry more information in decoding the keypresses than the averaged voxel-space features or parcel space features. Second, individual voxels within a parcel showed varying feature importance scores in decoding keypresses (Author response image 3). This finding shows that correlated voxels form mini subclusters that are much smaller spatially than the parcel they reside within.

      Author response image 3.:

      Feature importance score of individual voxels in decoding keypresses: MRMR was used to rank the individual voxel space features in decoding keypresses and the min-max normalized MRMR score was mapped to a structural brain surface. Note that individual voxels within a parcel showed different contribution to decoding (end of figure legend).

      Some of these concerns could be addressed by recording head movement (with enough precision) to regress out these contributions. The authors state that head movement was monitored with 3 fiducials, and their time courses ought to provide a way to deal with this issue. The ICA procedure may not have sufficiently dealt with removing movement-related problems, but one could eg relate individual components that were identified to the keypresses as another means for checking. An alternative could be to focus on frequency ranges above the movement frequencies. The accuracy for those still seems impressive and may provide a slightly more biologically plausible assessment.

      We have already addressed the issue of movement related artefacts in the first response above. With respect to a focus on frequency ranges above movement frequencies, the Reviewer states the “accuracy for those still seems impressive and may provide a slightly more biologically plausible assessment”. First, it is important to note that cortical delta-band oscillations measured with local field potentials (LFPs) in macaques is known to contain important information related to end-effector kinematics (Bansal et al., 2011; Mollazadeh et al., 2011) muscle activation patterns (Flint et al., 2012) and temporal sequencing (Churchland et al., 2012) during skilled reaching and grasping actions. Thus, there is a substantial body of evidence that low-frequency neural oscillatory activity in this range contains important information about the skill learning behavior investigated in the present study. Second, our own data shows (which the Reviewer also points out) that significant information related to the skill learning behavior is also present in higher frequency bands (see Figure 2A and Figure 3—figure supplement 1). As we pointed out in our earlier response to questions about the hybrid space decoder architecture (see above), it is likely that different, yet complimentary, information is encoded across different temporal frequencies (just as it is encoded across different spatial frequencies) (Heusser et al., 2016). Again, this interpretation is supported by our data as the highest performing classifiers in all cases (when holding all parameters constant) were always constructed from broadband input MEG data (Figure 2A and Figure 3—figure supplement 1).

      One question concerns the interpretation of the results shown in Figure 4. They imply that during the course of learning, entirely different brain networks underpin the behaviour. Not only that, but they also include regions that would seem rather unexpected to be key nodes for learning and expressing relatively simple finger sequences, such as here. What then is the biological plausibility of these results? The authors seem to circumnavigate this issue by moving into a distance metric that captures the (neural network) changes over the course of learning, but the discussion seems detached from which regions are actually involved; or they offer a rather broad discussion of the anatomical regions identified here, eg in the context of LFOs, where they merely refer to "frontoparietal regions".

      The Reviewer notes the shift in brain networks driving keypress decoding performance between trials 1, 11 and 36 as shown in Figure 4A. The Reviewer questions whether these shifts in brain network states underpinning the skill are biologically plausible, as well as the likelihood that bilateral superior and middle frontal and parietal cortex are important nodes within these networks.

      First, previous fMRI work in humans assessed changes in functional connectivity patterns while participants performed a similar sequence learning task to our present study (Bassett et al., 2011). Using a dynamic network analysis approach, Bassett et al. showed that flexibility in the composition of individual network modules (i.e. – changes in functional brain region membership of orthogonal brain networks) is up-regulated in novel learning environments and explains differences in learning rates across individuals. Thus, consistent with our findings, it is likely that functional brain networks rapidly reconfigure during early learning of novel sequential motor skills.

      Second, frontoparietal network activity is known to support motor memory encoding during early learning (Albouy et al., 2013; Albouy et al., 2012). For example, reactivation events in the posterior parietal (Qin et al., 1997) and medial prefrontal (Euston et al., 2007; Molle & Born, 2009) cortex (MPFC) have been temporally linked to hippocampal replay, and are posited to support memory consolidation across several memory domains (Frankland & Bontempi, 2005), including motor sequence learning (Albouy et al., 2015; Buch et al., 2021; F. Jacobacci et al., 2020). Further, synchronized interactions between MPFC and hippocampus are more prominent during early as opposed to later learning stages (Albouy et al., 2013; Gais et al., 2007; Sterpenich et al., 2009), perhaps reflecting “redistribution of hippocampal memories to MPFC” (Albouy et al., 2013). MPFC contributes to very early memory formation by learning association between contexts, locations, events and adaptive responses during rapid learning (Euston et al., 2012). Consistently, coupling between hippocampus and MPFC has been shown during initial memory encoding and during subsequent rest (van Kesteren et al., 2010; van Kesteren et al., 2012). Importantly, MPFC activity during initial memory encoding predicts subsequent recall (Wagner et al., 1998). Thus, the spatial map required to encode a motor sequence memory may be “built under the supervision of the prefrontal cortex” (Albouy et al., 2012), also engaged in the development of an abstract representation of the sequence (Ashe et al., 2006). In more abstract terms, the prefrontal, premotor and parietal cortices support novice performance “by deploying attentional and control processes” (Doyon et al., 2009; Hikosaka et al., 2002; Penhune & Steele, 2012) required during early learning (Doyon et al., 2009; Hikosaka et al., 2002; Penhune & Steele, 2012). The dorsolateral prefrontal cortex DLPFC specifically is thought to engage in goal selection and sequence monitoring during early skill practice (Schendan et al., 2003), all consistent with the schema model of declarative memory in which prefrontal cortices play an important role in encoding (Morris, 2006; Tse et al., 2007). Thus, several prefrontal and frontoparietal regions contributing to long term learning (Berlot et al., 2020) are also engaged in early stages of encoding. Altogether, there is strong biological support for the involvement of bilateral prefrontal and frontoparietal regions to decoding during early skill learning. We now address this issue in the revised manuscript.

      If I understand correctly, the offline neural representation analysis is in essence the comparison of the last keypress vs the first keypress of the next sequence. In that sense, the activity during offline rest periods is actually not considered. This makes the nomenclature somewhat confusing. While it matches the behavioural analysis, having only key presses one can't do it in any other way, but here the authors actually do have recordings of brain activity during offline rest. So at the very least calling it offline neural representation is misleading to this reviewer because what is compared is activity during the last and during the next keypress, not activity during offline periods. But it also seems a missed opportunity - the authors argue that most of the relevant learning occurs during offline rest periods, yet there is no attempt to actually test whether activity during this period can be useful for the questions at hand here.

      We agree with the Reviewer that our previous “offline neural representation” nomenclature could be misinterpreted. In the revised manuscript we refer to this difference as the “offline neural representational change”. Please, note that our previous work did link offline neural activity (i.e. – 16-22 Hz beta power (Bonstrup et al., 2019) and neural replay density (Buch et al., 2021) during inter-practice rest periods) to observed micro-offline gains.

      Reviewer #2 (Public review):

      Summary

      Dash et al. asked whether and how the neural representation of individual finger movements is "contextualized" within a trained sequence during the very early period of sequential skill learning by using decoding of MEG signal. Specifically, they assessed whether/how the same finger presses (pressing index finger) embedded in the different ordinal positions of a practiced sequence (4-1-3-2-4; here, the numbers 1 through 4 correspond to the little through the index fingers of the non-dominant left hand) change their representation (MEG feature). They did this by computing either the decoding accuracy of the index finger at the ordinal positions 1 vs. 5 (index_OP1 vs index_OP5) or pattern distance between index_OP1 vs. index_OP5 at each training trial and found that both the decoding accuracy and the pattern distance progressively increase over the course of learning trials. More interestingly, they also computed the pattern distance for index_OP5 for the last execution of a practice trial vs. index_OP1 for the first execution in the next practice trial (i.e., across the rest period). This "off-line" distance was significantly larger than the "on-line" distance, which was computed within practice trials and predicted micro-offline skill gain. Based on these results, the authors conclude that the differentiation of representation for the identical movement embedded in different positions of a sequential skill ("contextualization") primarily occurs during early skill learning, especially during rest, consistent with the recent theory of the "micro-offline learning" proposed by the authors' group. I think this is an important and timely topic for the field of motor learning and beyond.

      Strengths

      The specific strengths of the current work are as follows. First, the use of temporally rich neural information (MEG signal) has a large advantage over previous studies testing sequential representations using fMRI. This allowed the authors to examine the earliest period (= the first few minutes of training) of skill learning with finer temporal resolution. Second, through the optimization of MEG feature extraction, the current study achieved extremely high decoding accuracy (approx. 94%) compared to previous works. As claimed by the authors, this is one of the strengths of the paper (but see my comments). Third, although some potential refinement might be needed, comparing "online" and "offline" pattern distance is a neat idea.

      Weaknesses

      Along with the strengths I raised above, the paper has some weaknesses. First, the pursuit of high decoding accuracy, especially the choice of time points and window length (i.e., 200 msec window starting from 0 msec from key press onset), casts a shadow on the interpretation of the main result. Currently, it is unclear whether the decoding results simply reflect behavioral change or true underlying neural change. As shown in the behavioral data, the key press speed reached 3~4 presses per second already at around the end of the early learning period (11th trial), which means inter-press intervals become as short as 250-330 msec. Thus, in almost more than 60% of training period data, the time window for MEG feature extraction (200 msec) spans around 60% of the inter-press intervals. Considering that the preparation/cueing of subsequent presses starts ahead of the actual press (e.g., Kornysheva et al., 2019) and/or potential online planning (e.g., Ariani and Diedrichsen, 2019), the decoder likely has captured these future press information as well as the signal related to the current key press, independent of the formation of genuine sequential representation (e.g., "contextualization" of individual press). This may also explain the gradual increase in decoding accuracy or pattern distance between index_OP1 vs. index_OP5 (Figure 4C and 5A), which co-occurred with performance improvement, as shorter inter-press intervals are more favorable for the dissociating the two index finger presses followed by different finger presses. The compromised decoding accuracies for the control sequences can be explained in similar logic. Therefore, more careful consideration and elaborated discussion seem necessary when trying to both achieve high-performance decoding and assess early skill learning, as it can impact all the subsequent analyses.

      The Reviewer raises the possibility that (given the windowing parameters used in the present study) an increase in “contextualization” with learning could simply reflect faster typing speeds as opposed to an actual change in the underlying neural representation.

      We now include a new control analysis that addresses this issue as well as additional re-examination of previously reported results with respect to this issue – all of which are inconsistent with this alternative explanation that “contextualization” reflects a change in mixing of keypress related MEG features as opposed to a change in the underlying representations themselves. As correct sequences are generated at higher and higher speeds over training, MEG activity patterns related to the planning, execution, evaluation and memory of individual keypresses overlap more in time. Thus, increased overlap between the “4” and “1” keypresses (at the start of the sequence) and “2” and “4” keypresses (at the end of the sequence) could artefactually increase contextualization distances even if the underlying neural representations for the individual keypresses remain unchanged. One must also keep in mind that since participants repeat the sequence multiple times within the same trial, a majority of the index finger keypresses are performed adjacent to one another (i.e. - the “4-4” transition marking the end of one sequence and the beginning of the next). Thus, increased overlap between consecutive index finger keypresses as typing speed increased should increase their similarity and mask contextualization related changes to the underlying neural representations.

      We addressed this question by conducting a new multivariate regression analysis to directly assess whether the neural representation distance score could be predicted by the 4-1, 2-4 and 4-4 keypress transition times observed for each complete correct sequence (both predictor and response variables were z-score normalized within-subject). The results of this analysis also affirmed that the possible alternative explanation that contextualization effects are simple reflections of increased mixing is not supported by the data (Adjusted R<sup>2</sup> = 0.00431; F = 5.62). We now include this new negative control analysis in the revised manuscript.

      We also re-examined our previously reported classification results with respect to this issue. We reasoned that if mixing effects reflecting the ordinal sequence structure is an important driver of the contextualization finding, these effects should be observable in the distribution of decoder misclassifications. For example, “4” keypresses would be more likely to be misclassified as “1” or “2” keypresses (or vice versa) than as “3” keypresses. The confusion matrices presented in Figures 3C and 4B and Figure 3—figure supplement 3A display a distribution of misclassifications that is inconsistent with an alternative mixing effect explanation of contextualization.

      Based upon the increased overlap between adjacent index finger keypresses (i.e. – “4-4” transition), we also reasoned that the decoder tasked with separating individual index finger keypresses into two distinct classes based upon sequence position, should show decreased performance as typing speed increases. However, Figure 4C in our manuscript shows that this is not the case. The 2-class hybrid classifier actually displays improved classification performance over early practice trials despite greater temporal overlap. Again, this is inconsistent with the idea that the contextualization effect simply reflects increased mixing of individual keypress features.

      In summary, both re-examination of previously reported data and new control analyses all converged on the idea that the proximity between keypresses does not explain contextualization.

      We do agree with the Reviewer that the naturalistic, generative, self-paced task employed in the present study results in overlapping brain processes related to planning, execution, evaluation and memory of the action sequence. We also agree that there are several tradeoffs to consider in the construction of the classifiers depending on the study aim. Given our aim of optimizing keypress decoder accuracy in the present study, the set of trade-offs resulted in representations reflecting more the latter three processes, and less so the planning component. Whether separate decoders can be constructed to tease apart the representations or networks supporting these overlapping processes is an important future direction of research in this area. For example, work presently underway in our lab constrains the selection of windowing parameters in a manner that allows individual classifiers to be temporally linked to specific planning, execution, evaluation or memory-related processes to discern which brain networks are involved and how they adaptively reorganize with learning. Results from the present study (Figure 4—figure supplement 2) showing hybrid-space decoder prediction accuracies exceeding 74% for temporal windows spanning as little as 25ms and located up to 100ms prior to the KeyDown event strongly support the feasibility of such an approach.

      Related to the above point, testing only one particular sequence (4-1-3-2-4), aside from the control ones, limits the generalizability of the finding. This also may have contributed to the extremely high decoding accuracy reported in the current study.

      The Reviewer raises a question about the generalizability of the decoder accuracy reported in our study. Fortunately, a comparison between decoder performances on Day 1 and Day 2 datasets does provide insight into this issue. As the Reviewer points out, the classifiers in this study were trained and tested on keypresses performed while practicing a specific sequence (4-1-3-2-4). The study was designed this way as to avoid the impact of interference effects on learning dynamics. The cross-validated performance of classifiers on MEG data collected within the same session was 90.47% overall accuracy (4-class; Figure 3C). We then tested classifier performance on data collected during a separate MEG session conducted approximately 24 hours later (Day 2; see Figure 3 — figure supplement 3). We observed a reduction in overall accuracy rate to 87.11% when tested on MEG data recorded while participants performed the same learned sequence, and 79.44% when they performed several previously unpracticed sequences. Both changes in accuracy are important with regards to the generalizability of our findings. First, 87.11% performance accuracy for the trained sequence data on Day 2 (a reduction of only 3.36%) indicates that the hybrid-space decoder performance is robust over multiple MEG sessions, and thus, robust to variations in SNR across the MEG sensor array caused by small differences in head position between scans. This indicates a substantial advantage over sensor-space decoding approaches. Furthermore, when tested on data from unpracticed sequences, overall performance dropped an additional 7.67%. This difference reflects the performance bias of the classifier for the trained sequence, possibly caused by high-order sequence structure being incorporated into the feature weights. In the future, it will be important to understand in more detail how random or repeated keypress sequence training data impacts overall decoder performance and generalization. We strongly agree with the Reviewer that the issue of generalizability is extremely important and have added a new paragraph to the Discussion in the revised manuscript highlighting the strengths and weaknesses of our study with respect to this issue.

      In terms of clinical BCI, one of the potential relevance of the study, as claimed by the authors, it is not clear that the specific time window chosen in the current study (up to 200 msec since key press onset) is really useful. In most cases, clinical BCI would target neural signals with no overt movement execution due to patients' inability to move (e.g., Hochberg et al., 2012). Given the time window, the surprisingly high performance of the current decoder may result from sensory feedback and/or planning of subsequent movement, which may not always be available in the clinical BCI context. Of course, the decoding accuracy is still much higher than chance even when using signal before the key press (as shown in Figure 4 Supplement 2), but it is not immediately clear to me that the authors relate their high decoding accuracy based on post-movement signal to clinical BCI settings.

      The Reviewer questions the relevance of the specific window parameters used in the present study for clinical BCI applications, particularly for paretic patients who are unable to produce finger movements or for whom afferent sensory feedback is no longer intact. We strongly agree with the Reviewer that any intended clinical application must carefully consider the specific input feature constraints dictated by the clinical cohort, and in turn impose appropriate and complimentary constraints on classifier parameters that may differ from the ones used in the present study. We now highlight this issue in the Discussion of the revised manuscript and relate our present findings to published clinical BCI work within this context.

      One of the important and fascinating claims of the current study is that the "contextualization" of individual finger movements in a trained sequence specifically occurs during short rest periods in very early skill learning, echoing the recent theory of micro-offline learning proposed by the authors' group. Here, I think two points need to be clarified. First, the concept of "contextualization" is kept somewhat blurry throughout the text. It is only at the later part of the Discussion (around line #330 on page 13) that some potential mechanism for the "contextualization" is provided as "what-and-where" binding. Still, it is unclear what "contextualization" actually is in the current data, as the MEG signal analyzed is extracted from 0-200 msec after the keypress. If one thinks something is contextualizing an action, that contextualization should come earlier than the action itself.

      The Reviewer requests that we: 1) more clearly define our use of the term “contextualization” and 2) provide the rationale for assessing it over a 200ms window aligned to the KeyDown event. This choice of window parameters means that the MEG activity used in our analysis was coincident with, rather than preceding, the actual keypresses. We define contextualization as the differentiation of representation for the identical movement embedded in different positions of a sequential skill. That is, representations of individual action elements progressively incorporate information about their relationship to the overall sequence structure as the skill is learned. We agree with the Reviewer that this can be appropriately interpreted as “what-and-where” binding. We now incorporate this definition in the Introduction of the revised manuscript as requested.

      The window parameters for optimizing accurate decoding individual finger movements were determined using a grid search of the parameter space (a sliding window of variable width between 25-350 ms with 25 ms increments variably aligned from 0 to +100ms with 10ms increments relative to the KeyDown event). This approach generated 140 different temporal windows for each keypress for each participant, with the final parameter selection determined through comparison of the resulting performance between each decoder. Importantly, the decision to optimize for decoding accuracy placed an emphasis on keypress representations characterized by the most consistent and robust features shared across subjects, which in turn maximize statistical power in detecting common learning-related changes. In this case, the optimal window encompassed a 200ms epoch aligned to the KeyDown event (t<sub>0</sub> = 0 ms). We then asked if the representations (i.e. – spatial patterns of combined parcel- and voxel-space activity) of the same digit at two different sequence positions changed with practice within this optimal decoding window. Of course, our findings do not rule out the possibility that contextualization can also be found before or even after this time window, as we did not directly address this issue in the present study. Future work in our lab, as pointed out above, are investigating contextualization within different time windows tailored specifically for assessing sequence skill action planning, execution, evaluation and memory processes.

      The second point is that the result provided by the authors is not yet convincing enough to support the claim that "contextualization" occurs during rest. In the original analysis, the authors presented the statistical significance regarding the correlation between the "offline" pattern differentiation and micro-offline skill gain (Figure 5. Supplement 1), as well as the larger "offline" distance than "online" distance (Figure 5B). However, this analysis looks like regressing two variables (monotonically) increasing as a function of the trial. Although some information in this analysis, such as what the independent/dependent variables were or how individual subjects were treated, was missing in the Methods, getting a statistically significant slope seems unsurprising in such a situation. Also, curiously, the same quantitative evidence was not provided for its "online" counterpart, and the authors only briefly mentioned in the text that there was no significant correlation between them. It may be true looking at the data in Figure 5A as the online representation distance looks less monotonically changing, but the classification accuracy presented in Figure 4C, which should reflect similar representational distance, shows a more monotonic increase up to the 11th trial. Further, the ways the "online" and "offline" representation distance was estimated seem to make them not directly comparable. While the "online" distance was computed using all the correct press data within each 10 sec of execution, the "offline" distance is basically computed by only two presses (i.e., the last index_OP5 vs. the first index_OP1 separated by 10 sec of rest). Theoretically, the distance between the neural activity patterns for temporally closer events tends to be closer than that between the patterns for temporally far-apart events. It would be fairer to use the distance between the first index_OP1 vs. the last index_OP5 within an execution period for "online" distance, as well.

      The Reviewer suggests that the current data is not enough to show that contextualization occurs during rest and raises two important concerns: 1) the relationship between online contextualization and micro-online gains is not shown, and 2) the online distance was calculated differently from its offline counterpart (i.e. - instead of calculating the distance between last Index<sub>OP5</sub> and first Index<sub>OP1</sub> from a single trial, the distance was calculated for each sequence within a trial and then averaged).

      We addressed the first concern by performing individual subject correlations between 1) contextualization changes during rest intervals and micro-offline gains; 2) contextualization changes during practice trials and micro-online gains, and 3) contextualization changes during practice trials and micro-offline gains (Figure 5 – figure supplement 4). We then statistically compared the resulting correlation coefficient distributions and found that within-subject correlations for contextualization changes during rest intervals and micro-offline gains were significantly higher than online contextualization and micro-online gains (t = 3.2827, p = 0.0015) and online contextualization and micro-offline gains (t = 3.7021, p = 5.3013e-04). These results are consistent with our interpretation that micro-offline gains are supported by contextualization changes during the inter-practice rest periods.

      With respect to the second concern, we agree with the Reviewer that one limitation of the analysis comparing online versus offline changes in contextualization as presented in the original manuscript, is that it does not eliminate the possibility that any differences could simply be explained by the passage of time (which is smaller for the online analysis compared to the offline analysis). The Reviewer suggests an approach that addresses this issue, which we have now carried out. When quantifying online changes in contextualization from the first Index<sub>OP1</sub> the last Index<sub>OP5</sub> keypress in the same trial we observed no learning-related trend (Figure 5 – figure supplement 5, right panel). Importantly, offline distances were significantly larger than online distances regardless of the measurement approach and neither predicted online learning (Figure 5 – figure supplement 6).

      A related concern regarding the control analysis, where individual values for max speed and the degree of online contextualization were compared (Figure 5 Supplement 3), is whether the individual difference is meaningful. If I understood correctly, the optimization of the decoding process (temporal window, feature inclusion/reduction, decoder, etc.) was performed for individual participants, and the same feature extraction was also employed for the analysis of representation distance (i.e., contextualization). If this is the case, the distances are individually differently calculated and they may need to be normalized relative to some stable reference (e.g., 1 vs. 4 or average distance within the control sequence presses) before comparison across the individuals.

      The Reviewer makes a good point here. We have now implemented the suggested normalization procedure in the analysis provided in the revised manuscript.

      Reviewer #3 (Public review):

      Summary:

      One goal of this paper is to introduce a new approach for highly accurate decoding of finger movements from human magnetoencephalography data via dimension reduction of a "multiscale, hybrid" feature space. Following this decoding approach, the authors aim to show that early skill learning involves "contextualization" of the neural coding of individual movements, relative to their position in a sequence of consecutive movements. Furthermore, they aim to show that this "contextualization" develops primarily during short rest periods interspersed with skill training and correlates with a performance metric which the authors interpret as an indicator of offline learning.

      Strengths:

      A clear strength of the paper is the innovative decoding approach, which achieves impressive decoding accuracies via dimension reduction of a "multi-scale, hybrid space". This hybrid-space approach follows the neurobiologically plausible idea of the concurrent distribution of neural coding across local circuits as well as large-scale networks. A further strength of the study is the large number of tested dimension reduction techniques and classifiers (though the manuscript reveals little about the comparison of the latter).

      We appreciate the Reviewer’s comments regarding the paper’s strengths.

      A simple control analysis based on shuffled class labels could lend further support to this complex decoding approach. As a control analysis that completely rules out any source of overfitting, the authors could test the decoder after shuffling class labels. Following such shuffling, decoding accuracies should drop to chance level for all decoding approaches, including the optimized decoder. This would also provide an estimate of actual chance-level performance (which is informative over and beyond the theoretical chance level). Furthermore, currently, the manuscript does not explain the huge drop in decoding accuracies for the voxel-space decoding (Figure 3B). Finally, the authors' approach to cortical parcellation raises questions regarding the information carried by varying dipole orientations within a parcel (which currently seems to be ignored?) and the implementation of the mean-flipping method (given that there are two dimensions - space and time - what do the authors refer to when they talk about the sign of the "average source", line 477?).

      The Reviewer recommends that we: 1) conduct an additional control analysis on classifier performance using shuffled class labels, 2) provide a more detailed explanation regarding the drop in decoding accuracies for the voxel-space decoding following LDA dimensionality reduction (see Fig 3B), and 3) provide additional details on how problems related to dipole solution orientations were addressed in the present study.

      In relation to the first point, we have now implemented a random shuffling approach as a control for the classification analyses. The results of this analysis indicated that the chance level accuracy was 22.12% (± SD 9.1%) for individual keypress decoding (4-class classification), and 18.41% (± SD 7.4%) for individual sequence item decoding (5-class classification), irrespective of the input feature set or the type of decoder used. Thus, the decoding accuracy observed with the final model was substantially higher than these chance levels.

      Second, please note that the dimensionality of the voxel-space feature set is very high (i.e. – 15684). LDA attempts to map the input features onto a much smaller dimensional space (number of classes – 1; e.g. – 3 dimensions, for 4-class keypress decoding). Given the very high dimension of the voxel-space input features in this case, the resulting mapping exhibits reduced accuracy. Despite this general consideration, please refer to Figure 3—figure supplement 3, where we observe improvement in voxel-space decoder performance when utilizing alternative dimensionality reduction techniques.

      The decoders constructed in the present study assess the average spatial patterns across time (as defined by the windowing procedure) in the input feature space. We now provide additional details in the Methods of the revised manuscript pertaining to the parcellation procedure and how the sign ambiguity problem was addressed in our analysis.

      Weaknesses:

      A clear weakness of the paper lies in the authors' conclusions regarding "contextualization". Several potential confounds, described below, question the neurobiological implications proposed by the authors and provide a simpler explanation of the results. Furthermore, the paper follows the assumption that short breaks result in offline skill learning, while recent evidence, described below, casts doubt on this assumption.

      We thank the Reviewer for giving us the opportunity to address these issues in detail (see below).

      The authors interpret the ordinal position information captured by their decoding approach as a reflection of neural coding dedicated to the local context of a movement (Figure 4). One way to dissociate ordinal position information from information about the moving effectors is to train a classifier on one sequence and test the classifier on other sequences that require the same movements, but in different positions (Kornysheva et al., 2019). In the present study, however, participants trained to repeat a single sequence (4-1-3-2-4). As a result, ordinal position information is potentially confounded by the fixed finger transitions around each of the two critical positions (first and fifth press). Across consecutive correct sequences, the first keypress in a given sequence was always preceded by a movement of the index finger (=last movement of the preceding sequence), and followed by a little finger movement. The last keypress, on the other hand, was always preceded by a ring finger movement, and followed by an index finger movement (=first movement of the next sequence). Figure 4 - Supplement 2 shows that finger identity can be decoded with high accuracy (>70%) across a large time window around the time of the key press, up to at least +/-100 ms (and likely beyond, given that decoding accuracy is still high at the boundaries of the window depicted in that figure). This time window approaches the keypress transition times in this study. Given that distinct finger transitions characterized the first and fifth keypress, the classifier could thus rely on persistent (or "lingering") information from the preceding finger movement, and/or "preparatory" information about the subsequent finger movement, in order to dissociate the first and fifth keypress. Currently, the manuscript provides no evidence that the context information captured by the decoding approach is more than a by-product of temporally extended, and therefore overlapping, but independent neural representations of consecutive keypresses that are executed in close temporal proximity - rather than a neural representation dedicated to context.

      Such temporal overlap of consecutive, independent finger representations may also account for the dynamics of "ordinal coding"/"contextualization", i.e., the increase in 2-class decoding accuracy, across Day 1 (Figure 4C). As learning progresses, both tapping speed and the consistency of keypress transition times increase (Figure 1), i.e., consecutive keypresses are closer in time, and more consistently so. As a result, information related to a given keypress is increasingly overlapping in time with information related to the preceding and subsequent keypresses. The authors seem to argue that their regression analysis in Figure 5 - Figure Supplement 3 speaks against any influence of tapping speed on "ordinal coding" (even though that argument is not made explicitly in the manuscript). However, Figure 5 - Figure Supplement 3 shows inter-individual differences in a between-subject analysis (across trials, as in panel A, or separately for each trial, as in panel B), and, therefore, says little about the within-subject dynamics of "ordinal coding" across the experiment. A regression of trial-by-trial "ordinal coding" on trial-by-trial tapping speed (either within-subject or at a group-level, after averaging across subjects) could address this issue. Given the highly similar dynamics of "ordinal coding" on the one hand (Figure 4C), and tapping speed on the other hand (Figure 1B), I would expect a strong relationship between the two in the suggested within-subject (or group-level) regression. Furthermore, learning should increase the number of (consecutively) correct sequences, and, thus, the consistency of finger transitions. Therefore, the increase in 2-class decoding accuracy may simply reflect an increasing overlap in time of increasingly consistent information from consecutive keypresses, which allows the classifier to dissociate the first and fifth keypress more reliably as learning progresses, simply based on the characteristic finger transitions associated with each. In other words, given that the physical context of a given keypress changes as learning progresses - keypresses move closer together in time and are more consistently correct - it seems problematic to conclude that the mental representation of that context changes. To draw that conclusion, the physical context should remain stable (or any changes to the physical context should be controlled for).

      The issues raised by Reviewer #3 here are similar to two issues raised by Reviewer #2 above. We agree they must both be carefully considered in any evaluation of our findings.

      As both Reviewers pointed out, the classifiers in this study were trained and tested on keypresses performed while practicing a specific sequence (4-1-3-2-4). The study was designed this way as to avoid the impact of interference effects on learning dynamics. The cross-validated performance of classifiers on MEG data collected within the same session was 90.47% overall accuracy (4class; Figure 3C). We then tested classifier performance on data collected during a separate MEG session conducted approximately 24 hours later (Day 2; see Figure 3—supplement 3). We observed a reduction in overall accuracy rate to 87.11% when tested on MEG data recorded while participants performed the same learned sequence, and 79.44% when they performed several previously unpracticed sequences. This classification performance difference of 7.67% when tested on the Day 2 data could reflect the performance bias of the classifier for the trained sequence, possibly caused by mixed information from temporally close keypresses being incorporated into the feature weights.

      Along these same lines, both Reviewers also raise the possibility that an increase in “ordinal coding/contextualization” with learning could simply reflect an increase in this mixing effect caused by faster typing speeds as opposed to an actual change in the underlying neural representation. The basic idea is that as correct sequences are generated at higher and higher speeds over training, MEG activity patterns related to the planning, execution, evaluation and memory of individual keypresses overlap more in time. Thus, increased overlap between the “4” and “1” keypresses (at the start of the sequence) and “2” and “4” keypresses (at the end of the sequence) could artefactually increase contextualization distances even if the underlying neural representations for the individual keypresses remain unchanged (assuming this mixing of representations is used by the classifier to differentially tag each index finger press). If this were the case, it follows that such mixing effects reflecting the ordinal sequence structure would also be observable in the distribution of decoder misclassifications. For example, “4” keypresses would be more likely to be misclassified as “1” or “2” keypresses (or vice versa) than as “3” keypresses. The confusion matrices presented in Figures 3C and 4B and Figure 3—figure supplement 3A in the previously submitted manuscript do not show this trend in the distribution of misclassifications across the four fingers.

      Following this logic, it’s also possible that if the ordinal coding is largely driven by this mixing effect, the increased overlap between consecutive index finger keypresses during the 4-4 transition marking the end of one sequence and the beginning of the next one could actually mask contextualization-related changes to the underlying neural representations and make them harder to detect. In this case, a decoder tasked with separating individual index finger keypresses into two distinct classes based upon sequence position might show decreased performance with learning as adjacent keypresses overlapped in time with each other to an increasing extent. However, Figure 4C in our previously submitted manuscript does not support this possibility, as the 2-class hybrid classifier displays improved classification performance over early practice trials despite greater temporal overlap.

      As noted in the above reply to Reviewer #2, we also conducted a new multivariate regression analysis to directly assess whether the neural representation distance score could be predicted by the 4-1, 2-4 and 4-4 keypress transition times observed for each complete correct sequence (both predictor and response variables were z-score normalized within-subject). The results of this analysis affirmed that the possible alternative explanation put forward by the Reviewer is not supported by our data (Adjusted R<sup>2</sup> = 0.00431; F = 5.62). We now include this new negative control analysis result in the revised manuscript.

      Finally, the Reviewer hints that one way to address this issue would be to compare MEG responses before and after learning for sequences typed at a fixed speed. However, given that the speed-accuracy trade-off should improve with learning, a comparison between unlearned and learned skill states would dictate that the skill be evaluated at a very low fixed speed. Essentially, such a design presents the problem that the post-training test is evaluating the representation in the unlearned behavioral state that is not representative of the acquired skill. Thus, this approach would miss most learning effects on a task in which speed is the main learning metrics.

      A similar difference in physical context may explain why neural representation distances ("differentiation") differ between rest and practice (Figure 5). The authors define "offline differentiation" by comparing the hybrid space features of the last index finger movement of a trial (ordinal position 5) and the first index finger movement of the next trial (ordinal position 1). However, the latter is not only the first movement in the sequence but also the very first movement in that trial (at least in trials that started with a correct sequence), i.e., not preceded by any recent movement. In contrast, the last index finger of the last correct sequence in the preceding trial includes the characteristic finger transition from the fourth to the fifth movement. Thus, there is more overlapping information arising from the consistent, neighbouring keypresses for the last index finger movement, compared to the first index finger movement of the next trial. A strong difference (larger neural representation distance) between these two movements is, therefore, not surprising, given the task design, and this difference is also expected to increase with learning, given the increase in tapping speed, and the consequent stronger overlap in representations for consecutive keypresses. Furthermore, initiating a new sequence involves pre-planning, while ongoing practice relies on online planning (Ariani et al., eNeuro 2021), i.e., two mental operations that are dissociable at the level of neural representation (Ariani et al., bioRxiv 2023).

      The Reviewer argues that the comparison of last finger movement of a trial and the first in the next trial are performed in different circumstances and contexts. This is an important point and one we tend to agree with. For this task, the first sequence in a practice trial is pre-planned before the first keypress is performed. This occurs in a somewhat different context from the sequence iterations that follow, which involve temporally overlapping planning, execution and evaluation processes. The Reviewer is concerned about a difference in the temporal mixing effect issue raised above between the first and last keypresses performed in a trial. Please, note that since neural representations of individual actions are competitively queued during the pre-planning period in a manner that reflects the ordinal structure of the learned sequence (Kornysheva et al., 2019), mixing effects are most likely present also for the first keypress in a trial.

      Separately, the Reviewer suggests that contextualization during early learning may reflect preplanning or online planning. This is an interesting proposal. Given the decoding time-window used in this investigation, we cannot dissect separate contributions of planning, memory and sensory feedback to contextualization. Taking advantage of the superior temporal resolution of MEG relative to fMRI tools, work under way in our lab is investigating decoding time-windows more appropriate to address each of these questions.

      Given these differences in the physical context and associated mental processes, it is not surprising that "offline differentiation", as defined here, is more pronounced than "online differentiation". For the latter, the authors compared movements that were better matched regarding the presence of consistent preceding and subsequent keypresses (online differentiation was defined as the mean difference between all first vs. last index finger movements during practice). It is unclear why the authors did not follow a similar definition for "online differentiation" as for "micro-online gains" (and, indeed, a definition that is more consistent with their definition of "offline differentiation"), i.e., the difference between the first index finger movement of the first correct sequence during practice, and the last index finger of the last correct sequence. While these two movements are, again, not matched for the presence of neighbouring keypresses (see the argument above), this mismatch would at least be the same across "offline differentiation" and "online differentiation", so they would be more comparable.

      This is the same point made earlier by Reviewer #2, and we agree with this assessment. As stated in the response to Reviewer #2 above, we have now carried out quantification of online contextualization using this approach and included it in the revised manuscript. We thank the Reviewer for this suggestion.

      A further complication in interpreting the results regarding "contextualization" stems from the visual feedback that participants received during the task. Each keypress generated an asterisk shown above the string on the screen, irrespective of whether the keypress was correct or incorrect. As a result, incorrect (e.g., additional, or missing) keypresses could shift the phase of the visual feedback string (of asterisks) relative to the ordinal position of the current movement in the sequence (e.g., the fifth movement in the sequence could coincide with the presentation of any asterisk in the string, from the first to the fifth). Given that more incorrect keypresses are expected at the start of the experiment, compared to later stages, the consistency in visual feedback position, relative to the ordinal position of the movement in the sequence, increased across the experiment. A better differentiation between the first and the fifth movement with learning could, therefore, simply reflect better decoding of the more consistent visual feedback, based either on the feedback-induced brain response, or feedback-induced eye movements (the study did not include eye tracking). It is not clear why the authors introduced this complicated visual feedback in their task, besides consistency with their previous studies.

      We strongly agree with the Reviewer that eye movements related to task engagement are important to rule out as a potential driver of the decoding accuracy or contextualizaton effect. We address this issue above in response to a question raised by Reviewer #1 about the impact of movement related artefacts on our findings.

      First, the assumption the Reviewer makes here about the distribution of errors in this task is incorrect. On average across subjects, 2.32% ± 1.48% (mean ± SD) of all keypresses performed were errors, which were evenly distributed across the four possible keypress responses. While errors increased progressively over practice trials, they did so in proportion to the increase in correct keypresses, so that the overall ratio of correct-to-incorrect keypresses remained stable over the training session. Thus, the Reviewer’s assumptions that there is a higher relative frequency of errors in early trials, and a resulting systematic trend phase shift differences between the visual display updates (i.e. – a change in asterisk position above the displayed sequence) and the keypress performed is not substantiated by the data. To the contrary, the asterisk position on the display and the keypress being executed remained highly correlated over the entire training session. We now include a statement about the frequency and distribution of errors in the revised manuscript.

      Given this high correlation, we firmly agree with the Reviewer that the issue of eye movement related artefacts is still an important one to address. Fortunately, we did collect eye movement data during the MEG recordings so were able to investigate this. As detailed in the response to Reviewer #1 above, we found that gaze positions and eye-movement velocity time-locked to visual display updates (i.e. – a change in asterisk position above the displayed sequence) did not reflect the asterisk location above chance levels (Overall cross-validated accuracy = 0.21817; see Author response image 1). Furthermore, an inspection of the eye position data revealed that most participants on most trials displayed random walk gaze patterns around a center fixation point, indicating that participants did not attend to the asterisk position on the display. This is consistent with intrinsic generation of the action sequence, and congruent with the fact that the display does not provide explicit feedback related to performance. As pointed out above, a similar real-world example would be manually inputting a long password into a secure online application. In this case, one intrinsically generates the sequence from memory and receives similar feedback about the password sequence position (also provided as asterisks), which is typically ignored by the user.

      The minimal participant engagement with the visual display in this explicit sequence learning motor task (which is highly generative in nature) contrasts markedly with behavior observed when reactive responses to stimulus cues are needed in the serial reaction time task (SRTT). This is a crucial difference that must be carefully considered when comparing findings across studies using the two sequence learning tasks.

      The authors report a significant correlation between "offline differentiation" and cumulative microoffline gains. However, it would be more informative to correlate trial-by-trial changes in each of the two variables. This would address the question of whether there is a trial-by-trial relation between the degree of "contextualization" and the amount of micro-offline gains - are performance changes (micro-offline gains) less pronounced across rest periods for which the change in "contextualization" is relatively low? Furthermore, is the relationship between micro-offline gains and "offline differentiation" significantly stronger than the relationship between micro-offline gains and "online differentiation"?

      In response to a similar issue raised above by Reviewer #2, we now include new analyses comparing correlation magnitudes between (1) “online differentiation” vs micro-online gains, (2) “online differentiation” vs micro-offline gains and (3) “offline differentiation” and micro-offline gains (see Figure 5 – figure supplement  4, 5 and 6). These new analyses and results have been added to the revised manuscript. Once again, we thank both Reviewers for this suggestion.

      The authors follow the assumption that micro-offline gains reflect offline learning.

      We disagree with this statement. The original (Bonstrup et al., 2019) paper clearly states that micro-offline gains do not necessarily reflect offline learning in some cases and must be carefully interpreted based upon the behavioral context within which they are observed. Further, the paper lays out the conditions under which one can have confidence that micro-offline gains reflect offline learning. In fact, the excellent meta-analysis of (Pan & Rickard, 2015), which re-interprets the benefits of sleep in overnight skill consolidation from a “reactive inhibition” perspective, was a crucial resource in the experimental design of our initial study (Bonstrup et al., 2019), as well as in all our subsequent work. Pan & Rickard state:

      “Empirically, reactive inhibition refers to performance worsening that can accumulate during a period of continuous training (Hull, 1943 . It tends to dissipate, at least in part, when brief breaks are inserted between blocks of training. If there are multiple performance-break cycles over a training session, as in the motor sequence literature, performance can exhibit a scalloped effect, worsening during each uninterrupted performance block but improving across blocks(Brawn et al., 2010; Rickard et al., 2008 . Rickard, Cai, Rieth, Jones, and Ard (2008 and Brawn, Fenn, Nusbaum, and Margoliash (2010 (Brawn et al., 2010; Rickard et al., 2008 demonstrated highly robust scalloped reactive inhibition effects using the commonly employed 30 s–30 s performance break cycle, as shown for Rickard et al.’s (2008 massed practice sleep group in Figure 2. The scalloped effect is evident for that group after the first few 30 s blocks of each session. The absence of the scalloped effect during the first few blocks of training in the massed group suggests that rapid learning during that period masks any reactive inhibition effect.”

      Crucially, Pan & Rickard make several concrete recommendations for reducing the impact of the reactive inhibition confound on offline learning studies. One of these recommendations was to reduce practice times to 10s (most prior sequence learning studies up until that point had employed 30s long practice trials). They state:

      “The traditional design involving 30 s-30 s performance break cycles should be abandoned given the evidence that it results in a reactive inhibition confound, and alternative designs with reduced performance duration per block used instead (Pan & Rickard, 2015 . One promising possibility is to switch to 10 s performance durations for each performance-break cycle Instead (Pan & Rickard, 2015 . That design appears sufficient to eliminate at least the majority of the reactive inhibition effect (Brawn et al., 2010; Rickard et al., 2008 .”

      We mindfully incorporated recommendations from (Pan & Rickard, 2015) into our own study designs including 1) utilizing 10s practice trials and 2) constraining our analysis of micro-offline gains to early learning trials (where performance monotonically increases and 95% of overall performance gains occur), which are prior to the emergence of the “scalloped” performance dynamics that are strongly linked to reactive inhibition effects.

      However, there is no direct evidence in the literature that micro-offline gains really result from offline learning, i.e., an improvement in skill level.

      We strongly disagree with the Reviewer’s assertion that “there is no direct evidence in the literature that micro-offline gains really result from offline learning, i.e., an improvement in skill level.” The initial (Bonstrup et al., 2019) report was followed up by a large online crowd-sourcing study (Bonstrup et al., 2020). This second (and much larger) study provided several additional important findings supporting our interpretation of micro-offline gains in cases where the important behavioral conditions clarified above were met (see Author response image 4 below for further details on these conditions).

      Author response image 4.

      This Figure shows that micro-offline gains o ser ed in learning and nonlearning contexts are attri uted to different underl ing causes. Micro-offline and online changes relative to overall trial-by-trial learning. This figure is based on data from (Bonstrup et al., 2019). During early learning, micro-offline gains (red bars) closely track trial-by-trial performance gains (green line with open circle markers), with minimal contribution from micro-online gains (blue bars). The stated conclusion in Bönstrup et al. (2019) is that micro-offline gains only during this Early Learning stage reflect rapid memory consolidation (see also (Bonstrup et al., 2020)). After early learning, about practice trial 11, skill plateaus. This plateau skill period is characterized by a striking emergence of coupled (and relatively stable) micro-online drops and micro-offline increases. Bönstrup et al. (2019) as well as others in the literature (Brooks et al., 2024; Gupta & Rickard, 2022; Florencia Jacobacci et al., 2020), argue that micro-offline gains during the plateau period likely reflect recovery from inhibitory performance factors such as reactive inhibition or fatigue, and thus must be excluded from analyses relating micro-offline gains to skill learning. The Non-repeating groups in Experiments 3 and 4 from Das et al. (2024) suffer from a lack of consideration of these known confounds (end of Fig legend).

      Evidence documented in that paper (Bonstrup et al., 2020) showed that micro-offline gains during early skill learning were: 1) replicable and generalized to subjects learning the task in their daily living environment (n=389); 2) equivalent when significantly shortening practice period duration, thus confirming that they are not a result of recovery from performance fatigue (n=118); 3) reduced (along with learning rates) by retroactive interference applied immediately after each practice period relative to interference applied after passage of time (n=373), indicating stabilization of the motor memory at a microscale of several seconds consistent with rapid consolidation; and 4) not modified by random termination of the practice periods, ruling out a contribution of predictive motor slowing (N = 71) (Bonstrup et al., 2020). Altogether, our findings were strongly consistent with the interpretation that micro-offline gains reflect memory consolidation supporting early skill learning. This is precisely the portion of the learning curve (Pan & Rickard, 2015) refer to when they state “…rapid learning during that period masks any reactive inhibition effect”.

      This interpretation is further supported by brain imaging evidence linking known memory-related networks and consolidation mechanisms to micro-offline gains. First, we reported that the density of fast hippocampo-neocortical skill memory replay events increases approximately three-fold during early learning inter-practice rest periods with the density explaining differences in the magnitude of micro-offline gains across subjects (Buch et al., 2021). Second, Jacobacci et al. (2020) independently reproduced our original behavioral findings and reported BOLD fMRI changes in the hippocampus and precuneus (regions also identified in our MEG study (Buch et al., 2021)) linked to micro-offline gains during early skill learning. These functional changes were coupled with rapid alterations in brain microstructure in the order of minutes, suggesting that the same network that operates during rest periods of early learning undergoes structural plasticity over several minutes following practice (Deleglise et al., 2023). Crucial to this point, Chen et al. (2024) and Sjøgård et al (2024) provided direct evidence from intracranial EEG in humans linking sharp-wave ripple density during rest periods (which are known markers for neural replay (Buzsaki, 2015)) in the human hippocampus (80-120 Hz) to micro-offline gains during early skill learning.

      Thus, there is now substantial converging evidence in humans across different indirect noninvasive and direct invasive recording techniques linking hippocampal activity, neural replay dynamics and offline performance gains in skill learning.

      On the contrary, recent evidence questions this interpretation (Gupta & Rickard, npj Sci Learn 2022; Gupta & Rickard, Sci Rep 2024; Das et al., bioRxiv 2024). Instead, there is evidence that micro-offline gains are transient performance benefits that emerge when participants train with breaks, compared to participants who train without breaks, however, these benefits vanish within seconds after training if both groups of participants perform under comparable conditions (Das et al., bioRxiv 2024).

      The recent work of (Gupta & Rickard, 2022, 2024) does not present any data that directly opposes our finding that early skill learning (Bonstrup et al., 2019) is expressed as micro-offline gains during rest breaks. These studies are an extension of the Rickard et al (2008) paper that employed a massed (30s practice followed by 30s breaks) vs spaced (10s practice followed by 10s breaks) experimental design to assess if recovery from reactive inhibition effects could account for performance gains measured after several minutes or hours. Gupta & Rickard (2022) added two additional groups (30s practice/10s break and 10s practice/10s break as used in the work from our group). The primary aim of the study was to assess whether it was more likely that changes in performance when retested 5 minutes after skill training (consisting of 12 practice trials for the massed groups and 36 practice trials for the spaced groups) had ended reflected memory consolidation effects or recovery from reactive inhibition effects. The Gupta & Rickard (2024) follow-up paper employed a similar design with the primary difference being that participants performed a fixed number of sequences on each trial as opposed to trials lasting a fixed duration. This was done to facilitate the fitting of a quantitative statistical model to the data.

      To reiterate, neither study included any analysis of micro-online or micro-offline gains and did not include any comparison focused on skill gains during early learning trials (only at retest 5 min later). Instead, Gupta & Rickard (2022), reported evidence for reactive inhibition effects for all groups over much longer training periods than early learning. In fact, we reported the same findings for trials following the early learning period in our original 2019 paper (Bonstrup et al., 2019) (Author response image 4). Please, note that we also reported that cumulative microoffline gains over early learning did not correlate with overnight offline consolidation measured 24 hours later (Bonstrup et al., 2019) (see the Results section and further elaboration in the Discussion). We interpreted these findings as indicative that the mechanisms underlying offline gains over the micro-scale of seconds during early skill learning versus over minutes or hours very likely differ.

      In the recent preprint from (Das et al., 2024), the authors make the strong claim that “micro-offline gains during early learning do not reflect offline learning” which is not supported by their own data. The authors hypothesize that if “micro-offline gains represent offline learning, participants should reach higher skill levels when training with breaks, compared to training without breaks”. The study utilizes a spaced vs. massed practice groups between-subjects design inspired by the reactive inhibition work from Rickard and others to test this hypothesis.

      Crucially, their design incorporates only a small fraction of the training used in other investigations to evaluate early skill learning (Bonstrup et al., 2020; Bonstrup et al., 2019; Brooks et al., 2024; Buch et al., 2021; Deleglise et al., 2023; F. Jacobacci et al., 2020; Mylonas et al., 2024). A direct comparison between the practice schedule designs for the spaced and massed groups in Das et al., and the training schedule all participants experienced in the original Bönstrup et al. (2019) paper highlights this issue as well as several others (Author response image 5):

      Author response image 5.

      This figure shows (A) Comparison of Das et al. Spaced & Massed group training session designs, and the training session design from the original (Bonstrup et al., 2019) paper. Similar to the approach taken by Das et al., all practice is visualized as 10-second practice trials with a variable number (either 0, 1 or 30) of 10-second-long inter-practice rest intervals to allow for direct comparisons between designs. The two key takeaways from this comparison are that (1) the intervention differences (i.e. – practice schedules) between the Massed and Spaced groups from the Das et al. report are extremely small (less than 12% of the overall session schedule) (gaps in the red shaded area) and (2) the overall amount of practice is much less than compared to the design from the original Bönstrup report (Bonstrup et al., 2019) (which has been utilized in several subsequent studies). (B) Group-level learning curve data from Bönstrup et al. (2019) (Bonstrup et al., 2019) is used to estimate the performance range accounted for by the equivalent periods covering Test 1, Training 1 and Test 2 from Das et al (2024). Note that the intervention in the Das et al. study is limited to a period covering less than 50% of the overall learning range (end of figure legend).

      Participants in the original (Bonstrup et al., 2019) experienced 157.14% more practice time and 46.97% less inter-practice rest time than the Spaced group in the Das et al. study (Author response image 5). Thus, the overall amount of practice and rest differ substantially between studies, with much more limited training occurring for participants in Das et al.

      In addition, the training interventions (i.e. – the practice schedule differences between the Spaced and Massed groups) were designed in a manner that minimized any chance of effectively testing their hypothesis. First, the interventions were applied over an extremely short period relative to the length of the total training session (5% and 12% of the total training session for Massed and Spaced groups, respectively; see gaps in the red shaded area in Author response image 5). Second, the intervention was applied during a period in which only half of the known total learning occurs. Specifically, we know from Bönstrup et al. (2019) that only 46.57% of the total performance gains occur in the practice interval covered by Das et al Training 1 intervention. Thus, early skill learning as evaluated by multiple groups (Bonstrup et al., 2020; Bonstrup et al., 2019; Brooks et al., 2024; Buch et al., 2021; Deleglise et al., 2023; F. Jacobacci et al., 2020; Mylonas et al., 2024), is in the Das et al experiment amputated to about half.

      Furthermore, a substantial amount of learning takes place during Das et al’s Test 1 and Test 2 periods (32.49% of total gains combined). The fact that substantial learning is known to occur over both the Test 1 (18.06%) and Test 2 (14.43%) intervals presents a fundamental problem described by Pan and Rickard (Pan & Rickard, 2015). They reported that averaging over intervals where substantial performance gains occur (i.e. – performance is not stable) inject crucial artefacts into analyses of skill learning:

      “A large amount of averaging has the advantage of yielding more precise estimates of each subject’s pretest and posttest scores and hence more statistical power to detect a performance gain. However, calculation of gain scores using that strategy runs the risk that learning that occurs during the pretest and (or posttest periods (i.e., online learning is incorporated into the gain score (Rickard et al., 2008; Robertson et al., 2004 .”

      The above statement indicates that the Test 1 and Test 2 performance scores from Das et al. (2024) are substantially contaminated by the learning rate within these intervals. This is particularly problematic if the intervention design results in different Test 2 learning rates between the two groups. This in fact, is apparent in their data (Figure 1C,E of the Das et al., 2024 preprint) as the Test 2 learning rate for the Spaced group is negative (indicating a unique interference effect observable only for this group). Specifically, the Massed group continues to show an increase in performance during Test 2 and 4 relative to the last 10 seconds of practice during Training 1 and 2, respectively, while the Spaced group displays a marked decrease. This post-training performance decrease for the Spaced group is in stark contrast to the monotonic performance increases observed for both groups at all other time-points. One possible cause could be related to the structure of the Test intervals, which include 20 seconds of uninterrupted practice. For the Spaced group, this effectively is a switch to a Massed practice environment (i.e., two 10-secondlong practice trials merged into one long trial), which interferes with greater Training 1 interval gains observed for the Space group. Interestingly, when statistical comparisons between the groups are made at the time-points when the intervention is present (Figure 1E) then the stated hypothesis, “If micro-offline gains represent offline learning, participants should reach higher skill levels when training with breaks, compared to training without breaks”, is confirmed.

      In summary, the experimental design and analyses used by Das et al does not contradict the view that early skill learning is expressed as micro-offline gains during rest breaks. The data presented by Gupta and Rickard (2022, 2024) and Das et al. (2024) is in many ways more confirmatory of the constraints employed by our group and others with respect to experimental design, analysis and interpretation of study findings, rather than contradictory. Still, it does highlight a limitation of the current micro-online/offline framework, which was originally only intended to be applied to early skill learning over spaced practice schedules when reactive inhibition effects are minimized (Bonstrup et al., 2019; Pan & Rickard, 2015). Extrapolation of this current framework to postplateau performance periods, longer timespans, or non-learning situations (e.g. – the Nonrepeating groups from Das et al. (2024)), when reactive inhibition plays a more substantive role, is not warranted. Ultimately, it will be important to develop new paradigms allowing one to independently estimate the different coincident or antagonistic features (e.g. - memory consolidation, planning, working memory and reactive inhibition) contributing to micro-online and micro-offline gains during and after early skill learning within a unifying framework.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) I found Figure 2B too small to be useful, as the actual elements of the cells are very hard to read.

      We have removed the grid colormap panel (top-right) from Figure 2B. All of this colormap data is actually a subset of data presented in Figure 2 – figure supplement 1, so can still be found there.

      Reviewer #2 (Recommendations for the authors):

      (1) Related to the first point in my concerns, I would suggest the authors compare decoding accuracy between correct presses followed by correct vs. incorrect presses. This would clarify if the decoder is actually taking the MEG signal for subsequent press into account. I would also suggest the authors use pre-movement MEG features and post-movement features with shorter windows and compare each result with the results for the original post-movement MEG feature with a longer window.

      The present study does not contain enough errors to perform the analysis proposed by the Reviewer. As noted above, we did re-examine our data and now report a new control regression analysis, all of which indicate that the proximity between keypresses does not explain contextualization effects.

      (2) I was several times confused by the author's use of "neural representation of an action" or "sequence action representations" in understanding whether these terms refer to representation on the level of whole-brain, region (as defined by the specific parcellation used), or voxels. In fact, what is submitted to the decoder is some complicated whole-brain MEG feature (i.e., the "neural representation"), which is a hybrid of voxel and parcel features that is further dimension-reduced and not immediately interpretable. Clarifying this point early in the text and possibly using some more sensible terms, such as adding "brain-wise" before the "sequence action representation", would be the most helpful for the readers.

      We now clarified this terminology in the revised manuscript.

      (3) Although comparing many different ways in feature selection/reduction, time window selection, and decoder types is undoubtedly a meticulous work, the current version of the manuscript seems still lacking some explanation about the details of these methodological choices, like which decoding method was actually used to report the accuracy, whether or not different decoding methods were chosen for individual participants' data, how training data was selected (is it all of the correct presses in Day 1 data?), whether the frequency power or signal amplitude was used, and so on. I would highly appreciate these additional details in the Methods section.

      The reported accuracies were based on linear discriminant analysis classifier. A comparison of different decoders (Figure 3 – figure supplement 4) shows LDA was the optimal choice.

      Whether or not different decoding methods were chosen for individual participants' data

      We selected the same decoder (LDA) performance to report the final accuracy.

      How training data was selected (is it all of the correct presses in Day 1 data?),

      Decoder training was conducted as a randomized split of the data (all correct keypresses of Day 1) into training (90%) and test (10%) samples for 8 iterations.

      Whether the frequency power or signal amplitude was used

      Signal amplitude was used for feature calculation.

      (4) In terms of the Methods, please consider adding some references about the 'F1 score', the 'feature importance score,' and the 'MRMR-based feature ranking,' as the main readers of the current paper would not be from the machine learning community. Also, why did the LDA dimensionality reduction reduce accuracy specifically for the voxel feature?

      We have now added the following statements to the Methods section that provide more detailed descriptions and references for these metrics:

      “The F1 score, defined as the harmonic mean of the precision (percentage of true predictions that are actually true positive) and recall (percentage of true positives that were correctly predicted as true) scores, was used as a comprehensive metric for all one-versus-all keypress state decoders to assess class-wise performance that accounts for both false-positive and false-negative prediction tendencies [REF]. A weighted mean F1 score was then computed across all classes to assess the overall prediction performance of the multi-class model.”

      and

      “Feature Importance Scores

      The relative contribution of source-space voxels and parcels to decoding performance (i.e. – feature importance score) was calculated using minimum redundant maximum relevance (MRMR) and highlighted in topography plots. MRMR, an approach that combines both relevance and redundancy metrics, ranked individual features based upon their significance to the target variable (i.e. – keypress state identity) prediction accuracy and their non-redundancy with other features.”

      As stated in the Reviewer responses above, the dimensionality of the voxel-space feature set is very high (i.e. – 15684). LDA attempts to map the input features onto a much smaller dimensional space (number of classes-1; e.g. – 3 dimensions for 4-class keypress decoding). It is likely that the reduction in accuracy observed only for the voxel-space feature was due to the loss of relevant information during the mapping process that resulted in reduced accuracy. This reduction in accuracy for voxel-space decoding was specific to LDA. Figure 3—figure supplement 3 shows that voxel-space decoder performance actually improved when utilizing alternative dimensionality reduction techniques.

      (5) Paragraph 9, lines #139-142: "Notably, decoding associated with index finger keypresses (executed at two different ordinal positions in the sequence) exhibited the highest number of misclassifications of all digits (N = 141 or 47.5% of all decoding errors; Figure 3C), raising the hypothesis that the same action could be differentially represented when executed at different learning state or sequence context locations."

      This does not seem to be a fair comparison, as the index finger appears twice as many as the other fingers do in the sequence. To claim this, proper statistical analysis needs to be done taking this difference into account.

      We thank the Reviewer for bringing this issue to our attention. We have now corrected this comparison to evaluate relative false negative and false positive rates between individual keypress state decoders, and have revised this statement in the manuscript as follows:

      “Notably, decoding of index finger keypresses (executed at two different ordinal positions in the sequence) exhibited the highest false negative (0.116 per keypress) and false positive (0.043 per keypress) misclassification rates compared with all other digits (false negative rate range = [0.067 0.114]; false positive rate range = [0.020 0.037]; Figure 3C), raising the hypothesis that the same action could be differentially represented when executed within different contexts (i.e. - different learning states or sequence locations).”

      (6) Finally, the authors could consider acknowledging in the Discussion that the contribution of micro-offline learning to genuine skill learning is still under debate (e.g., Gupta and Rickard, 2023; 2024; Das et al., bioRxiv, 2024).

      We have added a paragraph in the Discussion that addresses this point.

      Reviewer #3 (Recommendations for the authors):

      In addition to the additional analyses suggested in the public review, I have the following suggestions/questions:

      (1) Given that the authors introduce a new decoding approach, it would be very helpful for readers to see a distribution of window sizes and window onsets eventually used across individuals, at least for the optimized decoder.

      We have now included a new supplemental figure (Figure 4 – figure Supplement 2) that provides this information.

      (2) Please explain in detail how you arrived at the (interpolated?) group-level plot shown in Figure 1B, starting from the discrete single-trial keypress transition times. Also, please specify what the shading shows.

      Instantaneous correct sequence speed (skill measure) was quantified as the inverse of time (in seconds) required to complete a single iteration of a correctly generated full 5-item sequence. Individual keypress responses were labeled as members of correct sequences if they occurred within a 5-item response pattern matching any possible circular shifts of the 5-item sequence displayed on the monitor (41324). This approach allowed us to quantify a measure of skill within each practice trial at the resolution of individual keypresses. The dark line indicates the group mean performance dynamics for each trial. The shaded region indicates the 95% confidence limit of the mean (see Methods).

      (3) Similarly, please explain how you arrived at the group-level plot shown in Figure 1C. What are the different colored lines (rows) within each trial? How exactly did the authors reach the conclusion that KTT variability stabilizes by trial 6?

      Figure 1C provides additional information to the correct sequence speed measure above, as it also tracks individual transition speed composition over learning. Figure 1C, thus, represents both changes in overall correct sequence speed dynamics (indicated by the overall narrowing of the horizontal speed lines moving from top to bottom) and the underlying composition of the individual transition patterns within and across trials. The coloring of the lines is a shading convention used to discriminate between different keypress transitions. These curves were sampled with 1ms resolution, as in Figure 1B. Addressing the underlying keypress transition patterns requires within-subject normalization before averaging across subjects. The distribution of KTTs was normalized to the median correct sequence time for each participant and centered on the mid-point for each full sequence iteration during early learning.

      (4) Maybe I missed it, but it was not clear to me which of the tested classifiers was eventually used. Or was that individualized as well? More generally, a comparison of the different classifiers would be helpful, similar to the comparison of dimension reduction techniques.

      We have now included a new supplemental figure that provides this information.

      (5) Please add df and effect sizes to all statistics.

      Done.

      (6) Please explain in more detail your power calculation.

      The study was powered to determine the minimum sample size needed to detect a significant change in skill performance following training using a one-sample t-test (two-sided; alpha = 0.05; 95% statistical power; Cohen’s D effect size = 0.8115 calculated from previously acquired data in our lab). The calculated minimum sample size was 22. The included study sample size (n = 27) exceeded this minimum.

      This information is now included in the revised manuscript.

      (7) The cut-off for the high-pass filter is unusually high and seems risky in terms of potential signal distortions (de Cheveigne, Neuron 2019). Why did the authors choose such a high cut-off?

      The 1Hz high-pass cut-off frequency for the 1-150Hz band-pass filter applied to the continuous raw MEG data during preprocessing has been used in multiple previous MEG publications (Barratt et al., 2018; Brookes et al., 2012; Higgins et al., 2021; Seedat et al., 2020; Vidaurre et al., 2018).

      (8) "Furthermore, the magnitude of offline contextualization predicted skill gains while online contextualization did not", lines 336/337 - where is that analysis?

      Additional details pertaining to this analysis are now provided in the Results section (Figure 5 – figure supplement 4).

      (9) How were feature importance scores computed?

      We have now added a new subheading in the Methods section with a more detailed description of how feature importance scores were computed.

      (10)  Please add x and y ticks plus tick labels to Figure 5 - Figure Supplement 3, panel A

      Done

      (11) Line 369, what does "comparable" mean in this context?

      The sentence in the “Study Participants” part of the Methods section referred to here has now been revised for clarity.

      (12) In lines 496/497, please specify what t=0 means (KeyDown event, I guess?).

      Yes, the KeyDown event occurs at t = 0. This has now been clarified in the revised manuscript.

      (13) Please specify consistent boundaries between alpha- and beta-bands (they are currently not consistent in the Results vs. Methods (14/15 Hz or 15/16 Hz)).

      We thank the Reviewer for alerting us to this discrepancy caused by a typographic error in the Methods. We have now corrected this so that the alpha (8-14 Hz) and beta-band (15-24 Hz) frequency limits are described consistently throughout the revised manuscript.

      References

      Albouy, G., Fogel, S., King, B. R., Laventure, S., Benali, H., Karni, A., Carrier, J., Robertson, E. M., & Doyon, J. (2015). Maintaining vs. enhancing motor sequence memories: respective roles of striatal and hippocampal systems. Neuroimage, 108, 423-434. https://doi.org/10.1016/j.neuroimage.2014.12.049

      Albouy, G., King, B. R., Maquet, P., & Doyon, J. (2013). Hippocampus and striatum: dynamics and interaction during acquisition and sleep-related motor sequence memory consolidation. Hippocampus, 23(11), 985-1004. https://doi.org/10.1002/hipo.22183 Albouy, G., Sterpenich, V., Vandewalle, G., Darsaud, A., Gais, S., Rauchs, G., Desseilles, M., Boly, M., Dang-Vu, T., Balteau, E., Degueldre, C., Phillips, C., Luxen, A., & Maquet, P. (2012). Neural correlates of performance variability during motor sequence acquisition. NeuroImage, 60(1), 324-331. https://doi.org/10.1016/j.neuroimage.2011.12.049

      Andersen, R. A., & Buneo, C. A. (2002). Intentional maps in posterior parietal cortex. Annu Rev Neurosci, 25, 189-220. https://doi.org/10.1146/annurev.neuro.25.112701.142922 112701.142922 [pii]

      Ashe, J., Lungu, O. V., Basford, A. T., & Lu, X. (2006). Cortical control of motor sequences. Curr Opin Neurobiol, 16(2), 213-221. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=16563734

      Bansal, A. K., Vargas-Irwin, C. E., Truccolo, W., & Donoghue, J. P. (2011). Relationships among low-frequency local field potentials, spiking activity, and three-dimensional reach and grasp kinematics in primary motor and ventral premotor cortices. J Neurophysiol, 105(4), 1603-1619. https://doi.org/10.1152/jn.00532.2010

      Barratt, E. L., Francis, S. T., Morris, P. G., & Brookes, M. J. (2018). Mapping the topological organisation of beta oscillations in motor cortex using MEG. NeuroImage, 181, 831-844. https://doi.org/10.1016/j.neuroimage.2018.06.041

      Bassett, D. S., Wymbs, N. F., Porter, M. A., Mucha, P. J., Carlson, J. M., & Grafton, S. T. (2011). Dynamic reconfiguration of human brain networks during learning. Proc Natl Acad Sci U S A, 108(18), 7641-7646. https://doi.org/10.1073/pnas.1018985108

      Battaglia-Mayer, A., & Caminiti, R. (2019). Corticocortical Systems Underlying High-Order Motor Control. J Neurosci, 39(23), 4404-4421. https://doi.org/10.1523/JNEUROSCI.2094-18.2019

      Berlot, E., Popp, N. J., & Diedrichsen, J. (2020). A critical re-evaluation of fMRI signatures of motor sequence learning. Elife, 9. https://doi.org/10.7554/eLife.55241

      Bonstrup, M., Iturrate, I., Hebart, M. N., Censor, N., & Cohen, L. G. (2020). Mechanisms of offline motor learning at a microscale of seconds in large-scale crowdsourced data. NPJ Sci Learn, 5, 7. https://doi.org/10.1038/s41539-020-0066-9

      Bonstrup, M., Iturrate, I., Thompson, R., Cruciani, G., Censor, N., & Cohen, L. G. (2019). A Rapid Form of Offline Consolidation in Skill Learning. Curr Biol, 29(8), 1346-1351 e1344. https://doi.org/10.1016/j.cub.2019.02.049

      Brawn, T. P., Fenn, K. M., Nusbaum, H. C., & Margoliash, D. (2010). Consolidating the effects of waking and sleep on motor-sequence learning. J Neurosci, 30(42), 13977-13982. https://doi.org/10.1523/JNEUROSCI.3295-10.2010

      Brookes, M. J., Woolrich, M. W., & Barnes, G. R. (2012). Measuring functional connectivity in MEG: a multivariate approach insensitive to linear source leakage. NeuroImage, 63(2), 910-920. https://doi.org/10.1016/j.neuroimage.2012.03.048

      Brooks, E., Wallis, S., Hendrikse, J., & Coxon, J. (2024). Micro-consolidation occurs when learning an implicit motor sequence, but is not influenced by HIIT exercise. NPJ Sci Learn, 9(1), 23. https://doi.org/10.1038/s41539-024-00238-6

      Buch, E. R., Claudino, L., Quentin, R., Bonstrup, M., & Cohen, L. G. (2021). Consolidation of human skill linked to waking hippocampo-neocortical replay. Cell Rep, 35(10), 109193. https://doi.org/10.1016/j.celrep.2021.109193

      Buneo, C. A., & Andersen, R. A. (2006). The posterior parietal cortex: sensorimotor interface for the planning and online control of visually guided movements. Neuropsychologia, 44(13), 2594-2606. https://doi.org/10.1016/j.neuropsychologia.2005.10.011

      Buzsaki, G. (2015). Hippocampal sharp wave-ripple: A cognitive biomarker for episodic memory and planning. Hippocampus, 25(10), 1073-1188. https://doi.org/10.1002/hipo.22488

      Chen, P.-C., Stritzelberger, J., Walther, K., Hamer, H., & Staresina, B. P. (2024). Hippocampal ripples during offline periods predict human motor sequence learning. bioRxiv, 2024.2010.2006.614680. https://doi.org/10.1101/2024.10.06.614680

      Churchland, M. M., Cunningham, J. P., Kaufman, M. T., Foster, J. D., Nuyujukian, P., Ryu, S. I., & Shenoy, K. V. (2012). Neural population dynamics during reaching. Nature, 487(7405), 51-56. https://doi.org/10.1038/nature11129

      Classen, J., Liepert, J., Wise, S. P., Hallett, M., & Cohen, L. G. (1998). Rapid plasticity of human cortical movement representation induced by practice. J Neurophysiol, 79(2), 1117-1123. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=9463469

      Colclough, G. L., Brookes, M. J., Smith, S. M., & Woolrich, M. W. (2015). A symmetric multivariate leakage correction for MEG connectomes. NeuroImage, 117, 439-448. https://doi.org/10.1016/j.neuroimage.2015.03.071

      Colclough, G. L., Woolrich, M. W., Tewarie, P. K., Brookes, M. J., Quinn, A. J., & Smith, S. M. (2016). How reliable are MEG resting-state connectivity metrics? NeuroImage, 138, 284-293. https://doi.org/10.1016/j.neuroimage.2016.05.070

      Das, A., Karagiorgis, A., Diedrichsen, J., Stenner, M.-P., & Azanon, E. (2024). “Micro-offline gains” convey no benefit for motor skill learning. bioRxiv, 2024.2007.2011.602795. https://doi.org/10.1101/2024.07.11.602795

      Deleglise, A., Donnelly-Kehoe, P. A., Yeffal, A., Jacobacci, F., Jovicich, J., Amaro, E., Jr., Armony, J. L., Doyon, J., & Della-Maggiore, V. (2023). Human motor sequence learning drives transient changes in network topology and hippocampal connectivity early during memory consolidation. Cereb Cortex, 33(10), 6120-6131. https://doi.org/10.1093/cercor/bhac489

      Doyon, J., Bellec, P., Amsel, R., Penhune, V., Monchi, O., Carrier, J., Lehéricy, S., & Benali, H. (2009). Contributions of the basal ganglia and functionally related brain structures to motor learning. [Review]. Behavioural brain research, 199(1), 61-75. https://doi.org/10.1016/j.bbr.2008.11.012

      Doyon, J., Song, A. W., Karni, A., Lalonde, F., Adams, M. M., & Ungerleider, L. G. (2002). Experience-dependent changes in cerebellar contributions to motor sequence learning. Proc Natl Acad Sci U S A, 99(2), 1017-1022. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=11805340

      Euston, D. R., Gruber, A. J., & McNaughton, B. L. (2012). The role of medial prefrontal cortex in memory and decision making. Neuron, 76(6), 1057-1070. https://doi.org/10.1016/j.neuron.2012.12.002

      Euston, D. R., Tatsuno, M., & McNaughton, B. L. (2007). Fast-forward playback of recent memory sequences in prefrontal cortex during sleep. Science, 318(5853), 1147-1150. https://doi.org/10.1126/science.1148979

      Flint, R. D., Ethier, C., Oby, E. R., Miller, L. E., & Slutzky, M. W. (2012). Local field potentials allow accurate decoding of muscle activity. J Neurophysiol, 108(1), 18-24. https://doi.org/10.1152/jn.00832.2011

      Frankland, P. W., & Bontempi, B. (2005). The organization of recent and remote memories. Nat Rev Neurosci, 6(2), 119-130. https://doi.org/10.1038/nrn1607

      Gais, S., Albouy, G., Boly, M., Dang-Vu, T. T., Darsaud, A., Desseilles, M., Rauchs, G., Schabus, M., Sterpenich, V., Vandewalle, G., Maquet, P., & Peigneux, P. (2007). Sleep transforms the cerebral trace of declarative memories. Proc Natl Acad Sci U S A, 104(47), 1877818783. https://doi.org/10.1073/pnas.0705454104

      Grafton, S. T., Mazziotta, J. C., Presty, S., Friston, K. J., Frackowiak, R. S., & Phelps, M. E. (1992). Functional anatomy of human procedural learning determined with regional cerebral blood flow and PET. J Neurosci, 12(7), 2542-2548.

      Grover, S., Wen, W., Viswanathan, V., Gill, C. T., & Reinhart, R. M. G. (2022). Long-lasting, dissociable improvements in working memory and long-term memory in older adults with repetitive neuromodulation. Nat Neurosci, 25(9), 1237-1246. https://doi.org/10.1038/s41593-022-01132-3

      Gupta, M. W., & Rickard, T. C. (2022). Dissipation of reactive inhibition is sufficient to explain post-rest improvements in motor sequence learning. NPJ Sci Learn, 7(1), 25. https://doi.org/10.1038/s41539-022-00140-z

      Gupta, M. W., & Rickard, T. C. (2024). Comparison of online, offline, and hybrid hypotheses of motor sequence learning using a quantitative model that incorporate reactive inhibition. Sci Rep, 14(1), 4661. https://doi.org/10.1038/s41598-024-52726-9

      Hardwick, R. M., Rottschy, C., Miall, R. C., & Eickhoff, S. B. (2013). A quantitative metaanalysis and review of motor learning in the human brain. NeuroImage, 67, 283-297. https://doi.org/10.1016/j.neuroimage.2012.11.020

      Heusser, A. C., Poeppel, D., Ezzyat, Y., & Davachi, L. (2016). Episodic sequence memory is supported by a theta-gamma phase code. Nat Neurosci, 19(10), 1374-1380. https://doi.org/10.1038/nn.4374

      Higgins, C., Liu, Y., Vidaurre, D., Kurth-Nelson, Z., Dolan, R., Behrens, T., & Woolrich, M. (2021). Replay bursts in humans coincide with activation of the default mode and parietal alpha networks. Neuron, 109(5), 882-893 e887. https://doi.org/10.1016/j.neuron.2020.12.007

      Hikosaka, O., Nakamura, K., Sakai, K., & Nakahara, H. (2002). Central mechanisms of motor skill learning. Curr Opin Neurobiol, 12(2), 217-222. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=12015240

      Jacobacci, F., Armony, J. L., Yeffal, A., Lerner, G., Amaro, E., Jr., Jovicich, J., Doyon, J., & Della-Maggiore, V. (2020). Rapid hippocampal plasticity supports motor sequence learning. Proc Natl Acad Sci U S A, 117(38), 23898-23903. https://doi.org/10.1073/pnas.2009576117

      Jacobacci, F., Armony, J. L., Yeffal, A., Lerner, G., Amaro Jr, E., Jovicich, J., Doyon, J., & DellaMaggiore, V. (2020). Rapid hippocampal plasticity supports motor sequence learning.

      Proceedings of the National Academy of Sciences, 117(38), 23898-23903. Karni, A., Meyer, G., Jezzard, P., Adams, M. M., Turner, R., & Ungerleider, L. G. (1995). Functional MRI evidence for adult motor cortex plasticity during motor skill learning. Nature, 377(6545), 155-158. https://doi.org/10.1038/377155a0

      Kennerley, S. W., Sakai, K., & Rushworth, M. F. (2004). Organization of action sequences and the role of the pre-SMA. J Neurophysiol, 91(2), 978-993. https://doi.org/10.1152/jn.00651.2003 00651.2003 [pii]

      Kleim, J. A., Barbay, S., & Nudo, R. J. (1998). Functional reorganization of the rat motor cortex following motor skill learning. J Neurophysiol, 80, 3321-3325.

      Kornysheva, K., Bush, D., Meyer, S. S., Sadnicka, A., Barnes, G., & Burgess, N. (2019). Neural Competitive Queuing of Ordinal Structure Underlies Skilled Sequential Action. Neuron, 101(6), 1166-1180 e1163. https://doi.org/10.1016/j.neuron.2019.01.018

      Lee, S. H., Jin, S. H., & An, J. (2019). The difference in cortical activation pattern for complex motor skills: A functional near- infrared spectroscopy study. Sci Rep, 9(1), 14066. https://doi.org/10.1038/s41598-019-50644-9

      Lisman, J. E., & Jensen, O. (2013). The theta-gamma neural code. Neuron, 77(6), 1002-1016. https://doi.org/10.1016/j.neuron.2013.03.007

      Mollazadeh, M., Aggarwal, V., Davidson, A. G., Law, A. J., Thakor, N. V., & Schieber, M. H. (2011). Spatiotemporal variation of multiple neurophysiological signals in the primary motor cortex during dexterous reach-to-grasp movements. J Neurosci, 31(43), 15531-15543. https://doi.org/10.1523/JNEUROSCI.2999-11.2011

      Molle, M., & Born, J. (2009). Hippocampus whispering in deep sleep to prefrontal cortex--for good memories? Neuron, 61(4), 496-498. https://doi.org/10.1016/j.neuron.2009.02.002

      Morris, R. G. M. (2006). Elements of a neurobiological theory of hippocampal function: the role of synaptic plasticity, synaptic tagging and schemas. [Review]. The European journal of neuroscience, 23(11), 2829-2846. https://doi.org/10.1111/j.1460-9568.2006.04888.x

      Mylonas, D., Schapiro, A. C., Verfaellie, M., Baxter, B., Vangel, M., Stickgold, R., & Manoach, D. S. (2024). Maintenance of Procedural Motor Memory across Brief Rest Periods Requires the Hippocampus. J Neurosci, 44(14). https://doi.org/10.1523/JNEUROSCI.1839-23.2024

      Pan, S. C., & Rickard, T. C. (2015). Sleep and motor learning: Is there room for consolidation? Psychol Bull, 141(4), 812-834. https://doi.org/10.1037/bul0000009

      Penhune, V. B., & Steele, C. J. (2012). Parallel contributions of cerebellar, striatal and M1 mechanisms to motor sequence learning. Behav. Brain Res., 226(2), 579-591. https://doi.org/10.1016/j.bbr.2011.09.044

      Qin, Y. L., McNaughton, B. L., Skaggs, W. E., & Barnes, C. A. (1997). Memory reprocessing in corticocortical and hippocampocortical neuronal ensembles. Philos Trans R Soc Lond B Biol Sci, 352(1360), 1525-1533. https://doi.org/10.1098/rstb.1997.0139

      Rickard, T. C., Cai, D. J., Rieth, C. A., Jones, J., & Ard, M. C. (2008). Sleep does not enhance motor sequence learning. J Exp Psychol Learn Mem Cogn, 34(4), 834-842. https://doi.org/10.1037/0278-7393.34.4.834

      Robertson, E. M., Pascual-Leone, A., & Miall, R. C. (2004). Current concepts in procedural consolidation. Nat Rev Neurosci, 5(7), 576-582. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=15208699

      Sawamura, D., Sakuraba, S., Suzuki, Y., Asano, M., Yoshida, S., Honke, T., Kimura, M., Iwase, Y., Horimoto, Y., Yoshida, K., & Sakai, S. (2019). Acquisition of chopstick-operation skills with the non-dominant hand and concomitant changes in brain activity. Sci Rep, 9(1), 20397. https://doi.org/10.1038/s41598-019-56956-0

      Schendan, H. E., Searl, M. M., Melrose, R. J., & Stern, C. E. (2003). An FMRI study of the role of the medial temporal lobe in implicit and explicit sequence learning. Neuron, 37(6), 1013-1025. https://doi.org/10.1016/s0896-6273(03)00123-5

      Seedat, Z. A., Quinn, A. J., Vidaurre, D., Liuzzi, L., Gascoyne, L. E., Hunt, B. A. E., O'Neill, G. C., Pakenham, D. O., Mullinger, K. J., Morris, P. G., Woolrich, M. W., & Brookes, M. J. (2020). The role of transient spectral 'bursts' in functional connectivity: A magnetoencephalography study. NeuroImage, 209, 116537. https://doi.org/10.1016/j.neuroimage.2020.116537

      Shadmehr, R., & Holcomb, H. H. (1997). Neural correlates of motor memory consolidation. Science, 277, 821-824.

      Sjøgård, M., Baxter, B., Mylonas, D., Driscoll, B., Kwok, K., Tolosa, A., Thompson, M., Stickgold, R., Vangel, M., Chu, C., & Manoach, D. S. (2024). Hippocampal ripples mediate motor learning during brief rest breaks in humans. bioRxiv. https://doi.org/10.1101/2024.05.02.592200

      Srinivas, S., Sarvadevabhatla, R. K., Mopuri, K. R., Prabhu, N., Kruthiventi, S. S. S., & Babu, R. V. (2016). A Taxonomy of Deep Convolutional Neural Nets for Computer Vision [Technology Report]. Frontiers in Robotics and AI, 2. https://doi.org/10.3389/frobt.2015.00036

      Sterpenich, V., Albouy, G., Darsaud, A., Schmidt, C., Vandewalle, G., Dang Vu, T. T., Desseilles, M., Phillips, C., Degueldre, C., Balteau, E., Collette, F., Luxen, A., & Maquet, P. (2009). Sleep promotes the neural reorganization of remote emotional memory. J Neurosci, 29(16), 5143-5152. https://doi.org/10.1523/JNEUROSCI.0561-09.2009

      Toni, I., Ramnani, N., Josephs, O., Ashburner, J., & Passingham, R. E. (2001). Learning arbitrary visuomotor associations: temporal dynamic of brain activity. Neuroimage, 14(5), 10481057. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citati on&list_uids=11697936

      Toni, I., Thoenissen, D., & Zilles, K. (2001). Movement preparation and motor intention. NeuroImage, 14(1 Pt 2), S110-117. https://doi.org/10.1006/nimg.2001.0841

      Tse, D., Langston, R. F., Kakeyama, M., Bethus, I., Spooner, P. A., Wood, E. R., Witter, M. P., & Morris, R. G. (2007). Schemas and memory consolidation. Science, 316(5821), 76-82. https://doi.org/10.1126/science.1135935

      van Kesteren, M. T., Fernandez, G., Norris, D. G., & Hermans, E. J. (2010). Persistent schemadependent hippocampal-neocortical connectivity during memory encoding and postencoding rest in humans. Proc Natl Acad Sci U S A, 107(16), 7550-7555. https://doi.org/10.1073/pnas.0914892107

      van Kesteren, M. T., Ruiter, D. J., Fernandez, G., & Henson, R. N. (2012). How schema and novelty augment memory formation. Trends Neurosci, 35(4), 211-219. https://doi.org/10.1016/j.tins.2012.02.001

      Vidaurre, D., Hunt, L. T., Quinn, A. J., Hunt, B. A. E., Brookes, M. J., Nobre, A. C., & Woolrich, M. W. (2018). Spontaneous cortical activity transiently organises into frequency specific phase-coupling networks. Nat Commun, 9(1), 2987. https://doi.org/10.1038/s41467-01805316-z

      Wagner, A. D., Schacter, D. L., Rotte, M., Koutstaal, W., Maril, A., Dale, A. M., Rosen, B. R., & Buckner, R. L. (1998). Building memories: remembering and forgetting of verbal experiences as predicted by brain activity. [Comment]. Science (New York, N.Y.), 281(5380), 1188-1191. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&id=9712582 &retmode=ref&cmd=prlinks

      Wolpert, D. M., Goodbody, S. J., & Husain, M. (1998). Maintaining internal representations: the role of the human superior parietal lobe. Nat Neurosci, 1(6), 529-533. https://doi.org/10.1038/2245

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, the authors investigate the role of microtubule dynamics and its effects on neuronal aging. Using C. elegans as a model, the authors investigate the role of evolutionarily conserved Hippo pathway in microtubule dynamics of touch receptor neurons (TRNs) in an age-dependent manner. Using genetic, molecular, behavioral, and pharmacological approaches, the authors show that age-dependent loss of microtubule dynamics might underlie structural and functional aging of TRNs. Further, the authors show that the Hippo pathway specifically functions in these neurons to regulate microtubule dynamics. Specifically, authors show that hyperactivation of YAP-1, a downstream component of the Hippo pathway that is usually inhibited by the kinase activity of the upstream components of the pathway, results in microtubule stabilization and that might underlie the structural and functional decline of TRNs with age. However, how the Hippo pathway regulates microtubule dynamics and neuronal aging was not investigated by the authors.

      Strengths:

      This is a well-conducted and well-controlled study, and the authors have used multiple approaches to address different questions.

      Weaknesses:

      There are no major weaknesses identified, except that the effect of the Hippo pathway seems to be specific to only a subset of neurons. I would like the authors to address the specificity of the effect of the Hippo pathway in TRNs, in their resubmission.

      Although our genetic experiments, including TRNs-specific rescue/overexpression of YAP-1 and knockdown of WTS-1, strongly suggest that a cell-autonomous function of WTS-1-YAP-1 axis in TRNs, the Hpo pathway could have broader roles in neuroprotection. While this pathway may regulate microtubules stability in multiple neurons, other characteristics of TRNs, such as their anatomical localization near the cuticle or their long projections along body axis, could contribute to their susceptibilities to age-related deformation. Otherwise, the Hpo pathway may be truly TRNs-specific. TRNs have unique microtubules in both terms of composition and structure. Among nine α-, six β-tubulin genes in C. elegans, one α-tubulin (mec-12) and one β-tubulin (mec-7) showed highly enriched expression in TRNs [1, 2] and TRNs contain special 15-protofilament microtubule structure, while all other neurons in C. elegans have 11-protofilament microtubules [3]. Transcriptional regulation through YAP-1 may affect the specific microtubule structure of TRNs, leading to premature neuronal deformation. We have included this in the discussion section of the revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      This study examines a novel role of the Hpo signaling pathway, specifically of wts-1/LATS and the downstream regulator of gene expression, yap, in age-related neurodegeneration in C. elegans touch-responsive mechanosensory neurons, ALM and PLM. The study shows that knockdown or deletion of wts-1/LATS causes age-associated morphological abnormalities of these neurons, accompanied by functional loss of touch responsiveness. This is further associated with enhanced, abnormal, microtubule stabilization in these neurons.

      Strengths:

      This study examines a novel role of the Hpo signaling pathway, specifically of wts-1/LATS and the downstream regulator of gene expression, yap, in age-related neurodegeneration in C. elegans touch-responsive mechanosensory neurons, ALM and PLM. The study shows that knockdown or deletion of wts-1/LATS causes age-associated morphological abnormalities of these neurons, accompanied by functional loss of touch responsiveness. This is further associated with enhanced, abnormal, microtubule stabilization in these neurons. Strong pharmacological and especially genetic manipulations of MT-stabilizing or severing proteins show a strong genetic link between yap and regulation of MTs stability. The study is strong and uses robust approaches, especially strong genetics. The demonstrations on the aging-related roles of the Hpo signaling pathway, and the link to MTs, are novel and compelling. Nevertheless, the study also has mechanistic weaknesses (see below).

      Weaknesses:

      Specific comments:

      (1) The study demonstrates age-specific roles of the Hpo pathway, specifically of wts-1/LATS and yap, specifically in TRN mechanosensory neurons, without observing developmental defects in these neurons, or effects in other neurons. This is a strong demonstration. Nevertheless, the study does not address whether there is a correlation of Hpo signaling pathway activity decline specifically in these neurons, and not other neurons, and at the observed L4 stage and onwards (including the first day of adulthood, 1DA stage). Such demonstrations of spatio-temporal regulation of the Hpo signaling pathway and its activation seem important for linking the Hpo pathway with the observed age-related neurodegeneration. Can this age-related response be correlated to indeed a decline in Hpo signaling during adulthood? Especially at L4 and onwards? It will be informative to measure this by examining the decline in wts1 as well as yap levels and yap nuclear localization.

      As described above, we have included possible explanations for the specificity of the Hpo pathway in TRNs. Since components of the Hpo pathway are expressed in various tissues, including the intestine and hypodermis, this pathway could have broader neuroprotective roles across multiple neurons. Alternatively, it could function in TRNs. Given that the TRNs possess unique microtubules in both structure and composition, and that Hpo pathway has crucial roles in microtubule stability regulation, the roles of the Hpo pathway may indeed be TRNs-specific. As we described in the manuscript, our observations, along with those of others, indicate that neuronal deformation of TRNs begins around the 4th day of adulthood. Additionally, the degree of morphological deformation in wts-1 mutants at the L4 stage is comparable to that of aged wild-type worms on the 15th day of adulthood. Therefore, to assess the functional decline of WTS-1 or nuclear localization of YAP-1, observations should begin in 4-day-old animals. Using fluorescence-tagged YAP-1 under the mec-4 promoter, we couldn’t detect a significant increase in nuclear YAP-1 in TRNs of 4-day-old adult. Additionally, we were unable to assess YAP-1 intercellular localization in older animals, such as 10-day-old animals, possibly due to the small cell size of neurons or morphological alteration along with aging of TRNs. Although we did not detect functional decline of WTS-1 or increased nuclear YAP-1 in TRNs, nuclear localization of YAP-1 increases with age in other tissues, such as the intestine and hypodermis (Author response image. 1). This may result from inactivation of the Hippo (Hpo) pathway, an indirect consequence of structural and functional decline—such as tissue stiffness associated with aging—or a combination of both. Additionally, given that morphological deformation of TRNs appears to begin around fourth day of adulthood, nuclear localization of YAP-1 in the intestine and hypodermis seems to have a later onset and be more moderate. It is possible that YAP-1 nuclear localization in TRNs occurs earlier or that other factors contribute early-stage touch neuronal deformation.

      Author response image 1.

      Quantification of the proportion of worms exhibiting nuclear localization of YAP-1. We used GFP-tagged YAP-1 driven by its own 4 kb promoter. A total of 90 animals were observed each day.

      (2) The Hpo pathway eventually activates gene expression via yap. Although the study uses robust genetic manipulations of yap and wts-1/LATS, it is not clear whether the observed effects are attributed to yap-mediated regulation of gene expression (see 3).

      Given that the neuronal deformation in the wts-1 mutant was completely restored by the loss of yap-1 or egl-44, it strongly suggests that YAP-TEAD-mediated transcriptional regulation is responsible for the premature neuronal degeneration of the wts-1 mutant. However, in this study, we were unable to identify specific transcriptional target genes associated with these phenomena, which represents a limitation of our research (please see below).

      (3) The observations on the abnormal MT stabilization, and the subsequent genetic examinations of MT-stability/severing genes, are a significant strength of the study. Nevertheless, despite the strong genetic links to yap and wts-1/LATS, it is not clear whether MT-regulatory genes are regulated by transcription downstream of the Hpo pathway, thus not enabling a strong causal link between MT regulation and Hpo-mediated gene expression, making this strong part of the study mechanistically circumstantial. Specifically, it will be good to examine whether the genes addressed herein, for example, Spastin, are transcriptionally regulated downstream of the Hpo pathway. This comment is augmented by the finding that in the wts-1/ yap-1 double mutants, MT abnormality, and subsequent neuronal morphology and touch responses are restored, clearly indicating that there is an associated transcriptional regulation

      If the target genes of YAP-1 are not identified, it will be difficult to fully understand how YAP-1 regulates microtubule stability. Microtubule-stabilizing genes, whose knockdown alleviates wts-1 mutant neuronal deformation, could be potential transcriptional targets of YAP-1. Among these genes, PTRN-1 and DLK-1 contain MCAT sequences (CATTCCA/T), a well-conserved DNA motif recognized by the TEAD transcription factor, in their promoters near the transcription start site (TSS). We hypothesized that the expression of fluorescence-tagged reporters of promoter regions containing these MCAT sequences would be enhanced in the absence of wts-1 activity. Although both reporters were expressed in TRNs, they did not show significant changes in the wts-1 mutant background. We also focused on spv-1, a worm homolog of ARHGAP29, which negatively regulates RhoA. YAP is known to modulate actin cytoskeleton rigidity through transcriptional regulation of ARHGAP29 [4]. The promoter of spv-1 contains 2 MCAT sequences and loss of spv-1 mitigated neuronal deformation of the wts-1 mutant. However, reporters of promoter regions containing MCAT sequences only weakly expressed in the process of TRNs. More importantly, ectopic expression of dominant-negative form of rho-1/rhoA did not lead to significant deformation of TRNs. While YAP typically functions as a transcriptional co-activator, it has also been reported to repress target gene expression, such as DDIT4 and Trail, in collaborated with TEAD transcriptional factor [5].  As a reviewer pointed out, spas-1 might be transcriptionally repressed by yap-1, given that its loss leads to premature deformation of TRNs. However, since the phenotype of the spas-1 mutant has a later onset than the wts-1 mutant and is relatively restricted to ALM, we excluded it from our candidate gene search. Despite extensive genetic approaches, we were unable to establish a strong causal link between YAP-1 and the regulation of microtubule stability. Unbiased screenings, such as tissue-specific transcriptome analysis, may help address the remaining questions. We have outlined the limitations of this study in the discussion section of the revised manuscript.

      Other comments:

      (1) The TRN-specific knockdown of wts-1 and yap-1 is a clear strength. Nevertheless, these do not necessarily show cell-autonomous effects, as the yap transcription factor may regulate the expression of external cues, secreted or otherwise, thus generating non-cell autonomous effects. For example, it is known that yap regulates TGF-beat expression and signaling.

      In the absence of LATS1/2 activity, activated YAP has been reported to drive biliary epithelial cell lineage specification by directly regulating TGF-β transcription during and after liver development [6]. Even when functioning in an autocrine manner, TGF-β can exhibit non-cell autonomous effects. While it primarily acts on the same cell that secretes it, some molecules may also affect neighboring cells, leading to paracrine effects. Additionally, TGF-β can modify the extracellular matrix (ECM), indirectly affecting surrounding cells. Similarly, if YAP regulates transcription of secretory protein in TRNs, the resulting extracellular factors or surrounding cells may influence touch neuronal microtubules in a non-cell-autonomous manner. Although our genetic data strongly suggest a cell-autonomous function of WTS-1-YAP-1 in TRNs, we could not exclude the possibility that YAP-1 functions non-cell-autonomously, as we were unable to identify its transcriptional targets. We have included this in the discussion section of the revised manuscript.

      (2) Continuing from comment (3) above, it seems that many of the MT-regulators chosen here for genetic examinations were chosen based on demonstrated roles in neurodegeneration in other studies. It would be good to show whether these MT-associated genes are directly regulated by transcription by the Hpo pathway.

      As we described above, several MT-associated genes­­, such as ptrn-1, dlk-1 and spv-1, contain MCAT sequences in their promoter and their knockdown alleviated wts-1-induced neuronal deformation. These genes were tested to determine whether they were directly regulated by WTS-1-YAP-1. Based on our findings, we concluded that they were unlikely to be regulated by the Hpo pathway in TRNs.

      (3) The impairment of the touch response may not be robust: it is only a 30-40% reduction at L4, and even less reduction at 1DA. It would be good to offer possible explanations for this finding.

      As pointed out by the reviewer, the impairment of touch responses of wts-1 mutants showed an approximately 33% reduction at both L4 and 1DA compared to age-matched wild-type animals. At the L4 stage, control worms responded to nearly every gentle touch (94%), whereas wts-1 mutants responded to only 60% of stimuli. By 1DA, control worms exhibited slightly decline in touch responses compared to L4 (82.5%), whereas wts-1 mutants displayed more pronounced impairment (55.7%) (Fig 1E). Regarding the severity and frequency of structural degeneration of wts-1 mutant at both stages, it appears to be relatively moderate. As we noted in the manuscript, our observations, along with those of others, indicate that structural abnormalities in ALM and PLM neurons begin to appear around the fourth day of adulthood and progressively worsen as the worms age [7]. In a previous study, Tank et al. categorized day 10-aged worms into two groups based on their movement ability and then assessed structural deformation in each animal to determine whether structural and functional degeneration of TRNs were correlated. In this same group of animals, they examined the gentle touch response and found that animals responded to gentle touch 46 ± 5.1 %, 84 ± 12.2 %, respectively [8]. It could be said that, on average, day 10 animals had 65% touch response on average, which is consistent with our observation in day 10 animals (Fig. 5E, 56.3%). Given these observations, the function of TRNs of wts-1 mutant or aged animals appears to be preserved despite severe structure failures. The gentle touch response evokes an escape behavior in which animals quickly move away from the stimulus; thus proper touch responses are essential for avoiding predators and ensuring survival. It has been reported to be necessary for evading fungal predation, such as escaping from a constricting hyphal ring [9]. Given that the gentle touch response is crucial for survival, its function is likely well preserved despite structural abnormalities, such as age-related deformation.

      Reviewer #1 (Recommendations for the authors):

      Major comments:

      (1) Why is the effect of the Hippo pathway on microtubule dynamics specific to TRNs? Is it the structure of TRNs that makes them prone to the effects of age-dependent decline in microtubule dynamics? The authors are advised to discuss it in their resubmission.

      As described above, we have included possible explanations for the tissue specificity of the Hpo pathway in TRNs and the vulnerability of TRNs to age-associated decline in the discussion section of the revised manuscript.

      (2) The authors are advised to explain the shorter life span of wts-1; yap-1 double mutants (with restored TRNs) compared to wts-1 single mutants in Figure 2F. The life span of yap-1 single mutants should be included in Figure 2F. Further, based on the data, the shorter lifespan of wts-1 mutants cannot be attributed to abnormal TRNs as the lifespan of wts-1; yap-1 double mutants is even shorter. The authors are advised to explain the shorter life span of wts-1 mutants compared to wild-type controls.

      wts-1 is known to be involved in various developmental processes, including the maintenance of apicobasal polarity in the intestine, growth rate control, and dauer formation [10-12]. Since WTS-1 activity is restored in the intestine of the mutant used for lifespan measurement, the shorter lifespan of the wts-1 mutant may result from the loss of WTS-1 in tissues other than the intestine. Although we were unable to include lifespan data for the yap-1 mutant, recent studies indicate that the yap-1(tm1416) mutant or yap-1 RNAi treated worms exhibit a shortened lifespan [13, 14]. Thus, our data showing a slightly shorter lifespan of the wts-1; yap-1 mutant compared with the wts-1 mutant may result from the synergistic action of yap-1 and yap-1-independent downstream factors of wts-1. While this study does not provide an explanation for the shortened lifespan of wts-1 or wts-1; yap-1 mutants, the fact that the wts-1; yap-1 double mutant with restored TRNs still have a shorter lifespan compared with the wts-1 mutant strongly suggests that premature deformation of the wts-1 neurons appear to be a touch neuron-specific event, rather than being associated with whole body, as described in the manuscript..

      Minor comments:

      (1) In the abstract, please provide definitions for LATS and YAP. Authors can mention that LATS is a kinase and YAP a transcriptional co-activator in the Hippo pathway.

      (2) In the last paragraph on page 9, change "these function" to "this function", and change "knock-downed" to "knocked down".

      (3) On page 10, paragraph 2, change "regarding the action mechanism" to "regarding the mechanism of action".

      (4) On page 11, paragraph 1, change "endogenous WTS-1 could inhibits" to "endogenous WTS-1 could inhibit".

      (5) On page 16, paragraph 1, change "consistent to the hypothesis" to "consistent with this hypothesis".

      (6) Overall, the paper is well written. However, there is still room to improve the language and diction used by the authors.

      We have revised all minor comments suggested by the reviewer in the revised manuscript.

      References

      (1) Hamelin M, Scott IM, Way JC, Culotti JG. The mec-7 beta-tubulin gene of Caenorhabditis elegans is expressed primarily in the touch receptor neurons. EMBO J. 1992;11(8):2885-93. Epub 1992/08/01. doi: 10.1002/j.1460-2075.1992.tb05357.x. PubMed PMID: 1639062; PubMed Central PMCID: PMCPMC556769.

      (2) Fukushige T, Siddiqui ZK, Chou M, Culotti JG, Gogonea CB, Siddiqui SS, et al. MEC-12, an alpha-tubulin required for touch sensitivity in C. elegans. J Cell Sci. 1999;112 ( Pt 3):395-403. Epub 1999/01/14. doi: 10.1242/jcs.112.3.395. PubMed PMID: 9885292.

      (3) Chalfie M, Thomson JN. Structural and functional diversity in the neuronal microtubules of Caenorhabditis elegans. J Cell Biol. 1982;93(1):15-23. Epub 1982/04/01. doi: 10.1083/jcb.93.1.15. PubMed PMID: 7068753; PubMed Central PMCID: PMCPMC2112106.

      (4) Qiao Y, Chen J, Lim YB, Finch-Edmondson ML, Seshachalam VP, Qin L, et al. YAP Regulates Actin Dynamics through ARHGAP29 and Promotes Metastasis. Cell Rep. 2017;19(8):1495-502. Epub 2017/05/26. doi: 10.1016/j.celrep.2017.04.075. PubMed PMID: 28538170.

      (5) Kim M, Kim T, Johnson RL, Lim DS. Transcriptional co-repressor function of the hippo pathway transducers YAP and TAZ. Cell Rep. 2015;11(2):270-82. Epub 2015/04/07. doi: 10.1016/j.celrep.2015.03.015. PubMed PMID: 25843714.

      (6) Lee DH, Park JO, Kim TS, Kim SK, Kim TH, Kim MC, et al. LATS-YAP/TAZ controls lineage specification by regulating TGFbeta signaling and Hnf4alpha expression during liver development. Nat Commun. 2016;7:11961. Epub 2016/07/01. doi: 10.1038/ncomms11961. PubMed PMID: 27358050; PubMed Central PMCID: PMCPMC4931324.

      (7) Toth ML, Melentijevic I, Shah L, Bhatia A, Lu K, Talwar A, et al. Neurite sprouting and synapse deterioration in the aging Caenorhabditis elegans nervous system. J Neurosci. 2012;32(26):8778-90. Epub 2012/06/30. doi: 10.1523/JNEUROSCI.1494-11.2012. PubMed PMID: 22745480; PubMed Central PMCID: PMCPMC3427745.

      (8) Tank EM, Rodgers KE, Kenyon C. Spontaneous age-related neurite branching in Caenorhabditis elegans. J Neurosci. 2011;31(25):9279-88. Epub 2011/06/24. doi: 10.1523/JNEUROSCI.6606-10.2011. PubMed PMID: 21697377; PubMed Central PMCID: PMCPMC3148144.

      (9) Maguire SM, Clark CM, Nunnari J, Pirri JK, Alkema MJ. The C. elegans touch response facilitates escape from predacious fungi. Curr Biol. 2011;21(15):1326-30. Epub 2011/08/02. doi: 10.1016/j.cub.2011.06.063. PubMed PMID: 21802299; PubMed Central PMCID: PMCPMC3266163.

      (10) Cai Q, Wang W, Gao Y, Yang Y, Zhu Z, Fan Q. Ce-wts-1 plays important roles in Caenorhabditis elegans development. FEBS Lett. 2009;583(19):3158-64. Epub 2009/09/10. doi: 10.1016/j.febslet.2009.09.002. PubMed PMID: 19737560.

      (11) Kang J, Shin D, Yu JR, Lee J. Lats kinase is involved in the intestinal apical membrane integrity in the nematode Caenorhabditis elegans. Development. 2009;136(16):2705-15. Epub 20090715. doi: 10.1242/dev.035485. PubMed PMID: 19605499.

      (12) Lee H, Kang J, Ahn S, Lee J. The Hippo Pathway Is Essential for Maintenance of Apicobasal Polarity in the Growing Intestine of Caenorhabditis elegans. Genetics. 2019;213(2):501-15. Epub 20190729. doi: 10.1534/genetics.119.302477. PubMed PMID: 31358532; PubMed Central PMCID: PMCPMC6781910.

      (13) Teuscher AC, Statzer C, Goyala A, Domenig SA, Schoen I, Hess M, et al. Longevity interventions modulate mechanotransduction and extracellular matrix homeostasis in C. elegans. Nat Commun. 2024;15(1):276. Epub 2024/01/05. doi: 10.1038/s41467-023-44409-2. PubMed PMID: 38177158; PubMed Central PMCID: PMCPMC10766642.

      (14) Saul N, Dhondt I, Kuokkanen M, Perola M, Verschuuren C, Wouters B, et al. Identification of healthspan-promoting genes in Caenorhabditis elegans based on a human GWAS study. Biogerontology. 2022;23(4):431-52. Epub 2022/06/25. doi: 10.1007/s10522-022-09969-8. PubMed PMID: 35748965; PubMed Central PMCID: PMCPMC9388463.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Review:

      Reviewer #1 (Public review): 

      Summary: 

      Odor- and taste-sensing are mediated by two different systems, the olfactory and gustatory systems, and have different behavioral roles. In this study, Wei et al. challenge this dichotomy by showing that odors can activate gustatory receptor neurons (GRNs) in Drosophila to promote feeding responses, including the proboscis extension response (PER) that was previously thought to be driven only by taste. While previous studies suggested that odors can promote PER to appetitive tastants, Wei et al. go further to show that odors alone cause PER, this effect is mediated through sweet-sensing GRNs, and sugar receptors are required. The study also shows that odor detection by bitter-sensing GRNs suppresses PER. The authors' conclusions are supported by behavioral assays, calcium imaging, electrophysiological recordings, and genetic manipulations. The observation that both attractive and aversive odors promote PER leaves an open question as to why this effect is adaptive. Overall, the study sheds new light on chemosensation and multimodal integration by showing that odor and taste detection converge at the level of sensory neurons, a finding that is interesting and surprising while also being supported by another recent study (Dweck & Carlson, Sci Advances 2023).

      Strengths: 

      (1) The main finding that odors alone can promote PER by activating sweet-sensing GRNs is interesting and novel.

      (2) The study uses video tracking of the proboscis to quantify PER rather than manual scoring, which is typically used in the field. The tracking method is less subjective and provides a higherresolution readout of the behavior.

      (3) The study uses calcium imaging and electrophysiology to show that odors activate GRNs. These represent complementary techniques that measure activity at different parts of the GRN (axons versus dendrites, respectively) and strengthen the evidence for this conclusion. 

      (4) Genetic manipulations show that odor-evoked PER is primarily driven by sugar GRNs and sugar receptors rather than olfactory neurons. This is a major finding that distinguishes this work from previous studies of odor effects on PER and feeding (e.g., Reisenman & Scott, 2019; Shiraiwa, 2008) that assumed or demonstrated that odors were acting through olfactory neurons.

      We appreciate the reviewer’s positive assessment of the novelty and significance of our work.

      Weaknesses/Limitations: 

      (1) The authors may want to discuss why PER to odors alone has not been previously reported, especially as they argue that this is a broad effect evoked by many different odors. Previous studies testing the effect of odors on PER only observed odor enhancement of PER to sugar (Oh et al., 2021; Reisenman & Scott, 2019; Shiraiwa, 2008) and some of these studies explicitly show no effect of odor alone or odor with low sugar concentration; regardless, the authors likely would have noticed if PER to odor alone had occurred. Readers of this paper may also be aware of unpublished studies failing to observe an effect of PER on odor alone (including studies performed by this reviewer and unrelated work by other colleagues in the field), which of course the authors are not expected to directly address but may further motivate the authors to provide possible explanations.

      We appreciate the reviewer’s comment. We believe that the difference in genotype is likely the largest reason behind this point. This is because the strength varied widely across genotypes and was quite weak in some strains including commonly used w[1118] empty Gal4 and w[1118] empty spit Gal4 as shown in Figure1- figure supplement 3 (Figure S3 in original submission). However, given that we observed odor-evoked PER in various genotypes (many in main Figures and three in Figure1- figure supplement 3 including Drosophila simulans), the data illustrate that it is a general phenomenon in Drosophila. Indeed, although Oh et al. (2021) did not emphasize it in the text, their Fig. 1E showed that yeast odor evoked PER at a probability of 20%, which is much higher than the rate of spontaneous PER in many genotypes. Therefore, this literature may represent another support for the presence of odor-evoked PER. We have expanded our text in the Discussion to describe these issues.

      Another possibility is our use of DeepLabcut to quantitatively track the kinematics of proboscis movement, which may have facilitated the detection of PER.

      (2) Many of the odor effects on behavior or neuronal responses were only observed at very high concentrations. Most effects seemed to require concentrations of at least 10-2 (0.01 v/v), which is at the high end of the concentration range used in olfactory studies (e.g., Hallem et al., 2004), and most experiments in the paper used a far higher concentration of 0.5 v/v. It is unclear whether these are concentrations that would be naturally encountered by flies.

      We acknowledge that the concentrations used are on the higher side, suggesting that GRNs may need to be stimulated with relatively concentrated odors to induce PER. Although it is difficult to determine the naturalistic range of odor concentration, it is at least widely reported that olfactory neurons including olfactory receptor neurons and projection neurons do not saturate, and exhibit odor identity-dependent responses at the concentration of 10<sup>-2</sup> where odor-evoked PER can be observed. Furthermore, we have shown in Figure 6 that low concentration (10<sup>-4</sup>) of banana odor, ethyl butyrate, and 4-methycyclohexanol all significantly increased the rate of odor-taste multisensory PER even in olfactory organs-removed flies, suggesting that low concentration odors can influence feeding behavior via GRNs in a natural context where odors and tastants coexist at food sites. Finally, we note that odors were further diluted by a factor of 0.375 by mixing the odor stream with the main air stream before being applied to the flies as described in Methods.

      (3) The calcium imaging data showing that sugar GRNs respond to a broad set of odors contrasts with results from Dweck & Carlson (Sci Adv, 2023) who recorded sugar neurons with electrophysiology and observed responses to organic acids, but not other odors. This discrepancy is not discussed.  

      As the reviewer points out, Dweck and Carlson (Sci Adv, 2023) reported using single sensillum electrophysiology (base recording) that sugar GRNs only respond to organic acids whereas we found using calcium imaging from a group of axons and single sensillum electrophysiology (tip recording) that these GRNs respond to a wide variety of odors. Given that we observed odor responses using two methods, the discrepancy is likely due to the differences in genotype examined. We now have discussed this point in the text.

      (4) Related to point #1, it would be useful to see a quantification of the percent of flies or trials showing PER for the key experiments in the paper, as this is the standard metric used in most studies and would help readers compare PER in this study to other studies. This is especially important for cases where the authors are claiming that odor-evoked PER is modulated in the same way as previously shown for sugar (e.g., the effect of starvation in Figure S4).

      For starved flies, we would like to remind the reviewer that the percentage of trials showing PER is reported in Fig. 1E, which shows a similar trend as the integrated PER duration. For fed flies, we have analyzed the percentage of PER and added the result to Figure 2-figure supplement 1C (Figure S4 in original submission).

      (5) Given the novelty of the finding that odors activate sugar GRNs, it would be useful to show more examples of GCaMP traces (or overlaid traces for all flies/trials) in Figure 3. Only one example trace is shown, and the boxplots do not give us a sense of the reliability or time course of the response. A related issue is that the GRNs appear to be persistently activated long after the odor is removed, which does not occur with tastes. Why should that occur? Does the time course of GRN activation align with the time course of PER, and do different odors show differences in the latency of GRN activation that correspond with differences in the latency of PER (Figure S1A)?

      Following the reviewer’s suggestion, we now report GCaMP responses for all the trials in all the flies (both Gr5a>GCaMP and Gr66a>GCaMP flies), where the time course and trial-to-trial/animal-toanimal variability of calcium responses can be observed (Figure 3-figure supplement 2).

      Regarding the second point, we recorded responses to both sucrose and odors in some flies and found that calcium responses of GRNs are long-lasting not only to odors but also to sucrose, as shown in Author response image 1. This may be due in part to the properties of GCaMP6s and slower decay of intracellular calcium concentration as compared to spikes.

      Author response image 1.

      Example calcium responses to sucrose and odor (MCH) in the same fly (normalized by the respective peak responses to better illustrate the time course of responses). Sucrose (blue) and odor (orange) concentrations are 100 mM, and 10<sup>-1</sup> respectively. Odor stimulation begins at 5 s and lasts for 2 s. Sucrose was also applied at the same timing for the same duration although there was a limitation in controlling the precise timing and duration of tastant application. Because of this limitation, we did not quantify the off time constant of two responses.

      To address whether the time course of GRN activation aligns with the time course of PER, and whether different odors evoke different latencies of GRN activation that correspond to latencies of PER, we plotted the time course of GRN responses and PER, and further compared the response latencies across odors and across two types of responses in Gr5a>GCaMP6s flies. As shown in Author response image 2, no significant differences were found in response latency between the six odors for PER and odor responses. Furthermore, Pearson correlation between GRN response latencies and PER latencies was not significant (r = 0.09, p = 0.872).

      Author response image 2.

      (A) PER duration in each second in Gr5a-Gal4>UAS-GCaMP6s flies. The black lines indicate the mean and the shaded areas indicate standard error of the mean. n = 25 flies. (B) Time course of calcium responses (ΔF/F) to nine odors in Gr5a GRNs. n = 5 flies. (C) Latency to the first odor-evoked PER in Gr5a-Gal4>UAS-GCaMP6s flies. Green bar indicates the odor application period. p = 0.67, one-way ANOVA. Box plots indicate the median (orange line), mean (black dot), quartiles (box), and 5-95% range (bar). Dots are outliers. (D) Latency of calcium responses (10% of rise to peak time) in Gr5a GRNs. Green bar indicates the odor application period. p = 0.32, one-way ANOVA. Box plots indicate the median (orange line), mean (black dot), quartiles (box), and 5-95% range (bar). Dots are outliers.

      (6) Several controls are missing, and in some cases, experimental and control groups are not directly compared. In general, Gal4/UAS experiments should include comparisons to both the Gal4/+ and UAS/+ controls, at least in cases where control responses vary substantially, which appears to be the case for this study. These controls are often missing, e.g. the Gal4/+ controls are not shown in Figure 2C-G and the UAS/+ controls are not shown in Figure 2J-L (also, the legend for the latter panels should be revised to clarify what the "control" flies are). For the experiments in Figure S5, the data are not directly compared to any control group. For several other experiments, the control and experimental groups are plotted in separate graphs (e.g., Figure 2C-G), and they would be easier to visually compare if they were together. In addition, for each experiment, the authors should denote which comparisons are statistically significant rather than just reporting an overall p-value in the legend (e.g., Figure 2H-L).

      We thank the reviewer for the input. We have conducted additional experiments for four Gal4/+controls in Figure 2 and added detailed information about control flies in the figure legend (Figure 2C-F).

      For the RNAi flies shown in Figure 2 and Figure 2-figure supplement 3, we used the recommended controls suggested by the VDRC. These control flies were crossed with tubulin-Gal4 lines to include both Gal4 and UAS control backgrounds.

      Regarding Figure S5 in original submission (current Figure 2-figure supplement 2), we now present the results of statistical tests which revealed that PER to certain odors is statistically significantly stronger than that to the solvent control (mineral oil) for both wing-removed and wing-leg-removed flies.

      For Figure 2C-F, we now plot the results for experimental and control groups side by side in each figure.

      Regarding the results of statistical tests, we have provided more information in the legend and also prepared a summary table (supplemental table). 

      (7) Additional controls would be useful in supporting the conclusions. For the Kir experiments, how do we know that Kir is effective, especially in cases where odor-evoked PER was not impaired (e.g., Orco/Kir)? The authors could perform controls testing odor aversion, for example. For the Gr5a mutant, few details are provided on the nature of the control line used and whether it is in the same genetic background as the mutant. Regardless, it would be important to verify that the Gr5a mutant retains a normal sense of smell and shows normal levels of PER to stimuli other than sugar, ruling out more general deficits. Finally, as the method of using DeepLabCut tracking to quantify PER was newly developed, it is important to show the accuracy and specificity of detecting PER events compared to manual scoring.  

      A previous study (Sato, 2023, Front Mol Neurosci) showed that the avoidance to 100 μM 2methylthiazoline was abolished, and the avoidance to 1 mM 2MT was partially impaired in Orco>Kir2.1 flies. However, because Orco-Gal4 does not label all the ORNs and we have more concrete results on flies in which all the olfactory organs are removed as well as specific GRNs and Gr are manipulated, we decided to remove the data for Orco>kir2.1 flies and have updated the text and Figure 2 accordingly.

      For the Gr5a mutant and its control, we have added detailed information about the genotype in the figure legend and in the Methods. We have used the exact same lines as reported in Dahanukar et al. (2007) by obtaining the lines from Dr. Dahanukar. Dahanukar et al. has already carefully examined that Gr5a mutant loses responses only to certain types of sugars (e.g. it even retains normal responses to some other sugars), demonstrating that Gr5a mutants do not exhibit general deficits.

      As for the PER scoring method, we manually scored PER duration and compared the results with those obtained using DeepLabCut in wild type flies for the representative data. The two results were similar (no statistical difference). We have reported the result in Figure1-figure supplement 1C.

      (8) The authors' explanation of why both attractive and aversive odors promote PER (lines 249-259) did not seem convincing. The explanation discusses the different roles of smell and taste but does not address the core question of why it would be adaptive for an aversive odor, which flies naturally avoid, to promote feeding behavior.  

      We have extended our explanation in the Discussion by adding the following possibility: “Enhancing PER to aversive odors might also be adaptive as animals often need to carry out the final check by tasting a trace amount of potentially dangerous substances to confirm that those should not be further consumed.”

      Reviewer #2 (Public review): 

      Summary: 

      A gustatory receptor and neuron enhances an olfactory behavioral response, proboscis extension. This manuscript clearly establishes a novel mechanism by which a gustatory receptor and neuron evokes an olfactory-driven behavioral response. The study expands recent observations by Dweck and Carlson (2023) that suggest new and remarkable properties among GRNs in Drosophila. Here, the authors articulate a clear instance of a novel neural and behavioral mechanism for gustatory receptors in an olfactory response.

      Strengths: 

      The systematic and logical use of genetic manipulation, imaging and physiology, and behavioral analysis makes a clear case that gustatory neurons are bona fide olfactory neurons with respect to proboscis extension behavior.

      Weaknesses: 

      No weaknesses were identified by this reviewer.  

      We appreciate the reviewer’s recognition of the novelty and significance of our work.

      Reviewer #3 (Public review): 

      Summary: 

      Using flies, Kazama et al. combined behavioral analysis, electrophysiological recordings, and calcium imaging experiments to elucidate how odors activate gustatory receptor neurons (GRNs) and elicit a proboscis extension response, which is interpreted as a feeding response. 

      The authors used DeepLabCut v2.0 to estimate the extension of the proboscis, which represents an unbiased and more precise method for describing this behavior compared to manual scoring.

      They demonstrated that the probability of eliciting a proboscis extension increases with higher odor concentrations. The most robust response occurs at a 0.5 v/v concentration, which, despite being diluted in the air stream, remains a relatively high concentration. Although the probability of response is not particularly high it is higher than control stimuli. Notably, flies respond with a proboscis extension to both odors that are considered positive and those regarded as negative.

      The authors used various transgenic lines to show that the response is mediated by GRNs.

      Specifically, inhibiting Gr5a reduces the response, while inhibiting Gr66a increases it in fed flies. Additionally, they find that odors induce a strong positive response in both types of GRNs, which is abolished when the labella of the proboscis are covered. This response was also confirmed through electrophysiological tip recordings.

      Finally, the authors demonstrated that the response increases when two stimuli of different modalities, such as sucrose and odors, are presented together, suggesting clear multimodal integration.

      Strengths: 

      The integration of various techniques, that collectively support the robustness of the results.

      The assessment of electrophysiological recordings in intact animals, preserving natural physiological conditions.

      We appreciate the reviewer’s recognition of the novelty and significance of our work.

      Weaknesses: 

      The behavioral response is observed in only a small proportion of animals.  

      We acknowledge that the probability of odor-evoked PER is lower compared to sucrose-evoked PER, which is close to 100 % depending on the concentration. To further quantify which proportion of animals exhibit odor-evoked PER, we now report this number besides the probability of PER for each odor shown in Fig. 1E. We found that, in wild type Dickinson flies, 73% and 68 % of flies exhibited PER to at least one odor presented at the concentration of 0.5 and 0.1.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      Minor comments/suggestions: 

      - Define "MO" in Figure 1D.  

      We have defined it as mineral oil in the figure legend.

      - Clarify how peak response was calculated for GCaMP traces (is it just the single highest frame per trial?).

      We extended the description in the Methods as follows: “The peak stimulus response was quantified by averaging ΔF/F across five frames at the peak, followed by averaging across three trials for each stimulus. Odor stimulation began at frame 11, and the frames used for peak quantification were 12 to 16.” We made sure that information about the image acquisition frame rate was provided earlier in the text.

      - Clarify how the labellum was covered in Figure 3 and show that this does not affect the fly's ability to do PER (e.g., test PER to sugar stimulation on tarsus) - otherwise one might think that gluing the labella could affect PER.

      In Figure 3, only calcium responses were recorded, and PER was not recorded simultaneously from the same flies. To ensure stable recording from GRN axons in the SEZ, we kept the fly’s proboscis in an extended position as gently as possible using a strip of parafilm. In some of the imaging experiments, we covered the labellum with UV curable glue, whose purpose was not to fix the labellum in an extended position but to prevent the odors from interacting with GRNs on the labellum. We have added a text in the Methods to explain how we covered the labellum.

      - Clarify how the coefficients for the linear equation were chosen in Figure 3G.  

      We used linear regression (implemented in Python using scikit-learn) to model the relationship between neural activity and behavior, aiming to predict the PER duration based on the calcium responses of two GRN types, Gr5a and Gr66a. The coefficients were estimated using the LinearRegression function. We added this description to the Methods. 

      - Typo in "L-type", Figure 4A.  

      We appreciate the reviewer for pointing out this error and have corrected it.

      - Clarify over what time period ephys recordings were averaged to obtain average responses.

      We have modified the description in the Methods as follows: “The average firing rate was quantified by using the spikes generated between 200 and 700 ms after the stimulus contact following the convention to avoid the contamination of motion artifact (Dahanukar and Benton, 2023; Delventhal et al., 2014; Hiroi et al., 2002).

      - The data and statistics indicate that MCH does not enhance feeding in Figure 6G, so the text in lines 207-208 is not accurate.

      We have modified the text as follows: “A similar result was observed with ethyl butyrate, and a slight, although not significant, increase was also observed with 4-methylcyclohexanol (Figure 6G).”

      - P-value for Figure S9 correlation is not reported.  

      We appreciate the reviewer for pointing this out. The p-value is 0.00044, and we have added it to the figure legend (current Figure 5-figure supplement 1).

      Reviewer #2 (Recommendations for the authors): 

      Honestly, I have no recommendations for improvement. The manuscript is extremely well-written and logical. The experiments are persuasive. A lapidary piece of work.

      We appreciate the reviewer for the positive assessment of our work.

      Reviewer #3 (Recommendations for the authors): 

      - I suggest explaining the rationale for selecting a 4-second interval, beginning 1 second after the onset of stimulation.

      Integrated PER duration was defined as the sum of PER duration over 4 s starting 1 s after the odor onset. This definition was set based on the following data.

      (1) We used a photoionization detector (PID) to measure the actual time that the odor reaches the position of a tethered fly, which was approximately 1.1 seconds after the odor valve was opened. Therefore, we began analyzing PER responses 1 second after the odor onset (valve opening) to align with the actual timing of stimulation.

      (2) As shown in Fig.1D and 1F, the majority of PER occurred within 4 s after the odor arrival.

      We have now added the above rationale in the Methods.

      - I could not find the statistical analysis for Figures 1E and 1G. If these figures are descriptive, I suggest the authors revise the sentences: 'Unexpectedly, we found that the odors alone evoked repetitive PER without an application of a tastant (Figures 1D-1G, and Movie S1). Different odors evoked PER with different probability (Figure 1E), latency (Figure S1A), and duration (Figures 1F, 1G, and S2)'.

      We have added the results of statistical analysis to the figure legend.

      - In Figure 2, the authors performed a Scheirer-Ray-Hare test, which, to my knowledge, is a nonparametric test for comparing responses across more than two groups with two factors. If this is the case, please provide the p-values for both factors and their interaction

      We now show the p-values for both factors, odor and group as well as their interaction in the supplementary table. 

      - In line 83, I suggest the authors avoid claiming that 'these data show the olfactory system modulates but is not required for odor-evoked PER,' as they are inhibiting most, but not all olfactory receptor neurons. In this regard, is it possible to measure the olfactory response to odors in these flies?  

      We thank the reviewer for the comment. Because Orco-Gal4 does not label all the ORNs and because we have more concrete results on flies in which all the olfactory organs are removed as well as specific GRNs and Gr are manipulated, we decided to remove the data for Orco>kir2.1 flies and have updated the text and Figure 2 accordingly.

      - In Figure 2, I wonder if there are differences in the contribution of various receptors in detecting different odors. A more detailed statistical analysis might help address this question.

      Although it might be possible to infer the contribution of different gustatory receptors by constructing a quantitative model to predict PER, it is a bit tricky because the activity of individual GRNs and not Grs are manipulated in Figure 2 except for Gr5a. The idea could be tested in the future by more systematically manipulating many Grs that are encoded in the fly genome.

      - For Figures 2J-L, please clarify which group serves as the control.  

      We have added this information to the legend. 

      - In Figure 3, I recommend including an air control in panels D and F to better appreciate the magnitude of the response under these conditions.

      The responses to all three controls, air, mineral oil and water, were almost zero. As the other reviewer suggested to present trial-to-trial variability as well, we now show responses to all the controls in all the trials in all the animals tested in Figure 3-figure supplement 2.

      - I had difficulty understanding Figure 3G. Could the authors provide a more detailed explanation of the model?

      We used linear regression (implemented in Python using scikit-learn) to model the relationship between neural activity and behavior, aiming to predict the PER duration based on the calcium responses of two GRN types, Gr5a and Gr66a. The weights for GRNs were estimated using the LinearRegression function. The weight for Gr5a and Gr66a was positive and negative, respectively, indicating that Gr5a contributes to enhance whereas Gr66a contributes to reduce PER.

      To evaluate the model performance, we calculated the coefficient of determination (R<sup>2</sup>), which was 0.81, meaning the model explained 81% of the variance in the PER data.

      The scatter plot in Fig. 3G shows a tight relationship between the predicted PER duration (y-axis) plotted against the actual PER duration (x-axis), demonstrating a strong predictive power of the model.

      We added the details to the Methods.

      - In Figure S4a, the reported p-value is 0.88, which seems to be a typo, as the text indicates that PER is enhanced in a starved state.

      Thank you for pointing this out. We have modified the figure legend to describe that PER was enhanced in a starved state only for the experiments conducted with odors at 10<sup>-1</sup> concentration (current Figure 2-figure supplement 1).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, Basha and colleagues aim to test whether the thalamic nucleus reuniens can facilitate the hippocampus/prefrontal cortex coupling during sleep. Considering the importance of sleep in memory consolidation, this study is important to understand the functional interaction between these three majorly involved regions. This work suggests that the thalamic nucleus reuniens has a functional role in synchronizing the hippocampus and prefrontal cortex.

      Strengths:

      The authors performed recordings in naturally sleeping cats, and analysed the correlation between the main slow wave sleep oscillatory hallmarks: slow waves, spindles, and hippocampal ripples, and with reuniens' neurons firing. They also associated intracellular recordings to assess the reuniens-prefrontal connectivity, and computational models of large networks in which they determined that the coupling of oscillations is modulated by the strength of hippocampal-thalamic connections.

      Thank you for your positive evaluation.

      Weaknesses:

      The authors' main claim is made on slow waves and spindle coupling, which are recorded both in the prefrontal cortex and surprisingly in reuniens. Known to be generated in the cortex by cortico-thalamic mechanisms, the slow waves and spindles recorded in reuniens show no evidence of local generation in the reuniens, which is not anatomically equipped to generate such activities. Until shown differently, these oscillations recorded in reuniens are most likely volume-conducted from nearby cortices. Therefore, such a caveat is a major obstacle to analysing their correlation (in time or frequency domains) with oscillations in other regions.

      (1) We fully agree with the reviewer that reuniens likely does not generate neither slow waves nor spindles. We do not make such claim, which we clearly stated in the discussion (lines 319-324). We propose that Reuniens neurons mediate different forms of activity. In the model, we introduced MD nucleus only because without MD we were unable to generate spindles. While the slow waves and spindles are generated in other thalamocortical regions, the REU neurons show these rhythms due to long-range projections from these regions to REU as has been shown in the model.

      (2) Definitely, we cannot exclude some influence of volume conductance on obtained LFP recordings in REU nucleus. However, we show modulation of spiking activity within REU by spindles. Spike modulation cannot be explained by volume conductance but can be explained by either synaptic drive (likely the case here) or some intrinsic neuronal processes (like T-current).

      (3) In our REU recordings for spike identification we used tetrode recordings. If slow waves and spindles are volume conducted, then slow waves and spindles recorded with tetrodes should have identical shape. Following reviewer comment, we took these recordings and subtracted one channel from another. The difference in signal during slow waves is in the order 0.1 mV. Considering that the distance between electrodes is in the order of 20 um, such a difference in voltage is major and can only be explained by local extracellular currents, likely due to synaptic activities originating in afferent structures.

      Finally, the choice of the animal model (cats) is the best suited one, as too few data, particularly anatomical ones regarding reuniens connectivity, are available to support functional results.

      (1) Thalamus of majority of mammals (definitely primates and carnivores, including cats) contain local circuit interneurons (about 30 % of all neurons). A vast majority of studies in rodents (except LGN nucleus) report either absence or extremally low (i.e. Jager P, Moore G, Calpin P, et al. Dual midbrain and forebrain origins of thalamic inhibitory interneurons. eLife. 2021; 10: e59272.) number of thalamic interneurons. Therefore, studies on other species than rodents are necessary, and bring new information, which is impossible to obtain in rodents.

      (2) Cats’ brain is much larger than the brain of mice or rats, therefore, the effects of volume conductance from cortex to REU are much smaller, if not negligible. The distance between REU and closest cortical structure (ectosylvian gyrus) in cats is about 15 mm.

      (3) Indeed, there is much less anatomical data on cats as opposed to rodents. This is why, we performed experiments shown in the figure 1. This figure contains functional anatomy data. Antidromic responses show that recorded structure projects to stimulated structure. Orthodromic responses show that stimulated structure projects to recorded structure.

      Reviewer #2 (Public Review):

      Summary:

      The interplay between the medial prefrontal cortex and ventral hippocampal system is critical for many cognitive processes, including memory and its consolidation over time. A prominent idea in recent research is that this relationship is mediated at least in part by the midline nucleus reuniens with respect to consolidation in particular. Whereas the bulk of evidence has focused on neuroanatomy and the effects of temproary or permanent lesions of the nucleus reuniens, the current work examined the electrophysiology of these three structures and how they inter-relate, especially during sleep, which is anticipated to be critical for consolidation. They provide evidence from intercellular recordings of the bi-directional functional connectivity among these structures. There is an emphasis on the interactions between these regions during sleep, especially slow-wave sleep. They provide evidence, in cats, that cortical slow waves precede reuniens slow waves and hippocampal sharp-wave ripples, which may reflect prefrontal control of the timing of thalamic and hippocampal events, They also find evidence that hippocampal sharp wave ripples trigger thalamic firing and precede the onset of reuniens and medial prefrontal cortex spindles. The authors suggest that the effectiveness of bidirectional connections between the reuniens and the (ventral) CA1 is particularly strong during non-rapid eye movement sleep in the cat. This is a very interesting, complex study on a highly topical subject.

      Strengths:

      An excellent array of different electrophysiological techniques and analyses are conducted. The temporal relationships described are novel findings that suggest mechanisms behind the interactions between the key regions of interest. These may be of value for future experimental studies to test more directly their association with memory consolidation.

      We thank this reviewer for very positive evaluation of our study.

      Weaknesses:

      Given the complexity and number of findings provided, clearer explanation(s) and organisation that directed the specific value and importance of different findings would improve the paper. Most readers may then find it easier to follow the specific relevance of key approaches and findings and their emphasis. For example, the fact that bidirectional connections exist in the model system is not new per se. How and why the specific findings add to existing literature would have more impact if this information was addressed more directly in the written text and in the figure legends.

      Thank you for this comment. In the revised version, we will do our best to simplify presentation and more clearly explain our findings.

      Reviewing Editor (Recommendations for Authors):

      Please discuss the ability of reuniens to generate spindles?

      We briefly discussed this in previous version. We now extended the discussion (p. 18).

      For population data, how many cats were used in acute and chronic experiments, where does the population data originate in Fig. 2? How repeatable were the findings across animals? Was histology verified in each animal?

      As previously stated in the beginning of method section we totally used 20 cats: 16 anesthetized (or acute) and 4 non-anesthetized (or chronic). We added number of cats in appropriate places in the result section. Population data in figure 2 comes from 48, 49 or 52 recording sessions (depending on the type of analysis, and indicated in the figure legend) from 4 chronic cats; we clarified this information in the legend. Results were highly repeatable across animals. Histology was verified in all chronic and acute animals, we added a sentence in the method section.

      Explanation of figures is very poor, values in figures should be reported in results so they can be compared in the context of the description.

      In this revised version, we report most numbers present in figures and their legend to the main text (result section).

      The depth of the recording tungsten electrodes are meaningless without the AP and ML coordinates given how heterogenous mPFC is. What is the ventromedial wall of the mPFC in the cat?

      We added the ML and AP coordinates in the method section. We corrected ventromedial wall for ventroposterior part of the mPFC.

      What are the two vertical lines in 1F?

      This was an error while preparing the figure. The panel was corrected.

      Line 90 mean +-SD of what? There are no numbers.

      Thanks, we now indicate the values.

      Panel 2L does not show increased spindling in reuniens prior to PFC as indicated in the results, please explain. It does show SWR in the hippocampus prior to spindles, what is the meaning of such a time relationship?

      Panel 2L did show an increased spindling reuniens prior to mPFC, but indeed at the time scale shown, it was not very clear. In this revised manuscript, we added an inset zooming around time zero to make this point clearer.

      Panel 2L indeed show an increase in SWR prior to the increase in spindle in both Reuniens and mPFC.

      As stated in the discussion, ‘We found that hippocampal SWRs trigger thalamic firing and precede the onset of reuniens and mPFC spindles, which points to SWRs as one of candidate events for spindle initiation.’

      It is unclear what the slow waves of PFC mean, these represent filtered PFC lfp, but is this a particular oscillation? They continue to occur during the spindle, while the slow waves supposedly trigger the spindle. Please explain and clarify.

      We recently published a review article involving several scientists studying both human and animal sleep that has inserted Box. 1 (Timofeev I, Schoch S, LeBourgeois M, Huber R, Riedner B, Kurth S. Spatio-temporal properties of sleep slow waves and implications for development. Current Opinion in Physiology. 2020; 15: 172–182). In this box among other terms, we provide current definition of slow waves vs slow oscillation. Briefly, if slow waves are repeated with a given rhythm, they typically form slow oscillation. However, if they occur in isolation or are not rhythmic, they remain slow waves, but cannot be called slow oscillation.

      Regarding relation of spindles and slow oscillation. We are currently systematically analyzing data on spindles and slow waves obtained from head-restrained and freely behaving cats. One of the main findings is that a majority of ‘cortical’ spindles are local. Local to the extent that spindles can occur in alternation in two neighboring cortical cells. Largely, LFP sleep spindles occur more or less synchronously within suprasylvian gyrus of cats where indeed a large majority of them was triggered by slow waves. The synchrony between LFP spindles in suprasylvian vs other other cortical areas is much less clear. So, it is not surprizing that spindles in one bran region can occur when there is a slow wave present in some other brain region. Something of a kind was also shown in human (Mölle M, Bergmann TO, Marshall L, Born J. Fast and slow spindles during the sleep slow oscillation: disparate coalescence and engagement in memory processing. Sleep. 2011; 34 (10): 1411-1421).

      In this regard, we are not ready to include modifications in the manuscript.

      Line 134, where is spindle amplitude shown? Plots report power within the spindle frequency band, which obviously captures more than just spindles.

      No, plots of figure 3 B, C show the phase-amplitude coupling (PAC) strength. These were calculated with detected spindles, therefore, while we cannot exclude some false spindle detections, we are confident that the false spindle detections are at a negligible level. We modified text and instead of spindle amplitude, we describe SW-spindle amplitude coupling. This reflects our analysis with exactitude.

      The discussion must include the medio dorsal nucleus which is the largest thalamic input to the prefrontal cortex and also receives input from the hippocampus. In particular, the case must be made for why reuniens would play a more important or different role than MD? (For example: Occurrence of Hippocampal Ripples is Associated with Activity Suppression in the Mediodorsal Thalamic Nucleus - PMC (nih.gov)).

      We cited the suggested study. We cannot say whether reuniens plays a more or less important role. What is clear is that hippocampal ripples at the onset of spindles trigger increased firing in both MD and reuniens. Our extracellular recordings (Fig. 4, K) suggest that the increased firing is associated with spike-bursts. We also have a parallel unpublished study done on anesthetized mice showing SWR triggered inhibitory potentials in both reuniens and MD that reverses around -65mV - -70 mV. Because the majority of SWR occurred at the onset of cortical up state, a relative role of cortico-thalamic vs hippocampo-thalamic drive is not easy to separate. We hope, we will convincingly do this in our forthcoming study, with the limitation that it was done on anesthetized mice.

      Reviewer #1 (Recommendations For The Authors):

      I strongly encourage the authors to perform current source density analyses on the LFP signals recorded in the nucleus reuniens to make sure that the observed oscillations are indeed locally generated. So far, the anatomical organisation in reuniens cannot support the local generation of oscillations, such as spindles and slow wave. At least in rodents (the cat reuniens does not seem too different, until shown differently), there were no oscillators found in reuniens, and at least not arranged like in cortical areas, allowing the summation in time, and particularly space, of rhythmic input currents. Bipolar recordings with pairs of twisted electrodes might also be useful to assess the local existence of spindles and slow waves.

      Current source density calculation is possible when one knows the exact distance between recording sites. As we used tetrodes made with 4 twisted platinum-iridium wires, we know more or less the range of distance between recording sites, but not the exact distance between any given pair of electrodes.

      Then, the physical distance between the reuniens and any cortical structure is about 8-9 mm. Therefore, with such distances, volume conductance is expected to be negligible. If slow waves and spindles are volume conducted, then slow waves and spindles recorded with tetrodes should have identical shape. Following reviewer comment, we took these recordings and subtracted one channel from another. The difference in signal during slow waves is in the order 0.1 mV. Considering that the distance between electrodes is in the order of 20 um, such a difference in voltage is major and can only be explained by local extracellular currents, likely due to synaptic activities originating in afferent structures.

      Below, we plotted the voltage of one channel of the tetrode versus another channel of the same tetrode. If the signal was simply volume conducted, one would expect to see the vast majority of points on the x=y line (red).

      Author response image 1.

      Below is a segment of mPFC LFP recording (upper black trace), mPFC LFP filtered for spindle frequency (7-15 Hz) and the spindle detected (black lines above the filtered trace. Then two LFP traces from a tetrode in the Reuniens (orange and light blue) are overlayed. The second trace (Blue) from bottom represents the substraction of Reuniens 1 minus Reuniens 2 channel, and just below (lower Blue trace) is this susbtraction trace filtered for spindle frequency (7-15 Hz) showing clear voltage difference in the spindle range between the two electrodes. Note also that around time 179-179.5 s, there is clear spindle oscillation in the mPFC recording which is not present in the Reuniens recordings.

      Author response image 2.

      Therefore, we are convinced that in our recordings, volume conductance did not play any significant role.

      Another concern regarding delays between events, like slow waves, measured between two regions (as exemplified by Figure 3). It appears that the delays were calculated from the filtered signal. Figure 3G shows a delay between the peak of the mPFC slow wave between the raw and the filtered signal, which might be artifactual of the processing. It is though not (or less) visible for the reuniens recording. Such mismatch might explain the observed differences in delays.

      Thanks for this comment. We recomputed the analysis using the original signal (smoothed) and obtained very similar results. Panels H and I of figure 3 were updated using the new analysis performed on original signal.

      The overall analyses of LFP-triggered reuniens MUA activity lack of statistics (at least z-scored firing to normalise the firings).

      Fig. 2 H and I are representative examples for histograms; statistical data are shown in circular plots as explained in the legend. Fig. 2 L, shows populational data and we provide now standard error. Fig. 4 C and D show individual example. Fig. 4 E shows histograms of activity of all identified putative single units. Units that show significant modulation are displayed above white line. Fig. 4 F shows populational data for significantly modified units.  

      A last point of detail in the model, which surprisingly shows reuniens to excitatory hippocampal cells' connectivity. Recent literature reports that reuniens only connect hippocampal interneurons, and not principal cells (at least in rodents, I could not find any report in cats). I wonder how changing this parameter would affect the results of the computational investigation, particularly the results shown in Figure 6.

      There are several studies in the literature showing a direct excitation from the Reuniens to pyramidal cells in the CA1, here are three of them:

      Goswamee, P., et al. (2021). "Nucleus Reuniens Afferents in Hippocampus Modulate CA1 Network Function via Monosynaptic Excitation and Polysynaptic Inhibition." Frontiers in Cellular Neuroscience 15.

      Dolleman-Van der Weel MJ, Lopes da Silva FH, Witter MP (1997) Nucleus Reuniens Thalami Modulates Activity in Hippocampal Field CA1 through Excitatory and Inhibitory Mechanisms. The Journal of Neuroscience 17:5640.

      Dolleman-van der Weel MJ, Lopes da Silva FH, Witter MP (2017) Interaction of nucleus reuniens and entorhinal cortex projections in hippocampal field CA1 of the rat. Brain Structure and Function 222:2421-2438.

      Because this is not a review paper, we opted to not cite all the papers describing connectivity between mPFC, hippocampus and thalamus.

      Reviewer #2 (Recommendations For The Authors):

      I respectively suggest that the earlier (public) comments listed above should be addressed. In addition, it would be useful to make it clearer when non-rapid eye movement sleep was being addressed and when rapid eye movement was being addressed. Is it of value to use a single term instead of adding "slow wave sleep" or else clarify when either term is used? The addition of more subheadings might help. Moreover, the relative contribution/value of evidence from these two sleep states was not addressed or was not very clear.

      We tried to make it clearer when NREM and when REM was analysed.

      We replaced slow-wave sleep with NREM sleep in the figure 5 title.

      We added several subheadings in the discussion.

      Relative contribution of NREM vs REM sleep was not addressed? Sorry but we do not clearly understand your question. Figs. 2 and 3 deal mainly with NREM sleep (Fig 2.B has an example of REM sleep). Fig. 4 essentially describes results obtained during REM sleep.

      I was not sure if the Abstract summarised the key take-home messages from the large amount of evidence provided. Some choices are needed, of course, but "evidence of bidirectional connectivity" struck me as less novel than other evidence provided. Given the huge amount of findings provided, which is commendable, it is still useful to present it perhaps in a more digestible fashion. For example, the headings or the first sentence(s) below headings could indicate the aim or the outcome of the specific method/analysis/findings.

      We rewrote abstract and we also added some conclusion to highlight major findings and their meaning.

      It is more common to use NRe or Re, rather than REU.

      We avoided using RE as, for decades, we used RE to abbreviate the thalamic reticular nucleus in several publications. In this revised version, we spell at full - Reuniens.

      Line 49 mentions "short-term" memory. Please specify this more clearly as it is otherwise ambiguous. Also, line 303.

      We rephrased the sentence: In particular, the hierarchical coupling of slow waves, spindles and SWRs is thought to play a key role in memory consolidation.

      Line 303 was likely about the ventromedial wall: we corrected that sentence.

      Line 62: the word, "required" (for memory function) is too strong because there is evidence that it is not always required.

      We modified the sentence for plays a major role.

      The focus within the medial prefrontal cortex could be specified more clearly / earlier.

      The mPFC is mentioned in the second sentence of the abstract and in the first sentence of the introduction.

      Line 134: The heading states "determine" and then mentions modulation. These terms may not be interchangeable or they need clarification.

      We changed it to slow wave-spindle amplitude coupling. This represents exactly our analysis.

      Line 204: Does "cortical network" mean prefrontal cortex network"?

      Yes, as described in lines 192-193, the two cortical networks (N1 and N2) of the model represent the mPFC layer 5 and 6 respectively.

      Lines 283 to 289: These were not very clear to me.

      These lines described the potential mechanisms for the responses to hippocampal and reuniens stimulation recorded intracellularly (results in figure 1). We modified this paragraph for clarity.

      Line 296: Specify the "claim".

      We modified the sentence for “[…] provides supporting evidence for this claim that nucleus Reuniens might synchronize the activity of ventral hippocampus and mPFC.”

      The discussion naturally focuses on the thalamic nucleus reuniens, but also occasionally mentions the thalamic mediodorsal nucleus. The distinction, assuming this is highly relevant, could be expressed more clearly (direct comparison with their previous papers).

      We never published a study on the mediodorsal nucleus. We do have some unpublished results from recordings in the MD nucleus and they reveal the presence of an inhibitory component at the beginning of cortical active states, therefore behaving in a similar way to first order nuclei. It is then possible that spindles recorded in the reuniens are actually generated in the MD nucleus and then transmitted to Reuniens through the thalamic reticular nucleus, as both MD and reuniens are connected to the rostral thalamic reticular nucleus. We added some discussion about this.

      Figure 1B: Do the authors have any additional evidence of the placements in the reuniens, because the photo provided suggests a large area beyond the reuniens boundary. Also, please confirm is the CEM between Rh and Re in the cat (I think the Rh and Re are adjacent in the rat).

      Figure 1B is from an electrolytic lesion, which is necessarily bigger than the tip of the electrode. Therefore the center of the electrolytic lesion indicates where the electrode tip was located which is well within the reuniens nucleus.

      Also, yes CE (Nucleus centralis thalami, pars medialis) is located between the reuniens and rhomboid in cats. This can be found in two cat atlas:  

      Reinoso-Suárez, F. (1961). Topographischer Hirnatlas der Katze für experimental-physiologische Untersuchungen (Merck).

      Berman AL, Jones EG (1982) The Thalamus and Basal Telencephalon of the Cat: A Cytoarchitectonic Atlas with Stereotaxic Coordinates: University of Wisconsin Press.

      The first mention of hippocampus in the figure legends should remind the reader by stating "ventral hippocampus".

      In this revised version, we added “ventral” in several instances both in the main text and in figure legend.

      Figure 2: It seems unusual to mention "unusually short NREM". Presumably, things are the same otherwise - if so, perhaps mention that, especially if some of the effects reflect an "unusual" episode.

      We display this particular segment because we want to show continuous recording in which still individual elements characterizing specific states are still visible.

      Some effects look like they are strong and others perhaps weaker. If so, how do these impact the final conclusions?

      Sorry, we did not understand clearly what is meant here by the reviewer. In general, if any effect has statistically significant difference (old fashion 0.05) we consider it as significant. Any other cases are described on individual basis.

      Perhaps "MAD" should be in full on the first occasion, if not already.

      It was spelled out at line 659, but we now spell it out also in the results section and in figure 2 legend.

      Methods: the key question is the use of rodent recordings to classify cat recordings. It would be good to have a reference indicating that this can be directly used for cats, which may have different sleep cycles and patterns compared to rats.

      We did not use rodent recordings to classify cat recordings, however we did used a state detection script that was developed with rodent recordings. As mentioned in the method section, we adapted the script to cat mPFC recordings and then manual corrections were made to correctly detect REM episodes. Respectfully, our lab investigates sleep-wake in non-anesthetized animals for a few decades; we developed state detection algorithm in mice, cats, marmosets when needed (to analyse months of recordings), and we have an extensive expertise in identifying states of vigilance from electrophysiological recordings.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Weaknesses:

      INTRODUCTION & THEORY

      (1) Can the authors please clarify why the first trial of extinction in a standard protocol does NOT produce the retrieval-extinction effect? Particularly as the results section states: "Importantly, such a short-term effect is also retrieval dependent, suggesting the labile state of memory is necessary for the short-term memory update to take effect (Fig. 1e)." The importance of this point comes through at several places in the paper:

      1A. "In the current study, fear recovery was tested 30 minutes after extinction training, whereas the effect of memory reconsolidation was generally evident only several hours later and possibly with the help of sleep, leaving open the possibility of a different cognitive mechanism for the short-term fear dementia related to the retrieval-extinction procedure." ***What does this mean? The two groups in study 1 experienced a different interval between the first and second CS extinction trials; and the results varied with this interval: a longer interval (10 min) ultimately resulted in less reinstatement of fear than a shorter interval. Even if the different pattern of results in these two groups was shown/known to imply two different processes, there is absolutely no reason to reference any sort of cognitive mechanism or dementia - that is quite far removed from the details of the present study.

      Indeed, the only difference between the standard extinction paradigm and the retrieval-extinction paradigm is the difference between the first and second CS extinction trials. It has been shown before that a second CS+ presented 1 hour after the initial retrieval CS+ resulted in the dephosphorylation of GluR1 in rats, which was indicative of memory destabilization. The second CS+ presented only 3 minutes after the initial retrieval CS+, as in the standard extinction training, did not cause the GluR1 dephosphorylation effect (Monfils et al., 2009). Therefore, an isolated presentation of the CS+ seems to be important in preventing the return of fear expression. Behaviorally, when the CSs were presented in a more temporally spaced (vs. mass presentation) or a more gradual manner in the extinction training, the fear amnesia effects were more salient (Cain et al., 2003, Gershman et al., 2013). It has also been suggested that only when the old memory and new experience (through extinction) can be inferred to have been generated from the same underlying latent cause, the old memory can be successfully modified (Gershman et al., 2017). On the other hand, if the new experiences are believed to be generated by a different latent cause, then the old memory is less likely to be subject to modification. Therefore, the way the first and 2nd CS are temporally organized (retrieval-extinction or standard extinction) might affect how the latent cause is inferred and lead to different levels of fear expression from a theoretical perspective. These findings, together with studies in both fear and drug memories using the retrieval-extinction paradigm (Liu et al., 2014, Luo et al., 2015, Schiller et al., 2010, Xue et al., 2012), seem to suggest that the retrieval-extinction and the standard extinction procedures engage different cognitive and molecular mechanisms that lead to significant different behavioral outcomes. 

      In our study, we focus on the short-term and long-term amnesia effects of the retrieval-extinction procedure but also point out the critical role of retrieval in eliciting the short-term effect.

      1B. "Importantly, such a short-term effect is also retrieval dependent, suggesting the labile state of memory is necessary for the short-term memory update to take effect (Fig. 1e)." ***As above, what is "the short-term memory update"? At this point in the text, it would be appropriate for the authors to discuss why the retrieval-extinction procedure produces less recovery than a standard extinction procedure as the two protocols only differ in the interval between the first and second extinction trials. References to a "short-term memory update" process do not help the reader to understand what is happening in the protocol.

      Sorry for the lack of clarity here. By short-term memory update we meant the short-term amnesia in fear expression.

      (2) "Indeed, through a series of experiments, we identified a short-term fear amnesia effect following memory retrieval, in addition to the fear reconsolidation effect that appeared much later."

      ***The only reason for supposing two effects is because of the differences in responding to the CS2, which was subjected to STANDARD extinction, in the short- and long-term tests. More needs to be said about how and why the performance of CS2 is affected in the short-term test and recovers in the long-term test. That is, if the loss of performance to CS1 and CS2 is going to be attributed to some type of memory updating process across the retrieval-extinction procedure, one needs to explain the selective recovery of performance to CS2 when the extinction-to-testing interval extends to 24 hours. Instead of explaining this recovery, the authors note that performance to CS1 remains low when the extinction-to-testing interval is 24 hours and invoke something to do with memory reconsolidation as an explanation for their results: that is, they imply (I think) that reconsolidation of the CS1-US memory is disrupted across the 24-hour interval between extinction and testing even though CS1 evokes negligible responding just minutes after extinction.

      In our results, we did not only focus on the fear expression related to CS2. In fact, we also demonstrated that the CS1 related fear expression diminished in the short-term memory test but re-appeared in the long-term memory after the CS1 retrieval-extinction training.

      The “…recovery of performance to CS2 when the extinction-to-testing interval extends to 24 hours…” is a result that has been demonstrated in various previous studies (Kindt and Soeter, 2018, Kindt et al., 2009, Nader et al., 2000, Schiller et al., 2013, Schiller et al., 2010, Xue et al., 2012). That is, the reconsolidation framework stipulates that the pharmacological or behavioral intervention during the labile states of the reconsolidation window only modifies the fear memory linked to the reminded retrieval cue, but not for the non-reminded CS-US memory expression (but also see (Liu et al., 2014, Luo et al., 2015) for using the unconditioned stimulus as the reminder cue and the retrieval-extinction paradigm to prevent the return of fear memory associated with different CS).  In fact, we hypothesized the temporal dynamics of CS1 and CS2 related fear expressions were due to the interplay between the short-term and long-term (reconsolidation) effects of the retrieval-extinction paradigm in the last figure (Fig. 6). 

      (3) The discussion of memory suppression is potentially interesting but, in its present form, raises more questions than it answers. That is, memory suppression is invoked to explain a particular pattern of results but I, as the reader, have no sense of why a fear memory would be better suppressed shortly after the retrieval-extinction protocol compared to the standard extinction protocol; and why this suppression is NOT specific to the cue that had been subjected to the retrieval-extinction protocol.

      We discussed memory suppression as one of the potential mechanisms to account for the three characteristics of the short-term amnesia effects: cue-independence, temporal dynamics (short-term) and thought-control-ability relevance. According to the memory suppression theory, the memory suppression effect is NOT specific to the cue and this effect was demonstrated via the independent cue test in a variety of studies (Anderson and Floresco, 2022, Anderson and Green, 2001, Gagnepain et al., 2014, Zhu et al., 2022). Therefore, we suggest in the discussion that it might be possible the CS1 retrieval cue prompted an automatic suppression mechanism and yielded the short-term fear amnesia consistent with various predictions from the memory suppression theory:

      “In our experiments, subjects were not explicitly instructed to suppress their fear expression, yet the retrieval-extinction training significantly decreased short-term fear expression. These results are consistent with the short-term amnesia induced with the more explicit suppression intervention (Anderson et al., 1994; Kindt and Soeter, 2018; Speer et al., 2021; Wang et al., 2021; Wells and Davies, 1994). It is worth noting that although consciously repelling unwanted memory is a standard approach in memory suppression paradigm, it is possible that the engagement of the suppression mechanism can be unconscious. For example, in the retrieval-induced forgetting (RIF) paradigm, recall of a stored memory impairs the retention of related target memory and this forgetting effect emerges as early as 20 minutes after the retrieval procedure, suggesting memory suppression or inhibition can occur in a more spontaneous and automatic manner (Imai et al., 2014). Moreover, subjects with trauma histories exhibited more suppression-induced forgetting for both negative and neutral memories than those with little or no trauma (Hulbert and Anderson, 2018). Similarly, people with higher self-reported thought-control capabilities showed more severe cue-independent memory recall deficit, suggesting that suppression mechanism is associated with individual differences in spontaneous control abilities over intrusive thoughts (Küpper et al., 2014). It has also been suggested that similar automatic mechanisms might be involved in organic retrograde amnesia of traumatic childhood memories (Schacter et al., 2012; Schacter et al., 1996).”

      3A. Relatedly, how does the retrieval-induced forgetting (which is referred to at various points throughout the paper) relate to the retrieval-extinction effect? The appeal to retrieval-induced forgetting as an apparent justification for aspects of the present study reinforces points 2 and 3 above. It is not uninteresting but needs some clarification/elaboration.

      We introduced the retrieval-induced forgetting (RIF) to make the point that RIF was believed to be related to the memory suppression mechanism and the RIF effect can appear relatively early, consistent with what we observed in the short-term amnesia effect. We have re-written the manuscript to make this point clearer:

      “It is worth noting that although consciously repelling unwanted memory is a standard approach in memory suppression paradigm, it is possible that the engagement of the suppression mechanism can be unconscious. For example, in the retrieval-induced forgetting (RIF) paradigm, recall of a stored memory impairs the retention of related target memory and this forgetting effect emerges as early as 20 minutes after the retrieval procedure, suggesting memory suppression or inhibition can occur in a more spontaneous and automatic manner (Imai et al., 2014). Moreover, subjects with trauma histories exhibited more suppression-induced forgetting for both negative and neutral memories than those with little or no trauma (Hulbert and Anderson, 2018). Similarly, people with higher self-reported thought-control capabilities showed more severe cue-independent memory recall deficit, suggesting that suppression mechanism is associated with individual differences in spontaneous control abilities over intrusive thoughts (Küpper et al., 2014).”

      (4) Given the reports by Chalkia, van Oudenhove & Beckers (2020) and Chalkia et al (2020), some qualification needs to be inserted in relation to reference 6. That is, reference 6 is used to support the statement that "during the reconsolidation window, old fear memory can be updated via extinction training following fear memory retrieval". This needs a qualifying statement like "[but see Chalkia et al (2020a and 2020b) for failures to reproduce the results of 6]."

      https://pubmed.ncbi.nlm.nih.gov/32580869/

      https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7115860/

      We have incorporated the reviewer’s suggestion into the revised manuscript in both the introduction:

      “Pharmacological blockade of protein synthesis and behavioral interventions can both eliminate the original fear memory expression in the long-term (24 hours later) memory test ( Lee, 2008; Lee et al., 2017; Schiller et al., 2013; Schiller et al., 2010), resulting in the cue-specific fear memory deficit (Debiec et al., 2002; Lee, 2008; Nader, Schafe, & LeDoux, 2000). For example, during the reconsolidation window, retrieving a fear memory allows it to be updated through extinction training (i.e., the retrieval-extinction paradigm (Lee, 2008; Lee et al., 2017; Schiller et al., 2013; Schiller et al., 2010), but also see (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; D. Schiller, LeDoux, & Phelps, 2020)”

      And in the discussion:

      “It should be noted that while our long-term amnesia results were consistent with the fear memory reconsolidation literatures, there were also studies that failed to observe fear prevention (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; Schroyens et al., 2023). Although the memory reconsolidation framework provides a viable explanation for the long-term amnesia, more evidence is required to validate the presence of reconsolidation, especially at the neurobiological level (Elsey et al., 2018). While it is beyond the scope of the current study to discuss the discrepancies between these studies, one possibility to reconcile these results concerns the procedure for the retrieval-extinction training. It has been shown that the eligibility for old memory to be updated is contingent on whether the old memory and new observations can be inferred to have been generated by the same latent cause (Gershman et al., 2017; Gershman and Niv, 2012). For example, prevention of the return of fear memory can be achieved through gradual extinction paradigm, which is thought to reduce the size of prediction errors to inhibit the formation of new latent causes (Gershman, Jones, et al., 2013). Therefore, the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause. Furthermore, other studies highlighted the importance of memory storage per se and suggested that memory retention was encoded in the memory engram cell ensemble connectivity whereas the engram cell synaptic plasticity is crucial for memory retrieval (Ryan et al., 2015; Tonegawa, Liu, et al., 2015; Tonegawa, Pignatelli, et al., 2015). It remains to be tested how the cue-independent short-term and cue-dependent long-term amnesia effects we observed could correspond to the engram cell synaptic plasticity and functional connectivity among engram cell ensembles (Figure 6). This is particularly important, since the cue-independent characteristic of the short-term amnesia suggest that either different memory cues fail to evoke engram cell activities, or the retrieval-extinction training transiently inhibits connectivity among engram cell ensembles. Finally, SCR is only one aspect of the fear expression, how the retrieval-extinction paradigm might affect subjects’ other emotional (such as the startle response) and cognitive fear expressions such as reported fear expectancy needs to be tested in future studies since they do not always align with each other (Kindt et al., 2009; Sevenster et al., 2012, 2013).”

      5A. What does it mean to ask: "whether memory retrieval facilitates update mechanisms other than memory reconsolidation"? That is, in what sense could or would memory retrieval be thought to facilitate a memory update mechanism?

      It is widely documented in the literatures that memory retrieval renders the old memory into a labile state susceptible for the memory reconsolidation process. However, as we mentioned in the manuscript, studies have shown that memory reconsolidation requires the de novo protein synthesis and usually takes hours to complete. What remains unknown is whether old memories are subject to modifications other than the reconsolidation process. Our task specifically tested the short-term effect of the retrieval-extinction paradigm and found that fear expression diminished 30mins after the retrieval-extinction training. Such an effect cannot be accounted for by the memory reconsolidation effect.

      5B. "First, we demonstrate that memory reactivation prevents the return of fear shortly after extinction training in contrast to the memory reconsolidation effect which takes several hours to emerge and such a short-term amnesia effect is cue independent (Study 1, N = 57 adults)."

      ***The phrasing here could be improved for clarity: "First, we demonstrate that the retrieval-extinction protocol prevents the return of fear shortly after extinction training (i.e., when testing occurs just min after the end of extinction)." Also, cue-dependence of the retrieval-extinction effect was assessed in study 2.

      We thank the reviewer and have modified the phrasing of the sentence:

      “First, we demonstrate that memory retrieval-extinction protocol prevents the return of fear expression shortly after extinction training and this short-term effect is memory reactivation dependent (Study 1, N = 57 adults).”

      5C. "Furthermore, memory reactivation also triggers fear memory reconsolidation and produces cue-specific amnesia at a longer and separable timescale (Study 2, N = 79 adults)." ***In study 2, the retrieval-extinction protocol produced a cue-specific disruption in responding when testing occurred 24 hours after the end of extinction. This result is interesting but cannot be easily inferred from the statement that begins "Furthermore..." That is, the results should be described in terms of the combined effects of retrieval and extinction, not in terms of memory reactivation alone; and the statement about memory reconsolidation is unnecessary. One can simply state that the retrieval-extinction protocol produced a cue-specific disruption in responding when testing occurred 24 hours after the end of extinction.

      We have revised the text according to the reviewer’s comment.

      “Furthermore, across different timescales, the memory retrieval-extinction paradigm triggers distinct types of fear amnesia in terms of cue-specificity and cognitive control dependence, suggesting that the short-term fear amnesia might be caused by different mechanisms from the cue-specific amnesia at a longer and separable timescale (Study 2, N = 79 adults).”

      5D. "...we directly manipulated brain activities in the dorsolateral prefrontal cortex and found that both memory retrieval and intact prefrontal cortex functions were necessary for the short-term fear amnesia."

      ***This could be edited to better describe what was shown: E.g., "...we directly manipulated brain activities in the dorsolateral prefrontal cortex and found that intact prefrontal cortex functions were necessary for the short-term fear amnesia after the retrieval-extinction protocol."

      Edited:

      “Finally, using continuous theta-burst stimulation (Study 3, N = 75 adults), we directly manipulated brain activity in the dorsolateral prefrontal cortex, and found that both memory reactivation and intact prefrontal cortex function were necessary for the short-term fear amnesia after the retrieval-extinction protocol.”

      5E. "The temporal scale and cue-specificity results of the short-term fear amnesia are clearly dissociable from the amnesia related to memory reconsolidation, and suggest that memory retrieval and extinction training trigger distinct underlying memory update mechanisms."

      ***The pattern of results when testing occurred just minutes after the retrieval-extinction protocol was different from that obtained when testing occurred 24 hours after the protocol. Describing this in terms of temporal scale is unnecessary, and suggesting that memory retrieval and extinction trigger different memory update mechanisms is not obviously warranted. The results of interest are due to the combined effects of retrieval+extinction and there is no sense in which different memory update mechanisms should be identified with retrieval (mechanism 1) and extinction (mechanism 2).

      We did not argue for different memory update mechanisms for the “retrieval (mechanism 1) and extinction (mechanism 2)” in our manuscript. Instead, we proposed that the retrieval-extinction procedure, which was mainly documented in the previous literatures for its association with the reconsolidation-related fear memory retention (the long-term effect), also had a much faster effect (the short-term effect). These two effects differed in many aspects, suggesting that different memory update mechanisms might be involved.

      5F. "These findings raise the possibility of concerted memory modulation processes related to memory retrieval..."

      ***What does this mean?

      As we mentioned in our response to the previous comment, we believe that the retrieval-extinction procedure triggers different types of memory update mechanisms working on different temporal scales.

      (6) "...suggesting that the fear memory might be amenable to a more immediate effect, in addition to what the memory reconsolidation theory prescribes..."

      ***What does it mean to say that the fear memory might be amenable to a more immediate effect?

      We intended to state that the retrieval-extinction procedure can produce a short-term amnesia effect and have thus revised the text.

      (7) "Parallel to the behavioral manifestation of long- and short-term memory deficits, concurrent neural evidence supporting memory reconsolidation theory emphasizes the long-term effect of memory retrieval by hypothesizing that synapse degradation and de novo protein synthesis are required for reconsolidation."

      ***This sentence needs to be edited for clarity.

      We have rewritten this sentence:

      “Corresponding to the long-term behavioral manifestation, concurrent neural evidence supporting memory reconsolidation hypothesis emphasizes that synapse degradation and de novo protein synthesis are required for reconsolidation.”

      (8) "previous behavioral manipulations engendering the short-term declarative memory effect..."

      ***What is the declarative memory effect? It should be defined.

      We meant the amnesia on declarative memory research, such as the memory deficit caused by the think/no-think paradigms. Texts have been modified for clarity:

      “On the contrary, previous behavioral manipulations engendering the short-term amnesia on declarative memory, such as the think/no-think paradigm, hinges on the intact activities in brain areas such as dorsolateral prefrontal cortex (cognitive control) and its functional coupling with specific brain regions such as hippocampus (memory retrieval) (Anderson and Green, 2001; Wimber et al., 2015).”

      (9) "The declarative amnesia effect emerges much earlier due to the online functional activity modulation..."

      ***Even if the declarative memory amnesia effect had been defined, the reference to online functional activity modulation is not clear.

      We have rephrased the sentence:

      “The declarative amnesia effect arises much earlier due to the more instant modulation of functional connectivity, rather than the slower processes of new protein synthesis in these brain regions.”

      (10) "However, it remains unclear whether memory retrieval might also precipitate a short-term amnesia effect for the fear memory, in addition to the long-term prevention orchestrated by memory consolidation."

      ***I found this sentence difficult to understand on my first pass through the paper. I think it is because of the phrasing of memory retrieval. That is, memory retrieval does NOT precipitate any type of short-term amnesia for the fear memory: it is the retrieval-extinction protocol that produces something like short-term amnesia. Perhaps this sentence should also be edited for clarity.

      We have changed “memory retrieval” to “retrieval-extinction” where applicable.

      I will also note that the usage of "short-term" at this point in the paper is quite confusing: Does the retrieval-extinction protocol produce a short-term amnesia effect, which would be evidenced by some recovery of responding to the CS when tested after a sufficiently long delay? I don't believe that this is the intended meaning of "short-term" as used throughout the majority of the paper, right?

      By “short-term”, we meant the lack of fear expression in the test phase (measured by skin conductance responses) shortly after the retrieval-extinction procedure (30 mins in studies 1 & 2 and 1 hour in study 3). It does not indicate that the effect is by itself “short-lived”.

      (11) "To fully comprehend the temporal dynamics of the memory retrieval effect..."<br /> ***What memory retrieval effect? This needs some elaboration.

      We’ve changed the phrase “memory retrieval effect” to “retrieval-extinction effect” to refer to the effect of retrieval-extinction on fear amnesia.

      (12) "We hypothesize that the labile state triggered by the memory retrieval may facilitate different memory update mechanisms following extinction training, and these mechanisms can be further disentangled through the lens of temporal dynamics and cue-specificities."

      ***What does this mean? The first part of the sentence is confusing around the usage of the term "facilitate"; and the second part of the sentence that references a "lens of temporal dynamics and cue-specificities" is mysterious. Indeed, as all rats received the same retrieval-extinction exposures in Study 2, it is not clear how or why any differences between the groups are attributed to "different memory update mechanisms following extinction".

      As the reviewer mentioned, if only one time point data were collected, we cannot differentiate whether different memory update mechanisms are involved. In study 2, however, the 3 groups only differed on the time onsets the reinstatement test was conducted. Accordingly, our results showed that the fear amnesia effects for CS1 and CS2 cannot be simply explained by forgetting: different memory update mechanisms must be at work to explain the characteristics of the SCR related to both CS1 and CS2 at three different time scales (30min, 6h and 24h). It was based on these results, together with the results from the TMS study (study 3), that we proposed the involvement of a short-term memory update mechanism in addition to the reconsolidation related fear amnesia (which should become evident much later) induced by the retrieval-extinction protocol.

      (13) "In the first study, we aimed to test whether there is a short-term amnesia effect of fear memory retrieval following the fear retrieval-extinction paradigm."

      ***Again, the language is confusing. The phrase, "a short-term amnesia effect" implies that the amnesia itself is temporary; but I don't think that this implication is intended. The problem is specifically in the use of the phrase "a short-term amnesia effect of fear memory retrieval." To the extent that short-term amnesia is evident in the data, it is not due to retrieval per se but, rather, the retrieval-extinction protocol.

      We have changed the wordings and replaced “memory retrieval” with “retrieval-extinction” where applicable.

      (14) The authors repeatedly describe the case where there was a 24-hour interval between extinction and testing as consistent with previous research on fear memory reconsolidation. Which research exactly? That is, in studies where a CS re-exposure was combined with a drug injection, responding to the CS was disrupted in a final test of retrieval from long-term memory which typically occurred 24 hours after the treatment. Is that what the authors are referring to as consistent? If so, which aspect of the results are consistent with those previous findings? Perhaps the authors mean to say that, in the case where there was a 24-hour interval between extinction and testing, the results obtained here are consistent with previous research that has used the retrieval-extinction protocol. This would clarify the intended meaning greatly.

      Our 24 hour test results after the retrieval-extinction protocol was consistent with both pharmacological and behavioral intervention studies in fear memory reconsolidation studies (Kindt and Soeter, 2018, Kindt et al., 2009, Liu et al., 2014, Luo et al., 2015, Monfils et al., 2009, Nader et al., 2000, Schiller et al., 2013, Schiller et al., 2010, Xue et al., 2012) since the final test phase typically occurred 24 hours after the treatment. At the 24-hour interval, the memory reconsolidation effect would become evident either via drug administration or behavioral intervention (extinction training).

      DATA

      (15) Points about data:

      5A. The eight participants who were discontinued after Day 1 in study 1 were all from the no-reminder group. Can the authors please comment on how participants were allocated to the two groups in this experiment so that the reader can better understand why the distribution of non-responders was non-random (as it appears to be)?

      15B. Similarly, in study 2, of the 37 participants that were discontinued after Day 2, 19 were from Group 30 min, and 5 were from Group 6 hours. Can the authors comment on how likely these numbers are to have been by chance alone? I presume that they reflect something about the way that participants were allocated to groups, but I could be wrong.

      We went back and checked out data. As we mentioned in the supplementary materials, we categorized subjects as non-responders if their SCR response to any CS was less than 0.02  in Day 1 (fear acquisition). Most of the discontinued participants (non-responders) in the no-reminder group (study 1) and the 30min & 24 h groups (study 2) were when the heating seasons just ended or were yet to start, respectively. It has been documented that human body thermal conditions were related to the quality of the skin conductance response (SCR) measurements (Bauer et al., 2022, Vila, 2004). We suspect that the non-responders might be related to the body thermal conditions caused by the lack of central heating.

      15C. "Post hoc t-tests showed that fear memories were resilient after regular extinction training, as demonstrated by the significant difference between fear recovery indexes of the CS+ and CS- for the no-reminder group (t26 = 7.441, P < 0.001; Fig. 1e), while subjects in the reminder group showed no difference of fear recovery between CS+ and CS- (t29 = 0.797, P = 0.432, Fig. 1e)."

      ***Is the fear recovery index shown in Figure 1E based on the results of the first test trial only? How can there have been a "significant difference between fear recovery indexes of the CS+ and CS- for the no-reminder group" when the difference in responding to the CS+ and CS- is used to calculate the fear recovery index shown in 1E? What are the t-tests comparing exactly, and what correction is used to account for the fact that they are applied post-hoc?

      As we mentioned in the results section of the manuscript, the fear recovery index was defined as “the SCR difference between the first test trial and the last extinction trial of a specific CS”. We then calculated the “differential fear recovery index” (figure legends of Fig. 1e) between CS+ and CS- for both the reminder and no-reminder groups. The post-hoc t-tests were used to examine whether there were significant fear recoveries (compare to 0) in both the reminder (t<sub>29</sub> = 0.797, P = 0.432, Fig. 1e) and no-reminder (t<sub>26</sub> = 7.441, P  < 0.001; Fig. 1e) groups. We realize that the description of Bonferroni correction was not specified in the original manuscript and hence added in the revision where applicable.

      15D. "Finally, there is no statistical difference between the differential fear recovery indexes between CS+ in the reminder and no reminder groups (t55 = -2.022, P = 0.048; Fig. 1c, also see Supplemental Material for direct test for the test phase)."

      ***Is this statement correct - i.e., that there is no statistically significant difference in fear recovery to the CS+ in the reminder and no reminder groups? I'm sure that the authors would like to claim that there IS such a difference; but if such a difference is claimed, one would be concerned by the fact that it is coming through in an uncorrected t-test, which is the third one of its kind in this paragraph. What correction (for the Type 1 error rate) is used to account for the fact that the t-tests are applied post-hoc? And if no correction, why not?

      We are sorry about the typo.  The reviewer was correct that we meant to claim here that “… there is a significant difference between the differential fear recovery indexes between CS+ in the reminder and no-reminder groups (t<sub>55</sub> =- 2.022, P = 0.048; Fig. 1e)”.  Note that the t-test performed here was a confirmatory test following our two-way ANOVA with main effects of group (reminder vs. no-reminder) and time (last extinction trial vs. first test trial) on the differential CS SCR response (CS+ minus CS-) and we found a significant group x time interaction effect (F<sub>1.55</sub> = 4.087, P = 0.048, η<sup>2</sup> = 0.069). The significant difference between the differential fear recovery indexes was simply a re-plot of the interaction effect mentioned above and therefore no multiple correction is needed. We have reorganized the sequence of the sentences such that this t-test now directly follows the results of the ANOVA:

      “The interaction effect was confirmed by the significant difference between the differential fear recovery indexes between CS1+ and CS2+ in the reminder and no-reminder groups (t<sub>55</sub> \= -2.022, P \= 0.048; Figure 1E, also see Supplemental Material for the direct test of the test phase).”

      15E. In study 2, why is responding to the CS- so high on the first test trial in Group 30 min? Is the change in responding to the CS- from the last extinction trial to the first test trial different across the three groups in this study? Inspection of the figure suggests that it is higher in Group 30 min relative to Groups 6 hours and 24 hours. If this is confirmed by the analysis, it has implications for the fear recovery index which is partly based on responses to the CS-. If not for differences in the CS- responses, Groups 30 minutes and 6 hours are otherwise identical.

      Following the reviewer’s comments, we went back and calculated the mean SCR difference of CS- between the first test trial and the last extinction trial for all three studies (see Author response image 1 below). In study 1, there was no difference in the mean CS- SCR (between the first test trial and last extinction trial) between the reminder and no-reminder groups (Kruskal-Wallis test , panel a), though both groups showed significant fear recovery even in the CS- condition (Wilcoxon signed rank test, reminder: P = 0.0043, no-reminder: P = 0.0037). Next, we examined the mean SCR for CS- for the 30min, 6h and 24h groups in study 2 and found that there was indeed a group difference (one-way ANOVA,F<sub>2.76</sub> = 5.3462, P = 0.0067, panel b), suggesting that the CS- related SCR was influenced by the test time (30min, 6h or 24h). We also tested the CS- related SCR for the 4 groups in study 3 (where test was conducted 1 hour after the retrieval-extinction training) and found that across TMS stimulation types (PFC vs. VER) and reminder types (reminder vs. no-reminder) the ANOVA analysis did not yield main effect of TMS stimulation type (F<sub>1.71</sub> = 0.322, P = 0.572) nor main effect of reminder type (F<sub>1.71</sub> = 0.0499, P = 0.824, panel c). We added the R-VER group results in study 3 (see panel c) to panel b and plotted the CS- SCR difference across 4 different test time points and found that CS- SCR decreased as the test-extinction delay increased (Jonckheere-Terpstra test, P = 0.00028). These results suggest a natural “forgetting” tendency for CS- related SCR and highlight the importance of having the CS- as a control condition to which the CS+ related SCR was compared with.

      Author response image 1.

      15F. Was the 6-hour group tested at a different time of day compared to the 30-minute and 24-hour groups; and could this have influenced the SCRs in this group?

      For the 30min and 24h groups, the test phase can be arranged in the morning, in the afternoon or at night. However, for the 6h group, the test phase was inevitably in the afternoon or at night since we wanted to exclude the potential influence of night sleep on the expression of fear memory (see Author response table 1 below). If we restricted the test time in the afternoon or at night for all three groups, then the timing of their extinction training was not matched.

      Author response table 1.

      Nevertheless, we also went back and examined the data for the subjects only tested in the afternoon or at nights in the 30min and 24h groups to match with the 6h group where all the subjects were tested either in the afternoon or at night. According to Author response table 1 above, we have 17 subjects for the 30min group (9+8),18 subjects for the 24h group (9 + 9) and 26 subjects for the 6h group (12 + 14). As Author response image 2 shows, the SCR patterns in the fear acquisition, extinction and test phases were similar to the results presented in the original figure.

      Author response image 2.

      15G. Why is the range of scores in "thought control ability" different in the 30-minute group compared to the 6-hour and 24-hour groups? I am not just asking about the scale on the x-axis: I am asking why the actual distribution of the scores in thought control ability is wider for the 30-minute group?

      We went back and tested whether the TCAQ score variance was the same across three groups. We found that there was significant difference in the variance of the TCAQ score distribution across three groups (F<sub>2.155</sub> = 4.324, P = 0.015, Levene test). However, post-hoc analyses found that the variance of TCAQ is not significantly different between the 30min and 6h groups (F<sub>26.25</sub> = 0.4788, P = 0.0697), nor between the 30min and 24h groups (i>F<sub>26.25</sub> = 0.4692, P = 0.0625). To further validate our correlational results between the TCAQ score and the fear recovery index, we removed the TCAQ scores that were outside the TCAQ score range of the 6h & 24h groups from the 30min group (resulting in 4 “outliner” TCAQ scores in the 30min group, panel a in Author response image 3 below) and the Levene test confirmed that the variance of the TCAQ scores showed no difference across groups after removing the 4 “outliner” data points in the 30min group (i>F<sub>2.147</sub> = 0.74028, P = 0.4788). Even with the 4 “outliers” removed from the 30min group, the correlational analysis of the TCAQ scores and the fear recovery index still yielded significant result in the 30min group (beta = -0.0148, t = -3.731, P = 0.0006, see panel b below), indicating our results were not likely due to the inclusion of subjects with extreme TCAQ scores.

      Author response image 3.

      (16) During testing in each experiment, how were the various stimuli presented? That is, was the presentation order for the CS+ and CS- pseudorandom according to some constraint, as it had been in extinction? This information should be added to the method section.

      We mentioned the order of the stimuli in the testing phase in the methods section “… For studies 2 & 3, …a pseudo-random stimulus order was generated for fear acquisition and extinction phases of three groups with the rule that no same trial- type (CS1+, CS2+ and CS-) repeated more than twice. In the test phase, to exclude the possibility that the difference between CS1+ and CS2+ was simply caused by the presentation sequence of CS1+ and CS2+, half of the participants completed the test phase using a pseudo-random stimuli sequence and the identities of CS1+ and CS2+ reversed in the other half of the participants.”

      (17) "These results are consistent with previous research which suggested that people with better capability to resist intrusive thoughts also performed better in motivated dementia in both declarative and associative memories."

      ***Which parts of the present results are consistent with such prior results? It is not clear from the descriptions provided here why thought control ability should be related to the present findings or, indeed, past ones in other domains. This should be elaborated to make the connections clear.

      In the 30min group, we found that subjects’ TCAQ scores were negatively correlated with their fear recovery indices. That is, people with better capacity to resist intrusive thoughts were also less likely to experience the return of fear memory, which are consistent with previous results. Together with our brain stimulation results, the short-term amnesia is related to subject’s cognitive control ability and intact dlPFC functions. It is because of these similarities that we propose that the short-term amnesia might be related to the automatic memory suppression mechanism originated from the declarative memory research. Since we have not provided all the evidence at this point of the results section, we briefly listed the connections with previous declarative and associative memory research.

      Reviewer #2 (Public Review):

      The fear acquisition data is converted to a differential fear SCR and this is what is analysed (early vs late). However, the figure shows the raw SCR values for CS+ and CS- and therefore it is unclear whether the acquisition was successful (despite there being an "early" vs "late" effect - no descriptives are provided).

      As the reviewer mentioned, the fear acquisition data was converted to a differential fear SCR and we conducted a two-way mixed ANOVA (reminder vs. no-reminder) x time (early vs. late part of fear acquisition) on the differential SCRs. We found a significant main effect of time (early vs. late; F<sub>1.55</sub> = 6.545, P = 0.013, η<sup>2</sup> = 0.106), suggesting successful fear acquisition in both groups. Fig. 1c also showed the mean differential SCR for the latter half of the acquisition phase in both the reminder and no-reminder groups and there was no significant difference in acquired SCRs between groups (early acquisition: t<sub>55</sub> = -0.063, P = 0.950; late acquisition: t<sub>55</sub> = -0.318, P = 0.751; Fig. 1c).

      In Experiment 1 (Test results) it is unclear whether the main conclusion stems from a comparison of the test data relative to the last extinction trial ("we defined the fear recovery index as the SCR difference between the first test trial and the last extinction trial for a specific CS") or the difference relative to the CS- ("differential fear recovery index between CS+ and CS-"). It would help the reader assess the data if Figure 1e presents all the indexes (both CS+ and CS-). In addition, there is one sentence that I could not understand "there is no statistical difference between the differential fear recovery indexes between CS+ in the reminder and no reminder groups (P=0.048)". The p-value suggests that there is a difference, yet it is not clear what is being compared here. Critically, any index taken as a difference relative to the CS- can indicate recovery of fear to the CS+ or absence of discrimination relative to the CS-, so ideally the authors would want to directly compare responses to the CS+ in the reminder and no-reminder groups. The latter issue is particularly relevant in Experiment 2, in which the CS- seems to vary between groups during the test and this can obscure the interpretation of the result.

      In all the experiments, the fear recovery index (FRI) was defined as the SCR difference between the first test trial and the last extinction trial for any CS. Subsequently, the differential fear recovery index (FRI) was defined between the FRI of a specific CS+ and the FRI of the CS-. The differential FRI would effectively remove the non-specific time related effect (using the CS- FRI as the baseline). We have revised the text accordingly.

      As we responded to reviewer #1, the CS- fear recovery indices (FIR) for the reminder and no-reminder groups were not statistically different (Kruskal-Wallis test , panel a, Author response image 1), though both groups showed significant fear recovery even in the CS- condition (Wilcoxon signed rank test, reminder: P = 0.0043, no-reminder: P = 0.0037, panel a). Next, we examined the mean SCR for CS- for the 30min, 6h and 24h groups in study 2 and found that there was indeed a group difference (one-way ANOVA,  one-way ANOVA,F<sub>2.76</sub> = 5.3462, P = 0.0067, panel b), suggesting that the CS- SCR was influenced by the test time delay. We also tested the CS- SCR for the 4 groups in study 3 and found that across TMS stimulation types (PFC vs. VER) and reminder types (reminder vs. no-reminder) the ANOVA analysis did not yield main effect of TMS stimulation type (F<sub>1.71</sub> = 0.322, P = 0.572) nor main effect of reminder type (F<sub>1.71</sub> = 0.0499, P = 0.824, panel c). We added the R-VER group results in study 3 (see panel c) to panel b and plotted the CS- SCR difference across 4 different test time points and found that CS- SCR decreased as the test-extinction delay increased (Jonckheere-Terpstra test, P = 0.00028). These results suggest a natural “forgetting” tendency for the CS- fear recovery index and highlight the importance of having the CS- as a control condition to compare the CS+ recovery index with (resulting in the Differential recovery index). Parametric and non-parametric analyses were adopted based on whether the data met the assumptions for the parametric analyses.

      In Experiment 1, the findings suggest that there is a benefit of retrieval followed by extinction in a short-term reinstatement test. In Experiment 2, the same effect is observed on a cue that did not undergo retrieval before extinction (CS2+), a result that is interpreted as resulting from cue-independence, rather than a failure to replicate in a within-subjects design the observations of Experiment 1 (between-subjects). Although retrieval-induced forgetting is cue-independent (the effect on items that are suppressed [Rp-] can be observed with an independent probe), it is not clear that the current findings are similar. Here, both cues have been extinguished and therefore been equally exposed during the critical stage.

      We appreciate the reviewer’s insight on this issue. Although in the discussion we raised the possibility of memory suppression to account for the short-term amnesia effect, we did not intend to compare our paradigm side-by-side with retrieval-induced forgetting. In our previous work (Wang et al., 2021), we reported that active suppression effect of CS+ related fear memory during the standard extinction training generalized to other CS+, yielding a cue-independent effect. In the current experiments, we did not implement active suppression; instead, we used the CS+ retrieval-extinction paradigm. It is thus possible that the CS+ retrieval cue may function to facilitate automatic suppression. Indeed, in the no-reminder group (standard extinction) of study 1, we did observe the return of fear expression, suggesting the critical role of CS+ reminder before the extinction training. Based on the results mentioned above, we believe our short-term amnesia results were consistent with the hypothesis that the retrieval CS+ (reminder) might prompt subjects to adopt an automatic suppress mechanism in the following extinction training, yielding cue-independent amnesia effects.

      The findings in Experiment 2 suggest that the amnesia reported in Experiment 1 is transient, in that no effect is observed when the test is delayed by 6 hours. The phenomena whereby reactivated memories transition to extinguished memories as a function of the amount of exposure (or number of trials) is completely different from the phenomena observed here. In the former, the manipulation has to do with the number of trials (or the total amount of time) that the cues are exposed to. In the current study, the authors did not manipulate the number of trials but instead the retention interval between extinction and test. The finding reported here is closer to a "Kamin effect", that is the forgetting of learned information which is observed with intervals of intermediate length (Baum, 1968). Because the Kamin effect has been inferred to result from retrieval failure, it is unclear how this can be explained here. There needs to be much more clarity on the explanations to substantiate the conclusions.

      Indeed, in our studies, we did not manipulate the amount of exposure (or number of trials) but only the retention interval between extinction and test. Our results demonstrated that the retrieval-extinction protocol yielded the short-term amnesia on fear memory, qualitatively different from the reconsolidation related amnesia proposed in the previous literatures. After examining the temporal dynamics, cue-specificity and TCAQ association with the short-term amnesia, we speculated that the short-term effect might be related to an automatic suppression mechanism. Of course, further studies will be required to test such a hypothesis.

      Our results might not be easily compared with the “Kamin effect”, a term coined to describe the “retention of a partially learned avoidance response over varying time intervals” using a learning-re-learning paradigm (Baum, 1968, Kamin, 1957). However, the retrieval-extinction procedure used in our studies was different from the learning-re-learning paradigm in the original paper (Kamin, 1957) and the reversal-learning paradigm the reviewer mentioned (Baum, 1968).

      There are many results (Ryan et al., 2015) that challenge the framework that the authors base their predictions on (consolidation and reconsolidation theory), therefore these need to be acknowledged. Similarly, there are reports that failed to observe the retrieval-extinction phenomenon (Chalkia et al., 2020), and the work presented here is written as if the phenomenon under consideration is robust and replicable. This needs to be acknowledged.

      We thank the reviewer pointing out the related literature and have added a separate paragraph about other results in the discussion (as well as citing relevant references in the introduction) to provide a full picture of the reconsolidation theory to the audience:

      “It should be noted that while our long-term amnesia results were consistent with the fear memory reconsolidation literatures, there were also studies that failed to observe fear prevention (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; Schroyens et al., 2023). Although the memory reconsolidation framework provides a viable explanation for the long-term amnesia, more evidence is required to validate the presence of reconsolidation, especially at the neurobiological level (Elsey et al., 2018). While it is beyond the scope of the current study to discuss the discrepancies between these studies, one possibility to reconcile these results concerns the procedure for the retrieval-extinction training. It has been shown that the eligibility for old memory to be updated is contingent on whether the old memory and new observations can be inferred to have been generated by the same latent cause (Gershman et al., 2017; Gershman and Niv, 2012). For example, prevention of the return of fear memory can be achieved through gradual extinction paradigm, which is thought to reduce the size of prediction errors to inhibit the formation of new latent causes (Gershman, Jones, et al., 2013). Therefore, the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause. Furthermore, other studies highlighted the importance of memory storage per se and suggested that memory retention was encoded in the memory engram cell ensemble connectivity whereas the engram cell synaptic plasticity is crucial for memory retrieval (Ryan et al., 2015; Tonegawa, Liu, et al., 2015; Tonegawa, Pignatelli, et al., 2015). It remains to be tested how the cue-independent short-term and cue-dependent long-term amnesia effects we observed could correspond to the engram cell synaptic plasticity and functional connectivity among engram cell ensembles (Figure 6). This is particularly important, since the cue-independent characteristic of the short-term amnesia suggest that either different memory cues fail to evoke engram cell activities, or the retrieval-extinction training transiently inhibits connectivity among engram cell ensembles. Finally, SCR is only one aspect of the fear expression, how the retrieval-extinction paradigm might affect subjects’ other emotional (such as the startle response) and cognitive fear expressions such as reported fear expectancy needs to be tested in future studies since they do not always align with each other (Kindt et al., 2009; Sevenster et al., 2012, 2013).”

      The parallels between the current findings and the memory suppression literature are speculated in the general discussion, and there is the conclusion that "the retrieval-extinction procedure might facilitate a spontaneous memory suppression process". Because one of the basic tenets of the memory suppression literature is that it reflects an "active suppression" process, there is no reason to believe that in the current paradigm, the same phenomenon is in place, but instead, it is "automatic". In other words, the conclusions make strong parallels with the memory suppression (and cognitive control) literature, yet the phenomena that they observed are thought to be passive (or spontaneous/automatic).

      Ultimately, it is unclear why 10 mins between the reminder and extinction learning will "automatically" suppress fear memories. Further down in the discussion, it is argued that "For example, in the well-known retrieval-induced forgetting (RIF) phenomenon, the recall of a stored memory can impair the retention of related long-term memory and this forgetting effect emerges as early as 20 minutes after the retrieval procedure, suggesting memory suppression or inhibition can occur in a more spontaneous and automatic manner". I did not follow with the time delay between manipulation and test (20 mins) would speak about whether the process is controlled or automatic.

      In our previous research, we showed that the memory suppression instruction together with the extinction procedure successfully prevented the return of fear expression in the reinstatement test trials 30mins after the extinction training (Wang et al., 2021). In the current experiments, we replaced the suppression instruction with the retrieval cue before the extinction training (retrieval-extinction protocol) and observed similar short-term amnesia effects. These results prompted us to hypothesize in the discussion that the retrieval cue might facilitate an automatic suppression process. We made the analogy to RIF phenomenon in the discussion to suggest that the suppression of (competing) memories could be unintentional and fast (20 mins), both of which were consistent with our results. We agree with the reviewer that this hypothesis is more of a speculation (hence in the discussion), and more studies are required to further test such a hypothesis. However, what we want to emphasize in this paper is the report of the short-term amnesia effects which were clearly not related to the memory reconsolidation effect in a variety of aspects.

      Among the many conclusions, one is that the current study uncovers the "mechanism" underlying the short-term effects of retrieval extinction. There is little in the current report that uncovers the mechanism, even in the most psychological sense of the mechanism, so this needs to be clarified. The same applies to the use of "adaptive".

      Whilst I could access the data on the OFS site, I could not make sense of the Matlab files as there is no signposting indicating what data is being shown in the files. Thus, as it stands, there is no way of independently replicating the analyses reported.

      We have re-organized data on the OFS site, and they should be accessible now.

      The supplemental material shows figures with all participants, but only some statistical analyses are provided, and sometimes these are different from those reported in the main manuscript. For example, the test data in Experiment 1 is analysed with a two-way ANOVA with the main effects of group (reminder vs no-reminder) and time (last trial of extinction vs first trial of the test) in the main report. The analyses with all participants in the sup mat used a mixed two-way ANOVA with a group (reminder vs no reminder) and CS (CS+ vs CS-). This makes it difficult to assess the robustness of the results when including all participants. In addition, in the supplementary materials, there are no figures and analyses for Experiment 3.

      We are sorry for the lack of clarity in the supplementary materials. We have supplementary figures Fig. S1 & S2 for the data re-analysis with all the responders (learners + non-learners). The statistical analyses performed on the responders in both figures yielded similar results as those in the main text. For other analyses reported in the supplementary materials, we specifically provided different analysis results to demonstrate the robustness of our results. For example, to rule out the effects we observed in two-way ANOVA in the main text may be driven by the different SCR responses on the last extinction trial, we only tested the two-way ANOVA for the first trial SCR of test phase and these analyses provided similar results. Please note we did not include non-learners in these analyses (the texts of the supplementary materials).

      Since we did not exclude any non-learners in study 3, all the results were already reported in the main text.

      One of the overarching conclusions is that the "mechanisms" underlying reconsolidation (long term) and memory suppression (short term) phenomena are distinct, but memory suppression phenomena can also be observed after a 7-day retention interval (Storm et al., 2012), which then questions the conclusions achieved by the current study.

      As we stated before, the focus of the manuscript was to demonstrate a novel short-term fear amnesia effect following the retrieval-extinction procedure. We discussed memory suppression as one of the potential mechanisms for such a short-term effect. In fact, the durability of the memory suppression effect is still under debate. Although Storm et al. (2012) suggested that the retrieval-induced forgetting can persist for as long as a week, other studies, however, failed to observe long-term forgetting (after 24 hrs; (Carroll et al., 2007, Chan, 2009). It is also worth noting that Storm et al. (2012) tested RIF one week later using half of the items the other half of which were tested 5 minutes after the retrieval practice. Therefore, it can be argued that there is a possibility that the long-term RIF effect is contaminated by the test/re-test process on the same set of (albeit different) items at different time onsets (5mins & 1 week).

      Reviewer #3 (Public Review):

      (1) The entire study hinges on the idea that there is memory 'suppression' if (1) the CS+ was reminded before extinction and (2) the reinstatement and memory test takes place 30 minutes later (in Studies 1 & 2). However, the evidence supporting this suppression idea is not very strong. In brief, in Study 1, the effect seems to only just reach significance, with a medium effect size at best, and, moreover, it is unclear if this is the correct analysis (which is a bit doubtful, when looking at Figure 1D and E). In Study 2, there was no optimal control condition without reminder and with the same 30-min interval (which is problematic, because we can assume generalization between CS1+ and CS2+, as pointed out by the authors, and because generalization effects are known to be time-dependent). Study 3 is more convincing, but entails additional changes in comparison with Studies 1 and 2, i.e., applications of cTBS and an interval of 1 hour instead of 30 minutes (the reason for this change was not explained). So, although the findings of the 3 studies do not contradict each other and are coherent, they do not all provide strong evidence for the effect of interest on their own.

      Related to the comment above, I encourage the authors to double-check if this statement is correct: "Also, our results remain robust even with the "non-learners" included in the analysis (Fig. S1 in the Supplemental Material)". The critical analysis for Study 1 is a between-group comparison of the CS+ and CS- during the last extinction trial versus the first test trial. This result only just reached significance with the selected sample (p = .048), and Figures 1D and E even seem to suggest otherwise. I doubt that the analysis would reach significance when including the "non-learners" - assuming that this is what is shown in Supplemental Figure 1 (which shows the data from "all responded participants").

      Our subjects were categorized based on the criteria specified in supplementary table S1. More specifically, we excluded the non-responders (Mean CS SCR < 0.02 uS  in the fear acquisition phase), and non-learners and focused our analyses on the learners. Non-responders were dismissed after day 1 (the day of fear acquisition), but both learners and non-learners finished the experiments. This fact gave us the opportunity to examine data for both the learners and the responders (learners + non-learners). What we showed in fig. 1D and E were differential SCRs (CS+ minus CS-) of the last extinction trials and the differential fear recovery indices (CS+ minus CS-), respectively. We have double checked the figures and both the learners (Fig. 1) and the responders (i.e. learners and non-learners, supplementary Fig. 1) results showed significant differences between the reminder and no-reminder groups on the differential fear recovery index.

      Also related to the comment above, I think that the statement "suggesting a cue-independent short-term amnesia effect" in Study 2 is not correct and should read: "suggesting extinction of fear to the CS1+ and CS2+", given that the response to the CS+'s is similar to the response to the CS-, as was the case at the end of extinction. Also the next statement "This result indicates that the short-term amnesia effect observed in Study 2 is not reminder-cue specific and can generalize to the non-reminded cues" is not fully supported by the data, given the lack of an appropriate control group in this study (a group without reinstatement). The comparison with the effect found in Study 1 is difficult because the effect found there was relatively small (and may have to be double-checked, see remarks above), and it was obtained with a different procedure using a single CS+. The comparison with the 6-h and 24-h groups of Study 2 is not helpful as a control condition for this specific question (i.e., is there reinstatement of fear for any of the CS+'s) because of the large procedural difference with regard to the intervals between extinction and reinstatement (test).

      In Fig. 2e, we showed the differential fear recovery indices (FRI) for the CS+ in all three groups. Since the fear recovery index (FRI) was calculated as the SCR difference between the first test trial and the last extinction trial for any CS, the differential fear recovery indices (difference between CS+ FRI and CS- FRI) not significantly different from 0 should be interpreted as the lack of fear expression in the test phase. Since spontaneous recovery, reinstatement and renewal are considered canonical phenomena in demonstrating that extinction training does not really “erase” conditioned fear response, adding the no-reinstatement group as a control condition would effectively work as the spontaneous recovery group and the comparison between the reinstatement and no-instatement groups turns into testing the difference in fear recovery using different methods (reinstatement vs. spontaneous recovery).

      (2) It is unclear which analysis is presented in Figure 3. According to the main text, it either shows the "differential fear recovery index between CS+ and CS-" or "the fear recovery index of both CS1+ and CS2+". The authors should clarify what they are analyzing and showing, and clarify to which analyses the ** and NS refer in the graphs. I would also prefer the X-axes and particularly the Y-axes of Fig. 3a-b-c to be the same. The image is a bit misleading now. The same remarks apply to Figure 5.

      We are sorry about the lack of clarity here. Figures 3 & 5 showed the correlational analyses between TCAQ and the differential fear recovery index (FRI) between CS+ and CS-. That is, the differential FRI of CS1+ (CS1+ FRI minus CS- FRI) and the differential FRI of CS2+ (CS2+ FRI minus CS- FRI).

      We have rescaled both X and Y axes for figures 3 & 5 (please see the revised figures). 

      (3) In general, I think the paper would benefit from being more careful and nuanced in how the literature and findings are represented. First of all, the authors may be more careful when using the term 'reconsolidation'. In the current version, it is put forward as an established and clearly delineated concept, but that is not the case. It would be useful if the authors could change the text in order to make it clear that the reconsolidation framework is a theory, rather than something that is set in stone (see e.g., Elsey et al., 2018 (https://doi.org/10.1037/bul0000152), Schroyens et al., 2022 (https://doi.org/10.3758/s13423-022-02173-2)).

      In addition, the authors may want to reconsider if they want to cite Schiller et al., 2010 (https://doi.org/10.1038/nature08637), given that the main findings of this paper, nor the analyses could be replicated (see, Chalkia et al., 2020 (https://doi.org/10.1016/j.cortex.2020.04.017; https://doi.org/10.1016/j.cortex.2020.03.031).

      We thank the reviewer’s comments and have incorporated the mentioned papers into our revised manuscript by pointing out the extant debate surrounding the reconsolidation theory in the introduction:

      “Pharmacological blockade of protein synthesis and behavioral interventions can both eliminate the original fear memory expression in the long-term (24 hours later) memory test ( Lee, 2008; Lee et al., 2017; Schiller et al., 2013; Schiller et al., 2010), resulting in the cue-specific fear memory deficit (Debiec et al., 2002; Lee, 2008; Nader, Schafe, & LeDoux, 2000). For example, during the reconsolidation window, retrieving a fear memory allows it to be updated through extinction training (i.e., the retrieval-extinction paradigm (Lee, 2008; Lee et al., 2017; Schiller et al., 2013; Schiller et al., 2010), but also see (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; D. Schiller, LeDoux, & Phelps, 2020). ”

      As well as in the discussion:

      “It should be noted that while our long-term amnesia results were consistent with the fear memory reconsolidation literatures, there were also studies that failed to observe fear prevention (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; Schroyens et al., 2023). Although the memory reconsolidation framework provides a viable explanation for the long-term amnesia, more evidence is required to validate the presence of reconsolidation, especially at the neurobiological level (Elsey et al., 2018). While it is beyond the scope of the current study to discuss the discrepancies between these studies, one possibility to reconcile these results concerns the procedure for the retrieval-extinction training. It has been shown that the eligibility for old memory to be updated is contingent on whether the old memory and new observations can be inferred to have been generated by the same latent cause (Gershman et al., 2017; Gershman and Niv, 2012). For example, prevention of the return of fear memory can be achieved through gradual extinction paradigm, which is thought to reduce the size of prediction errors to inhibit the formation of new latent causes (Gershman, Jones, et al., 2013). Therefore, the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause. Furthermore, other studies highlighted the importance of memory storage per se and suggested that memory retention was encoded in the memory engram cell ensemble connectivity whereas the engram cell synaptic plasticity is crucial for memory retrieval (Ryan et al., 2015; Tonegawa, Liu, et al., 2015; Tonegawa, Pignatelli, et al., 2015). It remains to be tested how the cue-independent short-term and cue-dependent long-term amnesia effects we observed could correspond to the engram cell synaptic plasticity and functional connectivity among engram cell ensembles (Figure 6). This is particularly important, since the cue-independent characteristic of the short-term amnesia suggest that either different memory cues fail to evoke engram cell activities, or the retrieval-extinction training transiently inhibits connectivity among engram cell ensembles. Finally, SCR is only one aspect of the fear expression, how the retrieval-extinction paradigm might affect subjects’ other emotional (such as the startle response) and cognitive fear expressions such as reported fear expectancy needs to be tested in future studies since they do not always align with each other (Kindt et al., 2009; Sevenster et al., 2012, 2013).”

      Relatedly, it should be clarified that Figure 6 is largely speculative, rather than a proven model as it is currently presented. This is true for all panels, but particularly for panel c, given that the current study does not provide any evidence regarding the proposed reconsolidation mechanism.

      We agree with the reviewer that Figure 6 is largely speculative. We realize that there are still debates regarding the retrieval-extinction procedure and the fear reconsolidation hypothesis. We have provided a more elaborated discussion and pointed out that figure 6 is only a working hypothesis and more work should be done to test such a hypothesis:

      “Although mixed results have been reported regarding the durability of suppression effects in the declarative memory studies (Meier et al., 2011; Storm et al., 2012), future research will be needed to investigate whether the short-term effect we observed is specifically related to associative memory or the spontaneous nature of suppression (Figure 6C).”

      Lastly, throughout the paper, the authors equate skin conductance responses (SCR) with fear memory. It should at least be acknowledged that SCR is just one aspect of a fear response, and that it is unclear whether any of this would translate to verbal or behavioral effects. Such effects would be particularly important for any clinical application, which the authors put forward as the ultimate goal of the research.

      Again, we agree with the reviewer on this issue, and we have acknowledged that SCR is only one aspect of the fear response and caution should be exerted in clinical application:

      “Finally, SCR is only one aspect of the fear expression, how the retrieval-extinction paradigm might affect subjects’ other emotional (such as the startle response) and cognitive fear expressions such as reported fear expectancy needs to be tested in future studies since they do not always align with each other (Kindt et al., 2009; Sevenster et al., 2012, 2013).”

      (4) The Discussion quite narrowly focuses on a specific 'mechanism' that the authors have in mind. Although it is good that the Discussion is to the point, it may be worthwhile to entertain other options or (partial) explanations for the findings. For example, have the authors considered that there may be an important role for attention? When testing very soon after the extinction procedure (and thus after the reminder), attentional processes may play an important role (more so than with longer intervals). The retrieval procedure could perhaps induce heightened attention to the reminded CS+ (which could be further enhanced by dlPFC stimulation)?

      We thank the reviewer for this suggestion and have added more discussion on the potential mechanisms involved. Unfortunately, since the literature on attention and fear recovery is rather scarce, it is even more of a speculation given our study design and results are mainly about subjects’ skin conductance responses (SCR).

      (5) There is room for improvement in terms of language, clarity of the writing, and (presentation of the) statistical analyses, for all of which I have provided detailed feedback in the 'Recommendations for the authors' section. Idem for the data availability; they are currently not publicly available, in contrast with what is stated in the paper. In addition, it would be helpful if the authors would provide additional explanation or justification for some of the methodological choices (e.g., the 18-s interval and why stimulate 8 minutes after the reminder cue, the choice of stimulation parameters), and comment on reasons for (and implications of) the large amount of excluded participants (>25%).

      We have addressed the data accessibility issue and added the justifications for the methodological choices as well as the excluded participants. As we mentioned in the manuscript and the supplementary materials, adding the non-learners into data analysis did not change the results. Since the non-responders discontinued after Day 1 due to their non-measurable spontaneous SCR signals towards different CS, it’s hard to speculate whether or how the results might have changed. However, participants’ exclusion rate in the SCR studies were relatively high (Hu et al., 2018, Liu et al., 2014, Raio et al., 2017, Schiller et al., 2010, Schiller et al., 2012, Wang et al., 2021). The non-responders were mostly associated with participants being tested in the winter in our tasks. Cold weather and dry skins in the winter are likely to have caused the SCR hard to measure (Bauer et al., 2022, Vila, 2004). Different intervals between the reinstating US (electric shock) and the test trials were used in the previous literature such as 10min (Schiller et al., 2010, Schiller et al., 2013) and 18 or 19s (Kindt and Soeter, 2018, Kindt et al., 2009, Wang et al., 2021). We stuck with the 18s reinstatement interval in the current experiment. For the cTBS stimulation, since the stimulation itself lasted less than 2mins, we started the cTBS 8min after the onset of reminder cue to ensure that any effect caused by the cTBS stimulation occurred during the hypothesized time window, where the old fear memory becomes labile after memory retrieval. All the stimulation parameters were determined based on previous literature, which showed that with the transcranial magnetic stimulation (TMS) on the human dorsolateral prefrontal cortex could disrupt fear memory reconsolidation (Borgomaneri et al., 2020, Su et al., 2022).

      Finally, I think several statements made in the paper are overly strong in light of the existing literature (or the evidence obtained here) or imply causal relationships that were not directly tested.

      We have revised the texts accordingly.

      Reviewer #2 (Recommendations For The Authors):

      On numerous occasions there are typos and the autocorrect has changed "amnesia" for "dementia".

      We are sorry about this mistake and have revised the text accordingly.

      Reviewer #3 (Recommendations For The Authors):

      *"Neither of the studies reported in this article was preregistered. The data for both studies are publicly accessible at https://osf.io/9agvk". This excerpt from the text suggests that there are 2 studies, but there are 3 in the paper. Also, the data are only accessible upon request, not publicly available. I haven't requested them, as this could de-anonymize me as a reviewer.

      We are sorry for the accessibility of the link. The data should be available to the public now.

      *Please refrain from causal interpretations when they are not supported by the data:

      - Figure 3 "thought-control ability only affected fear recovery"; a correlation does not provide causal evidence.

      - "establishing a causal link between the dlPFC activity and short-term fear amnesia." I feel this statement is too strong; to what extent do we know for sure what the applied stimulation of (or more correct: near) the dlPFC does exactly?

      We thank the reviewer for the suggestion and have changed the wording related to figure 3. On the other hand, we’d like to argue that the causal relationship between the dlPFC activity and short-term fear amnesia is supported by the results from study 3. Although the exact functional role of the TMS on dlPFC can be debated, the fact that the TMS stimulation on the dlPFC (compared to the vertex group) brought back the otherwise diminished fear memory expression can be viewed as the causal evidence between the dlPFC activity and short-term fear amnesia.

      *The text would benefit from language editing, as it contains spelling and grammar mistakes, as well as wording that is vague or inappropriate. I suggest the authors check the whole text, but below are already some excerpts that caught my eye:

      "preludes memory reconsolidation"; "old fear memory can be updated"; "would cause short-term memory deficit"; "the its functional coupling"; "Subjects (...) yielded more severe amnesia in the memory suppression tasks"; "memory retrieval might also precipitate a short-term amnesia effect"; "more SEVERE amnesia in the memory suppression tasks"; "the effect size of reinstatement effect"; "the previous literatures"; "towards different CS"; "failed to show SCR response to the any stimuli"; "significant effect of age of TMS"; "each subject' left hand"; "latter half trials"; "Differntial fear recovery"; "fear dementia"; "the fear reinstatement effects at different time scale is related to"; "fear reocery index"; "thought-control abiliites"; "performed better in motivated dementia"; "we tested that in addition to the memory retrieval cue (reminder), whether the"; "during reconsolidation window"; "consisitent with the short-term dementia"; "low level of shock (5v)"

      We thank the reviewer for thorough reading and sorry about typos in the manuscript. We have corrected typos and grammar mistakes as much as we can find.

      *In line with the remark above, there are several places where the text could still be improved.

      - The last sentence of the Abstract is rather vague and doesn't really add anything.

      - Please reword or clarify: "the exact functional role played by the memory retrieval remains unclear".

      - Please reword or clarify: "the unbinding of the old memory trace".

      - "suggesting that the fear memory might be amenable to a more immediate effect, in addition to what the memory reconsolidation theory prescribes" shouldn't this rather read "in contrast with"?

      We have modified the manuscript.

      - In the Introduction, the authors state: "Specifically, memory reconsolidation effect will only be evident in the long-term (24h) memory test due to its requirement of new protein synthesis and is cue-dependent". They then continue about the more immediate memory update mechanisms that they want to study, but it is unclear from how the rationale is presented whether (and why (not)) they also expect this mechanism to be cue-dependent.

      Most of the previous studies on the fear memory reconsolidation using CS as the memory retrieval cues have demonstrated that the reconsolidation effect is cue-dependent (Kindt and Soeter, 2018, Kindt et al., 2009, Monfils et al., 2009, Nader et al., 2000, Schiller et al., 2013, Schiller et al., 2010, Xue et al., 2012). However, other studies using unconditioned stimulus retrieval-extinction paradigm showed that such protocol was able to prevent the return of fear memory expression associated with different CSs (Liu et al., 2014, Luo et al., 2015). In our task, we used CS+ as the memory retrieval cues and our results were consistent with results from previous studies using similar paradigms.

      - "The effects of cTBS over the right dlPFC after the memory reactivation were assessed using the similar mixed-effect four-way ANOVA". Please clarify what was analyzed here.<br /> - "designing novel treatment of psychiatric disorders". Please make this more concrete or remove the statement.

      This sentence was right after a similar analysis performed in the previous paragraph. While the previous graph focused on how the SCRs in the acquisition phase were modulated by factors such as CS+ (CS1+ and CS2+), reminder (reminder vs. no-reminder), cTBS site (right dlPFC vs. vertex) and trial numbers, this analysis focused instead on the SCR responses in the extinction training phase. We have made the modifications as the reviewer suggested.

      *I have several concerns related to the (presentation) of the statistical analyses/results:<br /> - Some statistical analyses, as well as calculation of certain arbitrary indices (e.g., differential fear recovery index) are not mentioned nor explained in the Methods section, but only mentioned in the Results section.

      We have added the explanation of the differential fear recovery index into the methods section:

      “To measure the extent to which fear returns after the presentation of unconditioned stimuli (US, electric shock) in the test phase, we defined the fear recovery index as the SCR difference between the first test trial and the last extinction trial for a specific CS for each subject. Similarly, in studies 2 and 3, differential fear recovery index was defined as the difference between fear recovery indices of CS+ and CS- for both CS1+ and CS2+.”

      - Figure 1C-E: It is unclear what the triple *** mean. Do they have the same meaning in Figure 1C and Figure 1E? I am not sure that that makes sense. The meaning is not explained in the figure caption (I think it is different from the single asterisk*) and is not crystal clear from the main text either.

      We explained the triple *** in the figure legend (Fig. 1): ***P < 0.001. The asterisk placed within each bar in Figure 1C-E indicates the statistical results of the post-hoc test of whether each bar was significant. For example, the *** placed inside bars in Figure 1E indicates that the differential fear recovery index is statistically significant in the no-reminder group (P < 0.001).

      - Supplemental Figure 1: "with all responded participants" Please clarify how you define 'responded participants' and include the n's.

      We presented the criteria for both the responder/non-responder and the learner/non-learner in the table of the supplementary materials and reported the number of subjects in each category (please see supplement Table 1).

      - "the differential SCRs (difference between CS+ and CS-) for the CS+". Please clarify what this means and/or how it is calculated exactly.

      Sorry, it means the difference between the SCRs invoked by CS+ and CS- for both CS1+ (CS1+ minus CS-) and CS2+ (CS2+ minus CS-).

      *I suggest that the authors provide a bit more explanation about the thought-control ability questionnaire. For example, the type of items, etc, as this is not a very commonly used questionnaire in the fear conditioning field.

      We provided a brief introduction to the thought-control ability questionnaire in the methods section:

      “The control ability over intrusive thought was measured by the 25-item Thought-Control Ability Questionnaire (TCAQ) scle(30). Participants were asked to rate on a five-point Likert-type scale the extent to which they agreed with the statement from 1 (completely disagree) to 5 (completely agree). At the end of the experiments, all participants completed the TCAQ scale to assess their perceived control abilities over intrusive thoughts in daily life(17).”

      We have added further description of the item types to the TCAQ scale.

      *The authors excluded more than 25% of the participants. It would be interesting to hear reasons for this relatively large number and some reflection on whether they think this selection affects their results (e.g., could being a (non)responder in skin conductance influence the susceptibility to reactivation-extinction in some way?).

      Participants exclusion rate in the SCR studies were relatively high (Hu et al., 2018, Liu et al., 2014, Raio et al., 2017, Schiller et al., 2010, Schiller et al., 2012, Wang et al., 2021). The non-responders were mostly associated with participants being tested in the winter in our tasks. Cold weather and dry skins in the winter are likely to have caused the SCR hard to measure (Bauer et al., 2022, Vila, 2004).

      *Minor comments that the authors may want to consider:

      - Please explain abbreviations upon first use, e.g., TMS.

      - In Figure 6, it is a bit counterintuitive that the right Y-axis goes from high to low.

      We added the explanation of TMS:

      “Continuous theta burst stimulation (cTBS), a specific form of repetitive transcranial magnetic stimulation (rTMS)…”

      We are sorry and agree that the right Y-axis was rather counterintuitive. However, since the direction of the fear recovery index (which was what we measured in the experiment) and the short/long-term amnesia effect are of the opposite directions, plotting one index from low to high would inevitably cause the other index to go from high to low.

      Reference:

      Anderson, M. C. and Floresco, S. B. 2022. Prefrontal-hippocampal interactions supporting the extinction of emotional memories: The retrieval stopping model. Neuropsychopharmacology, 47, 180-195.

      Anderson, M. C. and Green, C. 2001. Suppressing unwanted memories by executive control. Nature, 410, 366-9.

      Bauer, E. A., Wilson, K. A. and Macnamara, A. 2022. 3.03 - cognitive and affective psychophysiology. In: ASMUNDSON, G. J. G. (ed.) Comprehensive clinical psychology (second edition). Oxford: Elsevier.

      Baum, M. 1968. Reversal learning of an avoidance response and the kamin effect. J Comp Physiol Psychol, 66, 495-7.

      Borgomaneri, S., Battaglia, S., Garofalo, S., Tortora, F., Avenanti, A. and Di Pellegrino, G. 2020. State-dependent tms over prefrontal cortex disrupts fear-memory reconsolidation and prevents the return of fear. Curr Biol, 30, 3672-3679.e4.

      Cain, C. K., Blouin, A. M. and Barad, M. 2003. Temporally massed cs presentations generate more fear extinction than spaced presentations. J Exp Psychol Anim Behav Process, 29, 323-33.

      Carroll, M., Campbell-Ratcliffe, J., Murnane, H. and Perfect, T. 2007. Retrieval-induced forgetting in educational contexts: Monitoring, expertise, text integration, and test format. European Journal of Cognitive Psychology, 19, 580-606.

      Chan, J. C. K. 2009. When does retrieval induce forgetting and when does it induce facilitation? Implications for retrieval inhibition, testing effect, and text processing. Journal of Memory and Language, 61, 153-170.

      Gagnepain, P., Henson, R. N. and Anderson, M. C. 2014. Suppressing unwanted memories reduces their unconscious influence via targeted cortical inhibition. Proc Natl Acad Sci U S A, 111, E1310-9.

      Gershman, S. J., Jones, C. E., Norman, K. A., Monfils, M. H. and Niv, Y. 2013. Gradual extinction prevents the return of fear: Implications for the discovery of state. Front Behav Neurosci, 7, 164.

      Gershman, S. J., Monfils, M. H., Norman, K. A. and Niv, Y. 2017. The computational nature of memory modification. Elife, 6.

      Hu, J., Wang, W., Homan, P., Wang, P., Zheng, X. and Schiller, D. 2018. Reminder duration determines threat memory modification in humans. Sci Rep, 8, 8848.

      Kamin, L. J. 1957. The retention of an incompletely learned avoidance response. J Comp Physiol Psychol, 50, 457-60.

      Kindt, M. and Soeter, M. 2018. Pharmacologically induced amnesia for learned fear is time and sleep dependent. Nat Commun, 9, 1316.

      Kindt, M., Soeter, M. and Vervliet, B. 2009. Beyond extinction: Erasing human fear responses and preventing the return of fear. Nat Neurosci, 12, 256-8.

      Liu, J., Zhao, L., Xue, Y., Shi, J., Suo, L., Luo, Y., Chai, B., Yang, C., Fang, Q., Zhang, Y., Bao, Y., Pickens, C. L. and Lu, L. 2014. An unconditioned stimulus retrieval extinction procedure to prevent the return of fear memory. Biol Psychiatry, 76, 895-901.

      Luo, Y.-X., Xue, Y.-X., Liu, J.-F., Shi, H.-S., Jian, M., Han, Y., Zhu, W.-L., Bao, Y.-P., Wu, P., Ding, Z.-B., Shen, H.-W., Shi, J., Shaham, Y. and Lu, L. 2015. A novel ucs memory retrieval-extinction procedure to inhibit relapse to drug seeking. Nature Communications, 6, 7675.

      Monfils, M. H., Cowansage, K. K., Klann, E. and Ledoux, J. E. 2009. Extinction-reconsolidation boundaries: Key to persistent attenuation of fear memories. Science, 324, 951-5.

      Nader, K., Schafe, G. E. and Le Doux, J. E. 2000. Fear memories require protein synthesis in the amygdala for reconsolidation after retrieval. Nature, 406, 722-6.

      Raio, C. M., Hartley, C. A., Orederu, T. A., Li, J. and Phelps, E. A. 2017. Stress attenuates the flexible updating of aversive value. Proc Natl Acad Sci U S A, 114, 11241-11246.

      Schiller, D., Kanen, J. W., Ledoux, J. E., Monfils, M. H. and Phelps, E. A. 2013. Extinction during reconsolidation of threat memory diminishes prefrontal cortex involvement. Proc Natl Acad Sci U S A, 110, 20040-5.

      Schiller, D., Monfils, M. H., Raio, C. M., Johnson, D. C., Ledoux, J. E. and Phelps, E. A. 2010. Preventing the return of fear in humans using reconsolidation update mechanisms. Nature, 463, 49-53.

      Schiller, D., Raio, C. M. and Phelps, E. A. 2012. Extinction training during the reconsolidation window prevents recovery of fear. J Vis Exp, e3893.

      Su, S., Deng, J., Yuan, K., Gong, Y., Zhang, Y., Li, H., Cao, K., Huang, X., Lin, X., Wu, P., Xue, Y., Bao, Y., Shi, J., Shi, L. and Lu, L. 2022. Continuous theta-burst stimulation over the right dorsolateral prefrontal cortex disrupts fear memory reconsolidation in humans. iScience, 25, 103614.

      Vila, J. 2004. Psychophysiological assessment. In: SPIELBERGER, C. D. (ed.) Encyclopedia of applied psychology. New York: Elsevier.

      Wang, Y., Zhu, Z., Hu, J., Schiller, D. and Li, J. 2021. Active suppression prevents the return of threat memory in humans. Commun Biol, 4, 609.

      Xue, Y. X., Luo, Y. X., Wu, P., Shi, H. S., Xue, L. F., Chen, C., Zhu, W. L., Ding, Z. B., Bao, Y. P., Shi, J., Epstein, D. H., Shaham, Y. and Lu, L. 2012. A memory retrieval-extinction procedure to prevent drug craving and relapse. Science, 336, 241-5.

      Zhu, Z., Anderson, M. C. and Wang, Y. 2022. Inducing forgetting of unwanted memories through subliminal reactivation. Nature communications, 13, 6496-6496.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Many drugs have off-target effects on the gut microbiota but the downstream consequences for drug efficacy and side effect profiles remain unclear. Herein, Wang et al. use a mouse model of liver injury coupled to antibiotic and microbiota transplantation experiments. Their results suggest that metformin-induced shifts in gut microbial community structure and metabolite levels may contribute to drug efficacy. This study provides valuable mechanistic insights that could be dissected further in future studies, including efforts to identify which specific bacterial species, genes, and metabolites play a causal role in drug response. Importantly, although some pilot data from human subjects is shown, the clinical relevance of these findings for liver disease remain to be determined.

      Thank you for reviewing our manuscript. We appreciate your valuable feedback. We agree that the downstream consequences of off-target effects on the gut microbiota by various drugs remain unclear. Our study aimed to shed light on this aspect by utilizing a mouse model of liver injury and conducting antibiotic and microbiota transplantation experiments. Our findings suggest that shifts in the structure and metabolite levels of the gut microbial community induced by metformin play a role in the drug’s efficacy. We believe that these mechanistic insights provide a strong foundation for further investigations. Specifically, future studies could focus on identifying the specific bacterial species, genes, and metabolites that have a causal role in drug response. While we have included some pilot data from human subjects, we acknowledge that the clinical relevance of our findings in the context of liver disease still requires further determination. In fact, we focused on the alteration of microbiota and metabolism caused by metformin in human bodies, which could capture the characteristics of changes in a more composite clinical direction, elucidating the potential role of metformin. We appreciate your attention to this aspect and thank you again for your thoughtful review and valuable suggestions.

      The major strength of this work is its scope, including detailed mouse phenotyping, inter-disciplinary methods, and numerous complementary experiments. The antibiotic depletion and FMT experiments provide support for a role of the gut microbiota in this mouse model.

      A major limitation is the lack of studies narrowing down which microbes are responsible. Sequencing data is shown, but no follow-up studies are done with bacterial isolates or defined communities.

      We acknowledge the limitation of our study in not narrowing down the specific microbes responsible for the observed effects. We hold the opinion that metformin exerts its effects through modulation of specific metabolic pathways unique to the microbial community. Previous study has shown that metformin can inhibit microbial folate metabolism, leading to longevity-promoting effects that are not attributed to a single colony or strain[1]. Similarly, the impact of metformin on amino acid metabolism in the microbial community appears to be widespread. While further investigations with bacterial isolates or defined communities are needed, our findings suggest that metformin's effects on microbial metabolism are complex and involve multiple members of the microbial community.

      The link to GABA is also somewhat tenuous. While it does match the phenotypic data, there are no targeted experiments in which GABA producing microbial communities/strains are compared to a control community/strain. As such, it seems difficult to know how much of the effects in this model are due to GABA vs. other metabolites.

      We agree with your point regarding the tenuous link to GABA in our study. While we did observe an increase in GABA as the only amino acid following metformin treatment, and this finding has not been reported previously, we acknowledge the need for targeted experiments comparing GABA-producing microbial communities/strains to control communities/strains. Previous literatures suggest that metformin's modulation of the microbiota can vary significantly depending on the disease context, with different microbial populations exhibiting differential responses[2-4]. Given this complexity, we opted to study the overall microbial community response to metformin rather than focusing on specific strains. Additionally, our detection of key enzymes involved in GABA synthesis at the community level further supports our findings.

      My major recommendation would be to revise the title, abstract, and discussion to provide more qualification and to consider alternative interpretations.

      We appreciate your feedback and understand your concern regarding the need for more qualification and consideration of alternative interpretations. We hope to have more specific and detailed suggestions you may have to enhance the clarity and qualification of our title and abstract. Furthermore, we have tried to revise discussion in order to enhance the scientific rigor and logical coherence of our study. If you have any specific recommendations or insights, we would be more than willing to make further revisions to address those concerns.

      Some key controls are also missing, which could be addressed by repeat experiments in the mouse model.

      We appreciate your suggestion to include additional key controls in the mouse model experiments. We have conducted repeat experiments to test the effect of antibiotics in the absence of metformin to differentiate between the effects of the model itself and the interaction of metformin with antibiotics. As results of liver injury indicators shown, there were no significance among Control, Control+Met, Control+FMT and Control+Abx groups, revealing that metformin and its treated feces, and antibiotics had no effect on liver function in normal mice (Figure 1).

      Author response image 1.

      Figure1 a: Liver MDA detection; b: Serum ALT level; c: Serum AST level.

      The antibiotic depletion experiment would be improved by testing the effect of antibiotics in the absence of metformin, to see if the effect is just driven by the model itself as opposed to an interaction between metformin and antibiotics.

      For the antibiotic depletion experiment, we had used antibiotics (Abx) for the mice of modeling, and the survival rate and liver function detection suggested that Abx had no extra effect on liver, which demonstrated that the effect is just driven by the model itself as opposed to an interaction between metformin and antibiotics (Figure 2).

      Author response image 2.

      Figure2 a: Survival rate between IR and IR + Abx group; b: Serum ALT level; c: Serum AST level.

      References

      [1] CABREIRO F, AU C, LEUNG K Y, et al. Metformin Retards Aging in C. elegans by Altering Microbial Folate and Methionine Metabolism [J]. Cell, 2013, 153(1): 228-39.

      [2] LIANG H, SONG H, ZHANG X, et al. Metformin attenuated sepsis-related liver injury by modulating gut microbiota [J]. Emerg Microbes Infect, 2022, 11(1): 815-28.

      [3] SUN L, XIE C, WANG G, et al. Gut microbiota and intestinal FXR mediate the clinical benefits of metformin [J]. Nat Med, 2018, 24(12): 1919-29.

      [4] ZHAO H Y, LYU Y J, ZHAI R Q, et al. Metformin Mitigates Sepsis-Related Neuroinflammation via Modulating Gut Microbiota and Metabolites [J]. Frontiers in Immunology, 2022, 13:797312.

      Reviewer #2 (Public Review):

      The authors examine the use of metformin in the treatment of hepatic ischemia/reperfusion injury (HIRI) and suggest the mechanism of action is mediated in part by the gut microbiota and changes in hepatic ferroptosis. While the concept is intriguing, the experimental approaches are inadequate to support these conclusions.

      The histological and imaging studies were considered a strength and reveal a significant impact of metformin post-HIRI.

      Thank you for reviewing our paper titled “Gut microbiota-derived gamma-aminobutyric acid from metformin treatment reduces hepatic ischemia/reperfusion injury through inhibiting ferroptosis”. We appreciate your insightful comments and suggestions, which have provided valuable insights into improving the quality and credibility of my research. We agree with your assessment that the experimental approaches used in this study may have limitations in supporting the conclusions drawn, and we appreciate your recognition of the strength of our histological and imaging studies, which clearly demonstrate the impact of metformin post-HIRI.

      Weaknesses largely stem from the experimental design. First, use of the iron chelator DFO would be strengthened using the ferroptosis inhibitor, liproxstatin.

      Your suggestion to employ the ferroptosis inhibitor, liproxstatin, in addition to the iron chelator DFO is well-taken. Incorporating liproxstatin into our experimental setup would provide a more comprehensive understanding of the involvement of hepatic ferroptosis in the mechanism of action of metformin. Therefore, we employed liproxstatin to inhibit HIRI and detected some core indicators of liver injury. As figure 3 shown, liproxstatin can reduce liver injury, restore liver GSH level and inhibit Fe accumulation, suggesting that ferroptosis plays an important role in HIRI. We hope this modification will enhance the credibility of our conclusions.

      Author response image 3.

      Figure3 a: Liver MDA detection; b: Serum ALT level; c: Serum AST level; d: Liver GSH level; e: Liver Fe level.

      Second, the impact of metformin on the microbiota is profound resulting in changes in bile acid, lipid, and glucose homeostasis. Throughout the manuscript no comparisons are made with metformin alone which would better capture the metformin-specific effects.

      Thank you for raising an important point regarding the impact of metformin on the microbiota and its potential effects on bile acid, lipid, and glucose homeostasis. It has well known that that the effects of metformin on normal blood glucose and lipid metabolism are minimal. Metformin primarily exerts its effects in cases of impaired glucose tolerance, which is why it is widely used for non-diabetic conditions. Regarding the changes in bile acid metabolism and chronic cholesterol and lipid elevation, these associations are typically observed in chronic liver disease models. Since our study focuses on an acute model of HIRI, we did not specifically investigate these changes.

      Lastly, the absence of proper controls including germ free mice, metformin treated mice, FMT treated mice, etc make it difficult to understand the outcomes and to properly reproduce the findings in other labs.

      Lastly, we acknowledge your concern regarding the absence of proper controls, including germ-free mice, metformin-treated mice, and FMT -treated mice. We understand that these controls are essential for robustly interpreting and reproducing our findings. Therefore, we have added a batch of experiments for verification. As results shown, there were no significance among Control, Control+Met, Control+FMT and Control+Abx groups, revealing that metformin and its treated feces, and antibiotics had no effect on liver function in normal mice (Figure 1). We hope the result of these controls could address your valid point and provide a more comprehensive framework for understanding the outcomes.

      Author response image 4.

      Figure1 a: Liver MDA detection; b: Serum ALT level; c: Serum AST level.

      Overall, while the concept is interesting and has the potential to better understand the pleiotropic functions of metformin, the limitations with the experimental design and lack of key controls make it challenging to support the conclusions.

      We genuinely appreciate your constructive criticism and the time you have taken to evaluate my work. Your feedback has shed light on the limitations of our experimental design and the need for key controls, which we have addressed in revised manuscript. If you have any further recommendations or concerns, we would be more than willing to incorporate them into my future work.

      Reviewer #3 (Public Review):

      The study presented in this paper explores the role of gut microbiota in the therapeutic effect of metformin on HIRI, as supported by fecal microbiota transplantation (FMT) experiments. Through high throughput sequencing and HPLC-MS/MS, the authors have successfully demonstrated that metformin administration leads to an increase in GABA-producing bacteria. Moreover, the study provides compelling evidence for the beneficial impact of GABA on HIRI.

      Thank you for your valuable feedback on our paper exploring the role of gut microbiota in the therapeutic effect of metformin on hepatic ischemia-reperfusion injury (HIRI). We appreciate your positive remarks and suggestions for improvement. In response to your comments, we have revised the manuscript accordingly. We have included additional details on the high throughput sequencing and HPLC-MS/MS methods used to analyze the gut microbiota and GABA levels. This should provide readers with a clearer understanding of our experimental approach and the evidence supporting our findings.

      Regarding your suggestion to further investigate the mechanisms underlying the beneficial impact of GABA on HIRI, we agree that this is an important direction for future research. We plan to conduct additional studies to explore the specific mechanisms by which GABA exerts its protective effects on HIRI in the future. We also supplemented discussion of potential therapeutic strategies targeting GABAergic pathways in the discussion section.

      Thank you once again for your insightful comments. We believe that these revisions have strengthened the manuscript and improved its scientific rigor. We hope that you find the revised version to be satisfactory and look forward to your further feedback.

      Reviewer #1 (Recommendations For The Authors):

      The writing could be improved. Multiple typos are found throughout and there is an overuse of adverbs like "expectedly". You should let the reader decide what is or is not expected. Try to avoid terms like "confirmed" or "validated", which only applies if you knew the result a priori. Remove underscores in species names. The Results section is also very difficult to interpret given the lack of explanation of experimental design. For example, the human study is only briefly mentioned within a larger paragraph on mouse data, without any explanation as to the study design. Similar issues are true for the transcriptomics and amplicon sequencing - it would help the reader to explain what samples were processed, the timepoints, etc.

      Thank you for your valuable feedback on our manuscript entitled “Gut microbiota-derived gamma-aminobutyric acid from metformin treatment reduces hepatic ischemia/reperfusion injury through inhibiting ferroptosis” We appreciate your constructive comments and insightful suggestions for improvement.

      We have carefully reviewed your comments and have made several revisions to enhance the clarity and readability of the manuscript. We have addressed the issue of multiple typos and have removed the overuse of adverbs, such as “expectedly,” to allow readers to draw their own conclusions from the results. Additionally, we have eliminated terms like “confirmed” or “validated” that may imply a priori knowledge of the results.

      We apologize for the lack of clarity regarding the experimental design in the Results section. We have now provided a more detailed explanation of the study design for the human study, transcriptomics, and amplicon sequencing experiments. This includes information on the samples processed, timepoints, and other relevant details, to aid readers in understanding the experimental procedures.

      In response to your comment about removing underscores in species names, we have revised the text accordingly to ensure consistency and accuracy in the species nomenclature used throughout the manuscript.

      Once again, we sincerely appreciate your valuable input, which has helped us improve the quality of our manuscript. We hope that the revised version now meets your expectations and look forward to any further feedback you may have.

      Thank you for your time and attention.

      Line 53 - prebiotics aren't "microbial agents"

      We apologize for this error, which we have corrected. (line 55: “Microbial agents, such as synbioticsprebiotics and probiotics…”)

      Line 88 - sequencing doesn't "verify the critical role of gut microbiota"

      We apologize for this error, which we have corrected. (line 90: “In order to verifyclarify the critical role of gut microbiota in the pleiotropic actions of metformin,22-24 fecal samples were collected from the mice to perform 16S rRNA sequencing.

      Line 92 - missing a citation for the "microbiota-gut-liver axis theory"

      We have corrected it in manuscript. (line 93: “Next, as the microbiota-gut-liver axis theory indicates,25 HIRI-induced dysfunction of the gut barrier may aggravate liver damage by disrupting the gut microbiota.”)

      Line 112 - it's very surprising to me that FMT led to lower alpha diversity, which seems impossible.

      We understand your surprise regarding the observed decrease in alpha diversity after FMT. Our findings indeed deviate from the commonly observed pattern of increased alpha diversity post-FMT. We have carefully re-examined our data and conducted additional analyses to ensure the accuracy of our results. After thorough investigation, we have identified a potential reason for this unexpected outcome, which we believe could shed light on this phenomenon. We hypothesize that the lower alpha diversity observed in our study might be attributed to the specific characteristics of the donor microbiota used for FMT. While the donor microbiota exhibited certain beneficial properties associated with the therapeutic effect on HIRI, it could have presented a limited diversity compared to the recipient’s original gut microbiota. This discrepancy in diversity could have contributed to the observed decrease in alpha diversity following FMT.

      To further support our hypothesis, we have included a discussion on this unexpected finding in the revised manuscript. We believe that this addition will provide a more comprehensive understanding of the results and help contextualize the observed decrease in alpha diversity following FMT.

      Line 117 - Antibiotics don't "identify the function of gut microbes." Need to specify which antibiotics were used and for how long.

      We have corrected it in manuscript. (line 119: “To further identify the function of gut microbes, experiments were designed, and combination treatment of antibiotics (1 mg/mL penicillin sulfate, 1 mg/mL neomycin sulfate, 1 mg/mL metronidazole and 0.16 mg/mL gentamicin) and metformin were employed for 1 week before IR treated.”)

      Line 120 - this experiment shows that the gut microbiota (or antibiotics more precisely) matters, not the "reshaped gut microbiota"

      We have corrected it in manuscript. (line 124: “The results confirmed that reshaped gut microbiota is critical for the effect of metformin against HIRI.”)

      Line 122 - need to reword this subheading and the concluding sentence. The main takeaway is that the FMT improved markers of ferroptosis, but no additional causal links are provided here.

      We have revised in manuscript. (line 125: “FMT alleviates HIRI-induced ferroptosis through reshaped fecal microbiota.”)

      Line 141 - need to explain what transcriptomics data was generated and how it was analyzed.

      We have revised in manuscript. (line 144: “To elucidate the molecular mechanisms through which pathway participates metformin-treated IR injury, we analysed gene expression profiles of each group mice. Transcriptome sequencing analysis revealed that 9697 genes were in common among four groups (Supplementary Figure 6). Therefore, we used these common genes for KEGG analysis, showing that The transcriptome analysis of liver tissues showed that similar mRNA changes between Met group and FMT group are mainly concentrated in the three top pathways: lipid metabolism, carbohydrate metabolism, and amino acid metabolism (Fig 4a).”)

      Line 150 - change to "16S rRNA gene sequencing". Typo: "mice microbes".

      We have revised in manuscript. (line 156: “Moreover, it was observed that the genus of Bacteroides had a significant increase based on the 16s rRNA gene sequencing of metformin-treated mice microbes.”)

      Line 152 - upregulated refers to gene expression, change to enriched.

      We have revised in manuscript. (line 171: “Detailedly, the species of Bacteroides containing Bacteroides thetaiotaomicron, Bacteroides unifomis, and Bacteroides salyersiae, were enriched in human gut after metformin administration (Fig. 4i).”)

      Line 159 - typo: "prokaryotes"

      We have revised in manuscript. (line 165: “In order to further identify the increased GABA originates from gut microbiota, two key enzymes of prokaryotes protokaryotic GABA synthesis, GAD and PAT, were detected on DNA level, finding that both of them are significantly increased in the feces from IR+Met and IR+FMT groups (Fig. 4h).”)

      Line 161 - the human study should be under a new sub-heading and provide more details.

      We have revised in manuscript. (line 168: In order to clarify the specific effects of metformin on microbiota, given the big safety margin, healthy volunteers were recruited for a 1 week of daily oral 500mg dose of metformin trial. Fecal samples were collected before and after oral administration of metformin for metagenomic analysis .”)

      Line 197 - It's unclear why the current study conflicts with prior literature. Is it due to the disease model, the starting microbiota, something else? Please add more discussion.

      Thank you for bringing this important point to our attention, and we appreciate your valuable input. We agree that it is important to discuss the potential reasons for the discrepancy between our findings and prior literature on metformin-reshaped microbiota. In our study, we used a disease model of HIRI, which may have unique characteristics compared to other disease models. It is possible that the specific disease model influenced the response of the gut microbiota. Additionally, the starting microbiota of the recipients and the characteristics of the donor microbiota used for FMT could also play a role in the disparity. We have expanded the discussion section of our revised manuscript to further address these potential factors and their implications. We hope that this additional information will provide a more comprehensive explanation for the discrepancy between our study and prior literature.

      Figure 1a - change to Kaplan Meier not ANOVA. Specify the contrast - which groups are being compared?

      We have revised in Figure 1a.

      Figure 1e, alpha diversity - relabel "sobs" with "observed OTUs". Change to 3 bars with error and add statistics.

      We have revised in Figure 1e.

      Figure 1e, PCA - this should be a separate panel (1f). Color of big red circle doesn't match the points. Add PERMANOVA p-value/R2. Change to OTUs not genera. Better yet, use amplicon sequence variants from DADA2.

      We have revised in Figure 1e..

      Figure 2a - Change to Kaplan Meier. Also, it's unclear if residual metformin could be in the donor samples.

      We have revised in Figure 2a.

      Figure 2f, alpha diversity - relabel "sobs" with "observed OTUs". Change to 3 bars with error and add statistics.

      We have revised in Figure 2f.

      Figure 2f, PCA - this should be a separate panel (2g). Color of big orange circle doesn't match the points. Add PERMANOVA p-value/R2. Change to OTUs not genera. Better yet, use amplicon sequence variants from DADA2.

      We have revised in Figure 2f.

      Figure 4b - check units, shouldn't this be ng/mg (i.e. weight not volume).

      We have revised in Figure 4b.

      Figure 4c,d - need more explanation in the legend and Results as to what is shown here.

      We have revised in Figure 4c,d.

      Figure 4d - unclear why only Bacteroides are shown here or if the p-values are adjusted for multiple comparisons.

      Thank you for your comment regarding Figure 4d in our manuscript. We apologize for the confusion caused. The reason why only Bacteroides is shown in Figure 4d is because we specifically wanted to investigate the changes in Bacteroides abundance following metformin treatment.

      In the mouse experiments, we observed a significant increase in Bacteroides after metformin treatment. To investigate if a similar change occurs in healthy volunteers, we examined the levels of Bacteroides in fecal samples before and after oral administration of metformin. We found that the abundance of Bacteroides also increased in the human gut after metformin administration, consistent with the results from the animal experiments. Regarding the p-values, we apologize for not mentioning whether they were adjusted for multiple comparisons in the figure legend. In our revised manuscript, we have provided a clarification stating that the p-values were adjusted using the appropriate method. We appreciate your feedback and hope that this explanation clarifies the rationale behind Figure 4d. Thank you for your valuable input.

      Reviewer #2 (Recommendations For The Authors):

      Below I've listed several suggestions to improve the paper.

      1. Controls - the authors should include metformin only treated mice, FMT only treated mice, etc. Additionally, germ free mice treated with metformin and HIRI would be helpful to better implicate the gut microbiome in these beneficial effects.

      Thank you for your suggestion regarding the inclusion of additional control groups in our study. We agree that including metformin only treated mice, FMT only treated mice, and germ-free mice treated with metformin and HIRI would provide valuable insights into the role of the gut microbiome in the observed beneficial effects.

      Therefore, we have included metformin only treated mice, FMT only treated mice and Abx only treated mice as supplement to better assess the specific contribution to the observed effects. As results shown, there were no significance among Control, Control+Met, Control+FMT and Control+Abx groups, revealing that metformin and its treated feces, and antibiotics had no effect on liver function in normal mice (figure1).

      We appreciate your input and believe that the inclusion of these additional control groups will strengthen our study and provide a more comprehensive understanding of the role of the gut microbiome in the therapeutic effects observed.

      Author response image 5.

      Figure1 a: Liver MDA detection; b: Serum ALT level; c: Serum AST level.

      1. More thorough characterization of metabolite pools. Metformin is known to influence many pathways including bile acids and lipids. These important molecules should be measures as they likely play a key role in the observed protective effect. In fact, many of the key changes displayed in Figure 3H are involved in lipid metabolism.

      Thank you for your valuable feedback regarding the characterization of metabolite pools in our study. We appreciate your suggestion to measure the influence of metformin on bile acids and lipid metabolism, as they are crucial pathways that may play a significant role in the observed protective effect.

      Regarding bile acids, we agree that they are important in the context of metformin’s influence on metabolic pathways. However, it is important to note that the impact of metformin on bile acids appears to be more prominent in chronic liver disease models. In our acute model, the changes in bile acids were not as significant. Instead, our results primarily indicate a close association between lipid changes and hepatic ferroptosis. Metformin significantly modulates lipid metabolism, thereby alleviating liver ferroptosis.

      Additionally, we have conducted metagenomic sequencing on the gut microbiota of healthy volunteers before and after oral administration of metformin. While analyzing the data, we did not observe significant changes in key genes involved in regulating bile acid variations. This might be attributed to the healthy volunteers used in our study, where significant changes in bile acids were not induced.

      We appreciate your insightful comments and suggestions, which have shed light on the importance of characterizing bile acids and lipid metabolism in our study. While the impact of bile acids may be more evident in chronic liver disease models, our findings highlight the significant influence of metformin on lipid metabolism, closely related to hepatic ferroptosis. We will take your suggestions into account for future studies to further explore the role of bile acids and their regulation by metformin.

      1. Imaging of lipid ROS is not quantitative. The authors should conduct more standard assays with BODIPY 581/591 C11 using cell lysates.

      We appreciate your suggestion to conduct more standard assays using BODIPY 581/591 C11 with cell lysates.

      We would like to clarify that we did indeed utilize assays with BODIPY 581/591 C11 to detect and measure lipid ROS in our study. The detailed description of these assays can be found in the Methods section of our paper. We followed established protocols and guidelines to ensure accurate and reliable measurements of lipid ROS levels.

      We acknowledge that imaging techniques may have limitations in providing quantitative data. However, we employed BODIPY 581/591 C11 assays as a widely accepted and commonly used method to assess lipid ROS levels. This allowed us to obtain qualitative and semi-quantitative information on the changes in lipid ROS levels in response to metformin treatment.

      1. Liproxstatin may be a better drug choice or at the very least should be used to compare with the DFO data

      Thank you for your suggestion. We have taken your advice into consideration and conducted an evaluation of Liproxstatin as a ferroptosis inhibitor. Our findings indicate that Liproxstatin significantly improves HIRI (Figure C). We believe that incorporating Liproxstatin in our research will provide valuable insights and allow for a comprehensive comparison with the DFO data.

      Author response image 6.

      Figure3 a: Liver MDA detection; b: Serum ALT level; c: Serum AST level; d: Liver GSH level; e: Liver Fe level.

      1. The rationale for how GABA was selected is not clear. I am surprised that there were not more significant metabolite changes. It might be better to show a volcano plot of heatmap of the significantly changed features.

      Thank you for raising an important question regarding the rationale for selecting GABA as the focus metabolite in our study. Initially, we also had concerns about the limited number of significant metabolite changes observed. However, through our comprehensive metabolomic profiling, we identified GABA as the most significantly altered metabolite following HIRI.

      It is worth noting that we specifically focused on the measurement of 22 essential amino acids in our analysis. While it is possible that changes in non-essential amino acids may have occurred, we did not examine them in this study. Nevertheless, we have since used additional methods to validate the upregulation of GABA levels, and the biological effects observed support the specific role of GABA in protecting against HIRI. Based on the fact that GABA was the only significant amino acid, the volcano plot was of little significance, so we did not supplement this plot.

      We appreciate your valuable input and thank you for bringing up this important issue.

      1. The manuscript needs to be proofread and edited. There are a variety of typos and grammar issues throughout.

      Thank you for your feedback. We acknowledge that the manuscript requires proofreading and editing, as we have identified several typos and grammar issues. We will try to ensure that the necessary revisions are made to improve the overall quality of the manuscript.

      Reviewer #3 (Recommendations For The Authors):

      However, I have some major concerns for the manuscript.

      1. Line 26 16S rRNA and metagenomic sequencing alone can't accurately confirm the improvement effect of GABA producing bacteria on HIRI. In fact, transcriptome analysis, HPLC-MS/MS and other methods were also used in this paper, so the language expression here is not appropriate

      Thank you for pointing out the language expression issue in line 26 of the manuscript. We apologize for any confusion caused. You are correct in stating that 16S rRNA and metagenomic sequencing alone may not accurately confirm the improvement effect of GABA-producing bacteria on HIRI. In our study, we employed a combination of multiple methods, including transcriptome analysis, HPLC-MS/MS, especially detection of bacteria GABA key synthetases, PAT and GAD, to comprehensively investigate the impact of GABA-producing bacteria on HIRI.

      We have revised the language in line 26 to reflect the broader range of methods used in our study to support the conclusions regarding the improvement effect of GABA-producing bacteria on HIRI.

      1. The Introduction section needs to add a description of the previous research on the association between HIRI and ferroptosis

      Thank you for your suggestion regarding the inclusion of a description of the association between HIRI and ferroptosis in the Introduction section. We agree that this is an important aspect to address. However, upon further consideration, we have decided to move the discussion of ferroptosis and its potential role in HIRI to the Discussion section, as it aligns better with the logical flow of the manuscript. This allows us to discuss the potential implications and future directions in a more organized and coherent manner.

      1. Authors should provide quantified figure or table next to the results of western blot that are more convenient to understand.

      We have revised in manuscript. (See sfigure 7)

      1. In this paper, FMT experiments are used to verify that metformin remodeled gut microbiota can play a role in improving HIRI. The operation steps of FMT should be described more specifically in the method part

      *What is the fecal donor information for FMT?

      *Line272 Did the IR + FMT group put the transplanted microbiota of FMT directly into the drinking water like the other treatment groups? Will such an operation affect the quality and quantification of the transplanted microbiota and lead to the loss of microbiota species? It is crucial for the authors to provide a clear and thorough clarification regarding these matters within the context of their FMT experiment.

      Thank you for your feedback regarding the need for a more detailed description of the fecal microbiota transplantation (FMT) procedure and clarification regarding the IR + FMT group in our manuscript. We appreciate your suggestions and we have taken them into consideration.

      In our study, the fecal donor for FMT was obtained from mice that had been orally administered metformin. The fecal microbiota was collected and processed to remove any residual metformin before transplantation. Specifically, the microbiota for the IR + FMT group was administered through gavage, as stated in line 272. This method does not affect the quality or quantity of the transplanted microbiota, nor does it lead to a loss of microbiota species. We understand the importance of providing clear and thorough clarification regarding these matters. Therefore, we have included additional specific details of the FMT procedure in the revised version of the manuscript. We hope that this clarification addresses your concerns and provides a more comprehensive understanding of our FMT experiment.

      1. The presentation of transcriptomic analysis results in the manuscript is insufficiently comprehensive and specific, as they are solely depicted through Fig 4a. Relying solely on Fig 4a is inadequate to establish the definitive roles of the met group and FMT group in ferroptosis compared to other groups. Therefore, the authors should provide additional transcriptomic analysis results to ascertain the specific effects of the met group and FMT group in ferroptosis, as well as their comparison with other groups.

      Thank you for your feedback regarding the comprehensiveness of our transcriptomic analysis results in the manuscript. We understand your concerns and appreciate your suggestion. In our study, we have provided additional data beyond Fig 4a to support the specific effects of the met group and FMT group in ferroptosis, as well as their comparison with other groups. Specifically, in Figure 3, we have included Western blot (WB) and quantitative real-time polymerase chain reaction (qRT-PCR) data to confirm the involvement of ferroptosis in HIRI and the role of metformin in attenuating ferroptosis. Moreover, we have presented transcriptomic analysis results in Figure 3h, which includes a heatmap of genes related to lipid metabolism. These findings can strengthen our conclusions regarding the importance of ferroptosis in HIRI and the protective effects of metformin against ferroptosis. We hope that these data address your concerns and provide a more comprehensive understanding of our research findings.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      This study explores the sequence characteristics and features of high-occupancy target (HOT) loci across the human genome. The computational analyses presented in this paper provide information into the correlation of TF binding and regulatory networks at HOT loci that were regarded as lacking sequence specificity.

      By leveraging hundreds of ChIP-seq datasets from the ENCODE Project to delineate HOT loci in HepG2, K562, and H1-hESC cells, the investigators identified the regulatory significance and participation in 3D chromatin interactions of HOT loci. Subsequent exploration focused on the interaction of DNA-associated proteins (DAPs) with HOT loci using computational models. The models established that the potential formation of HOT loci is likely embedded in their DNA sequences and is significantly influenced by GC contents. Further inquiry exposed contrasting roles of HOT loci in housekeeping and tissue-specific functions spanning various cell types, with distinctions between embryonic and differentiated states, including instances of polymorphic variability. The authors conclude with a speculative model that HOT loci serve as anchors where phase-separated transcriptional condensates form. The findings presented here open avenues for future research, encouraging more exploration of the functional implications of HOT loci.

      Strengths:

      The concept of using computational models to define characteristics of HOT loci is refreshing and allows researchers to take a different approach to identifying potential targets. The major strengths of the study lies in the very large number of datasets analyzed, with hundreds of ChIP-seq data sets for both HepG2 and K562 cells as part of the ENCODE project. Such quantitative power allowed the authors to delve deeply into HOT loci, which were previously thought to be artifacts.

      Weaknesses:

      While this study contributes to our knowledge of HOT loci, there are critical weaknesses that need to be addressed. There are questions on the validity of the assumptions made for certain analyses. The speculative nature of the proposed model involving transcriptional condensates needs either further validation or be toned down. Furthermore, some apparent contradictions exist among the main conclusions, and these either need to be better explained or corrected. Lastly, several figure panels could be better explained or described in the figure legends.

      We thank the reviewer for their valuable comments.

      - We have extended the study and included a new chapter focusing on the condensate hypothesis, added more supporting evidence (including the ones suggested by the reviewer), and made explicit statements on the speculative nature of this model.

      - We have restructured the text to remove the sentences which might be construed as contradictory.

      Reviewer #2 (Public Review):

      Summary:

      The paper 'Sequence characteristic and an accurate model of abundant hyperactive loci in human genome' by Hydaiberdiev and Ovcharenko offers comprehensive analyses and insights about the 'high-occupancy target' (HOT) loci in the human genome. These are considered genomic regions that overlap with transcription factor binding sites. The authors provided very comprehensive analyses of the TF composition characteristics of these HOT loci. They showed that these HOT loci tend to overlap with annotated promoters and enhancers, GC-rich regions, open chromatin signals, and highly conserved regions, and that these loci are also enriched with potentially causal variants with different traits.

      Strengths:

      Overall, the HOT loci' definition is clear and the data of HOT regions across the genome can be a useful dataset for studies that use HepG2 or K562 as a model. I appreciate the authors' efforts in presenting many analyses and plots backing up each statement.

      Weaknesses:

      It is noteworthy that the HOT concept and their signature characteristics as being highly functional regions of the genome are not presented for the first time here. Additionally, I find the main manuscript, though very comprehensive, long-winded and can be put in a shorter, more digestible format without sacrificing scientific content.

      The introduction's mention of the blacklisted region can be rather misleading because when I read it, I was anticipating that we are uncovering new regulatory regions within the blacklisted region. However, the paper does not seem to address the question of whether the HOT regions overlap, if any, with the ENCODE blacklisted regions afterward. This plays into the central assessment that this manuscript is long-winded.

      The introduction also mentioned that HOT regions correspond to 'genomic regions that seemingly get bound by a large number of TFs with no apparent DNA sequence specificity' (this point of 'no sequence specificity' is reiterated in the discussion lines 485-486). However, later on in the paper, the authors also presented models such as convolutional neural networks that take in one-hot-encoded DNA sequence to predict HOT performed really well. It means that the sequence contexts with potential motifs can still play a role in forming the HOT loci. At the same time, lines 59-60 also cited studies that "detected putative drive motifs at the core segments of the HOT loci". The authors should edit the manuscript to clarify (or eradicate) contradictory statements.

      We thank the reviewer for their valuable comments. Below are our responses to each paragraph in the given order:

      We added a statement in the commenting and summarizing other publications that studied the functional aspects of HOT loci with the following sentence in the introduction part:

      “Other studies have concluded that these regions are highly functionally consequential regions enriched in epigenetic signals of active regulatory elements such as histone modification regions and high chromatin accessibility”.

      We significantly shortened the manuscript by a) moving the detailed analyses of the computational model to the supplemental materials, and b) shortening the discussions by around half, focusing on core analyses that would be most beneficial to the field.

      Given that the ENCODE blacklisted regions are the regions that are recommended by the ENCODE guidelines to be avoided in mapping the ChIP-seq (and other NGS), we excluded them from our analyzed regions before mapping to the genome. Instead, we relied on the conclusions of other publications on HOT loci that the initial assessments of a fraction of HOT loci were the result of factoring in these loci which later were included in blacklisted regions.

      We addressed the potential confusion by using the expression of “no sequence specificity” by a) changing the sentence in the introduction by adding a clarification as “... with no apparent DNA sequence specificity in terms of detectible binding motifs of corresponding motifs” and b) removing that part from the sentence in the discussions.

      Reviewer #3 (Public Review):

      Summary:

      Hudaiberdiev and Ovcharenko investigate regions within the genome where a high abundance of DNA-associated proteins are located and identify DNA sequence features enriched in these regions, their conservation in evolution, and variation in disease. Using ChIP-seq binding profiles of over 1,000 proteins in three human cell lines (HepG2, K562, and H1) as a data source they're able to identify nearly 44,000 high-occupancy target loci (HOT) that form at promoter and enhancer regions, thus suggesting these HOT loci regulate housekeeping and cell identity genes. Their primary investigative tool is HepG2 cells, but they employ K562 and H1 cells as tools to validate these assertions in other human cell types. Their analyses use RNA pol II signal, super-enhancer, regular-enhancer, and epigenetic marks to support the identification of these regions. The work is notable, in that it identifies a set of proteins that are invariantly associated with high-occupancy enhancers and promoters and argues for the integration of these molecules at different genomic loci. These observations are leveraged by the authors to argue HOT loci as potential sites of transcriptional condensates, a claim that they are well poised to provide information in support of. This work would benefit from refinement and some additional work to support the claims.

      Comments:

      (1) Condensates are thought to be scaffolded by one or more proteins or RNA molecules that are associated together to induce phase separation. The authors can readily provide from their analysis a check of whether HOT loci exist within different condensate compartments (or a marker for them). Generally, ChIPSeq signal from MED1 and Ronin (THAP11) would be anticipated to correspond with transcriptional condensates of different flavors, other coactivator proteins (e.g., BRD4), would be useful to include as well. Similarly, condensate scaffolding proteins of facultative and constitutive heterochromatin (HP1a and EZH2/1) would augment the authors' model by providing further evidence that HOT Loci occur at transcriptional condensates and not heterochromatin condensates. Sites of splicing might be informative as well, splicing condensates (or nuclear speckles) are scaffolded by SRRM/SON, which is probably not in their data set, but members of the serine arginine-rich splicing factor family of proteins can serve as a proxy-SRSF2 is the best studied of this set. This would provide a significant improvement to their proposed model and be expected since the authors note that these proteins occur at the enhancers and promoter regions of highly expressed genes.

      (2) It is curious that MAX is found to be highly enriched without its binding partner Myc, is Myc's signal simply lower in abundance, or is it absent from HOT loci? How could it be possible that a pair of proteins, which bind DNA as a heterodimer are found in HOT loci without invoking a condensate model to interpret the results?

      (3) Numerous studies have linked the physical properties of transcription factor proteins to their role in the genome. The authors here provide a limited analysis of the proteins found at different HOT-loci by employing go terms. Is there evidence for specific types of structural motifs, disordered motifs, or related properties of these proteins present in specific loci?

      (4) Condensates themselves possess different emergent properties, but it is a product of the proteins and RNAs that concentrate in them and not a result of any one specific function (condensates can have multiple functions!)

      (5) Transcriptional condensates serve as functional bodies. The notion the authors present in their discussion is not held by practitioners of condensate science, in that condensates exist to perform biochemical functions and are dissolved in response to satisfying that need, not that they serve simply as reservoirs of active molecules. For example, transcriptional condensates form at enhancers or promoters that concentrate factors involved in the activation and expression of that gene and are subsequently dissolved in response to a regulatory signal (in transcription this can be the nascently synthesized RNA itself or other factors). The association reactions driving the formation of active biochemical machinery within condensates are materially changed, as are the kinetics of assembly. It is unnecessary and inaccurate to qualify transcriptional condensates as depots for transcriptional machinery.

      6) This work has the potential to advance the field forward by providing a detailed perspective on what proteins are located in what regions of the genome. Publication of this information alongside the manuscript would advance the field materially.

      We thank the reviewer for constructive comments and suggestions. Below are our point-by-point responses:

      (1) We added a new short section “Transcriptional condensates as a model for explaining the HOT regions” with additional support for the condensate hypothesis, wherein some of the points raised here were addressed. Specifically, we used a curated LLPS proteins (CD-CODE) database and provided statistics of those annotation condensate-related DAPs.

      Regarding the DAPs mentioned in this question, we observed that the distributions corresponding ChIP-seq peaks confirm the patterns expected by the reviewer (Author response image 1). Namely:

      - MED1 and Ronin (THAP11) are abundant in the HOT loci, being present 67% and 64% of HOT loci respectively.

      - While the BRD4 is present in 28% of the HOT loci, we observed that the DAPs with annotated LLPS activity ranged from 3% to 73%, providing further support for the condensate hypothesis.

      - ENCODE database does not contain ChIP-seq dataset for HP1A. EZH2 peaks were absent in the HOT loci (0.4% overlap), suggesting the lack of heterochromatin condensate involvement.

      - Serine-rich splicing factor family proteins were present only in 7.7% of the HOT loci, suggesting the absence or limited overlap with splicing condensates or nuclear speckles.

      Author response image 1.

      (2) In this study we selected the TF ChIP-seq datasets with stringent quality metrics, excluding those which had attached audit warning and errors. As a result, the set of DAPs analyzed in HepG2 did not include MYC, since the corresponding ChIP-seq dataset had the audit warning tags of "borderline replicate concordance, insufficient read length, insufficient read depth, extremely low read depth". Analyses in K562 and H1 did include MYC (alongside MAX) ChIP-seq dataset.

      To address this question, we added the mentioned ChIP-seq dataset (ENCODE ID: ENCFF800JFG) and analyzed the colocalization patterns of MYC and MAX. We observed that the MYC ChIP-seq peaks in HepG2 display spurious results, overlapping with only 5% of HOT loci. Meanwhile in K562 and H1, MYC and MAX are jointly present in 54% and 44% of the HOT loci, respectively (Author response image 2).

      Author response image 2.

      These observations were also supported by Jaccard indices between the MYC and MAX ChIP-seq peaks. To do this analysis, we calculated the pairwise Jaccard indices between MYC and MAX and divided them by the average Jaccard indices of 2000 randomly selected DAP pairs. In K562 and H1, the Jaccard indices between MYC and MAX are 5.72x and 2.53x greater than the random background, respectively. For HepG2, the ratio was 0.21x, clearly indicating that HepG2 MYC ChIP-seq dataset is likely erroneous.

      Author response image 3.

      (3) Despite numerous publications focusing on different structural domains in transcription factors, we could not find an extensive database or a survey study focusing on annotations of structural motifs in human TFs. Therefore, surveying such a scale would be outside of this study’s scope. We added only the analysis of intrinsically disordered regions, as it pertains to the condensate hypothesis. To emphasize this shortcoming, we added the following sentence to the end of the discussions section.

      “Further, one of the hallmarks of LLPS proteins that have been associated with their abilities to phase-separate is the overrepresentation of certain structural motifs, which we did not pursue due to size limitations.”

      (4, 5) We agree with these statements and thank the reviewer for pointing out this faulty statement. We modified the sections in the discussions related to the condensates and removed the part where we implied that the condensate model could be because of mostly a single function of TF reservoir.

      (6) We added a table to the supplemental materials (Zenodo repository) with detailed annotation of HOT and non-HOT DAP-bound loci in the genome.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      The clause with "inadequate" would be dropped if the authors sufficiently address reviewer concerns about clarity of writing, including:

      (1) Editing the title to better reflect the findings of the paper.

      (2) Making clear that the condensate model is speculative and not explicitly tested in this study (and may be better described as a hypothesis).

      (3) Resolving apparent contradictions regarding DNA sequence specificity and the interpretation of ChIP-seq signal intensity.

      (4) Better specifying and justifying model parameters, thresholds, and assumptions.

      (5) Shortening the manuscript to emphasize the main, well-supported claims and to enhance readability (especially the discussion section).

      We thank the Editor for their work. We followed their advice and implemented changes and additions to address all 5 points.

      Reviewer #1 (Recommendations For The Authors):

      (1) The title "Sequence characteristics and an accurate model of abundant hyperactive loci in the human genome" does not accurately reflect the findings of the paper. We are unclear as to what the 'accurate model' refers to. Is it the proposed model 'based on the existence of large transcriptional condensates' (abstract)? If so, there are concerns below regarding this statement (see comment 2). If the authors are referring to the computational modeling presented in Figure 5, it is unclear that any one of them performed that much better than the others and the best single model was not identified. Furthermore, the models being developed in the study constitute only a portion of the paper and lacked validation through additional datasets. Additionally, sequence characteristics were not a primary focus of the study. Only figure 5 talks about the model and sequence characteristics, the rest of the figures are left out of the equation.

      We agree with and thank the reviewer for this idea of clarifying the intended meaning.

      (1) We changed the title and clarified that the computational model is meant:

      “Functional characteristics and a computational model of abundant hyperactive loci in the human genome”.

      (2) Shortened the part of the manuscript discussing the computational models and pointed out the CNNs as “the best single model”.

      (2) The abstract and discussion (and perhaps the title) propose a model of transcriptional condensates in relation to HOT loci. However, there is no data provided in the manuscript that relates to condensates. Therefore, anything relating to condensates is primarily speculative. This distinction needs to be properly made, especially in the abstract (and cannot be included in the title). Otherwise, these statements are misleading. Although the field of transcriptional condensates is relatively new, there have been several factors studied. The authors could include in Figure 2d which factors have been shown to form transcriptional condensates. This might provide some support for the model, though it would still largely remain speculative unless further testing is done.

      We added a new short chapter “Transcriptional condensates as a model for explaining the HOT regions”,  with additional analyses testing the condensates hypothesis. We provided supportive evidence by analyzing the metrics used as hallmarks of condensates including the distributions of annotated condensate-related proteins, nascent transcription, and protein-RNA interaction levels in HOT loci. Still, we acknowledge that this is a speculative hypothesis and we clarified that with the following statement in the discussions:

      “It is important to note here that our proposed condensate model is a speculative hypothesis. Further experimental studies in the field are needed to confirm or reject it.”

      (3) Several apparent contradictions exist throughout the manuscript. For example, "HOT locus formation are likely encoded in their DNA sequences" (lines 329-330) vs the proposed model of formation through condensates (abstract). These two statements do not seem compatible, or at the very least, the authors can explain how they are consistent with each other. Another example: "ChIP-seq signal intensity as a proxy for... binding affinity" (line 229) vs. "ChIP-seq signal intensities do not seem to be a function of the DNA-binding properties of the DAPs" (lines 259-260). The first statement is the assumption for subsequent analyses, which has its own concerns (see comment 4). But the conclusion from that analysis seems to contradict the assumption, at least as it is stated.

      In this study, we argue that the two statements may not necessarily contradict each other. We aimed to a) demonstrate that the observed intensity of DAP-DNA interactions as measured by ChIP-seq experiments at HOT loci cannot be explained with direct DNA-binding events of the DAPs alone and b) propose a hypothesis that this observation can be at least partially explained if the HOT loci have the propensity to either facilitate or take part in the formation of transcriptional condensates.

      One of the conditions for condensates to form at enhancers was shown to be the presence of strong binding sites of key TFs (Shrinivas et al. 2019 “Enhancer features that drive the formation of transcriptional condensates”), where the study was conducted using only one TF (OCT4) and one coactivator (MED1). To the best of our knowledge, no such study has been conducted involving many TFs and cofactors simultaneously. We also know that the factors that lead to liquid-to-liquid phase separation include weak multivalent IDR-IDR, IDR-DNA, and IDR-RNA interactions. As a result, the observed total sum of ChIP-seq peaks in HOT loci is the direct DNA-binding events combined with the indirect DAP-DNA interactions, some of which may be facilitated by condensates. And, the fact that CNNs can recognize the HOT loci with high accuracy suggests that there must be an underlying motif grammar specific to HOT loci.

      We emphasized this conclusion in the discussions.

      The comment on using the ChIP-seq signal as a proxy for DNA-binding affinity is addressed under comment 4.

      (4) In lines 229-230, the authors used "the ChIP-seq signal intensity as a proxy for the DAP binding affinity." What is the basis for this assumption? If there is a study that can be referenced, it should be added. However, ChIP-seq signal intensity is generally regarded as a combination of abundance, frequency, or percentage of cells with binding. RNA Pol2 is a good example of this as it has no specific binding affinity but the peak heights indicate level of expression. Therefore, the analyses and conclusions in Figure 4, particularly panel A, are problematic. In addition, clarification from lines 258-260 is needed as it contradicts the earlier premise of the section (see comment 3).

      We thank the reviewer for pointing out this error. The main conclusion of the paragraph is that the average ChIP-seq signal values at HOT loci do not correlate well with the sequence-specificity of TFs. We reworded the paragraph stating that we are analyzing the patterns of ChIP-seq signals across the HOT loci, removing the part that we use them as a proxy for sequence-specific binding affinity.

      (5) In Figure 1A, the authors show that "the distribution of the number of loci is not multimodal, but rather follows a uniform spectrum, and thus, this definition of HOT loci is ad-hoc" (lines 92-95). The threshold to determine how a locus is considered to be HOT is unclear. How did the authors decide to use the current threshold given the uniform spectrum observed? How does this method of calling HOT loci compare to previous studies? How much overlap is there in the HOT loci in this study versus previous ones?

      We moved the corresponding explanation from the supplemental methods to the main methods section of the manuscript.

      Briefly, our reasoning was as follows: assuming that an average TFBS is 8bp long and given that we analyze the loci of length 400bp, we can set the theoretical maximum number of simultaneous binding events to be 50. Hence, if there are >50 TF ChIP-seq peaks in a given 400bp locus, it is highly unlikely that the majority of ChIP-seq peaks can be explained by direct TF-DNA interactions. The condition of >50 TFs corresponded to the last four bins of our binning scale, which was used as an operational definition for HOT loci.

      We have compared our definition of HOT loci to those reported in previous studies by Remaker et al. and Boyle et al. The results of our analyses are in lines 147-154.

      (6) In Figure 3B, the authors state that of "the loop anchor regions with >3 overlapping loops, 51% contained at least one HOT locus, suggesting an interplay between chromatin loops and HOT loci." However, it is unclear how "51%" is calculated from the figure. Similarly, in the following sentence, "94% of HOT loci are located in regions with at least one chromatin interaction". It is unclear as to how the number was obtained based on the referenced figure.

      Initially, the x-axis on the Figure 3B was missing, making it hard to understand what we meant. We added the x-axis numbers and changed the “51%” to “more than half”. We intend to say that, of the loci with 4 and 5 overlapping loops, exactly 50% contain at least one HOT locus. However, since for x=6 the percentage is 100% (since there’s only one such locus), the percentage is technically “more than half”.

      The percentage of HOT loci engaging in chromatin interaction regions (91%) was calculated by simply overlapping the HOT regions with Hi-C long-range contact anchors. The details of extracting these regions using FitHiChip are described in Supplemental Methods 1.3.

      (7) While we have a limited basis to evaluate computational models, we would like to see a clearer explanation of the model set-up in terms of the number of trained vs. test datasets. In addition, it would be interesting to see if the models can be applied to data from different cell lines.

      We added the table with the sizes of the datasets used for classification in Supplemental Methods 1.6.1.

      Evaluating the models trained on the HOT loci of HepG2 and K562 on other cell lines would pose challenges since the number of available ENCODE TF ChIP-seq datasets is significantly less compared to the mentioned cell lines. Therefore, we conducted the proposed analysis between the studied cell lines. Specifically, we used the CNN models trained on HOT and regular enhancers of HepG2 and K562. Then, we evaluated each model on the test sets of each classification experiment (Author response image 4). We observed that the classification results of the HOT loci demonstrated a higher level of tissue-specificity compared to the same classification results of the regular enhancers.

      Author response image 4.

      (8) Lines 349-351. The significance of highly expressed genes being more prone to having multiple HOT loci, and vice versa, appears conventional and remains unclear. Intuitively, it makes sense for higher expressed genes to have more of the transcriptional machinery bound, and would bias the analysis. One way to circumvent this is to only analyze sequence-specific TFs and remove ones that are directly related to transcription machinery.

      We thank the reviewer for this suggestion. Our attempt to re-annotate the HOT loci with only sequence-specific TFs led to a significantly different set of loci, which would not be strictly comparable to the HOT loci defined by this study. Analyzing these new sets of loci would create a noticeable departure from the flow of the manuscript and further extend the already long scope of the study.

      Moreover, numerous studies have shown that super-enhancers recruit large numbers of TFs via transcriptional condensates (Boija et al., 2018; Cho et al., 2018; Sabari et al., 2018). We hope that our results can serve as data-driven supportive evidence for those studies.

      (9) Lines 393-396. We would like to see a reference to the models shown in the figures, if these models have been published previously.

      We could not understand the question. The lines 393-396 contains the following sentence:

      “However, many of the features of the loci that we’ve analyzed so far demonstrated similar patterns (GC contents, target gene expressions, ChIP-seq signal values etc.) when compared to the DAP-bound loci in HepG2 and K562, suggesting that albeit limited, the distribution of the DAPs in H1 likely reflects the true distribution of HOT loci.”

      In case the question was about the models that we trained to classify the HOT loci, we included the models and codebase to Zenodo and GitHub repository.

      (10) Values in Figure 7D are not reflected in the text. Specifically, the text states "Average ... phastCons of the developmental HOT loci are 1.3x higher than K562 and HepG2 HOT loci (Figure 7D)" (lines 408-409). Figure 7D shows conservation scores between HOT enhancers vs promoters for each cell line, and does not seem to reflect the text.

      We modified the figure to reflect the statement appropriately.

      (11) Methodology should include a justification for the use of the Mann-Whitney U-test (non-parametric) over other statistical tests.

      We added the following description to the methods section:

      “For calculating the statistical significance, we used the non-parametric Mann-Whitney U-test when the compared data points are non-linearly correlated and multi-modal. When the data distributions are bell-curve shaped, the Student’s t-test was used.“

      Minor:

      (1) Figure 2b was never mentioned in the paper. This can be added alongside Figure S6C, line 148.

      Indeed, Figure 2B was supposed to be listed together with Figure S6C, which was omitted by mistake. It was corrected.

      (2) Supplementary Figure 8 has two Cs. Needs to be corrected to D.

      Fixed.

      (3) Figure 3B is missing labels on the x-axis.

      Fixed.

      (4) The horizontal bar graph on the bottom left of Figure 1E needs to be described in the figure legend.

      Description added to the figure caption.

      (5) Line 345, Fig 15A should be Fig S15A.

      Corrected.

      Reviewer #2 (Recommendations For The Authors):

      I listed all my concerns about the paper in the public comments. I think the manuscript is very comprehensive and it is valuable, but it should be cut short and presented in a more digestible way.

      We thank the reviewer for their valuable comments and suggestions. We addressed all the concerns listed in the public comments. We shortened the manuscript by reducing the paragraph that focuses on computational classification models and reduced the discussions by about half in length.

      Line 55: What are chromatin-associated proteins, i.e. are they histone modifications?

      To clarify the definition used from the citation we changed the sentence to the following:

      “For instance, Partridge et al. studied the HOT loci in the context of 208 proteins including TFs, cofactors, and chromatin regulators which they called chromatin-associated proteins.”

      Though most of the paper can be cut short to avoid analysis paralysis for readers, there are details that still need filling in. For example, how did the authors perform PCA analysis, i.e. what are the features of each data point in the PCA analysis? Lines 214-215: How do we calculate the number of multi-way contacts in Hi-C data?

      We added clarifying descriptions and changed the mentioned sentences to the following:

      PCA:

      “To analyze the signatures of unique DAPs in HOT loci, we performed a PCA analysis where each HOT locus is represented by a binary (presence/absence) vector of length equal to the total number of DAPs analyzed.”

      Multi-way contacts on loop anchors:

      “To investigate further, we analyzed the loop anchor regions harboring HOT loci and observed that the number of multi-way contacts on loop anchors (i.e. loci which serve as anchors to multiple loops) correlates with the number of bound DAPs (rho=0.84 p-value<10E-4; Pearson correlation). “

      - Lines 251-252: How did the referenced study categorize DAPs? It is important for any manuscript to be self-contained.

      We added the explanation and changed the sentence to the following:

      “To test this hypothesis, we classified the DAPs into those two categories using the definitions provided in the study (Lambert et al. 2018) 28, where the TFs are classified by manual curation through extensive literature review and supported by annotations such as the presence of DNA-binding domains and validated binding motifs. Based on this classification, we categorized the ChIP-seq signal values into these two groups.“

      - Lines 181-185, sentences starting with 'To test' can be moved to the methods, leaving only brief mentions of the statistic tests if needed.

      We removed the mentioned sentence and moved to the supplemental methods (1.4).

      - Lines 217-220: I find this sentence extremely redundant unless it can offer more specific insights about a particular set of DAPs or if the DAPs are closer/or a proven distal enhancer to a confirmed causal gene.

      We removed the mentioned sentence from the text.

      - Lines 243-246: How did the authors determine the set DAPs that have stabilizing effects, and how exactly are the 'stabilizing effects' observed/measured?

      We added explanations to Supplemental Methods 3.1 and Fig S18, S19.

      While addressing this comment we realized that the reported value of the ratio is 1.91x, not 1.7x. We corrected that value in the main text and added the p-value.

      - When discussing the phastCons scores analyses, such as in lines 268-271, how did the authors calculate the relationship between phastCons scores and HOT loci, i.e. was the score averaged across the 400-bp locus to obtain a locus-specific conservation score?

      Yes, per-locus conservation scores were averaged over the bps of loci. We added this clarification to the methods.

      - Line 311: What is the role of the 'control sets' in the analyses of the sequence's relationship with HOT?

      In this specific case, the control sets are used as background or negative sets to set up the classification tasks. In other words, we are asking, whether the HOT loci can be distinguished when compared to random chromatin-accessible regions, promoters, or regular enhancers. We clarified this in the text.

      - I also find the discussion about different machine learning methods that classify HOT loci based on sequence contexts quite redundant UNLESS the authors decide to go further into the features' importance (such as motifs) in the models that predict/ are associated with HOT loci, which in itself can constitute another study.

      We agree with the reviewer, and shortened the part with the discussions of models by limiting it to only 3 main models and moved the rest to the supplemental materials.

      - Can the authors clarify where they obtain data on super-enhancers?

      We obtained the super-enhancer definitions from the original study (Hnisz et al. 2013, PMID: 24119843) where the super-enhancers were defined for multiple cell lines. We clarified this in the methods.

      - Figure 1B, the x and y axis should be clarified.

      We clarified it by using MAX as an example case in the figure caption as follows:

      “Prevalence of DAPs in HOT loci. Each dot represents a DAP. X-axis: percentage of HOT loci in which DAP is present (e.g. MAX is present in 80% of HOT loci). Y-axis: percentage of total peaks of DAPs that are located in HOT loci (e.g. 45% of all the ChIP-seq peaks of MAX is located in the HOT loci). Dot color and size are proportional to the total number of ChIP-seq peaks of DAP.”

      Reviewer #3 (Recommendations For The Authors):

      The list of proteins associated with different types of genomic loci at a meta level (enhancers, promoters, and gene body etc.), and an annotation of the genome at the specific loci level.

      The authors use a wide range of acronyms throughout the text and figure legends, they do a reasonably good job, but the main text section "HOT-loci are enriched in causal variants" and Figure 8 would be materially improved if they held it to the same standard.

      Size is a physical property and not a physicochemical property.

      We thank the reviewer for their comments and suggestions. We added a table to supplemental files with detailed annotations of analyzed loci.

      We reviewed the section “HOT loci are enriched in causal variants” and corrected a few mismatches in the acronyms.

    1. Author Response

      We thank you for the time you took to review our work and for your feedback!

      The major changes to the manuscript are:

      1. We have extended the range of locomotion velocity over which we compare its dependence with cholinergic activity in Figures 2E and S2H.

      2. We have quantified the contributions of cholinergic stimulation on multiplicative and additive gains on visual responses (Figure S7).

      3. We have provided single cell examples for the change in latency to visual response (Figure S12).

      4. We have added an analysis to compare layer 2/3 and layer 5 locomotion onset responses as a function of visuomotor condition (Figure S8).

      A detailed point-by-point response to all reviewer concerns is provided below.  

      Reviewer #1 (Public Review):

      The paper submitted by Yogesh and Keller explores the role of cholinergic input from the basal forebrain (BF) in the mouse primary visual cortex (V1). The study aims to understand the signals conveyed by BF cholinergic axons in the visual cortex, their impact on neurons in different cortical layers, and their computational significance in cortical visual processing. The authors employed two-photon calcium imaging to directly monitor cholinergic input from BF axons expressing GCaMP6 in mice running through a virtual corridor, revealing a strong correlation between BF axonal activity and locomotion. This persistent activation during locomotion suggests that BF input provides a binary locomotion state signal. To elucidate the impact of cholinergic input on cortical activity, the authors conducted optogenetic and chemogenetic manipulations, with a specific focus on L2/3 and L5 neurons. They found that cholinergic input modulates the responses of L5 neurons to visual stimuli and visuomotor mismatch, while not significantly affecting L2/3 neurons. Moreover, the study demonstrates that BF cholinergic input leads to decorrelation in the activity patterns of L2/3 and L5 neurons.

      This topic has garnered significant attention in the field, drawing the interest of many researchers actively investigating the role of BF cholinergic input in cortical activity and sensory processing. The experiments and analyses were thoughtfully designed and conducted with rigorous standards, leading to convincing results which align well with findings in previous studies. In other words, some of the main findings, such as the correlation between cholinergic input and locomotor activity and the effects of cholinergic input on V1 cortical activity, have been previously demonstrated by other labs (Goard and Dan, 2009; Pinto et al., 2013; Reimer et al., 2016). However, the study by Yogesh and Keller stands out by combining cutting-edge calcium imaging and optogenetics to provide compelling evidence of layerspecific differences in the impact of cholinergic input on neuronal responses to bottom-up (visual stimuli) and top-down inputs (visuomotor mismatch).

      We thank the reviewer for their feedback.

      Reviewer #2 (Public Review):

      The manuscript investigates the function of basal forebrain cholinergic axons in mouse primary visual cortex (V1) during locomotion using two-photon calcium imaging in head-fixed mice. Cholinergic modulation has previously been proposed to mediate the effects of locomotion on V1 responses. The manuscript concludes that the activity of basal forebrain cholinergic axons in visual cortex provides a signal which is more correlated with binary locomotion state than locomotion velocity of the animal. Cholinergic axons did not seem to respond to grating stimuli or visuomotor prediction error. Optogenetic stimulation of these axons increased the amplitude of responses to visual stimuli and decreased the response latency of layer 5 excitatory neurons, but not layer 2/3 neurons. Moreover, optogenetic or chemogenetic stimulation of cholinergic inputs reduced pairwise correlation of neuronal responses. These results provide insight into the role of cholinergic modulation to visual cortex and demonstrate that it affects different layers of visual cortex in a distinct manner. The experiments are well executed and the data appear to be of high quality. However, further analyses are required to fully support several of the study's conclusions.

      We thank the reviewer for their feedback.

      1) In experiments analysing the activity of V1 neurons, GCaMP6f was expressed using a ubiquitous Ef1a promoter, which is active in all neuronal cell types as well as potentially non-neuronal cells. The manuscript specifically refers to responses of excitatory neurons but it is unclear how excitatory neuron somata were identified and distinguished from that of inhibitory neurons or other cell types.

      This might be a misunderstanding. The Ef1α promoter has been reported to drive highly specific expression in neurons (Tsuchiya et al., 2002) with 99.7% of labeled cells in layer 2/3 of rat cortex being NeuN+ (a neuronal marker), with only 0.3% of labeled cells being GFAP+ (a glial marker) (Yaguchi et al., 2013). This bias was even stronger in layer 5 with 100% of labeled cells being NeuN+ and none GFAP+ (Yaguchi et al., 2013). The Ef1α promoter in an AAV vector, as we use it here, also biases expression to excitatory neurons. In layer 2/3 of mouse visual cortex, we have found that 96.8% ± 0.7% of labeled neurons are excitatory three weeks after viral injection (Attinger et al., 2017). Similar results have also been found in rats (Yaguchi et al., 2013), where on expressing GFP under Ef1a promoter delivered using Lenti virus, 95.2% of labeled neurons in layer 2/3 were excitatory and 94.1% in layer 5 were excitatory. These numbers are comparable to the ones obtained with promoters commonly used to target expression to excitatory neurons. To do this, typically two variants of promoters based on the transcription start region of CaMKIIα gene have been used. The first, the CaMKIIα-0.4 promoter, results in 95% excitatory specificity (Scheyltjens et al., 2015). The second, the CaMKIIα-1.3 promoter, results in only 82% excitatory specificity (Scheyltjens et al., 2015), and is thus not far from chance. We have clarified this in the manuscript. Nevertheless, we have removed the qualifier “excitatory” when talking about neurons in most instances, throughout the manuscript.

      2) The manuscript concludes that cholinergic axons convey a binary locomotion signal and are not tuned to running speed. The average running velocity of mice in this study is very slow - slower than 15 cm/s in the example trace in Figure 1D and speeds <6 cm/s were quantified in Figure 2E. However, mice can run at much faster speeds both under head-fixed and freely moving conditions (see e.g. Jordan and Keller, 2020, where example running speeds are ~35 cm/s). Given that the data in the present manuscript cover such a narrow range of running speeds, it is not possible to determine whether cholinergic axons are tuned to running speed or convey a binary locomotion signal.

      Our previous analysis window of 0-6.25 cm/s covered approximately 80% of all data. We have increased the analysis window to 0-35 cm/s that now covers more than 99% of the data (see below). Also, note that very high running speeds are probably overrepresented in the Jordan and Keller 2020 paper as mice had to be trained to run reliably before all experiments given the relatively short holding times of the intracellular recordings. The running speeds in our current dataset are comparable to other datasets we have acquired in similar experiments.

      Figure 2E has now been updated to reflect the larger range of data. Please note, as the number of mice that contribute to the data now differs as a function of velocity (some mice run faster than others), we have now switched to a variant of the plot based on hierarchical bootstrap sampling (see Methods). This does not overtly change the appearance of the plot. See Author response image 1 for a comparison of the original plot, the extended range without bootstrap sampling, and the extended range with bootstrap sampling currently used in the paper.

      Author response image 1.

      Average activity of cholinergic axons as a function of locomotion velocity. (A) As in the previous version of the manuscript. (B) As in A, but with the extended velocity range. (C) As in B, but using hierarchical bootstrap sampling to estimate median (red dots) and 95% confidence interval (shading) for each velocity bin.

      3) The analyses in Figure 4 only consider the average response to all grating orientations and directions. Without further analysing responses to individual grating directions it is unclear how stimulation of cholinergic inputs affects visual responses. Previous work (e.g. Datarlat and Stryker, 2017) has shown that locomotion can have both additive and multiplicative effects and it would be valuable to determine the type of modulation provided by cholinergic stimulation.

      We thank the reviewer for this suggestion. To address this, we quantified how cholinergic stimulation influenced the orientation tuning of V1 neurons. The stimuli we used were full field sinusoidal drifting gratings of 4 different orientations (2 directions each). For each neuron, we identified the preferred orientation and plotted responses relative to this preferred orientation as a function of whether the mouse was running, or we were stimulating cholinergic axons. Consistent with previous work, we found a mixture of a multiplicative and an additive components during running. With cholinergic axon stimulation, the multiplicative effect was stronger than the additive effect. This is now quantified in Figure S7.

      4) The difference between the effects of locomotion and optogenetic stimulation of cholinergic axons in Figure 5 may be confounded by differences in the visual stimulus. These experiments are carried out under open-loop conditions, where mice may adapt their locomotion based on the speed of the visual stimulus. Consequently, locomotion onsets are likely to occur during periods of higher visual flow. Since optogenetic stimulation is presented randomly, it is likely to occur during periods of lower visual flow speed. Consequently, the difference between the effect of locomotion and optogenetic stimulation may be explained by differences in visual flow speed and it is important to exclude this possibility.

      We find that in general locomotion is unaffected by visual flow in open loop conditions in this type of experiment (in this particular dataset, there was a small negative correlation between locomotion and visual flow in the open loop condition, Author response image 2).

      Author response image 2.

      Correlation between visual flow and locomotion in open loop conditions. Average correlation of locomotion velocity and visual flow speed in open loop for all mice in Figure 5. Each dot is an imaging site. In the open loop, the correlation between locomotion and visual flow speed is close to zero, but significantly negative in this dataset.

      However, to directly address the concern that our results are influenced by visual flow, we can restrict our analysis only to locomotion onsets that occurred in absence of visual flow (Author response image 3A and R3B). These responses are not substantially different from those when including all data (Figures 5A and 5B). Thus, the difference between the effect of locomotion and optogenetic stimulation cannot be explained by differences in visual flow speed.

      Author response image 3.

      Open loop locomotion onset responses without visual flow. (A) Average calcium response of layer 2/3 neurons in visual cortex to locomotion onset in open loop in the absence of visual flow. Shading indicates SEM. (B) As in A, but for layer 5 neurons.

      5) It is unclear why chemogenetic manipulations of cholinergic inputs had no effect on pairwise correlations of L2/3 neuronal responses while optogenetic stimulation did.

      This is correct – we do not know why that is the case and can only speculate. There are at least two possible explanations for this difference:

      1) Local vs. systemic. The optogenetic manipulation is relatively local, while the chemogenetic manipulation is systemic. It is not clear how cholinergic release in other brain regions influences the correlation structure in visual cortex. It is conceivable that a cortex-wide change in cholinergic release results in a categorically different state with a specific correlation structure in layer 2/3 neurons different from the one induced by the more local optogenetic manipulation.

      2) Layer-specificity of activation. Cholinergic projections to visual cortex arrive both in superficial and deep layers. We activate the axons in visual cortex optogenetically by illuminating the cortical surface. Thus, in our optogenetic experiments, we are primarily activating the axons arriving superficially, while in the chemogenetic experiment, we are likely influencing superficial and deep axons similarly. Thus, we might expect a bias in the optogenetic activation to influencing superficial layers more strongly than the chemogenetic activation does.

      6) The effects of locomotion and optogenetic stimulation on the latency of L5 responses in Figure 7 are very large - ~100 ms. Indeed, typical latencies in mouse V1 measured using electrophysiology are themselves shorter than 100 ms (see e.g. Durand et al., 2016). Visual response latencies in stationary conditions or without optogenetic stimulation appear surprisingly long - much longer than reported in previous studies even under anaesthesia. Such large and surprising results require careful analysis to ensure they are not confounded by artefacts. However, as in Figure 4, this analysis is based only on average responses across all gratings and no individual examples are shown.

      This is correct and we speculate this is the consequence of a combination of different reasons.

      1) Calcium imaging is inherently slower than electrophysiological recordings. While measuring spiking responses using electrophysiology, response latencies of on the order of 100 ms have indeed been reported, as the reviewer points out. Using calcium imaging these latencies are typically 4 times longer (Kuznetsova et al., 2021). This is likely a combination of a) calcium signals that are slower than electrical changes, b) delays in the calcium sensor itself, and c) temporal sampling used for imaging that is about 3 orders of magnitude slower than what typically used for electrophysiology.

      2) Different neurons included in analysis. The calcium imaging likely has very different biases than electrophysiological recordings. Historically, the fraction of visually responsive neurons in visual cortex based on extracellular electrophysiological recordings has been systematically overestimated (Olshausen and Field, 2005). One key contributor to this is the fact that recordings are biased to visually responsive neurons. The criteria for inclusion of “responsive neurons” strongly influences the “average” response latency. In addition, calcium imaging has biases that relate to the vertical position of the somata in cortex. Both layer 2/3 and layer 5 recordings are likely biased to superficial layer 2/3 and superficial layer 5 neurons. Conversely, electrical recordings are likely biased to layer 4 and layer 5 neurons. Thus, comparisons at this level of resolution between data obtained with these two methods are difficult to make.

      We have added example neurons as Figure S12, as suggested.  

      Reviewer #1 (Recommendations For The Authors):

      While the study showcases valuable insights, I have a couple of concerns regarding the novelty of their research and the interpretation of results. By addressing these concerns, the authors can clarify the positioning of their research and strengthen the significance of their findings.

      (Major comments)

      1) Page 1, Line 21: The authors claim, "Our results suggest that acetylcholine augments the responsiveness of layer 5 neurons to inputs from outside of the local network, enabling faster switching between internal representations during locomotion." However, it is not clear which specific data or results support the claim of "switching between internal representations." Overall, their study primarily presents responses averaged across all neurons imaged, lacking a detailed exploration of individual neuron response patterns. Population analysis, such as PCA and decoding, can be used to assess the encoding of each stimulus by V1 neurons - "internal representation."<br /> To strengthen their claim regarding "switching between internal representations," the authors could consider an experiment measuring the speed at which the population activity pattern A transitions to the population activity pattern B when the visual stimulus switches from A to B. Such experiments would significantly enhance the impact of their study, providing a clearer understanding of how BF cholinergic input influences the dynamic representation of stimuli during locomotion.

      We thank the reviewer for bringing this up. That acetylcholine enables a faster switching between internal representations in layer 5 is a speculation. We have attempted to make this clearer in the discussion. Our speculation is based on the finding that the population response in layer 5 to sensory input is faster under high levels of acetylcholine (Figures 4D and 7B). In line with the reviewer’s intuition, the neuronal response to a change in visual stimulus, in our experiment from a uniform grey visual stimulus to a sinusoidal grating stimulus, is indeed faster. Based on evidence in favor of layer 5 encoding internal representation (Heindorf and Keller, 2023; Keller and Mrsic-Flogel, 2018; Suzuki and Larkum, 2020), we interpret the decrease in latency of the population response as a faster change in internal representation. We are not sure a decoding analysis would add much to this, given that a trivial decoder simply based on mean population response would already find a faster transition. We have expanded on our explanation of these points in the manuscript.

      2) Page 4, Line 103: "..., a direct measurement of the activity of cholinergic projection from basal forebrain to the visual cortex during locomotion has not been made." This statement is incorrect. An earlier study by Reimer et al. indeed imaged cholinergic axons in the visual cortex of mice running on a wheel. They found that "After walking onset, ... ACh activation, and a large pupil diameter, were sustained throughout the walking period in both cortical areas V1 and A1." Their findings are very similar to the results presented by Yogesh and Keller - that is, BF cholinergic axons exhibited locomotion statedependent activity. The authors should clarify the positioning of this study relative to previous studies.

      Reimer, J., McGinley, M., Liu, Y. et al. Pupil fluctuations track rapid changes in adrenergic and cholinergic activity in cortex. Nat Commun 7, 13289 (2016). https://doi.org/10.1038/ncomms13289

      We have clarified this as suggested. However, we disagree slightly with the reviewer here. The key question is whether the cholinergic axons imaged originate in basal forebrain. While Reimer et al. 2016 did set out to do this, we believe a number of methodological considerations prevent this conclusion:

      1) In their analysis, Reimer et al. 2016 combine data from mice with cholinergic axons labeled with either viral injection to basal forebrain or germline cross of ChAT-cre mice with reporter line. Unfortunately, it is unclear what the exact number of mice labeled with either strategy was. Based on the information in the paper, we can conclude that of the 6 mice used for experiments between 2 and 5 were germline cross. The problem with germline labeling of ChAT positive neurons is that when using a cross, VIP-ChAT+ neurons in cortex are also labeled. Based on the fact that Reimer et al. 2016 find an anticipatory increase in activity on locomotion onset, that is also seen by Larsen et al. 2018 (they use a germline cross strategy), an effect we do not see in our data, we speculate that a significant part of the signals reported in the Reimer et al. 2016 paper are from local VIP-ChAT+ neurons.

      2) In their analysis, Reimer et al. 2016 also combine all imaging data obtained from both primary auditory cortex and primary visual cortex. Given the heterogeneity in the basal forebrain cholinergic neuronal population and their projection selectivity, to better understand these signals, it’s important to acquire the signals from cholinergic axons selectively in specific cortical regions, which we do in visual cortex. Based on the information provided in their paper, we were unfortunately not able to discern the injection location for their viral labeling strategy. Given the topographic selectivity in projection from basal forebrain, this could give hints as to the relative contribution of cholinergic projections to A1 vs V1 in their data. The injection coordinates given in the methods of the Reimer paper, of 4 mm lateral and 0.5 mm posterior to bregma to target basal forebrain, are likely wrong (they fall outside the head of the mouse).

      Given the heterogeneity in the basal forebrain cholinergic neuronal population and their projection selectivity, to better understand these signals, it’s important to acquire the signals from cholinergic axons both selectively in a cortical region, as we do in visual cortex, and purely originating from basal forebrain. Collins et al. 2023 inject more laterally and thus characterize cholinergic input to S1 and A1, while Lohani et al. 2022 use GRAB sensors which complement our findings. Please note, we don’t think there is any substantial disagreement in the results of previous studies and ours, with very few exceptions, like the anticipatory increase in cholinergic activity that precedes locomotion onset in the Reimer et al. 2016 data, but not in ours. This is a rather critical point in the context of the literature of motor-related neuronal activity in mouse V1. Based on early work on the topic, it is frequently assumed that motor-related activity in V1 is driven by a cholinergic input. This is very likely incorrect given our results, hence we feel it is important to highlight this methodological caveat of earlier work.

      3) Fig. 4H: The authors found that L5 neurons exhibit positive responses at the onset of locomotion in a closed-loop configuration. Moreover, these responses are further enhanced by photostimulation of BF axons.

      In a previous study from the same authors' group (Heindorf and Keller, 2023), they reported 'negative' responses in L5a IT neurons during closed-loop locomotion. This raises a question about the potential influence of different L5 neuron types on the observed results between the two studies. Do the author think that the involvement of the other neuronal type in L5, the PT neurons, might explain the positive responses seen in the present study? Discussing this point in the paper would provide valuable insights into the underlying mechanisms.

      Yes, we do think the positive response observed on locomotion onset in closed loop is due to non-Tlx3+ neurons. Given that Tlx3-cre only labels a subset of inter-telencephalic (IT) neurons (Gerfen et al., 2013; Heindorf and Keller, 2023), it’s not clear whether the positive response is explained by the pyramidal tract (PT) neurons, or the non-Tlx3+ IT neurons. Dissecting the response profiles of different subsets of layer 5 neurons is an active area of research in the lab and we hope to be able to answer these points more comprehensively in future publications. We have expanded on this in the discussion as suggested.

      Furthermore, it would be valuable to investigate whether the effects of photostimulation of BF axons vary depending on neuronal responsiveness. This could help elucidate how neurons with positive responses, potentially putative PT neurons, differ from neurons with negative responses, putative IT neurons, in their response to BF axon photostimulation during locomotion.

      We have attempted an analysis of the form suggested. In short, we found no relationship between a neuron’s response to optogenetic stimulation of ChAT axons and its response to locomotion onset, or its mean activity. Based on their response to locomotion onset in closed loop, we split layer 5 neurons into three groups, 30% most strongly decreasing (putative Tlx3+), 30% most strongly increasing, and the rest. We did not see a response to optogenetic stimulation of basal forebrain cholinergic axons in any of the three groups (Author response image 4A). We also found no obvious relationship between the mean activity of neurons and their response to optogenetic stimulation (Author response image 4B).

      Author response image 4.

      Neither putative layer 5 cell types nor neuronal responsiveness correlates with the response to optogenetic stimulation of cholinergic axons. (A) Average calcium response of layer 5 neurons split into putative Tlx3 (closed loop locomotion onset suppressed) and non-Tlx3 like (closed loop locomotion onset activated) to optogenetic stimulation of cholinergic axons. (B) Average calcium response of layer 5 neurons to optogenetic stimulation of cholinergic axons as a function of their mean response throughout the experimental session. Left: Each dot is a neuron. Right: Average correlation in the response of layer 5 to optogenetic stimulation and mean activity over all neurons per imaging site. Each dot is an imaging site.

      (Minor comments)

      1) It is unclear which BF subregion(s) were targeted in this study.

      Thanks for pointing this out. We targeted the entire basal forebrain (medial septum, vertical and horizontal limbs of the diagonal band, and nucleus basalis) with our viral injections. All our axonal imaging data comes from visual cortex and given the sensory modality-selectivity of cholinergic projections to cortex, the labeled axons originate from medial septum and the diagonal bands (Kim et al., 2016). We have now added the labels for basal forebrain subregions targeted next to the injection coordinates in the manuscript.

      2) Page 43, Line 818: The journal name of the cited paper Collins et al. is missing.

      Fixed.

      3) In the optogenetic experiments, how long is the inter-trial interval? Simulation of BF is known to have long-lasting effects on cortical activity and plasticity. It is, therefore, important to have a sufficient interval between trials.

      The median inter-trial interval for different stimulation events are as follows:

      • Optogenetic stimulation only : 15 s

      • Optogenetic stimulation + grating : 12 s

      • Optogenetic stimulation + mismatch: 35 s

      • Optogenetic stimulation + locomotion onset: 45 s

      We have added this information to the methods in the manuscript.

      Assuming locomotion is the primary driver of acetylcholine release (as we argue in Figures 1 and 2), the frequency of stimulation roughly corresponds to the frequency of acetylcholine release experienced endogenously. It is of course possible that being awake and mobile puts the entire system in a longlasting acetylcholine driven state different from what would be observed during long-term quite wakefulness or during sleep. But the main focus of the optogenetic stimulation experiments we performed was to investigate the consequences of the rapid acetylcholine release driven by locomotion.

      4) Page 11, Line 313: "..., we cannot exclude the possibility of a systemic contribution to the effects we observe through shared projections between different cortical and subcortical target." This possibility can be tested by examining the effect of optogenetic stimulation of cholinergic axons on locomotor activity, as they did for the chemogenetic experiments (Fig. S7). If the optogenetic manipulation changes locomotor activity, it is likely that this manipulation has some impact on subcortical activity and systemic contribution to the changes in cortical responses observed.

      Based on the reviewer suggestion we tested this and found no change in the locomotor activity of the mice on optogenetic stimulation of cholinergic axons locally in visual cortex (we have added this as Figure S5 to the manuscript). Please note however, we can of course not exclude a systemic contribution based on this.

      5) Fig. 4 and 5: In a closed-loop configuration, L2/3 neurons exhibit a transient increase in response at the onset of locomotion, while in an open-loop configuration, their response is more prolonged. On the other hand, L5 neurons show a sustained response in both configurations. Do the authors have any speculation on this difference?

      This is correct. Locomotion onset responses in layer 2/3 are strongly modulated by whether the locomotion onset occurs in closed loop or open loop configurations (Widmer et al., 2022). This difference is absent in our layer 5 data here. We suspect this is a function of a differential within-layer cell type bias in the different recordings. In the layer 2/3 recordings we are likely biased strongly towards superficial L2/3 neurons that tend to be negative prediction error neurons (top-down excited and bottom-up inhibited), see e.g. (O’Toole et al., 2023). A reduction of locomotion onset responses in closed loop is what one would expect for negative prediction error neurons. While layer 5 neurons exhibit mismatch responses, they do not exhibit opposing top-down and bottom-up input that would result in such a suppression (Jordan and Keller, 2020).

      We can illustrate this by splitting all layer 2/3 neurons based on their response to gratings and to visuomotor mismatch into a positive prediction error (PE) type (top 30% positive grating response), a negative prediction error type (top 30% positive visuomotor mismatch response), and the rest (remaining neurons and neurons responsive to both grating and visuomotor mismatch). Plotting the response of these neurons to locomotion onset in closed loop and open loop, we find that negative PE neurons have a transient response to locomotion onset in closed loop while positive PE neurons have a sustained increase in response in closed loop. In open loop the response of the two populations is indistinguishable. Splitting the layer 5 neurons using the same criteria, we don’t find a striking difference between closed and open loop between the two groups of neurons. We have added this as Figure S8.

      Reviewer #2 (Recommendations For The Authors):

      Major concerns:

      1) As a ubiquitous promoter was used to drive GCaMP expression, please explain how excitatory neurons were identified.

      2) As the data cover a very small range of running speeds, it is important to confirm that the binary locomotion signal model still applies when mice run at higher speeds - either by selecting recordings where mice have a wider range of running speeds or conducting additional experiments. In addition, please show the running speed tuning of individual axons.

      3) Please provide a more detailed analysis of the effects of locomotion and cholinergic modulation on visual responses. How does cholinergic modulation affect orientation and direction tuning? Are the effects multiplicative or additive? How does this compare to the effects of locomotion on single neurons?

      4) To ensure that the analyses in Figure 5 are not confounded by differences in the visual stimulus, please include average visual flow speed traces for each condition.

      5) Please clarify why chemogenetic manipulations of cholinergic inputs had no effect on pairwise correlations in L2/3.

      6) The latency effect is quite an extraordinary claim and requires careful analysis. Please provide examples of single neurons illustrating the latency effect - including responses across individual grating orientations/directions. One possible confound is that grating presentation could itself trigger locomotion or other movements. In the stationary / noOpto conditions, the grating response might not be apparent in the average trace until the animal begins to move. Thus the large latency in the stationary / noOpto conditions may reflect movement-related rather than visual responses.

      Please see our responses to these points in the public review part above.

      There are some minor points where text and figures could be improved:

      1) When discussing the decorrelation of neuronal responses by cholinergic axon activation, it is important to make it clear that Figure 6D quantifies the responses of layer 5 apical dendrites rather than neurons.

      We have added this information to the results section.

      2) In Figure S7, please clarify why velocity is in arbitrary units.

      This was an oversight and has been fixed.

      3) Please clarify how locomotion and stational trials are selected in Figure 4.

      We thank the reviewers for pointing this out. Trials were classified as occurring during locomotion or while mice were stationary as follows. We used a time-window of -0.5 s to +1 s around stimulus onset. If mice exhibited uninterrupted locomotion above a threshold of 0.25 cm/s in this time-window, we considered the stimulus as occurring during locomotion, otherwise it was defined as occurring while the mice were stationary. Note, the same criteria to define locomotion state was used to isolate visuomotor mismatch events, and also during control optogenetic stimulation experiments. We have added this information to the methods.

      4) When testing whether cholinergic activation is sufficient to explain locomotion-induced decorrelation in Figure 6G-H, please show pre-CNO and post-CNO delta-correlation, not just their difference.

      We can do that, but the results are harder to parse this way. We have added this as Figure S11 to the manuscript. The problem with parsing the figure is that the pre-CNO levels are different in different groups. This is likely a function of mouse-to-mouse variability and makes it harder to identify what the CNO induced changes are. Using the pre-post difference removes the batch influence. Hence, we have left this as the main analysis in Figure 6G and 6H.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Wang et al. generate XAP5 and XAP5L knockout mice and find that they are male infertile due to meiotic arrest and reduced sperm motility, respectively. RNA-Seq was subsequently performed and the authors concluded that XAP5 and XAP5L are antagonistic transcription factors of cilliogenesis (in XAP5-KO P16 testis: 554 genes were unregulated and 1587 genes were downregulated; in XAP5L-KO sperm: 2093 genes were unregulated and 267 genes were downregulated).

      We are grateful for the comprehensive summary.

      Strengths:

      Knockout mouse models provided strong evidence to indicate that XAP5 and XAP5L are critical for spermatogenesis and male fertility.

      Thank you for your positive comment.

      Weaknesses:

      The key conclusions are not supported by evidence. First, the authors claim that XAP5 and XAP5L transcriptionally regulate sperm flagella development; however, detailed molecular experiments related to transcription regulation are lacking. How do XAP5 and XAP5L regulate their targets? Only RNA-Seq is not enough. Second, the authors declare that XAP5 and XAP5L are antagonistic transcription factors; however, how do XAP5 and XAP5L regulate sperm flagella development antagonistically? Only RNA-Seq is not enough. Third, I am concerned about whether XAP5 really regulates sperm flagella development. XAP5 is specifically expressed in spermatogonia and XAP5-cKO mice are in meiotic arrest, indicating that XAP5 regulates meiosis rather than sperm flagella development.

      Thank you for the critical comments. To strengthen our conclusions, we have included XAP5/XAP5L CUT&Tag data in our revised manuscript. This highly sensitive method has allowed us to identify direct target genes of XAP5 and XAP5L (Table S1, Figure S6). Notably, our results demonstrate that both FOXJ1 and RFX2 are occupied by XAP5 (Figure 4G). Additionally, real-time PCR validation confirmed that RFX2 is also associated with XAP5L, even though enriched peaks for the RFX2 gene were not detected in the initial CUT&Tag data (Figure 4G). These findings indicate that XAP5 and XAP5L regulate the expression of FOXJ1 and RFX2 by directly binding to these genes. De novo motif analyses revealed that XAP5 and XAP5L shared a conserved binding sequence (CCCCGCCC/GGGCGGGG) (Figure S6C), and the bound regions of FOXJ1 and RFX2 contain this sequence. Further analysis shows that many XAP5L target genes are also targets of XAP5 (Figure S6G), despite the limited number of identified XAP5L target genes. This differential binding and regulation of shared target genes underscore the antagonistic relationship between XAP5 and XAP5L. Collectively, these findings provide additional support for the idea that XAP5 and XAP5L function as antagonistic transcription factors, acting upstream of transcription factor families, including FOXJ1 and RFX factors, to coordinate ciliogenesis during spermatogenesis.

      While we agree that XAP5 primarily regulates meiosis during spermatogenesis, our data also indicate that many cilia-related genes, including key transcription regulators of spermiogenesis such as RFX2 and SOX30, are downregulated in XAP5-cKO mice and are bound by XAP5 (Figure 4, Figures S4 and S6). It is important to note that genes coding for flagella components are expressed sequentially and in a germ cell-specific manner during development. When we refer to "regulating sperm flagella development", we mean the spatiotemporal regulation. We have revised the manuscript to clarify this point.

      Reviewer #2 (Public Review):

      In this study, Wang et al., report the significance of XAP5L and XAP5 in spermatogenesis, involved in transcriptional regulation of the ciliary gene in testes. In previous studies, the authors demonstrate that XAP5 is a transcription factor required for flagellar assembly in Chlamydomonas. Continuing from their previous study, the authors examine the conserved role of the XAP5 and XAP5L, which are the orthologue pair in mammals.

      XAP5 and XAP5L express ubiquitously and testis specifically, respectively, and their absence in the testes causes male infertility with defective spermatogenesis. Interestingly, XAP5 deficiency arrests germ cell development at the pachytene stage, whereas XAP5L absence causes impaired flagellar formation. RNA-seq analyses demonstrated that XAP5 deficiency suppresses ciliary gene expression including Foxj1 and Rfx family genes in early testis. By contrast, XAP5L deficiency abnormally remains Foxj1 and Rfx genes in mature sperm. From the results, the authors conclude that XAP5 and XAP5L are the antagonistic transcription factors that function upstream of Foxj1 and Rfx family genes.

      This reviewer thinks the overall experiments are performed well and that the manuscript is clear. However, the current results do not directly support the authors' conclusion. For example, the transcriptional function of XAP5 and XAP5L requires more evidence. In addition, this reviewer wonders about the conserved XAP5 function of ciliary/flagellar gene transcription in mammals - the gene is ubiquitously expressed despite its functional importance in flagellar assembly in Chlamydomonas. Thus, this reviewer thinks authors are required to show more direct evidence to clearly support their conclusion with more descriptions of its role in ciliary/flagellar assembly.

      Thank you for your thoughtful review of our work. We appreciate your positive feedback on the overall quality of the experiments and the clarity of the manuscript. In response to your concerns, we have included new experimental data and made revisions to the manuscript (lines 193-217) to better support our conclusions, particularly regarding the transcriptional function of XAP5 and XAP5L. Additionally, we have expanded on the role of XAP5 in ciliary and flagellar assembly to provide more direct evidence for its functional importance. Thank you for your insights.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The title (Control of ciliary transcriptional programs during spermatogenesis by antagonistic transcription factors) is not specific and does tend to exaggerate.

      Thank you for the comment, and we appreciate the opportunity to clarify the appropriateness of the title. Our paper extensively investigates the transcriptional regulation of ciliary genes during spermatogenesis. It demonstrates that XAP5/XAP5L are key transcription factors involved in this process. The title reflects our primary focus on the transcriptional programs that govern ciliary gene expression. Moreover, our paper shows that XAP5 positively regulates the expression of ciliary genes, particularly during the early stages of spermatogenesis, while XAP5L negatively regulates these genes. This antagonistic relationship is a crucial aspect of the study and is effectively conveyed in the title. In addition, our revised paper provides detailed insights into how XAP5/XAP5L control ciliary gene expression during spermatogenesis.

      Figure 4C: FOXJ1 and RFX2 are absent in sperm from WT mice. Are you sure? They are highly expressed in WT testes.

      Thank you for your careful review. While FOXJ1 and RFX2 are indeed highly expressed in the testes of wild-type (WT) mice, our data show that they are not detectable in mature sperm. This observation is consistent with published single-cell RNA-seq data(Jung et al., 2019), which indicate that FOXJ1 and RFX2 are primarily expressed in spermatocytes but not in spermatids (Figure S7). This expression pattern aligns with that that of IFT-particle proteins, which are essential for the formation but not the maintenance of mammalian sperm flagella(San Agustin, Pazour, & Witman, 2015).

      XAP5 is specifically expressed in spermatogonia and XAP5-cKO mice are in meiotic arrest, indicating that XAP5 regulates meiosis rather than sperm flagella development.

      We appreciate your insightful comments. As mentioned above, we agree that XAP5 primarily regulates meiosis during spermatogenesis. When we mentioned "regulating sperm flagella development," we were referring to the spatiotemporal regulation of these processes. We have revised the manuscript to clarify this distinction. Thank you for your understanding.

      The title of Figure 2 (XAP5L is required for normal sperm formation) is not accurate because the progress of spermatogenesis and sperm count is normal in XAP5L-KO mice (only sperm motility is reduced).

      We apologize for any confusion caused by the previous figure. It did not accurately convey the changes in sperm count. In the revised Figure 2B, we clearly demonstrate that the sperm count in XAP5L-KO mice is indeed lower than that in WT mice. This revision aims to provide a more accurate representation of the effects of XAP5L deficiency on spermatogenesis. Thank you for bringing this to our attention.

      Reviewer #2 (Recommendations For The Authors):

      (1) Although XAP5 and XAP5L deficiency alters the transcription of Foxj1 and Rfx family genes, which are the essential transcription factors for the ciliogenesis, current data do not directly support that XAP5 and XAP5L are the upstream transcription factors. The authors need to show more direct evidence such as CHIP-Seq data.

      Thank you for your valuable feedback! In this revised manuscript, we have included data identifying candidate direct targets of XAP5 and XAP5L using the highly sensitive CUT&Tag method (Kaya-Okur et al., 2019). Our results show that XAP5 occupies both FOXJ1 and RFX2 (Figure 4G). Furthermore, real-time PCR validation of the CUT&Tag experiments confirmed that RFX2 is also occupied by XAP5L (Figure 4G), despite the initial CUT&Tag data not revealing enriched peaks for the RFX2 gene (Table S1). Unfortunately, the limited number of enriched peaks identified for XAP5L (Table S1) suggests that the XAP5L antibody used in the CUT&Tag experiment might have suboptimal performance, which prevented us from detecting occupancy on the FOXJ1 promoter. Nevertheless, these additional data provide strong evidence that XAP5 and XAP5L function as upstream transcription factors for FOXJ1 and RFX family genes, supporting their essential roles in ciliogenesis.

      (2) Shared transcripts that are altered by the absence of either XAP5 or XAP5L do not clearly support they are antagonistic transcription factors.

      Thank you for your insightful comment. In our revised manuscript, we performed CUT&Tag analysis to identify target genes of XAP5 and XAP5L. Motif enrichment analysis revealed conserved binding sequences for both factors (Figures S6C), indicating a subset of shared downstream genes between XAP5 and XAP5L. Among the downregulated genes in XAP5 cKO germ cells, 891 genes were bound by XAP5 (Figure S6D). Although the number of enriched peaks identified for XAP5L was limited, 75 of the upregulated genes in XAP5L KO sperm were bound by XAP5L (Figure S6E). Importantly, of these 75 XAP5L target genes, approximately 30% (22 genes) were also identified as targets of XAP5 (Figure S6G), further support the idea that XAP5 and XAP5L function as antagonistic transcription factors.

      (3) XAP5 seems to be an ancient transcription factor for cilia and flagellar assembly. However, XAP5 expresses ubiquitously in mice. How can this discrepancy be explained? Is it also required for primary cilia assembly? Are their expression also directly linked to ciliogenesis in other types of cells?

      Thank you for the thoughtful questions. The ubiquitous expression of XAP5 in mice can be understood in light of its role as an ancient transcription factor for cilia and flagellar assembly. Given that cilia are present on nearly every cell type in the mammalian body (O'Connor et al., 2013), this broad expression pattern makes sense. In fact, XAP5 serves not only as a master regulator of ciliogenesis but also as a critical regulator of various developmental processes (Kim et al., 2018; Lee et al., 2020; Xie et al., 2023).

      Our current unpublished work demonstrates that XAP5 is essential for primary cilia assembly in different cell lines. The loss of XAP5 protein results in abnormal ciliogenesis, further supporting its vital role in ciliary formation across different cell types.

      We believe that the widespread expression of XAP5 reflects its fundamental importance in multiple cellular processes, including ciliogenesis, development, and potentially other cellular functions yet to be discovered.

      (4) XAP5L causes impairs flagellar assembly. Have the authors observed any other physiological defects in the absence of XAP5L in mouse models? Such as hydrocephalus and/or tracheal defects?

      Thank you for the questions. We have carefully examined XAP5L KO mice for other physiological defects. To date, we have not observed any additional physiological abnormalities. Specifically, we assessed the condition of tracheal cilia in XAP5L KO mice and found no significant differences compared to wild-type (WT) mice, as illustrated in Author response image 1 below.

      Author response image 1.

      References

      Jung, M., Wells, D., Rusch, J., Ahmad, S., Marchini, J., Myers, S. R., & Conrad, D. F. (2019). Unified single-cell analysis of testis gene regulation and pathology in five mouse strains. Elife, 8. doi:10.7554/eLife.43966

      Kaya-Okur, H. S., Wu, S. J., Codomo, C. A., Pledger, E. S., Bryson, T. D., Henikoff, J. G., . . . Henikoff, S. (2019). CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nat Commun, 10(1), 1930. doi:10.1038/s41467-019-09982-5

      Kim, Y., Hur, S. W., Jeong, B. C., Oh, S. H., Hwang, Y. C., Kim, S. H., & Koh, J. T. (2018). The Fam50a positively regulates ameloblast differentiation via interacting with Runx2. J Cell Physiol, 233(2), 1512-1522. doi:10.1002/jcp.26038

      Lee, Y.-R., Khan, K., Armfield-Uhas, K., Srikanth, S., Thompson, N. A., Pardo, M., . . . Schwartz, C. E. (2020). Mutations in FAM50A suggest that Armfield XLID syndrome is a spliceosomopathy. Nature Communications, 11(1). doi:10.1038/s41467-020-17452-6

      O'Connor, A. K., Malarkey, E. B., Berbari, N. F., Croyle, M. J., Haycraft, C. J., Bell, P. D., . . . Yoder, B. K. (2013). An inducible CiliaGFP mouse model for in vivo visualization and analysis of cilia in live tissue. Cilia, 2(1), 8. doi:10.1186/2046-2530-2-8

      San Agustin, J. T., Pazour, G. J., & Witman, G. B. (2015). Intraflagellar transport is essential for mammalian spermiogenesis but is absent in mature sperm. Mol Biol Cell, 26(24), 4358-4372. doi:10.1091/mbc.E15-08-0578

      Xie, X., Li, L., Tao, S., Chen, M., Fei, L., Yang, Q., . . . Chen, L. (2023). Proto-Oncogene FAM50A Can Regulate the Immune Microenvironment and Development of Hepatocellular Carcinoma In Vitro and In Vivo. Int J Mol Sci, 24(4). doi:10.3390/ijms24043217

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment:

      This important study advances the understanding of physiological mechanisms in deep-sea Planctomycetes bacteria, revealing unique characteristics such as the only known Phycisphaerae using a budding mode of division, extensive involvement in nitrate assimilation and release phage particles without cell death. The study uses convincing evidence, based on experiments using growth assays, phylogenetics, transcriptomics, and gene expression data. The work will be of interest to bacteriologists and microbiologists in general.

      Response: Thanks for the Editor’s and Reviewers’ positive comments, which help us improve the quality of our manuscript entitled “Physiological and metabolic insights into the first cultured anaerobic representative of deep-sea Planctomycetes bacteria” (paper#eLife-RP-RA-2023-89874). The comments are all valuable, and we have studied the comments carefully and have made corresponding revisions according to the suggestions. Revised portions are marked in blue in the modified manuscript.

      Please find the detailed responses as following.

      Public Reviews:

      Reviewer #1 (Public Review):

      The authors of the manuscript cultivated a Planctomycetes strain affiliated with Phycisphaerae. The strain was one of the few Planctomycetes from deep-sea environments and demonstrated several unique characteristics, such as being the only known Phycisphaerae using a budding mode of division, extensive involvement in nitrate assimilation, and being able to release phage particles without cell death. The manuscript is generally well-written. However, a few issues need to be more clearly addressed, especially regarding the identification and characterization of the phage.

      Response: Thanks for your positive comments. Please find the detailed responses as following.

      Reviewer #1 (Recommendations For The Authors):

      • Line 75-77, add a reference for this statement.

      Response: Thanks for your suggestion. We have added a reference (Fuerst and Sagulenko, 2011) for this statement in the revised manuscript (Line 77).

      References related to this response:

      Fuerst, J.A., and Sagulenko, E. Beyond the bacterium: planctomycetes challenge our concepts of microbial structure and function. Nat Rev Microbiol. 2011;9:403-413.

      • Line 124-134, add key statistics (such as ANI) of strain ZRK32 and KS4 to this section.

      Response: Thanks for your suggestion. We added the key statistics of strain ZRK32 and KS4, and described as “Based on the 16S rRNA sequence of strain ZRK32, a sequence similarity calculation using the NCBI server indicated that the closest relatives of strain ZRK32 were Poriferisphaera corsica KS4T (98.06%), Algisphaera agarilytica 06SJR6-2T (88.04%), Phycisphaera mikurensis NBRC 102666T (85.28%), and Tepidisphaera mucosa 2842T (82.94%). Recently, the taxonomic threshold for species based on 16S rRNA gene sequence identity value was 98.65% (Kim et al., 2014). Based on these criteria, we proposed that strain ZRK32 might be a novel representative of the genus Poriferisphaera. In addition, to clarify the phylogenetic position of strain ZRK32, the genome relatedness values were calculated by the average nucleotide identity (ANI), the tetranucleotide signatures (Tetra), and in silico DNA-DNA similarity (isDDH), against the genomes of strains ZRK32 and KS4. The ANIb, ANIm, Tetra, and isDDH values were 72.89%, 85.34%, 0.97385, and 20.90%, respectively (Table S1). These results together demonstrated the strain ZRK32 genome to be obviously below established ‘cut-off’ values (ANIb: 95%, ANIm: 95%, Tetra: 0.99, isDDH: 70%) for defining bacterial species, suggesting strain ZRK32 represents a novel strain within the genus Poriferisphaera.” in the revised manuscript (Lines 124-139).

      • Fig. 2A missing description for figure key.

      Response: Thanks for your comments. We modified the Figure 2A, shown as below:

      Author response image 1.

      Figure. 2. Growth assay and transcriptomic analysis of P. heterotrophicis ZRK32 strains cultivated in basal medium and rich medium.

      • Regarding the page released, could this be a membrane vesicle-engulfed phage? I would recommend checking "Spontaneous Prophage Induction Contributes to the Production of Membrane Vesicles by the Gram-Positive Bacterium Lacticaseibacillus casei BL23" and "Chronic Release of Tailless Phage Particles from Lactococcus lactis" for further references.

      Response: Thanks for your valuable comments. We carefully read these two papers and found that phage ZRK32 is most likely a membrane vesicle-engulfed phage. We added the corresponding description as “Moreover, it has recently been reported that the tailless Caudoviricetes phage particles are enclosed in lipid membrane and are released from the host cells by a nonlytic mechanism (Liu et al., 2022), and the prophage induction contributes to the production of membrane vesicles by Lacticaseibacillus casei BL23 during cell growth (da Silva Barreira et al., 2022). Considering that strain ZRK32 has a large number of membrane vesicles during cell growth (Figure S9), we speculated that Phage-ZRK32 might be a membrane vesicle-engulfed phage and its release should be related to membrane vesicles.” in the revised manuscript (Lines 381-388).

      References related to this response:

      Liu Y, Alexeeva S, Bachmann H, Guerra Martníez J.A, Yeremenko N, Abee T et al. Chronic release of tailless phage particles from Lactococcus lactis. Appl Environ Microbiol. 2022; 88: e0148321.

      Silva Barreira, D., Lapaquette, P., Novion Ducassou, J., Couté, Y., Guzzo, J., and Rieu, A. Spontaneous prophage induction contributes to the production of membrane vesicles by the gram-positive bacterium Lacticaseibacillus casei BL23. mBio. 2022;13:e0237522.

      • How were the reference sequences for Fig. S10-S13 retrieved, was it by blasting the phage gene against the entire NCBI database, or only the virus sequence within the NCBI? Please clarify this.

      Response: Thanks for your comments. The reference sequences for Fig. S10-S13 were retrieved by blasting the phage gene against the entire NCBI database. We clarified this as “The reference sequences of four AMGs encoding amidoligase, glutamine amidotransferase, gamma-glutamylcyclotransferase, and glutathione synthase were retrieved by blasting the phage gene against the entire NCBI database, respectively.” in the revised manuscript (Lines 444-447).

      Reviewer #2 (Public Review):

      Summary:

      Planctomycetes encompass a group of bacteria with unique biological traits, the compartmentalized cells make them appear to be organisms in between prokaryotes and eukaryotes. However, only a few of the Planctomycetes bacteria are cultured thus far, and this hampers insight into the biological traits of these evolutionarily important organisms. This work reports the methodology details of how to isolate the deep-sea bacteria that could be recalcitrant to laboratory cultivation, and further reveals the distinct characteristics of the new species of a deep-sea Planctomycetes bacterium, such as the chronic phage release without breaking the host and promote the host and related bacteria in nitrogen utilization. Therefore, the finding of this work is of importance in extending our knowledge of bacteria.

      Response: Thanks for your positive comments.

      Strengths:

      Through the combination of microscopic, physiological, genomics, and molecular biological approaches, this reports the isolation and comprehensive investigation of the first anaerobic representative of the deep-sea Planctomycetes bacterium, in particular in that of the budding division, and release phage without lysis of the cells. Most of the results and conclusions are supported by the experimental evidence.

      Response: Thanks for your positive comments.

      Weaknesses:

      1. While EMP glycolysis is predicted to be involved in energy conservation, no experimental evidence indicated any sugar utilization by the bacterium.

      Response: Thanks for your comments. We have previously tested the sugar utilization of strain ZRK32, and now added this description as “Consistent with the presence of EMP glycolysis pathway in strain ZRK32, we found that it could use a variety of sugars including glucose, maltose, fructose, isomaltose, galactose, D-mannose, and rhamnose (Table S2).” in the revised manuscript (Lines 281-284).

      1. "anaerobic representative" is indicated in the Title, the contrary, TCA in energy metabolism is predicted by the bacterium.

      Response: Thanks for your valuable comments. Currently, anaerobic microorganisms can use other alternative electron acceptors (such as sulfate reducers, nitrate reducers, iron reducers, etc) in place of oxygen for the TCA cycle. For example, Proteus mirabilis uses the whole oxidative TCA cycle without using oxygen as the final electron acceptor when it performs multicellular swarming (Alteri et al., 2012). In this study, all the genes involved in the TCA cycle were present in anaerobic strain ZRK32 and most of them are upregulated, thus we speculate that it might function through the complete TCA metabolic pathway to obtain energy. We added the related description as “Notably, when growing in the rich medium, the expressions of most genes involved in the TCA cycle and EMP glycolysis pathway in strain ZRK32 were upregulated (Figure 2B-D, Figure S5B and Figure S6), suggesting that strain ZRK32 might function through the complete TCA metabolic pathway and EMP glycolysis pathway to obtain energy for growth (Figure S8) (Zheng et al., 2021b). Consistent with the presence of EMP glycolysis pathway in strain ZRK32, we found that it could use a variety of sugars including glucose, maltose, fructose, isomaltose, galactose, D-mannose, and rhamnose (Table S2). As for the presence of TCA cycle in the anaerobic strain ZRK32, we propose that it might use other alternative electron acceptors (such as sulfate reducers, nitrate reducers, iron reducers, etc) in place of oxygen for the TCA cycle, as shown in other anaerobic bacteria (Alteri et al., 2012).” in the revised manuscript (Lines 277-287).

      References related to this response:

      Alteri CJ, Himpsl SD, Engstrom MD, Mobley HL. Anaerobic respiration using a complete oxidative TCA cycle drives multicellular swarming in Proteus mirabilis. mBio. 2012; 3(6): e00365-12.

      1. The possible mechanisms of the chronic phage release without breaking the host are not discussed.

      Response: Thanks for your valuable comments. The possible mechanism of the chronic phage release without breaking the host might be that it was enclosed in lipid membrane and released from the host cells by a nonlytic mechanism. We added the corresponding description as “Moreover, it has recently been reported that the tailless Caudoviricetes phage particles are enclosed in lipid membrane and are released from the host cells by a nonlytic mechanism (Liu et al., 2022), and the prophage induction contributes to the production of membrane vesicles by Lacticaseibacillus casei BL23 during cell growth (da Silva Barreira et al., 2022). Considering that strain ZRK32 has a large number of membrane vesicles during cell growth (Figure S9), we speculated that Phage-ZRK32 might be a membrane vesicle-engulfed phage and its release should be related to membrane vesicles.” in the revised manuscript (Lines 381-388).

      References related to this response:

      Liu Y, Alexeeva S, Bachmann H, Guerra Martníez J.A, Yeremenko N, Abee T et al. Chronic release of tailless phage particles from Lactococcus lactis. Appl Environ Microbiol. 2022; 88: e0148321. da Silva Barreira, D., Lapaquette, P., Novion Ducassou, J., Couté, Y., Guzzo, J., and Rieu, A. Spontaneous prophage induction contributes to the production of membrane vesicles by the gram-positive bacterium Lacticaseibacillus casei BL23. mBio. 2022;13:e0237522.

      Reviewer #2 (Recommendations For The Authors):

      • Have you tested whether strain ZRK32 uses any sugars? If not, why it uses EMP pathway to obtain energy?

      Response: Thanks for your comments. We have previously tested the sugar utilization of strain ZRK32, and now added this description as “Consistent with the presence of EMP glycolysis pathway in strain ZRK32, we found that it could use a variety of sugars including glucose, maltose, fructose, isomaltose, galactose, D-mannose, and rhamnose (Table S2).” in the revised manuscript (Lines 281-284).

      • Further discussion on possible mechanisms of the chronic phage release without breaking the host is expected.

      Response: Thanks for your valuable comments. The possible mechanism of the chronic phage release without breaking the host might be that it was enclosed in lipid membrane and released from the host cells by a nonlytic mechanism. We added the corresponding description as “Moreover, it has recently been reported that the tailless Caudoviricetes phage particles are enclosed in lipid membrane and are released from the host cells by a nonlytic mechanism (Liu et al., 2022), and the prophage induction contributes to the production of membrane vesicles by Lacticaseibacillus casei BL23 during cell growth (da Silva Barreira et al., 2022). Considering that strain ZRK32 has a large number of membrane vesicles during cell growth (Figure S9), we speculated that Phage-ZRK32 might be a membrane vesicle-engulfed phage and its release should be related to membrane vesicles.” in the revised manuscript (Lines 381-388).

      References related to this response:

      Liu Y, Alexeeva S, Bachmann H, Guerra Martníez J.A, Yeremenko N, Abee T et al. Chronic release of tailless phage particles from Lactococcus lactis. Appl Environ Microbiol. 2022; 88: e0148321.

      da Silva Barreira, D., Lapaquette, P., Novion Ducassou, J., Couté, Y., Guzzo, J., and Rieu, A. Spontaneous prophage induction contributes to the production of membrane vesicles by the gram-positive bacterium Lacticaseibacillus casei BL23. mBio. 2022;13:e0237522.

      • It is recommended that the writing is improved, including presentation style and grammar.

      Response: Thanks for your comments. We have invited an English native speaker (Dr. Diana Walsh from Life Science Editors, USA) to revise our manuscript, which we hope to meet your approval.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, Millard and colleagues investigated if the analgesic effect of nicotine on pain sensitivity, assessed with two pain models, is mediated by Peak Alpha Frequency (PAF) recorded with resting state EEG. The authors found indeed that nicotine (4 mg, gum) reduced pain ratings during phasic heat pain but not cuff pressor algometry compared to placebo conditions. Nicotine also increased PAF (globally). However, mediation analysis revealed that the reduction in pain ratings elicited by the phasic heat pain after taking nicotine was not mediated by the changes in PAF. Also, the authors only partially replicated the correlation between PAF and pain sensitivity at baseline (before nicotine treatment). At the group-level no correlation was found, but an exploratory analysis showed that the negative correlation (lower PAF, higher pain sensitivity) was present in males but not in females. The authors discuss the lack of correlation.

      In general, the study is rigorous, methodology is sound and the paper is well-written. Results are compelling and sufficiently discussed.

      Strengths:

      Strengths of this study are the pre-registration, proper sample size calculation, and data analysis. But also the presence of the analgesic effect of nicotine and the change in PAF.

      Weaknesses:

      It would even be more convincing if they had manipulated PAF directly.

      We thank Reviewer #1 for their positive and constructive comments regarding our study. We appreciate the view that the study was rigorous and methodologically sound, that the paper was well-written, and that the strengths included our pre-registration, sample size calculation, and data analysis.

      In response to the reviewer's comment about more directly manipulating Peak Alpha Frequency (PAF), we agree that such an approach could provide a more direct investigation of the role of PAF in pain processing. We chose nicotine to modulate PAF as the literature suggested it was associated with a reliable increase in PAF speed. As mentioned in our Discussion, there are several alternative methods to manipulate PAF, such as non-invasive brain stimulation techniques (NIBS) like transcranial alternating current stimulation (tACS) or neurofeedback training. These approaches could help clarify whether a causal relationship exists between PAF and pain sensitivity. Although methods such as NIBS still require further investigation as there is little evidence for these approaches changing PAF (Millard et al., 2024).

      Reviewer #2 (Public Review):

      Summary:

      The study by Millard et al. investigates the effect of nicotine on alpha peak frequency and pain in a very elaborate experimental design. According to the statistical analysis, the authors found a factor-corrected significant effect for prolonged heat pain but not for alpha peak frequency in response to the nicotine treatment.

      Strengths:

      I very much like the study design and that the authors followed their research line by aiming to provide a complete picture of the pain-related cortical impact of alpha peak frequency. This is very important work, even in the absence of any statistical significance. I also appreciate the preregistration of the study and the well-written and balanced introduction. However, it is important to give access to the preregistration beforehand.

      Weaknesses:

      The weakness of the study revolves around three aspects:

      (1) I am not entirely convinced that the authors' analysis strategy provides a sufficient signal-tonoise ratio to estimate the peak alpha frequency in each participant reliably. A source separation (ICA or similar) would have been better suited than electrode ROIs to extract the alpha signal. By using a source separation approach, different sources of alpha (mu, occipital alpha, laterality) could be disentangled.

      (2) Also, there's a hint in the literature (reference 49 in the manuscript) that the nicotine treatment may not work as intended. Instead, the authors' decision to use nicotine to modulate the peak alpha frequency and pain relied on other, not suitable work on chronic pain and permanent smokers. In the present study, the authors use nicotine treatment and transient painful stimulation on nonsmokers.

      (3) In my view, the discussion could be more critical for some aspects and the authors speculate towards directions their findings can not provide any evidence. Speculations are indeed very important to generate new ideas but should be restricted to the context of the study (experimental pain, acute interventions). The unfortunate decision to use nicotine severely hampered the authors' aim of the study.

      Impact:

      The impact of the study could be to show what has not worked to answer the research questions of the authors. The authors claim that their approach could be used to define a biomarker of pain. This is highly desirable but requires refined methods and, in order to make the tool really applicable, more accurate approaches at subject level.

      We thank reviewer #2 for their recognition of the study’s design, the importance of this research area, and the pre-registration of our study. In response to the weaknesses highlighted:

      (1) We appreciate the reviewer’s suggestion to improve the signal-to-noise ratio by applying source separation techniques, such as ICA, which have now been performed and incorporated into the manuscript. Our original decision to use sensor-level ROIs followed the precedent set in previous studies, our rationale being to improve reproducibility and avoid  biases from picking individual electrodes or manually picking sources. We have  added analyses using an automated pipeline that selects components based on the presence of a peak in the alpha range and alignment with a predefined template topography representing sensorimotor sites. Here again we found no significant differences in the mediation results that used a sensor space sensorimotor ROI, further supporting the robustness of the chosen approach. ICA could still potentially disentangle different sources of alpha, such as occipital alpha and mu rhythm, and provide new insights into the PAF-pain relationship. We have now added a discussion in the manuscript about the potential advantages of source separation techniques and suggest that the possible contributions of separate alpha sources be investigated and compared to sensor space PAF as a direction for future research.

      (2) We recognise the reviewer's concern regarding our choice of nicotine as a modulator of pain and alpha peak frequency (PAF). The meta-analysis by Ditre et al. (2016) indeed points to small effect sizes for nicotine's impact on experimental pain and highlights the potential for publication bias. However, our decision to use nicotine in this study was not primarily based on its direct analgesic effects, but rather on its well-documented ability to modulate PAF, in smoking and non-smoker populations, as outlined in our study aims.

      In this regard, the intentional use of nicotine was to assess whether changes in PAF could mediate alterations in pain. This approach aligns with the broader concept that a direct effect of an intervention is not necessary to observe indirect effects (Fairchild & McDaniel, 2017). We have, however, revised our introduction to further clarify this rationale, highlighting that nicotine was used as a tool for PAF modulation, not solely for its potential analgesic properties.

      (3) We agree with the reviewer’s observation that certain aspects of the Discussion could be more cautious, particularly regarding speculations about nicotine’s effects and PAF as a biomarker of pain. We have revised the Discussion to ensure that our interpretations are better grounded in the data from this study, clearly stating the limitations and avoiding overgeneralization. This revision focuses on a more critical evaluation of the potential relationships between PAF, nicotine, and pain sensitivity based solely on our experimental context.

      Finally, We also apologize for not providing access to the preregistration earlier. This was an oversight on our end, and we will ensure that future preregistrations are made available upfront.

      Reviewer #3 (Public Review):

      In this manuscript, Millard et al. investigate the effects of nicotine on pain sensitivity and peak alpha frequency (PAF) in resting state EEG. To this end, they ran a pre-registered, randomized, double-blind, placebo-controlled experiment involving 62 healthy adults who received either 4 mg nicotine gum (n=29) or placebo (n=33). Prolonged heat and pressure were used as pain models. Resting state EEG and pain intensity (assessed with a visual analog scale) were measured before and after the intervention. Additionally, several covariates (sex at birth, depression and anxiety symptoms, stress, sleep quality, among others) were recorded. Data was analyzed using ANCOVAequivalent two-wave latent change score models, as well as repeated measures analysis of variance. Results do not show *experimentally relevant* changes of PAF or pain intensity scores for either of the prolonged pain models due to nicotine intake.

      The main strengths of the manuscript are its solid conceptual framework and the thorough experimental design. The researchers make a good case in the introduction and discussion for the need to further investigate the association of PAF and pain sensitivity. Furthermore, they proceed to carefully describe every aspect of the experiment in great detail, which is excellent for reproducibility purposes. Finally, they analyse the data from almost every possible angle and provide an extensive report of their results.

      The main weakness of the manuscript is the interpretation of these results. Even though some of the differences are statistically significant (e.g., global PAF, pain intensity ratings during heat pain), these differences are far from being experimentally or clinically relevant. The effect sizes observed are not sufficiently large to consider that pain sensitivity was modulated by the nicotine intake, which puts into question all the answers to the research questions posed in the study.

      We would like to express our gratitude to Reviewer #3 for their thoughtful and constructive review, including the positive feedback on the strengths of our study's conceptual framework, experimental design, and thorough methodological descriptions.

      We acknowledge the concern regarding the experimental and clinical relevance of some statistically significant results (e.g., global PAF and pain intensity during heat pain) and agree that small effect sizes may limit their practical implications. However, our primary goal was to assess whether nicotine-induced changes in PAF mediate pain changes, rather than to demonstrate large direct effects on pain sensitivity. Nicotine was chosen for its known ability to modulate PAF, and our focus was on the mechanistic role of PAF in pain perception. To clarify this, we have revised the discussion to better differentiate between statistical significance, experimental relevance, and clinical applicability. We emphasize that this study represents a preliminary step towards understanding PAF’s mechanistic role in pain, rather than a direct clinical application.

      We appreciate the suggestion to refine our interpretation. We have adjusted our language to ensure it aligns with the effect sizes observed and made recommendations for future research, such as testing different nicotine doses, to potentially uncover stronger or more clinically relevant effects.

      Although modest, we believe these findings offer valuable insights into the potential mechanisms by which nicotine affects alpha oscillations and pain. We have also discussed how these small effects could become more pronounced in different populations (e.g., chronic pain patients) and over time, offering guidance for future research on PAF modulation and pain sensitivity.

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      I have a number of points that the authors may want to consider for this or future work.

      (1) By reviewing the literature provided by the authors in the introduction I think that using nicotine as a means to modulate pain and alpha peak frequency was a mistake. The only work that may give a hint on whether nicotine can modulate experimental pain is the meta-analysis by Ditre and colleagues (2016). They suggest that their small effect may contain a publication bias. I think the other "large body of evidence" is testing something else than analgesia.

      Thank you for your consideration of our choice of nicotine in the study. The meta-analysis by Ditre and colleagues (2016) suggests small effect sizes for nicotine's impact on experimental pain, compared to the moderate effects claimed in some papers, especially when accounting for the potential publication bias you mentioned. However, our selection of nicotine was primarily driven by its documented ability to modulate PAF rather than its direct analgesic effects, as clearly stated in our aims. Therefore, we do not view our decision to use nicotine as a mistake; instead, it was aligned with our goal of assessing whether changes in PAF mediate alterations in pain and thus served as a valuable tool. This perspective aligns with the broader concept that a direct effect is not a prerequisite for observing indirect effects of an intervention on an outcome (Fairchild &

      McDaniel, 2017). To further enhance clarity, we've revised the introduction to emphasize the role of nicotine in manipulating PAF in relation to our study's aims.

      Previously we wrote: “A large body of evidence suggests that nicotine is an ideal choice for manipulating PAF, as both nicotine and smoking increase PAF speed [37,40–47] as well as pain thresholds and tolerance [48–52].” This has been changed to read: “Because evidence suggests that nicotine can modulate PAF, where both nicotine and smoking increase PAF speed [37,40–47], we chose nicotine to assess our aim of whether changes in PAF mediate changes in pain in a ‘mediation by design’ approach [48]. In addition, given evidence that nicotine may increase experimental pain thresholds and tolerance [49–53], nicotine could also influence pain ratings during tonic pain.”

      (2) As mentioned above, the OSF page is not accessible.

      We apologise for this. We had not realised that the pre-registration was under embargo, but we have now made it available.

      (3) I generally struggle with the authors' approach to investigating alpha. With the approach the authors used to detect peak alpha frequency it might be that the alpha signal may just show such a low amplitude that it is impossible to reliably detect it at electrode level. In my view, the approach is not accurate enough, which can be seen by the "jagged" shape of the individual alpha peak frequency. In my view, a source separation technique would have been more useful. I wonder which of the known cortical alphas contributes to the effects the authors have reported previously: occipital, mu rhythms projections or something else? A source separation approach disentangles the different alphas and will increase the SNR. My suggestion would be to work on ICA components or similar approaches. The advantage is that the components are almost completely free of any artefacts. ICAs could be run on the entire data or separately for each individual. In the latter case, it might be that some participants do not exhibit any alpha component.

      We appreciate your thoughtful consideration of our approach to investigating alpha. The calculation of PAF involves various methods and analysis steps across the literature (Corcoran et al., 2018; Gil Avila et al., 2023; McLain et al., 2022). Your query about which known cortical alphas contribute to reported effects is important. Initially focusing on a sensorimotor component from an ICA in Furman et al., 2018, subsequent work from our labs suggested a broader relationship between PAF and pain across the scalp (Furman et al., 2019; Furman et al., 2020; Millard et al., 2022), and a desire to conduct analyses at the sensor level in order to improve the reproducibility of the methods (Furman et al., 2020). However, based on your comment we have made several additions to the manuscript, including: explaining why we did not use manual ICA methods, suggest this for future research, and added an exploratory analysis using a recently developed automated pipeline that selects components based on the presence of a peak in the alpha range and alignment with a predefined template topography representing activity from occipital or motor sites.

      While we acknowledge that ICA components can offer a better signal-to-noise ratio (SNR) and possibly smoother spectral plots, we opted for our chosen method to avoid potential bias inherent in deciding on a component following source separation. The desire for a quick, automated, replicable, and unbiased pipeline, crucial for potential clinical applications of PAF as a biomarker, influenced this decision. At the time of analysis registration, automated methods for deciding which alpha components to extract following ICA were not apparent. We have now added this reasoning to Methods.

      “Contrary to some previous studies that used ICA to isolate sensory region alpha sources (Furman et al., 2018; De Martino et al., 2021; Valentini et al., 2022), we used pre-determined sensor level ROIs to improve reproducibility and reduce the potential for bias when individually selecting ICA components. Using sensor level ROIs may decrease the signal-to-noise ratio of the data; however, this approach has still been effective for observing the relationship between PAF and experimental pain (Furman et al., 2019; Furman et al., 2020).”

      We have also added use of ICA and development of methods as a suggestion for future research in the discussion:

      “Additionally, the use of global PAF may have introduced mediation measurement error into our mediation analysis. The spatial precision used in the current study was based on previous literature on PAF as a biomarker of pain sensitivity, which have used global and/or sensorimotor ROIs (Furman et al., 2018; Furman et al., 2020). Identification and use of the exploratory electrode clusters found in this study could build upon the current work (e.g., Furman et al., 2021). However, exploratory analysis of the clusters found in the present analysis demonstrated no influence on mediation analysis results (Supplementary Materials 3.8-3.10). Alternatively, independent component analysis (ICA) could be used to identify separate sources of alpha oscillations (Choi et al., 2005), as used in other experimental PAF-pain studies (Furman et al., 2018; Valentini et al., 2022), which could aid to disentangle the potential relevance of different alpha sources in the PAFpain relationship. Although this comes with the need to develop more reproducible and automated methods for identifying such components.”

      The specific location or source of PAF that relates to pain remains unclear. Because of this, we did employ an exploratory cluster-based permutation analysis to assess the potential for variations in the presence of PAF changes across the scalp at sensor level, and emphasise that location of PAF change could be explored in future. However, we have now conducted the mediation analysis (difference score 2W-LCS model) using averages from the data-driven parietal cluster, frontal cluster, and both clusters together. For these we see a stronger effect of gum on PAF change, which was expected given the data driven approach of picking electrodes. There was still a total and direct effect of nicotine on pain during the PHP model, but still no indirect effect via change in PAF. For the CPA models, there were still no significant total, direct, or indirect effects of nicotine on CPA ratings. Therefore, using these data-driven clusters did not alter results compared to the model using the global PAF variable.

      The reader has been directed to this supplementary material so:

      “The potential mediating effect of this change in PAF on change in PHP and CPA was explored (not pre-registered) by averaging within each cluster (central-parietal: CP1, CP2, Cpz, P1, P2, P3, P4, Pz, POz; right-frontal: F8, FT8, FT10) and across both clusters. This averaging across electrodes produced three new variables, each assessed in relation to mediating effects on PHP and CPA ratings. The resulting in six exploratory mediation analysis (difference score 2W-LCS) models demonstrated minimal differences from the main analysis of global PAF (8-12 Hz), except for the

      expected stronger effect of nicotine on change in PAF (bs = 0.11-0.14, ps < .003; Supplementary

      Materials 3.8-3.10).”

      Moreover, our team has been working on an automated method for selecting ICA components, so in response to your comment we assessed whether using this method altered the results of the current analysis. The in-depth methodology behind this new automatic pipeline will be published with a validation from some co-authors in the current collaboration in due course. At present, in summary, this automatic pipeline conducts independent component analysis (ICA) 10 times for each resting state, and selects the component with the highest topographical correlation to a template created of a sensorimotor alpha component from Furman et al., (2018). 

      The results of the PHP or CPA mediation models were not substantially different using the PAF calculated from independent components than that using the global PAF. For the PHP model, the total effect (b = -0.648, p \= .033) and direct effects (b = -0.666, p \= .035) were still significant, and there was still no significant indirect effect (b = 0.018, p \= .726). The general fit was reduced, as although the CFI was above 0.90, akin to the original model, the RMSEA and SRMR were not below 0.08, unlike the original models (Little, 2013). For the CPA model, there were still no significant total (b = -0.371, p \= .357), direct (b = -0.364, p \= .386), or indirect effects (b = -0.007, p \= .906), and the model fit also decreased, with CFI below 0.90 and RMSEA and SRMR above 0.08. See supplementary material (3.11). Note that still no correlations were seen between this IC sensorimotor PAF and pain (PHP: r = 0.11, p = .4; CPA: r \= -0.064, p = .63).

      Interestingly, in both models, there was now no longer a significant a-path (PHP: b = 0.08, p =

      0.292; CPA: b = 0.039, p = 0.575), unlike previously observed (PHP: b = 0.085, p = 0.018; CPA: b = 0.089, p = 0.011). We interpret this as supporting the previously highlighted difference between finding an effect on PAF globally but not in a sensorimotor ROI (and now a sensorimotor IC), justifying the exploratory CBPA and the suggestion in the discussion to explore methodology.

      We understand that this analysis does not fully uncover the reviewer’s question in which they wondered which of the known cortical alphas contributes to the effects reported in our previous work. However, we consider this exploration to be beyond the scope of the current paper, as it would be more appropriately addressed with larger datasets or combinations of datasets, potentially incorporating MEG to better disentangle oscillatory sources. The highlighted differences seen between global PAF, sensorimotor ROI PAF, sensorimotor IC PAF, as well as the CBPA of PAF changes provide ample directions for future research to build upon: 1) which alpha (sensor or source space) are related to pain, 2) how are these alpha signals represented robustly in a replicable way, and 3) which alpha (sensor or source space) are manipulable through interventions. These are all excellent questions for future studies to investigate.

      The below text has been added to the Discussion:

      In-house code was developed to compare a sensorimotor component to the results presented in this manuscript (Supplementary Material 3.11), showing similar results to the sensorimotor ROI mediation analysis presented here. However, examination of which alpha - be it sensor or source space - are related to pain, how they can be robustly represented, and how they can be manipulated are ripe avenues for future study.

      (4) I have my doubts that you can get a reliable close to bell-shaped amplitude distribution for every participant. The argument that the peak detection procedure is hampered by the high-amplitude lower frequency can be easily solved by subtracting the "slope" before determining the peak. My issue is that the entire analysis is resting on the assumption that each participant has a reliable alpha effect at electrode level. This is not the case. Non-alpha participants can severely distort the statistics. ICA-based analyses would be more sensitive but not every participant will show alpha. You may want to argue with robust group effects but In my view, every single participant counts, particularly for this type of data analysis, where in the case of a low SNR the "peak" can easily shift to the extremes. In case there is an alpha effect for a specific subject, we should see a smooth bump in the frequency spectrum between 8 and 12 12Hz. Anything beyond that is hard to believe. The long stimulation period allows a broad FFT analysis window with a good frequency resolution in order to detect the alpha frequency bump.

      The reviewer is correct that non-alpha participants can distort the statistics. We did visually assess the EEG of each individual’s spectra at baseline to establish the presence of global peaks, as we believe this is good practice to aid understanding of the data. Please see Author response image 1 for individual spectra seen at baseline. Although not all participants had a ‘smooth bump in the frequency spectrum between 8 and 12 Hz’, we prefer to not apply/necessitate this assumption to our data. Chiang et al., (2011) suggest that ~3% of individuals do not have a discernible alpha peak, and in our data we observed only one participant without a very obvious spectral peak (px-39). But, this participant does have enough activity within the alpha range to identify PAF by the CoG method (i.e. not just flat spectra and activity on top of 1/f characteristics). Without a pre-registered and standardised decision process to remove such a participant in place, we opted to not remove any participants to avoid curation of our data.

      Author response image 1.

      (5) I find reports on frequent channel rejections reflect badly on the data quality. Bad channels can be avoided with proper EEG preparation. EEG should be continuously monitored during recording in order to obtain best data quality. Have any of the ROI channels been rejected?

      We appreciate your attention to the channel rejection. We believe that the average channels removed (0.94, 0.98, 0.74, and 0.87 [range: 0-4] for each of the four resting states out of 64 channels) does not suggest overly frequent rejection, as it was less than one electrode on average and the numbers are below the accepted number of bad channels to remove/interpolate (i.e. 10%) in EEG pipelines (Debnath et al., 2020; Kayhan et al., 2022). To maintain data quality, consistently poor channels were identified and replaced over time. We hope you will accept our transparency on this issue and note that by stating how channel removal decisions were made (i.e. 8 or more deviations) and reporting the number of channels removed, we adhere to the COBIDAS guidelines (Pernet et al., 2018; 2020).

      During analysis, cases of sensorimotor ROI channels being rejected were noted and are now specified in our manuscript. “Out of 248 resting states recorded, 14 resting states had 4 ROI channels instead of 5. Importantly, no resting state had fewer than 4 channels for the sensorimotor ROI.”

      Note, we also realised that we had not specified that we did interpolate channels for the cluster based permutation analysis. This has been corrected with the following sentence:

      “Removed channels were not interpolated for the pre-registered global and sensorimotor ROI averaged analyses, but were interpolated for an exploratory cluster based permutation analysis using the nearest neighbour average method in `Fieldtrip`.”

      (6) I have some issues buying the authors' claims that there is an effect of nicotine on prolonged pain. By looking at the mean results for the nicotine and placebo condition, this can not be right. What was the point in including the variables in the equation? In my view, in this within-subject design the effect of nicotine should be universal, no matter what gender, age, or depression. The unconditional effect of nicotine is close to zero. I can not get my head around how any of the variables can turn the effects into significance. There must be higher or lower variable scores that might be related to a higher or lower effect on nicotine. The question is not to consider these variables as a nuisance but to show how they modulate the pain-related effect of nicotine treatment. Still, the overall nicotine effect of the entire group is basically zero.

      Another point is that for within-subject analyses even tiny effects can become statistically significant if they are systematically in one direction. This might be the case here. There might be a significant effect of nicotine on pain but the actual effect size (5.73 vs. 5.78) is actually not interpretable. I think it would be interesting for the reader how (in terms of pain rating difference) each of the variables can change the effect of nicotine.

      Thank you for your comments. We recognize the concern about interpreting the effect of nicotine on prolonged pain solely based on mean results, and in fact wish to discourage this approach. It's crucial to note that both PAF and pain are highly individual measures (i.e. high inter-individual variance), necessitating the use of random intercepts for participants in our analyses to acknowledge the inherent variability at baseline across participants. Including random intercepts rather than only considering the means helps address the heterogeneity in baseline levels among participants. We also recognise that displaying the mean PHP ratings for all participants in Table 2 could be misleading, firstly because these means do not have weight in an analysis that takes into account a random-effects intercept for participants, and secondly because two participants (one from each group) did not have post-gum PHP assessments and were not included in the mediation analysis due to list-wise deletion of missing data. Therefore, to reduce the potential for misinterpretation, we have added extra detail to display both the full sample and CPA mediation analysis (i.e. N=62) and the data used for PHP mediation analysis (i.e. n=60) in Table 2. We hope that the extra details added to this table will help the readers interpretation of results.

      In light of this, we have also altered the PAF Table 3 to reflect both the pre-post values used for the CPA mediation and baseline correlations with CPA and PHP pain (i.e. N=62), and the pre-post values used for the PHP mediation (i.e. n=60).

      It is inherently difficult to visualise the findings of a mediation analysis with confounding variables that also used latent change scores (LCS) and random-effect intercepts for participants. LCS was specifically used because of issues of regression to the mean that occur if you calculate a straightforward ‘difference-score’, therefore calculating the difference in order to demonstrate the results of the statistical model in a figure, for example, does not provide a full description of the data assessed (Valente & McKinnon, 2017). Nevertheless, if we look at the data descriptively with this in mind, then calculating the change in PHP ratings does indicate that, for the nicotine group, the mean change in PHP ratings was -0.047 (SD = 1.05, range: -4.13, 1.45). Meanwhile, for the placebo group the mean change in PHP ratings was 0.33 (SD = 0.75, range: -1.37, 1.66). Therefore suggesting a slight decrease in pain ratings on average for the nicotine group compared to a slight increase on average for the placebo group. With control for pre-determined confounders, we found that the latent change score was -0.63 lower for the nicotine group compared to the control group (i.e. the direct effect of nicotine on change in pain).

      If the reviewer is only discussing the effect of nicotine on pain, we do not believe that this effect ‘should be universal’. There is clear evidence that effects of nicotine on other measures can vary greatly across individuals (Ettinger et al., 2009; Falco & Bevins, 2015; Pomerleau et al., 1995). Our intention would not be to propose a universal effect but to understand how these variables may influence nicotine's impact on pain for individuals. Here we focus on the effects of nicotine on PAF and pain sensitivity, but attempted to control for the potential influence of these other confounding factors. Therefore, our statistical approach goes beyond mean values, incorporating variables like sex at birth, age, and depression to control for and explore potential modulating factors. Control for confounding factors is an important aspect of mediation analysis (Lederer et al., 2019; VanderWeele, 2019).

      Regarding the seemingly small effect size, we understand your concern. Indeed ‘tiny effects can become statistically significant if they are systematically in one direction’, which may be what we see in this analysis. We do not agree that the effect is ‘not interpretable’, rather that it should be interpreted in light of its small effect size (effect size being the beta coefficient in our analysis, rather than the mean group difference). We agree on the importance of considering practical significance alongside statistical significance and hope to conduct additional experiments and analyses in future to elucidate the contribution of each variable to the subtle and therefore not entirely conclusive overall effect you mention.

      Your feedback on this is valuable, and we have ensured a more detailed discussion in the revised manuscript on how these factors should be interpreted alongside some additional post-hoc analyses of confounding factors that were significant in our mediation, with the note that investigation of these interactions is exploratory. We had already discussed the potential contribution of sex on the effect of nicotine on PAF, with exploratory post-hoc analysis on this included in supplementary materials. In addition, we have now added an exploratory post-hoc analysis on the potential contribution of stress on the effect of nicotine on pain. This then shows the stratified effects by the covariates that our model suggest are influencing change in PAF and pain.

      Results edits:

      “There was also a significant effect of perceived stress at baseline on change in PHP ratings when controlling for group allocation and other confounding variables (b = -0.096, p = .048, bootstrapped 95% CI: [-0.19, -0.000047]), where higher perceived stress resulted in larger decreases in PHP ratings (see Supplementary Material 3.3 for post-hoc analysis of stress).”

      Supplementary material addition:

      “3.3 Exploratory analysis of the influence of perceived stress on the effects of nicotine on change in PHP ratings “

      “Due to the significant estimated effects of perceived stress on change in PHP ratings in the 2WLCS mediation model, we also explored post-hoc effects of stress on change in PHP ratings. We found that there is strong evidence for a negative correlation between stress and change in PHP rating within the nicotine group (n = 28, r = −0.39, BF10 = 13.65; Figure 3) that is not present in the placebo group, with equivocal evidence (n = 32, r = −0.14, BF10 = 0.46). This suggests that those with higher baseline stress who had nicotine gum experienced greater decreases in PHP ratings. Note that there was less, but still sufficient evidence for this relationship within the nicotine group when the participant who was a potential outlier for change in PHP rating was removed (n = 27, r = −0.32, BF10 = 1.45). “

      Author response image 2.

      Spearman correlations od baseline perceived stress with the change in phasic heat pain (PHP) ratings, suggest strong evidence for a negative relationship for the nicotine gum groupin orange (n=28; BF<sub>10</sub>=13.65) but not for the placebo group in grey (n=32; BF<sub>10</sub>=0.46). Regression lines and 95% confidence intervals.

      Discussion edits:

      “For example, in addition to the effect of nicotine on prolonged heat pain ratings, our results suggest an effect of stress on changes in heat pain ratings, with those self-reporting higher stress at baseline having greater reductions in pain. Our post-hoc analysis suggested that this relationship between higher stress and larger decrease in PHP ratings was only present for the nicotine group (Supplementary Material 3.3). As stress is linked to nicotine use [69,70] and pain [71–73], these interactions should be explored in future.”

      (7) Is the differential effect of nicotine vs. placebo based on the pre vs. post treatment effect of the placebo condition or on the pre vs. post effect of the nicotine treatment? Can the mediation model be adapted and run for each condition separately? The placebo condition seems to have a stronger effect and may have driven the result.

      Thank you for your comments. In our mediation analysis, the differential effect of nicotine vs. placebo is assessed as a comparison between the pre-post difference within each condition. A latent change score (i.e. pre-post) is calculated for each condition (nicotine and placebo), and then the effect of being in the nicotine group (dummy coded as 1) is compared to being in the placebo group (dummy coded as 0). The comparison between conditions is needed for this model (Valente & MacKinnon, 2017), as we are assessing the change in PAF and pain in the nicotine group compared to the change in the placebo group.

      However, to address your response, it is possible to simplify and assess the relationship between the change in peak alpha frequency (PAF) and change in pain within each gum group (nicotine and placebo) independently, without including the intervention as a factor. To do this, the mediation model can be simplified to regression analysis with latent change scores that focus purely on these relationships. The results of this can help to understand whether change in PAF influences change in pain within each group separately. As with the main analysis, we see no significant influence of change in PAF on change in pain while controlling for the same confounding variables within the nicotine group (Beta = -0.146 +/- 1.105, p = 0.895, 95% CI: -2.243, 2.429) or the placebo group (Beta = 0.730 +/- 2.061, p = 0.723, 95% CI: -4.177, 3.625).

      When suggesting that the “the placebo condition seems to have a stronger effect and may have driven the result”, we believe you are referring to the increase in mean PHP ratings within the placebo group from pre (5.51 +/- 2.53) to post-placebo gum (5.84 +/- 2.67). Indeed there was a significant increase in pain ratings pre to post chewing placebo gum (t(31) = -2.53, p = 0.0165, 95% CI: -0.603, -0.0653), that was not seen after chewing nicotine gum (t(27) = 0.237, p = 0.81, 95% CI: -0.358, 0.452). In lieu of a control where no gum was chewed (i.e. simply a second pain assessment ~30 minutes after the first), we assume the gum without nicotine is a good reference that controls for the effect of time plus expectation of chewing nicotine gum. With this in mind, as we describe in our results, the change in PHP ratings is reduced in the nicotine group compared to the placebo group. Note that this phrasing keeps the effect of placebo on pain as our reference from which to view the effect of nicotine on pain. However, you are correct that we need to ensure we emphasise that the change in pain in the PHP group is reduced in comparison to the change seen after placebo.

      We have not included these extra statistics in our revised manuscript, but hope that they aid the your understanding and interpretation of the included analyses and have highlighted these nuances in the discussion.

      “However, we note that the observed effect of nicotine on pain was small in magnitude, and most prominent in comparison to the effect of placebo, where pain ratings increased after chewing, which brings into question whether this reduction in pain is meaningful in practice.”

      (8) I would not dare to state that nicotine can function as an acute analgesic. Acute analgesics need to work for everyone. The average effect here is close to zero.

      In light of your feedback, we have refined our language to avoid a sweeping assertion of universal analgesic effects and emphasize individual variability. Nicotine's role as a coping strategy for pain is acknowledged in the literature (Robinson et al., 2022), with the meta-analysis by Ditre et al. (2016) discussing its potential as an acute analgesic in humans, along with some evidence from animal research (Zhang et al., 2020). Our revised discussion underscores the need for further exploration into factors influencing nicotine's potential impact on pain. We have also specified the short-term nature of nicotine use in this context to distinguish acute effects from potential opposing effects after long-term use (Zhang et al., 2020).

      “Short-term nicotine use is thought to have acute analgesic properties in experimental settings, with a review reporting that nicotine increased pain thresholds and pain tolerance [49]. In addition, research in a rat model suggests analgesic effects on mechanical thresholds after short-term nicotine use (Zhang et al., 2020). However, previous research has not assessed the acute effects of nicotine on prolonged experimental pain models. The present study found that 4 mg of nicotine reduced heat pain ratings during prolonged heat pain compared to placebo for our human participants, but that prolonged pressure pain decreased irrespective of which gum was chewed. Our findings are thus partly consistent with the idea that nicotine may have acute analgesic properties [49], although further research is required to explore factors that may influence nicotine’s potential impact on a variety of prolonged pain models. We further advance the literature by reporting this effect in a

      model of prolonged heat pain, which better approximates the experience of clinical pain than short lasting models used to assess thresholds and tolerance [50]. However, we note that the observed effect of nicotine on pain was small in magnitude, and most prominent in comparison to the effect of placebo, where pain ratings increased after chewing, which brings into question whether this reduction in pain is meaningful in practice. Future research should examine whether effects on pain increase in magnitude with different nicotine administration regimens (i.e. dose and frequency).”

      (9) Figures 2E and 2F are not particularly intuitive. Usually, the colour green in "jet" colour coding is being used for "zero" values. I would suggest to cut off the blue and use only the range between red green and red.

      We have chosen to retain the current colour scale for several reasons. In our analysis, green represents the middle of the frequency range (approx 10 Hz in this case), and if we were to use green as zero, it would effectively remove both blue and green from the plot, resulting in only red shades. Additionally, we have provided a clear colour scale for reference next to the plot, which allows readers to interpret the data accurately. Our intention is to maintain clarity and precision in representing the data, rather than conforming strictly to conventional practices in color coding.

      We believe that the current representation effectively conveys the results of our study while allowing readers to interpret the data within the context provided. Thank you again for your suggestion, and we hope you understand our reasoning in this matter.

      (10) Did the authors do their analysis on the parietal ROI or on the pre-registerred ROI?

      The analysis was conducted on the pre-registered sensorimotor ROI and on the global values. We have now also conducted the analysis with the regions suggested with the cluster based permutation analysis as requested by reviewer 2, comment 3.

      (11) Point 3.2 in the discussion. I would be very cautious to discuss smoking and chronic pain in the context of the manuscript. The authors can not provide any additional knowledge with their design targeting non-smokers, acute nicotine and experimental pain. The information might be interesting in the introduction in order to provide the reader with some context but is probably misleading in the discussion.

      We appreciate your perspective and agree with your caution regarding the discussion of smoking and chronic pain. While our study specifically targets non-smokers and focuses on acute nicotine effects in experimental pain, we understand the importance of contextual clarity. We have removed these points from the discussion to not mislead the reader.

      Previously we wrote, and have removed: “For those with chronic pain, smoking and nicotine use is reported as a coping strategy for pain [52]; abstinence can increase pain sensitivity [48,50], and pain is thus seen as a barrier to smoking cessation due to fear of worsening pain [51,52]. Therefore, continued understanding of the acute effects of nicotine on models of prolonged pain could improve understanding of the role of nicotine and smoking use in chronic pain [49,51,52].”

      (12) I very much appreciate section 3.3 of the discussion. I would not give up on PAF as a target to modulate pain. A modulation might not be possible in such a short period of experimental intervention. PAF might need longer and different interventions to gradually shift in order to attenuate the intensity of pain. As discussed by the authors themselves, I would also consider other targets for alpha analysis (as mentioned above not other electrodes or ROIs but separated sources.)

      Thank you for your comments on section 3.3. We appreciate your recognition of the potential significance of PAF as a target for pain modulation. Your insights align with our considerations that the experimental intervention duration or type might be a limiting factor in observing substantial shifts in PAF to attenuate pain intensity. We had mentioned the use of the exploratory electrode clusters in future work, but have now also mentioned that the use of ICA to identify separate ICA sources may provide an alternative approach. See responses to your previous ICA comment regarding separate sources.

      REFERENCES for responses to reviewer 2

      Chiang, A. K. I., Rennie, C. J., Robinson, P. A., Van Albada, S. J., & Kerr, C. C. (2011). Age trends and sex differences of alpha rhythms including split alpha peaks. Clinical Neurophysiology, 122(8), 1505-1517.

      Debnath, R., Buzzell, G. A., Morales, S., Bowers, M. E., Leach, S. C., & Fox, N. A. (2020). The Maryland analysis of developmental EEG (MADE) pipeline. Psychophysiology, 57(6), e13580.

      Ettinger, U., Williams, S. C., Patel, D., Michel, T. M., Nwaigwe, A., Caceres, A., ... & Kumari, V. (2009). Effects of acute nicotine on brain function in healthy smokers and non-smokers: estimation of inter-individual response heterogeneity. Neuroimage, 45(2), 549-561.

      Falco, A. M., & Bevins, R. A. (2015). Individual differences in the behavioral effects of nicotine: a review of the preclinical animal literature. Pharmacology Biochemistry and Behavior, 138, 80-90.

      Kayhan, E., Matthes, D., Haresign, I. M., Bánki, A., Michel, C., Langeloh, M., ... & Hoehl, S. (2022). DEEP: A dual EEG pipeline for developmental hyperscanning studies. Developmental cognitive neuroscience, 54, 101104.

      Lederer, D. J., Bell, S. C., Branson, R. D., Chalmers, J. D., Marshall, R., Maslove, D. M., ... & Vincent, J. L. (2019). Control of confounding and reporting of results in causal inference studies. Guidance for authors from editors of respiratory, sleep, and critical care journals. Annals of the American Thoracic Society, 16(1), 22-28.

      Little TD. Longitudinal structural equation modeling. Guilford press; 2013.

      Pernet, C., Garrido, M., Gramfort, A., Maurits, N., Michel, C. M., Pang, E., ... & Puce, A. (2018). Best practices in data analysis and sharing in neuroimaging using MEEG.

      Pernet, C., Garrido, M. I., Gramfort, A., Maurits, N., Michel, C. M., Pang, E., ... & Puce, A. (2020). Issues and recommendations from the OHBM COBIDAS MEEG committee for reproducible EEG and MEG research. Nature neuroscience, 23(12), 1473-1483.

      Pomerleau, O. F. (1995). Individual differences in sensitivity to nicotine: implications for genetic research on nicotine dependence. Behavior genetics, 25(2), 161-177.

      Robinson, C. L., Kim, R. S., Li, M., Ruan, Q. Z., Surapaneni, S., Jones, M., ... & Southerland, W. (2022). The Impact of Smoking on the Development and Severity of Chronic Pain. Current Pain and Headache Reports, 26(8), 575-581.

      Xia, J., Mazaheri, A., Segaert, K., Salmon, D. P., Harvey, D., Shapiro, K., ... & Olichney, J. M. (2020). Event-related potential and EEG oscillatory predictors of verbal memory in mild cognitive impairment. Brain communications, 2(2), fcaa213.

      VanderWeele, T. J. (2019). Principles of confounder selection. European journal of epidemiology, 34, 211-219.

      Valente, M. J., & MacKinnon, D. P. (2017). Comparing models of change to estimate the mediated effect in the pretest–posttest control group design. Structural Equation Modeling: A Multidisciplinary Journal, 24(3), 428-450.

      Vimolratana, O., Aneksan, B., Siripornpanich, V., Hiengkaew, V., Prathum, T., Jeungprasopsuk, W., ... & Klomjai, W. (2024). Effects of anodal tDCS on resting state eeg power and motor function in acute stroke: a randomized controlled trial. Journal of NeuroEngineering and Rehabilitation, 21(1), 1-15.

      Zhang, Y., Yang, J., Sevilla, A., Weller, R., Wu, J., Su, C., ... & Candiotti, K. A. (2020). The mechanism of chronic nicotine exposure and nicotine withdrawal on pain perception in an animal model. Neuroscience letters, 715, 134627.

      Reviewer #3 (Recommendations For The Authors):

      Introduction

      (1) Rationale and link to chronic pain. I am not sure I agree with the statement "The ability to identify those at greater risk of developing chronic pain is limited". I believe there is an abundance of literature associating risk factors with the different instances of chronic pain (e.g., Mills et al., 2019). The fact that the authors cite studies involving potential neuroimaging biomarkers leads me to believe that they perhaps did not intend to make such a broad statement, or that they wanted to focus on individual prediction instead of population risk.

      We thank the reviewer for the thought put into this comment. We did indeed wish to refer to individual prediction, but also realise that the focus on predicting pain might not be the most appropriate opening for this manuscript. Therefore, we have adjusted the below sentence to refer to the need to identify modifiable factors rather than the need to predict pain.

      “Identifying modifiable factors that influence pain sensitivity could be a key step in reducing the presence and burden of chronic pain (van der Miesen et al., 2019; Davis et al., 2020; Tracey et al., 2021).”

      (2) The statement "Individual peak alpha frequency (PAF) is an electro-physiological brain measure that shows promise as a biomarker of pain sensitivity, and thus may prove useful for predicting chronic pain development" is a non sequitur. PAF may very well be a biomarker of pain sensitivity, but the best measures of pain sensitivity we have (selfreported pain intensity ratings) in general are not in themselves predictive of the development of chronic pain. Conversely, features that are not related to pain sensitivity could be useful for predicting chronic pain (e.g., Tanguay-Sabourin et al., 2023).

      We agree that it is essential to acknowledge that self-reported pain intensity ratings alone are not definitive predictors of chronic pain development. To align with this, we have revised the sentence, removing the second clause to avoid overstatement. The adjusted sentence now reads, "Individual peak alpha frequency (PAF) is an electrophysiological brain measure that shows promise as a biomarker of pain sensitivity."

      (3) Finally, some of the statements in the discussion comparing a tonic heat pain model with chronic neuropathic pain might be an overstatement. Whereas it is true that some of the descriptors are similar, the time courses and mechanisms are vastly different.

      We appreciate this comment, and agree that it is difficult to compare the heat pain model used to clinical neuropathic pain. This was an oversight and with further understanding we have removed this comment from the introduction and the discussion:

      “In parallel, we saw no indication of a relationship between PAF and pain ratings during CPA. The introduction of the CPA model, specifically calibrated to a moderate pain threshold, provides further support for the notion that the relationship between PAF and pain is specific to certain pain types [17,28]. Prolonged heat pain was pre-dominantly described as moderate/severe shooting, sharp, and hot pain, whereas prolonged pressure pain was predominantly described as mild/moderate throbbing, cramping, and aching in the present study. It is possible that the PAF–pain relationship is specific to particular pain models and protocols [12,17].”

      Methodology

      (4) or the benefit of good science. However, I am compelled to highlight that I could not access the preregistered files, even though I waited for almost two weeks after requesting permission to do so. This was a problem on two levels: the main one is that I could not check the hypothesized effect sizes of the sample size estimation, which are not only central to my review, and in general negate all the benefits that should go with preregistration (i.e., avoiding phacking, publication bias, data dredging, HARKing, etc.). The second one is that I had to provide an email address to request access. This allows the authors to potentially identify the reviewers. Whereas I have no issues with this and I support transparent peer review practices (https://elifesciences.org/inside-elife/e3e90410/increasingtransparency-in-elife-s-review-process), I also note that this might condition other reviewers.

      We apologise for this. We had not realised that the pre-registration was under embargo, but we have now made it available.

      Interpretation of results

      (5)To be perfectly clear, I trust the results of this study more than some of the cited studies regarding nicotine and pain because it was preregistered, the sample size is considerably larger, and it seems carefully controlled. I just do not agree with the interpretation of the results, stated in the first paragraph of the Discussion. Quoting J. Cohen, "The primary product of a research inquiry is one or more measures of effect size, not P values" (Cohen, 1990). As I am sure the authors are aware of, even tiny differences between conditions, treatments or groups will eventually be statistically significant given arbitrarily large sample sizes. What really matters then is the magnitude of these differences. In general, the authors hypothesize on why there were no differences on the pressure pain model, and why decreases in heat pain were not mediated by PAF, but do not seem to consider the possibility that the intervention just did not cause the intended effect on the nociceptive system, which would be a much more straightforward explanations for all observations.

      While acknowledging and agreeing with the concern that 'even tiny differences between conditions, treatments, or groups will eventually be statistically significant given arbitrarily large sample sizes,' it's crucial to clarify that our sample size of N=62 does not fall into the category of arbitrarily large. We carefully considered the observed outcomes in the pressure pain model and the lack of PAF mediation in heat pain, as dictated by our statistical approach and the obtained results.

      The suggestion of a straightforward explanation aligning with the intervention not causing the intended effect on the nociceptive system is a valid consideration. We did contemplate the possibility of a false positive, emphasising this in the limitations of our findings and the need for replication to draw stronger conclusions to follow up this initial study.

      (6) In this regard, I do not believe that an average *increase* of 0.05 / 10 (Nicotine post - pre) can be considered a "reduction of pain ratings", regardless of the contrast with placebo (average increase of 0.24 / 10). This tiny effect size is more relevant in the context of the considerable inter-individual variation, in which subjects scored the same heat pain model anywhere from 1 to 10, and the same pressure pain model anywhere from 1 to 8.5. In this regard, the minimum clinically or experimentally important differences (MID) in pain ratings varies from study to study and across painful conditions but is rarely below 1 / 10 in a VAS or NRS scale, see f. ex. (Olsen et al., 2017). It is not my intention to question whether nicotine can function as an acute analgesic in general (as stated in the Discussion), but instead, if it worked as such under these very specific experimental conditions. I also acknowledge that the authors note this issue in two lines in the Discussion, but I believe that this is not weighed properly.

      We appreciate your perspective on the interpretation of the effect size, and we understand the importance of considering it in the context of individual variation.

      As also discussed in response to comment 6 From reviewer 2, we recognize the concern about interpreting the effect of nicotine on prolonged pain solely based on mean results, and in fact wish to discourage this approach. It's crucial to note that both PAF and pain are highly individual measures (i.e. high inter-individual variance), necessitating the use of random intercepts for participants in our analyses to acknowledge the inherent variability at baseline across participants. Including random intercepts rather than only considering the means helps address the heterogeneity in baseline levels among participants. We also recognise that displaying the mean PHP ratings for all participants in Table 2 could be misleading, firstly because these means do not have weight in an analysis that takes into account a random-effects intercept for participants, and secondly because two participants (one from each group) did not have post-gum PHP assessments and were not included in the mediation analysis due to list-wise deletion of missing data. Therefore, to reduce the potential for misinterpretation, we have added extra detail to display both the full sample and CPA mediation analysis (i.e. N=62) and the data used for PHP mediation analysis (i.e. n=60) in Table 2. We hope that the extra details added to this table will help the readers interpretation of results.

      Moreover, we have made sure refer to the comparison with the placebo group when discussing the reduction or decrease in pain seen in the nicotine group, for example:

      “2) nicotine reduced prolonged heat pain intensity but not prolonged pressure pain intensity compared to placebo gum;”

      “The nicotine group had a decrease in heat pain ratings compared to the placebo group and increased PAF speed across the scalp from pre to post-gum, driven by changes at central-parietal and right-frontal regions.”

      We have kept our original comment of whether this effect on pain is meaningful in practice to refer to the minimum clinically or experimentally important differences in pain ratings as highlighted by Olsen et al., 2017.

      “While acknowledging the modest effect size, it’s essential to consider the broader context of our study’s focus. Assessing the clinical relevance of pain reduction is pertinent in applications involving the use of any intervention for pain management [69]. However, from a mechanistic standpoint, particularly in understanding the implications of and relation to PAF, the specific magnitude of the pain effect becomes less pivotal. Nevertheless, future research should examine whether effects on pain increase in magnitude with different nicotine administration regimens (i.e. dose and frequency).”

      (7) In line with the topic of effect sizes, average effect sizes for PAF in the study cited in the manuscript range from around 1 Hz (Boord et al., 2008; Wydenkeller et al., 2009; Lim et al., 2016), to 2 Hz (Foulds et al., 1994), compared with changes of 0.06 Hz (Nicotine post - pre) or -0.01 Hz (Placebo post - pre). MIDs are not so clearly established for peak frequencies in EEG bands, but they should be certainly larger than some fractions of a Hertz (which is considerably below the reliability of the measurement).

      We appreciate your care of these nuances. We acknowledge the differences in effect sizes between our study and those referenced in the manuscript. Given the current state of the literature, it's noteworthy that ‘MIDs’ for peak frequencies in EEG bands, particularly PAF changes, are not clearly established, other than a recent publication suggesting that even small changes in PAF are reliable and meaningful (Furman et al., 2021). In light of this, we have addressed the uncertainty around the existence and determination of MIDs in our revision, highlighting the need for further research in this area.

      In addition, our study employed a greater frequency resolution (0.2 Hz) compared to some of the referenced studies, with approximately 0.5 Hz resolution (Boord et al., 2008; Wydenkeller et al., 2009; Foulds et al., 1994). This improved resolution allows for a more precise measurement of changes in PAF. Considering this, it is plausible that studies with lower resolution might have conflated increases in PAF, and our higher resolution contributes to a more accurate representation of the observed changes.

      We have also incorporated this insight into the manuscript, emphasising the methodological advancements in our study and their potential impact on the interpretation of PAF changes. Thank you for your thoughtful feedback.

      “The ability to detect changes in PAF can be considerably impacted by the frequency resolution used during Fourier Transformations, an element that is overlooked in recent methodological studies on PAF calculation [16,95]. Changes in PAF within individuals might be obscured or conflated by lower frequency resolutions, which should be considered further in future research.”

      (8) The authors also ran alternative statistical models to analyze the data and did not find consistent results in terms of PHP ratings (PAF modulation was still statistically significantly different). The authors attribute this to the necessity of controlling for covariates. Now, considering the effects sizes, aren't these statistically significant differences just artifacts stemming from the inclusion of too many covariates (Simmons et al., 2011)? How much influence should be attributable to depression and anxiety symptoms, stress, sleep quality and past pain, considering that these are healthy volunteers? Should these contrasting differences call the authors to question the robustness of the findings (i.e., whether the same data subjected to different analysis provides the same results), particularly when the results do not align with the preregistered hypothesis (PAF modulation should occur on sensorimotor ROIs)?

      Thank you for your comments on our alternative statistical models. By including these covariates, we aim to provide a more nuanced understanding of the complexities within our data by considering their potential impact on the effects of interest. The decision to include covariates was preregistered (apologies again that this was not available) and made with consideration of balancing model complexity and avoiding potential confounding. Moreover, we hope that the insights gained from these analyses will offer valuable information about the behaviour of our data and aid future research in terms of power calculations, expected variance, and study design.

      (9) Beyond that, I believe in some cases that the authors overreach in an attempt to provide explanations for their results. While I agree that sex might be a relevant covariate, I cannot say whether the authors are confirming a pre-registered hypothesis regarding the gender-specific correlation of PAF and pain, or if this is just a post hoc subgroup analysis. Given the large number of analyses performed (considering the main document and the supplementary files), caution should be exercised on the selective interpretation of those that align with the researchers' hypotheses.

      We chose to explore the influence of sex on the correlation between PAF and pain, because this has also been investigated in previous publications of the relationship (Furman et al., 2020).  We state that the assessment by sex is exploratory in our results on p.17: “in an exploratory analysis of separate correlations in males and females (Figure 5, plot C)”. For clarity regarding whether this was a pre-registered exploration or not, we have adjusted this to be: “in an exploratory analysis (not pre-registered) of separate correlations in males and females (Figure 5, plot C), akin to those conducted in previous research on this topic (Furman et al., 2020),

      We have made sure to state this in the discussion also. Therefore, when we previously said on p.22:

      “Regarding the relationship between PAF and pain at baseline, the negative correlation between PAF and pain seen in previous work [7–11,15] was only observed here for male participants during the PHP model for global PAF.” We have now changed this to: “Regarding the relationship between PAF and pain at baseline, the negative correlation between PAF and pain seen in previous work [7– 11,15] was only observed here for male participants during the PHP model for global PAF in an exploratory analysis.”

      Please also note that we altered the colour and shape of points on the correlation plot (Figure 5 in initial submission), the male brown was changed to a dark brown as we realised that the light brown colour was difficult to read. The shape was then changed for male points so that the two groups can be distinguished in grey-scale.

      Overall, your thoughtful feedback is instrumental in refining the interpretation of our findings, and we look forward to presenting a more comprehensive and nuanced discussion. Thank you for your comments.

      REFERENCES for responses to reviewer 3

      Arendt-Nielsen, L., & Yarnitsky, D. (2009). Experimental and clinical applications of quantitative sensory testing applied to skin, muscles and viscera. The Journal of Pain, 10(6), 556-572.

      Chowdhury, N. S., Skippen, P., Si, E., Chiang, A. K., Millard, S. K., Furman, A. J., ... & Seminowicz, D. A. (2023). The reliability of two prospective cortical biomarkers for pain: EEG peak alpha frequency and TMS corticomotor excitability. Journal of Neuroscience Methods, 385, 109766.

      Fishbain, D. A., Lewis, J. E., & Gao, J. (2013). Is There Significant Correlation between SelfReported Low Back Pain Visual Analogue Scores and Low Back Pain Scores Determined by Pressure Pain Induction Matching?. Pain practice, 13(5), 358-363.

      Furman, A. J., Prokhorenko, M., Keaser, M. L., Zhang, J., Chen, S., Mazaheri, A., & Seminowicz, D. A. (2021). Prolonged pain reliably slows peak alpha frequency by reducing fast alpha power.

      bioRxiv, 2021-07.

      Heitmann, H., Ávila, C. G., Nickel, M. M., Dinh, S. T., May, E. S., Tiemann, L., ... & Ploner, M. (2022). Longitudinal resting-state electroencephalography in patients with chronic pain undergoing interdisciplinary multimodal pain therapy. Pain, 163(9), e997.

      McLain, N. J., Yani, M. S., & Kutch, J. J. (2022). Analytic consistency and neural correlates of peak alpha frequency in the study of pain. Journal of neuroscience methods, 368, 109460.

      Ngernyam, N., Jensen, M. P., Arayawichanon, P., Auvichayapat, N., Tiamkao, S., Janjarasjitt, S., ... & Auvichayapat, P. (2015). The effects of transcranial direct current stimulation in patients with neuropathic pain from spinal cord injury. Clinical Neurophysiology, 126(2), 382-390.

      Parker, T., Huang, Y., Raghu, A. L., FitzGerald, J., Aziz, T. Z., & Green, A. L. (2021). Supraspinal effects of dorsal root ganglion stimulation in chronic pain patients. Neuromodulation: Technology at the Neural Interface, 24(4), 646-654.

      Petersen-Felix, S., & Arendt-Nielsen, L. (2002). From pain research to pain treatment: the role of human experimental pain models. Best Practice & Research Clinical Anaesthesiology, 16(4), 667680.

      Sarnthein, J., Stern, J., Aufenberg, C., Rousson, V., & Jeanmonod, D. (2006). Increased EEG power and slowed dominant frequency in patients with neurogenic pain. Brain, 129(1), 55-64.

      Sato, G., Osumi, M., & Morioka, S. (2017). Effects of wheelchair propulsion on neuropathic pain and resting electroencephalography after spinal cord injury. Journal of Rehabilitation Medicine, 49(2), 136-143.

      Sufianov, A. A., Shapkin, A. G., Sufianova, G. Z., Elishev, V. G., Barashin, D. A., Berdichevskii, V. B., & Churkin, S. V. (2014). Functional and metabolic changes in the brain in neuropathic pain syndrome against the background of chronic epidural electrostimulation of the spinal cord. Bulletin of experimental biology and medicine, 157(4), 462-465.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Weaknesses:

      The weaknesses of the study include the following. 

      (1) It remains unclear whether the function described for CDK2 is regulatory, that is, it affects TBK1 levels during physiological responses such as viral infection or cell cycle progression, or if it is homeostatic, governing the basal abundance of TBK1 but not responding to signaling.

      The regulation of TBK1 by CDK2 described in this article occurs during viral infection. Simultaneously, we also investigated the effects of CDK2 overexpression and knockdown on TBK1 levels under non-infected state and observed a slight reduction, as shown in Figure 4K and 4L. Thus, we speculate that the regulation of TBK1 by CDK2 serves, on one hand, to maintain cellular homeostasis and, on the other hand, to respond to signaling triggered by viral infection.

      (2) The authors have not explored whether the catalytic activity of CDK2 is required for TBK1 ubiquitinoylation and, if so, what its target specificity is.

      We found that the ubiquitination modification of TBK1 was not affected by treatment with a CDK2 kinase activity inhibitor (SNS-032), as demonstrated in the results below (Author response image 1).

      Author response image 1.

      (3) Given the multitude of CDK isoforms in fish, it remains unexplored whether the identified fish CDK2 homolog is a requisite cell cycle regulator or if its action in the cell cycle is redundant with other CDKs.

      A comparison of the protein sequences of fish CDK2 and human CDK2 revealed a 90% similarity (Author response image 2). It has also been reported that the kinase activity of goldfish CDK2 significantly increases during oocyte maturation (ref. 1). Furthermore, UHRF1 phosphorylation by cyclin A2/CDK2 is crucial for zebrafish embryogenesis (ref. 2). Additionally, Red grouper nervous necrosis virus (RGNNV) infection activated the p53 pathway, leading to the upregulation of p21 and downregulation of cyclin E and CDK2, which forces infected cells to remain in the G1/S replicative phase (ref. 3). All these evidences suggest that fish CDK2 plays a vital role in cell cycle regulation, and there have been no reports of other CDKs demonstrating CDK2-like functions.

      References:

      (1) Hirai T, et al. (1992) Isolation and Characterization of Goldfish Cdk2, a Cognate Variant of the Cell-Cycle Regulator Cdc2. Developmental biology 152(1):113-120.

      (2) Chu J, et al. (2012) UHRF1 phosphorylation by cyclin A2/cyclin-dependent kinase 2 is required for zebrafish embryogenesis. Molecular biology of the cell 23(1):59-70. 

      (3) Mai WJ, Liu HX, Chen HQ, Zhou YJ, & Chen Y (2018) RGNNV-induced cell cycle arrest at G1/S phase enhanced viral replication via p53-dependent pathway in GS cells. Virus Res 256:142-152.

      Author response image 2.

      Reviewer #2 (Public Review):

      Weaknesses:

      (1) While the study focuses on fish, the broader implications for other lower vertebrates and higher vertebrates are not extensively discussed.

      Thanks to your comment, we have added a paragraph to the Discussion section of the manuscript regarding the implications of the negative regulation of IFN expression by fish CDK2 for other vertebrates (lines 398-403). The details are as follows: first, we selected representative species from each of the six major vertebrate groups and compared their CDK2 protein sequences, finding that they are over 90% similar to one another (Author response image 3). This suggests that the function of CDK2 may be conserved to some extent across vertebrates. Additionally, CDK2 inhibition has been shown to enhance anti-tumor immunity by increasing the IFN response to endogenous retroviruses (ref. 1). Our studies provide evidence that fish CDK2 inhibits the IFN response by promoting the ubiquitination and degradation of TBK1, strongly supporting the role of CDK2 in the regulation of the immune response.

      Reference:

      (1) Chen Y, et al. (2022) CDK2 Inhibition Enhances Antitumor Immunity by Increasing IFN Response to Endogenous Retroviruses. Cancer Immunol Res 10(4):525-539.

      Author response image 3.

      (2) The study heavily relies on specific fish models, which may limit the generalizability of the findings across different species.

      Thank you for your comment. First, we compared the amino acid sequences of CDK2 proteins from fish and other vertebrates, which show over 90% similarity. Moreover, the small size, low cost, and external development of zebrafish make it an excellent model for vertebrate developmental biology. It has been reported that due to the high genomic and molecular similarities between zebrafish and other vertebrates, including humans, many significant discoveries in zebrafish development are relevant to humans (ref. 2). Our study concentrated on CDK2 in zebrafish, and the findings should be valuable for other vertebrates.

      Reference:

      (2) Veldman MB & Lin S (2008) Zebrafish as a Developmental Model Organism for Pediatric Research. Pediatr Res 64(5):470-476.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The following additional data/discussion could improve the manuscript.

      (1) Investigate whether the catalytic activity of CDK2 is required to regulate TBK1 abundance. It is common for E3 ligases to be directed towards phosphorylated substrates, so it would be of interest to know if CDK2 phosphorylates TBK1 to facilitate its recognition for ubiquitinylation.

      We examined the effect of CDK2 on the TBK1 protein after inhibiting its kinase activity with SNS-032 treatment and found that it could still affect TBK1 expression, as shown in the results below (Figure R4). Our previous experiments investigating the effect of CDK2 on TBK1 did not show that CDK2 caused the migration of TBK1 bands (typically, proteins that undergo phosphorylation exhibit band migration). Furthermore, in this study, CDK2 did not function as an E3 ligase; instead, it recruited the E3 ligase Dtx4 to ubiquitinate TBK1.

      Author response image 4.

      (2) Investigate how CDK2 abundance is regulated by viral infection and whether viral infection impacts cell cycle progression in a CDK2-dependent manner.

      In fact, as illustrated in Figure 1, we investigated the changes in CDK2 at both the mRNA and protein levels following viral infection. Our findings revealed that SVCV infection resulted in an increase in CDK2 mRNA and protein expression. Additionally, our earlier reports have indicated that SVCV infection can induce alterations in the cell cycle, resulting in a notable increase in the S phase (Figure 1 of ref. 1). However, whether SVCV infection impacts cell cycle progression in a CDK2dependent manner will be explored in our upcoming study.

      Reference:

      (1) Li S, et al. Spring viraemia of carp virus modulates p53 expression using two distinct mechanisms. PLoS Pathog 15, e1007695 (2019).

      (3) Provide data/discussion concerning the role of fish CDK2 in the regulation of cell cycle progression and whether this process is impacted by viral infection (part 1). Are TBK1 abundance and interferon production differentially regulated across the cell cycle due to the action of CDK2 (part 2).

      Thank you for your advice. This concern is addressed in two parts, as follows: 

      For part 1: To date, there has been limited research conducted on fish CDK2 in the regulation of cell cycle progression. The details are as follows: It has been reported that the kinase activity of goldfish CDK2 significantly increases during oocyte maturation (ref. 1). Furthermore, UHRF1 phosphorylation by cyclin A2/CDK2 is crucial for zebrafish embryogenesis (ref. 2). Additionally, a novel CDK2 homolog has been identified in Japanese lamprey, which plays a crucial role in apoptosis (ref. 3). Red grouper nervous necrosis virus (RGNNV) infection activates the p53 pathway, leading to the upregulation of p21 and downregulation of cyclin E and CDK2, which forces infected cells to remain in the G1/S replicative phase (ref. 4). All this evidence suggests that fish CDK2 plays a vital role in cell cycle regulation, and this process is also impacted by viral infection. Relevant content has been added to the Discussion section in the revised manuscript (lines 389-398).

      References:

      (1) Hirai T, et al. (1992) Isolation and Characterization of Goldfish Cdk2, a Cognate Variant of the Cell-Cycle Regulator Cdc2. Developmental biology 152(1):113-120.

      (2) Chu J, et al. (2012) UHRF1 phosphorylation by cyclin A2/cyclin-dependent kinase 2 is required for zebrafish embryogenesis. Molecular biology of the cell 23(1):5970.

      (3) Xu Y, Tian Y, Zhao H, Zheng N, Ren KX, Li QW. A novel CDK-2 homolog identified in lamprey, with roles in apoptosis. Fish Physiol Biochem 47, 189-189 (2021). 

      (4) Mai WJ, Liu HX, Chen HQ, Zhou YJ, & Chen Y (2018) RGNNV-induced cell cycle arrest at G1/S phase enhanced viral replication via p53-dependent pathway in GS cells. Virus Res 256:142-152.

      For part 2: TBK1 plays a crucial role in regulating IFN production. Variations in CDK2 activity during different phases of the cell cycle may lead to changes in the expression and function of TBK1. Our findings suggest that heightened CDK2 activity may suppress TBK1 expression, thereby hindering the cell's capacity to produce IFN. Conversely, during the late phase of the cell cycle or in an inhibited state, TBK1 expression may rise, enhancing IFN synthesis and release. In summary, CDK2 is involved in intracellular signaling by modulating TBK1 levels and IFN production, affecting the cellular immune response and cycle regulation—two processes that are notably distinct at various stages of the cell cycle. Relevant content has been added to the Discussion section in the revised manuscript (lines 377-384).

      Minor suggestions:

      (1) The authors introduce their study with the consideration that knowledge of fish signaling pathways can inform mammalian biology because mammals evolved from fish. This is not strictly true, since mammals and fish both evolved from an ancient common ancestor and the diversification of signaling in each species likely occurred in response to distinct evolutionary selective pressures.

      Thank you for your suggestion. We have revised the statement in the manuscript to eliminate the notion that mammals evolved from fish (lines 98-99). The immune systems of higher vertebrates (e.g., humans) and lower vertebrates (e.g., fish) generally exhibit some consistency, although there are notable differences.

      (2) On line 210 and line 276, the authors appear to have misstated the data. CDK2 knockout increases not decreases TBK1 and Dtx4 knockdown abrogated rather than restored CDK2 suppression of TBK1.

      Thanks for your reminder, I jumped to the wrong conclusions in these two places (line 204 and line 267) and have changed them as you suggested.

      Reviewer #2 (Recommendations For The Authors):

      The manuscript has some shortcomings that, if addressed, could improve the overall quality of the article.

      (1) Line 63-72, line 77-79, line 88-90- please add additional references for these sentences.

      Thanks to your comment, we have added references for these sentences (Line 63-72, line 77-79, line 88-90).

      (2) It is of the utmost importance to quantify the data presented in Figures 4J and 5D, as this will facilitate the visualization of the immunoblot.

      Thank you for your comment. We have quantified the data presented in Figures 4J and 5D to enhance the clarity of the immunoblot.

      (3) The scale in Figure 4E is difficult to discern.

      Thanks for your comment. To improve the visual clarity of the image, we have enlarged the scale label in Figure 4E.

      (4) In Figure 3B, shCDK2 is shown in italics, preferably in line with other standards such as Figures 3C and 3F.

      Thank you for your comment. We have revised the shCDK2 in Figure 3B.

      (5) The functions of CDK family members in immunity are hoped to be discussed.

      Thanks for your suggestion. We have discussed the functions of CDK family members in immunity (lines 363-387). The details are as follows: Recent studies have demonstrated that CDK activity is crucial for virus-induced innate immune responses. Reports indicate that CDKs are involved in the Toll-like receptor (TLR) signaling pathway, the nuclear factor-κB (NF-κB) signaling pathway, and the JAK-STAT signaling pathway. For instance, CDK8 and/or CDK19 enhanced the transcription of inflammatory genes, such as IL-8 and IL-10, in cells following TLR9 stimulation. CDKs and NF-κB establish a remarkable paradigm where CDKs can act directly on substrate proteins rather than depending solely on transcriptional control. It has been reported that CDK1 serves as a positive regulator of the IFN-I signaling pathway, facilitating STAT1 phosphorylation, which subsequently boosts the expression of ISGs. Furthermore, inhibiting CDK activity has been shown to obstruct STAT phosphorylation, proinflammatory gene activation, and ISG mRNA induction in response to SeV infection. It is important to note that no evidence suggests the involvement of CDKs in RLR signaling pathways. This study has shown that fish CDK2 functions as a negative regulator of the key kinase TBK1, which is involved in the RLR signaling pathway. A better understanding of the relationship between CDK2 and RLR signaling pathways will enhance our grasp of the regulatory mechanisms of CDKs in antiviral innate immunity.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Amaral et al. presents a study investigating the mesoscale modelling and dynamics of bolalipids.

      Strengths:

      The figures in this paper are exceptional. Both those to outline and introduce the lipid types, but also the quality and resolution of the plots. The data held within also appears to be outstanding and of significant (hopefully) general interest.

      We thank the reviewer for their kind words and the appreciation of our work.

      Weaknesses:

      In the introduction, I would like to have read more specifics on the biological role of bolalipids. Archaea are mentioned, but this kingdom is huge - there must be specific species that can be discussed where bolalipids are integral to archaeal life. The authors should go beyond ’extremophiles’. In short, they should unpack why the general audience should be interested in these lipids, within a subset of organisms that are often forgotten about.

      Following the reviewer’s advice we have revised the introduction of the manuscript, in which we now discuss specific species (Sulfolobus acidocaldarius and Thermococcus kodakarensis) and how in these species bolalipids are integral to archaeal life. We explain that the ratio between bilayer and bolalipids, and the number of cyclopentane rings contained within bolalipids can change to adapt to the environment. The revised parts of the introduction read (p.1 ):

      “Like for bacteria and eukaryotes, archaea must keep their lipid membranes in a fluid state (homeoviscous adaptation). This is important even under extreme environmental conditions, such as hot and cold temperatures, or high and low pH values [7]. Because of this, many archaea adapt to changes in their environment by tuning the lipid composition of their membranes: altering the ratio between bola- and bilayer lipids in their membranes [8, 9] and/or by changing the number of cyclopentane rings in their lipid tails, which are believed to make lipid molecules more rigid [5]. For example, Thermococcus kodakarensis increases its tetraether bolalipid ratio from around 50% to over 80% when the temperature of the environment increases from 60 to 85 C [10]. Along the same lines, the cell membrane of Sulfolobus acidocaldarius, can contain over 90 % of bolalipids with up to 8 cyclopentane rings at 70 C and pH 2.5 [5, 11]. It is worth mentioning that in exceptional cases bacteria also synthesise bolalipids in response to high temperatures [12], highlighting that the study of bolalipid membranes is relevant not only for archaeal biology but also from a general membrane biophysics perspective.”

      Reviewer #2 (Public review):

      Summary:

      The authors aimed to understand the biophysical properties of archeal membranes made of bolalipids. Bacterial and eukaryotic membranes are made of lipids that self-assemble into bilayers. Archea, instead, use bolalipids, lipids that have two headgroups and can span the entire bilayer. The authors wanted to determine if the unique characteristics of archaea, which are often extremophiles, are in part due to the fact that their membranes contain bolalipids.

      The authors develop a minimal computational model to compare the biophysics of bilayers made of lipids, bolalipids, and mixtures of the two. Their model enables them to determine essential parameters such as bilayer phase diagrams, mechanical moduli, and the bilayer behaviour upon cargo inclusion and remodelling.

      The author demonstrates that bolalipid bilayers behave as binary mixtures, containing bolalipids organized either in a straight conformation, spanning the entire bilayer, or in a u-shaped one, confined to a single leaflet. This dynamic mixture allows bolalipid bilayers to be very sturdy but also provides remodelling. However, remodelling is energetically more expensive than with standard lipids. The authors speculate that this might be why lipids were more abundant in the evolutionary process. Strengths:

      This is a wonderful paper, a very fine piece of scholarship. It is interesting from the point of view of biology, biophysics, and material science. The authors mastered the modelling and analysis of these complex systems. The evidence for their findings is really strong and complete. The paper is written superbly, the language is precise and the reading experience is very pleasant. The plots are very well-thought-out.

      Weaknesses:

      I would not talk about weaknesses, because this is really a nice paper. If I really had to find one, I would have liked to see some clear predictions of the model expressed in such a way that experimentalists could design validation experiments.

      We thank the reviewer for their very kind assessment. We incorporated their recommendations regarding experimental validation in the discussion section, as follows (p.14):

      “Our model makes a number of predictions that could be tested by experiment either in cells or in vitro. First, it predicts that a small increase in the fraction of archaeal bilayer lipids should be sufficient to soften a bolalipid-rich membrane. While this could be tested in the future, so far only very few studies have yet reported experimental analysis of archaeal membrane mixtures [18, 50]. Second, we observed that membranes with moderate bolalipid molecular rigidity k<sub>bola</sub> exhibit curvature-dependent bending rigidity. To experimentally verify this, one could extrude membrane tethers from cells while controlling for membrane tension. Finally, to get to the core mechanism underlying our findings, it will be important to develop experimental methods that will allow the fraction of U-shaped bolalipid conformers per leaflet to be imaged and measured.”

      Reviewer #3 (Public review):

      Summary:

      The authors have studied the mechanics of bolalipid and archaeal mixed-lipid membranes via comprehensive molecular dynamics simulations. The Cooke-Deserno 3-bead-per-lipid model is extended to bolalipids with 6 beads. Phase diagrams, bending rigidity, mechanical stability of curved membranes, and cargo uptake are studied. Effects such as the formation of U-shaped bolalipids, pore formation in highly curved regions, and changes in membrane rigidity are studied and discussed. The main aim has been to show how the mixture of bolalipids and regular bilayer lipids in archaeal membrane models enhances the fluidity and stability of these membranes.

      Strengths:

      The authors have presented a wide range of simulation results for different membrane conditions and conformations. For the most part, the analyses and their results are presented clearly and concisely. Figures, supplementary information, and movies very well present what has been studied. The manuscript is well-written and is easy to follow.

      We thank the reviewer for the detailed assessment of our work and their constructive feedback.

      Major issues

      R3.Q1: The Cooke-Deserno model, while very powerful for biophysical analysis of membranes at the mesoscale, is very much void of chemical information. It is parametrized such that it is good in producing fluid membranes and predicting values for bending rigidity, compressibility, and even thermalexpansioncoefficientfallingintheacceptedrangeofvaluesforbilayermembranes. But it still represents a generic membrane. Now, the authors have suggested a similar model for the archaeal bolalipids, which have chemically different lipids (the presence of cyclopentane rings for one), and there is no good justification for using the same pairwise interactions between their representative beads in the coarse-grained model. This does not necessarily diminish the worth of all the authors’ analyses. What is at risk here is the confusion between ”what we observe this model of bolalipidor mixed-membranes do” and ”how real bolalipid-containing archaeal membranes behave at these mechanical and thermal conditions.”.

      As the reviewer correctly notes, Cooke and Deserno used a minimal model, devoid of chemical detail, to represent fluid lipid membranes composed of bilayer lipids. Indeed archaeal lipids are chemically different compared to non-archaeal lipids, but just like non-archaeal lipids, they can be very different from one another. Given the chemical diversity of bolalipids between each other, instead of representing their complexity in a complicated model with many experimentally unconstrained parameters, we here defined a minimal model for bolalipids. The power of this minimal model is to represent the key physical/geometrical characteristics of archaeal membranes, namely the fact that lipid heads on two sides of the membrane are often connected, that bolalipids can exhibit a conformational change, and that bolalipids mix with some percentage of bilayer molecules. We then ask a general question: how do these unique geometrical characteristics of archaeal membranes influence their mechanics and reshaping? The reviewer is however right in pointing out that a model, regardless of its level of details (atomistic, coarse-grained, minimal), is still a model.

      Our approach of extending an established coarse-grained model for bilayer lipids to bolalipids is further supported by experimental observations, which report that archaeal bilayer lipids can form membranes of comparable bending rigidity to those of non-archaeal bilayer membranes [53]. Hence, different lipid linkages (archaeal vs. non-archaeal) give rise to fluid, deformable membranes of not too dissimilar rigidities, suggesting that both archaeal and non-archaeal bilayer lipids can be represented by a similar minimal coarse-grained model for the purpose of mesoscopic biophysical investigations. Since archaeal bolalipids have the same core chemical structure as two archaeal bilayer lipids joined by their tail ends, similarly we model a bolalipid by joining two bilayer lipids. Such an approach also efficiently enables us to compare bolalipid with bilayer membranes, and connect to the large body of knowledge on the physics of bilayer membranes.

      To conclude, our coarse-grained model is indeed intended to capture the main physical properties of bolalipid membranes, and not their chemical diversity.

      R3.Q2: Another more specific, major issue has to do with using the Hamm-Kozlov model for fitting the power spectrum of thermal undulations. The 1/q<sup>2</sup> term can very well be attributed to membrane tension. While a barostat is indeed used, have the authors made absolutely sure that the deviation from 1/q<sup>4</sup> behaviour does not correspond to lateral tension?

      To the casual observer, any 1/q<sup>2</sup> trend might point at membrane tension. However, the precise functional form is relevant as it determines whether the 1/q<sup>2</sup> dominates the 1/q<sup>4</sup> trend for small or large values of the wave number q in the fitted power spectrum.

      The first model (including lipid tilt) exhibits the functional form 1/(kq<sup>4</sup>) + 1/(kq<sup>2</sup>). In contrast, the second model (including membrane tension) exhibits the functional form 1/(kq<sup>4</sup> + ∑q<sup>2</sup>). Importantly, the two models obey a different functional form. Here k and k<sub>θ</sub>, are the bending and tilt moduli, which are assumed positive, and ∑ is the membrane tension, which can be either positive or negative. For the first model (with tilt), while for small q the amplitude is proportional to q<sup>-4</sup>, for large q the amplitude is proportional to q<sup>-2</sup>. In contrast, for the second model (with positive tension) while for small q the amplitude is proportional to q<sup>-2</sup>, for large q the amplitude is proportional to q<sup>-4</sup>. If membrane tension were to be negative in the second model, the slope would cross from negative infinity for small q to -4 for large q. The functional dependencies are summarized in Author response image 1A.

      For rigid bolalipid membranes, it is clearly visible that the slope of the power spectrum plotted against the wave number q decreases with increasing q (Author response image 1B). While the slope initially assumes a value close to 4, it gradually approaches 2 for larger values of q. We conclude that only the model including lipid tilt can fit the power spectrum of membrane fluctuations appropriately (solid-dashed line), whereas the model with tension fails to fit the data (dashed line). We note that the combined model containing both lipid tilt and membrane tension does not give a better fit (dotted line).

      To demonstrate that the tension model cannot fit the data, we included the best fits for both models for rigid bolalipid membranes in the new SI section 16 (p. S22) and show that only the tilt model leads to acceptable fits. We also measured the projected membrane tension - , where P<sub>x</sub>,P<sub>y</sub> are respectively the pressure in x and y direction and  L<sub>z</sub> is the dimension of the simulation box in z axis. We found the projected membrane tension to give a negligible value similarly to the one that we indirectly measured by fitting a combined model with both tension and tilt, further confirming our conjecture.

      Author response image 1.

      (A) Schematic showing the decay of the power spectrum as a function of the wave number q in the tilt model (top), in the tension model with positive membrane tension (middle), and in the tension model with negative membrane tension (bottom). (B) Fitted power spectrum as a function of q for rigid bolalipid membranes (k<sub>bola</sub>=5k<sub>B</sub>T). The fit shows that while the model with tension (dashed line) cannot fit the data, the model with tilt nicely fits the spectrum (solid-dashed line). The combined model including both tension and tilt does not fit the spectrum any better (dotted line).

      R3.Q3: I got more worried when I noticed in the SI that the simulations had been done with combined ”fix langevin” and ”fix nph” LAMMPS commands. This combination does not result in a proper isothermal-isobaric ensemble. The importance of tilt terms for bolalipids is indeed very interesting, but I believe more care is needed to establish that.

      In what follows, we show that there is no reason to worry. First of all we want to clarify that the physical setup we simulate is that of a membrane contained in a heat bath under negligible tension with correct diffusional dynamics. To achieve this physical setup, for which we use a Langevin thermostat combined with pressure control via an overdamped barostat, which we implement in LAMMPS by combining ”fix langevin” and ”fix nph”.

      In more detail: we simulated particles in an implicit solvent, for which we use a Langevin thermostat to get the right diffusional dynamics. To apply the theory of fitting fluctuation spectrums the simulation box length needs to be (near) constant. However, simulating membranes at a fixed box size results in an average non-zero membrane tension, making it hard to measure bending rigidity. The reason is that the effect of membrane tension is most influential on the largest wavelength modes, which are also most decisive when determining mechanical membrane properties like membrane rigidity. To minimize the effect of tension, we perform our simulation with an overdamped barostat (𝜏<sub>baro</sub> = 10 𝜏 <sub>langevin</sub>), which keeps the membrane near tensionless, as also done before [32]. In the revised manuscript, we have clarified the statement on the physical ensemble used (p.S2):

      “For simulating flat membrane patches of bolalipids, we combined the previously used Langevin thermostat with relaxation time of 1𝜏 with a Nosé–Hoover barostat with relaxation time of 10𝜏. In LAMMPS this amounts to combining the commands ’fix langevin’ with ’fix nph’. We configured the barostat to set lateral pressure P<sub>xy</sub> to zero by re-scaling the simulation box in the x-y plane. We compare this setup to a fixed box length setup, and an NPT ensemble setup, in SI section 17.”

      To connect our results with statistical mechanics ensemble theory we tested alternative setups. Similar setups, including the formal isothermal-isobaric ensemble, where N,P,T are kept constant using Nose-Hoover style equations for thermostating and barostating with modern corrections [34], which the reviewer refers to, result in very similar fluctuation spectrums. Consequently, our measurements of bending and tilt modulus hold true regardless of the integration scheme. However, such a setup does not correctly capture implicit solvent and diffusional dynamics.

      In even more detail: we tested our setup (implemented via ”fix langevin”+”fix nph”) versus a isothermal-isobaric ensemble (implemented via ”fix npt”). We measured volume mean and standard deviation, and found them matching for a reference LJ gas.

      To be completely sure, and to please the reviewer, we have performed additional verifications in the new SI section 17, which we summarize in the following. We simulated three representative membranes with different integration schemes: ”fix npt”, ”fix langevin”+”fix nph”, and ”fix langevin” (Langevin dynamics with projected area fixed at the average value obtained from a ”langevin+nph”). We checked that the ”fix nph” barostat is merely equilibrating the membrane to a tensionless configuration, after which the projected membrane area (A<sub>p</sub> = L<sub>x</sub>L<sub<y</sub>) is practically constant. Consequently, the different schemes resulted in minor changes in the longest wavelength modes that we tracked down to small changes in the negligible tension. The resulting measurements of bending modulus change by less than 10%, and our main text conclusions do not change. Author response image 2 compares the fluctuation spectrums for the different integration schemes.

      Author response image 2.

      Height fluctuation spectrum, for a bilayer membrane at T<sub>eff</sub> =1.1, simulated with Langevin dynamics (pink, ‘langevin‘), our setup (purple, ‘nph+langevin‘), and under an isothermal-isobaric ensemble (blue, ‘npt‘); fits are shown as dotted lines.

      R3.Q4: This issue is reinforced when considering Figure 3B. These results suggest that increasing the fraction of regular lipids increases the tilt modulus, with the maximum value achieved for a normal Cooke-Deserno bilayer void of bolalipids. But this is contradictory. For these bilayers, we don’t need the tilt modulus in the first place.

      We understand the concern why this might be counter-intuitive, and we thank the reviewer for pointing it out. We first want to stress that the tilt modulus can also be measured for bilayer membranes even if it is not needed to fit the fluctuation spectrum. If we measure the tilt modulus for a bilayer membrane, we obtain a value similar to the previously measured one [36]. Importantly, here we also report measurements for the tilt modulus for bolalipid membranes.

      To understand the seemingly contradictory behaviour of the tilt modulus, it is insightful to rewrite the expression for the fluctuation spectrum as done in Eq. (1):

      where is a characteristic length scale related to tilt, which we call the tilt persistence length. From the last equation it is easy to see that the tilt modulus 𝜅<sub>𝜃</sub> becomes relevant for the fluctuation spectrum if the tilt persistence length l<sub>𝜃</sub>  is not negligible. In other words, this means that we have to consider the tilt modulus 𝜅<sub>𝜃</sub> as relevant, if it is sufficiently small compared to the bending rigidity 𝜅.

      However, this is not only counter-intuitive, but also difficult to communicate graphically. Per the excellent reviewer’s suggestion, to make the interpretation more accessible, we converted in the main text and its figures the tilt modulus to the more directly interpretable tilt persistence length l<sub>𝜃</sub>, as this is small when tilt is irrelevant (for bilayer lipids and flexible bolalipids) and large otherwise (for rigid bolalipids). This includes changes to the main text on p.6 and p.8 , and to the insets in Figs. 2C and 3B. We note that for completeness we also report the tilt modulus 𝜅<sub>𝜃</sub>  in the SI.

      R3.Q5: Also, from the SI, I gathered that the authors have neglected the longest wavelength mode because it is not equilibrated. If this is indeed the case, it is a dangerous thing to do, because with a small membrane patch, this mode can very well change the general trend of the power spectrum. As a lot of other analyses in the manuscript rely on these measurements, I believe more elaboration is in order.

      We thank the reviewer for the careful examination of our supplementary material. For each fluctuation spectrum measurement, we ran multiple replicas. We observed that the largest wavelength modes were not fully equilibrated. In the simulations the first mode of the fluctuation spectrum is probed at different amplitudes and phases. We thus expected the potential systematic error would show up clearly when comparing spectrums of the different replicas. As we saw no correlation in these systematic offsets between replicas, we concluded that the simulations are sufficiently equilibrated and we could safely exclude the first mode of the fluctuation spectrum from our analysis.

      To show without doubt that this procedure does not randomly bias our results, we also ran simulations for three representative membranes until all modes were equilibrated. On the modes previously equilibrated, the resulting spectrums agree with our previous shorter simulations. On the largest wavelength modes that were previously not fully equilibrated, we noticed a small deviation from theory, specifically for flexible membranes (small bending modulus). These small deviations can be explained by including a negligible negative tension. Importantly, however, the resulting bending modulus σ stays nearly the same. We note that the small negative tension disappears when we halve the timestep (see Author response image 3). This verification is shown in SI section 17.

      R3.Q6: The authors have found that ”there is a strong dependency of the bending rigidity on the membrane mean curvature of stiffer bolalipids.” The effect is negative, with the membrane becoming less stiff at higher mean curvatures. Why is that? I would assume that with more flexible bolalipids, the possibility of reorganization into U-shaped chains should affect the bending rigidity more (as Figure 2E suggests). While for a stiff bolalipid, not much would change if you increase the mean curvature. This should be either a tilt effect, or have to do with asymmetry between the leaflets. But on the other hand, the tilt modulus is shown to decrease with increasing bolalipid rigidity. The authors get back to this issue only on page 10, when they consider U-shaped lipids in the inner and outer leaflets and write, ”this suggested that an additional membrane-curving mechanism must be involved.” But then again, in the Discussion, the authors write, ”It is striking that membranes made from stiffer bolalipids showed a curvature-dependent bending modulus, which is a clear signature that bolalipid membranes exhibit plastic behaviour during membrane reshaping,” adding to the confusion.

      Author response image 3.

      Height fluctuation spectrum, for a bilayer membrane at T<sub>eff</sub> =1.1, as simulated in the main text (grey, for 60⇥10<sup>3</sup>τ), for longer duration (1_.44⇥10<sup>6</sup>τ) (pink), and with the longer duration and halved timestep =0.005_τ(purple); fits are shown as dotted lines (tension and tilt) or dash-dot lines (tilt only).

      We thank the reviewer for asking this important question. Membrane bending rigidity in bolalipid membranes decreases dramatically once a small fraction of U-shapes is allowed to form, but then plateaus once this U-shape fraction reaches 20%. In a curved bolalipid membrane, U-shapes must accumulate in the outer leaflet to accommodate for area difference. Together, the bending rigidity non-linear dependence on U-shape fraction, and the promotion of U-shapes by curvature, explain why in a membrane made of moderately stiff bolalipids (k<sub>bola</sub> = 1k<sub>B</sub>T), which contain very few U-shapes in the flatstate, the bending rigidity of the membrane decreases as curvature increases. While in a membrane made of flexible bolalipid molecules (k<sub>bola</sub> = 0), where many U-shapes are present in the flat membrane, the bending rigidity does not change with curvature.

      Bending rigidity 𝜅 in flat membranes composed of bolalipids decreases dramatically once a small fraction of U-shapes is allowed to form, but plateaus once more than 20% of U-shaped bolalipids are present. In details, our data shows that with an increasing bolalipid molecular rigidity k<sub>bola</sub>, both the number of U-shaped bolalipids decreases (Fig. 2B) and the membrane rigidity 𝜅 increases (Fig. 2C). Thus, the correlation suggests that U-shaped bolalipids soften the membrane, in a non-linear way where most of the change in membrane bending rigidity happens for U-shaped bolalipid fraction < 20% (Figure S11).

      Separately, membrane curvature affects the area difference between curved membrane leaflets and thus drives U-shape accumulation. To be specific, a cylindrical membrane with area A, mean curvature H and thickness h has the outer leaflet with area A(1 + Hh) and the inner leaflet with smaller area A(1 Hh). This can be large, in our simulations up to an area change of Hh \= 25%. For pure bolalipid membranes, straight bolalipids occupy the same space in each leaflet. Area difference can then be achieved only by having a different amount of U-shaped bolalipids in each leaflet, which can result in a different U-shape fraction between leaflets and thus ’asymmetry between leaflets’. Figure S10 confirms U-shape head fraction asymmetry that increases with curvature, for both flexible (k<sub>bola</sub> = 0) and moderately stiff bolalipids (k<sub>bola</sub> = 1k<sub>B</sub>T).

      Together, these two effects result in membrane softening under curvature for the moderately stiff bolalipids, but constant rigidity for flexible bolalipids (Fig. 2F). In details: for membranes composed of moderately stiff bolalipid molecules (k<sub>bola</sub> = 1k<sub>B</sub>T), the U-shape bolalipid head fraction only increases in the outer leaflet, goingfrom10to20%(Figure S10). This is in the high sensitivity region where the bending rigidity is expected to change the most (Figure S11). We hypothesize that the molecular rigidity of a U-shaped bolalipid creates compression on the outer leaflet that stabilizes the membrane curvature and thus causes membrane softening. We suspect that for membranes composed of rigid bolalipids (k<sub></sub> > 1k<sub>B</sub>T), the effect is likely not present due to the absence of U-shape formation even under strong bending.

      By contrast, for membranes composed of flexible bolalipids (k<sub></sub> = 0), the U-shaped bolalipid head fraction changes relatively little from its value for flat membranes (from 50% to respectively 60 and 40% for the outer and inner leaflet, Figure S10). This is in the region where the membrane bending rigidity is expected to respond weakly to U-shape fraction (Figure S11). Additionally, the change is symmetric, so presumably the outer leaflet becomes softer as the inner leaflet becomes stiffer, thus creating opposing effects and only weakly affecting the membrane bending rigidity as a whole. We note that the distinction between the U-shape head fraction that we plot (Figure S10) and U-shape fraction (Figure S11) matters little for this analysis.

      We have added this deduction and its plots to SI section 8, and revised the corresponding statement in the main text accordingly (p.7 ).

      “Changing membrane curvature alters the area differently in the two membrane leaflets. To adapt to the area difference, we thus expect the fraction of U-shaped bolalipids to change as the membrane curvature changes. Moreover, the results of Fig. 2B and Fig. 2C showed that the U-shaped bolalipid fraction and the membrane bending rigidity are correlated. As a result, we predict that the fraction of straight versus U-shaped bolalipids in a membrane will change in response to membrane bending, in a way that makes the bending rigidity of a bolalipid membrane curvature dependent.”

      R3.Q7: This issue is repeated when the authors study nanoparticle uptake. They write: ”to reconcile these seemingly conflicting observations we reason that the bending rigidity, similar to Figure 2F, is not constant but softens upon increasing membrane curvature, due to dynamic change in the ratio between bolalipids in straight and U-shaped conformation. Hence, bolalipid membranes show stroking plastic behaviour as they soften during reshaping.” But the softening effect that they refer to, as shown in Figure 4B, occurs for very stiff bolalipids, for which not much switching to U-shaped conformation should occur.

      We thank the reviewer for locating a particularly dense sentence. We changed the text to explicitly refer to the range k<sub></sub> 2 [0,2] k<sub>B</sub>T for which there is significant change in U-shape fraction (p.8 ):

      “To reconcile these seemingly conflicting observations we reason that the bending rigidity κ, similar to Fig. 2F, is not constant but softens in the range k<sub></sub> 2 [0,2] k<sub>B</sub>T, upon increasing membrane curvature. This is due to the dynamic change in the ratio between bolalipids in straight and U-shaped conformation.”

      As for Fig. 4B, for k<sub></sub> > 2k<sub>B</sub>T, pores form thus explaining the plateau in adsorption energy.

      R3.Q8: Another major issue is with what the authors refer to as the ”effective temperature”. While plotting phase diagrams for kT/eps value is absolutely valid, I’m not a fan of calling this effective temperature. It is a dimensionless quantity that scales linearly with temperature, but is not a temperature. It is usually called a ”reduced temperature”. Then the authors refer to their findings as studying the stability of archaeal membranes at high temperatures. I have to disagree because eps is not the only potential parameter in the simulations (there are at least space exclusion and angle-bending stiffnesses) so one cannot identify changing eps with changing the global simulation temperature. This only works when you have one potential parameter, like an LJ gas.

      We indeed thought about this before and found that it makes little difference in our set-up. To thoroughly show that the distinction matters very little, per reviewer’s question, we computed our phase diagrams by scaling temperature T explicitly (and not lipid tail interactions T<sub>eff</sub> = k<sub>B</sub>T /ϵ<sub>p</sub>). We added these results to the SI section 14 and found no significant difference when comparing scaling tail interactions (Figure S15A) with scaling temperature explicitly (Figure S15B).

      We also computed Fig. 2A-C for scaling interactions (Figure S17A) and scaling temperature explicitly (Figure S17B). We found a slightly increased U-shaped bolalipid fraction for low k<sub></sub> when comparing scaling interactions (Figure S17A) with temperature scaling (Figure S17B). The reason is that the U-shaped fraction depends on temperature, as with higher temperature bolalipids can easier transition into the U-shape. Most importantly, however, we found no qualitative changes on the liquid region or the mechanical membrane properties when we compared the different scaling variants.

      The reason why both scaling variants match so well can be understood easily. All pair potentials, including volume exclusion interactions between head beads and other membrane beads, were also scaled in the same manner as tail-to-tail interactions, as described in the SI. In contrast, the energy scales for maintaining the lipid bonds, the bilayer lipid angles and the bolalipid angles are relatively large compared to the energy scales involved in tail-to-tail interactions. This separation of energy scales guarantees that there will be little effect when increasing global temperature. Regarding nomenclature, we take the reviewer’s advice and have added ’reduced temperature’ as an alias for T<sub>eff</sub> in the main text.

      In the revised version of the manuscript, we mention these observations in the SI section 14 and point towards these results in the main text (p.4 ):

      “This interaction strength governs the membrane phase behaviour and can be interpreted as the effective temperature or reduced temperature T<sub>eff</sub> = k<sub>B</sub>T /ϵ<sub>p</sub>. As the distinction between scaling interactions (T<sub>eff</sub>) or temperature (T) is not important for our analysis (see Supplemental Information (SI) section 14), for simplicity we refer to T<sub>eff</sub> as temperature in the following.”

      Minor issues

      R3.Q9: As the authors have noted, the fact that the membrane curvature can change the ratio of U-shaped to straight bolalipids would render the curvature elasticity non-linear (though the term ”plastic” should not be used, as this is still structurally reversible when the stress is removed. Technically, it is hypoelastic behaviour, possibly with hysteresis.) With this in mind, when the authors use essentially linear elastic models for fluctuation analysis, they should make a comparison of maximum curvatures occurring in simulations with a range that causes significant changes in bolalipid conformational ratios.

      We thank the reviewer for their suggestion on calling the non-linear behaviour of the curvature elasticity hypoelastic. We have edited the main text accordingly (p.8 ):

      “In an elastic material, the strain modulus holds constant and deformation is reversible. For bolalipid membranes at k<sub></sub> = 1k<sub>B</sub>T, however, the bending modulus decreases when deformation increases, rendering bolalipid membranes hypoelastic.”

      Moreover, regarding the maximum curvatures occurring in the fluctuation simulations: We first note that the ensemble average of the mean curvature H from the fluctuation measurements is indicated as a vertical line in Fig. 2F. As the average value is nearly zero, the membrane can be considered as flat in good approximation. To investigate the question in more detail, we extended the SI with a careful analysis of the validity of the maximum membrane curvature and the validity of the Monge gauge approximation (SI section 15).

      In short, we found that the involved membrane curvatures are small and therefore are unlikely to trigger any significant changes of the bending modulus. Moreover, since we are dealing with two bolalipid conformations, we also tested the homogeneity of the membrane. In our simulations of flat membrane patches we did not observe clustering or phase separation between the two bolalipid conformations beyond the [2,3]σ range. Furthermore, we get good agreement between our fluctuation measurement and the cylinder simulations in Fig. 2F. We now mention this verification in the revised version of the manuscript (p.8 ):

      “Fortunately, this dependency on curvature does not invalidate our fluctuation results, where the curvature is small enough that its effect on the bending modulus is negligible (SI section 15).”

      Last but least, simulating bending/unbending cycles of an arc-shaped membrane (frozen endpoints) shows agreement with cylinder membrane simulations, and no hysteresis at the rates of deformation employed (cf. M. Amaral’s thesis [54], soon to be out of the embargo period).

      R3.Q10: The Introduction section of the manuscript is written with a biochemical approach, with very minor attention to the simulation works on this system. Some molecular dynamics works are only cited as existing previous work, without mentioning what has already been studied in archaeal membranes. While some information, like the binding of ESCRT proteins to archaeal membranes, though interesting, helps little to place the study within the discipline. The Introduction should be revised to show what has already been studied with simulations (as the authors mention in the Discussion) and how the presented research complements it.

      The present research for the first time covers archaeal membranes with a single coarse-grained model capable of assuming both bolalipid in-membrane conformations and sweeps through temperature, membrane composition, and molecular rigidity. The work shows the first curvature dependent bending modulus for pure bolalipid membranes. It also investigates systematically bending modulus and Gaussian modulus, and tests the model in an all-encompassing budding simulation that incorporates topology changes. Existing atomistic or coarse-grained MD simulations (MARTINI or similar force fields) are limited to small patches of membrane, with no study of large-scale deformations or topology changes; plus, they rely on force fields that were parametrized for bilayer membranes.

      To give a comprehensive overview of the field, we revised the introduction section of the manuscript, in which we now discuss previous computational work investigating membrane diffusivity, U-shaped lipid fraction, and bending rigidity (p.3 ):

      “By contrast, only a few studies have investigated bolalipid membranes applying computational or theoretical tools [24, 25]. Specifically, the pore closure time in bolalipid membranes, and the role of cyclopentane rings for membrane properties has been investigated using all-atom simulations, showing decreased lateral mobility, reduced permeability to water, and increased lipid packing [26–28]. Moreover, using coarse-grained simulations, it was suggested that bolalipid membranes are thicker [29], exhibit a gel-to-liquid phase transition at higher temperature [30], and exhibit a reduced diffusivity [31]. However, little research has been devoted to investigating mechanics and reshaping of bolalipid membranes at the mesoscale despite the obvious importance of this question from evolutionary, biophysics, and biotechnological perspectives and although different membrane physics is expected to manifest.”

      Following the reviewer’s advice and to keep the introduction concise and focused on bolalipid membranes, we have removed the paragraph on ESCRT-III proteins in the revised manuscript.

      R3.Q11: The authors have been a bit loose with using the term ”stability”. I’d like to see the distinction in each case, as in ”chemical/thermal/mechanical/conformational stability”.

      We have clarified when applicable the type of stability throughout the manuscript. In all other instances, if not clear from context, we mean simply that the membrane persists being a membrane. At our coarse-grained level, this means the membrane does not disassemble into a gas phase.

      R3.Q12: In the original Cooke-Deserno model, a so-called ”poorman’s angle-bending term” is used, which is essentially a bond-stretching term between the first and third particle. However, I notice the authors using the full harmonic angle-bending potential. This should be mentioned.

      This is made clear in the SI (Eq. (S3)). Cooke and Deserno mention the harmonic angle potential as a valid alternative in their original publication. We now also added this detail to the main text (p.3 ):

      “The angle formed by the chain of three beads is kept near 180° via an angular potential with strength k<sub>0</sub>, instead of the approximation by a bond between end beads of the original model [32].”

      R3.Q13: The analysis of energy of U-shaped lipids with the linear model E \= c<sub>0</sub> + c<sub>1</sub>k<sub></sub> is indeed very interesting. I am curious, can this also be corroborated with mean energy measurements? The minor issue is calling the source of the favorability of U-shaped lipids ”entropic”, while clearly an energetic contribution is found. The two conformations, for example, might differ in the interactions with the neighbouring lipids.

      We were also curious and thank the reviewer for the suggestion of mean energy measurements. We concluded that there must be either an entropic contribution to the free energy or an intermolecular interaction energy favouring U-shaped bolalipids. We have now included these measurements in SI section 6 (p.S5 ):

      “By splitting the average potential energy between an internal contribution (bonds, angles and pair interactions between particles in the same molecule) and an external contribution (pair interactions between a molecule and its neighbours), we determined the transition energy from straight to U-shaped bolalipids in detail. We found that this transition lowers the internal potential energy of the bolalipid while increasing its interaction energy. In total, we obtained an energy barrier for the transition of ΔE<sub>s→u</sub> = 0.79±0.01k<sub>B</sub>T. Since the fit indicates, however, that the U-shaped bolalipid conformation is preferred over the straight conformation, we conclude that there must be either an entropic contribution to the free energy or an intermolecular interaction energy favouring U-shaped bolalipids.”

      We refer to these measurements in the main text (p.6 ):

      “For the fit it appears that c<sub>0</sub> < 0, which implies that bolalipids in U-shape conformation are slightly favoured over straight bolalipids at k<sub></sub> = 0 (explored in SI section 6).”

      R3.Q14: The authors write in the Discussion, ”In any case, our results indicate that membrane remodelling, such as membrane fission during membrane traffic, is much more difficult in bolalipid membranes [34].” Firstly, I’m not sure if studying the dependence of budding behaviour on adhesion energy with nanoparticles is enough to make claims about membrane fission. Secondly, why is the 2015 paper by Markus Deserno cited here?

      We thank the reviewer for giving us the opportunity to clarify. We make an energetic argument on membrane fission based on the observed difference in the ratio of .

      Splitting a spherical membrane vesicle into two spherical vesicles (fission) increases the bending energy by 8𝜋𝜅 and decreases the energy related to the Gaussian bending modulus by . The second part of the argument is given for example in the review by Markus Deserno (p.23, right column), that’s why we cite the paper here. Together, this gives an energy barrier, required for membrane fission in the considered geometry of ∆E<sub>fission</sub> = . We found that is around 0.5 for bolalipid membranes and around 1 for bilayer membranes. Since 𝜅 was typically larger in bolalipid membranes we thus expect the energy barrier for fission ∆E<sub>fission</sub> to be larger for bolalipid membranes. We therefore predict that membrane remodelling, such as membrane fission during membrane trafficking, is harder in bolalipid membranes. We explain our reasoning in the discussion of the revised manuscript (p.13 ):

      “Membrane remodelling, such as the fission of one spherical vesicle into two, increases the bending energy by 8πκ but decreases the energy related to the Gaussian modulus by – [39], giving rise to a fission energy barrier of ∆E<sub>fission</sub> = . Our results indicated that while in bolalipid membranes 𝜅 is larger, is smaller compared to bilayer membranes. Our results thus predict a larger energy barrier for membrane fission ∆E<sub>fission</sub> in bolalipid membranes compared to bilayer membranes.”

      R3.Q15: In the SI, where the measurement of the diffusion coefficient is discussed, the expression for D is missing the power 2 of displacement.

      We thank the reviewer for spotting this oversight. We corrected it in the revised version of the SI (p.S5 ).

      R3.Q16: Where cargo uptake is discussed, the term ”adsorption energy” is used. I think the more appropriate term would be ”adhesion energy”.

      For the sake of simplicity, we changed the term to adhesion energy (caption of Fig. 4, and p.10). We do not have a strong opinion on this, but we believe that adsorption energy would be equally correct as we describe the adsorption of many lipid head beads to a nanoparticle.

      R3.Q17: Typos:

      Page 1, paragraph 2: Adaption → Adaptation. Page 10, paragraph 1: Stroking → Striking.

      We thank the reviewer for spotting these typos which we have corrected in the revised version of the manuscript.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors):

      A few thoughts (likely out of the scope of this paper but possibly to consider upon revision):

      R1.Q1: Do bolalipids always have the same headgroup? I don’t recall reading this in the introduction/discussion. R1 and R2 are in Figure 1, but I don’t know whether there are standard types. Could this be expanded upon? Is the model able to take these differences into account?

      We thank the reviewer for raising this important question. Similar to bacteria and eukaryotes, in archaea there is a huge variety in terms of the different head groups that lipids can contain and thus also lipid variety. Most archaeal lipids have head groups that contain either phosphate groups or sugar residues. Typically, archaeal bolalipids are asymmetric and contain a phosphatidyl and a sugar moiety at the two ends of the lipid molecule. Within the membrane the lipid is oriented such that the phosphatidyl moiety points towards the interior of the cell whereas the sugar moiety points towards the outside of the cell as it occupies more space [5].

      In our computational model, however, we consider symmetric bolalipids for the sake of simplicity and to decouple the role of ”connected geometry” from other effects. In principle, we could investigate the effect of lipid asymmetry by increasing the size of one of the lipid head beads. However, this investigation exceeds the scope of the present study and therefore requires future work.

      In the revised version of the manuscript, we now clarify that bolalipids can have different headgroups (p.1 and the caption of Fig. 1):

      “The hydrophilic heads can be composed of different functional groups with phosphatidyl and sugar being the most relevant moieties. For bolalipids the two head groups at either end of the molecule are typically distinct (Fig. 1A right) [5].”

      “The hydrophilic head of a bolalipid can be composed of different functional groups represented by R1 and R2 (right).”

      We also explicitly state that we neglect lipid head group asymmetry for the sake of simplicity (p.4 ):

      “To decouple the effect of the connected geometry of the bolalipids from that of lipid asymmetry, we assume both head beads of a bolalipid to share the same properties.”

      R1.Q2: Is it possible to compare the mesoscale models to either Coarse-grained or even all-atom lipid models? Have simulations previously been performed for bolalipids at those levels of description?

      A few studies have investigated bolalipids membranes in simulations previously. These studies either used all-atom or coarse-grained simulations. However, none of these studies investigated how bolalipids respond to membrane deformations. Therefore, it is currently not possible to directly compare our results to studies in the literature. However, to recapitulate our predictions experimentally is certainly something that could and should be done in the future. As a reply to this reviewer and reviewer 3, we discuss the current state of modelling bolalipid membranes in simulations in the revised version of the manuscript (p.3 ):

      “By contrast, only a few studies have investigated bolalipid membranes applying computational or theoretical tools [24, 25]. Specifically, the pore closure time in bolalipid membranes, and the role of cyclopentane rings for membrane properties has been investigated using all-atom simulations, showing decreased lateral mobility, reduced permeability to water, and increased lipid packing [26–28]. Moreover, using coarse-grained simulations, it was suggested that bolalipid membranes are thicker [29], exhibit a gel-to-liquid phase transition at higher temperature [30], and exhibit a reduced diffusivity [31]. However, little research has been devoted to investigating mechanics and reshaping of bolalipid membranes at the mesoscale despite the obvious importance of this question from evolutionary, biophysics, and biotechnological perspectives and although different membrane physics is expected to manifest.”

      We want to mention, however, that we do compare membrane diffusivity, U-shaped lipid fraction, and bending rigidity to the behaviour and values that have been previously measured in simulations in the discussion section. In general, we find good agreement between our results and previously reported behaviour/values (p.13 ):

      “While flexible bolalipid membranes are liquid under the same conditions as bilayer membranes, we found that stiff bolalipids form membranes that operate in the liquid regime at higher temperatures. These results agree well with previous molecular dynamics simulations that suggested that bolalipid membranes are more ordered and have a reduced diffusivity compared to bilayer membranes [24, 29]. In our simulations, this is due to the fact that completely flexible bolalipids molecules adopt both straight (transmembrane) as well as the U-shaped (loop) conformation with approximately the same frequency. In contrast, stiff bolalipids typically only take on the straight conformation when assembled in a membrane. These results agree with the previous coarse-grained molecular dynamics simulations using the MARTINI force field which showed that the ratio of straight to U-shaped bolalipids increased upon stiffening the linker between the lipid tails [29].

      [...]

      When we determined the bending rigidity of bolalipid membranes by measuring their response to thermal fluctuations, we found that membranes made from flexible bolalipids are only slightly more rigid than bilayer membranes. This result is consistent with previous atomistic simulations, which showed that the membrane rigidity was similar for membranes composed of bilayer lipids and flexible synthetic bolalipids [45].”

      R1.Q3: How would membrane proteins alter the behaviour of bolalipids? Either those integral to the membrane or those binding peripherally?

      The reviewer asks an important question. However, the question is difficult to answer due to its scope and the gaps in the current literature. Important examples of integral or peripheral membrane proteins that alter the behaviour of bolalipids and archaeal bolalipid membranes are involved in cell homeostasis, cell division, membrane trafficking, and lipid synthesis.

      The cells of many archaeal species are enclosed in a paracrystalline protein layer called the Slayer, which is attached to the lipid membrane [4, 55]. The main function of the S-layer is to keep the cell’s shape and to protect it against osmotic stress. Due to the embedding of the S-layer in the membrane at specific locations, it is to be expected that the membrane properties are influenced by the S-layer. Furthermore, archaea execute cell division by locally reshaping the membrane using FtsZ and ESCRT-III proteins [56]. While Asgard archaeal genomes encode proteins with homology to those regulating aspects of eukaryotic membrane remodelling and trafficking [57], they have yet to be observed undergoing a process like endocytosis [58]. In addition, it has been speculated that the proteins that drive the synthesis of two diether lipids into a tetraether lipid are either membrane associated or integral membrane proteins [59].

      However, to the best of our knowledge it is not known how membrane proteins specifically alter the behaviour of bolalipids. Future work will need to be executed to answer this question. Following the advice of reviewer 3 and to keep the introduction concise and focused on bolalipid membranes, we do not mention these observations in the revised manuscript.

      R1.Q4: Is there a mechanism in cells to convert or switch bolalipids from a straight to a u-shaped description? Does this happen spontaneously or are there enzymes responsible for this?

      We thank the reviewer for bringing up this important point. Despite the relevance of the question, little is currently known about the mechanism that make bolalipids transition between a straight and a U-shaped configuration mainly because there is to date no established experimental method.

      Besides our own results, most of what we know comes from coarse-grained molecular dynamics simulations, which showed that bolalipids can spontaneously transition between the straight and U-shaped configuration [29]. In addition, by using comparative genomic analysis, it has been predicted that many archaeal species contain flippases, i.e., membrane proteins that are able, upon the consumption of energy, to transfer (flipflop) bilayer lipids between the two membrane leaflets [43]. Moreover, it has been shown that Halobacterium salinarum (an archaeon with a bilayer lipid membrane) [44] contains scramblases, which are membrane proteins that passively transfer bilayer lipids from one membrane leaflet to the other. It is therefore tempting to speculate that similar proteins might exist for bolalipids which could facilitate the straight to U-shaped transition.

      In addition, it has been reported that vesicles composed of bolalipid membranes can undergo fusion with enveloped influenza viruses [17]. In this context, it has been suggested that the influenza fusion protein hemagglutinin may locally induce U-shaped bolalipids to facilitate membrane fusion. However, all these hints are by far no proof of a mechanism that can drive the straight to U-shaped bolalipid transition, and further work needs to be done to investigate this question in detail.

      In the revised version of the manuscript, we now discuss what is known about potential mechanisms to facilitate the straight to U-shaped transition in the discussion section (p.13 ):

      “While previous coarse-grained simulations predicted that bolalipids spontaneously transition between the straight and U-shaped conformations [29], how this happens in archaeal membranes and whether membrane proteins are involved in this conformational transition needs to be clarified in the future. Experimental studies suggest that archaeal membranes contain flippases and scramblases for the transitioning of bilayer lipids between membrane leaflets [43, 44], raising the possibility that similar proteins could also facilitate conformational transitions in bolalipids. In addition, it has been suggested that the viral fusion protein hemagglutinin could cause a transition from straight to U-shaped bolalipid conformation during the fusion of bolalipid vesicles with influenza viruses [17]. However, future investigation is required.”

      R1.Q5: Ideally, coordinates and any parameter files required to run the molecular simulations should be included for reproducibility.

      We absolutely share the reviewer’s concern with reproducibility and as such have included in the original submission as part of our data availability section a link to a code repository (available at: https://doi.org/10.5281/zenodo.13934991 [51]) that allows initializing and simulating flat membrane patches, with user control of the parameters explored in this paper (𝜔,T<sub>eff</sub>,k<sub>bola</sub>,f<sup>bi</sup>).

      Reviewer #2 (Recommendations for the authors):

      This is a great paper and I congratulate the authors for writing such a fine piece of scholarship. The only nitty-gritty feedback that I have is summarized in the following three points:

      R2.Q1: In the introduction the authors talk about archaea adapting their membrane to retain membrane fluidity. However, homeoviscous adaptation is also fundamental in bacteria and eukaryotes.

      The reviewer is correct, like archaea the membranes of bacteria and eukaryotes must balance between flexibility and stability. Moreover, the cell membranes in all 3 domains of life need to maintain membrane fluidity and provide mobility to the embedded lipids and membrane proteins (homeoviscous adaptation). The general idea is that these organisms change the ratio of different lipids to change membrane properties and thereby optimally adapt to their environments [10]. Importantly, however, there are differences of how homeoviscous adaptation is maintained across the different domains of life. As a reply to this reviewer and reviewer 3, we now discuss the underlying mechanisms in the revised parts of the introduction (p.1 ):

      “Like for bacteria and eukaryotes, archaea must keep their lipid membranes in a fluid state (homeoviscous adaptation). This is important even under extreme environmental conditions, such as hot and cold temperatures, or high and low pH values [7]. Because of this, many archaea adapt to changes in their environment by tuning the lipid composition of their membranes: altering the ratio between bola- and bilayer lipids in their membranes [8, 9] and/or by changing the number of cyclopentane rings in their lipid tails, which are believed to make lipid molecules more rigid [5]. For example, Thermococcus kodakarensis increases its tetraether bolalipid ratio from around 50% to over 80% when the temperature of the environment increases from 60 to 85 C [10]. Along the same lines, the cell membrane of Sulfolobus acidocaldarius, can contain over 90 % of bolalipids with up to 8 cyclopentane rings at 70 C and pH 2.5 [5, 11]. It is worth mentioning that in exceptional cases bacteria also synthesise bolalipids in response to high temperatures [12], highlighting that the study of bolalipid membranes is relevant not only for archaeal biology but also from a general membrane biophysics perspective.”

      R2.Q2: Uncertainties in Gaussian rigidity modulus estimates are not properly reported.

      The large uncertainties in the Gaussian rigidity modulus were due to the fact how they were calculated. In short, is determined in cap folding simulations [41] (SI section 9), by using the measured values of the dimensionless parameter 𝜉, related to the folding probability, the bending modulus 𝜅, the membrane line tension , and the cap radius R. In our case, the main source of uncertainty for determining comes from the uncertainty in the measurement of the bending rigidity 𝜅. To obtain 𝜅, previously, we fitted fluctuation spectra for different seeds and only then averaged the obtained values. In the revised version of the manuscript, we now first pool the fluctuation spectra of the different simulation seeds before we fit all spectra at the same time. This new approach results in smaller uncertainties for the bending rigidity 𝜅 and also the Gaussian rigidity modulus .

      As a consistency check, in addition to the simulations that we previously performed at T<sub>eff</sub> = 1.3, we have repeated the cap folding and line tension simulations at T<sub>eff</sub> = 1.2, resulting in similar values for . In the revised version of the manuscript, we report the newly calculated values and uncertainties for at T<sub>eff</sub>  = 1.2 in the main text (p.8 ):

      “At T<sub>eff</sub>  = 1.2, we obtained = 4.30±0.22kBT and thus a ratio of = 0.89±0.04 for bilayer membranes, similar to what has been reported previously [41]. For flexible bolalipid membranes, we got a slightly smaller value for = 5.04 ± 0.37kBT. Due to the larger bending modulus, however, flexible bolalipid membranes show a significantly smaller ratio = 0.64± 0.04 (k<sub></sub> = 0). At larger temperature (Teff = 1.3), the ratio can be even smaller = 0.45 ± 0.07 (see SI section 9).”

      In addition, we report the values at T<sub>eff</sub> = 1.3 and T<sub>eff</sub> = 1.2 in the SI (p.S15 , Tabl. S4):

      We have also adapted the discussion of the Gaussian bending modulus accordingly (p.13 ):

      “Another marked difference between bilayer and flexible bolalipid membranes is the ratio of the Gaussian rigidity to the bending modulus. Instead of being around 1 as for bilayer membranes [41], it is around 1/2 and therefore only half of that of bilayer lipids.”

      Reviewer #3 (Recommendations for the authors):

      While I think the bulk of the work presented is useful, some of the issues that I raised in my review are indeed major. Without properly addressing them, it is hard to accept the conclusions of the manuscript. I hope the authors can address them by revising their analysis.

      We thank the reviewer for their constructive feedback, which helped us to improve the manuscript. We have addressed all points raised by the reviewer in our detailed point-by-point response to the reviewer (see above). We hope the reviewer will now find it easier to accept our conclusions.

      (1) R. Phillips, J. Kondev, J. Theriot, and H. Garcia, Physical biology of the cell (Garland Science, New York, 2012).

      (2) H. T. McMahon and J. L. Gallop, Membrane curvature and mechanisms of dynamic cell membrane remodelling, Nature 438, 590 (2005).

      (3) S. B. Gould, Membranes and evolution, Curr. Biol. 28, R381 (2018).

      (4) S.-V. Albers and B. H. Meyer, The archaeal cell envelope, Nat. Rev. Microbiol. 9, 414 (2011).

      (5) P. M. Oger and A. Cario, Adaptation of the membrane in Archaea, Biophys. Chem. 183, 42 (2013).

      (6) K. Rastädter, D. J. Wurm, O. Spadiut, and J. Quehenberger, The Cell Membrane of Sulfolobus spp.—Homeoviscous Adaption and Biotechnological Applications, International Journal of Molecular Sciences 21, 3935 (2020).

      (7) P. L.-G. Chong, Archaebacterial bipolar tetraether lipids: Physico-chemical and membrane properties, Chem. Phys. Lipids 163, 253 (2010).

      (8) M. Tourte, P. Schaeffer, V. Grossi, and P. M. Oger, Functionalized Membrane Domains: An Ancestral Feature of Archaea?, Front. Microbiol. 11, 526 (2020).

      (9) Y. H. Kim, G. Leriche, K. Diraviyam, T. Koyanagi, K. Gao, D. Onofrei, J. Patterson, A. Guha, N. Gianneschi, G. P. Holland, M. K. Gilson, M. Mayer, D. Sept, and J. Yang, Entropic effects enable life at extreme temperatures, Sci. Adv. 5, eaaw4783 (2019).

      (10) M. F. Siliakus, J. van der Oost, and S. W. M. Kengen, Adaptations of archaeal and bacterial membranes to variations in temperature, pH and pressure, Extremophiles 21, 651 (2017).

      (11) D. W. Grogan, Phenotypic characterization of the archaebacterial genus sulfolobus: comparison of five wild-type strains, J. Bacteriol. 171, 6710 (1989).

      (12) D. X. Sahonero-Canavesi, M. F. Siliakus, A. Abdala Asbun, M. Koenen, F. von Meijenfeldt, S. Boeren, N. J. Bale, J. C. Engelman, K. Fiege, L. Strack van Schijndel, J. S. Sinninghe Damsté, and L. Villanueva, Disentangling the lipid divide: Identification of key enzymes for the biosynthesis of membrane-spanning and ether lipids in Bacteria, Sci. Adv. 8, eabq8652 (2022).

      (13) M. van Wolferen, A. A. Pulschen, B. Baum, S. Gribaldo, and S.-V. Albers, The cell biology of archaea, Nat. Microbiol. 10.1038/s41564-022-01215-8 (2022).

      (14) U. Bakowsky, U. Rothe, E. Antonopoulos, T. Martini, L. Henkel, and H.-J. Freisleben, Monomolecular organization of the main tetraether lipid from Thermoplasma acidophilum at the water–air interface, Chem. Phys. Lipids 105, 31 (2000).

      (15) C. Jeworrek, F. Evers, M. Erlkamp, S. Grobelny, M. Tolan, P. L.-G. Chong, and R. Winter, Structure and Phase Behavior of Archaeal Lipid Monolayers, Langmuir 27, 13113 (2011).

      (16) D. P. Brownholland, G. S. Longo, A. V. Struts, M. J. Justice, I. Szleifer, H. I. Petrache, M. F. Brown, and D. H. Thompson, Phase Separation in Binary Mixtures of Bipolar and Monopolar Lipid Dispersions Revealed by 2H NMR Spectroscopy, Small Angle X-Ray Scattering, and Molecular Theory, Biophysical Journal 97, 2700 (2009).

      (17) A. Bhattacharya, I. D. Falk, F. R. Moss, T. M. Weiss, K. N. Tran, N. Z. Burns, and S. G. Boxer, Structure–function relationships in pure archaeal bipolar tetraether lipids, Chem. Sci. 15, 14273 (2024).

      (18) V. Vitkova, D. Mitkova, V. Yordanova, P. Pohl, U. Bakowsky, G. Staneva, and O. Batishchev, Elasticity and phase behaviour of biomimetic membrane systems containing tetraether archaeal lipids, Colloids Surf. A Physicochem. Eng. Asp. 601, 124974 (2020).

      (19) E. Chang, Unusual thermal stability of liposomes made from bipolar tetraether lipids, Biochem. Biophys. Res. Commun. 202, 673 (1994).

      (20) O. V. Batishchev, A. S. Alekseeva, D. S. Tretiakova, T. R. Galimzyanov, A. Y. Chernyadyev, N. R. Onishchenko, P. E. Volynsky, and I. A. Boldyrev, Cyclopentane rings in hydrophobic chains of a phospholipid enhance the bilayer stability to electric breakdown, Soft Matter 16, 3216 (2020).

      (21) U. Seifert, Configurations of fluid membranes and vesicles, Adv. Phys. 46, 13 (1997).

      (22) H. Noguchi, Membrane Simulation Models from Nanometer to Micrometer Scale, J. Phys. Soc. Jpn. 78, 041007 (2009).

      (23) F. Frey and T. Idema, More than just a barrier: using physical models to couple membrane shape to cell function, Soft Matter 17, 3533 (2021).

      (24) C. Huguet, S. Fietz, A. Rosell-Melé, X. Daura, and L. Costenaro, Molecular dynamics simulation study of the effect of glycerol dialkyl glycerol tetraether hydroxylation on membrane thermostability, Biochimica et Biophysica Acta (BBA) - Biomembranes 1859, 966 (2017).

      (25) T. R. Galimzyanov, P. I. Kuzmin, P. Pohl, and S. A. Akimov, Elastic deformations of bolalipid membranes, Soft Matter 12, 2357 (2016).

      (26) T. R. Galimzyanov, P. E. Volynsky, and O. V. Batishchev, Continuum elasticity and molecular dynamics of a pore in archaeal bolalipid membranes, Soft Matter 21, 687 (2025).

      (27) A. O. Chugunov, P. E. Volynsky, N. A. Krylov, I. A. Boldyrev, and R. G. Efremov, Liquid but Durable: Molecular Dynamics Simulations Explain the Unique Properties of Archaeal-Like Membranes, Sci. Rep. 4, 7462 (2015).

      (28) L. F. Pineda De Castro, M. Dopson, and R. Friedman, Biological Membranes in Extreme Conditions: Simulations of Anionic Archaeal, PLoS One 11, e0155287 (2016).

      (29) M. Bulacu, X. Périole, and S. J. Marrink, In Silico Design of Robust Bolalipid Membranes, Biomacromolecules 13, 196 (2012).

      (30) C. H. Davis, H. Nie, and N. V. Dokholyan, Insights into thermophilic archaebacterial membrane stability from simplified models of lipid membranes, Phys. Rev. E 75, 051921 (2007).

      (31) S. Dey and J. Saha, Minimal Coarse-Grained Modeling toward Implicit Solvent Simulation of Generic Bolaamphiphiles, J. Phys. Chem. B 124, 2938 (2020).

      (32) I. R. Cooke and M. Deserno, Solvent-free model for self-assembling fluid bilayer membranes: Stabilization of the fluid phase based on broad attractive tail potentials, J. Chem. Phys. 123, 224710 (2005).

      (33) P. L.-G. Chong, U. Ayesa, V. Prakash Daswani, and E. C. Hur, On Physical Properties of Tetraether Lipid Membranes: Effects of Cyclopentane Rings, Archaea 2012, 1 (2012).

      (34) A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. in ’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyen, R. Shan, M. J. Stevens, J. Tranchida, C. Trott, and S. J. Plimpton, LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales, Comput. Phys. Commun. 271, 108171 (2022).

      (35) A. Stukowski, Visualization and analysis of atomistic simulation data with ovito–the open visualization tool, Modelling and Simulation in Materials Science and Engineering 18, 015012 (2009).

      (36) E. R. May, A. Narang, and D. I. Kopelevich, Role of molecular tilt in thermal fluctuations of lipid membranes, Physical Review E 76, 021913 (2007).

      (37) W. Helfrich, Elastic Properties of Lipid Bilayers: Theory and Possible Experiments, Z. Naturforsch. C 28, 693 (1973).

      (38) M. Hamm and M. Kozlov, Elastic energy of tilt and bending of fluid membranes, Eur. Phys. J. E 3, 323 (2000).

      (39) M. Deserno, Fluid lipid membranes: From differential geometry to curvature stresses, Chemistry and Physics of Lipids 185, 11 (2015).

      (40) V. A. Harmandaris and M. Deserno, A novel method for measuring the bending rigidity of model lipid membranes by simulating tethers, The Journal of Chemical Physics 125, 204905 (2006).

      (41) M. Hu, J. J. Briguglio, and M. Deserno, Determining the Gaussian Curvature Modulus of Lipid Membranes in Simulations, Biophys. J. 102, 1403 (2012).

      (42) M. Deserno, Elastic deformation of a fluid membrane upon colloid binding, Phys. Rev. E 69, 031903 (2004), arXiv: cond-mat/0303656.

      (43) K. S. Makarova, M. Y. Galperin, and E. V. Koonin, Comparative genomic analysis of evolutionarily conserved but functionally uncharacterized membrane proteins in archaea: Prediction of novel components of secretion, membrane remodeling and glycosylation systems, Biochimie 118, 302 (2015).

      (44) A. Verchère, W.-L. Ou, B. Ploier, T. Morizumi, M. A. Goren, P. Bütikofer, O. P. Ernst, G. Khelashvili, and A. K. Menon, Light-independent phospholipid scramblase activity of bacteriorhodopsin from Halobacterium salinarum, Sci. Rep. 7, 9522 (2017).

      (45) T. B. H. Schroeder, G. Leriche, T. Koyanagi, M. A. Johnson, K. N. Haengel, O. M. Eggenberger, C. L. Wang, Y. H. Kim, K. Diraviyam, D. Sept, J. Yang, and M. Mayer, Effects of lipid tethering in extremophile-inspired membranes on H(+)/OH(-) flux at room temperature, Biophys. J. 110, 2430 (2016).

      (46) R. Xu, A. Dehghan, A.-C. Shi, and J. Zhou, Elastic property of membranes self-assembled from diblock and triblock copolymers, Chem. Phys. Lipids 221, 83 (2019).

      (47) Z. Dogic and S. Fraden, Ordered phases of filamentous viruses, Curr. Opin. Colloid Interface Sci. 11, 47 (2006).

      (48) E. Barry and Z. Dogic, Entropy driven self-assembly of nonamphiphilic colloidal membranes, Proc. Natl. Acad. Sci. U.S.A. 107, 10348 (2010).

      (49) A. J. Balchunas, R. A. Cabanas, M. J. Zakhary, T. Gibaud, S. Fraden, P. Sharma, M. F. Hagan, and Z. Dogic, Equation of state of colloidal membranes, Soft Matter 15, 6791 (2019).

      (50) M. Saracco, P. Schaeffer, M. Tourte, S.-V. Albers, Y. Louis, J. Peters, B. Demé, S. Fontanay, and P. M. Oger, Bilayer-Forming Lipids Enhance Archaeal Monolayer Membrane Stability, Int. J. Mol. Sci. 26, 3045 (2025).

      (51) M. Amaral, archaeal_membranes : code and examples (2024), available at https://doi.org/10.5281/zenodo. 13934991.

      (52) M. F. Ergüder and M. Deserno, Identifying systematic errors in a power spectral analysis of simulated lipid membranes, The Journal of Chemical Physics 154, 214103 (2021).

      (53) J. Genova, N. Ulrih, V. Kralj-Iglič, A. Iglič, and I. Bivas, Bending Elasticity Modulus of Giant Vesicles Composed of Aeropyrum Pernix K1 Archaeal Lipid, Life 5, 1101 (2015).

      (54) M. Amaral, Archaeal Membranes: In Silico Modelling and Design, Ph.D. thesis, Institute of Science and Technology Austria (2024).

      (55) M. Pohlschroder, F. Pfeiffer, S. Schulze, and M. F. A. Halim, Archaeal cell surface biogenesis, FEMS Microbiol. Rev. 42, 694 (2018).

      (56) K. S. Makarova, N. Yutin, S. D. Bell, and E. V. Koonin, Evolution of diverse cell division and vesicle formation systems in Archaea, Nat. Rev. Microbiol. 8, 731 (2010).

      (57) C. W. Stairs and T. J. Ettema, The Archaeal Roots of the Eukaryotic Dynamic Actin Cytoskeleton, Curr. Biol. 30, R521 (2020).

      (58) B. Baum and D. A. Baum, The merger that made us, BMC Biol. 18, 72 (2020).

      (59) Z. Zeng, H. Chen, H. Yang, Y. Chen, W. Yang, X. Feng, H. Pei, and P. V. Welander, Identification of a protein responsible for the synthesis of archaeal membrane-spanning GDGT lipids, Nat. Commun. 13, 1545 (2022).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      Summary: 

      In Ryu et al., the authors use a cortical mouse astrocyte culture system to address the functional contribution of astrocytes to circadian rhythms in the brain. The authors' starting point is transcriptional output from serum-shocked culture, comparative informatics with existing tools and existing datasets. After fairly routine pathway analyses, they focus on the calcium homeostasis machinery and one gene, Herp, in particular. They argue that Herp is rhythmic at both mRNA and protein levels in astrocytes. They then use a calcium reporter targeted to the ER, mitochondria, or cytosol and show that Herp modulates calcium signaling as a function of circadian time. They argue that this occurs through the regulation of inositol receptors. They claim that the signaling pathway is clock-controlled by a limited examination of Bmal1 knockout astrocytes. Finally, they switch to calcium-mediated phosphorylation of the gap junction protein Connexin 43 but do not directly connect HERP-mediated circadian signaling to these observations. While these experiments address very important questions related to the critical role of astrocytes in regulating circadian signaling, the mechanistic arguments for HERP function, its role in circadian signaling through inositol receptors, the connection to gap junctions, and ultimately, the functional relevance of these findings is only partially substantiated by experimental evidence. 

      Strengths: 

      - The paper provides useful datasets of astrocyte gene expression in circadian time. 

      - Identifies HERP as a rhythmic output of the circadian clock. 

      - Demonstrates the circadian-specific sensitivity of ATP -> calcium signaling. 

      - Identifies possible rhythms in both Connexin 43 phosphorylation and rhythmic movement of calcium between cells. 

      Weaknesses: 

      - It is not immediately clear why the authors chose to focus on Ca2+ homeostasis or Herp from their initial screens as neither were the "most rhythmic" pathways in their primary analyses. 

      We appreciate the reviewer’s comment. We chose to focus on Ca2+ homeostasis processes because intracellular Ca2+ signaling plays crucial role in numerous astrocyte functions and is notably associated with sleep/wake status of animals, which is our primary interest (Bojarskaite et al., 2020; Ingiosi et al., 2020; Blum et al., 2021; Szabó et al., 2017). Among the genes involved in calcium ion homeostasis, Herp exhibited the most robust rhythmicity (supplementary table 1). The rationale for our focus on Ca2+ homeostasis and Herp is explained in the results section (line 143-150). We hope this provides a clear justification for our focus.

      - It would have been interesting (and potentially important) to know whether various methods of cellular synchronization would also render HERP rhythmic (e.g., temperature, forskolin, etc). If Herp is indeed relatively astrocyte-specific and rhythmic, it should be easy to assess its rhythmicity in vivo. 

      Thank you for the reviewer’s insightful comment. In response, we examined HERP expression in cultured astrocytes synchronized using either Dexamethasone or Forskolin treatment. We found that Herp exhibited rhythmic expression at both the the mRNA and protein levels under these conditions. These results have been added to Figure S3 and are explained in the manuscript (lines 173-175).

      Additionally, we measured HERP levels in the prefrontal cortex of mice at CT58 and CT70 and found no rhythmicity, as shown in Author response image 1. Given that Herp is expressed in various brain cell types, including microglia, endothelial cells, neurons, oligodendrocytes, and the astrocytes- with the highest expression in microglia(Cahoy et al., 2008), we reason that the potential rhythmic expression of HERP in astrocytes might be masked by its continuous expression in other cell types. Nonetheless, to assess HERP rhythmicity specifically in astrocytes in vivo, we attempted immunostaining using several anti-HERP antibodies, but none were successful. Consequently, we were unable to determine whether HERP exhibits rhythmic expression in astrocytes in vivo.

      Author response image 1.

      HERP levels were constant at CT58 and CT70. (A, B) Mice were entrained under 12h:12h LD cycle and maintained in constant dark. Prefrontal cortices were harvested at indicated time and processed for Western blot analysis. Representative image shows three independent samples. (B) Quantification of HERP levels normalized to VINCULIN. Values in graphs are mean ± SEM (*p < 0.05, **p < 0.005, ***p < 0.0005, and ****p < 0.00005; t-test)

      - The authors show that Herp suppression reduces ATP-mediated suppression of calcium whereas it initially increases Ca2+ in the cytosol and mitochondria and then suppresses it. The dynamics of the mitochondrial and cytosolic responses are not discussed in any detail and it is unclear what their direct relationship is to Herp-mediated ER signaling. What is the explanation for Herp (which is thought to be ER-specific) to calcium signaling in other organelles? 

      Our examination of cytosolic and mitochondrial Ca2+ responses was aimed at corroborating HERP’s effect on ER Ca2+ response. Upon ATP stimulation, Ca2+ is released from the ER via IP3R receptors (IP3Rs) and subsequently transmitted to other organelles including mitochondria (Carreras-Sureda et al., 2018; Giorgi et al., 2018). Ca2+ is directly transferred to the cytosol by IP3Rs located on the ER membrane, and to the mitochondria through a complex formed by IP3R and the voltage-dependent anion channel (VDAC) on the mitochondria (Giorgi et al., 2018).  Consistent with previous reports, we observed an increase of cytosolic and mitochondrial Ca2+ levels accompanied by decrease in ER Ca2+ levels following ATP treatment (See Fig. 3B, E, H, control siRNA). The ATP-stimulated ER Ca2+ release was enhanced by Herp knockdown. We reasoned that if Ca2+ release was enhanced, then cytosolic and mitochondrial Ca2+ uptakes would also be enhanced. The results were consistent with our hypothesis (See Fig. 3B, E, H, Herp siRNA). These observations are described in the Results section (lines 202-208) and in the Discussion (lines 333-348). We hope this explanation clarifies the relationship between Herp-mediated ER Ca2+ response and Ca2+ response in other organelles. Thank you for your consideration.

      - What is the functional significance of promoting ATP-mediated suppression of calcium in ER? 

      In astrocytes, intracellular Ca2+ plays crucial role in regulating several processes. In this study, among various downstream effects of intracellular Ca2+, we examined the gap junction channel (GJC) conductance, which affects astrocytic communication. As discussed in the manuscript (lines 357-381), circadian variation in HERP results in rhythmic Cx43 (S368) phosphorylation linked with GJC conductance. We propose that during the subjective night phase, heightened ATP induced ER Ca2+ release reduces GJC conductance, uncoupling astrocytes from the syncytium, making them better equipped for localized response. On the other hand, during the subjective day phase, increased GJC conductance may allow astrocytes to control a larger area for synchronous neuronal activity which is a key feature of sleep.

      - The authors then nicely show that the effect of ATP is dependent on intrinsic circadian timing but do not explain why these effects are antiphase in cytosol or mitochondria.

      Moreover, the ∆F/F for calcium in mitochondria and cytosol both rise, cross the abscissa, and then diminish - strongly suggesting a biphasic signaling event. Therefore, one wonders whether measuring the area under the curve is the most functionally relevant measurement of the change. 

      We appreciate the reviewer’s insightful comments. As explained in our previous response, Ca2+ released from the ER is transferred to the cytosol and mitochondria. This transfer explains why the fluorescent intensities of cytosolic and mitochondrial Ca2+ indicators show anti-phasic responses to those of the ER.

      We agree that cytosolic and mitochondrial Ca2+ responses may be biphasic. The decrease below the abscissa in mitochondria and cytosol likely reflects Ca2+ extrusion from these organelles. However, our primary focus was on the initial uptake of Ca2+ following ER Ca2+ release. Thus, when calculating the area under the curve (AUC), we measured the area between the ∆F/F graph and the y=0 (X-axis) for both mitochondria and cytosol. We reason that the measuring the area under the curve (above the abscissa) fits with our objective.

      While addressing your concerns, we noticed errors in the Y-axis labels of Fig. 3C, 4D, and 5C. For the ER Ca2+ dynamics, we measured the area above curve. These mistakes have now been corrected.

      - Why are mitochondrial and cytosolic calcium not also demonstrated for Bmal1 KO astrocytes? 

      In two sets of experiments (Fig. 3 and Fig. 4), we demonstrated that the increase in cytosolic and mitochondrial Ca2+ aligns with ER Ca2+ release. Since there were no circadian time differences in ER Ca2+ release in the Bmal1 KO cultures, we concluded that it was unnecessary to measure Ca2+ levels in the mitochondria and cytosol. Additionally, our primary focus is on the ER Ca2+ response rather than the Ca2+ dynamics in subcellular organelles. We hope this clarifies our rationale and maintains the focus of our study.

      - The authors claim that Herp acts by regulating the degradation of ITPRs but this hypothesis - rather central to the mechanisms proposed in this study - is not experimentally substantiated. 

      We appreciate the reviewer’s insightful comments regarding the role of HERP in the degradation of IP3Rs. In the original manuscript, we demonstrated that treating cells with Herp siRNA leads to an increase in the levels of ITPR1 and ITPR2, suggesting that HERP might be involved in the regulation of IP3Rs stability. This observation is consistent with previous studies, which showed that Herp siRNA treatment increases ITPR levels in HeLa and cardiac cells (Paredes et al., 2016; Torrealba et al., 2017). Torrealba et al. also showed that HERP regulates the polyubiquitination of IP3Rs. Based on our results and previous reports, we hypothesized that HERP similarly regulates ITPR degradation in cultured astrocytes.

      However, as the reviewer rightly pointed out, further evidence is needed to confirm that HERP specifically regulates ITPR degradation. To address this, we conducted new experiments examining the effect of XesC, an inhibitor of IP3Rs, on ER Ca2+ release. The treatment of XesC reduced the ER Ca2+ release and abolished the enhancement of ER Ca2+ release by Herp KD. These results demonstrated that HERP influences ER Ca2+ response through IP3Rs. These new findings have been added to Fig. 3N – 3P and explained in the Results section (lines 217-221).

      We believe these additional experiments and clarifications strengthen our hypothesis that HERP regulates IP3R degradation, thereby modulating ER Ca2+ responses.

      - There is no clear demonstration of the functional relevance of the circadian rhythms of ATP-mediated calcium signaling.

      As mentioned in the previous response, we examined Cx43 phosphorylation linked with GJC conductance in the context of ATP-mediated Ca2+ signaling. Our results demonstrated circadian variations in Cx43 Ser368 phosphorylation leading to variations of gap junction channel (GJC) conductance (Fig. 6C – F and Fig. 7D - I). We have discussed the significance of this circadian rhythm in ATP driven ER Ca2+ signaling concerning astrocytic function during sleep/wake states in the manuscript (lines 357 – 382) as follows.

      “ATP-stimulated Cx43 (S368) phosphorylation is higher at 30hr (subjective night phase) than at 42hr (subjective day phase) (Fig. 6C and 6D.), a finding further supported by in vivo experiments showing higher pCx43(S368) levels in the prefrontal cortex during the subjective night than during the day (Fig. 6E and 6F). What are the implications of this day/night variation in Cx43 (S368) phosphorylation? We reasoned that the circadian variation in Cx43 phosphorylation could significantly impact astrocyte functionality within the syncytium. Indeed, our cultured astrocytes exhibited circadian phase-dependent variation in gap junctional communication (Fig.7D – 7F). Astrocytes influence synaptic activity through the release of gliotransmitters such as glutamate, GABA, D-serine, and ATP, triggered by increases in intracellular Ca2+ in response to the activity of adjacent neurons and astrocytes (Verkhratsky & Nedergaard, 2018). Importantly, this increase in Ca2+ spreads to adjacent astrocytes through GJCs (Fujii et al., 2017), influencing a large area of the neuronal network. Considering that Cx43 Ser368 phosphorylation occurs to uncouple specific pathways in the astrocytic syncytium to focus local responses (Enkvist & McCarthy, 1992), our findings suggest that astrocytes better equipped for localized responses when presented with a stimulus during the active phase in mice. Conversely, during the rest period, characterized by more synchronous neuronal activity across broad brain areas (Vyazovskiy et al., 2009) higher GJC conductance might allow astrocytes to exert control over a larger area. In support of this idea, recent study showed that synchronized astrocytic Ca2+ activity advances the slow wave activity (SWA) of the brain, a key feature of non-REM sleep (Szabó et al., 2017). Blocking GJC was found to reduce SWA, further supporting this interpretation. However, conflicting findings have also been reported. For instance, Ingiosi et al. (Ingiosi et al., 2020) found that astrocytic synchrony was higher during wakefulness than sleep in the mouse frontal cortex. Whether these differing results in astrocyte synchrony during resting and active periods are attributable to differences in experimental context (e.g., brain regions, sleep-inducing condition) remains unclear. Indeed, astrocyte Ca2+ dynamics during wakefulness/sleep vary according to brain regions (Tsunematsu et al., 2021). While the extent of astrocyte synchrony might differ depending on brain region and/or stimulus, on our results suggest that the baseline state of astrocyte synchrony, which is affected by GJC conductance, varies with the day/night cycle.”

      Reviewer #2 (Public Review): 

      Summary: 

      The article entitled "Circadian regulation of endoplasmic reticulum calcium response in mouse cultured astrocytes" submitted by Ryu and colleagues describes the circadian control of astrocytic intracellular calcium levels in vitro. 

      Strengths: 

      The authors used a variety of technical approaches that are appropriate 

      We appreciate the reviewer’s acknowledgement of the strengths of our manuscript.

      Weaknesses: 

      Statistical analysis is poor and could lead to a misinterpretation of the data 

      Thank you for the comment. We have carefully reviewed our statistical analyses and applied appropriate methods where necessary. Please see below for the specific revisions and improvements made.

      For Fig. 2D-E, we initially used a t-test. However, after adding more replicates and conducting a normality test, we found that the data did not follow a normal distribution. Therefore, we switched to the Mann-Whitney U test. In Fig. 5D-E, we originally used a repeated measures two-way ANOVA, but we have now changed it to a standard two-way ANOVA. For Fig. 7C and I, we also observed non-normal distribution in the normality test and consequently replaced the t-test with the Mann-Whitney U test. For other analyses not specifically mentioned, normality tests confirmed normal distribution, allowing us to use t-tests or ANOVA as appropriate for statistical analysis.

      Several conceptual issues have been identified. 

      We have addressed the reviewer’s concerns. Please see our detailed point-by-point responses below.

      Overinterpretation of the data should be avoided. This is a mechanistic paper done completely in vitro, all references to the in vivo situation are speculative and should be avoided. 

      We appreciate the reviewer’s insightful comment. Following the reviewer’s suggestion, we have removed the interpretations of GO pathways in the context of in vivo situation.

      Reviewer #3 (Public Review): 

      Astrocyte biology is an active area of research and this study is timely and adds to a growing body of literature in the field. The RNA-seq, Herp expression, and Ca2+ release data across wild-type, Bmal1 knockout, and Herp knockdown cellular models are robust and lend considerable support to the study's conclusions, highlighting their importance. Despite these strengths, the manuscript presents a gap in elucidating the dynamics of HERP and the involvement of ITPR1/2 in modulating Ca2+ release patterns and their circadian variations, which remains insufficiently supported and characterized. While the Connexin data underscore the importance of rhythmic Ca2+ release triggered by ATP, the relationship here appears correlational and the role of HERP and ITPR in Cx function remains to be characterized. Moreover, enhancing the manuscript's clarity and readability could significantly benefit the presentation and comprehension of the findings. 

      We appreciate the reviewer’s acknowledgement of the strengths of our manuscript. Regarding the identified gaps, we have conducted several new experiments to clearly demonstrate the HERP-ITPR-Cx phosphorylation axis. Please see our detailed point-by-point responses below.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors): 

      - While HERP appears to be a clock-controlled gene and its protein levels appear to demonstrate rhythmicity as well, the data quality of the western blotting in Bmal1 knockout raises some concern about the accuracy of HERP protein quantification. 

      We understand the reviewer’s concern regarding the proximity of the HERP band to a nonspecific band in the Western blotting for the Bmal1 knockout. However, we took great care to ensure the accuracy of our HERP band quantification. We meticulously selected only the specific HERP band, excluding nonspecific band. Therefore, we are confident in the accuracy of our HERP protein measurements.

      - If HERP is rhythmic and ITPRs are not, if their model is correct, might we expect HERP suppression to result in 'unmasking' an ITPR rhythm? 

      Our model suggests that both HERP and ITPRs are rhythmic, with HERP regulating the degradation of ITPR proteins and driving their rhythms. Consistent with this, we observed that day/night variations in ITPR2 levels (Fig. 4N and 4O). Therefore, we concluded that circadian variations in HERP are sufficient to drive ITPR2 rhythms. We have explained this in detail in the Result section (lines 236-241) and the Discussion section (lines 324-332).

      - The authors make a rather abrupt switch to examining gap junctions and connexin 43 phosphorylation. While the data demonstrating that the phosphorylation of S368 may indeed be rhythmic - the authors do not connect these data to the rest of the manuscript by showing a connection to HERP-mediated calcium signaling, limiting the coherence of the narrative. 

      Thank you for the reviewer’s insightful comments. To address the reviewer's concern regarding the connection between Herp and the phosphorylation of CX43 at S368, we have conducted new experiments to test whether KD of Herp abolishes the rhythms of Cx43 phosphorylation at S368. We found that the phosphorylation of Cx43 at S368 is significantly enhanced at 30hrs post sync compared with 42hrs post sync in control siRNA-treated astrocytes consistent with our previous results (Fig. 6C & 6D). On the other hand, this circadian phase dependent difference in phosphorylation was abolished in Herp siRNA treated astrocytes. These results clearly indicate that circadian variations in Cx43 phosphorylation are driven by the HERP. These new results are now included in Fig. 6G and 6H and explained in the Results section (lines 276-281).

      - Comment on data presentation: the authors repeatedly present histograms with attached lines between data points - from my understanding of the experiments, this is inappropriate unless these were repeated measures from the same cells. Otherwise, the lines connecting one data point to another between different conditions (e.g., Ctrl or Herp knockdown) are arbitrary and possibly misleading (i.e., Figure 3K, 3M, 4L, 6D). 

      Thank you for the reviewer’s comment. We have updated the figures by removing the lines connecting data points in the relevant figures (Fig.3K, M, Fig4.N and Fig.6D)).

      Reviewer #2 (Recommendations For The Authors): 

      Most of the suggestions of this reviewer are related to the conceptual interpretation and presentation of the data and to the statistical analysis 

      In Figure 1 the authors analyzed the rhythmic transcriptome of cortical astrocytes synchronized with a serum shock in two different ways. The authors need to discuss what is the difference between the two methods used to detect rhythmic transcripts and make sense of them. 

      Following the reviewer’s suggestion, we have provided a more detailed explanation about MetaCycle and BioCycle, as well as the rationale for using both packages in our analysis as follows: “Various methods have been used to identify periodicity in time-series data, such as Lomb-Scargle (Glynn et al., 2006), JTK_CYCLE (Hughes et al., 2010) and ARSER (Yang & Su, 2010), each with distinct advantages and limitations. MetaCycle, integrates these three methods, facilitating the evaluation of periodicity in time-series data without requiring the selection of an optimal algorithm (Wu et al., 2016). Additionally, BioCycle has been developed using a deep neural network trained with extensive synthetic and biological time series datasets (Agostinelli et al., 2016). Because MetaCycle and Biocycle identify periodic signal based on different algorithms, we applied both packages to identify periodicity in our time-series transcriptome data. BioCycle and MetaCycle analyses detected 321 and 311 periodic transcripts, respectively (FDR corrected, q-value < 0.05) (Fig. 1B). Among these, 220 (53.4%) were detected by both methods, but many transcripts did not overlap. MetaCycle is known for its inability to detect asymmetric waveforms (Mei et al., 2020). In our analysis, genes with increasing waveforms like Adora1 and Mybph were identified as rhythmic only by BioCycle, while Plat and Il34 were identified as rhythmic only by MetaCycle (Fig. S1C). Despite these discrepancies, the clear circadian rhythmic expression profiles of these genes led us to conclude that using the union of the two lists compensates for the limitations of each algorithm.”

      Please refer to lines 105-117 in the Results section.

      The reasoning for comparing CT0 with the phase of the clock 8 hs after SS needs to be explained. Circadian time (CT) conceptually refers to the clock phase in the absence of entrainment cues in vivo, the direct transformation of "time after synchronization" in vitro to CT is misleading. 

      Thank you for the reviewer’s insightful comments. Initially, we believed that transforming TASS to CT, despite being in vitro data, might provide a more intuitive and physiologically relevant interpretation of our results. However, we agree that this approach might be misleading. Following the reviewer’s suggestion, we have revised our terminology by changing “CT” to “Time post sync (hr)”. Nonetheless, in Fig. 1F for circular peak phase map, we set 8hrs post sync to ZT0 based on a phase comparison result in Fig. 1D for physiologically relevant interpretation. We hope these revisions clarify our approach.

      Moreover, also by definition a CT cannot be defined in terms of "dark" or "light". Figure 6M needs to be changed. 

      Following the reviewer’s suggestion, we removed the labels CT22 and CT34. Instead. we have labeled the respective periods as “30hr post sync” and “42hr post sync”.

      In Figure 1D, the authors present a gene ontology analysis that is certainly interesting, however, it should not be overinterpreted when trying to explain processes that take place only in vivo (e.g. wound repair). 

      Thank you for the insightful comment. Following the reviewer’s feedback, we have removed the paragraph interpreting the cell migration process in relation to wound repair and have focused instead on Ca2+ ion homeostasis.

      In Figure 2A the relative expression of clock genes and Herp is again misleading by a white/grey shading indicating subjective night and subjective day when the system under study is a cell culture. 

      We understand the reviewer’s concern that a cell culture system is not equivalent to light/dark entrainment condition. However, we apply time-synchronizing stimuli to recapitulate in vivo entrainment. In addition, by comparing our data with CircaDB, we defined 8hrs post sync as corresponding to ZT0, thus aligning it with the beginning of the day. We have retained the shading to facilitate easier interpretation of our data in relation to in vivo situations. However, in response to the reviewer’s concern, we have revised the shading from white/grey to light grey/dark grey. We hope this adjustment addresses the reviewer’s concern, but if the reviewer still believes it is inappropriate, please let us know, we will gladly update it.

      In the Figure 2A legend, it is indicated that rhythmicity is assessed using MetaCycle with mean values obtained from n=2. The authors need to make clear whether this n=2 mean: 2 biological replicates or 2 technical replicates. This difference is relevant because it would make the analysis statistically valid or invalid, respectively. 

      Thank you for your feedback. n=2 refers to 2 biological replicates. Therefore, the analysis is statistically valid.

      In Figures 2C and D the authors applied a T-test, a parametric statistical test for one-to-one comparison that requires normality distribution of the data to be tested first. To test normality, the authors need at least 4 biological replicates. The suggestion of this reviewer is that these experiments have to be repeated and proper statistics applied. 

      Thank you for your feedback. In response to the reviewer's suggestion, we conducted additional experiments to increase the number of biological replicates to 4. After verifying the normality of the data, we applied a t-test for Figure 2C and a Mann-Whitney test for Figure 2D and 2E. These tests confirmed significant statistical difference between groups.

      Further evidence of Bmal1-dependent control of HERP circadian expression authors could check the presence of E-Box elements in the Herp promoter. 

      Thank you for the reviewer’s insightful comment. In the original version of our manuscript's Discussion section, we mentioned the absence of a canonical E-Box in the upstream of Herp gene. However, following the reviewer’s suggestion and considering the potential role of non-canonical E-Boxes, we conducted an additional analysis. This analysis identified several non-canonical E-Boxes within the 6 kb upstream region of the Herp gene (Table S2). Notably, we found one non-canonical E-Box, “CACGTT,” known to regulate circadian expression (Yoo et al., 2005) is close to the transcription start site (chr8:94386194-94386543). Moreover, this element is evolutionarily conserved across various mammals, including humans, rats, mice, dogs, and opossums (See Author response image 2). Therefore, we reasoned that these non-canonical E boxes might drive the CLOCK/BMAL1 dependent expression of Herp. We have updated the Discussion to reflect these findings in lines 315-319.

      Author response image 2.

      The calcium experiments shown in Figures 3A-I, could be more convincing if the authors showed that the different Ca2+ sensors are compartment-specific by showing co-localization with a subcellular marker. In the pictures shown it is not even possible to recognize the cell dimensions. 

      Following the reviewer’s suggestion, we performed co-staining experiments with organelle specific Ca2+ indicators and organelle markers. First, astrocytes were co-transfected with G-CEPIA1er, an ER specific Ca2+ indicator and ER targeted DsRed2 (with Calreticulin signal sequence). Live imaging analysis showed that the fluorescent intensities of G-CEPIA1er and DsRed2-ER-5 significantly overlapped in co-transfected cells. Secondly, astrocytes were transfected with Mito-R-GECO1 and Mitotracker, a cell permeable mitochondria dye, was applied. The fluorescent intensities of Mito-R-GECO1 and Mitotracker also significantly overlapped. These new data are included in Figure S4 and explained in the Result section (lines 194-195).

      Data analysis in Figure 3 K and M is misleading. According to the explanations of the results, each of the experiments to assess ITRP1 or 2 is run independently. Then it is not clear why the relative levels obtained with control or Herp siRNA are plotted as pairs. Same comment as above for Figure 4L and Figure 6D. 

      Thank you for the reviewer’s insightful comments. Reviewer1 raised similar issues. Following the reviewers’ suggestions, we have removed the lines connecting the data points in Fig. 3K, 3M, 4L, and 6D.

      In Figure 5E the authors need to explain why they consider that repeated measures 2-way ANOVA is the right statistical test to apply. According to the explained experimental design, cells transfected, synchronized, and then harvested independently at the indicated time after synchronization. 

      Thank you for the reviewer’s insightful comment. Upon reviewing the statistical methods as suggested, we have revised our approach. Instead of using repeated measures 2-way ANOVA, we have now applied a standard 2-way ANOVA, which is more appropriate given the experimental procedures were independent, as the reviewer pointed out.

      The English language needs to be revised throughout the text. 

      We have thoroughly revised the English language throughout the text.

      Reviewer #3 (Recommendations For The Authors): 

      (1) Figure 3. Clarify the physiological importance of 100 µM ATP. Would the Herp rhythm warrant Ca2+ release rhythms under basal conditions? In 3J-K, the relatively weak effect of Herp knockdown on ITPR1/2 levels, albeit statistically significant, may not be physiologically significant. This calls into question the claimed Herp-ITPR axis that underlies the Ca2+ release phenotype. Further, the correlation certainly exists but further characterization of Herp KD cells would be required to address the mechanism. 

      As previously reported, a broad range of ATP concentrations can induce Ca2+ activity in the astrocytes (Neary et al., 1988). Originally, we conducted an ATP dose-response analysis to observe ER Ca2+ release in our primary astrocyte culture. Our results show that ER Ca2+ release begins at 50 µM ATP and plateaus at 500 µM. Please refer to Author response image 3. We selected 100µM ATP for our experiments because it induces a medium level of ER Ca2+ response. Importantly, although measuring ATP concentrations at the synapse in vivo is challenging(Tan et al., 2017), estimates suggest synaptic ATP concentrations range from 5-500 µM (Pankratov et al., 2006). Thus, 100µM ATP is a physiologically relevant concentration that can affect nearby cells, including astrocytes, in the nervous system.

      Author response image 3.

      Cultured astrocytes were transfected with G-CEPIA1er ER and at 48hrs post transfection, cultured astrocytes were treated with various concentrations of ATP and Ca2+ imaging analysis was performed. (A) ΔF/F0 values over time following ATP application. (B) Area above curve values. Values in graphs are mean ± SEM (*p < 0.05, **p < 0.005, ***p < 0.0005, and ****p < 0.00005; one-way ANOVA).

      Regarding the comment on Ca2+ release rhythms under basal conditions, we interpret this as referring Ca2+ release in the absence of a stimulus. We typically observe Ca2+ release only upon stimulation, such as ATP treatment. However, we acknowledge that the modest effects of HERP knockdown on ITPR1/2 levels could question the HERP-ITPR axis’s role in ER Ca2+ release.

      To address this, we analyzed whether Herp KD induced increases in ER Ca2+ release were mediated through ITPRs by treating cells with Xestospongin C (XesC), an IP3R inhibitor. XesC treatment reduced ATP-induced ER Ca2+ release and eliminated the differences in ER Ca2+ release between control and Herp KD astrocytes (Fig. 3N – 3P). These results clearly indicate that HERP-ITPR axis plays critical role in controlling ER Ca2+ release. These new experiments have been included in Fig. 3 and explained in the result section (lines 217-221).

      Furthermore, following the reviewer’s suggestion, we examined whether HERP rhythms underlie the rhythms of ER Ca2+ response by analyzing ER Ca2+ response in Herp KD astrocyte in two different times following synchronization. In control astrocytes, ATP-induced ER Ca2+ responses vary depending on time, whereas these time-dependent variations were abolished in Herp KD astrocytes. These new experiments have been included in Fig. 4K – 4M and explained in the Results section (lines 232-235).

      Collectively, these results indicate that HERP rhythms lead to time-dependent differences in ER Ca2+ response through ITPRs.

      (2) Figure 4K-L. As data suggested the involvement of ITPR1 and ITPR2 (circadian effect), a reasonable next step is to determine their involvement, but the study did not pursue the hypothesis. 

      Thank you for your insightful comment. Our results indeed suggest that rhythms in ITPR2 levels may drive the time-dependent variations in ATP-induced ER Ca2+ release following synchronization. The newly conducted experiments demonstrated that treatment with the ITPR inhibitor XesC suppressed ATP-induced ER Ca2+ release at both control and Herp siRNA treatment conditions (Fig. 3). Based on these findings, we now further confirm that rhythms of ITPR levels, specifically ITPR2 underlie the circadian variations in ER Ca2+ release. While examining the effect of ITPR2 siRNA would directly prove the involvement of ITPR2, we have decided to pursue this experiment in the future studies.

      (3) Figure 5A-C. Data from WT cells should be included side by side with Bmal1-/- cells for comparison which is expected to be consistent with the HERP levels as in 5D-E. Again, the role of ITPR2 is suggested but not demonstrated. 

      Following the reviewer's suggestion, we conducted additional experiments including both WT and Bmal1-/- cultured astrocytes side-by-side. The results were consistent with our previous findings: WT astrocytes showed rhythms of ER Ca2+ release while Bmal1-/- astrocytes did not. We have updated the Figure 5A to 5C and the corresponding Results section in lines 242-245 accordingly.<br /> Regarding second comment, as mentioned in our previous response, we plan to examine the role of ITPR2 in further studies.

      (4) Figure 6. The Connexin data seems an addon and is correlative with the Ca2+ release. The role of Herp and Itpr in Connexin function is not addressed. Figure 6E-F was not called out in the results section. Suggest providing additional data to support the role of the HERP-ITPR axis in regulating Ca2+ release and Connexin activity. 

      We agree that additional data are needed to support the role of HERP in regulating CX43 phosphorylation. Therefore, we have conducted further experiments to determine whether rhythms of Cx43 phosphorylation are regulated by HERP. In the control astrocytes, ATP treatment induced time-dependent variations in Cx43 phosphorylation. However, these rhythms were abolished in Herp KD astrocytes. These results indicate that rhythms in HERP levels contribute to the time-dependent variations in Cx43 phosphorylation. These new experiments have included in Fig. 6G and 6H and explained in the results section (lines 276-281).

      Regarding second comment, we have corrected our oversight by properly referencing figures 6E-F in the results section. Please refer to lines 357-359 for clarification.

      (5) Discussion. This section should focus on noteworthy points to discuss, not repeating the results. 

      Based on the reviewer's valuable suggestions, we have revised the Discussion section to minimize repetition of the results. Thank you for your guidance.

      (6) The manuscript exhibits numerous grammatical and textual inaccuracies that necessitate careful revision by the authors. My observations here are confined to the title and the abstract alone. I recommend altering the title from "mouse cultured astrocytes" to "cultured mouse astrocytes" for clarity and grammatical correctness. The abstract, meanwhile, needs enhancements both in terms of its content and language. It should incorporate the results of the partitioning among the ER, cytoplasm, and mitochondria, and provide clear definitions for some of the critical terms used. It's worth noting that the abstract's second sentence contains a grammatical error. 

      Thank you for the reviewer’s valuable feedback. We have carefully revised the title, abstract, and main text to address the grammatical and textual issues. The title has been changed to “cultured mouse astrocytes”. Additionally, the abstract now includes results related to cytoplasmic Ca2+ dynamics and has been revised in several places. We appreciate your insights and have worked to enhance the content and language accordingly.

      Reference

      Agostinelli, F., Ceglia, N., Shahbaba, B., Sassone-Corsi, P., & Baldi, P. (2016). What time is it? Deep learning approaches for circadian rhythms. Bioinformatics, 32(12), i8-i17. https://doi.org/10.1093/bioinformatics/btw243

      Cahoy, J. D., Emery, B., Kaushal, A., Foo, L. C., Zamanian, J. L., Christopherson, K. S., Xing, Y., Lubischer, J. L., Krieg, P. A., Krupenko, S. A., Thompson, W. J., & Barres, B. A. (2008). A transcriptome database for astrocytes, neurons, and oligodendrocytes: a new resource for understanding brain development and function. J Neurosci, 28(1), 264-278. https://doi.org/10.1523/JNEUROSCI.4178-07.2008

      Carreras-Sureda, A., Pihán, P., & Hetz, C. (2018). Calcium signaling at the endoplasmic reticulum: fine-tuning stress responses. Cell Calcium, 70, 24-31. https://doi.org/10.1016/j.ceca.2017.08.004

      Enkvist, M. O., & McCarthy, K. D. (1992). Activation of protein kinase C blocks astroglial gap junction communication and inhibits the spread of calcium waves. J Neurochem, 59(2), 519-526. https://doi.org/10.1111/j.1471-4159.1992.tb09401.x

      Fujii, Y., Maekawa, S., & Morita, M. (2017). Astrocyte calcium waves propagate proximally by gap junction and distally by extracellular diffusion of ATP released from volume-regulated anion channels. Scientific Reports, 7(1), 13115. https://doi.org/10.1038/s41598-017-13243-0

      Giorgi, C., Marchi, S., & Pinton, P. (2018). The machineries, regulation and cellular functions of mitochondrial calcium. Nature Reviews Molecular Cell Biology, 19(11), 713-730. https://doi.org/10.1038/s41580-018-0052-8

      Glynn, E. F., Chen, J., & Mushegian, A. R. (2006). Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle periodograms. Bioinformatics, 22(3), 310-316. https://doi.org/10.1093/bioinformatics/bti789

      Hughes, M. E., Hogenesch, J. B., & Kornacker, K. (2010). JTK_CYCLE: an efficient nonparametric algorithm for detecting rhythmic components in genome-scale data sets. J Biol Rhythms, 25(5), 372-380. https://doi.org/10.1177/0748730410379711

      Ingiosi, A. M., Hayworth, C. R., Harvey, D. O., Singletary, K. G., Rempe, M. J., Wisor, J. P., & Frank, M. G. (2020). A Role for Astroglial Calcium in Mammalian Sleep and Sleep Regulation. Curr Biol, 30(22), 4373-4383.e4377. https://doi.org/10.1016/j.cub.2020.08.052

      Mei, W., Jiang, Z., Chen, Y., Chen, L., Sancar, A., & Jiang, Y. (2020). Genome-wide circadian rhythm detection methods: systematic evaluations and practical guidelines. Briefings in Bioinformatics, 22(3). https://doi.org/10.1093/bib/bbaa135

      Neary, J. T., van Breemen, C., Forster, E., Norenberg, L. O., & Norenberg, M. D. (1988). ATP stimulates calcium influx in primary astrocyte cultures. Biochem Biophys Res Commun, 157(3), 1410-1416. https://doi.org/10.1016/s0006-291x(88)81032-5

      Pankratov, Y., Lalo, U., Verkhratsky, A., & North, R. A. (2006). Vesicular release of ATP at central synapses. Pflugers Arch, 452(5), 589-597. https://doi.org/10.1007/s00424-006-0061-x

      Paredes, F., Parra, V., Torrealba, N., Navarro-Marquez, M., Gatica, D., Bravo-Sagua, R., Troncoso, R., Pennanen, C., Quiroga, C., Chiong, M., Caesar, C., Taylor, W. R., Molgó, J., San Martin, A., Jaimovich, E., & Lavandero, S. (2016). HERPUD1 protects against oxidative stress-induced apoptosis through downregulation of the inositol 1,4,5-trisphosphate receptor. Free Radic Biol Med, 90, 206-218. https://doi.org/10.1016/j.freeradbiomed.2015.11.024

      Szabó, Z., Héja, L., Szalay, G., Kékesi, O., Füredi, A., Szebényi, K., Dobolyi, Á., Orbán, T. I., Kolacsek, O., Tompa, T., Miskolczy, Z., Biczók, L., Rózsa, B., Sarkadi, B., & Kardos, J. (2017). Extensive astrocyte synchronization advances neuronal coupling in slow wave activity in vivo. Scientific Reports, 7(1), 6018. https://doi.org/10.1038/s41598-017-06073-7

      Tan, Z., Liu, Y., Xi, W., Lou, H. F., Zhu, L., Guo, Z., Mei, L., & Duan, S. (2017). Glia-derived ATP inversely regulates excitability of pyramidal and CCK-positive neurons. Nat Commun, 8, 13772. https://doi.org/10.1038/ncomms13772

      Torrealba, N., Navarro-Marquez, M., Garrido, V., Pedrozo, Z., Romero, D., Eura, Y., Villalobos, E., Roa, J. C., Chiong, M., Kokame, K., & Lavandero, S. (2017). Herpud1 negatively regulates pathological cardiac hypertrophy by inducing IP3 receptor degradation. Sci Rep, 7(1), 13402. https://doi.org/10.1038/s41598-017-13797-z

      Tsunematsu, T., Sakata, S., Sanagi, T., Tanaka, K. F., & Matsui, K. (2021). Region-specific and state-dependent astrocyte Ca<sup>2+</sup> dynamics during the sleep-wake cycle in mice. The Journal of Neuroscience, JN-RM-2912-2920. https://doi.org/10.1523/jneurosci.2912-20.2021

      Verkhratsky, A., & Nedergaard, M. (2018). Physiology of Astroglia. Physiol Rev, 98(1), 239-389. https://doi.org/10.1152/physrev.00042.2016

      Vyazovskiy, V. V., Olcese, U., Lazimy, Y. M., Faraguna, U., Esser, S. K., Williams, J. C., Cirelli, C., & Tononi, G. (2009). Cortical firing and sleep homeostasis. Neuron, 63(6), 865-878. https://doi.org/10.1016/j.neuron.2009.08.024

      Wu, G., Anafi, R. C., Hughes, M. E., Kornacker, K., & Hogenesch, J. B. (2016). MetaCycle: an integrated R package to evaluate periodicity in large scale data. Bioinformatics, 32(21), 3351-3353. https://doi.org/10.1093/bioinformatics/btw405

      Yang, R., & Su, Z. (2010). Analyzing circadian expression data by harmonic regression based on autoregressive spectral estimation. Bioinformatics, 26(12), i168-174. https://doi.org/10.1093/bioinformatics/btq189

      Yoo, S. H., Ko, C. H., Lowrey, P. L., Buhr, E. D., Song, E. J., Chang, S., Yoo, O. J., Yamazaki, S., Lee, C., & Takahashi, J. S. (2005). A noncanonical E-box enhancer drives mouse Period2 circadian oscillations in vivo. Proc Natl Acad Sci U S A, 102(7), 2608-2613. https://doi.org/10.1073/pnas.0409763102

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Yang, Hu et al. examined the molecular mechanisms underlying astrocyte activation and its implications for multiple sclerosis. This study shows that the glycolytic enzyme PKM2 relocates to astrocyte nuclei upon activation in EAE mice. Inhibiting PKM2's nuclear import reduces astrocyte activation, as evidenced by decreased proliferation, glycolysis, and inflammatory cytokine release. Crucially, the study identifies TRIM21 as pivotal in regulating PKM2 nuclear import via ubiquitination. TRIM21 interacts with PKM2, promoting its nuclear translocation and enhancing its activity, affecting multiple signaling pathways. Confirmatory analyses using single-cell RNA sequencing and immunofluorescence demonstrate TRIM21 upregulation in EAE astrocytes. Modulating TRIM21 expression in primary astrocytes impacts PKM2-dependent glycolysis and proliferation. In vivo experiments targeting this mechanism effectively mitigate disease severity, CNS inflammation, and demyelination in EAE.

      The authors supported their claims with various experimental approaches, however, some results should be supported with higher-quality images clearly depicting the conclusions and additional quantitative analyses of Western blots.

      Thanks for the reviewer’s comments. We agree with the reviewer and have added higher magnification images, for example Fig.2A to better visualize the localization of PKM2 in DASA-treated conditions, and Fig. 3A and Fig.3B to better visualize the pSTAT3 and pp65. Moreover, we have added quantitative analyses of Western blots for some key experiments, for example quantitative results for Fig.2D is added in Fig.S3 to show the change of PKM2 and p-c-myc in DASA-58-treated conditions and quantitative results for Fig. 3D are added in Fig.S4B and S4C to show the change of nuclear and cytoplasmic PKM2, STAT3 and NF-κB in different conditions.

      Strength:

      This study presents a comprehensive investigation into the function and molecular mechanism of metabolic reprogramming in the activation of astrocytes, a critical aspect of various neurological diseases, especially multiple sclerosis. The study uses the EAE mouse model, which closely resembles MS. This makes the results relevant and potentially translational. The research clarifies how TRIM21 regulates the nuclear import of PKM2 through ubiquitination by integrating advanced techniques. Targeting this axis may have therapeutic benefits since lentiviral vector-mediated knockdown of TRIM21 in vivo significantly reduces disease severity, CNS inflammation, and demyelination in EAE animals.

      We thank the reviewer for their positive and constructive comments on the manuscript.

      Weaknesses:

      The authors reported that PKM2 levels are elevated in the nucleus of astrocytes at different EAE phases compared to cytoplasmic localization. However, Figure 1 also shows elevated cytoplasmic expression of PKM2. The authors should clarify the nuclear localization of PKM2 by providing zoomed-in images. An explanation for the increased cytoplasmic PKM2 expression should provided. Similarly, while PKM2 translocation is inhibited by DASA-58, in addition to its nuclear localization, a decrease in the cytoplasmic localization of PKM2 is also observed. This situation brings to mind the possibility of a degradation mechanism being involved when its nuclear translocation of PKM2 is inhibited.

      According to the results of immunofluorescence staining of PKM2 in spinal cord of EAE mice and in cultured primary astrocytes, in addition to the observation of PKM2 nuclear translocation in EAE conditions, we showed an elevated expression of PKM2 in astrocytes, including the cytoplasmic and nuclear expression. In neurological diseases, various studies showed consistent results, for example, following spinal cord injury (SCI), not only the upregulated expressing of PKM2 but also nuclear translocation was observed in astrocytes (Zhang et al., 2015). In EAE conditions, CNS inflammation is elevated and several proinflammatory cytokines and chemokines might contribute to the upregulated expression of PKM2 in astrocytes. We have tested TNFα and IL-1β, which are recognized to play important roles in EAE and MS (Lin and Edelson, 2017, Wheeler et al., 2020), and results from western blots showed the increased expression of PKM2 upon stimulation with TNFα and IL-1β (Author response image 1). Moreover, according to the reviewer’s suggestions, we have added zoomed-in images for figure 2A.

      Additionally, the reviewer has noted the decrease in the cytoplasmic PKM2 level, degradation-related mechanism and other mechanisms might be involved in this process.

      Author response image 1.

      Upregulated expression of PKM2 in astrocytes following stimulation with TNF-α and IL-1β. Primary astrocytes were stimulated with TNF-α and IL-1β (50 ng/mL) for 48 h and western blotting analysis were performed.

      In Figure 3D, the authors claim that PKM2 expression causes nuclear retention of STAT3, p65, and p50, and inhibiting PKM2 localization with DASA-58 suppresses this retention. The western blot results for the MOG-stimulated group show high levels of STAT3, p50, and p65 in nuclear localization. However, in the MOG and DASA-58 treated group, one would expect high levels of p50, p65, and STAT3 proteins in the cytoplasm, while their levels decrease in the nucleus. These western blot results could be expanded. Additionally, intensity quantification for these results would be beneficial to see the statistical difference in their expressions, especially to observe the nuclear localization of PKM2.

      We agree with the reviewer’s comments and we have incorporated the quantification of STAT3,p50 and p65 for Fig.3D and Fig.S4B and Fig.S4C. Nevertheless, given that DASA-58 did not trigger a notable increase in the cytoplasmic level of PKM2, we did not detect an upregulation of STAT3, p50, or p65 in the cytoplasm of the MOG and DASA-58-treated groups. With the quantification results, it is more obvious to see the changes of these proteins in different conditions.

      The discrepancy between Figure 7A and its explaining text is confusing. The expectation from the knocking down of TRIM21 is the amelioration of activated astrocytes, leading to a decrease in inflammation and the disease state. The presented results support these expectations, while the images showing demyelination in EAE animals are not highly supportive. Clearly labeling demyelinated areas would enhance readers' understanding of the important impact of TRIM21 knockdown on reducing the disease severity.

      Thank you for pointing this out. We sincerely apologize for our carelessness. Based on your comments, we have made the corrections in the manuscript. As there is indeed a statistical difference in the mean clinical scores between shTRIM21-treated group and shVec group, we have accordingly revised the sentence for Figure 7A to state, “At the end time point at day 22 p.i., shTRIM21-treated group showed reduced disease scores compared to control groups (Fig. 7A).” .

      Additionally, we have added the whole image of the spinal cord for MBP in Author Response image 2. Moreover, we have labelled the demyelinated areas to facilitate readers’ understanding.

      Author response image 2.

      MBP staining of the whole spinal cord in EAE mice from shVec and shTRIM21 group. Scale bar: 100 μm. Demyelinated areas are marked with dashed lines.

      Reviewer #2 (Public Review):

      This study significantly advances our understanding of the metabolic reprogramming underlying astrocyte activation in neurological diseases such as multiple sclerosis. By employing an experimental autoimmune encephalomyelitis (EAE) mouse model, the authors discovered a notable nuclear translocation of PKM2, a key enzyme in glycolysis, within astrocytes.

      Preventing this nuclear import via DASA 58 substantially attenuated primary astrocyte activation, characterized by reduced proliferation, glycolysis, and inflammatory cytokine secretion.<br /> Moreover, the authors uncovered a novel regulatory mechanism involving the ubiquitin ligase TRIM21, which mediates PKM2 nuclear import. TRIM21 interaction with PKM2 facilitated its nuclear translocation, enhancing its activity in phosphorylating STAT3, NFκB, and c-myc. Single-cell RNA sequencing and immunofluorescence staining further supported the upregulation of TRIM21 expression in astrocytes during EAE.

      Manipulating this pathway, either through TRIM21 overexpression in primary astrocytes or knockdown of TRIM21 in vivo, had profound effects on disease severity, CNS inflammation, and demyelination in EAE mice. This comprehensive study provides invaluable insights into the pathological role of nuclear PKM2 and the ubiquitination-mediated regulatory mechanism driving astrocyte activation.

      The author's use of diverse techniques, including single-cell RNA sequencing, immunofluorescence staining, and lentiviral vector knockdown, underscores the robustness of their findings and interpretations. Ultimately, targeting this PKM2-TRIM21 axis emerges as a promising therapeutic strategy for neurological diseases involving astrocyte dysfunction.

      While the strengths of this piece of work are undeniable, some concerns could be addressed to refine its impact and clarity further; as outlined in the recommendations for the authors.

      Thanks for the reviewer’s comment and positive evaluation of our present work. We have further answered each question in recommendations section.

      Reviewer #3 (Public Review):

      Summary:

      Pyruvate kinase M2 (PKM2) is a rate-limiting enzyme in glycolysis and its translocation to the nucleus in astrocytes in various nervous system pathologies has been associated with a metabolic switch to glycolysis which is a sign of reactive astrogliosis. The authors investigated whether this occurs in experimental autoimmune encephalomyelitis (EAA), an animal model of multiple sclerosis (MS). They show that in EAA, PKM2 is ubiquitinated by TRIM21 and transferred to the nucleus in astrocytes. Inhibition of TRIM21-PKM2 axis efficiently blocks reactive gliosis and partially alleviates symptoms of EAA. Authors conclude that this axis can be a potential new therapeutic target in the treatment of MS.

      Strengths:

      The study is well-designed, controls are appropriate and a comprehensive battery of experiments has been successfully performed. Results of in vitro assays, single-cell RNA sequencing, immunoprecipitation, RNA interference, molecular docking, and in vivo modeling etc. complement and support each other.

      Weaknesses:

      Though EAA is a valid model of MS, a proposed new therapeutic strategy based on this study needs to have support from human studies.

      We agree that although we have clarified the therapeutic potential of targeting TRIM21 or PKM2 in the treatment of EAE, a mouse model of MS, the application in human studies warrants further studies. While considering the use of TRIM21 as a target for treating multiple sclerosis in clinical trials, several issues need to be addressed to ensure the safety, efficacy and feasibility. One such aspect is the development of drug that specifically target TRIM21 in brain, capable of crossing the blood-brain barrier and have minimal off-target effects. The translation of preclinical finding into clinical trials poses a significant challenge. To provide evidence for the similarities between the EAE model and multiple sclerosis, we have screened GEO databases (Author response image 3). In GSE214334 which analyzed transcriptional profiles of normal-appearing white matter from non-MS and different subtypes of disease (RRMS, SPMS and PPMS). Although no statistical difference was observed among different groups, the TRIM21 expression has tendency to increase in SPMS (secondary progressive MS) and PPMS (primary progressive MS) patients. In GSE83670, astrocytes from 3 control white matter and 4 multiple sclerosis normal appearing white matter (NAWM) were analyzed. TRIM21 mRNA expression is higher in MS group (78.73 ± 10.44) compared to control group (46.67 ± 24.15). Although these two GEO databases did not yield statistically significant differences, TRIM21 expression appears to be elevated in the white matter of MS patients compared to controls.

      To address this limitation, we have incorporated the following statement in the discussion section: “However, whether TRIM21-PKM2 could potentially serve as therapeutic targets in multiple sclerosis warrants further studies.”

      Author response image 3.

      TRIM21 expression in control and MS patients based on published GEO database. (A) The expression of TRIM21 in normal-appearing white matter in non-MS (Ctl) and different clinical subtypes of MS (RRMS, SPMS, PPMS) based on GSE214334 (one-way ANOVA). (B) The expression of TRIM21 from multiple sclerosis normal appearing white matter (NAWM) and control WM based on GSE83670. RRMS, relapsing--remitting MS; SPMS, secondary progressive MS; PPMS, primary progressive MS (unpaired Student's t test). Data are represented as the means ± SEM.

      Reviewer #4 (Public Review):

      Summary:

      The authors report the role of the Pyruvate Kinase M2 (PKM2) enzyme nuclear translocation as fundamental in the activation of astrocytes in a model of autoimmune encephalitis (EAE). They show that astrocytes, activated through culturing in EAE splenocytes medium, increase their nuclear PKM2 with consequent activation of NFkB and STAT3 pathways. Prevention of PKM2 nuclear translocation decreases astrocyte counteracts this activation. The authors found that the E3 ubiquitin ligase TRIM21 interacts with PKM2 and promotes its nuclear translocation. In vivo, either silencing of TRIM21 or inhibition of PKM2 nuclear translocation ameliorates the severity of the disease in the EAE model.

      Strengths:

      This work contributes to the knowledge of the complex action of the PKM2 enzyme in the context of an autoimmune-neurological disease, highlighting its nuclear role and a novel partner, TRIM21, and thus adding a novel rationale for therapeutic targeting.

      Weaknesses:

      Despite the relevance of the work and its goals, some of the conclusions drawn would require more thorough proof:

      I believe that the major weakness is the fact that TRIM21 is known to have per se many roles in autoimmune and immune pathways and some of the effects observed might be due to a PKM2-independent action. Some of the experiments to link the two proteins, besides their interaction, do not completely clarify the issue. On top of that, the in vivo experiments address the role of TRIM21 and the nuclear localisation of PKM2 independently, thus leaving the matter unsolved.

      We agree that TRIM21 has multifunctional roles and only some of their effects are due to PKM2-independent action. It is obvious that TRIM21 functions as ubiquitin ligases and its substrate are various. Here we identify PKM2 as one of its interacting proteins and our focus is the relationship between TRIM21 and the nuclear translocation PKM2, we have used diverse experiments to clarify their relationships, for example immunoprecipitation, western blotting, immunofluorescence, cyto-nuclear protein extraction. These aforementioned experiments are key points of our studies. From the results of in vitro experiments, targeting either TRIM21 or PKM2 might be potential targets for EAE treatment. Expectedly, from in vivo experiments, either targeting TRIM21 or PKM2 nuclear transport ameliorated EAE. In order to test the relationship of TRIM21 and PKM2 nuclear transport in vivo, we have stained PKM2 in shVec and shTRIM21-treated mice. Expectedly, knocking down TRIM21 led to a decrease in the nuclear staining of PKM2 in spinal cord astrocytes in EAE models (Figure S7A). This observation underscores that the therapeutic potential of inhibiting TRIM21 in astrocytes in vivo might be partially due to its role in triggering the reduced nuclear translocation of PKM2.

      Some experimental settings are not described to a level that is necessary to fully understand the data, especially for a non-expert audience: e.g. the EAE model and MOG treatment; action and reference of the different nuclear import inhibitors; use of splenocyte culture medium and the possible effect of non-EAE splenocytes.

      According to the reviewer’s suggestions, we have added more detailed descriptions in the materials and methods section, for example, the use of splenocytes culture medium, mass spectrometry, HE and LFB staining have been added. More details are incorporated in the part for “EAE induction and isolation and culture of primary astrocytes”. Moreover, the reference of DASA-58 in vitro and TEPP-46 in vivo as inhibitors of PKM2 nuclear transport were added.

      The statement that PKM2 is a substrate of TRIM21 ubiquitin ligase activity is an overinterpretation. There is no evidence that this interaction results in ubiquitin modification of PKM2; the ubiquitination experiment is minimal and is not performed in conditions that would allow us to see ubiquitination of PKM2 (e.g. denaturing conditions, reciprocal pull-down, catalytically inactive TRIM21, etc.).

      To prevent the misunderstanding, we have revised certain statements in the manuscript. In the updated version, the description is as follows: Hereby, we recognized PKM2 as an interacting protein of TRIM21, and further studies are required to determine if it is a substrate of E3 ligase TRIM21.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      General recommendations:

      - The whole manuscript needs language editing.

      We appreciate the comments of the reviewers. We have improved the writing of the manuscript. All modifications are underlined.

      - Details of many experiments are not given in the materials and methods.

      According to the reviewer’s suggestions, we have added more details for experiments in the materials and methods. For example, “Splenocyte isolation and supernatant of MOG35-55-stimulated-splenocytes”, “mass spectrometry”, “Hematoxylin-Eosin (HE) and Luxol Fast Blue (LFB) staining” were added in the section of Materials and Methods. More detailed information is given for EAE induction and isolation and culture of primary astrocytes.

      - Line properties in graphics should be corrected, some lines in box plots and error bars are very weak and hardly visible. Statistical tests should be included in figure legends as well. Statistical differences should be mentioned for control vs DASA-58 (alone) in all related figures.

      We have revised the figures to enhance their visibility by thickening the lines and error bars. In accordance with the reviewer’s suggestions, we have incorporated statistical tests in figure legends. Moreover, statistical analysis has been made among all groups, if there is no asterisk indicated in the figure legend and figure panels, it means there is no statistical difference between the control vs DASA-58 groups. For most of the experiments conducted in our studies, including lactate production, glucose consumption, the EdU analysis and CCK8 analysis, the change of STAT3 and NF-κB pathways, no statistical difference was observed between the control and DASA-58 group. The reason might be due to that in unstimulated astrocytes, the expression of PKM2 is low and nuclear translocation of PKM2 are few, which may explain why DASA-58 did not exert the anticipated effect. Thus, in our experiments, we have used MOGsup to stimulate astrocytes, enabling us to observe the impact of DASA-58 on the astrocyte proliferation and glycolysis in this condition.

      - Scale bars, arrows, and labeling in the images are not visible.

      We have improved the images according to the reviewer’s suggestions. The scale bars, arrows are made thicker and labeling are larger. The updated figures are visible.

      - Quantitative analysis of all western blot results and their statistics could be provided in every image and for every protein.

      For western blotting results which are further processed with quantitative analysis, for example, Fig.2D, fig. 5G, Fig. 6A and 6B, Fig. S4, we have added their statistics in the raw data sections. The other western blot results, for example, IP analysis, which are used to analyze protein-protein binding are not further processed with quantitative analysis.

      - Proteins that are used for normalizations in western blots should be stated in the text.

      We have added description of proteins that are used for normalization in western blots in figure legends. Moreover, in figure panels, proteins used for normalization are indicated. Globally, whole protein level is normalized to protein level of β-actin. For nuclear and cytoplasmic proteins, nuclear protein is normalized to the expression of lamin, cytoplasmic protein is normalized to the expression of tubulin. 

      - The manuscript investigates the role of TRIM21 in the nuclear localization of PKM2 in astrocytes in EAE mice, however almost no information is given about TRIM21 in the introduction. Extra information is given for PKM2, yet can be concisely explained.

      We have added a paragraph that describes the information of TRIM21 in the introduction section. The description is as follows: “TRIM21 belongs to the TRIM protein family which possess the E3 ubiquitin ligase activity. In addition to its well-recognized function in antiviral responses, emerging evidences have documented the multifaceted role of TRIM21 in cell cycle regulation, inflammation and metabolism (Chen et al., 2022). Nevertheless, the precise mechanisms underlying the involvement of TRIM21 in CNS diseases remain largely unexplored.”

      - "As such, deciphering glycolysis-dominant metabolic switch in astrocytes is the basis for understanding astrogliosis and the development of neurological diseases such as multiple sclerosis." The sentence could be supported by references.

      To support this sentence, we have added the following references:

      (1) Xiong XY, Tang Y, Yang QW. Metabolic changes favor the activity and heterogeneity of reactive astrocytes. Trends in endocrinology and metabolism: TEM 2022;33(6):390-400.

      (2) das Neves SP, Sousa JC, Magalhães R, Gao F, Coppola G, Mériaux S, et al. Astrocytes Undergo Metabolic Reprogramming in the Multiple Sclerosis Animal Model. Cells 2023;12(20):2484.

      Figure 1/Result 1:

      - Figure 1A-B: Quality of the images should be improved.

      According to the reviewer’s suggestion, we have improved the quality of the image, images with higher resolution were added in figure 1A and figure 1B.

      - Control images of Figure 1B are not satisfying. GFAP staining is very dim. Images from control cells should be renewed.

      As mentioned by the reviewer’s, we have renewed the control images and added the DAPI staining figures for all groups. Compared with MOGsup stimulated astrocytes, the control cells are not in activated state and GFAP are relatively low.

      - Labelings on the images are not sufficient, arrows and scale bars are not visible.

      We have improved the images including labels, arrows and scale bars in all figures.

      - How splenocytes were obtained from MOG induced mice were not given in the material and methods section. Thus, it should be clearly stated how splenocyte supernatant is generated (treatment details).

      We have added the detailed information relating to splenocyte isolation and splenocyte supernatant entitled “Splenocyte isolation and supernatant of MOG35-55-stimulated-splenocytes” in the section of Materials and methods. “Splenocytes were isolated from EAE mice 15 d (disease onset) after MOG35-55 immunization. Briefly, spleen cells were suspended in RPMI-1640 medium containing 10% FBS. Splenocytes were plated in 12-well plates at 1x106 cells/well containing 50 μg/mL MOG35-55 and cultured at 37°C in 5% CO2. After stimulation for 60 h, cell suspension was centrifuged at 3000 rpm for 5 min and supernatants were collected. For the culture of MOGsup-stimulated astrocytes, astrocytes were grown in medium containing 70% DMEM supplemented with 10% FBS and 30% supernatant from MOG35-55-stimulated-splenocytes.”

      - For general astrocyte morphology: authors showed the cells are GFAP+ astrocytes. It is surprising that these cells do not bear classical astrocyte morphology in cell culture. How long do you culture astrocytes before treatment? How do you explain their morphological difference?

      Astrocytes were cultured for 2 to 3 weeks which correspond to 2-3 passages before treatment. There are several possible reasons for the morphological differences observed between GFAP+ astrocytes and their classical morphology. Firstly, the cell density. In low-density culture just as shown in Figure 1B, we have observed that astrocytes adopt a more flattened morphology. In high-density cultures, they adopt a stellate shape. Moreover, variations in culture conditions, such as the use of different fetal bovine serum, can also influence the morphology of astrocytes. In addition, the mechanical injury induced by the isolation procedures for astrocytes might contribute to variations in their morphology during in vitro cultivation. In summary, the morphological differences observed in GFAP+ astrocytes in cell culture likely result from a combination of culture conditions, cell density, and mechanical injury occured during astrocyte isolation etc.

      - Additional verification of reactive astrocytes could be performed by different reactive astrocyte markers, such as GLAST, Sox9, S100ß. Thus, quantitative analysis of activated astrocytes can be done by counting DAPI vs GLAST, Sox9 or S100ß positive cells.

      We really agree with the reviewer that there are other markers of reactive astrocytes such as GLAST, sox9 and S100β. However, numerous evidences support that GFAP is the most commonly used reactive astrocyte markers. Most of the cases, reactive astrocytes undergo GFAP overexpression. GFAP is one the most consistently induced gene in transcriptomic datasets of reactive astrocytes, confirming its usefulness as a reactive marker (Escartin et al., 2019). Thus, we have used GFAP as the marker of astrocyte activation in our study.

      - How you performed quantifications for Figures 1C and 1D should be clearly explained, details are not given.

      Quantification for Figure 1C and 1D were added in the figure legend. In general, Mean fluorescence intensity of PKM2 in different groups of (B) was calculated by ImageJ. The number of nuclear PKM2 was quantified by Image-Pro Plus software manually (eg. nuclear or cytoplasmic based on DAPI blue staining). The proportion of nuclear P KM2 is determined by normalizing the count of nuclear PKM2 to the count of nuclear DAPI, which represents the number of cell nuclei.

      - "Together, these data demonstrated the nuclear translocation of PKM2 in astrocytes from EAE mice." Here the usage of "suggests" instead of "demonstrated".

      Based on the reviewer's suggestion, we have revised the use of "demonstrated" to "suggest" in this sentence.

      Result 2 and 3:

      - In the literature, DASA-58 is shown to be the activator of PKM2 (https://www.nature.com/articles/nchembio.1060https://doi.org/10.1016/j.cmet.2019.10.015).

      - Providing references for the inhibitory use of DASA-58 for PKM2 would be appreciated.

      DASA-58 is referred to as “PKM2 activator” due to its ability to enforce the tetramerization of PKM2, enhancing the enzymatic ability of PKM2 to catalyze PEP to pyruvate conversion. However, the enforced conversion of tetramerization of PKM2 inhibited the dimer form of PKM2, thereby inhibiting its nuclear translocation. For this reason, DASA-58 is also used as the inhibitor of nuclear translocation of PKM2. In primary BMDMs, LPS induced nuclear PKM2. However, driving PKM2 into tetramers using DASA-58 and TEPP-46 inhibited LPS-induced PKM2 nuclear translocation (Palsson-McDermott et al., 2015). Consistently, FSTL1 induced PKM2 nuclear translocation was inhibited by DASA-58 in BMDMs (Rao et al., 2022). Accordingly, we have added these references in the manuscript.

      - Western blot results and statistics for PKM2 should be quantitatively given for all groups.

      According to the reviewer’s suggestions, we have added the quantification of PKM2 for western blots in figure 2 and figure 3. Quantification of PKM2 in figure 2D is added in Fig S3. Quantification of PKM2 in figure 3D is added in Fig.S4B and Fig. S4C.

      - Figure 3A-B: staining method/details are not mentioned in materials and methods.

      Staining methods is in the paragraph entitled “Immunofluorescence” in the section of materials and methods. The descriptions are as follows:

      For cell immunochemistry, cells cultured on glass coverslips were fixed with 4% PFA for 10 min at RT, followed by permeabilization with 0.3% Triton X-100. Non-specific binding was blocked with buffer containing 3% BSA for 30 min at RT. Briefly, samples were then incubated with primary antibodies and secondary antibodies. DAPI was used to stain the nuclei. Tissues and cells were observed and images were acquired using an EVOS FL Auto 2 Cell image system (Invitrogen). The fluorescence intensity was measured by ImageJ.

      - In Figure 3A, in only DASA-58 treated cells, it looks like GFAP staining is decreased. It would be better to include MFI analysis for GFAP in the supplementary information.

      We have added the MFI analysis for GFAP in Figure 3A in Fig.S4A. GFAP expression is decreased after DASA-58 treatment (in both control and MOGsup condition), the reason might be due to the effect of DASA-58 on inhibition of PKM2 nuclear transport, which subsequently suppress the activation of astrocytes, leading to the decreased expression of GFAP.

      Result 4

      - Detailed explanation of the mass spectrometry and IP experiments should be given in materials and methods. What are the conditions of the cells? Which groups were analyzed? Are they only MOG stimulated, MOG-DASA-58 treated, or only primary astrocytes without any treatment? The results should be interpreted according to the experimental group that has been analyzed.

      We have added the detailed information relating to mass spectrometry and immunoprecipitation in the materials and methods. In general, two groups of cells were subjected to mass spectrometry analysis, primary astrocytes without any treatment and MOGsup-stimulated primary astrocytes. These two groups were immunoprecipitated with anti-PKM2 antibody. Moreover, in the manuscript, we have revised the sentence concerning the description of mass spectrometry. The description is as follows: “To illustrate underlying mechanism accounting for nuclear translocation of PKM2 in astrocytes, we sought to identify PKM2-interacting proteins. Here, unstimulated and MOGsup-stimulated primary astrocytes were subjected to PKM2 immunoprecipitation, followed by mass spectrometry”. Furthermore, the description of these two groups of cells were added in the figure legend of Fig.4.

      Result 5:

      - For the reader, it would be better to start this part by explaining the role of TRIM21 in cells by referring to the literature.

      We agreed with the reviewer that beginning this part by explaining the role of TRIM21 would be better. Accordingly, we have added the following descriptions at the beginning of this part: “TRIM21 is a multifunctional E3 ubiquitin ligase that plays a crucial role in orchestrating diverse biological processes, including cell proliferation, antiviral responses, cell metabolism and inflammatory processes (Chen X. et al., 2022).” The relevant literature has been included: Chen X, Cao M, Wang P, Chu S, Li M, Hou P, et al. The emerging roles of TRIM21 in coordinating cancer metabolism, immunity and cancer treatment. Front Immunol 2022;13:968755.

      - The source and the state of the cells (control vs MOG induced) should be stated (Figure 5A).

      In figure 5A to 5D, single-cell RNA-seq were performed from CNS tissues of naive and different phases of EAE mice (peak and chronic). We have added this detailed information in the figure legend of Figure 5.

      - Figure 5D can be placed after 5A. Data in Figure 5A is probably from naive animals, if so, it should be stated in the legend where A is explained. The group details of the data shown in Figure 5 should be clearly stated.

      According to the reviewer’s suggestions, we have placed 5D after 5A. Single-cell RNA seq analysis were performed from CNS tissues of naïve mice and EAE mice. This information is stated in the legend of Figure 5A-D. “Single-cell RNA-seq profiles from naive and EAE mice (peak and chronic phase) CNS tissues. Naive (n=2); peak (dpi 14–24, n=3); chronic (dpi 21–26, n=2).”

      - Immunofluorescence images should be replaced with better quality images, in control images, stainings are not visible.

      We have replaced with better quality images in figure 5H and in control images, the staining is now visible.

      Result 6:

      - Experimental procedures should be given in detail in materials and methods.

      We have revised the section of materials and methods, and more details are added. Detailed information was added for astrocyte isolation, immunoprecipitation. Moreover, mass spectrometry, Hematoxylin-Eosin (HE) and Luxol Fast Blue (LFB) staining, Splenocyte isolation and supernatant of MOG35-55-stimulated-splenocytes were added in materials and methods.

      Result 7:

      - In Figure 7A, the mean clinical score seems significantly reduced in the shTRIM21-treated group, although it is explained in the result text that it is not significant. Explain to us the difference between Figure 7A and the explaining text?

      Thank you for pointing this out. We sincerely apologize for our carelessness. Based on your comments, we have made the corrections in the manuscript. As there is indeed a statistical difference in the mean clinical scores between shTRIM21-treated group and shVec group, we have accordingly revised the sentence for Figure 7A to state, “At the end time point at day 22 p.i., shTRIM21-treated group showed reduced disease scores compared to control groups (Fig. 7A).” .

      - The staining methods for luxury fast blue and HE are not given in materials and methods.

      According to the reviewer’s comments, we have added the staining methods for HE and LFB in materials and methods.

      - In Figure 7E, authors claim that MBP staining is low in an image, however the image covers approximately 500 um area. One would like to see the demyelinated areas in dashed lines, and also the whole area of the spinal cord sections.

      In Author response image 2, we have added the images for MBP staining of the whole area of spinal cord sections. Demyelinated areas are marked with dashed lines.

      - "TEPP-46 is an allosteric activator that blocks the nuclear translocation of PKM2 by promoting its tetramerization." should be supported by references.

      We have added two references for this sentence. Anastasiou D et al. showed that TEPP-46 acts as an activator by stabilizing subunit interactions and promoting tetramer formation of PKM2. Angiari S et al. showed that TEPP-46 prevented the nuclear transport of PKM2 by promoting its tetramerization in T cells.

      These two references are added:

      Angiari S, Runtsch MC, Sutton CE, Palsson-McDermott EM, Kelly B, Rana N, et al. Pharmacological Activation of Pyruvate Kinase M2 Inhibits CD4(+) T Cell Pathogenicity and Suppresses Autoimmunity. Cell metabolism 2020;31(2):391-405.e8.

      Anastasiou D, Yu Y, Israelsen WJ, Jiang JK, Boxer MB, Hong BS, et al. Pyruvate kinase M2 activators promote tetramer formation and suppress tumorigenesis. Nature chemical biology 2012;8(10):839-47.

      - Could you explain what the prevention stage is?

      The term “prevention stage” was used to describe the administration of TEPP-46 before disease onset. To be more accurate, we have revised the phrase from “prevention stage” to “preventive treatment” as described in other references. For example, Ferrara et al. (Ferrara et al., 2020) used “preventive” and “preventive treatment” to mean administration before disease onset.

      The revised sentences are as follows: “To test the effect of TEPP-46 on the development of EAE, the “preventive treatment” (i.e, administration before disease onset) was administered. Intraperitoneal treatment with TEPP-46 at a dosage of 50 mg/kg every other day from day 0 to day 8 post-immunization with MOG35-55 resulted in decreased disease severity (Fig. S8A).”

      - In in vitro experiments, authors used DASA-58, and in vivo they used TEPP-46. What might be the reason that DASA-58 is not applied in vivo?

      The effects of DASA-58 and TEPP-46 in promoting PKM2 tetramerization have been tested in vitro and has been documented. Based on in vitro absorption, distribution, metabolism and excretion profiling studies, Anastasiou et al. predicted that TEPP-46 had better in vivo drug exposure compared to DASA-58. Moreover, TEPP-46, but not DASA-58, is pharmacokinetically validated in vivo (Anastasiou et al., 2012). Thus, we used TEPP-46 for in vivo studies.

      - Authors claim that TEPP-46 activates PKM2 and leads it its nuclear translocation, however, they did not verify PKM2 expression in the nucleus.

      To support that TEPP-46 exerts effects in inhibiting PKM2 nuclear translocation both in vivo and in vitro, we have performed western blotting analysis and immunofluorescence staining. In vitro, TEPP-46 administration inhibited the MOGsup-induced PKM2 nuclear translocation, which exerts similar effects as DASA-58 (Author response image 4). The in vivo effects of TEPP-46 was analyzed by co-immunostaining of PKM2 and GFAP. The results showed reduced nuclear staining of PKM2 in spinal cord astrocytes in TEPP-46-treated EAE mice compared with control EAE mice (Figure S7B).

      Author response image 4.

      TEPP-46 inhibited the nuclear transport of PKM2 in primary astrocytes. Nuclear-cytoplasmic protein extraction analysis showed the nuclear and cytoplasmic changes of PKM2 in TEPP-46 treated astrocytes and MOGsup-stimulated astrocytes. Primary astrocytes were pretreated with 50 μM TEPP-46 for 30 min and stimulated with MOGsup for 24 h.

      Supplementary Figure 3:

      - In Figure 3D, merge should be stated on top of the merged images, it is confusing to the reader.

      According to the reviewer’s comments, we have added merge on top of the merged images.

      Discussion:

      All results should be discussed in detail by interpreting them according to the literature.

      We have further discussed the results in the discussion n section. Firstly, we added a paragraph describing the role of nuclear translocation of PKM2 in diverse CNS diseases. Moreover, a paragraph discussing the nuclear function of PKM2 as a protein kinase or transcriptional co-activator was added. Now the discussion section is more comprehensive, which nearly discuss all the results by interpreting them according to the literature in detail.

      Reviewer #2 (Recommendations For The Authors):

      The authors could address the following points:

      (1) In Figure 1A, the authors present immunofluorescence staining of PKM2 in both control mice and MOG35-725 55-induced EAE mice across different stages of disease progression: onset, peak, and chronic stages. Observing the representative images suggests a notable increase in PKM2 levels, particularly within the nucleus of MOG35-725 55-induced EAE mice. However, to provide a more comprehensive analysis, it would be beneficial for the authors to include statistical data, such as average intensities {plus minus} standard deviation (SD), along with the nuclear PKM2 ratio, akin to the presentation for cultured primary astrocytes in vitro in panels B-D. Additionally, the authors should clearly specify the number of technical repeats and the total number of animals utilized for these data sets to ensure transparency and reproducibility of the findings.

      Thanks for the reviewer’s suggestion. Accordingly, for figure 1A, we have added the nuclear PKM2 ratio in astrocytes in control and different stages of EAE mice in Supplementary figure S1A. Moreover, the quantification of mean fluorescence intensity (MFI) for PKM2 was added in figure S1B. Moreover, we have added the number of animals used in each group in figure legend.

      (2) The blue hue observed in the merged images of Figure 1B (lower panel) presents a challenge for interpretation. The source of this coloration remains unclear from the provided information. Did the authors also include a co-stain for the nucleus in their imaging? To enhance clarity, especially for individuals with color vision deficiency, the authors might consider utilizing different color combinations, such as presenting PKM2 in green and GFAP in magenta, which would aid in distinguishing the two components. Furthermore, for in vitro cell analysis, incorporating a nuclear stain could provide valuable insights into estimating the cytosolic-to-nuclear ratio of PKM2.

      For the question relating to the merged images in figure 1B, PKM2 was presented in green, GFAP was presented in red and blue represents the nuclear staining by DAPI. “Merge” represents the merged images of these three colors. To enhance the clarity, we have added the images for the nuclear staining of DAPI.

      (3) To substantiate the conclusion of the authors regarding the enhancement of aerobic glycolysis due to PKM2 expression and nuclear translocation in MOGsup-stimulated astrocytes, employing supplementary methodologies such as high-resolution respirometry and metabolomics could offer valuable insights. These techniques would provide a more comprehensive understanding of metabolic alterations and further validate the observed changes in glycolytic activity.

      While we recognize the merits of techniques such as high-resolution respirometry and metabolomics, we believe that the conclusions regarding the enhancement of aerobic glycolysis due to PKM2 expression and nuclear translocation in MOGsup-stimulated astrocytes are sufficiently supported by the current experimental evidence. Our study has relied on a robust set of experiments, including lactate production, glucose consumption, cyto-nuclear localization analysis and western blotting analysis of key enzymes in glycolysis. These results, in conjunction with the literature on the role of PKM2 in various cancer cells, keratinocytes and immune cells, provide a strong foundation for our conclusions. Although metabolomics could offer a global view of the changes in metabolic states in astrocytes, as the end product of aerobic glycolysis is lactate, our study, which analyze the change of lactate levels in different experimental conditions might be more direct. However, we fully acknowledge that future studies employing these advanced methodologies could provide further insights into the precise mechanisms underlying PKM2's effects on aerobic glycolysis.

      (4) Minor: Why is the style of the columns different in Gig 2 panel D compared to those shown in panels B, C, and G of Figure 2.

      To maintain consistency in the column style across figure 2, we have updated the column in figure 2D. Now, we use same style of columns in Fig 2B, C, D and G.

      (5) The effect of stimulating astrocytes with MOGsup on cell proliferation, as shown in Figure 2E, is very moderate. Does DASA-58 reduce the proliferation of control cells in this assay?

      In response to the reviewer’s questions, we conducted a CCK8 analysis in astrocytes subjected to DASA-58 treatment. As depicted in Author response image 5, administration of DASA-58 did not reduce the proliferation of control cells. This result aligns with our other findings in the glycolysis assays and EdU analysis, where there is no statistical difference between control group and DASA-58-treated group. One plausible explanation for this is that in their steady state, astrocytes in the control group are not in a hyperproliferative state. Under such conditions, inhibiting the translocation of PKM2 via DASA-58 or other inhibitors did not significantly affect the proliferation of astrocytes.

      Author response image 5.

      CCK8 analysis of astrocyte proliferation. Primary astrocytes were pretreated with 50 μM DASA-58 for 30 min before stimulation with MOGsup. Data are represented as mean ± SEM. ***P<0.001. SEM, standard error of the mean.

      (6) The tables and lists in Figure 4, panels A-D, are notably small, hindering readability and comprehension. Consider relocating these components to the supplementary materials as larger versions.

      We have updated the tables and lists, the lines are made thicker. As suggested by the reviewer, we relocate theses components in Supplementary Figure S5.

      Reviewer #3 (Recommendations For The Authors):

      Higher magnification images that more clearly show nuclear translocation of PKM2 and pp65 and pSTAT3 immunoreactivity should be added to the figures panels, for example as inlets.

      Thank you for pointing out this issue in the manuscript. According to the reviewer’s comments we have included higher magnification images as inlets for Figure 3A, Figure 3B and Figure 2A. These enlarged images now provide a clearer visualization of the nuclear translocation state of PKM2, pp65, and pSTAT3.

      There are seldom wording errors like features => feathers at line 364.

      We are very sorry for our incorrect writing. We have corrected this spelling mistake in the manuscript.

      Reviewer #4 (Recommendations For The Authors):

      Here below are major and minor concerns on the data presented:

      (1) It is not clear from the Methods section what are the culture conditions defined as 'control' in Figure 1B-D. I believe the control should be culturing with the conditioned medium of normal (non-EAE) mice splenocytes to be sure the effect is not from cytokines naturally secreted by these cells.

      Thanks for the reviewer’s comments and we totally understand the reviewer's concern. The control means non-treated primary astrocytes cultured with traditional DMEM medium supplemented with 10% FBS. In fact, we have performed experiments to exclude the possibility that the observed effect of MOGsup on the activation of astrocytes is from cytokines secreted by splenocytes. Splenocytes from normal (non-EAE) mice were isolated, cultured in RPMI-1640 medium containing 10% FBS for 60 hours, and supernatant was collected. Immunofluorescence staining of PKM2 and GFAP were performed in non-treated primary astrocytes and astrocytes stimulated with supernatant from control splenocytes. As shown in Figure S1C, in both groups, no difference was observed in PKM2 expression and localization, PKM2 was located mainly in the cytoplasm in theses conditions. These results indicate that observed effect of PKM2 in MOGsup-stimulated condition is not due to the cytokines secreted from splenocytes. Thus, we used non-treated primary astrocytes as controls in our study. To clarify the control group, we have revised the description in the figure legend, The revised expression is as follows: “Immunofluorescence staining of PKM2 (green) with GFAP (red) in non-treated primary astrocytes (control) or primary astrocytes cultured with splenocytes supernatants of MOG35–55-induced EAE mice (MOGsup) for different time points (6 h, 12 h and 24 h). ”

      (2) Figure 3D: the presence of PMK2 in the nuclear fraction upon MOGSUP together with the DASA-58 (last lane of Figure 3D) is not supporting the hypothesis proposed and further may indicate that the reduction of pSTAT3, pp65, etc. observed is independent of PMK2 nuclear translocation/astrocyte activation being observed even in absence of MOGSUP.

      Thank you for pointing out this problem in manuscript. The representing image of nuclear level of PKM2 in Figure 3D is not obvious, as shown by figure 3D, which has raised doubts among the reviewers. To strengthen our conclusion that the reduction of STAT3 and p65 pathway is related to the inhibited nuclear level of PKM2 induced by DASA-58, nuclear PKM2 level was quantified and added in Figure S4B. From the quantification results, it is evident that DASA-58 administration decreased the nuclear level of PKM2 in MOGsup-stimulated astrocytes. To address this concern, we have updated the immunoblot image for PKM2 in figure 3D and incorporated quantification results in supplementary Figure S4.

      (3) Molecular docking indication and deletion co-immunoprecipitation reported in Figure 4 data are not concordant on TRIM21: N-terminal Phe23 and Thr87 (Figure 4E) predicted by MD to bind PMK2 are not in the PRY-SPRY domain suggested by the co-IP experiment (Figure 4I).

      The discrepancy between the molecular docking prediction and the co-immunoprecipitation can be explained as follows:

      Firstly, molecular docking is computational methods that predicts protein-protein interaction based on 3-D structures of the proteins. However, the accuracy of this predication can be influenced by the different models of 3D structures of TRIM21 and PKM2, as well as by factors such as post-translational modifications and flexibility of the proteins. Proteins in vivo are subject to post-translational modifications that can affect their interactions. These modifications are not fully captured in molecular docking analysis. For example, in our analysis, the predicted N-terminal Phe23 and Thr87 in TRIM21 hold the potential to interact with PKM2 by hydrogen bonds. However, such binding can be influenced by diverse biological environments, such as different cells and pathological conditions. Molecular docking predication may suggest the specific residues and binding pocked within the protein complex, however, the accuracy should be verified by experimental techniques such as immunoprecipitation. To address the predication results of molecular docking, the description has been revised as follows: “TRIM21 is predicted to bound to PKM2 via hydrogen bonds between the amino acids of the two molecules.”

      Co-immunoprecipitation that involves the use of truncated domains of TRIM21 and PKM2, is an experimental technique relies on the specific interaction between antibody and targeted proteins. This technique can provide insights into the precise binding domains between TRIM21 and PKM2. As demonstrated in our study, PRY-SPRY domain of TRIM21 is involved in this binding. In summary, while molecular docking and Co-IP are valuable tools for studying protein-protein interactions, their differing focus and limitations may result in discrepancies between the predicted interaction sites and the experimentally identified interaction domains.

      (4) The Authors state that PMK2 is a substrate of TRIM21 E3 ligase activity, however, this is not proved: i) interaction does not imply a ligase-substrate relationship; ii) the ubiquitination shown in Figure 6C is not performed in denaturing conditions thus the K63-Ub antibody can detect also interacting FLAG-IPed proteins (besides, only a single strong band is seen, not a chain; molecular weights in immunoblot should be indicated); iii) use of a catalytically inactive TRIM21 would be required as well.

      We appreciate the reviewer’s comments regarding the limitations of the immunoprecipitation and K63-antibody test, which could not lead to the conclusion that PKM2 is a substrate of TRIM21. To avoid any misunderstandings, we have revised the relevant sentence from “Hereby, we recognized PKM2 as a substrate of TRIM21” to “Hereby, we recognized PKM2 as an interacting protein of TRIM21, and further studies are required to determine if it is a substrate of E3 ligase TRIM21”. Moreover, we have revised the title of the relevant part in the results section, the previous title, “TRIM21 ubiquitylates and promotes the nuclear translocation of PKM2” has been replaced with “TRIM21 promotes ubiquitylation and the nuclear translocation of PKM2”. Moreover, molecular weights for all proteins in western blotting were indicated.

      (5) As above, molecular weights should always be indicated in immunoblot.

      Thanks for pointing out this problem in the figures. Accordingly, we have added the molecular weights for every protein tested in immunoblot.

      (6) The authors should describe the EAE mouse model in the text and in the material and methods as it may not be so well known to the entire reader audience, and the basic principle of MOG35-55 stimulation, in order to understand the experimental plan meaning.

      We appreciate the reviewer’s comments highlighting the importance of clarifying EAE model for a broader understanding of the reader audience. In response, we have described the EAE model both in the text and in the materials and methods section. In the text, the description of EAE model was added at the beginning of the first paragraph in the Results section. The description is as follows: “EAE is widely used as a mouse model of multiple sclerosis, which is typically induced by active immunization with different myelin-derived antigens along with adjuvants such as pertussis toxin (PTX). One widely used antigen is the myelin oligodendrocyte glycoprotein (MOG) 35-55 peptide (Nitsch et al., 2021), which was adopted in our current studies.”

      We have also added the detailed experimental procedures for EAE induction in the materials and methods section.

      (7) The authors should better explain and give the rationale for the use of splenocytes and why directly activated astrocytes (isolated from the EAE model) cannot be employed to confirm/prove some of the presented data.

      Firstly, splenocytes offer a heterogenous cell population, encompassing T cells and antigen presenting cells (APC), which may better mimic the microenvironment and complex immune responses observed in vivo.

      Myelin oligodendrocyte glycoprotein (MOG) 35-55 peptide is one widely used antigen for EAE induction. MOG35-55 elicits strong T responses and is highly encephalitogenic. Moreover, MOG35-55 induces T cell-mediated phenotype of multiple sclerosis in animal models. Thus, by isolating splenocytes from the onset stage of EAE mice, which contains APC and effector T cells, followed by stimulation with antigen MOG35-55 in vitro for 60 hours, the T-cell response in the acute stage of EAE diseases could be mimicked in vitro. The supernatant from MOG35-55 stimulated splenocytes has high levels of IFN-γ and IL-17A, which in part mimic the pathological process and environment in EAE, and this technique has been documented in the references (Chen et al., 2009, Kozela et al., 2015).

      Correspondingly, we have revised sentence for the use of MOG35-55 stimulates splenocytes in EAE mice and add the relevant references: “Supernatant of MOG35-55-stimulated splenocytes isolated from EAE mice were previously shown to elicit a T-cell response in the acute stage of EAE and are frequently used as an in vitro autoimmune model to investigate MS and EAE pathophysiology (Chen et al., 2009, Du et al., 2019, Kozela et al., 2015).”

      Secondly, activated astrocytes (isolated from the EAE model) can not be employed for in vitro culture for the following reasons:

      (1) Low cell viability. Compared to embryonic or neonatal mice, adult mice yield a limited number of viable cells. The is mainly because that adult tissues possess less proliferative capacity.

      (2) Disease changes. Astrocytes in EAE mice are exposed to microenvironment including inflammatory cytokines, antigens and other pathological factors. Without this environment, the function and morphology of astrocytes undergo changes, which make it difficult to interpret the results in vitro.

      For these reasons, the in vitro cultured primary astrocytes used the neonatal mice.

      (8) The authors should indicate the phosphorylation sites they are referring to when analysing p-c-myc, pSTAT3, pp65, etc...

      According to the reviewer’s suggestions, we have added the phosphorylation sites for pSTAT3 (Y705), pp65 (S536), p-c-myc (S62) and pIKK (S176+S180) in the figure panels.

      (9) Reference of DASA-58 and TEPP-46 inhibitors and their specificity should be given.

      According to the reviewer’s comments, we have added the relevant references for the use of DASA-58 and TEPP-46 as inhibitors of PKM2 nuclear transport. In primary BMDMs, LPS induced nuclear PKM2. However, driving PKM2 into tetramers using DASA-58 and TEPP-46 inhibited LPS-induced PKM2 nuclear translocation (Palsson-McDermott et al., 2015). Consistently, FSTL1 induced PKM2 nuclear translocation was inhibited by DASA-58 in BMDMs (Rao et al., 2022). Accordingly, we have added these references in the manuscript.

      To address the selectivity of TEPP-46 and add the references, the relevant sentence has been revised from “TEPP-46 is an allosteric activator that blocks the nuclear translocation of PKM2 by promoting its tetramerization” to “TEPP-46 is a selective allosteric activator for PKM2, showing little or no effect on other pyruvate isoforms. It promotes the tetramerization of PKM2, thereby diminishing its nuclear translocation (Anastasiou et al., 2012, Angiari et al., 2020).”

      Reviewing Editor (Recommendations For The Authors):

      The reviewing editor would appreciate it if the original blots from the western blot analysis, which were used to generate the final figures, could be provided.

      Thanks for the reviewing editor’s comment, accordingly, we will add the original blots for the western blots analysis.

      References

      Anastasiou D, Yu Y, Israelsen WJ, Jiang JK, Boxer MB, Hong BS, et al. Pyruvate kinase M2 activators promote tetramer formation and suppress tumorigenesis. Nature chemical biology 2012;8(10):839-47.

      Escartin C, Guillemaud O, Carrillo-de Sauvage M-A. Questions and (some) answers on reactive astrocytes. Glia 2019;67(12):2221-47.

      Ferrara G, Benzi A, Sturla L, Marubbi D, Frumento D, Spinelli S, et al. Sirt6 inhibition delays the onset of experimental autoimmune encephalomyelitis by reducing dendritic cell migration. Journal of neuroinflammation 2020;17(1):228.

      Lin CC, Edelson BT. New Insights into the Role of IL-1β in Experimental Autoimmune Encephalomyelitis and Multiple Sclerosis. Journal of immunology (Baltimore, Md : 1950) 2017;198(12):4553-60.

      Palsson-McDermott Eva M, Curtis Anne M, Goel G, Lauterbach Mario AR, Sheedy Frederick J, Gleeson Laura E, et al. Pyruvate Kinase M2 Regulates Hif-1α Activity and IL-1β Induction and Is a Critical Determinant of the Warburg Effect in LPS-Activated Macrophages. Cell metabolism 2015;21(1):65-80.Rao J, Wang H, Ni M, Wang Z, Wang Z, Wei S, et al. FSTL1 promotes liver fibrosis by reprogramming macrophage function through modulating the intracellular function of PKM2. Gut 2022;71(12):2539-50.

      Wheeler MA, Clark IC, Tjon EC, Li Z, Zandee SEJ, Couturier CP, et al. MAFG-driven astrocytes promote CNS inflammation. Nature 2020;578(7796):593-9.

      Zhang J, Feng G, Bao G, Xu G, Sun Y, Li W, et al. Nuclear translocation of PKM2 modulates astrocyte proliferation via p27 and -catenin pathway after spinal cord injury. Cell Cycle 2015;14(16):2609-18.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We sincerely thank the reviewers for their constructive feedback. We have revised our manuscript to address some important concerns. The main changes are summarized as follows:

      (1) A major concern as reflected in the eLife assessment and reviewer comments, was that the “evidence supporting the conclusion that striatal neurons encode single-limb gait is incomplete.” We have now provided an expanded analysis of gait phase-locking to different limbs in Figure 2 – figure supplement 1. The analysis reveals three key new insights: 1) most striatal neurons are significantly entrained to only one or two limbs; 2) for neurons entrained to two limbs, most limb pairs are diagonal pairs, whose phases are closely aligned; 3) the strength of phase-locking, as measured by the mean vector length, is biased toward a single limb. From these results we conclude that striatal neurons are indeed better correlated with single-limb (as opposed to multiple limbs’) gait. However, we speculate that because of the inherently correlated motion across limbs, some neurons also display significant phaselocking to multiple limbs, particularly to diagonal pairs.

      (2) Reviewer 2 noted the lack of a manipulation experiment which would help establish the striatum’s relationship to gait control. We have therefore included the results of new experimental data in Figure 6 – figure supplement 2, in which we show that optogenetically activating D2 MSNs alters both some measures of whole-body motion and single-limb gait. We recognize that these experiments are not ideal, for example, the optical stimulation was not entrained to limb phase. Nevertheless, they hopefully allay any concern that the striatum is incapable of influencing gait performance.

      (3) We have further characterized the relationship between vector length and firing rate, and firing rate between D1 and D2 MSNs. We now show that: 1) vector length is negatively correlated with session-wide firing rate (Figure 2 – figure supplement 1E); 2) session-wide firing rates are similar between D1 and D2 MSNs in both healthy and dopamine lesioned animals (Figure 4D and Figure 6H). Thus, the imbalance in the vector length between D1 and D2 MSNs following dopamine lesions is unlikely to be explained by changes in the overall firing rates of these cells.

      (4) We have added new data similar to Figure 1 with distributions of stride frequency, duration, and length to illustrate the difference between sham and 6OHDA mice (Figure 5 – figure supplement 1B,C).

      (5) We have expanded the Discussion section to discuss a number of important points raised by the reviewers. These include: 1) speculating on the origins of gait coding in the striatum; 2) discussion of some literature which reported similar levels of D1/D2 MSN start coding in contrast to our results in healthy mice; 3) discussion of the finding that almost all phase-locked cells also have a firing rate related to speed or start/stop signals; 4) discussion of one of the limitations of the unilateral 6OHDA model, namely, the strong turning bias, and its potential implications for our results.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Yang et al combine high-speed video tracking of the limbs of freely moving mice with in vivo electrophysiology to demonstrate how striatal neurons encode single-limb gait. They also examine encoding other well-known aspects of locomotion, such as movement velocity and the initiation/termination of movement. The authors show that striatal neurons exhibit rhythmic firing phase-locked with mouse gait, while mice engage in spontaneous locomotion in an open field arena. Moreover, they describe gait deficits induced by severe unilateral dopamine neuron degeneration and associate these deficits with a relative strengthening of gait-modulation in the firing of D2-expressing MSNs. Although the source and function of this gait-modulation remain unclear, this manuscript uncovers an important physiological correlate of striatal activity with gait, which may have implications for gait deficits in Parkinson's Disease.

      Strengths:

      While some previous work has looked at the encoding of gait variables in the striatum and other basal ganglia nuclei, this paper uses more careful quantification of gait with video tracking. In addition, few if any papers do this in combination with optically-labeled recordings as were performed here.

      Weaknesses:

      The data collected has a great richness at the physiological and behavioral levels, and this is not fully described or explored in the manuscript. Additional analysis and display of data would greatly expand the interest and interpretability of the findings.

      There are also some caveats to the interpretation of the analyses presented here, including how to compare encoding of gait variables when animals have markedly different behaviors (eg comparing sham and unilaterally 6-OHDA treated mice), or how to interpret the loss of gait modulation when single unit activity is overall very low.

      (1) The authors use circular analysis to quantify the degree to which striatal neurons are phaselocked to individual limbs during gait. The result of this analysis is shown as the proportion of units phase-locked to each limb, vector length, and vector angle (Fig 2H-K; Fig 4E-F; Fig 6E-F). Given that gait is a cyclic oscillation of the trajectories of all four limbs, one could expect that if one unit is phase-locked to one limb, it will also be phase-locked to the other three limbs but at a different phase. Therefore, it is not clear in the manuscript how the authors determine to which limb each unit is locked, and how some units are locked to more than one limb (Fig 2H). More methodological/analytical detail would be especially helpful.

      We thank the reviewer for raising this important issue, which was not sufficiently explored in our original manuscript. This relates to a major concern that “evidence supporting the conclusion that striatal neurons encode single-limb gait is incomplete.” We have now prepared a new figure supplement to address whether neurons are preferentially entrained to only one or multiple limbs (Figure 2 – figure supplement 1, panels A-C).

      Author response image 1.

      Panels A-C. Phase-locking to different limbs.

      Panel A shows the percentage of striatal neurons (all neurons including untagged cells) with significant phase-locking to only 1, 2, 3, or all 4 limbs. The results indicate that most phaselocked cells are entrained to either only 1, or only 2 limbs, as opposed to 3 or all 4 limbs. We next looked more closely at the cells which were entrained to only 2 limbs: Panel B shows that a significant majority of those cells were coupled to diagonal limb pairs. This finding is insightful because diagonal limb pairs move at nearly the same phase during walking, thus some overlap in phase-locking to these limbs is to be expected. Finally, Panel C shows the mean vector length per neuron ranked from the highest to lowest value. The results reveal that the vector length is significantly biased toward the highest ranked limb. This bias would be absent if neurons were entrained to all 4 limbs with similar strength. Together, these results support the conclusion that striatal neuron spiking is preferentially coupled to single limbs as opposed to multiple limbs. However, we speculate that because of the inherently correlated motion across limbs, some neurons also display significant phase-locking to multiple limbs, particularly to diagonal pairs.

      (2) In Figures 2 and 3, the authors describe the modulation of striatal neurons by gait, velocity, and movement transitions (start/end), with most of their examples showing firing rates compatible with rates typical of striatal interneurons, not MSNs. In order to have a complete picture of the relationship between striatal activity and gait, a cell type-specific analysis should be performed. This could be achieved by classifying units into putative MSN, FS interneurons, and TANs using a spike waveform-based unit classification, as has been done in other papers using striatal single-unit electrophysiology. An example of each cell type's modulation with gait, as well as summary data on the % modulation, would be especially helpful.

      We appreciate the reviewer’s suggestion to analyze our data after classifying units into different putative cell types (MSN, FSI, TAN). Indeed, we have frequently adopted this practice in our other publications (e.g., Bakhurin & Masmanidis 2016, 2017; Lee & Masmanidis 2019). However, this study already relies on a more rigorous method – optogenetic tagging – to identify D1 and D2 MSNs. We felt that adding a second, more subjective and therefore less rigorous identification method based on spike waveforms would add unnecessary confusion in how the results are presented and interpreted. For example, we were unsure how to address the situation where an opto-tagged D1 or D2 MSN may be classified as a putative FSI or TAN according to spike waveform criteria. For this reason, we decided not to perform an analysis by putative MSN, FSI, and TAN. Finally, we have made all our electrophysiological data available should someone want to perform this analysis themselves.

      (3) By normalizing limb trajectories to the nose-tail axis, the analysis ignores whether the mouse is walking straight, or making left/right turns. Is the gait-modulation of striatal activity shaped by ipsi- and contralateral turning? This would be especially important to understand changes in the unilateral disease model, given the imbalance in turning of 6-OHDA mice.

      This is an important question, which our data are unfortunately underpowered to address. Lesioned mice turn sharply for nearly the entire duration of walking, while healthy mice walk in a nearly straight line, with occasional brief turning bouts. Thus, we do not have sufficient stride numbers during healthy turning to enable a rigorous analysis of gait phase locking during left/right turns. This raises some questions about the interpretation of the higher D2 MSN vector length in dopamine lesioned mice – does the higher vector length relate to the impaired gait, or the higher incidence of turning in this PD model? We have acknowledged this issue in the Discussion section as a limitation of the unilateral 6OHDA model. And, in future work we hope to investigate turning effects in more detail using behavioral arenas which force animals to turn left or right at specific locations.

      (4) It looks like the data presented in Figure 4 D-F comes from all opto-identified D1- and D2MSNs. How many of these are gait-modulated? This information is missing (line 110). Pooling all units may dilute differences specific to gait-modulated units, therefore a similar analysis only on gait-modulated units should be performed.

      The reviewer is correct that the data presented in Figure 4 comes from all optogenetically tagged cells. We have now included a new panel, Figure 4H, which shows the proportion of D1 and D2 MSNs which encode limb phase, body speed, or start/stop. The reviewer suggested that a similar analysis only gait-modulated units should be performed. We prefer to stick to our current approach (of using all cells, regardless of whether they show significant gait modulation) because it is less biased. For example, even cells which do not pass our threshold for statistical significance may display weak but visible gait modulation.

      (5) Since 6-OHDA lesions are on the right hemisphere, we would expect left limbs to be more affected than right limbs (although right limbs may also compensate). It is therefore surprising that RF and RR strides seem slightly shorter than LF and LR (Fig 5G), and no differences in other stride parameters (Fig 5H-J). Could the authors comment on that? It may be that this is due to rotational behavior. One interesting analysis would be to compare activity during similar movements in healthy and 6-OHDA mice, eg epochs in which mice are turning right (which should be present in both groups) or walking a few steps straight ahead (which are probably also present in both groups).

      Unilateral 6OHDA lesions are associated with ipsiversive turning (in this case, toward the right). The reviewer noted that the stride length is shorter for the two right compared to the two left limbs (Figure 5G), which is consistent with a right turning bias. In line with this observation, the stride speed for the right limbs also seemed slower than for the left limbs (Figure 5I), though we agree this is a bit difficult to see in the plot due to the choice of y-axis range. We appreciate the reviewer’s suggestion to analyze activity during similar movements in healthy and lesioned mice. As discussed in reply to their third comment above, our data did not contain sufficient bouts of straight walking in lesioned mice, or turning in healthy mice, to make such analysis possible. We have acknowledged this issue in the Discussion section as a limitation of the unilateral 6OHDA model. And, in future work we hope to investigate turning effects in more detail using behavioral arenas which force animals to turn left or right at specific locations.

      (6) Multiple publications have shown that firing rates of D1-MSN and D2-MSN are dramatically changed after dopamine neuron loss. Is it possible that changes observed in gait-modulation might be biased by changes in firing rates? For example, dMSNs have exceptionally low overall activity levels after dopamine depletion (eg Parker...Schnitzer, 2018; Ryan...Nelson, 2018; Maltese...Tritsch, 2021); this might reduce the ability to detect modulation in the firing of dMSNs as compared to iMSNs, which have similar or increased levels of activity in dopamine depleted mice. Does vector length correlate with firing rate? In addition, the normalization method used (dividing firing rate by minimum) may amplify very small changes in absolute rates, given that the firing rates for MSN are very low. The authors could show absolute values or Z-score firing rates (Figure 6 A, D).

      The reviewer asked a number of important questions here. First, is it possible that changes in gait modulation are biased by changes in firing rates? We have included a new analysis comparing the average session-wide firing rate of D1 and D2 MSNs (Figure 6D & 6H). This showed that firing rates were statistically similar between D1 and D2 MSNs for both sham and dopamine lesioned mice. Thus, it seems unlikely that the imbalance in vector length is purely due to changes in firing rate. The reviewer referenced some literature (e.g. Parker & Schnitzer; Ryan & Nelson; Maltese & Tritsch) which does appear to show significant changes in the relative firing levels of D1/D2 MSNs after dopamine lesions. While we can only speculate about the reason for the discrepancy (e.g., differences in measurement method, behavioral task, or analysis method), we note that not all prior literature has reported such changes (e.g., Ketzef & Silberberg 2017).

      Author response image 2.

      Panels D & H. No difference in firing between D1 and D2 MSNs.

      Second, does vector length correlate with firing rate? Interestingly, we found that indeed it does. We now show that vector length is negatively correlated with firing rate (Figure 2 – figure supplement 1E), implying that cells with higher overall firing rates tend to have weaker phaselocking to the gait cycle. Though not shown in the manuscript, we found a similar negative correlation for D1 and D2 MSNs in both healthy and dopamine lesioned mice.

      Author response image 3,

      Panel E. Vector length is negatively correlated to firing rate.

      Third, the reviewer asked about our normalization method in Figure 6A etc, in which we divide by the minimum rate. We would like to clarify that this normalization method was only used for visualizing our data, but not for calculating the vector length. Therefore, we chose to leave the plots as they are.

      (7) The analysis shown in Fig 3C should also be done for opto-identified D1- and D2-MSNs (and for waveform-based classified units as noted above).

      We have now performed the same analysis for optogenetically tagged D1 and D2 MSNs from healthy mice (Figure 4H). As with our original analysis, both populations showed a similar proportion of neurons which encoded limb phase, start of movement, body speed, and the combination of these. We did not perform this analysis for waveform-based classified units as per our reason outlined in reply to the reviewer’s second comment above.

      Author response image 4.

      Panel H. Venn diagrams showing the percentage of D1 and D2 MSNs with significant responses to limb phase of at least one limb, body speed, and start and/or stop of motion.

      (8) Discussion: the origin of the gait-modulation as well as the possible mechanisms driving the alterations observed in 6-OHDA mice should be discussed in more detail.

      Our Discussion section includes the following paragraph speculating on the origin of gait modulation: “Movement-related neural activity is widespread in many brain areas, and it is plausible that the striatum receives both motor and sensory signals involved in gait generation. For example, the primary motor cortex, which projects to dorsal striatum, has been shown to exhibit rhythmic spiking activity consistent with gait phase coding (Armstrong & Drew 1984), suggesting a shared mechanism underlying the production of this code.” We appreciate the request to also discuss the possible mechanisms driving the alterations in 6OHDA mice. But this is a very complex topic which our study is not aimed at addressing. The range of possible mechanisms uncovered in the literature is vast – from synaptic changes in striatal microcircuits, to altered intrinsic excitability of D1/D2 MSNs, and network-level alterations. Therefore, we preferred to keep the discussion focused on gait and movement coding.

      Reviewer #2 (Public Review):

      Summary:

      Yang et al. recorded the activity of D1- and D2-MSNs in the dorsal striatum and analyzed their firing activity in relation to single-limb gait in normal and 6-OHDA lesioned mice. Although some of the observations of striatal encoding are interesting, the novelty and implications of this firing activity in relation to gait behavior remain unclear. More specifically, the authors made two major claims. First, the striatal D1- and D2-MSNs were phase-locked to the walking gait cycles of individual limbs. Second, dopamine lesions led to enhanced phase-locking between D2-MSN activity and walking gait cycles. The second claim was supported by the increase of vector length in D2-MSNs after unilateral 6-OHDA administration to the medial forebrain bundle. However, for the first claim, the authors failed to convincingly demonstrate that striatal MSNs were more phase-locked to gait with single-limb and step resolution than to the global gait cycles.

      We thank the reviewer for their feedback and for their comment that “the authors failed to convincingly demonstrate that striatal MSNs were more phase-locked to gait with single-limb and step resolution than to the global gait cycles.” We now present new analysis demonstrating that neurons are more phase-locked to single-limb gait rather than multiple limbs (Figure 2 – figure supplement 1, panels A-C). These results are discussed in detail in response to Reviewer #1’s first comment. For conciseness we will not repeat the same response here but instead refer the reviewer to Reviewer #1, comment #1.

      Strengths:

      It is a technically advanced study.

      Weaknesses:

      (1) The authors focused on striatal encoding of gait information in current studies. However, it remains unclear whether the part of the striatum for which the authors performed neuronal recording is really responsible for or contributing to gait control. A lesion or manipulation experiment disrupting the part of the striatum recorded seems a necessary step to test or establish its relationship to gait control.

      We agree that our study – like many others which employ recordings – is largely correlative, and that a direct causal relationship was lacking. We have therefore decided to present some data which, despite some caveats, shows that the striatum is in principle capable of altering gait performance (Figure 6 – figure supplement 2).

      Author response image 5.

      Optogenetic activation of D2 MSNs alters whole-body movement and single-limb gait.

      These new results are from healthy mice (n=4) receiving optogenetic stimulation of D2 MSNs over a 5 minute period. Panels A-E show changes in a variety of whole-body measures of motion, mostly replicating the results of Kravitz & Kreitzer 2010. Panels F-I show changes (statistically significant or trending) in a variety of gait parameters, with the greatest effects found on the single-limb stride duration and stride speed. Interestingly, Kravitz & Kreitzer 2010 actually examined effects of this stimulation on gait; quoting from their paper: “we examined gait parameters in D1-ChR2 and D2-ChR2 mice in response to illumination, using a treadmill equipped with a high-speed camera. We quantified multiple gait parameters with the laser on and off, and found no significant differences in the average or variance of stride length, stance width, stride frequency, stance duration, swing duration, paw angle and paw area on belt for either line….This indicates that activation of direct and indirect pathways in the dorsomedial striatum regulates the pattern of motor activity, without changing the coordination of ambulation itself.” We wonder therefore if the reviewer’s comment about causality may have stemmed from the negative result in Kravitz & Kreitzer 2010. In any event, we now present results which firmly show a link between striatal D2 MSNs and gait. To be clear, we are not claiming that Kravitz & Kreitzer’s study was fundamentally flawed, but that perhaps their ability to resolve gait changes using a commercial treadmill system, or their choice of dorsomedial as opposed to more lateral regions of the striatum may have contributed to the negative result.

      It is also important to acknowledge a limitation of our optogenetic stimulation experiment. Our optical stimulation was not phase-locked to the gait cycle; thus, technically, we did not address whether the phase code per se is involved in producing gait. We mention this caveat in the manuscript. Despite this, we believe the new data address the reviewer’s concern about lack of causality.

      (2) The authors attributed one of the major novelties to phase-locking of striatal neural activities with single-limb gait cycles. The claim was not clearly supported, as the authors did not demonstrate that phase-locking to single-limb gaits was more significant than phase-locking to global walking gait cycles. In rhythmic walking, the LR and RF limbs were roughly anti-phase with the LF and RR limbs (Fig. 1D, E). In line with this relationship, striatal neurons were mainly in-phase with LR and RF limbs and anti-phase with LF and RR limbs (Fig. 2J, K). One could instead interpret this as the striatal neurons spanned all the phases of the global walking gait cycles (Fig. 3D). To demonstrate phase-locking with individual limb movements, the authors need to show that neural activities were better correlated with a specific limb than to the global gait cycles.

      We sincerely appreciate the reviewer’s comment. As described above we now present new analysis demonstrating that neurons are more phase-locked to single-limb gait rather than multiple limbs (Figure 2 – figure supplement 1, panels A-C). These results are discussed in detail in response to Reviewer #1’s first comment. For conciseness we will not repeat the same response here but instead refer the reviewer to Reviewer #1, comment #1.

      (3) The observation of the enhancement of coupling between D2 MSN firing and the gait cycles was interesting, but the physiological interpretation was not clear (as the authors also noted in the Discussion), which hampers the significance of the observation.

      In the Discussion we comment on the potential behavioral significance of our findings, keeping in mind the reviewer’s earlier concern about the correlative nature of recordings. For example, we speculate that the increase in D2 MSN limb phase-locking strength contributes to bradykinetic symptoms, specifically the production and maintenance of a normal gait cycle and rhythm. We respectfully disagree with the reviewer about the limited significance of the observations, as this is the first study to describe striatal gait phase coding in detail, noting that gait impairments are a major motor symptom in PD. We believe that progress in better understanding and eventually treating PD will be made through a combination of correlative observations (i.e., neural recordings) and causal manipulations. There are both advantages and disadvantages to correlative as well as causal experiments.

      (4) Due to the lack of causality experiments as mentioned in the first comment above, the observations of coupling between striatal neuronal activity and gait control might well result from a third brain region/factor serving as the common source to both, whether in normal or dopamine lesioned brain. If this is the case, the significance and implications of current findings will be greatly limited.

      As mentioned above we have included new data to address this concern (Figure 6 – figure supplement 2). Please refer to Reviewer #2, comment #4 for a detailed discussion of these results and their caveats.

      Reviewer #3 (Public Review):

      In this study, Yang et al. address a fundamental question of the role of dorsal striatum in neural coding of gait. The authors study the respective roles of D1 and D2 MSNs by linking their balanced activity to detailed gait parameters. In addition, they put in parallel the striatal activity related to whole-body measures such as initiation/cessation of movement or body speed. They are using an elegant combination of high-resolution single-limb motion tracking, identification of bouts of movements, and electrophysiological recordings of striatal neurons to correlate those different parameters. Subpopulations of striatal output neurons (D1 and D2 expressing neurons) are identified in neural recordings with optogenetic tagging. Those complementary approaches show that a subset of striatal neurons have phase-locked activity to individual limbs. In addition, more than a third of MSNs appear to encode all three aspects of motor behavior addressed here, initiation/cessation of movement, body speed, and gait. This activity is balanced between D1 and D2 neurons, with a higher activity of D1 neurons only for movement initiation. Finally, alterations of gait, and the associated striatal activity, are studied in a mouse model of Parkinson's Disease, using 6-OHDA lesions in the medial forebrain bundle (MFB). In the 6OHDA mice, there is an imbalance toward D2 activity.

      Strengths:

      There is a long-standing debate on the respective role of D1 and D2 MSNs on the control of movement. This study goes beyond prior work by providing detailed quantification of individual limb kinematics, in parallel with whole-body motion, and showing a high proportion of MSNs to be phase-locked to precise gait cycle and also encoding whole-body motion. The temporal resolution used here highlights the preferential activity of D1 MSN at the movement starts, whereas previous studies described a more balanced involvement. Finally, they reveal neural mechanisms of dopamine depletion-induced gait alterations, with a preponderant phase-locked activity of D2 neurons. The results are convincing, and the methodology supports the conclusions presented here.

      Weaknesses:

      Some more detailed explanations would improve the clarity of the results in the corresponding section. Analysis of the 6OHDA experiments could be expanded to extract more relevant information.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Panels I and J from Figure 6 are referred to in the text (line 158) but they don't exist.

      Thank you, we have corrected this in the text.

      (2) For the classification of striatal units into putative MSN, FS interneurons, and TANs, see Gage et al. DOI: 10.1016/j.neuron.2010.06.034 or Thorn et al. DOI: 10.1523/JNEUROSCI.178213.2014.

      As explained in the Public Reviews, Reviewer #1 comment #2 we opted not to perform an analysis by putative MSN, FSI, and TAN. We have performed analysis of different putative cell types in several of our other publications (e.g., Bakhurin & Masmanidis 2016, 2017; Lee & Masmanidis 2019). However, this study already relies on a more rigorous method – optogenetic tagging – to identify D1 and D2 MSNs. We felt that adding a second, more subjective and therefore less rigorous identification method based on spike waveforms would add unnecessary confusion in how the results are presented and interpreted. For example, we were unsure how to address the situation where an opto-tagged D1 or D2 MSN may be classified as a putative FSI or TAN according to spike waveform criteria. For this reason, we decided not to perform an analysis by putative MSN, FSI, and TAN. Finally, we have made all our electrophysiological data available should someone want to perform this analysis themselves.

      (3) The discussion section could be improved by elaborating on the origin and function of these gait signals in the striatum, as well as the mechanisms underlying changes in the 6-OHDA model. In addition, it would be important to discuss the limitations of this model, since unilateral 6-OHDA lesions may not accurately recapitulate parkinsonian gait deficits, as it results in a very asymmetric gait.

      Our Discussion section includes a paragraph speculating on the origin of gait modulation in the striatum, and another paragraph addressing the limitation that unilateral 6OHDA lesions induce gait asymmetry. We appreciate the request to also discuss the possible mechanisms driving the alterations in 6OHDA mice. But this is a very complex topic which our study is not aimed at addressing. The range of possible mechanisms uncovered in the literature is vast – from synaptic changes in striatal microcircuits, to altered intrinsic excitability of D1/D2 MSNs, and network-level alterations. Therefore, we preferred to keep the discussion focused on gait and movement coding.

      Reviewer #2 (Recommendations For The Authors):

      (1) The authors denoted the limb movement sequences as LR-LF-RR-RF, with limbs on the same left/right side moving first. However, considering multiple gait cycles, the sequence could also be described as RF-LR-LF-RR, with movements of the diagonal limbs temporally closer to each other, which was more intuitive from the visual inspection of Fig. 1D. The LR-LF-RR-RF denotation would make more sense if the authors could demonstrate that a walking bout almost always started from LR, as seen in the two examples in Fig. 1D.

      We designated the sequence as LR-LF-RR-RF to illustrate the lateral sequence pattern. But the reviewer is correct that a shifted version of this sequence, such as RF-LR-LF-RR, is also valid. We are not making any claim that the LR limb is always the first to move in a walking bout, but rather, that limbs on the same side of the body move one after the other, followed by the limbs on the opposite side. We have edited the text to hopefully clarify this point: “Mice walked with a lateral sequence gait pattern (e.g., LRLFRRRF), with the limbs on the same side of the body moving one after the other, followed by movement of limbs on the opposite side (Figure 1E).”

      (2) The study identified a biased D1-MSN activation at movement initiation, which was not reported in previous studies that relied on measuring calcium dynamics. The authors attributed the difference to the temporal resolution of electrophysiological versus optic methods. The authors would probably notice that in some previous studies that relied also on optic-tagging and electrophysiological recordings, start/stop activity was not found to be different between direct and indirect pathway MSNs. The authors should discuss these studies and offer some possible explanations.

      This is an oversight on our part, and we thank the reviewer for noting this. We are aware of one such study (Jin & Costa 2014); we apologize if other studies were missed. The Discussion has been updated as follows to discuss this paper: “We also note that another study employing optogenetic tagging did not find significant D1/D2 MSN differences is start/stop activity (Jin & Costa 2014). However, the movement being measured was an instrumental action (rewardguided lever pressing), as opposed to self-initiated motion examined in our work. This suggests either that imbalances between D1 and D2 MSN start activity may be more pronounced under specific behavioral conditions, or that results vary depending on how movement initiation and cessation events are identified.”

      (3) The authors could add some denotations to the peak firing rates in Fig. 3D to aid visualization, so that readers could get a sense of the distribution of neurons preferring each phase of the movements.

      We appreciate this suggestion. We tried adding various colored lines to denote the peak firing rates, but ultimately, we felt the lines were not helpful and potential deleterious for some readers. We thus decided not to add any lines to the plot.

      (4) Although the relative strength of D1/D2-MSN coding of body speed and movement cessation was found after dopamine lesion, it seemed that D1-MSNs cessation coding, as well as D1- and D2-MSN speed coding, were all altered after dopamine lesion (Fig. S3). The authors could mention these to avoid misunderstandings.

      We thank the reviewer for their observation. In the Results, we now mention that “while speed coding remained balanced between D1 and D2 MSNs, there was a substantial reduction in the speed coding score of both cell types after dopamine lesions.” The stop modulation index did not change appreciably.

      Reviewer #3 (Recommendations For The Authors):

      (1) A suggestion would be to put more emphasis in the title on the first parts of the study, i.e. detailed correlation between striatal activity and quantified motion, and not only focus on the dopamine depletion model.

      We considered other titles, but felt that our current choice is appropriate given that the study’s climax is with the dopamine lesion results in Figures 5 & 6.

      (2) The calculation and the significance of the vector length should be more detailed in the results as it is used all along as a measure of "the strength of neural entrainment to the gait cycle".

      We have added the following statement in the Results section to clarify the significance of vector length: “The vector length is a unitless parameter which can theoretically vary from 0 to 1, with 0 representing a neuron whose spikes occur at random limb phases, and 1 representing a neuron which always spikes at the same phase. Thus, higher vector length indicates a stronger entrainment of spiking activity to a specific limb phase.” For details on how vector length is calculated we refer readers to our Methods, specifically the section entitled “Gait phase coding analysis.”

      (3) There is no difference in the ipsi- or contralateral limbs while recordings are made only in the right hemisphere. Given that MSNs receive inputs from IT and PT neurons from the motor cortex, would it not be expected to have differences in the phase-locked activity to right versus left limbs? This is a question also with the dopamine depletion model which is performed with unilateral 6OHDA injections.

      This is something we also wondered and were somewhat surprised by the lack of a contralateral bias in the phase locking vector length, as shown in Figure 2 – figure supplement 1D. We have two hypotheses as to why there is no ipsi/contra-lateral bias. First, it is possible that striatal neurons receive similar levels of synaptic input signaling ipsi/contra-lateral limb movements. Second, the strongly correlated motion of diagonally opposed limbs may give the appearance that neurons that are phase-locked to one limb (e.g., LF) are also locked to the diagonally opposite limb (i.e., RR). We see evidence of this diagonal limb coupling in Figure 2 – figure supplement 1B.

      (4) Among the 45% of striatal neurons that display significant phase-locking to at least one limb, it would be interesting to describe the % of neurons being phase-locked to several limbs and whether they are specific subtypes. Are there animals with more phase-locked cells in several limbs?

      This is indeed a very interesting and important point which relates to the major concern that “evidence supporting the conclusion that striatal neurons encode single-limb gait is incomplete.” As described above we now present new analysis demonstrating that neurons are more phaselocked to single-limb gait rather than multiple limbs (Figure 2 – figure supplement 1, panels AC). These results are discussed in detail in response to Reviewer #1’s first comment. For conciseness we will not repeat the same response here but instead refer the reviewer to Reviewer #1, comment #1. With regard to whether there are specific subtypes, we performed the same analysis on optogenetically identified D1/D2 MSNs and found similar trends, but did not show these results in the manuscript to avoid redundancy.

      (5) The Venn diagram in Fig. 3C shows ~40% of striatal cells encoding body speed, single-limb and start/stop information. Nevertheless, this percentage is limited by the number of single-limb phase-locked cells as almost all have a firing rate related to body speed and start/stop signals. This could be discussed.

      This is a very interesting observation. Basically, the reviewer is noting that almost all the phaselocked cells also encode start/stop and/or speed. We have now updated the Discussion to specifically discuss this observation: “We found a different percentage of striatal neurons which encoded limb phase, movement initiation or cessation, and speed (Figure 3). Among these three categories, limb phase coding cells represented the smallest population with ~45% of neurons, as opposed to ~90% for start/stop or speed. In addition, nearly all phase coding cells were also significantly responsive to start/stop or speed, whereas a sizable proportion of start/stop or speed coding cells were not entrained to limb phase. It is unclear, however, whether these population size differences reflect a proportionally smaller role for the striatum in regulating single-limb gait as opposed to whole-body movement initiation, cessation or speed.”

      (6) D1/D2 analysis:

      For optogenetic identification of D1 and D2 neurons, 39 D1 neurons and 40 D2 neurons were extracted from the total of 274 recorded neurons while 222 neurons were optogenetically tagged according to the mat and meth. Were there any technical difficulties that made it difficult to identify more neurons?

      The low yield of optogenetic tagging is quite common in the literature due to the rigorous criteria which must be satisfied in order to qualify as a tagged neuron (e.g., Kvitsiani & Kepecs 2013). The number 222 neurons quoted in the methods reflects the entirety of optogenetically tagged neurons in this study. Our study contained 33 mice, thus the average number of tagged units per animal was 222/33 ~ 6.7 units/animal. This is actually comparable to or slightly better than the yield reported in some other striatal literature (see for example, Figure 1 of Ryan & Nelson 2018).

      It is mentioned that "a subset" of these were phase-locked to a single limb. It would be interesting to specify the exact percentage of those neurons for D1 and D2 populations.

      Phase-locking of D2 neurons seems less sharp than D1 neurons, with a lower firing rate (Fig. 4D), please comment. Also difference in vector length for LR while none for other limbs, why? There is a balanced activity of D1 and D2 MSNs during walking (speed) and single-limb movements, but more D1 MSNs active at movement initiation. Is it also true for stop signals? Are they separated based on the speed threshold of 20 mm/s?

      As mentioned above, our new analysis specifically examines the percentage of all neurons which are phase locked to a single limb (Figure 2 – figure supplement 1, panels A-C). We have performed the same analysis on optogenetically tagged D1/D2 MSNs and found similar trends, but not show these results in the manuscript to avoid redundancy. With regard to whether phase-locking of D2 is less sharp than D1 MSNs, the “sharpness” of phase-locking is characterized by the mean vector length. And we show that on average, the vector length is statistically the same for D1 and D2 MSNs in healthy mice (Figure 4F). The reviewer noted that the D2 vector length in Figure 4F appears visibly higher for LR while not for other limbs, however, this difference is not statistically significant. With regard to whether more D1 MSNs are active during movement cessation, we show that both sham and dopamine lesioned mice have similar levels of D1/D2 MSN activity during stop (Figure 6 – figure supplement 1, panels A & B). Details of how start, stop, and speed are calculated are provided in the Methods.

      The relationship between firing and body speed (Fig. 4H) displays differences between D1 and D2. If a speed inferior to 20 mm/s, corresponds to "start or stop signal" as mentioned in the mat and meth, then early difference would correspond to start, but still there is a difference between 20 and 100 mm/s and after 150 mm/s. These results should be commented on.

      The reviewer is correct that in the plot of firing rate vs body speed (Figure 4J), there visibly appears to be a difference between D1 and D2 MSNs at low speeds. However, according to our pre-determined measure of speed coding which relies on the correlation coefficient between firing rate and speed, D1 and D2 MSNs have similar speed coding indices. Since there is a precedent for using the correlation coefficient to quantify speed coding (Fobbs & Kravitz 2020; Kropff & Moser 2015), we prefer to stick with this measure despite some caveats. Furthermore, the apparent difference between D1 and D2 MSNs in Figure 4J is not seen in either sham or dopamine lesioned mice (Figure 6 – figure supplement 1, panels D & E). Taken together, we do not believe the apparent speed coding difference in Figure 4J rises to the level of a consistent result.

      (7) The timing of normalized firing rate in relation to start/stop signals might be also quite interesting to comment on. D1 neurons have stronger activation for start signals and it seems that it is also earlier, with D2 activated after the onset of the movement (Fig. 4G).

      We appreciate the observation that D1 neurons appear to fire a little earlier than D2 neurons in Figure 4I. However, this did not rise to the level of a statistically significant result by our attempted quantitative analysis (not shown). Furthermore, the earlier timing of D1 is not apparent in sham lesioned animals in Figure 6I, thus overall we cannot make any confident statements about earlier timing of D1 start signals.

      In dopamine lesion experiments, in sham mice, it seems that both D1 and D2 have higher activity after the onset of the movement and that the peak of D2 activity is earlier (Fig. 6G). In 6OHDA mice, both peaks are after the onset of the movement although they are much less clearly defined.

      Both peaks become less sharp after 6OHDA lesions, but in terms of amplitude the main effect is a reduction in the D1 start signal. This is reflected in the reduced D1 start modulation index whereas the D2 index remains relatively constant.

      (8) 6OHDA model displays much fewer walking bouts with lower speed and initiation rate. It would be important to include in the figure a similar representation to Fig.1 with distributions of stride frequency, duration, and length to illustrate the difference between control and 6OHDA mice. On average, how many walking bouts were analyzed in control and 6OHDA animals?

      We have added new data similar to Figure 1 with distributions of stride frequency, duration, and length to illustrate the difference between sham and 6OHDA mice (Figure 5 – figure supplement 1, panels B & C). We also added the following information on the number of walking bouts: “The mean number of walking bouts per session was reduced from 124 ± 42 in sham to 47 ± 19 in dopamine lesioned mice (mean ± SD).”

      The initiation rate is particularly low in 6OHDA animals, 3-4 per minute, did the authors make longer behavioral recordings to extract enough initiation/stop signals for neural correlation analysis?

      All of our recordings were of the same duration (30 minutes). This duration was pre-determined at the beginning of the study to ensure consistency.

      The stride length seems smaller on the right limbs in 6OHDA mice and vector length in D2 neurons as well, while there is no change in D1 neurons. Is it a significant effect? If yes, it would be important to comment on this.

      The ANOVA test in those figures was not designed to perform post-hoc multiple comparisons between different limbs. However, if one changes the ANOVA design then the effect for stride length is significant. This is probably related to the ipsiversive turning bias in the unilateral 6OHDA lesion model. Though we have not changed the ANOVA design, in the Discussion we do comment on the shorter stride length on the right limbs in 6OHDA mice in Figure 5G. There is no significant difference in D2 vector length between different limbs.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Debeuf et al. introduce a new, fast method for the selection of suitable T cell clones to generate TCR transgenic mice, a method claimed to outperform traditional hybridoma-based approaches. Clone selection is based on the assessment of the expansion and phenotype of cells specific for a known epitope following immune stimulation. The analysis is facilitated by a new software tool for TCR repertoire and function analysis termed DALI. This work also introduces a potentially invaluable TCR transgenic mouse line specific for SARS-CoV-2.

      Strengths:

      The newly introduced method proved successful in the quick generation of a TCR transgenic mouse line. Clone selection is based on more comprehensive phenotypical information than traditional methods, providing the opportunity for a more rational T cell clone selection.

      The study provides a software tool for TCR repertoire analysis and its linkage with function.

      The findings entail general practical implications in the preclinical study of a potentially very broad range of infectious diseases or vaccination.

      A novel SARS-CoV-2 spike-specific TCR transgenic mouse line was generated.

      Weaknesses:

      The authors attempt to compare their novel method with a more conventional approach to developing TCR transgenic mice. In this reviewer's opinion, this comparison appears imperfect in several ways:

      (1) Work presenting the "traditional" method was inadequate to justify the selection of a suitable clone. It is therefore not surprising that it yielded negative results. More evidence would have been necessary to select clone 47 for further development of the TCR transgenic line, especially considering the significant time and investment required to create such a line.

      Based on Supplementary Figure 1A only, we understand the concern of the reviewer. However, the data presented in Supplementary Figure 1A is collected during the first rough screening of clones where only the production of IL-2 and IFN-y was measured as a readout for activation. Thereafter, a large selection of responsive clones was further grown and co-cultured with a dose-titration of the antigenic peptide pool. In this second co-culture, also flow cytometry readouts are included such as CD69 expression (as shown in Supplementary Figure 1B). Finally, a narrower selection of responder clones was co-cultured with the different individual peptides to unravel the specificity of the TCR of the clone. In conclusion, the clone was tested at least three times in three distinct set-ups with multiple different readouts.

      However, a good evaluation of a clone in an in vitro setting does not necessarily translate in optimal functioning of the cells in a biological context. For instance, some clones survive better in an in vitro setting than others or have already a more activated profile before stimulation.

      (2) The comparison is somewhat unfair, because the methods start at different points: while the traditional method was attempted using a pool of peptides whose immunogenicity does not appear to have been established, the new method starts by utilising tetramers to select T cells specific for a well-established epitope.

      Given the costs and time involved, only a single clone could be tested for either method, intrinsically making a proper comparison unfeasible. Even for their new method, the authors' ability to demonstrate that the selected clone is ideal is limited unless they made different clones with varying profiles to show that a particular profile was superior to others.

      In my view, there was no absolute need to compare this method with existing ones, as the proposed method holds intrinsic value.

      We acknowledge the importance of the well-established hydridoma technology and in no way intended to compare these methods head-to-head, nor do not want to question the validity of the classical methods. The reason why we also wanted to show the failed CORSET8 mouse was to highlight the parts of the TCR generating process which could be rationalized. We again want to emphasize that we do not want to compare methods in any way and recognise that we started from two different bases in terms of clone selection (peptide pool stimulation versus tetramer staining). While the tetramer staining that was employed in the generation of CORSET8 mice allowed to enrich the samples for specific responder clones, this enrichment step is not an absolute requirement for the implementation of the presented method or for the successful generation of a TCR Tg mouse model. An alternative approach could be to use the described method to select for activated and expanded clones upon immunisation and test their reactivity in subsequent steps using peptide stimulation before selecting a receptor. In conclusion, we merely wish to present a novel roadmap for others to use for the generation of their TCR Tg mouse to aid in the selection of the most preferable clone for their purposes.

      (3) While having more data to decide on clone selection is certainly beneficial, given the additional cost, it remains unclear whether knowing the expression profiles of different proteins in Figure 2 aids in selecting a candidate. Is a cell expressing more CD69 preferable to a cell expressing less of this marker? Would either have been effective? Are there any transcriptional differences between clonotype 1 and 2 (red colour in Figure 2G) that justify selecting clone 1, or was the decision to select the latter merely based on their different frequency? If all major clones (i.e. by clonotype count) present similar expression profiles, would it have been necessary to know much more about their expression profiles? Would TCR sequencing and an enumeration of clones have sufficed, and been a more cost-effective approach?

      The method we present in the paper serves as a proof-of-concept, to be adapted to the researcher’s own needs. We agree with the reviewer that for our intentions with the CORSET8 mice, TCRseq in combination with an enumeration of the clones could also have sufficed and would lower the cost of sequencing. However, we wish to present a roadmap for others to use for the generation of their TCR Tg mouse. Important in this, is that the cellular phenotype, and activation state can be taken into consideration, which might for some projects be essential.  

      Nonetheless, we do see clear interclonal differences regarding the expression of “activation” genes, where clone 1 is clearly one of the well activated and interferon producing clones (as shown in Author response image 1). As such, researchers could expand these types of analysis to probe for specific phenotypes of characteristics.

      Author response image 1.

      (4) Lastly, it appears that several of the experiments presented were conducted only once. This information should have been explicitly stated in the figure legends.

      To control for interexperimental variation, every experiment represented in the manuscript has been performed at least two times. We have added the additional information regarding the experimental repetitions and groups in the figure legends.

      Reviewer #2 (Public Review):

      Summary:

      The authors seek to use single-cell sequencing approaches to identify TCRs specific for the SARS CoV2 spike protein, select a candidate TCR for cloning, and use it to construct a TCR transgenic mouse. The argument is that this process is less cumbersome than the classical approach, which involves the identification of antigen-reactive T cells in vitro and the construction of T cell hybridomas prior to TCR cloning. TCRs identified by single-cell sequencing that are already paired to transcriptomic data would more rapidly identify TCRs that are likely to contribute to a functional response. The authors successfully identify TCRs that have expanded in response to SARS CoV2 spike protein immunization, bind to MHC tetramers, and express genes associated with functional response. They then select a TCR for cloning and construction of a transgenic mouse in order to test the response of resulting T cells in vivo following immunization with spike protein of coronavirus infection.

      Strengths:

      (1) The study provides proof of principle for the identification and characterization of TCRs based on single-cell sequencing data.

      (2) The authors employ a recently developed software tool (DALI) that assists in linking transcriptomic data to individual clones.

      (3) The authors successfully generate a TCR transgenic animal derived from the most promising T cell clone (CORSET8) using the TCR sequencing approach.

      (4) The authors provide initial evidence that CORSET8 T cells undergo activation and proliferation in vivo in response to immunization or infection.

      (5) Procedures are well-described and readily reproducible.

      Weaknesses:

      (1) The purpose of presenting a failed attempt to generate TCR transgenic mice using a traditional TCR hybridoma method is unclear. The reasons for the failure are uncertain, and the inclusion of this data does not really provide information on the likely success rate of the hybridoma vs single cell approach for TCR identification, as only a single example is provided for either.

      We refer to comments 2 and 3 of reviewer 1 for an answer to this point.

      (2) There is little information provided regarding the functional differentiation of the CORSET8 T cells following challenge in vivo, including expression of molecules associated with effector function, cytokine production, killing activity, and formation of memory. The study would be strengthened by some evidence that CORSET8 T cells are successfully recapitulating the functional features of the endogenous immune response (beyond simply proliferating and expressing CD44). This information is important to evaluate whether the presented sequencing-based identification and selection of TCRs is likely to result in T-cell responses that replicate the criteria for selecting the TCR in the first place.

      We agree with the reviewer that the data in the initial manuscript included only a limited in vivo functional validation of the CORSET8 T cells. Therefore, we extended these in vivo readouts and measured IFN-g production, CD69, T-bet expression (as measure for activation) and Ki-67 expression (as alternative readout than CTV for proliferation). In the single cell data, we saw that these markers were more pronounced in the selected clone compared to other clones. We could confirm these findings in vivo, and found a stronger induction of IFN-g, CD69, T-bet and Ki-67 in CORSET8 T cells compared to endogenous CD45.2 cells and even Spike-Tetramer+ CD45.2 endogenous cells. We added these data in Figure 4.

      (3) While I find the argument reasonable that the approach presented here has a lot of likely advantages over traditional approaches for generating TCR transgenic animals, the use of TCR sequencing data to identify TCRs for study in a variety of areas, including cancer immunotherapy and autoimmunity, is in broad use. While much of this work opts for alternative methods of TCR expression in primary T cells (i.e. CRISPR or retroviral approaches), the process of generating a TCR transgenic mouse from a cloned TCR is not in itself novel. It would be helpful if the authors could provide a more extensive discussion explaining the novelty of their approach for TCR identification in comparison to other more modern approaches, rather than only hybridoma generation.

      By integrating the recent technological advances in single cell sequencing into the generation of TCR Tg mice, possibilities arise to rationalize clone selection regarding clonal size, lineage/phenotype and functional characteristics. Often, the selection process based on hybridoma selection yields multiple epitope specific clones that upregulate CD69 or IL-2, and only minimal functional and phenotypic parameters are checked before prioritizing one clone to proceed with. In our experience, transgenic clones selected in this way sometimes render TCR clones unable to compete with endogenous polyclonal T clones in vivo. Taken all these caveats into account, the novelty we present here is that the researcher is fully able to select clones based on several layers of information without the need for extensive or repeated screening. Moreover, the selection of the TCR Tg clone can be done via the interactive and easily interpretable DALI tool. Owing to the browser-based interactive GUI, immunologists having limited coding experience can effectively analyse their complex datasets.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Regarding Supplementary Figure 1A was the experiment conducted more than once? Clone 47 seems minimally superior to the other clones. Incorporating a positive control, such as the response of the OT-I hybridoma to SIINFEKL, could have provided a benchmark to gauge the strength of the observed responses.

      Also, what was the concentration of the peptide used to restimulate the T cells in vitro? High peptide concentrations can lead to non-specific responses. Ideally, a titration should have been performed, perhaps in a subsequent experiment that only tested those clones that responded well initially. Given the resources required to create and maintain a transgenic mouse line, proceeding with the chosen clone based on the data presented seems to carry considerable risk.

      The experiment has been performed three times. The data presented in Supplementary Figure 1A is collected during the first rough screening of clones where only the production of IL-2 and IFN-y was measured as a readout for activation. Thereafter, a large selection of responsive clones was further grown and co-cultured with a dose-titration of the antigenic peptide pool. In this second co-culture, also flow cytometry readouts are included such as CD69 expression (as shown in Supplementary Figure 1B). Finally, a narrower selection of responder clones was co-cultured with the different individual peptides to unravel the specificity of the TCR of the clone. In conclusion, the clone was tested at least three times in three distinct set-ups with multiple different readouts.

      In Supplementary Figure 1C, no response to stimulation was detected. Ideally, this figure should have included a positive control, such as PMA/Ionomycin or aCD3/CD28 stimulation.

      We agree with the reviewer that this experiment should have included a positive control to validate the non-specific responsiveness of the clone and the technical feasibility of the experiment. Unfortunately, the initial CORSET8 line is frozen and is thus not easily available to repeat the experiment.

      Can the authors clarify their gating strategy in the legend of In Supplementary Figure 1D?

      Plotted cells are non-debris > single cells > viable cells > CD45+. We have added the information to the legend of Supplementary Figure 1D.

      In Figure 2, the figure legend should provide more detail on which cells were sorted for the single-cell RNA sequencing analysis. The materials and methods section explains that cells were stained for CD44. Were activated cells then sorted (either tetramer-positive or -negative), plus naïve CD8 T cells from a naïve mouse?

      Supplementary Figure 2 contains the detailed gating strategy during the sort for the single cell experiment. We have added additional red gates to the plots to clarify which samples were sent for sequencing. This has been adapted in the figure legends of both Figure 2 and Supplementary Figure 2. 

      In Figure 3, Rag1 sufficient transgenic mice display similar numbers of CD4 and CD8 T cells as WT mice in the spleen. Typically, transgenic mice present skewed frequencies of T cells towards the type generated (CD8 in this case), which the authors only found in the thymus of CORSET8 mice. Could this be discussed?

      The comment of the reviewer is valid as there is indeed a skewing towards CD8 T cells in the thymi of the CORSET8 mice. We looked back into the data of the experiments and noticed that poor resolution of some markers might have resulted in improper results. We have repeated this and added another T cell marker (TCRbeta) next to the already included CD3e marker. By including both markers, we were able to show that also in spleen the skewing towards the CD8 T cell phenotype is present.

      How many repetitions were performed for the experiments in Figures 3D and 3E? How many mice were analyzed for Figure 3E? Please provide this information in the figure legend. Also, include a proper quantification and statistical analysis of the data shown.

      New quantification graphs with statistical analysis have been added to Figure 3E. The accompanying figure legend has been adapted. The co-culture displayed in Figure 3D is a representative experiment of two repetitions.

      Figure 4C includes 3-4 mice per group. This experiment should have been replicated, and this information should be indicated in the figure legend.

      We apologise for omitting this data in the figure legend. The experiment presented in Figure 4A-C has been repeated twice, yielding results following the same trend. We were unable to pool the data as two different proliferation dyes were used in the separate experiments (CFSE and CTV). Furthermore, in the in vivo BSL3 experiments represented in figure 4E-H, we always took along the Spike/CpG-group as positive control. We have added the additional information regarding the experimental repetitions and groups in the figure legend.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review):

      Aging is associated with a number of physiologic changes including perturbed circadian rhythms. However, mechanisms by which rhythms are altered remain unknown. Here authors tested the hypothesis that age-dependent factors in the sera affect the core clock or outputs of the core clock in cultured fibroblasts. They find that both sera from young and old donors are equally potent at driving robust ~24h oscillations in gene expression, and report the surprising finding that the cyclic transcriptome after stimulation by young or old sera differs markedly. In particular, genes involved in the cell cycle and transcription/translation remain rhythmic in both conditions, while genes associated with oxidative phosphorylation and Alzheimer's Disease lose rhythmicity in the aged condition. Also, the expression of cycling genes associated with cholesterol biosynthesis increases in the cells entrained with old serum. Together, the findings suggest that age-dependent blood-borne factors, yet to be identified, affect circadian rhythms in the periphery. The most interesting aspect of the paper is that the data suggest that the same system (BJ-5TA), may significantly change its rhythmic transcriptome depending on how the cells are synchronized. While there is a succinct discussion point on this, it should be expanded and described whether there are parallels with previous works, as well as what would be possible mechanisms for such an effect.

      We’ve expanded our discussion in the manuscript to discuss possible mechanisms and also how the genes/pathways implicated in our study relate to other aging literature.  

      Major points: 

      Fig 1 and Table S1. Serum composition and levels of relevant blood-borne factors probably change in function of time. At what time of the day were the serum samples from the old and young groups collected? This important information should be provided in the text and added to Table S1. 

      We made sure to highlight the collection time in the abstract of the manuscript “We collected blood from apparently healthy young (age 25-30) and old (age 70-76) individuals at 14:001 and used the serum to synchronize cultured fibroblasts.” The time of blood draw is also in sections of the paper (Intro and Methods). Since Table S1 is demographic information, we did not think that the blood draw time fit best there, but hopefully it is now clear in the text.

      Fig 2A. Luminescence traces: the manuscript would greatly benefit from inclusion of raw luminescence traces.

      Raw luminescence traces have been added to Figure S3 (S3A).

      Fig 2. Of the many genes that change their rhythms after stimulation with young and old sera, what are the typical fold changes? For example, it would be useful to show histograms for the two groups. Does one group tend to have transcript rhythms of higher or lower fold changes? 

      We’ve presented these data in Figure S5. There are a few significant differences, but largely the groups are similar in terms of fold change.

      Fig. 2 Gene expression. Also here, the presentation would benefit from showing a few key examples for different types of responses. 

      Sample traces of genes that gain rhythmicity, lose rhythmicity, phase shift, and change MESOR are now illustrated in Figure S6.

      What was the rationale to use these cells over the more common U2OS cells? Are there similarities between the rhythmic transcriptomes of the BJ-5TA cells and that of U2OS cells or other human cells? This could easily be assessed using published datasets. 

      The original rationale to use BJ-5TA fibroblast cells was that we were aiming to build upon an observation found in a previous study2 which showed that circadian period changes with age in human fibroblasts. While our findings did not match theirs, we think an added benefit of using the BJ-5TA line is that unlike U2OS cells, it is not a carcinoma derived cell line. We’ve added this point in lines 98-101.

      Our study finds many more rhythmic transcripts compared to the previous studies examining U2OS cells. This can be attributed to several factors including differences in methods, including the use of human serum in our study, cell type differences, or decoupling of rhythms in some cancer cells. While a comparison of BJ-5TA cells and U2OS cells could be interesting, a proper comparison requires investigation of many data sets, since any pair of BJ-5TA and U2OS data sets will most likely differ in some detail of experimental design or data processing pipeline, which could contribute to observed differences in rhythmic transcripts.

      That being said, we compared clock reference genes (see Author response image 1) between BJ-5TA and U2OS cells, comparing circadian profiles obtained from our data with those available on CircaDB. These circadian profiles exhibit many similarities and a few differences. The peak to trough ratios (amplitudes) are quite similar for ARNTL, NR1D1, NR1D2, PER2, PER3, and are about 25% lower for CRY1 and somewhat higher for TEF (about 15%) in our data. We find that the MESORS are generally similar with the exception of NR1D1 which is much lower and NR1D2 which is much higher in our data.

      Author response image 1.

      BJ-5TA and U2OS Cells Exhibit Similar Profiles of Circadian Gene Transcription. We compared the transcriptomic profiles of the BJ-5TA cells in young and old serum (left) to the U2OS transcriptomic data (right) available on CircaDB, a database containing profiles of several circadian reference genes in U2OS cells. This figure suggests that circadian profiles of these genes exhibit many similarities. We find that the peak to trough ratios (amplitudes) are similar for ARNTL, NR1D1, NR1D2, Per2, PER3, and that the MESORS are similar (with the exception of NR1D1 which is much lower and NR1D2 which is much higher in the BJ-5TA cells). We find that the amplitudes of CRY1 is ~25% lower and TEF is ~15% higher for the BJ5TA cells. The axis for plots on the left show counts divided by 3.5 in order to made MESORs of ARNTL similar to ease comparison.

      For the rhythmic cell cycle genes, could this be the consequence of the serum which synchronizes also the cell cycle, or is it rather an effect of the circadian oscillator driving rhythms of cell cycle genes? 

      This is an interesting point. Given our previous data showing that the cell cycle gene cyclin D1 is regulated by clock transcription factors3, we believe the circadian oscillator drives, or at least contributes, to rhythms of cell cycle genes. However, the serum clearly makes a difference as we find that MESORs of cell cycle genes decrease with aged serum. This is consistent with the decreased proliferation previously observed in aged human tissue4.

      While the reduction of rhythmicity in the old serum for oxidative phosphorylation transcripts is very interesting and fits with the general theme that metabolic function decreases with age, it is puzzling that the recipient cells are the same, but it is only the synchronization by the old and young serum that changes. Are the authors thus suggesting that decrease of metabolic rhythms is primarily a non cell-autonomous and systemic phenomenon? What would be a potential mechanism? 

      We are indeed suggesting this, although it is also possible that it is not cycling per se, but rather an overall inefficiency of oxidative phosphorylation that is conveyed by the serum. Relating other work in the field to our findings, we’ve added the following to our discussion: “Previous work in the field demonstrates that synchronization of the circadian clock in culture results in cycling of mitochondrial respiratory activity5,6 further underscoring the different effects of old serum, which does not support oscillations of oxidative phosphorylation associated transcripts. Age-dependent decrease in oxidative phosphorylation and increase in mitochondrial dysfunction7 has been seen in aged fibroblasts8 and contributes to age-related diseases9. We suggest that the age-related inefficiency of oxidative phosphorylation is conferred by serum signals to the cells such that oxidative phosphorylation cycles are mitigated. On the other hand, loss of cycling could contribute to impairments in mitochondrial function with age.”

      The delayed shifts after aged serum for clock transcripts (but not for Bmal1) are interesting and indicate that there may be a decoupling of Bmal1 transcript levels from the other clock gene phases. How do the authors interpret this? could it be related to altered chronotypes in the elderly? 

      One possible explanation is that the delay of NPAS2, BMAL1’s binding partner, results in the delay of the transcription of clock controlled genes/negative arm genes. Since the RORs do not seem to be affected, Bmal is transcribed/translated as usual, but there isn’t enough NPAS2 to bind with BMAL1. In this case downstream genes are slower to transcribe causing the phase delay.

      Reviewer #2 (Public Review): 

      Schwarz et al. have presented a study aiming to investigate whether circulating factors in sera of subjects are able to synchronize depending on age, circadian rhythms of fibroblast. The authors used human serum taken from either old (age 70-76) or young (age 25-30) individuals to synchronise cultured fibroblasts containing a clock gene promoter driven luciferase reporter, followed by RNA sequencing to investigate whole gene expression. 

      This study has the potential to be very interesting, as evidence of circulating factors in sera that mediate peripheral rhythms has long been sought after. Moreover, the possibility that those factors are affected by age which could contribute to the weaken circadian rhythmicity observed with aging. 

      Here, the authors concluded that both old and young sera are equally competent at driving robust 24 hour oscillations, in particular for clock genes, although the cycling behaviour and nature of different genes is altered between the two groups, which is attributed to the age of the individuals. This conclusion could however be influenced by individual variabilities within and between the two age groups. The groups are relatively small, only four individual two females and two males, per group. And in addition, factors such as food intake and exercise prior to blood drawn, or/and chronotype, known to affect systemic signals, are not taken into consideration. As seen in figure 4, traces from different individuals vary heavily in terms of their patterns, which is not addressed in the text. Only analysing the summary average curve of the entire group may be masking the true data. More focus should be attributed to investigating the effects of serum from each individual and observing common patterns. Additionally, there are many potential causes of variability, instead or in addition to age, that may be contributing to the variation both, between the groups and between individuals within groups. All of this should be addressed by the authors and commented appropriately in the text. 

      We are not aware of any specific feature distinguishing the subjects (other than age) that could account for the differences between old and young. The fact that we see significant differences between the two groups, even with the relatively small size of the groups, suggests strongly that these differences are largely due to age. Nevertheless, we acknowledge that individual variability can be a contributing factor. For instance, the change in phase of clock genes appears to be driven largely by two subjects. We have commented on this and individual differences, in general, in the discussion.  

      The authors also note in the introduction that rhythms in different peripheral tissues vary in different ways with age, however the entire study is performed on only fibroblast, classified as peripheral tissue by the authors. It would be very interesting to investigate if the observed changes in fibroblast are extended or not to other cell lines from diverse organ origin. This could provide information about whether circulating circadian synchronising factors could exert their function systemically or on specific tissues. At the very least, this hypothesis should be addressed within the discussion. 

      It is likely that factors circulating in serum act on several tissues, and so their effects are relatively broad. However, this would require extensive investigation of other tissues. We now discuss this in the manuscript.

      In addition to the limitations indicated above I consider that the data of the study is an insufficiently analysis beyond the rhythmicity analysis. Results from the STRING and IPA analysis were merely descriptive and a more comprehensive bioinformatic analysis would provide additional information about potential molecular mechanism explaining the differential gene expression. For example, enrichment of transcription factors binding sites in those genes with different patters to pinpoint chromatin regulatory pathways.

      We performed LinC similarity analysis (LISA) to study enrichment of transcription factor binding. Results are displayed in Fig 3B and in lines 157-168. 

      Recommendations for the authors:

      The two reviewers and reviewing editor have agreed on the following recommendations for the authors: 

      Major: 

      (1) The bioinformatic analysis would benefit from a more thorough focus on variability between individuals. Specifically, the main conclusion of the manuscript could be significantly influenced by individual variabilities within and between the two age groups. This is of particular concern, as the groups are relatively small (four individual two females and two males, per group). In addition, the consideration of factors such as food intake and exercise prior to blood drawn, or/and chronotype, known to affect systemic signals should be more adequately explained. The lab is an experienced chronobiology lab, and thus we are confident that these factors had been thought of, but this needs to be better made clear.

      As seen in Figure 4, traces from different individuals vary heavily in terms of their patterns, which is not addressed in the text. Only analysing the summary average curve of the entire group may be masking the relevant data. Furthermore, there are many potential causes of variability, instead or in addition to age, that may be contributing to the variation both, between the groups and between individuals within groups. All of this should be addressed by the authors and commented appropriately in the text. 

      We are not aware of any specific feature distinguishing the subjects (other than age) that could account for the differences between old and young. The fact that we see significant differences between the two groups, even with the relatively small size of the groups, suggests strongly that these differences are largely due to age. Nevertheless, we acknowledge that individual variability can be a contributing factor. For instance, the change in phase of clock genes appears to be driven largely by two subjects. We have commented on this and individual differences, in general, in the discussion. 

      (2) The study would benefit from a more thorough analysis of the data beyond the rhythmicity analysis. Results from the STRING and IPA analysis were merely descriptive and a more comprehensive bioinformatic analysis would provide additional information about potential molecular mechanism explaining the differential gene expression. For example, enrichment of transcription factors binding sites in those genes with different patters to pinpoint chromatin regulatory pathways. This would provide additional value to the study, especially given the otherwise apparent lack of any mechanistic explanation. 

      We performed LinC similarity analysis (LISA) to study enrichment of transcription factor binding. Results are displayed in Fig 3B and in lines 157-168.

      (3) There were some questions about the amplitude of the core circadian clock gene rhythms raised, which in other human cell types would be much higher. A comment on this matter and the provision of the raw luminescence traces for Fig 2A would be greatly beneficial.

      Addressing the same topic: what are the typical fold changes of the many genes that change their rhythms after stimulation with young and old sera? For example, it would be useful to show histograms for the two groups. Does one group tend to have transcript rhythms of higher or lower fold changes? The presentation of the manuscript would further benefit from showing a few key examples for different types of responses. 

      The average luminescence trace for each individual serum sample from Fig 2A has been added to Fig S3A.

      We’ve presented the fold change data in Figure S5. There are a few significant differences, but largely the groups are similar in terms of fold change.

      (4) There are several points that we recommend to consider to add to the discussion: 

      What was the rationale to use these cells over the more common U2OS cells? Are there similarities between the rhythmic transcriptomes of the BJ-5TA cells and that of U2OS cells or other human cells? It should be relatively easy to address this point by assessing published datasets. 

      The original rationale to use BJ-5TA fibroblast cells was that we were aiming to build upon an observation found in a previous study2 which showed that circadian period changes with age in human fibroblasts. While our findings did not match theirs, we think an added benefit of using the BJ-5TA line is that unlike U2OS cells, it is not carcinoma derived cell line. We’ve added this point in lines 98-101. 

      Our study finds many more rhythmic transcripts compared to the previous studies examining U2OS cells. This can be attributed to several factors including differences in methods, including the use of human serum in our study, cell type differences, or decoupling of rhythms in some cancer cells. While a comparison of BJ-5TA cells and U2OS cells could be interesting, a proper comparison requires investigation of many data sets, since any pair of BJ-5TA and U2OS data sets will most likely differ in some detail of experimental design or data processing pipeline, which could contribute to observed differences in rhythmic transcripts.

      That being said, we compared clock reference genes (see Author response image 1) between BJ-5TA and U2OS cells, comparing circadian profiles obtained from our data with those available on CircaDB. These circadian profiles exhibit many similarities and a few differences. The peak to trough ratios (amplitudes) are quite similar for ARNTL, NR1D1, NR1D2, PER2, PER3, and are about 25% lower for CRY1 and somewhat higher for TEF (about 15%) in our data. We find that the MESORS are generally similar with the exception of NR1D1 which is much lower and NR1D2 which is much higher in our data.

      For the rhythmic cell cycle genes, could this be the consequence of the serum which synchronizes also the cell cycle, or is it rather an effect of the circadian oscillator driving rhythms of cell cycle genes? 

      This is an interesting point. Given our previous data showing that the cell cycle gene cyclin D1 is regulated by clock transcription factors3, we believe the circadian oscillator drives, or at least contributes to rhythms of cell cycle genes. However, the serum clearly makes a difference as we find that MESORs of cell cycle genes decrease with aged serum. This is consistent with the decreased proliferation previously observed in aged human tissue.

      While the reduction of rhythmicity in the old serum for oxidative phosphorylation transcripts is very interesting and fits with the general theme that metabolic function decreases with age, it is puzzling that the recipient cells are the same, but it is only the synchronization by the old and young serum that changes. Are the authors thus suggesting that decrease of metabolic rhythms is primarily a non cell-autonomous and systemic phenomenon? What would be a potential mechanism? 

      It may not be the cycling per se, but rather an overall inefficiency of oxidative phosphorylation that is conveyed by the serum. Relating other work in the field to our findings, we’ve added the following to our discussion: “Previous work in the field demonstrates that synchronization of the circadian clock in culture results in cycling of mitochondrial respiratory activity5,6 further underscoring the different effects of old serum, which does not support oscillations of oxidative phosphorylation associated transcripts. Age-dependent decrease in oxidative phosphorylation and increase in mitochondrial dysfunction7 is seen also in aged fibroblasts8 and contributes to age-related diseases9. We suggest that the age-related inefficiency of oxidative phosphorylation is conferred by serum signals to the cells such that oxidative phosphorylation cycles are mitigated. On the other hand, loss of cycling could contribute to impairments in mitochondrial function with age.”

      The delayed shifts after aged serum for clock transcripts (but not for Bmal1) are interesting and indicate that there may be a decoupling of Bmal1 transcript levels from the other clock gene phases. How do the authors interpret this? Could it be related to altered chronotypes in the elderly? 

      One possible explanation is that the delay of NPAS2, BMAL1’s binding partner, results in the delay of the transcription of clock controlled genes/negative arm genes. Since the RORs do not seem to be affected, Bmal is transcribed/translated as usual, but there isn’t enough NPAS2 to bind with BMAL1. In this case downstream genes are slower to transcribe causing the phase delay.

      The discussion would also benefit from mentioning parallels and dissimiliarities with previous works, as well as what would be possible mechanisms for such an effect. 

      We’ve expanded our discussion in the manuscript to discuss possible mechanisms and also how the genes/pathways implicated in our study relate to other aging literature.  

      Minor: 

      While time of serum collection is provided in the methods, it would be very useful to provide this information, along with the accompanying argumentation also at a more prominent position and to also add it to Table S1. 

      We made sure to highlight the collection time in the abstract of the manuscript “We collected blood from apparently healthy young (age 25-30) and old (age 70-76) individuals at 14:001 and used the serum to synchronize cultured fibroblasts.” The time of blood draw is also in sections of the paper (Intro and Methods). Since Table S1 is demographic information, we did not think that the blood draw time fit best there, but hopefully it is now clear in the text.

      L73 EKG: define the abbreviation 

      We rewrote this paragraph, but defined the term where it is used the paper.  

      L77: transfected BJ-5TA fibroblasts. Mention in the text that these are stably transfected cells. 

      We added this to the text.

      L88: Day 2 also revealed different phases of cyclic expression between young and old "groups" for a larger number of genes. Here it is only two donors, right? 

      Yes, we swapped out the word “groups” for “subjects”.

      L115. MESORs of steroid biosynthesis genes, particularly those relating to cholesterol biosynthesis, were also increased in the old sera condition. This is quite interesting, can the authors speculate on the significance of this finding? 

      We’ve added discussion about this finding in the context of the literature in our discussion.

      Fig 3. - FDRs are only listed for certain KEGG pathways, and gene counts for each pathway are also missing, which excludes some valuable context for drawing conclusions. Full tables of KEGG pathway enrichment outputs should be provided in supplementary materials. Input gene lists should also be uploaded as supplementary data files.

      Both output and input files are included in this submission as additional files.  

      Line 322 - How many replicates were excluded in the end for each group? Providing this information would strengthen the claim that the ability of both old and young serum to drive 24h oscillations in fibroblasts is robust and not only individual. 

      Each serum was tested in triplicate in two individual runs of the experiment. Of the 15 serum samples, on one of the runs, a triplicate for each of two serum samples (one old, one young) was excluded. Given that only one technical replicate in one run of the experiment had to be excluded for one old and one young individual out of all the samples assayed, this supports the idea that young and old serum drive robust oscillations.

      Line 373 - Should list which active interaction sources were used for analysis. 

      In this manuscript we used STRING (search tool for retrieval of interacting genes) analysis to broadly identify relevant pathways defined by different algorithms. From these data, we focused in particular on KEGG pathways.

      Reviewer #1 (Recommendations For The Authors): 

      These comments are in addition to those provided above: 

      Minor: 

      L73 EKG: define the abbreviation 

      We rewrote this paragraph, but defined the term where it is used the paper.  

      L77: transfected BJ-5TA fibroblasts. Mention in the text that these are stably transfected cells. 

      We added this to the text.

      L88: Day 2 also revealed different phases of cyclic expression between young and old "groups" for a larger number of genes. Here it is only two donor, right? 

      Yes, we swapped out the word “groups” for “subjects”.

      L115. MESORs of steroid biosynthesis genes, particularly those relating to cholesterol biosynthesis, were also increased in the old sera condition. This is quite interesting, can the authors speculate on the significance of this finding? 

      We’ve added discussion about this finding in the context of the literature.

      Fig.4 The fold change amplitude of the clock gene seems quite a bit lower than what is usually expected (for Nr1d1 it is usually 10 fold). The authors should provide an explanation and discuss this. 

      There are a variety of factors that contribute to the fold change amplitude of clock genes. First, the change in amplitude of clock genes is lower in vitro compared to in vivo samples. For example, in U2OS cell cultures the fold change in the cycling of Nr1d1 is only 2 fold and is not significantly different from the fold change we observe (as shown in the U2OS data from CircaDB plotted in Figure 1R). Second, the method of synchronization contributes to the strength of the rhythms. Serum synchronization is generally less effective at driving strong clock cycling than forskolin or dexamethasone although, as noted in the manuscript, it may promote the cycling of more genes. Lastly, rhythm amplitude is also dependent on the cell type in question so cell to cell variability also contributes to differences. However, overall, we do not find major differences in comparing the U2OS data and ours. Please note that the y-axis has a logarithmic scale.

      What is the authors' strategy to identify which serum components that are responsible for the reported changes? This should be discussed. 

      In the future, we intend to analyze the serum factors using a combination of fractionation and either proteomics or metabolomics to identify relevant factors. We have added this to the discussion.

      Reviewer #2 (Recommendations For The Authors): 

      Overall, the article is well-written but lacks some more rigorous data analysis as mentioned in the public review above. In addition to a more thorough analysis approach focusing much more heavily on individual variability, several other changes can be made to strengthen this study:

      Fig 3. - FDRs are only listed for certain KEGG pathways, and gene counts for each pathway are also missing, which excludes some valuable context for drawing conclusions. Full tables of KEGG pathway enrichment outputs should be provided in supplementary materials. Input gene lists should also be uploaded as supplementary data files. 

      Both output and input files are included in this submission as additional files.

      Fig 1A. - Only n=5 participants were used for this analysis, explanation of the exclusion criteria for the other participants would be useful. 

      As Figure 1A is a schematic, we assume the reviewer is referring to Figure 1B. We’ve provided a flow chart of subject inclusion/exclusion in Figure S2.

      Fig 2. - For circadian transcriptome analysis only n=4 participants were used - what criteria was used to exclude individuals, and why were only these individuals used in the end? 

      As patient recruitment was interrupted by COVID, we selected samples where we had sufficient serum to effectively carry out the RNA seq experiment and control for age and sex.

      Line 322 - How many replicates were excluded in the end for each group? Providing this information would strengthen the claim that the ability of both old and young serum to drive 24h oscillations in fibroblasts is robust and not only individual. 

      Each serum was tested in triplicate in two individual runs of the experiment. Of the 15 serum samples, on one of the runs, a triplicate for each of two serum samples (one old, one young) was excluded. Given that only one technical replicate in one run of the experiment had to be excluded for one old and one young individual out of all the samples assayed, this supports the idea that young and old serum drive robust oscillations.

      Line 373 - Should list which active interaction sources were used for analysis. 

      In this manuscript we used STRING (search tool for retrieval of interacting genes) analysis to identify relevant pathways. We do not present any STRING networks in the paper.

      Line 68 - "These novel findings suggest that it may be possible to treat impaired circadian physiology and the associated disease risks by targeting blood borne factors." This is a completed overstatement that are cannot be sustained by the limited findings provided by the authors. 

      We’ve modified this statement to avoid overstating results.

      (1) Pagani, L. et al. Serum factors in older individuals change cellular clock properties. Proceedings of the National Academy of Sciences 108, 7218–7223 (2011).

      (2) Pagani, L. et al. Serum factors in older individuals change cellular clock properties. Proc Natl Acad Sci U S A 108, 7218–7223 (2011).

      (3) Lee, Y. et al. G1/S cell cycle regulators mediate effects of circadian dysregulation on tumor growth and provide targets for timed anticancer treatment. PLOS Biology 17, e3000228 (2019).

      (4) Tomasetti, C. et al. Cell division rates decrease with age, providing a potential explanation for the age-dependent deceleration in cancer incidence. Proceedings of the National Academy of Sciences 116, 20482–20488 (2019).

      (5) Cela, O. et al. Clock genes-dependent acetylation of complex I sets rhythmic activity of mitochondrial OxPhos. Biochimica et Biophysica Acta (BBA) - Molecular Cell Research 1863, 596–606 (2016).

      (6) Scrima, R. et al. Mitochondrial calcium drives clock gene-dependent activation of pyruvate dehydrogenase and of oxidative phosphorylation. Biochimica et Biophysica Acta (BBA) - Molecular Cell Research 1867, 118815 (2020).

      (7) Lesnefsky, E. J. & Hoppel, C. L. Oxidative phosphorylation and aging. Ageing Research Reviews 5, 402–433 (2006).

      (8) Greco, M. et al. Marked aging-related decline in efficiency of oxidative phosphorylation in human skin fibroblasts. The FASEB Journal 17, 1706–1708 (2003).

      (9) Federico, A. et al. Mitochondria, oxidative stress and neurodegeneration. Journal of the Neurological Sciences 322, 254–262 (2012).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer 1:

      We thank Reviewer 1 for their helpful comments and hope that the changes made to the revised manuscript have addressed their points.

      This study presents a novel application of the inverted encoding (i.e., decoding) approach to detect the correlates of crossmodal integration in the human EEG (electrophysiological) signal. The method is successfully applied to data from a group of 41 participants, performing a spatial localization task on auditory, visual, and audiovisual events. The analyses clearly show a behavioural superiority for audio-visual localization. Like previous studies, the results when using traditional univariate ERP analyses were inconclusive, showing once more the need for alternative, more sophisticated approaches. Instead, the principal approach of this study, harnessing the multivariate nature of the signal, captured clear signs of super-additive responses, considered by many as the hallmark of multisensory integration. Unfortunately, the manuscript lacks many important details in the descriptions of the methodology and analytical pipeline. Although some of these details can eventually be retrieved from the scripts that accompany this paper, the main text should be self-contained and sufficient to gain a clear understanding of what was done. (A list of some of these is included in the comments to the authors). Nevertheless, I believe the main weakness of this work is that the positive results obtained and reported in the results section are conditioned upon eye movements. When artifacts due to eye movements are removed, then the outcomes are no longer significant. 

      Therefore, whether the authors finally achieved the aims and showed that this method of analysis is truly a reliable way to assess crossmodal integration, does not stand on firm ground. The worst-case scenario is that the results are entirely accounted for by patterns of eye movements in the different conditions. In the best-case scenario, the method might truly work, but further experiments (and/or analyses) would be required to confirm the claims in a conclusive fashion.

      One first step toward this goal would be, perhaps, to facilitate the understanding of results in context by reporting both the uncorrected and corrected analyses in the main results section. Second, one could try to support the argument given in the discussion, pointing out the origin of the super-additive effects in posterior electrode sites, by also modelling frontal electrode clusters and showing they aren't informative as to the effect of interest.

      We performed several additional analyses to address concerns that our main result was caused by different eye movement patterns between conditions. We re-ran our key analyses using activity exclusively from frontal electrodes, which revealed poorer decoding performance than that from posterior electrodes. If eye movements were driving the non-linear enhancement in the audiovisual condition, we would expect stronger decoding using sensors closer to the source, i.e., the extraocular muscles. We also computed the correlations between average eye position and stimulus position for each condition to evaluate whether participants made larger eye movements in the audiovisual condition, which might have contributed to better decoding results. Though we did find evidence for eye movements toward stimuli, the degree of movement did not significantly differ between conditions.

      Furthermore, we note that the analysis using a stricter eye movement criterion, acknowledged in the Discussion section of the original manuscript, resulted in very similar results to the original analysis. There was significantly better decoding in the AV condition (as measured by d') than the MLE prediction, but this difference did not survive cluster correction. The most likely explanation for this is that the strict eye movement criterion combined with our conservative measure of (mass-based) cluster correction led to reduced power to detect true differences between conditions. Taken together with the additional analyses described in the revised manuscript and supplementary materials, the results show that eye movements are unlikely to account for differences between the multisensory and unisensory conditions. Instead, our decoding results likely reflect nonlinear neural integration between audio and visual sensory information.

      “Any experimental design that varies stimulus location needs to consider the potential contribution of eye movements. We computed correlations between participants’ average eye position and each stimulus position between the three sensory conditions (auditory, visual and audiovisual; Figure S1) and found evidence that participants made eye movements toward stimuli. A re-analysis of the data with a very strict eye-movement criterion (i.e., removing trials with eye movements >1.875º) revealed that the super-additive enhancement in decoding accuracy no longer survived cluster correction, suggesting that our results may be impacted by the consistent motor activity of saccades towards presented stimuli. Further investigation, however, suggests this is unlikely. Though the correlations were significantly different from 0, they were not significantly different from each other. If consistent saccades to audiovisual stimuli were responsible for the nonlinear multisensory benefit we observed, we would expect to find a higher positive correlation between horizontal eye position and stimulus location in the audiovisual condition than in the auditory or visual conditions. Interestingly, eye movements corresponded more to stimulus location in the auditory and audiovisual conditions than in the visual condition, indicating that it was the presence of a sound, rather than a visual stimulus, that drove small eye movements. This could indicate that participants inadvertently moved their eyes when localising the origin of sounds. We also re-ran our analyses using the activity measured from the frontal electrodes alone (Figure S2). If the source of the nonlinear decoding accuracy in the audiovisual condition was due to muscular activity produced by eye movements, there should be better decoding accuracy from sensors closer to the source. Instead, we found that decoding accuracy of stimulus location from the frontal electrodes (peak d' = 0.08) was less than half that of decoding accuracy from the more posterior electrodes (peak d' = 0.18). These results suggest that the source of neural activity containing information about stimulus position was located over occipito-parietal areas, consistent with our topographical analyses (inset of Figure 3).” 

      The univariate ERP analyses an outdated contrast, AV <> A + V to capture multisensory integration. A number of authors have pointed out the potential problem of double baseline subtraction when using this contrast, and have recommended a number of solutions, experimental and analytical. See for example: [1] and [2]. 

      (1) Teder-Salejarvi, W. A., McDonald, J. J., Di Russo, F., & Hillyard, S. A. (2002). Cognitive Brain Research, 14, 106-114. 

      (2) Talsma, D., & Woldorff, M. G. (2005). Journal of cognitive neuroscience, 17(7), 1098-1114.

      We thank the reviewer for raising this point. Comparing ERPs across different sensory conditions requires careful analytic choices to discern genuine sensory interactions within the signal. The AV <> (A +V) contrast has often been used to detect multisensory integration, though any non-signal related activity (i.e. anticipatory waves; Taslma & Woldorff, 2005) or pre-processing manipulation (e.g. baseline subtraction; Teder-Sälejärvi et al., 2002) will be doubled in (A + V) but not in AV. Critically, we did not apply a baseline correction during preprocessing and thus our results are not at risk of double-baseline subtraction in (A + V). Additionally, we temporally jittered the presentation of our stimuli to mitigate the potential influence of consistent overlapping ERP waves (Talsma & Woldorff, 2005). 

      The results section should provide the neurometric curve/s used to extract the slopes of the sensitivity plot (Figure 2B). 

      We thank the reviewer for raising this point of clarification. The sensitivity plots for Figures 2B and 2C were extracted from the behavioural performance of the behavioural and EEG tasks, respectively. The sensitivity plot for Figure 2B was extracted from individual psychometric curves, whereas the d’ values for Figure 2C were calculated from the behavioural data for the EEG task. This information has been clarified in the manuscript.

      “Figure 1. Behavioural performance is improved for audiovisual stimuli. A) Average accuracy of responses across participants in the behavioural session at each stimulus location for each stimulus condition, fitted to a psychometric curve. Steeper curves indicate greater sensitivity in identifying stimulus location. B) Average sensitivity across participants in the behavioural task, estimated from psychometric curves, for each stimulus condition. The red cross indicates estimated performance assuming optimal (MLE) integration of unisensory cues. C) Average behavioural sensitivity across participants in the EEG session for each stimulus condition. Error bars indicate ±1 SEM.”

      The encoding model was fitted for each electrode individually; I wonder if important information contained as combinations of (individually non-significant) electrodes was then lost in this process and if the authors consider that this is relevant. 

      Although the encoding model was fitted for each electrode individually for the topographic maps (Figure 4B), in all other analyses the encoding model was fitted across a selection of electrodes (see final inset of Figure 3). As this electrode set was used for all other neural analyses, our model would allow for the detection of important information contained in the neural patterns across electrodes. This information has been clarified in the manuscript.

      “Thus, for all subsequent analyses we only included signals from the central-temporal, parietal-occipital, occipital and inion sensors for computing the inverse model (see final inset of Figure 2). As the model was fitted for multiple electrodes, subtle patterns of neural information contained within combinations of sensors could be detected.”

      Neurobehavioral correlations could benefit from outlier rejection and the use of robust correlation statistics. 

      We thank the reviewer for raising this issue. Note, however, that the correlations we report are resistant to the influence of outliers because we used Spearman’s rho1 (as opposed to Pearson’s). This information has been communicated in the manuscript.

      (1) Wilcox, R.R. (2016), Comparing dependent robust correlations. British Journal of Mathematical & Statistical Psychology, 69(3), 215-224. https://doi.org/10.1111/bmsp.12069

      “Neurobehavioural correlations. As behavioural and neural data violated assumptions of normality, we calculated rank-order correlations (Spearman’s rho) between the average decoding sensitivity for each participant from 150-250 ms poststimulus onset and behavioural performance on the EEG task. As Spearman’s rho is resistant to outliers (Wilcox, 2016), we did not perform outlier rejection.”

      “Wilcox, R.R. (2016), Comparing dependent robust correlations. British Journal of Mathematical & Statistical Psychology, 69(3), 215-224. https://doi.org/10.1111/bmsp.12069”

      Many details that are important for the reader to evaluate the evidence and to understand the methods and analyses aren't given; this is a non-exhaustive list:  

      We thank the reviewer for highlighting these missing details. We have updated the manuscript where necessary to ensure the methods and analyses are fully detailed and replicable.

      - specific parameters of the stimuli and performance levels. Just saying "similarly difficult" or "marginally higher volume" is not enough to understand exactly what was done.  

      “The perceived source location of auditory stimuli was manipulated via changes to interaural level and timing (Whitworth & Jeffress, 1961; Wightman & Kistler, 1992). The precise timing of when each speaker delivered an auditory stimulus was calculated from the following formula:

      where x and z are the horizontal and forward distances in metres between the ears and the source of the sound on the display, respectively, r is the head radius, and s is the speed of sound. We used a constant approximate head radius of 8 cm for all participants. r was added to x for the left speaker and subtracted for the right speaker to produce the interaural time difference. For ±15° source locations, interaural timing difference was 1.7 ms. To simulate the decrease in sound intensity as a function of distance, we calculated interaural level differences for the left and right speakers by dividing the sounds by the left and right distance vectors. Finally, we resampled the sound using linear interpolation based on the calculations of the interaural level and timing differences. This process was used to calculate the soundwaves played by the left and right speakers for each of the possible stimulus locations on the display. The maximum interaural level difference between speakers was 0.14 A for ±15° auditory locations, and 0.07 A for ±7.5°.”

      - where are stimulus parameters adjusted individually or as a group? Which method was followed?  

      To clarify, stimulus parameters (frequency, size, luminance, volume, location, etc.) were manipulated throughout pilot testing only. Parameters were adjusted to achieve similar pilot behavioural results between the auditory and visual conditions. For the experiment proper, parameters remained constant for both tasks and were the same for all participants.

      “During pilot testing, stimulus features (size, luminance, volume, frequency etc.) were manipulated to make visual and auditory stimuli similarly difficult to spatially localize. These values were held constant in the main experiment.”

      - specify which response buttons were used.

      “Participants were presented with two consecutive stimuli and tasked with indicating, via button press, whether the first (‘1’ number-pad key) or second (‘2’ number-pad key) interval contained the more leftward stimulus.”

      “At the end of each sequence, participants were tasked with indicating, via button press, whether more presentations appeared on the right (‘right’ arrow key) or the left (‘left’ arrow key) of the display.”

      - no information is given as to how many trials per condition remained on average, for analysis.  

      The average number of remaining trials per condition after eye-movement analysis is now included in the Methods section of the revised manuscript.

      “We removed trials with substantial eye movements (>3.75 away from fixation) from the analyses. After the removal of eye movements, on average 2365 (SD \= 56.94), 2346 (SD \= 152.87) and 2350 (SD \= 132.47) trials remained for auditory, visual and audiovisual conditions, respectively, from the original 2400 per condition.”

      - no information is given on the specifics of participant exclusion criteria. (even if the attrition rate was surprisingly high, for such an easy task).  

      The behavioural session also served as a screening task. Although the task instructions were straightforward, perceptual discrimination was not easy due to the ambiguity of the stimuli. Auditory localization is not very precise, and the visual stimuli were brief, dim, and diffuse. The behavioural results reflect the difficulty of the task. Attrition rate was high as participants who scored below 60% correct in any condition were deemed unable to accurately perform the task, were not invited to complete the subsequent EEG session, and omitted from the analyses. We have included the specific criteria in the manuscript.

      “Participants were first required to complete a behavioural session with above 60% accuracy in all conditions to qualify for the EEG session (see Behavioural session for details).”

      - EEG pre-processing: what filter was used? How was artifact rejection done? (no parameters are reported); How were bad channels interpolated?  

      We used a 0.25 Hz high-pass filter to remove baseline drifts, but no low-pass filter. In line with recent studies on the undesirable influence of EEG preprocessing on ERPs1, we opted to avoid channel interpolation and artifact rejection. This was erroneously reported in the manuscript and has now been clarified. For the sake of clarity, here we demonstrate that a reanalysis of data using channel interpolation and artifact rejection returned the same pattern of results. 

      (1) Delorme, A. (2023). EEG is better left alone. Scientific Reports, 13, 2372. https://doi.org/10.1038/s41598-023-27528-0

      - specific electrode locations must be given or shown in a plot (just "primarily represented in posterior electrodes" is not sufficiently informative).  

      A diagram of the electrodes used in all analyses is included within Figure 3, and we have drawn readers’ attention to this in the revised manuscript.

      “Thus, for all subsequent analyses we only included signals from the central-temporal, parietal-occipital, occipital and inion sensors for computing the inverse model (see final inset of Figure 2).” 

      - ERP analysis: which channels were used? What is the specific cluster correction method?

      We used a conservative mass-based cluster correction from Pernet et al. (2015) - this information has been clarified in the manuscript.

      “A conservative mass-based cluster correction was applied to account for spurious differences across time (Pernet et al., 2015).” 

      “Pernet, C. R., Latinus, M., Nichols, T. E., & Rousselet, G. A. (2015). Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of Neuroscience Methods, 250, 85-93. https://doi.org/https://doi.org/10.1016/j.jneumeth.2014.08.003” 

      - results: descriptive stats on performance must be given (instead of saying "participants performed well").  

      The mean and standard deviation of participants’ performance for each condition in the behavioural and EEG experiments are now explicitly mentioned in the manuscript.

      “A quantification of the behavioural sensitivity (i.e., steepness of the curves) revealed significantly higher sensitivity for the audiovisual stimuli (M = .04, SD = .02) than for the auditory stimuli alone (M = .03, SD = .01; Z = -3.09, p = .002), and than for the visual stimuli alone (M = .02, SD = .01; Z = -5.28, p = 1.288e-7; Figure 1B). Sensitivity for auditory stimuli was also significantly higher than sensitivity for visual stimuli (Z = 2.02, p = .044).” 

      “We found a similar pattern of results to those in the behavioural session; sensitivity for audiovisual stimuli (M = .85, SD = .33) was significantly higher than for auditory (M = .69, SD = .41; Z = -2.27, p = .023) and visual stimuli alone (M = .61, SD = .29; Z = -3.52, p = 4.345e-4), but not significantly different from the MLE prediction (Z = -1.07, p = .285).” 

      - sensitivity in the behavioural and EEG sessions is said to be different, but no comparison is given. It is not even the same stimulus set across the two tasks...  

      This relationship was noted as a potential explanation for the higher sensitivities obtained in the EEG task, and was not intended to stand up to statistical scrutiny. We agree it makes little sense to compare statistically between the EEG and behavioural results as they were obtained from different tasks. We would like to clarify, however, that the stimuli used in the two tasks were the same, with the exception that in the EEG task the stimuli were presented from 5 locations versus 8 in the behavioural task. To avoid potential confusion, we have removed the offending sentence from the manuscript:

      Reviewer 2:

      Their measure of neural responses is derived from the decoder responses, and this takes account of the reliability of the sensory representations - the d' statistics - which is an excellent thing. It also means if I understand their analysis correctly (it could bear clarifying - see below), that they can generate from it a prediction of the performance expected if an optimal decision is made combining the neural signals from the individual modalities. I believe this is the familiar root sum of squares d' calculation (or very similar). Their decoding of the audiovisual responses comfortably exceeds this prediction and forms part of the evidence for their claims. 

      Yet, superadditivity - including that in evidence in the principle of inverse effectiveness more typically quantifies the excess over the sum of proportions correct in each modality. Their MLE d' statistic can already predict this form of superadditivity. Therefore, the superadditivity they report here is not the same form of superadditivity that is usually referred to in behavioural studies. It is in fact a stiffer definition. What their analysis tests is that decoding performance exceeds what would be expected from an optimally weighted linear integration of the unisensory information. As this is not the common definition it is difficult to relate to behavioral superadditivity reported in much literature (of percentage correct). This distinction is not at all clear from the manuscript. 

      But the real puzzle is here: The behavioural data or this task do not exceed the optimal statistical decision predicted by signal detection theory (the MLE d'). Yet, the EEG data would suggest that the neural processing is exceeding it. So why, if the neural processing is there to yield better performance is it not reflected in the behaviour? I cannot explain this, but it strikes me that the behaviour and neural signals are for some reason not reflecting the same processing. 

      Be explicit and discuss this mismatch they observe between behaviour and neural responses. 

      Thank you, we agree that it is worth expanding on the observed disconnect between MSI in behaviour and neural signals. We have included an additional paragraph in the Discussion of the revised manuscript. Despite the mismatch, we believe the behavioural and neural responses still reflect the same underlying processing, but at different levels of sensitivity. The behavioural result likely reflects a coarse down-sampling of the precision in location representation, and thus less likely to reflect subtle MSI enhancements.

      “An interesting aspect of our results is the apparent mismatch between the behavioural and neural responses. While the behavioural results meet the optimal statistical threshold predicted by MLE, the decoding analyses suggest that the neural response exceeds it. Though non-linear neural responses and statistically optimal behavioural responses are reliable phenomena in multisensory integration (Alais & Burr, 2004; Ernst & Banks, 2002; Stanford & Stein, 2007), the question remains – if neural super-additivity exists to improve behavioural performance, why is it not reflected in behavioural responses? A possible explanation for this neurobehavioural discrepancy is the large difference in timing between sensory processing and behavioural responses. A motor response would typically occur some time after the neural response to a sensory stimulus (e.g., 70-200 ms), with subsequent neural processes between perception and action that introduce noise (Heekeren et al., 2008) and may obscure super-additive perceptual sensitivity. In the current experiment, participants reported either the distribution of 20 serially presented stimuli (EEG session) or compared the positions of two stimuli (behavioural session), whereas the decoder attempts to recover the location of every presented stimulus. While stimulus location could be represented with higher fidelity in multisensory relative to unisensory conditions, this would not necessarily result in better performance on a binary behavioural task in which multiple temporally separated stimuli are compared. One must also consider the inherent differences in how super-additivity is measured at the neural and behavioural levels. Neural super-additivity should manifest in responses to each individual stimulus. In contrast, behavioural super-additivity is often reported as proportion correct, which can only emerge between conditions after being averaged across multiple trials. The former is a biological phenomenon, while the latter is an analytical construct. In our experiment, we recorded neural responses for every presentation of a stimulus, but behavioural responses were only obtained after multiple stimulus presentations. Thus, the failure to find super-additivity in behavioural responses might be due to their operationalisation, with between-condition comparisons lacking sufficient sensitivity to detect super-additive sensory improvements. Future work should focus on experimental designs that can reveal super-additive responses in behaviour.”

      Re-work the introduction to explain more clearly the relationship between the behavioural superadditivities they review, the MLE model, and the superadditivity it actually tests. 

      We agree it is worth discussing how super-additivity is operationalised across neural and behavioural measures. However, we do not believe the behavioural studies we reviewed claimed super-additive behavioural enhancements. While MLE is often used as a behavioural marker of successful integration, it is not necessarily used as evidence for super-additivity within the behavioural response, as it relies on linear operations. 

      “It is important to consider the differences in how super-additivity is classified between neural and behavioural measures. At the level of single neurons, superadditivity is defined as a non-linear response enhancement, with the multisensory response exceeding the sum of the unisensory responses. In behaviour, meanwhile, it has been observed that the performance improvement from combining two senses is close to what is expected from optimal integration of information across the senses (Alais & Burr, 2004; Stanford & Stein, 2007). Critically, behavioural enhancement of this kind does not require non-linearity in the neural response, but can arise from a reliability-weighted average of sensory information. In short, behavioural performance that conforms to MLE is not necessarily indicative of neural super-additivity, and the MLE model can be considered a linear baseline for multisensory integration.”

      Regarding the auditory stimulus, this reviewer notes that interaural time differences are unlikely to survive free field presentation.

      Despite the free field presentation, in both the pilot test and the study proper participants were able to localize auditory stimuli significantly above chance. 

      "However, other studies have found super-additive enhancements to the amplitude of sensory event-related potentials (ERPs) for audiovisual stimuli (Molholm et al., 2002; Talsma et al., 2007), especially when considering the influence of stimulus intensity (Senkowski et al., 2011)." - this makes it obvious that there are some studies which show superadditivity. It would have been good to provide a little more depth here - as to what distinguished those studies that reported positive effects from those that did not.

      We have provided further detail on how super-additivity appears to manifest in neural measures.

      “In EEG, meanwhile, the evoked response to an audiovisual stimulus typically conforms to a sub-additive principle (Cappe et al., 2010; Fort et al., 2002; Giard & Peronnet, 1999; Murray et al., 2016; Puce et al., 2007; Stekelenburg & Vroomen, 2007; Teder- Sälejärvi et al., 2002; Vroomen & Stekelenburg, 2010). However, when the principle of inverse effectiveness is considered and relatively weak stimuli are presented together, there has been some evidence for super-additive responses (Senkowski et al., 2011).”

      “While behavioural outcomes for multisensory stimuli can be predicted by MLE, and single neuron responses follow the principles of inverse effectiveness and super- additivity, among others (Rideaux et al., 2021), how audiovisual super-additivity manifests within populations of neurons is comparatively unclear given the mixed findings from relevant fMRI and EEG studies. This uncertainty may be due to biophysical limitations of human neuroimaging techniques, but it may also be related to the analytic approaches used to study these recordings. For instance, superadditive responses to audiovisual stimuli in EEG studies are often reported from very small electrode clusters (Molholm et al., 2002; Senkowski et al., 2011; Talsma et al., 2007), suggesting that neural super-additivity in humans may be highly specific. However, information encoded by the brain can be represented as increased activity in some areas, accompanied by decreased activity in others, so simplifying complex neural responses to the average rise and fall of activity in specific sensors may obscure relevant multivariate patterns of activity evoked by a stimulus.”

      P9. "(25-75 W, 6 Ω)." This is not important, but it is a strange way to cite the power handling of a loudspeaker. 

      “The loudspeakers had a power handling capacity of 25-75 W and a nominal impedance of 6 Ω.” 

      I am struggling to understand the auditory stimulus: 

      "Auditory stimuli were 100 ms clicks". Is this a 100-ms long train of clicks? A single pulse which is 100ms long would not sound like a click, but two clicks once filtered by the loudspeaker. Perhaps they mean 100us. 

      "..with a flat 850 Hz tone embedded within a decay envelope". Does this mean the tone is gated - i.e. turns on and off slowly? Or is it constant?

      We thank the reviewer for catching this. ‘Click’ may not be the most apt way of defining the auditory stimulus. It was a 100 ms square wave tone with decay, i.e., with an onset at maximal volume before fading gradually. Given that the length of the stimulus was 100 ms, the decay occurs quickly and provides a more ‘click-like’ percept than a pure tone. We have provided a representation of the sound below for further clarification. This represents the amplitude from the L and R speakers for maximally-left and maximally-right stimuli. We have added this clarification in the revised manuscript. 

      Author response image 1.

      “Auditory stimuli were 100 ms, 850 Hz tones with a decay function (sample rate = 44, 100 Hz; volume = 60 dBA SPL, as measured at the ears).”

      P10. "Stimulus modality was either auditory, visual, or audiovisual. Trials were blocked with short (~2 min) breaks between conditions".

      Presumably the blocks were randomised across participants.

      Condition order was not randomised across participants, but counterbalanced. This has been clarified in the manuscript.

      “Stimulus modality was auditory, visual or audiovisual, presented in separate blocks with short breaks (~2 min) between conditions (see Figure 6A for an example trial). The order of conditions was counterbalanced across participants.” 

      P15. Feels like there is a step not described here: "The d' of the auditory and visual conditions can be used to estimate the predicted 'optimal' sensitivity of audiovisual signals as calculated through MLE." Do they mean sqrt[ (d'A)^2 + (d'V)^2] ? If it is so simple then it may as well be made explicit here. A quick calculation from eyeballing Figures 2B and 2C suggests this is the case.

      We thank the reviewer for raising this point of clarification. Yes, the ‘optimal’ audiovisual sensitivity was calculated as the hypotenuse of the auditory and visual sensitivities. This calculation has been made explicit in the revised manuscript.

      The d’ from the auditory and visual conditions can be used to estimate the predicted ‘optimal’ sensitivity to audiovisual signals as calculated through the following formula:

      "The perceived source location of auditory stimuli was manipulated via changes to interaural intensity and timing (Whitworth & Jeffress, 1961; Wightman & Kistler, 1992)." The stimuli were delivered by a pair of loudspeakers, and the incident sound at each ear would be a product of both speakers. And - if there were a time delay between the two speakers, then both ears could potentially receive separate pulses one after the other at different delays. Did they record this audio stimulus with manikin? If not, it would be very difficult to know what it was at the ears. I don't doubt that if they altered the relative volume of the loudspeakers then some directionality would be perceived but I cannot see how the interaural level and timing differences could be matched - as if the sound were from a single source. I doubt that this invalidates their results, but to present this as if it provided matched spatial and timing cues is wrong, and I cannot work out how they can attribute an azimuthal location to the sound. For replication purposes, it would be useful to know how far apart the loudspeakers were and what the timing and level differences actually were.

      The behavioural tasks each had evenly distributed ‘source locations’ on the horizontal azimuth of the computer display (8 for the behavioural session, 5 for the EEG session). We manipulated the perceived location of auditory stimuli through interaural time delays and interaural level differences. By first measuring the forward (z) and horizontal (x) distance of each source location to each ear, the method worked by calculating what the time-course of a sound wave should be at the location of the ear given the sound wave at the source. Then, for each source location, we can calculate the time delay between speakers given the vectors of x and z, the speed of sound and the width of the head.  As the intensity of sound drops inversely with the square of the distance, we can divide the sound wave by the distance for each source location to provide the interaural level difference. Though we did not record the auditory stimulus with a manikin, our behavioural analyses show that participants were able to detect the directions of auditory stimuli from our manipulations, even to a degree that significantly exceeded the localisation accuracy for visual stimuli (for the behavioural session task). This information has been clarified in the manuscript.

      “Auditory stimuli were played through two loudspeakers placed either side of the display (80 cm apart for the behavioural session, 58 cm apart for the EEG session).” 

      “The perceived source location of auditory stimuli was manipulated via changes to interaural level and timing (Whitworth & Jeffress, 1961; Wightman & Kistler, 1992). The precise timing of when each speaker delivered an auditory stimulus was calculated from the following formula:

      where x and z are the horizontal and forward distances in metres between the ears and the source of the sound on the display, respectively, r is the head radius, and s is the speed of sound. We used a constant approximate head radius of 8 cm for all participants. r was added to x for the left speaker and subtracted for the right speaker to produce the interaural time difference. For ±15° source locations, interaural timing difference was 1.7 ms. To simulate the decrease in sound intensity as a function of distance, we calculated interaural level differences for the left and right speakers by dividing the sounds by the left and right distance vectors. Finally, we resampled the sound using linear interpolation based on the calculations of the interaural level and timing differences. This process was used to calculate the soundwaves played by the left and right speakers for each of the possible stimulus locations on the display. The maximum interaural level difference between speakers was 0.14 A for ±15° auditory locations, and 0.07 A for ±7.5°.

      I am confused about this statement: "A quantification of the behavioural sensitivity (i.e., steepness of the curves) revealed significantly greater sensitivity for the audiovisual stimuli than for the auditory stimuli alone (Z = -3.09, p = .002)," It is not clear from the methods how they attributed sound source angle to the sounds. Conceivably they know the angle of the loudspeakers, and this would provide an outer bound on the perceived location of the sound for extreme interaural level differences (although free field interaural timing cues can create a wider sound field). 

      Our analysis of behavioural sensitivity was dependent on the set ‘source locations’ that were used to calculate the position of auditory and audiovisual stimuli.  In the behavioural task, participants judged the position of the target stimulus relative to a central stimulus. Thus, for each source location, we recorded how often participants correctly discriminated between presentations. The quoted analysis acknowledges that participants were more sensitive to audiovisual stimuli than auditory stimuli in the context of this task. A full explanation of how source location was implemented for auditory stimuli has been clarified in the manuscript. 

      It would be very nice to see some of the "channel" activity - to get a feel for the representation used by the decoder. 

      We have included responses for the five channels as a Supplemental Figure.

      Figure 6 appears to show that there is some agreement between behaviour and neural responses - for the audiovisual case alone. The positive correlation of behavioural and decoding sensitivity appears to be driven by one outlier - who could not perform the audiovisual task (and indeed presumably any of them). Furthermore, if we were simply Bonferonni correct for the three comparisons, this would become non-significant. It is also puzzling why the unisensory behaviour and EEG do not correlate - which seems to again suggest a poor correspondence between them. Opposite to the claim made.

      We understand the reviewer’s concern here. We would like to note, however, that each correlation used unique data sets – that is, the behavioural and neural data for each separate condition. In this case, we believe a Bonferroni correction for multiple comparisons is too conservative, as no data set was compared more than once. Neither the behavioural nor the neural data were normally distributed, and both contained outliers. Rather than reduce power through outlier rejection, we opted to test correlations using Spearman’s rho, which is resistant to outliers1. It is also worth noting that, without outlier rejection, the audiovisual correlation (p \= .003) would survive a Bonferroni correction for 3 comparisons. The nonsignificant correlation in the auditory and visual conditions might be due to the weaker responses elicited by unisensory stimuli, with the reduced signal-to-noise ratio obscuring potential correlations. Audiovisual stimuli elicited more precise responses both behaviourally and neurally, increasing the power to detect a correlation. 

      (1) Wilcox, R.R. (2016), Comparing dependent robust correlations. British Journal of Mathematical & Statistical Psychology, 69(3), 215-224. https://doi.org/10.1111/bmsp.12069

      “We also found a significant positive correlation between participants’ behavioural judgements in the EEG session and decoding sensitivity for audiovisual stimuli. This result suggests that participants who were better at identifying stimulus location also had more reliably distinct patterns of neural activity. The lack of neurobehavioural correlation in the unisensory conditions might suggest a poor correspondence between the different tasks, perhaps indicative of the differences between behavioural and neural measures explained previously. However, multisensory stimuli have consistently been found to elicit stronger neural responses than unisensory stimuli (Meredith & Stein, 1983; Puce et al., 2007; Senkowski et al., 2011; Vroomen & Stekelenburg, 2010), which has been associated with behavioural performance (Frens & Van Opstal, 1998; Wang et al., 2008). Thus, the weaker signalto-noise ratio in unisensory conditions may prevent correlations from being detected.”

      Further changes:

      (1)   To improve clarity, we shifted the Methods section to after the Discussion. This change included updating the figure numbers to match the new order (Figure 1 becomes Figure 6, Figure 2 becomes Figure 1, and so on).

      (2)   We also resolved an error on Figure 2 (previously Figure 3). The final graph (Difference between AV and A + V) displayed incorrect values on the Y axis.

      This has now been remedied.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study of extrachromosomal DNA (ecDNA) aims to identify genes that distinguish ecDNA+ and ecDNA- tumors. This timely study is important in addressing the genes responding to the amplification of the ecDNA. The data presented are for the most part solid, there were concerns regarding the clarity in the description of the analysis methods and whether the evidence for specific genes required to maintain the ecDNA+ state was entirely conclusive.

      Public Reviews:

      Reviewer #1 (Public Review):

      Recently discovered extrachromosomal DNA (ecDNA) provides an alternative non-chromosomal means for oncogene amplification and a potent substrate for selective evolution of tumors. The current work aims to identify key genes whose expression distinguishes ecDNA+ and ecDNA- tumors and the associated processes to shed light on the biological mechanisms underlying ecDNA genesis and their oncogenic effects. While this is clearly an important question, the analysis and the evidence supporting the claims are weak. The specific machine learning approach seems unnecessarily convoluted, insufficiently justified and explained, and the language used by the authors conflates correlation with causality. This work points to specific GO processes associated (up and down) with ecDNA+ tumors, many of which are expected but some seem intriguing, such as association with DSB pathways. My specific comments are listed below.

      Response. As some of the specific questions below address similar concerns, we have answered them briefly here. As a high level point, the reviewer is correct in that other statistical or ML approaches could potentially have been used, and that some are simpler. However, the test used here directly addresses the question: Find a collection of genes whose expression value is predictive of ecDNA status in the sample. Because the underlying method in the Boruta analysis uses random forests, it can test predictive power without relying on a linearity assumption implicit in other methods. In this revision, we also compare against a Generalized Linear Model and show that it is less suited to the specific task above. We also address the reviewer concerns about specific parameter choices by showing robustness to the specific parameter.

      (A) The claim of identifying genes required to 'maintain' ecDNA+ status is not justified - predictive features are not necessarily causal.

      Response. We agree with the reviewer that predictive features are correlative and not causal. In the manuscript, we identify genes whose expression (when used as a feature) is predictive of ecDNA presence or absence. Such predictive genes are consistently over-expressed or consistently under-expressed in ecDNA(+) samples relative to ecDNA(-) samples even though they are not required to be on ecDNA. To our knowledge, we did not claim that these genes are causal for ecDNA formation or maintenance, only that such genes and the underlying biological processes are worth investigating. In the beginning of the manuscript, we had written the following paragraph, but we have removed the last line (struck out here):

      “In lieu of identifying genes that are highly differentially expressed between ecDNA(+) and ecDNA(-) samples but driven by a small subset of cases (e.g. gene A in Fig. S1a), we sought to identify genes (e.g. gene B) whose expression level was predictive of ecDNA presence. We assumed that genes that were persistently over-expressed or under-expressed in ecDNA(+) samples relative to ecDNA(-) samples were more likely to be involved in ecDNA biogenesis or maintenance, or in mediating the cellular response to the presence of ecDNA.”

      We revised the manuscript to make sure that there are no claims that refer to causality. We revisited all phrases where the words like “maintain” were used and added appropriate disclaimers, or replaced them by the phrase, “ecDNA presence.” The remaining statements say, for example, “These results are consistent with a pan-cancer role of CorEx genes in ecDNA biogenesis and maintenance,” and do not claim causality.

      (B) The methods and procedures to identify the key genes is hyper-parameterized and convoluted and casts doubt on the robustness of the findings given the size and heterogeneity of the data.

      (a) In the first two paragraphs of Boruta Analysis Methods section, authors describe an iterative procedure where in each iteration, a binomial p-value is computed for each gene based on number of iterations thus far in which the gene was selected (higher GINI index than max of shadow features). But then in the third paragraph they simply perform Random Forest in 200 random 80% of samples and pick a gene if it is selected in at least 10/200. It is ultimately not clear what was done. Why 10/200? Also "the probability that a gene is a "hit" or "non-hit" in each iteration is 0.5" is unclear. That probability is of a gene achieving GINI index higher than the max of shadow features. How can it be 0.5?

      Response. We believe that there is some misunderstanding about the algorithm, and we agree that the description should have been more clear. We have greatly simplified the description in the manuscript. However, we want to provide some higher-level explanation here. Boruta is a standard feature extraction algorithm (Kursa, Journal of Statistical Software September 2010, Volume 36, Issue 11), and we used a Python implementation of the method. Given a gene expression data-set with class labels on samples, Boruta extracts features (genes) that best predict the class labels using a Random Forest Classifier, as long as the features are more predictive than permuted features added in each iteration. As we are using an implementation of a published method, we have removed non-essential details, referring directly to the publication. Nevertheless, to address the reviewer’s specific critique, the number of false-features added changes in each iteration (it equals the number of accepted+uncommitted features). Therefore, the choice of 0.5 by Boruta (it is fixed in the published method and not a user-specified parameter) is a conservative approach. If a gene was no better than a randomly chosen feature, its predictive performance would exceed the most predictive randomly chosen feature by at most 0.5 (but could be lower, making the choice of 0.5 conservative).

      While Boruta iteratively picks genes that are significantly better than random features, the list of genes predicted might be specific to the data-set, and might change with different data-sets. Therefore, we employed a bootstrapping strategy: we performed 200 trials each time picking 80% of the ecDNA(+) samples and 80% of the ecDNA(-) samples at random, thus generating many data-sets while maintaining class imbalance. For each of the 200 trials, we performed a Boruta analysis. Finally, we picked a gene if it was selected as a Boruta feature in at least 10 of 200 trials.

      The reviewer has a reasonable critique about why 10 (of 200) specifically, and why not fewer or more. Most genes are weak predictors by themselves. For example, RAE1, which is the top ranked gene, picked in all 200 Boruta trials, can only predict ecDNA status with poor recall for any meaningful precision.

      Author response image 1.

      Given the weakness of an individual gene as a classifier, its repeated selection in multiple Boruta trials is already a significant event. By requiring a gene to be picked in 5% of the trials (10/200), we were selecting a small, but more robust list of genes. However, to further explore the reviewer’s concerns, we also applied 8 other selection criteria ranging from 5 (of 200 Boruta trials) to 200 of 200 Boruta trials. See Figure below. The number of CorEx genes expectedly decreases. However, of the 187 GO terms that were enriched by 262 UP-genes using 10 of 200 Boruta trials as the selection criteria, 93 terms (49.7%) were enriched for each cut-off (see Author response image 2), and 155 terms (82.9%) were enriched in at least 5 of the 8 cut-off criteria. Given that the remaining analysis works on the hierarchy of GO terms and finds 4 GO-categories (Mitotic Cell Cycle, G1/S, G2/M; cell-division; DSB DNA Damage response; and the HOX Gene cluster) enriched by UP-regulated genes, those conclusions would hold regardless of the specific cut-off.

      Author response image 2.

      The number of GO terms that were enriched by DOWN-regulated genes is smaller, only 73, and falls rapidly for higher cut-offs, with 25 at a cut-off of 15. Therefore we see fewer terms enriched for more stringent cut-offs. However, they all support immune processes. These results do suggest that there are fewer genes that are consistently down-regulated in ecDNA(+) cancers, and expression change in a small number of genes may be sufficient to promote conditions for ecDNA.

      Finally, we note that in the final section we discuss the 65 most highly ranked genes with a harmonic mean rank <= 3. These 65 CorEx genes (or a member of their cluster) appear in each of 200 Boruta trials. Thus, their choice is also not dependent on the cut-off of 10 in 200. In summary, the conclusions of the paper do not depend upon the specific cut-off of 10 in 200 trials.

      We have added the figure as a supplemental figure and have added the following text to the manuscript on pages 17 and 18.

      “Any CorEx gene is either a Core gene that was selected as a feature in at least 5% of 200 Boruta trials, or be highly co-expressed with a Core gene. Because the selection criterion of 5% is arbitrary, we also tested robustness with 8 other cut-offs ranging from 5-of-200 to 200-of-200 Boruta trials. The number of CorEx genes expectedly decreases with more stringent cut-offs. However, of the 187 GO terms that were enriched by 262 CorEx UP-genes using 10 of 200 Boruta trials as the selection criteria, 93 terms (49.7%) were enriched for each cut-off (Fig. S9), and 155 terms (82.9%) were enriched in at least 5 of the 8 cut-offs. Given that our subsequent analyses utilized the hierarchy of GO terms and identified 4 GO-categories enriched by UP-regulated genes, the conclusions would hold regardless of the specific cut-off.”

      (b) The approach of combining genes with clusters is arbitrary. Why not start with clusters and evaluate each cluster (using some gene set summary score) for their ability to discriminate? Ultimately, one needs additional information to disambiguate correlated genes (i.e. in a coexpression cluster) in terms of causality.

      Response. In general, the approach proposed by the reviewer is reasonable. However, we did consider that possibility and found that our approach was easier to implement. For example, if we clustered first, we would have the challenge of choosing the correct set of clusters. Also, the Boruta analysis would become very difficult while dealing with clusters (e.g., how to define falsefeatures?). We tested other methods of picking genes that were suggested by other reviewers such as generalized linear models. They turned out not to be as predictive of ecDNA status, as described later in the response. Finally, we performed many experiments to ensure the validity of the clustering. Specifically, we had the following text in the paper:

      “Notably, among the 354 clusters, only 2 clusters (with 14 total genes) did not contain any Core genes. As most genes do not have completely identical expression patterns, we would expect one gene to be consistently picked as a Boruta gene over another co-expressed gene. Consistent with this hypothesis, most (344/354) clusters contained only 1 or 2 Core genes (Fig. 1c). When selecting clusters that contained at least 1 Core and 1 co-expressed gene, 53 of 71 clusters contained 1 to 3 Core genes (Fig. S1b), confirming that a few genes per co-expressed cluster provide sufficient predictive value, but other co-expressed genes might still play an important functional role in maintaining ecDNA(+) status.”

      These experiments suggest that the genes found by extending the Core genes through clustering do not radically change the Core genes, but only enhance the set.

      (c) The cross-validation procedure is not clear at all. There is a mention of 80-20 split but exactly how/if the evaluation is done on the 20% is muddled. The way precision-recall procedure is also a bit convoluted - why not simply use the area under the PR curve?

      Response. We apologize if the method was unclear. We have rewritten the methods part to make things clearer. As a high level point, there are two places where we use the same 80-20 split, and that resulted in some confusion. We start by randomly picking 80% of the ecDNA(+) and 80% of ecDNA(-) samples to create an 80-20 split of all samples. This procedure is repeated to generate 200 80-20 split data-sets. These data-sets are hereafter called 200 training and test samples.

      In the first usage, we use only the ‘training’ part of the 200 samples. We apply Boruta to each training set, and this helps us select the Core genes, which are then expanded to form the CorEx set. At this point, the CorEx genes are frozen for analysis in the rest of the paper. One question that we subsequently answer is what is the predictive power of the CorEx genes in determining if the sample is ecDNA(+) or ecDNA(-)? We also compare the predictive performance of CorEx genes relative to (a) Core genes, (b) LFC genes, and (c) random genes. In the revised manuscript, we have added another list of 3,012 genes selected using a single gene generalized linear model (GLM) for feature prediction. To make these comparisons, we utilized the same 200 training and test data-sets as before. In each test, we trained a random forest classifier on the training set and predicted on the ‘test’ set, for each of the 5 gene lists. This provided a uniform and fair method for testing which of the 5 gene lists was the better predictor of ecDNA status.

      The precision recall values are plotted in Fig. 2b (also included below). We note that none of the gene lists was a great predictor of ecDNA status of a sample. However, the CorEx and Core genes were significantly more predictive than GLM, LFC, and random genes. The predictive power of GLM genes was very similar to LFC, and better than random.

      For each of these 200 tests, we obtained a separate area under the precision-recall curve number for each of the gene-sets. To address the reviewer’s comments regarding a single number, we reported the average of the AUPRC for each of the gene-sets in the revision. The mean AUPRC values were added to the manuscript and are described here as well: Core_408_genes: 0.495 CorEx_643_genes: 0.48 Random_643_genes: 0.36 top_lfc_643_genes: 0.429 GLM_R_3012_genes: 0.426

      We also changed Figure 2b to show box-plots showing distribution of recall values for specific precision windows instead of maximum recall. For ease of checking, the figure is reproduced below.

      Author response image 3.

      (d) The claim is that Boruta genes are different from differentially expressed genes but the differential expression seems to be estimated without regards to cancer type, which would certainly be highly biased and misleading. Why not do a simple regression of gene expression by ecDNA status, cancer type and select the genes that show significant coefficient for ecDNA status?

      Response. As requested by the reviewer, and in the more detailed questions below, we added an alternative model with a generalized linear model (GLM) analysis that controlled for tumor subtype. The method itself is described in the Methods section and pasted below. The GLM genes were tested along with the LFC, CorEx, Core genes as described in response to the previous question, and those results are now presented in Figure 2b and on pages 6 and 7 of the revised manuscript.

      “We tested each of 16,309 genes independently in a separate logistic regression model using the glm() function in the R stats package (v4.2.0), and retained genes that were significant (p-value 0.01). Specifically, the model was defined as glm(𝑦 ~ 𝑔𝑗 + 𝑡𝑡, data = 𝑀, family = binomial(link = 'logit')), where y is the response vector where 𝑦𝑖=1 if sample 𝑖 ∈ {1, . . . ,870} is ecDNA(+) and 𝑦𝑖 =0 otherwise, 𝑔𝑗 is the vector of expression values for gene j ∈ {1, . . . ,16309} in samples 𝑖 ∈ {1,. . . ,870}, t is the covariate vector representing the tumor subtypes of samples 𝑖 ∈ {1, . . . ,870}, and 𝑀 is the data matrix containing values of gene expression, tumor subtype, and ecDNA status for all samples. The equation for the binomial logistic regression described above 𝑝𝑝 is formulated as where p is the probability that the dependent variable y is 1, 𝑋 are the independent variables, and 𝛽 are the coefficients of the model. In this case, k=1 represents independent variable gene j and k=2 represents the tumor subtype covariate t. Of the 16,309 genes tested independently, 3,012 genes were significant at pvalue<0.01.”

      (C) After identifying key features (which the authors inappropriate imply to be causal) they perform a series of enrichment/correlative analysis.

      Response. We have reviewed the document to ensure that we did not use the word ‘causal.’ If the reviewer can point to specific text, we are happy to change the phrasing.

      (a) It is known that ecDNA status associates with poor survival, and so are cell cycle related signal. Then the association between Boruta genes and those processes is entirely expected. Is it not? The same goes for downregulation of immune processes.

      Response. We agree with the reviewer that cell cycle related signals and immune related signals are associated with low survival, and so does ecDNA. However, many cellular processes could be associated with low survival (including for example, metabolic processes, protein and DNA biosynthesis, etc.). The unexpected part is that there appear to be only 4 major processes that are upregulated in ecDNA(+) cancers relative to ecDNA(-) cancers, and only one (immune response) that is downregulated.

      (b) The association with DSB specifically is interesting. Further analysis or discussion of why this should be would strengthen the work.

      Response. We thank the reviewer for their comment, and agree with their perspective. Note that we devoted a fair amount of text to analysis of DSB pathways. Specifically, we parsed the 4 main pathways in Figure 3b, and found our data to suggest that many genes in the classical nonhomologous end joining repair pathway are down-regulated in ecDNA(+) samples relative to ecDNA(-) samples. In contrast, Alternative end-joining and homology directed repair pathways are upregulated. This is a surprising result because c-NHEJ is considered to be an important mechanism of DSB repair. We have some lines in the discussion that address this:

      “The DNA damage genes are broadly up-regulated in ecDNA(+) samples, especially in double-strand break repair. Within this broad category of mechanisms, our analysis suggests that alternative DSB repair pathways such as Alt-EJ are preferred relative to classical NHEJ. This is consistent with previous observations of small microhomologies at breakpoint junctions, and has important implications in therapeutic selection that will need to be validated in future experimental studies. We note, however, the microhomology analyses typically study breakpoint junctions, and might ignore double-strand breaks in non-junctional sequences which could be observed, for example at replication-transcription junctions.”

      We note that additional experimental work to corroborate these findings is significant effort and will be part of ongoing research in our collaborators’ laboratories.

      (c) On page 15, second paragraph, when providing the up versus down CorEx genes, please also provide up versus down for non-CorEx genes as well to get a sense of magnitude.

      Response. We thank the reviewer for the comment. We note that Supplementary Table S15 has the complete contingency tables as well as the Fisher Exact Test statistic for all categories. For the specific categories mentioned in the paper, the chi-square tables are reproduced below. As we are citing TableS15 (containing all numbers and the statistic p-value) in the main text, we thought it was better to leave the text as it was.

      Category: Inflammation (p-value: 0.005)

      CorEx: 18 (UP), 76 (DOWN)

      Non-CorEx: 325 (UP), 657 (DOWN)

      Category: Leukocyte migration and chemotaxis (p-value: 0.03)

      CorEx: 13 (UP), 49 (DOWN)

      Non-CorEx: 213 (UP), 410 (DOWN)

      Category: Lymphocyte activation (p-value: 0.0075)

      CorEx: 23 (UP), 75 (DOWN)

      Non-CorEx: 334 (UP), 560 (DOWN)

      Category: Cytokine production (p-value: 0.117)

      CorEx: 6 (UP), 28 (DOWN)

      Non-CorEx: 93 (UP), 208 (DOWN)

      (d) The finding that Boruta genes are associated with high mutation burden is intriguing because in general mutation burden is associated with better survival and immunotherapy response. This counter-intuitive result should be scrutinized more to strengthen the work.

      Response. We agree with the reviewer that it is an intriguing observation. However, we are cautious in our interpretation. This is for the following reasons (all mentioned in the text):

      (1) The total mutation burden was significantly higher in ecDNA(+) samples relative to ecDNA(-) samples (Fig. 5a). However, when controlling for cancer type, only glioblastoma, low-grade gliomas, and uterine corpus endometrial carcinoma continued to show differential total mutational burden (Fig. S7b).

      (2) We tested if specific genes were differentially mutated between the two classes (Fig. 5b). For deleterious/high-impact mutations, TP53 was the only gene whose mutational patterns were significantly higher in ecDNA(+) compared to ecDNA(-) (OR 2.67, Bonferroni adjusted p-value 4.22e-07). BRAF mutations, however, were more common in ecDNA(-) samples and were significant to an adjusted p-value < 0.1 (OR 0.27).

      (3) In response to another reviewer’s comment, we also tested correlation with variant allele frequencies, and did not find any significant correlation except for TP53. We decided not to include that result in the paper.

      These tissue specific cases might be confounding the main observation, but we have placed all of them together so that the reader can gain a better understanding. It is worth noting that the correlation between high TMB and immunotherapy response is also now controversial, and perhaps not true for all cancer types. See for example (https://www.annalsofoncology.org/article/S0923-7534(21)00123-X/fulltext), which suggests that this relationship is not true for Glioma, and in Glioma (which is ecDNA enriched), higher TMB is associated with worse immunotherapy response. Our results are consistent with that finding. We have modified the discussion paragraph to better reflect this.

      “Mutation data alone does not provide as clear a picture of the genes involved in ecDNA maintenance. We did observe that the total mutation burden (TMB) was higher in ecDNA(+) samples. However, that relationship is much less clear after controlling for cancer type. High TMB has been positively correlated with sensitivity to immunotherapy52, and better patient outcomes; however, the gene expression patterns suggest that immunomodulatory genes are downregulated in ecDNA(+) samples, and patients with ecDNA(+) tumors have worse outcomes2. Notably, other results have suggested that the correlation between TMB and response to immunotherapy is not uniform, and it can vary across different tumor subtypes53. Specifically, our data is consistent with previous results which showed that Gliomas with high TMB have worse response to immunotherapy relative to gliomas with low TMB53. In general, no collection of gene mutations was predictive of ecDNA status, although mutations in TP53 were more likely in ecDNA(+) samples, and perhaps are an important driver for ecDNA formation5.”

      (e) On page 17 "12 of the 47 genes not specifically enriching any known GO biological Process" is confusing. How can individual gene enrich for a GO process?

      Response. We agree that the statement was incorrectly phrased. We have changed it to state that “Only 12 of the 47 genes were not included in the gene sets of any enriched GO term.”

      Reviewer #2 (Public Review):

      In their manuscript entitled "Transcriptional immune suppression and upregulation of double stranded DNA damage and repair repertoires in ecDNA-containing tumors" Lin et al. describe an important study on the transcriptional programs associated with the presence of extrachromosomal DNA in a cohort of 870 cancers of different origin. The authors find that compared to cancers lacking such amplifications, ecDNA+ cancers express higher levels of DNA damage repair-associated genes, but lower levels of immune-related gene programs.

      This work is very timely and its findings have the potential to be very impactful, as the transcriptional context differences between ecDNA+ and ecDNA- cancers are currently largely unknown. The observation that immune programs are downregulated in ecDNA+ cancers may initiate new preclinical and translational studies that impact the way ecDNA+ cancers are treated in the future. Thus, this study has important theoretical implications that have the potential to substantially advance our understanding of ecDNA+ cancers.

      Strengths

      The authors provide compelling evidence for their conclusions based on large patient datasets. The methods they used and analyses are rigorous.

      Weaknesses

      The biological interpretation of the data remains observational. The direct implication of these genes in ecDNA(+) tumors is not tested experimentally.

      Response. We agree with the reviewer that experimental tests would be ideal. Towards that, there are some challenges. The immune system genes cannot be tested in cell line models as they need a tumor microenvironment. Tests of DSB repair mechanisms and cell cycle control can be performed in cell-lines, but not with the TCGA samples which are not available. Some of our collaborators are actively working on these topics, but that extensive experimental work is beyond the scope of this paper.

      Reviewer #3 (Public Review):

      Summary:

      Using a combination of approaches, including automated feature selection and hierarchical clustering, the author identified a set of genes persistently associated with extrachromosomal DNA (ecDNA) presence across cancer types. The authors further validated the gene set identified using gene ontology enrichment analysis and identified that upregulated genes in extrachromosomal DNA-containing tumors are enriched in biological processes like DNA damage and cell proliferation, whereas downregulated genes are enriched in immune response processes.

      Major comments:

      (1) The authors presented a solid comparative analysis of ecDNA-containing and ecDNA-free tumors. An established automated feature selection approach, Boruta, was used to select differentially expressed genes (DEG) in ecDNA(+) and ecDNA(-) TCGA tumor samples, and the iterative selection process and two-tier multiple hypothesis testing ensured the selection of reliable DEGs. The author showed that the DEG selected using Boruta has stronger predictive power than genes with top log-fold changes.

      (2) The author performed a thorough interpretation of the findings with GO enrichment analysis of biological processes enriched in the identified DEG set, and presented interesting findings, including the enrichment in DNA damage process among the genes upregulated in ecDNA(+) tumors.

      (3) Overall, the authors achieved their aims with solid data mining and analysis approaches applied to public data tumor data sets.

      (4) While it may not be the scope of this study, it will be interesting to at least have some justification for choosing Boruta over other feature selection methods, such as Recursive Feature Elimination (RFE) and backward stepwise selection.

      Response. We actually agree with the reviewer that some other feature selection methods could work just as well, and note that the Boruta analysis is not our creation, but a published feature selection method (Kursa, Journal of Statistical Software September 2010, Volume 36, Issue 11). We use Boruta to identify relevant genes, but the bulk of the paper is to understand the biological processes driven by that gene selection. Even if we had chosen another method that performed slightly better, it likely would not change the main conclusions. However, to address the reviewers concerns on over-reliance on one method, we added a different gene list created by a generalized linear model analysis, with the goal of checking if the expression of a gene could predict the ecDNA status of the sample after controlling for tumor subtype. Thus, we tested 5 different genelists in terms of their power in predicting ecDNA. While none of the lists is a great predictor of ecDNA status, the Core and CorEx gene lists are significantly better than the other lists. The Figure below replaces the previous Figure panels 2b and 2c.

      Author response image 4.

      (1) The authors showed that DESEQ-selected DEGs with top log-fold changes have less strong predictive power and speculated that this may be due to the fact that genes with top log-fold changes (LFC) are confined only to a small subset of samples. It will be interesting to select DEGs with top log-fold changes after first partitioning the tumor samples. For example, randomly partition the tumor samples, identify the DEGs with top LFC, combine the DEGs identified from each partition, then evaluate the predictive power of these DEGs against the Boruta-selected DEGs.

      Response. This is a great comment. We added a generalized linear model test for selecting genes whose expression is predictive of ecDNA status. The GLM list described above uses a standard methodology (Analysis of Variance) controls for tumor type as a covariate, and its predictive performance is only slightly better than the Top-|LFC| genes, while improving over a random gene set.

      (2) While the authors showed that the presence of mutations was not able to classify ecDNA(+) and (-) tumor samples, it will be interesting to see if variant allele frequencies of the genes containing these mutations have predictive power.

      Response. This is a great suggestion. To address the reviewer’s question, we used allelic counts (REFs and ALTs) information from the MC3 variant callset, and calculated allele frequencies of all variants from samples where ecDNA status was available. Next, we conducted a Wilcoxon rank-sum test between VAFs of the ecDNA(+) group and VAFs of the ecDNA(-) group for every mutated gene. We found 1,073 genes with p<0.05, but among them, only TP53 passed the multiple testing correction (padj<0.05, Benjamini-Hochberg). As the results are identical to the tests based solely on presence of mutations, we decided not to include this data.

      Reviewer #1 (Recommendations For The Authors):

      (A) The presentation should be substantially streamlined.

      (B) Preferably use a more intuitive simpler ML approach with fewer parameters to make it more credible. Because there are relatively few samples across numerous cancer types with greater variability in representation, a simpler procedure with transparent controls will be more convincing.

      Response. We accept the reviewer’s criticism in that other statistical or ML approaches could potentially have been used, and that some are simpler. However, the test used here directly addresses the question: Find a collection of genes whose expression value is predictive of ecDNA status in the sample. Because the underlying method in the Boruta analysis uses random forests, it can test predictive power without relying on a linearity assumption implicit in other methods. In this revision, we also compare against a Generalized Linear Model (regression analysis) and show that it is less suited to the specific task above. We address the reviewer concerns about specific parameter choices by showing robustness to the specific parameter. All details are provided in the initial questions, and in the revised manuscript.

      (C) Avoid using any term implying causality unless you can bring in direct experimental evidence (e.g. mutagenesis experiment followed by ecDNA measurement. Some places you use the word 'maintain ecDNA' and other places 'ecDNA impact'. But these are all associations. How can you distinguish causal genes from downstream effects without additional data?

      Response. We note that the word causal does not appear anywhere in the manuscript, and was not intended. Additionally we have revised the manuscript and are open to specific changes requested by the reviewer or the editors.

      (D) Along these lines, if Boruta genes are indeed causal, one would expect Boruta-Up genes to be amplified more than expected in the ecDNA+; converse for Boruta-down genes.

      Response. We did not understand the reviewer’s question. By “amplified,” if the reviewer means “amplification of transcript level,” then that is exactly what the Boruta analysis is showing. Specifically, for each gene, we have the ability to pick a transcript level cut-off ‘t’ so that samples in which the expression is higher than t are more likely to be ecDNA(+). However, we are not claiming that there is causality, just that the transcript level is (weakly) predictive of the ecDNA status of the sample.

      (E) A strawman control should be a simple regression-based gene identification that controls for ecDNA status and cancer type.

      Response. We agree that this was a very good suggestion. In the revision, we have applied a GLM, which controls for tumor type. Thus, we have 5 gene-lists (including the Core and CorEx genes). As described in the revised manuscript but also in response to the main comments above, none of the lists are a great predictor. However, the CorEx and Core genes are significantly better at predicting ecDNA status of a sample.

      Reviewer #2 (Recommendations For The Authors):

      Comments

      (1) The analysis hinges on a classification of tumors into ecDNA(+) and ecDNA(-) using AmpliconClassifier. It would be good to know how robust the outcomes are with respect to the performance of AmpliconClassifier - how many false positives and negatives will AmpliconClassifier generate on this dataset and how would this influence the CorEx genes?

      Response. This is a very reasonable request. AA has been extensively tested on established cell-lines for its ability in predicting ecDNA status, and this information is published in multiple venues, including Kim, Nature genetics 2020, and shows precision 85% for recall 83%. For completeness, we have reproduced the relevant plot from that paper here, and the relevant text here, but are not including it in the manuscript.

      “To evaluate the accuracy of the AmpliconArchitect predictions, we analyzed whole-genome sequencing data from a panel of 44 cancer cell lines, and examined tumor cells in metaphase. We used 35 unique fluorescence in-situ hybridization (FISH) probes in combination with matched centromeric probes (81 distinct “cell-line, probe” combinations) to determine the intranuclear location of amplicons (Supplementary Table 2). Following automated analysis >1,600 images, we observed that 85% of amplicons characterized as ‘Circular’ by whole genome sequencing profile demonstrated an extrachromosomal fluorescent signal, representing the positive predictive value. Of the amplicons corresponding to extrachromosomally located FISH probes, 83% were classified as Circular, representing the sensitivity (Extended Data Fig. 1A).”

      Author response image 5.

      (2) It is unclear why genes are labeled Boruta genes when they are present in 10 out of 200 runs, this seems like an unexpectedly low number. How did the authors arrive at this number? Do the authors have any ground truth to estimate how well Boruta works in this setting and implementation?

      Response. This is a great question and asked by another reviewer as well. Given the weakness of an individual gene as a classifier, its repeated selection in multiple Boruta trials is already a significant event. By requiring a gene to be picked in 5% of the trials (10/200), we were selecting a small, but more robust list of genes. However, to further explore the reviewer’s concerns, we also applied 8 other selection criteria ranging from 5 (of 200 Boruta trials) to 200 of 200 Boruta trials. See Figure below. The number of CorEx genes expectedly decreases with increasing stringency. However, of the 187 GO terms that were enriched by UP-genes, 93 terms (50%) were enriched regardless of the cut-off (see Figure below), and 153 terms (82%) were enriched in at least 5 of the 8 cut-offs. Given that the remaining analysis works on the hierarchy of GO terms and finds 4 GO-categories (Mitotic Cell Cycle, G1/S, G2/M; cell-division; DSB DNA Damage response; and the HOX Gene cluster) enriched by UP-regulated genes, those conclusions would hold regardless of the specific cut-off.

      Author response image 6.

      The number of GO terms that were enriched by DOWN-regulated genes is smaller, only 73, and falls rapidly for higher cut-offs, with 25 at a cut-off of 15. Therefore we see fewer terms enriched for more stringent cut-offs. However, they all support immune processes. These results do suggest that there are fewer genes that are consistently down-regulated in ecDNA(+) cancers, and expression change in a small number of genes may be sufficient to promote conditions for ecDNA.

      We have added the figure as a supplemental figure and have added the following text to the manuscript on pages 17 and 18.

      “Any CorEx gene is either a Core gene that was selected as a feature in at least 5% of 200 Boruta trials, or be highly co-expressed with a Core gene. Because the selection criterion of 5% is arbitrary, we also tested robustness with 8 other cut-offs ranging from 5-of-200 to 200-of-200 Boruta trials. The number of CorEx genes expectedly decreases with more stringent cut-offs.

      However, of the 187 GO terms that were enriched by 262 CorEx UP-genes using 10 of 200 Boruta trials as the selection criteria, 93 terms (49.7%) were enriched for each cut-off (Fig. S9), and 155 terms (82.9%) were enriched in at least 5 of the 8 cut-offs. Given that our subsequent analyses utilized the hierarchy of GO terms and identified 4 GO-categories enriched by UP-regulated genes, the conclusions would hold regardless of the specific cut-off.”

      (3) Authors extend the core gene set with co-expressed genes, arguing that "gene C" would not add predictive power in addition to "gene B" and is therefore not identified as a Boruta gene. However, from its description in the manuscript (summarized: "Boruta [...] selects the highest feature importance score, s, of shadow features as a cut off, and returns features with a higher score than s."), it isn't immediately obvious to me why Boruta would not return both genes B and C. Maybe the authors could explain this better.

      Response. We consider the following.

      (1) Consider 100 ecDNA(+) and 100 ecDNA(-) samples. Let the expression levels of genes B and C in the data-sets be as described in the figure below; y-axis is the gene expression, and x-axis is just a listing of all samples, with green color denoting ecDNA(+) samples and orange color denoting ecDNA(-) samples.

      Author response image 7.

      (2) Then, if we choose gene B and a transcript level of 1.25, we have a perfect prediction of ecDNA status because all samples where gene B has a transcript level higher than 1.25 are ecDNA(+) and otherwise they are ecDNA(-). Similarly, using Gene C, we can get perfect predictions. Thus, when Boruta has to select a gene, it will pick either Gene B or Gene C, because picking both will not improve prediction. We can therefore use Boruta to pick one gene, and then co-expression clustering to pick the other gene.

      As an example, cluster #3 consists of 21 genes that were up-regulated in ecDNA(+) samples and enriched in cell-cycle related biological processes (Table S3). While these genes were expressed similarly in ecDNA(+) samples, and separately, in ecDNA(-) samples, out of the 21 genes, only 9 genes were selected in at least 10 out of 200 Boruta trials (i.e., Core genes). Of the 12 remaining genes (i.e., CorEx genes), 8 genes were not selected by the Boruta method at all, 3 genes were selected in less than 5 out of 200 Boruta trials, and 1 gene was selected in 9 out of 200 Boruta trials.

      Author response image 8.

      (4) In Fig 2a, I would like to see the variability of the precision and recall in the main text, not only the maximum values. Authors could plot mean + standard deviation for precision and recall separately, or use S2a/b.

      Response. We have replaced Figures 2b and 2c with a combined figure (Fig. 2b) that gives a box-plot describing the distribution of recall values for 5 gene lists: four from the original manuscript, and another gene list created using a Generalized Linear Model (GLM).

      Author response image 9.

      (5) Since the authors analyze bulk RNA, the gene expression signatures they notice could, in principle, originate from non-tumor cells as well. I do not believe this is the case, however, the paper would be strengthened by an analysis that shows that the difference in expression patterns of the Corex genes between ecDNA(+) and ecDNA(-)-samples does come from tumor cells. One way of showing this would be by using single-cell mRNA-sequencing data, and another way of showing this would be to show that Corex gene-expression correlates with tumor purity in bulk samples.

      Response. The reviewer is correct. Unfortunately, our analysis requires data with whole-genome sequencing (WGS) for ecDNA prediction, as well as RNA-seq for transcriptome profiling. The TCGA data-set is the only available data-set with a significant number of samples that includes both WGS and RNA-seq. They have not made tissue samples available for scRNA analysis, to our knowledge. The reviewer raises an important question regarding purity, but testing if CorEx gene expression correlates with tumor purity would require a large range of purity values, something that scientists would avoid when collecting samples.

      However, the presence of non-cancer tissue (impurity) could reduce sensitivity of ecDNA detection, and therefore, change the results. To better investigate this, we started with a publication that investigated multiple tumor purity metrics and devised a composite score (CPE; Aran et al., 2015). Using their composite tumor purity, we find that ecDNA(-) samples have slightly lower purity than ecDNA(+) samples (p-value 0.0036; Fig. S2a).

      This result is not surprising because one would expect lower detection of ecDNA in less pure samples. The presence of undetected ecDNA in ecDNA(-) samples would confound the results by reducing the discriminating power of genes, but would not give false results. To test this, we measured the expression directionality in CorEx genes in all samples versus samples which had a high tumor purity (CPE 0.8). The results suggest that the p-values of directionality in the pure samples were highly correlated with the expression data from all samples (Fig. S2b).

      Author response image 10.

      (6) The biological interpretation of the data remains a bit too observational. Can the authors offer an interpretation of the enriched GO terms? And are any of these genes already implicated in ecDNA(+) tumors?

      Response. To answer the second question first, prior to our study, the focus was on genes that were amplified on ecDNA. Indeed many oncogenes known to be amplified in cancer are in fact amplified on ecDNA (Turner, Nature 2017, Kim Nature genetics 2020). This study is unique in that it identifies genes whose expression values are predictive of ecDNA(+) status. The Figure below lists 24 genes most frequently amplified on ecDNA from Kim, Nature Genetics 2020. With the exception of EGFR and CDK4, none of these 24 genes was included in the list of the 65 genes reported by us as the most frequently selected genes in the Boruta trials (lowest harmonic rank). Thus, most persistent CorEx genes do not lie on ecDNA. However, they all play important roles in biological processes relevant to cancer pathology including Immune Response, Mitotic cell Cycle, Cell division, and DSB repair. We agree with the reviewer that the results are observational (although statistically significant in populations), and some of our collaborators are actively working to experimentally validate some of these genes. The experimental work, however, is beyond the scope of this paper.

      We have added the following statement to the manuscript. “Notably, of the 24 genes most frequently expressed on ecDNA,2 only EGFR and CDK4 were included in the list of 65 genes, suggesting that the most persistent CorEx genes do not themselves appear frequently on ecDNA.”

      Author response image 11.

      Reviewer #3 (Recommendations For The Authors):

      Minor comments:

      (1) The authors performed gene ontology enrichment test but referred to it as gene set enrichment analysis. Usually gene set enrichment analysis does not refer to Fischer's exact test-based analysis but rather the one described in Subramanian et al 2005. The term correction should be made to avoid confusion.

      Response. We have rephrased text in the manuscript to prevent confusion between enrichment analysis on gene sets using an one-sided Fisher’s exact test and the Gene Set Enrichment Analysis (GSEA) method that exists as a software. We have also revised the header in the methods section from “Gene set enrichment analysis” to “Gene Ontology (GO) enrichment analysis”.

      (2) A couple of figures could use more detailed labels and captions. In Figure 2c, it is unclear what the numbers 100 and 54 right next to the Cliff's Delta heatmap indicate. In Figures 3a and 4a, it is not immediately clear what the barplot on top of the heatmap indicates and there is no label for the y-axis.

      Response. These are good suggestions, and we have added descriptions to the figure captions.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Chen and colleagues first compared the cartilage tissues collected from OA and HA patients using histology and immunostaining. Then, a genome-wide DNA methylation analysis was performed, which informed the changes of a novel gene, TNXB. IHC confirmed that TNXB has a lower expression level in HA cartilage than OA. Next, the authors demonstrated that TNXB levels were reduced in the HA animal model, and intraarticular injection of AAV carrying TNXB siRNA induced cartilage degradation and promoted chondrocyte apoptosis. Based on KEGG enrichment, histopathological analysis, and western blot, the authors also showed the relationship between TNXB and AKT phosphorylation. Lastly, AKT agonist, specifically SC79 in this study, was shown to partially rescue the changes of in vitro-cultured chondrocytes induced by Tnxb knock-down. Overall, this is an interesting study and provided sufficient data to support their conclusion.

      Strengths:

      (1) Both human and mouse samples were examined.

      (2) The HA model was used.

      (3) Genome-wide DNA methylation analysis was performed.

      Weaknesses:

      (1) In some experiments, the selection of the control groups was not ideal.

      Thank you for comments. The reviewer raised the concerns about using human OA cartilage as control, instead of health cartilage. This is an important detail we didn’t describe in the previous version. We have added our explanation in revised Methods.

      (2) More details on analyzing methods and information on replicates need to be included.

      We greatly appreciate your careful review and helpful suggestions. We have added detailed information to our revised draft.

      (3) Discussion can be improved by comparing findings to other relevant studies.

      Thank the reviewer very much for the opportunity to improve our manuscript. We have improved discussions as reviewer suggested in Recommendation 13.

      (4) The use of transgenic mice with conditional Tnxb depletion can further define the physiological roles of Tnxb.

      Thanks for this valuable comment. We understand that conditional Tnxb-KO mice is much helpful for the study of biological roles of Tnxb, and it will be constructed and used in our future studies.

      Recommendations For the Authors:

      (1) Please add more information about HA such as incidence to highlight the importance of the study.

      We greatly appreciate your careful review and helpful suggestions. We have provided more information about the importance of HA study in revised Introduction. Please see lines 90-93 and 103-112.

      (2) Please justify the use of OA cartilage, instead of normal tissues, as the control.

      Thanks for your suggestion. We certainly would have liked to use healthy cartilage as control, but we were extremely difficult to obtain enough control samples from healthy individuals. Despite the mechanistic and phenotypic differences between HA and OA, OA is often used as “disease” control to reveal the characteristics in HA 1,2. Thus, we measured cartilage degeneration and DNA methylation difference in HA and OA patients. We have provided the statement and evidence in revised manuscript. Please see lines 144-145.

      (3) Please provide details of how to calculate the Cartilage wear area ratio in Figure 1D, and measure the positive staining area in Figure 1F.

      We apologize for the issue you pointed out. Here, we provide detailed information for how positively stained areas are calculated. Specifically, in Figure 1D, we obtained the cartilage area ratio by calculating the ratio of blue cartilage staining area to the whole tissue area by using image J software. In Figure 1F, the area of positive staining was determined upon secondary antibody treatment and color development using DAB chromogen (brown stain). We then obtained the positive staining area ratio by calculating the ratio of positive staining area to the whole cartilage area by using image J software.

      (4) Please label the location of hemorrhagic ferruginous deposits in Figure 1.

      Thank you for your valuable suggestion. We have used black arrows to indicate hemorrhagic ferruginous deposits in revised Figure 1A.

      (5) Please define the meaning of "n" in all figure legends, such as technical or biological replicates.

      Thanks for your suggestion. We have defined the meaning of "n" in all figure legends in revised manuscript.

      (6) In Figure 3, please increase the font size of B, D, F, H, and J. The same applies to other figures.

      Thank you for your valuable suggestion. We have increased the font size of figures in our revised manuscript.

      (7) Line 327, "(Figure 1, F and G)" should be Figure 2F, G.

      Thanks for your reminding. We have corrected it in the revision. Please see lines 347.

      (8) Reduced TNXB levels in human HA cartilage are one of the major findings in this study. Currently, only semi-quatative IHC was used to draw the conclusion. A second method, such as real-time PCR or western blot, is required.

      Thanks for your suggestion. We feel very sorry that we did not have enough samples of human HA cartilages for qPCR and WB experiments, due to severe erosion of the HA cartilage. We have pointed out this limitation in revised drafts. Please see lines 445-448.

      (9) Figure 3 shows that reduced Tnxb was accompanied by the increased Dnmt1. In addition, this study is about methylation. Have the authors tested the change of Dnmt1 levels when Tnxb was knocked down?

      Thanks for your suggestion. According to the reviewer's suggestion, we have tested the expression of Dnmt1 in Tnxb-KD chondrocytes, and no significant alteration was observed. Please see the following Figure.

      Author response image 1.

      Figure Legend: Representative IHC staining of Dnmt1 in articular cartilage from Tnxb-KD HA mice. Corresponding quantification of the proportion of Dnmt1 positive regions. Red arrows indicate positive cells. Scale bar: 100 μm. Data were presented as means ± SD; n = 5 in each group. ns = no significance by unpaired Student’s t test.

      (10) Also, is there a causal relationship between Tnxb levels and the distribution of methylation levels? Any related study was performed?

      Following the valuable suggestion of the reviewer, we used two well-known DNA methyltransferase inhibitors (RG108 or 5-Aza-dc) 3 to examine whether DNA methylation regulates transcriptional expression of TNXB. We found that both inhibitors significantly up-regulated Tnxb mRNA level. We have added this result to the revised Supplementary Figure 4 and draft (lines 292-296 and 369-374).

      (11) In Figure 6, what was the control of "AKT agnost" group?

      Thank you for your suggestion. We feel sorry for our negligence and we have added the vehicle group as a control for AKT agonists in Figure 6 in our revised manuscript.

      (12) Previous studies have reported the involvement of TNXB in TGF-β signaling. Have the authors examined the effect of TNXB on TGF-β signaling in chondrocytes?

      Thank you for your suggestion. Here, we examined the expression of TGF-β signaling in Tnxb-KD chondrocyte and no significant changes were observed. We have discussed this result in revised draft (lines 475-479). We have added this result to the revised Supplementary Figure 7.

      (13) Discussion can be improved. For example, have previous studies reported the association between TNXB and methylation in other cells/tissues? In addition to apoptosis, are there other potential mechanisms underlying the protective role of TNXB in chondrocytes?

      Thank you for your valuable comments. Previous studies have shown the different DNA methylation of TNXB in whole blood from rheumatoid arthritis patients and in retinal pigment epithelium from patients with age-related macular degeneration 4,5. Herein, we were the first to report the association between DNA methylation of TNXB and HA cartilage degeneration. As for TNXB, there are limited public studies regarding physiological function of TNXB, among which mostly report the effect of TNXB on extracellular matrix organization 6,7. In our work, we found that TNXB regulated the phosphorylation of AKT. Since previous reports showed AKT controlled the expression of Mmp13 8, we thought that TNXB might regulated the chondrocyte extracellular matrix organization, in addition to its function on apoptosis. We have discussed these in revised manuscript (lines 462-464, and 495-501).

      (14) The manuscript writing needs to be improved. Typos and grammar issues were noted.

      Thanks. We have modified and polished our language and we hope the revised version could be acceptable for you.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript mainly studied the biological effect of tenascin XB (TNXB) on hemophilic arthropathy (HA) progression. Using bioinformatic and histopathological approaches, the authors identified the novel candidate gene TNXB for HA. Next, the authors showed that TNXB knockdown leads to chondrocyte apoptosis, matrix degeneration, and subchondral bone loss in vivo/vitro. Furthermore, AKT agonists promoted extracellular matrix synthesis and prevented apoptosis in TNXB knockdown chondrocytes.

      Strengths:

      In general, this study significantly advances our understanding of HA pathogenesis. The authors utilize comprehensive experimental strategies to demonstrate the role of TNXB in cartilage degeneration associated with HA. The results are clearly presented, and the conclusions appear appropriate.

      Weaknesses:

      Additional clarification is required regarding the gender of the F8-/- mouse in the study. Is the mouse male or female?

      We feel sorry that we did not provide enough information about the gender of the F8-/- mouse in the previous draft. Here, we used male F8-/- mice as the study subjects for our experiments. Hemophilia A is predominantly seen in males because of the X chromosome linkage 9.

      Recommendations For The Authors:

      Some issues need to be addressed in the manuscript:

      (1) During the progression of HA, in addition to cartilage degeneration, synovial hypertrophy and inflammation are also significant symptoms. How is the expression of TNXB in HA synovium?

      Thank you for your valuable comments. According to the reviewer's suggestion, we tested the expression of TNXB in the synovium, and there was no statistically significant difference in the expression level of TNXB in the synovium (Supplementary Figure. 2) Please see lines 347-349.

      (2) Lines 183-188. The methods of virus infection should be more detailed. What was the concentration of the AAVs injected? And how many doses were administrated?

      Thank you for your suggestion. We have added an explanation of virus infection and injected doses in revised methods section (lines 205-206).

      (3) Line 197-198. Could the author double-check the decalcification time for human cartilage samples? Is it for 3 months? Or for 3 weeks?

      Thank you for your suggestion. We have reconfirmed the decalcification of human cartilage samples for 3 months.

      (4) Line 343-344 "Above results suggest that TNXB might be protective against HA and its cartilage suppression is closely related to HA development." The conclusion is inappropriate, please revise it.

      Thanks for your suggestion. We have revised this conclusion into “Above results suggest that the suppression of TNXB in cartilage promotes the HA development”. Please see lines 365-366.

      (5) Line 326-327, the IHC staining for human samples is shown in Figure 2, not Figure 1. Please double check and revise it.

      Thanks for your reminding. We feel sorry for our negligence and we have corrected it in the revision.

      (6) For Figure 1B, it shows the MRI images of knee joints. However, the method section lacks details regarding the MRI imaging scan and analysis. Could the author include this information in the method section?

      Thank you for your valuable comments. We have added the method of MRI imaging scan and analysis in revised Methods. Please see lines 154-163.

      (7) In Figure 5, The statistical result of Bcl-2 is inconsistent with its Western blot band. Please check.

      Thanks for your reminding. We have modified it in the revision.

      (8) Please read through the text carefully to check for language problems. For example, in Line 68 "Our" not "our".

      Thanks for your reminding. In revision, we have corrected it. Please see Line 68.

      Reviewer #3 (Public Review):

      Summary:

      The manuscript by Dr. Chen et al. investigates the genes that are differentially methylated and associated with cartilage degeneration in hemophilia patients. The study demonstrates the functional mechanisms of the TNXB gene in chondrocytes and F8-/- mice. The authors first showed significant DNA methylation differences between hemophilic arthritis (HA) and osteoarthritis through genome-wide DNA methylation analysis. Subsequently, they showed a decreased expression of the differentially methylated TNXB gene in cartilage from HA patients and mice. By knocking down TNXB in vivo and in vitro, the results indicated that TNXB regulates extracellular matrix homeostasis and apoptosis by modulating p-AKT. The findings are novel and interesting, and the study presents valuable information in blood-induced arthritis research.

      Strengths:

      The authors adopted a comprehensive approach by combining genome-wide DNA methylation analysis, in vivo and in vitro experiments using human and mouse samples to illustrate the molecular mechanisms involved in HA progression, which is crucial for developing targeted therapeutic strategies. The study identifies Tenascin XB (TNXB) as a central mediator in cartilage matrix degradation. It provides mechanistic insights into how TNXB influences cartilage matrix degradation by regulating the activation of AKT. It opens avenues for future research and potential therapeutic interventions using AKT agonists for cartilage protection in hemophilic arthropathy. The conclusions drawn from the study are clear and directly tied to the findings.

      Weaknesses:

      (1) The study utilizes a small sample size (N=5 for both osteoarthritis and hemophilic arthropathy). A larger sample size would enhance the generalizability and statistical power of the findings.

      Thank you for pointing out this deficiency. Indeed, our sample size is relatively small, although the overall sample size was sufficient for statistical analyses. And we have added this limitation in discussion in revised manuscript. Please see line 445-448. Considering the small sample size, we subsequently performed functional validation study for TNXB, one of the most significant genes, and demonstrated that TNXB exerted critical impacts on chondrocytes apoptosis in HA pathogenesis in vivo and in vitro.

      (2) The use of an animal model (F8-/- mouse) to investigate the role of TNXB may not fully capture the complexity of human hemophilic arthropathy. Differences in the biology between species may affect the translatability of the findings to human patients.

      Thank you for your valuable comments. We recognize that biological differences between species can affect the clinical translation of research findings. In our work, we sequenced human cartilage samples to obtain the differentially methylated gene-TNXB. Meanwhile, we demonstrated that protein expression of TNXB protein was significantly down-regulated in HA human cartilage and F8-/- transgenic mouse cartilage. The F8-/- transgenic mouse serves as a well-accepted model for the study of hemophilia, which is phenotypically similar to that of human patients suffering from the disease and spontaneously bleeds into the joints and soft tissues. Besides, this model mouse has been widely used in the study of hemophilia and hemophilic arthritis 9-11.

      (3) The study primarily focuses on TNXB as a central mediator, but it might overlook other potentially relevant factors contributing to cartilage degradation in hemophilic arthropathy. A more holistic exploration of genetic and molecular factors could provide a broader understanding of the condition.

      Thanks for your suggestion. Since our human sample size is relatively small, we should interpret differentially methylated genes cautiously. Therefore, we mainly focused on the most top significant gene TNXB for functional study. In our further study, we will expand the sample size to more comprehensively explore the molecular mechanisms of HA.

      Recommendations For The Authors:

      The following are my suggestions:

      (1) Why do the authors choose to concentrate on the knee joint in the introduction when hemophilia, characterized by a deficiency in clotting factor F8, is recognized as a systemic disease?

      Thank you for your valuable comments. Although hemophilia a systemic disease, approximately 80%-90% of bleeding episodes in patients with hemophilia occur within the musculoskeletal system, especially in the knee joint 12.

      (2) While Figure 1 illustrates distinct expressions of Dnmt1 and Dnmt3a, only Dnmt1 results are presented in HA mice models in Figure 3. To address this, it is suggested that the expression of Dnmt3a be explored in animal models.

      Thank you for your suggestion. According to the reviewer's suggestion, we examined the expression of Dnmt3a in mouse articular cartilage, and the expression level of Dnmt3a was significantly up-regulated in both the 4W and 8W model groups compared with the control group (Figure 3). Please see line 364.

      (3) In Figure 3, the sample size for Dnmt1 is smaller than the other indicators; therefore, supplementing the sample count is recommended.

      Thanks for your reminding. We have corrected it in the revision.

      (4) Regarding Figure 4G, a few apoptotic cells were observed in the AAV NC group. It is advised that this figure be reviewed for accuracy.

      Thanks for your suggestion. In Figure 5D, the AAV-NC group is the case of needle-injected with AAV. Therefore, it is normal for apoptotic cells to appear in the cartilage layer.

      (5) The authors concluded that TNXB plays a role in apoptosis and AKT signaling. Providing expression data for Caspase9 would be valuable to strengthen this assertion, as PI3K/AKT signaling directly influences its activation during apoptosis.

      Thank you for your comments. We have examined the expression of Cleaved-Caspase9 protein, and found that knockdown of TNXB resulted in upregulation of Cleaved-Caspase9 protein expression, which was reversed by addition of SC79. This result has added in revised Figure 6 and manuscript. Please see line 414.

      (6) Quantitative analysis of the differences between the two groups in Supplemental Figures is necessary.

      Thank you for your suggestion. We have added the quantitative analysis of the differences between the two groups in Supplemental Figures.

      (7) With three major isoforms (homologs) of AKT in mammals-AKT1, 2, and 3 - why did the authors specifically focus on AKT1?

      Thank you for your comments. Based on the results of the KEGG enrichment analysis of differential methylated genes, we investigated the role of PI3K/AKT pathway in apoptosis of HA chondrocytes. AKT is universally acknowledged as a core factor in the PI3K/AKT pathway that plays critical roles in various cellular activities such as cell proliferation, cell differentiation, cell apoptosis, metabolism and so on 13,14, More notably, several studies demonstrated that in AKT family, Akt1 primarily was involved in regulation of chondrocyte survival and proteoglycan synthesis 15. Therefore, we detected phosphorylation of AKT1 in HA cartilages and TNXB-KD chondrocytes, and found that TNXB regulation chondrocytes ECM and apoptosis by AKT1. Reference:

      (1) Cooke, E.J., Zhou, J.Y., Wyseure, T., Joshi, S., Bhat, V., Durden, D.L., Mosnier, L.O., and von Drygalski, A. (2018). Vascular Permeability and Remodelling Coincide with Inflammatory and Reparative Processes after Joint Bleeding in Factor VIII-Deficient Mice. Thromb Haemost 118, 1036-1047. 10.1055/s-0038-1641755.

      (2) Kleiboer, B., Layer, M.A., Cafuir, L.A., Cuker, A., Escobar, M., Eyster, M.E., Kraut, E., Leavitt, A.D., Lentz, S.R., Quon, D., et al. (2022). Postoperative bleeding complications in patients with hemophilia undergoing major orthopedic surgery: A prospective multicenter observational study. J Thromb Haemost 20, 857-865. 10.1111/jth.15654.

      (3) Weiland, T., Weiller, M., Kunstle, G., and Wendel, A. (2009). Sensitization by 5-azacytidine toward death receptor-induced hepatic apoptosis. J Pharmacol Exp Ther 328, 107-115. 10.1124/jpet.108.143560.

      (4) Anaparti, V., Agarwal, P., Smolik, I., Mookherjee, N., and El-Gabalawy, H. (2020). Whole Blood Targeted Bisulfite Sequencing and Differential Methylation in the C6ORF10 Gene of Patients with Rheumatoid Arthritis. J Rheumatol 47, 1614-1623. 10.3899/jrheum.190376.

      (5) Porter, L.F., Saptarshi, N., Fang, Y., Rathi, S., den Hollander, A.I., de Jong, E.K., Clark, S.J., Bishop, P.N., Olsen, T.W., Liloglou, T., et al. (2019). Whole-genome methylation profiling of the retinal pigment epithelium of individuals with age-related macular degeneration reveals differential methylation of the SKI, GTF2H4, and TNXB genes. Clin Epigenetics 11, 6. 10.1186/s13148-019-0608-2.

      (6) Mao, J.R., Taylor, G., Dean, W.B., Wagner, D.R., Afzal, V., Lotz, J.C., Rubin, E.M., and Bristow, J. (2002). Tenascin-X deficiency mimics Ehlers-Danlos syndrome in mice through alteration of collagen deposition. Nat Genet 30, 421-425. 10.1038/ng850.

      (7) Zhang, K., Wang, X., Zeng, L.T., Yang, X., Cheng, X.F., Tian, H.J., Chen, C., Sun, X.J., Zhao, C.Q., Ma, H., and Zhao, J. (2023). Circular RNA PDK1 targets miR-4731-5p to enhance TNXB expression in ligamentum flavum hypertrophy. FASEB J 37, e22877. 10.1096/fj.202200022RR.

      (8) Guo, H., Yin, W., Zou, Z., Zhang, C., Sun, M., Min, L., Yang, L., and Kong, L. (2021). Quercitrin alleviates cartilage extracellular matrix degradation and delays ACLT rat osteoarthritis development: An in vivo and in vitro study. J Adv Res 28, 255-267. 10.1016/j.jare.2020.06.020.

      (9) Weitzmann, M.N., Roser-Page, S., Vikulina, T., Weiss, D., Hao, L., Baldwin, W.H., Yu, K., Del Mazo Arbona, N., McGee-Lawrence, M.E., Meeks, S.L., and Kempton, C.L. (2019). Reduced bone formation in males and increased bone resorption in females drive bone loss in hemophilia A mice. Blood Adv 3, 288-300. 10.1182/bloodadvances.2018027557.

      (10) Haxaire, C., Hakobyan, N., Pannellini, T., Carballo, C., McIlwain, D., Mak, T.W., Rodeo, S., Acharya, S., Li, D., Szymonifka, J., et al. (2018). Blood-induced bone loss in murine hemophilic arthropathy is prevented by blocking the iRhom2/ADAM17/TNF-alpha pathway. Blood 132, 1064-1074. 10.1182/blood-2017-12-820571.

      (11) Vols, K.K., Kjelgaard-Hansen, M., Ley, C.D., Hansen, A.K., and Petersen, M. (2019). Bleed volume of experimental knee haemarthrosis correlates with the subsequent degree of haemophilic arthropathy. Haemophilia 25, 324-333. 10.1111/hae.13672.

      (12) Lobet, S., Peerlinck, K., Hermans, C., Van Damme, A., Staes, F., and Deschamps, K. (2020). Acquired multi-segment foot kinematics in haemophilic children, adolescents and young adults with or without haemophilic ankle arthropathy. Haemophilia 26, 701-710. 10.1111/hae.14076.

      (13) Garcia, D., and Shaw, R.J. (2017). AMPK: Mechanisms of Cellular Energy Sensing and Restoration of Metabolic Balance. Mol Cell 66, 789-800. 10.1016/j.molcel.2017.05.032.

      (14) Johnson, J., Chow, Z., Lee, E., Weiss, H.L., Evers, B.M., and Rychahou, P. (2021). Role of AMPK and Akt in triple negative breast cancer lung colonization. Neoplasia 23, 429-438. 10.1016/j.neo.2021.03.005.

      (15) Rao, Z., Wang, S., and Wang, J. (2017). Peroxiredoxin 4 inhibits IL-1beta-induced chondrocyte apoptosis via PI3K/AKT signaling. Biomed Pharmacother 90, 414-420. 10.1016/j.biopha.2017.03.075.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The manuscript by Rühling et al analyzes the mode of entry of S. aureus into mammalian cells in culture. The authors propose a novel mechanism of rapid entry that involves the release of calcium from lysosomes via NAADP-stimulated activation of TPC1, which in turn causes lysosomal exocytosis; exocytic release of lysosomal acid sphingomyelinase (ASM) is then envisaged to convert exofacial sphingomyelin to ceramide. These events not only induce the rapid entry of the bacteria into the host cells but are also described to alter the fate of the intracellular S. aureus, facilitating escape from the endocytic vacuole to the cytosol.

      Strengths:

      The proposed mechanism is novel and could have important biological consequences.

      Weaknesses:

      Unfortunately, the evidence provided is unconvincing and insufficient to document the multiple, complex steps suggested. In fact, there appear to be numerous internal inconsistencies that detract from the validity of the conclusions, which were reached mostly based on the use of pharmacological agents of imperfect specificity.

      We thank the reviewer for the detailed evaluation of our manuscript. We will address the criticism below.

      We agree with the reviewer that many of the experiments presented in our study rely on the usage of inhibitors. However, we want to emphasize that the main conclusion (invasion pathway affects the intracellular fate/phagosomal escape) was demonstrated without the use of inhibitors or genetic ablation in two key experiments (Figure5 D/E). These experiments were in line with the results we obtained with inhibitors (amitriptyline [Figure 4D], ARC39, PCK310, [Figure 4C] and Vacuolin-1 [Figure4E]). Importantly, the hypothesis was also supported by another key experiment, in which we showed the intracellular fate of bacteria is affected by removal of SM from the plasma membrane before invasion, but not by removal of SM from phagosomal membranes after bacteria internalization (Figure5A-C). Taken together, we thus believe that the main hypothesis is strongly supported by our data.

      Moreover, we either used different inhibitors for the same molecule (ASM was inhibited by ARC39, amitriptyline and PCK310 with similar outcome) or supported our hypothesis with gene-ablated cell pools (TPC1, Syt7, SARM1), as we will point out in more detail below.

      Firstly, the release of calcium from lysosomes is not demonstrated. Localized changes in the immediate vicinity of lysosomes need to be measured to ascertain that these organelles are the source of cytosolic calcium changes. In fact, 9-phenantrol, which the authors find to be the most potent inhibitor of invasion and hence of the putative calcium changes, is not a blocker of lysosomal calcium release but instead blocks plasmalemmal TRPM4 channels. On the other hand, invasion is seemingly independent of external calcium. These findings are inconsistent with each other and point to non-specific effects of 9-phenantrol. The fact that ionomycin decreases invasion efficiency is taken as additional evidence of the importance of lysosomal calcium release. It is not clear how these observations support involvement of lysosomal calcium release and exocytosis; in fact treatment with the ionophore should itself have induced lysosomal exocytosis and stimulated, rather than inhibited invasion. Yet, manipulations that increase and others that decrease cytosolic calcium both inhibited invasion.

      With respect to lysosomal Ca<sup>2<sup>+</sup></sup> release, we agree with the reviewer that direct visual demonstration of lysosomal Ca<sup>2<sup>+</sup></sup> release upon infection will improve the manuscript. We therefore performed live cell imaging to visualize lysosomal Ca<sup>2<sup>+</sup></sup> release by a previously published method.1 The approach is based on two dextran-coupled fluorophores that were incubated with host cells. The dyes are endocytosed and eventually stain the lysosomes. One of the dyes, Rhod-2, is Ca<sup>2<sup>+</sup></sup>-sensitive and can be used to estimate the lysosomal Ca<sup>2<sup>+</sup></sup> content. The second dye, AF647, is Ca<sup>2<sup>+</sup></sup>-insensitive and is used to visualize the lysosomes. If the ratio Rhod-2/AF647 within the lysosomes is decreasing, lysosomal Ca<sup>2<sup>+</sup></sup> release is indicated. We monitored lysosomal Ca<sup>2<sup>+</sup></sup> content during S. aureus infection with this method (Author response image 1 and Author response video 1). However, the lysosomes are very dynamic, and it is challenging to monitor the fluorescence intensities over time. Thus, quantitative measurements are not possible with our methodology, and we decided to not include these data in the main manuscript. However, one could speculate that lysosomal Ca<sup>2<sup>+</sup></sup> content in the selected ROI (Author response image 1 and Author response video 1) is decreased upon attachment of S. aureus to the host cells as indicated by a decrease in Rhod-2/AF647 ratio.

      Author response image 1.

      Lysosomal Ca<sup>2<sup>+</sup></sup> imaging during S. aureus infection. The lysosomes of HuLEC were stained with two dextran-coupled fluorescent dyes. A Ca<sup>2<sup>+</sup></sup>-sensitive dye Rhod-2 as well as Ca<sup>2<sup>+</sup></sup>insensitive AF647. Cells were infected with fluorescent S. aureus JE2 and monitored by live cell imaging (see Author response video 1). The intensity of Rhod-2/AF647 was measured close to a S. aureus-host contact site. Ratio of Rhod-2 vs. AF647 fluorescence intensity was calculated

      As to the TRPM4 involvement in S. aureus host cell internalization, it has been reported that TRPM4 is activated by cytosolic Ca<sup>2<sup>+</sup></sup>. However, the channel conducts monovalent cations such as K<sup>+</sup> or Na<sup>+</sup> but is impermeable for Ca<sup>2<sup>+</sup></sup> [2, 3]. The following of our observations are supporting this:

      i) S. aureus invasion is dependent on intracellular Ca<sup>2<sup>+</sup></sup>, but is independent from extracellular Ca<sup>2<sup>+</sup></sup>  (Figure 1A).

      ii) 9-phenantrol treatment reduces S. aureus internalization by host cells, illustrating the dependence of this process on TRPM4 (data removed from the manuscript) . We therefore hypothesize that TRPM4 is activated by Ca<sup>2<sup>+</sup></sup> released from lysosomes (see above).

      TRPM4 is localized to focal adhesions and is connected to actin cytoskeleton[4, 5] – a requisite of host cell entry of S. aureus.[6, 7] This speaks for an important function of TRPM4 in uptake of S. aureus in general, but does not necessarily have to be involved exclusively in the rapid uptake pathway.

      TRPM4 itself is not permeable for Ca<sup>2<sup>+</sup></sup> but is activated by the cation.  Thus, it is unlikely to cause lysosomal exocytosis. The stronger bacterial uptake reduction by treatment with 9-phenantrol when compared to Ned19 thus may be caused by the involvement of TRPM4 in additional pathways of S. aureus host cell entry involving that association of TRPM4 with focal adhesions or as pointed out by the reviewer, unspecific side effects of 9-phenantrol that we currently cannot exclude.  However, we think that experiments with 9-phenantrol distract from the main story (lysosomal Ca<sup>2<sup>+</sup></sup> and exocytosis) and might be confusing for the reader. We thus removed all data and discussion concerning 9phenantrol in the revised manuscript.

      Regarding the reduced S. aureus invasion after ionomycin treatment, we agree with the reviewer that ionomycin is known to lead to lysosomal exocytosis as was previously shown by others8 as well as our laboratory[9}. 

      We hypothesized that pretreatment with ionomycin would trigger lysosomal exocytosis and thus would reduce the pool of lysosomes that can undergo exocytosis before host cells are contacted by S. aureus. As a result, we should observe a marked reduction of S. aureus internalization in such “lysosome-depleted cells”, if the lysosomal exocytosis is coupled to bacterial uptake. Our observation of reduced bacterial internalization after ionomycin treatment supports this hypothesis.

      However, ionomycin treatment and S. aureus infection of host cells are distinct processes.  

      While ionomycin results in strong global and non-directional lysosomal exocytosis of all “releasable” lysosomes (~5-10 % of all lysosomes according to previous observations)8, we hypothesize that lysosomal exocytosis upon contact with S. aureus only involves a small proportion of lysosomes at host-bacteria contact sites. This is supported by experiments that demonstrate that ~30% of the lysosomes that are released by ionomycin treatment are exocytosed during S. aureus infection (see below and Figure 2, A-C). We added this new data as well as an according section to the discussion  (line 563 ff). Moreover, we moved the data obtained with ionomycin to Figure 2E and described our idea behind this experiment more precisely (line 166 ff).

      The proposed role of NAADP is based on the effects of "knocking out" TPC1 and on the pharmacological effects of Ned-19. It is noteworthy that TPC2, rather than TPC1, is generally believed to be the primary TPC isoform of lysosomes. Moreover, the gene ablation accomplished in the TPC1 "knockouts" is only partial and rather unsatisfactory. Definitive conclusions about the role of TPC1 can only be reached with proper, full knockouts. Even the pharmacological approach is unconvincing because the high doses of Ned-19 used should have blocked both TPC isoforms and presumably precluded invasion. Instead, invasion is reduced by only ≈50%. A much greater inhibition was reported using 9-phenantrol, the blocker of plasmalemmal calcium channels. How is the selective involvement of lysosomal TPC1 channels justified?

      As to partial gene ablation of TPC1: To avoid clonal variances, we usually perform pool sorting to obtain a cell population that predominantly contains cells -here- deficient in TPC1, but also a small proportion of wildtype cells as seen by the residual TPC1 protein on the Western blot. We observe a significant reduction in bacterial uptake in this cell pool suggesting that the uptake reduction in a pure K.O. population may be even more pronounced. 

      As to the inhibition by Ned19: 

      The scale of invasion reduction upon Ned19 treatment (50%, Figure 1B) is comparable with the reduction caused by other compounds that influence the ASM-dependent pathway (such as amitriptyline, ARC39 [Figure 2G], BAPTA-AM [Figure 1A], Vacuolin-1 [Figure 2D], β-toxin [Figure 2L] and ionomycin [Figure 2E]). Further, the partial reduction of invasion is most likely due to the concurrent activity of multiple internalization pathways which are not all targeted by the used compounds and which we briefly discuss in the manuscript.

      We agree with the reviewer that Ned19 inhibits TPC1 and TPC2. Since ablation of TPC1 reduced invasion of S. aureus, we concluded that TPC1 is important for S. aureus host cell invasion. We thus agree with the reviewer that a role for TPC2 cannot be excluded. We clarified this in the revised manuscript (Lines 552). It needs to be noted, however, that deficiency in either TPC1 or TPC2 alone was sufficient to prevent Ebola virus infection10, which is in line with our observations.

      In order to address the role of TPC2 for this review process, we kindly were gifted TPCN1/TPCN2 double knock-out HeLa cells by Norbert Klugbauer (Freiburg, Germany), which we tested for S. aureus internalization. We found that invasion was reduced in these cell lines supporting a role of lysosomal Ca<sup>2<sup>+</sup></sup> release in S. aureus host cell entry and a role for both TPC channels (Author response image 2, see end of the document). Since we did not have a single TPCN2 knock-out available we decided to exclude these data from the main manuscript.

      Author response image 2.

      Invasion efficiency is reduced in TPC1/TPC2 double K.O. HeLa cells. Invasion efficiency of S. aureus JE2 was determined in TPC1/TPC2 double K.O. cells after 10 and 30 min. Results were normalized to the parental HeLa WT cell line (set to 100 %).  

      Invoking an elevation of NAADP as the mediator of calcium release requires measurements of the changes in NAADP concentration in response to the bacteria. This was not performed. Instead, the authors analyzed the possible contribution of putative NAADP-generating systems and reported that the most active of these, CD38, was without effect, while the elimination of SARM1, another potential source of NAADP, had a very modest (≈20%) inhibitory effect that may have been due to clonal variation, which was not ruled out. In view of these data, the conclusion that NAADP is involved in the invasion process seems unwarranted.

      Our results from two independent experimental set-ups (Ned19 [Figure 1B] and TPC1 K.O. [Figure 1C & Figure 2N]) indicate the involvement of NAADP in the process. Together with the metabolomics unit at the Biocenter Würzburg, we attempted to measure cellular NAADP levels, however, this proved to be non-trivial and requires further optimization. However, we can rule out clonal variation in the SARM1 mutant since experiments were conducted with a cell pool as described above in order to avoid clonal variation of single clones.

      The mechanism behind biosynthesis of NAADP is still debated. CD38 was the first enzyme discovered to possess the ability of producing NAADP. However, it requires acidic pH to produce NAADP[11] -which does not match the characteristics of a cytosolic NAADP producer. HeLa cells do not express CD38 and hence, it is not surprising that inhibition of CD38 had no effect on S. aureus invasion in HeLa cells. However, NAADP production by HeLa cells was observed in absence of CD38[12]. Thus CD38independent NAADP generation is likely. SARM1 can produce NAADP at neutral pH[13] and is expressed in HeLa, thus providing a more promising candidate.  

      We agree with the reviewer that the reduction of S. aureus internalization after ablation of SARM1 is less pronounced than in other experiments of ours. This may be explained by NAADP originating from other enzymes, such as the recently discovered DUOX1, DUOX2, NOX1 and NOX2[14], which – with exception of DUOX2- possess a low expression even in HeLa cells. We add this to the discussion in the revised manuscript (line 579).

      We can, however, rule out clonal variation for the inhibitory effect. As stated above we generated K.O. cell pools specifically to avoid inherent problems of clonality. Thus, we also detect some residual wildtype cells within our cell pools.  

      The involvement of lysosomal secretion is, again, predicated largely on the basis of pharmacological evidence. No direct evidence is provided for the insertion of lysosomal components into the plasma membrane, or for the release of lysosomal contents to the medium. Instead, inhibition of lysosomal exocytosis by vacuolin-1 is the sole source of evidence. However, vacuolin-1 is by no means a specific inhibitor of lysosomal secretion: it is now known to act primarily as a PIKfyve inhibitor and to cause massive distortion of the endocytic compartment, including gross swelling of endolysosomes. The modest (20-25%) inhibition observed when using synaptotagmin 7 knockout cells is similarly not convincing proof of the requirement for lysosomal secretion.

      We agree with the reviewer that the manuscript will benefit from a functional analysis of lysosomal exocytosis and therefore conducted assays to investigate exocytosis in the revised manuscript. We previously showed i) by addition of specific antisera that LAMP1 transiently is exposed on the plasma membrane during ionomycin and pore-forming toxin challenge and ii) demonstrated the release of ASM activity into the culture medium under these conditions.[9] However, both measurements are not compatible with S. aureus infection, since LAMP1 antibodies also are non-specifically bound by protein A and another IgG-binding proteins on the S. aureus surface, which would bias the results. Since protein A also may serve as an adhesin in the investigated pathway, we cannot simply delete the ORF without changing other aspects of staphylococcal virulence. Further, FBS contains a ASM background activity that impedes activity measurements of cell culture medium. We previously removed this background activity by a specific heat-inactivation protocol.[9] However, S. aureus invasion is strongly reduced in culture medium containing this heat-inactivated FBS.

      We therefore developed a luminescence assay based on split NanoLuc luciferase that enables detection of LAMP1 exposed on the plasma membrane without usage of antibodies (Figure 2, A-C). We added a section on the assay in the revised manuscript. Briefly, we generated reporter cells by fusing a short peptide fragment of NanoLuc called HiBiT between the signal peptide and the mature luminal domain of LAMP1 and stably expressed the resulting protein in HeLa cells by lentiviral transduction. The LgBiT protein domain of NanoLuc luciferase (Promega) as well as the substrate Furimazine are added to the culture medium. HiBiT can reconstitute a functional NanoLuc with LgBiT and process Furimazine when lysosomes are exocytosed thereby generating luminescence measurable in a suitable plate reader. 

      With this assay we detected that  about 30% of lysosomes that were “releasable” by treatment with ionomycin are exocytosed during S. aureus infection. Lysosomal exocytosis was strongly reduced (even below the levels of untreated controls), if we treated cells with Vacuolin-1 or Ned19.  

      We agree with the reviewer that Vacuolin-1 to some extent has unspecific side effects as has been shown by others and which we addressed in the revised version of the manuscript (line 541 ff). However, our new results with the HiBiT reporter cell line clearly demonstrate a reduction of lysosomal exocytosis after Vacuolin-1 treatment. Supported by this and our other results we hypothesize that Vacuolin-1 decreases S. aureus internalization due to the inhibition of lysosomal exocytosis.

      As to the involvement of synaptotagmin 7: The effect of Syt7 K.O. on invasion was moderate in initial experiments, likely due to a high culture passage and presumably overgrowth of WT cells. However, reduction of invasion in Syt7 K.O.s was more pronounced in experiments with β-toxin complementation (Figure 2, N) and hence, we combined the two data sets (Figure 2, F). This demonstrates the reduction of bacterial invasion by ~40% in Syt7 K.O. cell pools. Moreover, Syt7 is not the only protein possibly involved in Ca<sup>2<sup>+</sup></sup>-dependent exocytosis. For instance, Syt1 has been shown to possess an overlapping function.[15] This may explain the differences between our Vacuolin-1 and Syt7 ablation experiments. We added this information to the discussion. 

      ASM is proposed to play a central role in the rapid invasion process. As above, most of the evidence offered in this regard is pharmacological and often inconsistent between inhibitors or among cell types. Some drugs affect some of the cells, but not others. It is difficult to reach general conclusions regarding the role of ASM. The argument is made even more complex by the authors' use of exogenous sphingomyelinase (beta-toxin). Pretreatment with the toxin decreased invasion efficiency, a seemingly paradoxical result. Incidentally, the effectiveness of the added toxin is never quantified/validated by directly measuring the generation of ceramide or the disappearance of SM.

      Although pharmacological inhibitors can have unspecific side effects, we want to emphasize that the inhibitors used in our study act on the enzyme ASM by completely different mechanisms. Amitriptyline is a so called functional inhibitor of ASM (FIASMA) which induces the detachment of ASM from lysosomal membranes resulting in degradation of the enzyme.[16] By contrast, ARC39 is a competitive inhibitor.[17, 18] 

      There are no inconsistencies in our data obtained with ASM inhibitors. Amitriptyline and ARC39 both reduce the invasion of S. aureus in HuLEC, HuVEC and HeLa cells (Figure 2G). ARC39 needs a longer pre-incubation, since its uptake by host cells is slower (to be published elsewhere). We observe a different outcome in 16HBE14o- and Ea.Hy 926 cells, with 16HBE14o- even demonstrating a slightly increased invasion of S. aureus upon ARC39 treatment. Amitriptyline had no effect (Figure 2G). 

      Thus, the ASM-dependent S. aureus internalization is cell type/line specific, which we state in the manuscript. The molecular origin of these differences is unclear and will require further investigation, e.g. in testing cell lines for potential differences in surface receptors. In a separate study we have already developed a biotinylation-based approach to identify potential novel host cell surface interaction partners during S. aureus infection.[19]

      Moreover, both inhibitors affected the invasion dynamics (Figure 3D), phagosomal escape (Figure 4C and Figure 4D) and Rab7 recruitment (Figure 4A and Supp. Figure 4A-C) in a similar fashion. Proper inhibition of ASM by both compounds in all cell lines used was validated by enzyme assays (Supp. Figure 2H), which again suggests that the ASM-dependent pathway does only exist in specific cell lines and also supports  that we do not observe unspecific side effects of the compounds. We clarified this in the revised manuscript.

      ASM is a key player for SM degradation and recycling. In clinical context, deficiency in ASM results in the so-called Niemann Pick disease type A/B. The lipid profile of ASM-deficient cells is massively altered[20], which will result in severe side effects. Short-term inhibition by small molecules therefore poses a clear benefit when compared to the usage of ASM K.O. cells. In order to satisfy the query of the reviewer, we generated two ASM K.O. cell pools (generated with two different sgRNAs) and tested these for S. aureus invasion efficiency (Figure 2, I). We did not observe bacterial invasion differences between WT and K.O. cells. However, when we treated the cells additionally with ASM inhibitor, we observed a strongly reduced invasion in WT cells, while invasion efficiency in ASM K.O. was only slightly affected (Figure 2, J). We concluded that the reduced invasion observed in inhibitor-treated WT cells  predominantly is due to absence of ASM, while the small reduction observed in ARC39treated ASM K.O.s is likely due to unspecific side effects.  

      We performed lipidomics on these cells and demonstrated a strongly altered sphingolipid profile in ASM K.O. cells compared to untreated and inhibitor-treated WT cells (Figure 2, K). We speculate that other ASM-independent bacterial invasion pathways are upregulated in ASM K.O.s., thereby obscuring the effect contributed by absence of ASM. We discussed this in the revised manuscript (line 518 ff).

      Moreover, we introduced the RFP-CWT escape marker into the ASM K.O. cells and measured phagosomal escape of S. aureus JE2 and Cowan I.  The latter strain is non-cytotoxic and serves as negative control, since it is known to possess a very low escape rate, due to its inability to produce toxin. Again, we compared early invaders (infection for 10 min) with early<sup>+</sup>late invaders (infection for 30 min). As observed  for JE2, “early invaders” possess lower escape rates than “early<sup>+</sup>late invaders”.

      We did not observe differences between WT and ASM K.O. cells, if we infected for only 10 min. By contrast, we observed a lower escape rate in ASM K.O (Author response image 3, see end of the document). compared to WT cells, when we infected for 30 min.  

      However, we usually observe an increased phagosomal escape, when we treated host cells with ASM inhibitors (Figure 4C and D). Reduced phagosomal escape of intracellular S. aureus in ASM K.O. cells may be caused by the altered sphingolipid profile(e.g., by interference with binding of bacterial toxins to phagosomal membranes or altered vesicular acidification). We hence think that these data are difficult to interpret, and clarification would require intense additional experimentation. Thus, we did not include this data in the manuscript. 

      Author response image 3.

      Phagosomal escape rates were established in either HeLa wild-type or ASM K.O. cells expressing the phagosomal escape reporter RFP-CWT. Host cells that were infected with the cytotoxic S. aureus strain JE2 or the non-cytotoxic strain Cowan I for 10 or 30 minutes and escape rates were determined by microscopy 3h p.i.

      As to the treatment with a bacterial sphingomyelinase:

      Treatment with the bacterial SMase (bSMase, here: β-toxin) was performed in two different ways:

      i) Pretreatment of host cells with β-toxin to remove SM from the host cell surface before infection. This removes the substrate of ASM from the cell surface prior to addition of the bacteria (Figure 2L, Figure 4A-C). Since SM is not present on the extracellular plasma membrane leaflet after treatment, a release of ASM cannot cause localized ceramide formation at the sites of lysosomal exocytosis. Similar observations were made by others.[21] 

      ii) Addition of bSMase to host cells together with the bacteria to complement for the absence of ASM (Figure 2N).  

      Removal of the ASM substrate before infection (i) prevents localized ASM-mediated conversion of SM to Cer during infection and resulted in a decreased invasion, while addition of the SMase during infection resulted in an increased invasion in TPC1 and Syt7 ablated cells. Thus, both experiments are consistent with each other and in line with our other observations. 

      Removal of SM from the plasma membrane by β-toxin was indirectly demonstrated by the absence of Lysenin recruitment to phagosomes/escaped bacteria when host cells were pretreatment with the toxin before infection (Figure5C). We also added another data set that demonstrates degradation of a fluorescence SM derivative upon β-toxin treatment of host cells (Supp Figure 2, M). In another publication, we recently quantified the effectiveness of β-toxin treatment, even though with slightly longer treatment times (75 min vs. 3h).[22]

      To clarify our experimental approaches to the readership we added an explanatory section to the revised manuscript (line 287 ff) and we also added a scheme to in Figure 2M describing the experimental settings.

      As to the general conclusions regarding the role of ASM: ASM and lysosomal exocytosis has been shown to be involved in uptake of a variety of pathogens[21, 23-27] supporting its role in the process.

      The use of fluorescent analogs of sphingomyelin and ceramide is not well justified and it is unclear what conclusions can be derived from these observations. Despite the low resolution of the images provided, it appears as if the labeled lipids are largely in endomembrane compartments, where they would presumably be inaccessible to the secreted ASM. Moreover, considering the location of the BODIPY probe, the authors would be unable to distinguish intact sphingomyelin from its breakdown product, ceramide. What can be concluded from these experiments? Incidentally, the authors report only 10% of BODIPY-positive events after 10 min. What are the implications of this finding? That 90% of the invasion events are unrelated to sphingomyelin, ASM, and ceramide?

      During the experiments with fluorescent SM analogues (Figure 3a,b), S. aureus was added to the samples immediately before the start of video recording. Hence, bacteria are slowly trickling onto the host cells, and we thus can image the initial contact between them and the bacteria, for instance, the bacteria depicted in Figure 3A contact the host cell about 9 min before becoming BODIPY-FL-positive (see Supp. Video 1, 55 min). Hence, in these cases we see the formation of phagosomes around bacteria rather than bacteria in endomembrane compartments. Since generation of phagosomes happens at the plasma membrane, SM is accessible to secreted ASM.  

      The “trickling” approach for infection is an experimental difference to our invasion measurements, in which we synchronized the infection by  centrifugation. This ensures that all bacteria have contact to host cells and are not just floating in the culture medium. However, live cell imaging of initial bacterialhost contact and synchronization of infection is hard to combine technically.

      In our invasion measurements -with synchronization-, we typically see internalization of ~20% of all added bacteria after 30 min. Hence, most bacteria that are visible in our videos likely are still extracellular and only a small proportion was internalized. This explains why only 10% of total bacteria are positive for BODIPY-FL-SM after 10 min. The proportion of internalized bacteria that are positive for BODIPY-FL-SM should be way higher but cannot be determined with this method.

      We agree with the reviewer that we cannot observe conversion of BODIPY-FL-SM by ASM. In order to do that, we attempted to visualize the conversion of a visible-range SM FRET probe (Supp. Figure 3), but the structure of the probe is not compatible with measurement of conversion on the plasma membrane, since the FITC fluorophore released into the culture medium by the ASM activity thereby gets lost for imaging. In general, the visualization of SM conversion with subcellular resolution is challenging and even with novel tools developed in our lab[28] visualization of SM on the plasma membrane is difficult. 

      The conclusions we draw from these experiments are that i.) S. aureus invasion is associated with SM and ii.) SM-associated invasion can be very fast, since bacteria are rapidly engulfed by BODIPY-FL-SM containing membranes.

      It is also unclear how the authors can distinguish lysenin entry into ruptured vacuoles from the entry of RFP-CWT, used as a criterion of bacterial escape. Surely the molecular weights of the probes are not sufficiently different to prevent the latter one from traversing the permeabilized membrane until such time that the bacteria escape from the vacuole.

      We here want to clarify that both Lysenin as well as the CWT reporter have access to ruptured vacuoles (Figure 4B). We used the Lysenin reporter in these experiments for estimation of SM content of phagosomal membranes. If a vacuole is ruptured, both the bacteria and the luminal leaflet of the phagosomal membrane remnants get in contact with the cytosol and hence with the cytosolically expressed reporters YFP-Lysenin as well as RFP-CWT resulting in “Lysenin-positive escape” when phagosomes contained SM (see Figure 5C). By contrast, either β-toxin expression by S. aureus or pretreatment with the bSMase resulted in absence of Lysenin recruitment suggesting that the phagosomal SM levels were decreased/undetectable (Figure 5C, Supp Figure 6F, G, I, J).

      Although this approach does not enable a quantitative measurement of phagosomal SM, this method is sufficient to show that β-toxin expression and pretreatment result in markedly decreased phagosomal SM levels in the host cells.

      The approach we used here to analyze “Lysenin-positive escape” can clearly be distinguished from Lysenin-based methods that were used by others.29 There Lysenin was used to show trans-bilayer movement of SM before rupture of bacteria-containing phagosomes.

      To clarify the function of Lysenin in our approach we added  additional figures (Figure 4F, Supp. Figure 5) and a movie (Supp. Video 4) to the revised manuscript.

      Both SMase inhibitors (Figure 4C) and SMase pretreatment increased bacterial escape from the vacuole. The former should prevent SM hydrolysis and formation of ceramide, while the latter treatment should have the exact opposite effects, yet the end result is the same. What can one conclude regarding the need and role of the SMase products in the escape process?

      As pointed out above, pretreatment of host cells with SMase removes SM from the plasma membrane and hence, ASM does not have access to its substrate. Hence, both treatment with either ASM inhibitors or pretreatment with bacterial SMase prevent ASM from being active on the plasma membrane and hence block the ASM-dependent uptake (Figure 2 G, L). Although overall less bacteria were internalized by host cells under these conditions, the bacteria that invaded host cells did so in an ASM-independent manner. 

      Since blockage of the ASM-dependent internalization pathway (with ASM inhibitor [Figure 4C, D], SMase pretreatment [Figure 5B] and Vacuolin-1[Figure.4E]) always resulted in enhanced phagosomal escape, we conclude that bacteria that were internalized in an ASM-independent fashion cause enhanced escape. Vice versa, bacteria that enter host cells in an ASM-dependent manner demonstrate lower escape rates. 

      This is supported by comparing the escape rates of “early” and “late” invaders [Figure 5D, E], which in our opinion is a key experiment that supports this hypothesis. The “early” invaders are predominantly ASM-dependent (see e.g. Figure 3E) and thus, bacteria that entered host cell in the first 10 min of infection should have been internalized predominantly in an ASM-dependent fashion, while slower entry pathways are active later during infection. The early ASM dependent invaders possessed lower escape rates, which is in line with the data obtained with inhibitors (e.g. Figure 4C, D).

      We hypothesize that the activity of ASM on the plasma membrane during invasion mediates the recruitment of a specific subset of receptors, which then influences downstream phagosomal maturation and escape. This hypothesis is supported by the fact that the subset of receptors interacting with S. aureus is altered upon inhibition of the ASM-dependent uptake pathway. We describe this in another study that is currently under evaluation elsewhere.  

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Ruhling et al propose a rapid uptake pathway that is dependent on lysosomal exocytosis, lysosomal Ca<sup>2<sup>+</sup></sup> and acid sphingomyelinase, and further suggest that the intracellular trafficking and fate of the pathogen is dictated by the mode of entry.

      The evidence provided is solid, methods used are appropriate and results largely support their conclusions, but can be substantiated further as detailed below. The weakness is a reliance on chemical inhibitors that can be non-specific to delineate critical steps.

      Specific comments:

      A large number of experiments rely on treatment with chemical inhibitors. While this approach is reasonable, many of the inhibitors employed such as amitriptyline and vacuolin1 have other or nondefined cellular targets and pleiotropic effects cannot be ruled out. Given the centrality of ASM for the manuscript, it will be important to replicate some key results with ASM KO cells.

      We thank the reviewer for the critical evaluation of our manuscript and plenty of constructive comments. 

      We agree with the reviewer, that ASM inhibitors such as functional inhibitors of ASM (FIASMA) like amitriptyline used in our study have unspecific side effects given their mode-of-action. FIASMAs induce the detachment of ASM from lysosomal membranes resulting in degradation of the enzyme.[16]  However, we want to emphasize that we also used the competitive inhibitor ARC39 in our study[17, 18] which acts on the enzyme by a completely different mechanism. All phenotypes (reduced invasion [Figure 2G], effect on invasion dynamics [Figure 3D], enhanced escape [Figure 4C, D] and differential recruitment of Rab7 [Supp. Figure 4A-C]) were observed with both inhibitors thereby supporting the role of ASM in the process.  

      We further agree that experiments with genetic evidence usually support and improve scientific findings. However, ASM is a cellular key player for SM degradation and recycling. In a clinical context, deficiency in ASM results in a so-called Niemann Pick disease type A/B. The lipid profile of ASMdeficient cells is massively altered[20], which in itself will result in severe side effects. Thus, the usage of inhibitors provides a clear benefit when compared to ASM K.O. cells, since ASM activity can be targeted in a short-term fashion thereby preventing larger alterations in cellular lipid composition.

      We nevertheless generated two ASM K.O. cell pools (generated with two different sgRNAs) and tested for invasion efficiency (Figure 2, I). Here, we did not observe differences between WT and mutants. However, if we treated the cells additionally with ASM inhibitor, we observed a strongly reduced invasion in WT cells, while invasion efficiency in ASM K.O. was only slightly affected (Figure 2, J). We concluded that the reduced invasion observed in WT cells upon inhibitor treatment predominantly is due to inhibition of ASM, whereas the small reduction observed in ARC39-treated ASM K.O.s is likely due to unspecific side effects. We also demonstrated a strongly altered sphingolipid profile in ASM K.O. cells when compared to untreated and inhibitor-treated WT cells (new Figure 2, K). We speculate that other ASM-independent invasion pathways are upregulated in ASM K.O.s., thereby making up for the absence of ASM. We discuss this in the revised manuscript (line 518 ff).

      We introduced the RFP-CWT escape marker into the ASM K.O. cells and measured phagosomal escape of S. aureus JE2 and Cowan I (Author response image 3). The latter serves as negative control, since it is known to possess a very low escape rate, due to its inability of toxin production. Again, we compared early invaders (infection for 10 min) with early<sup>+</sup>late invaders (infection for 30 min). As seen before for JE2, early invaders possess lower escape rates than early<sup>+</sup>late invaders. We did not observe differences between WT and K.O. cells, if we infected for 10 min. By contrast, we observed a lower escape rate in ASM K.O. compared to WT cells, when we infected for 30 min. However, we usually observe an increased phagosomal escape, when we treated host cells with ASM inhibitors (Figure 4C and D). We think that the reduced phagosomal escape in ASM K.O. is caused by the altered sphingolipid profile, which could have versatile effects (e.g., inference with binding of bacterial toxins to phagosomal membranes or changes in acidification). We hence think that these data are difficult to interpret, and clarification would require intense additional experimentation. Thus, we did not include this data in the manuscript. 

      Most experiments are done in HeLa cells. Given the pathway is projected as generic, it will be important to further characterize cell type specificity for the process. Some evidence for a similar mechanism in other cell types S. aureus infects, perhaps phagocytic cell type, might be good. 

      Whenever possible we performed the experiments not only in HeLa but also in HuLECs. For example, we refer to experiments concerning the role of Ca<sup>2<sup>+</sup></sup> (Figure 1A/Supp.Figure1A), lysosomal Ca<sup>2<sup>+</sup></sup>/Ned19 (Figure1B/Supp Figure 1C), lysosomal exocytosis/Vacuolin-1 (Figure 2D/Supp. Figure2D), ASM/ARC39 and amitriptyline (Figure 2G), surface SM/β-toxin (Figure 2L/Supp. Figure 2L), analysis of invasion dynamics (complete Figure 3) and measurement of cell death during infection (Figure 6C<sup>+</sup>E, Supp. Figure 8A<sup>+</sup>B).

      HuLECs, however, are not really genetically amenable and hence we were not able to generate gene deletions in these cells and upon introduction of the fluorescence escape reporter the cells are not readily growing. 

      As to ASM involvement in phagocytic cells: a role for ASM during the uptake of S. aureus by macrophages was previously reported by others.[25] However, in professional phagocytes S. aureus does not escape from the phagosome and replicates within the phagosome.[30]

      I'm a little confused about the role of ASM on the surface. Presumably, it converts SM to ceramide, as the final model suggests. Overexpression of b-toxin results in the near complete absence of SM on phagosomes (having representative images will help appreciate this), but why is phagosomal SM detected at high levels in untreated conditions? If bacteria are engulfed by SM-containing membrane compartments, what role does ASM play on the surface? If surface SM is necessary for phagosomal escape within the cell, do the authors imply that ASM is tuning the surface SM levels to a certain optimal range? Alternatively, can there be additional roles for ASM on the cell surface? Can surface SM levels be visualized (for example, in Figure 4 E, F)?

      We initially hypothesized that we would detect higher phagosomal SM levels upon inhibition of ASM, since our model suggests SM cleavage by ASM on the host cell surface during bacterial cell entry. However, we did not detect any changes in our experiments (Supp. Figure 4F). We currently favor the following explanation: SM is the most abundant sphingolipid in human cells.[31] If peripheral lysosomes are exocytosed and thereby release ASM, only a localized and relative small proportion of SM may get converted to Cer, which most likely is below our detection limit. In addition, the detection of cytosolically exposed phagosomal SM by YFP-Lysenin is not quantitative and provides a “Yes or No” measurement. Hence, we think that the rather limited SM to Cer conversion in combination with the high abundance of SM in cellular membranes does not visibly affect the recruitment of the Lysenin reporter. 

      In our experiments that employ BODIPY-FL-SM (Figure 3a<sup>+</sup>b), we cannot distinguish between native SM and downstream metabolites such as Cer. Hence, again we cannot make any assumptions on the extent to which SM is converted on the surface during bacterial internalization. Although our laboratory recently used trifunctional sphingolipid analogs to analyze the SM to Cer conversion[22], the visualization of this process on the plasma membrane is currently still challenging.

      Overall, we hypothesize that the localized generation of Cer on the surface by released ASM leads to generation of Cer-enriched platforms. Subsequently, a certain subset of receptors may be recruited to these platforms and influence the uptake process. These platforms are supposed to be very small, which also would explain that we did not detect changes in Lysenin recruitment.

      Related to that, why is ASM activity on the cell surface important? Its role in non-infectious or other contexts can be discussed.

      ASM release by lysosomal exocytosis is implied in plasma membrane repair upon injury. We added a short description of the role of extracellular ASM in the introduction (line 35).

      If SM removal is so crucial for uptake, can exocytosis of lysosomes alone provide sufficient ASM for SM removal? How much or to what extent is lysosomal exocytosis enhanced by initial signaling events? Do the authors envisage the early events in their model happening in localized confines of the PM, this can be discussed.

      Ionomycin treatment led to a release of ~10 % of all lysosomes and also increased extracellular ASM activity.[8, 9] In the revised manuscript, we developed an assay to determine lysosomal exocytosis during S. aureus infection (Figure 2, A-C). We detected lysosomal exocytosis of ~30% when compared to ionomycin treatment  during infection. Since this is only a fraction of the “releasable lysosomes”, we assume that the effects (lysosomal Ca<sup>2<sup>+</sup></sup> liberation, lysosomal exocytosis and ASM activity) are very localized and take place only at host-pathogen contact sites (see also above). We discuss this in the revised manuscript (line 563 ff). To our knowledge it is currently unclear to which extent the released ASM affects surface SM levels. We attempted to visualize the local ASM activity on the cell surface by using a visible range FRET probe (Supp. Fig. 3). Cleavage of the probe by ASM on the surface leads to release of FITC into the cell culture medium, which does not contribute a measurable signal at the surface. 

      How are inhibitor doses determined? How efficient is the removal of extracellular bacteria at 10 min? It will be good to substantiate the cfu experiments for infectivity with imaging-based methods. Are the roles of TPC1 and TPC2 redundant? If so, why does silencing TPC1 alone result in a decrease in infectivity? For these and other assays, it would be better to show raw values for infectivity. Please show alterations in lysosomal Ca<sup>2<sup>+</sup></sup> at the doses of inhibitors indicated. Is lysosomal Ca<sup>2<sup>+</sup></sup> released upon S. aureus binding to the cell surface? Will be good to directly visualize this.

      Concerning the inhibitor concentrations, we either used values established in published studies or recommendations of the suppliers (e.g. 2-APB, Ned19, Vacuolin-1). For ASM inhibitors, we determined proper inhibition of ASM by activity assays. Concentrations of ionomycin resulting in Ca<sup>2<sup>+</sup></sup> influx and lysosomal exocytosis was determined in earlier studies of our lab.[9, 32] 

      As to the removal of bacteria at 10 min p.i.: Lysostaphin is very efficient for removal of extracellular S. aureus and sterilizes the tissue culture supernatant. It significantly lyses bacteria within a few minutes, as determined by turbidity assays.[33]

      As to imaging-based infectivity assays: We performed imaging-based invasion assays to show reduced invasion efficiency with two ASM inhibitors in the revised manuscript with similar results as obtained by CFU counts (Supp. Figure 2, J).

      Regarding the roles of TPC1 and TPC2: from our data we cannot conclude whether the roles of TPC1 and TPC2 are redundant. One could speculate that since blockage of TPC1 alone is sufficient to reduce internalization of bacteria, that both channels may have distinct roles. On the other hand, there might be a Ca<sup>2<sup>+</sup></sup> threshold in order to initiate lysosomal exocytosis that can only be attained if TPC1 and TPC2 are activated in parallel. Thus, our observations are in line with another study that shows reduced Ebola virus infection in absence of either TPC1 or TPC2.[34] In order to address the role of TPC2 for this review process, we kindly were gifted TPCN1/TPCN2 double knock-out HeLa cells by Norbert Klugbauer (Freiburg, Germany), which we tested for S. aureus internalization. We found that invasion was reduced in these double KO cell lines even further supporting a role of lysosomal Ca<sup>2<sup>+</sup></sup> release in S. aureus host cell entry (Author response image 2, see end of the document). Since we did not have a single TPCN2 knockout available, we decided to exclude these data from the main manuscript.

      As to raw CFU counts: whereas the observed effects upon blocking the invasion of S. aureus are stable, the number of internalized bacteria varies between individual biological replicates, for instance, by differences in host cell fitness or growth differences in bacterial cultures, which are prepared freshly for each experiment.

      With respect to visualization of lysosomal Ca<sup>2<sup>+</sup></sup> release: we agree with the reviewer that direct visual demonstration of lysosomal Ca<sup>2<sup>+</sup></sup> release upon infection would improve the manuscript. We therefore performed live cell imaging to visualize lysosomal Ca<sup>2<sup>+</sup></sup> release by a previously published method.[1] The approach is based on two dextran-coupled fluorophores that were incubated with host cells. The dyes are endocytosed and eventually stain the lysosomes. One of the dyes, Rhod-2, is Ca<sup>2<sup>+</sup></sup>-sensitive and can be used to estimate the lysosomal Ca<sup>2<sup>+</sup></sup> content. The second dye, AF647, is Ca<sup>2<sup>+</sup></sup>-insensitive and is used to visualize the lysosomes. If the ratio Rhod-2/AF647 within the lysosomes is decreasing, lysosomal Ca<sup>2<sup>+</sup></sup> release is indicated. We monitored lysosomal Ca<sup>2<sup>+</sup></sup> content during S. aureus infection with this method (Author response image 1 and Author response video 1). However, the lysosomes are very dynamic, and it is challenging to monitor the fluorescence intensities over time. Thus, quantitative measurements are not possible with our methodology, and we decided to not include these data in the final manuscript. However, one could speculate that lysosomal Ca<sup>2<sup>+</sup></sup> content in the selected ROI (Author response image 1 and Author response video 1) is decreased upon attachment of S. aureus to the host cells as indicated by a decrease in Rhod-2/AF647 ratio.

      The precise identification of cytosolic vs phagosomal bacteria is not very easy to appreciate. The methods section indicates how this distinction is made, but how do the authors deal with partial overlaps and ambiguities generally associated with such analyses? Please show respective images.

      The number of events (individual bacteria) for the live cell imaging data should be clearly mentioned.

      We apologize for not having sufficiently explained the technology to detect escaped S. aureus. The cytosolic location of S. aureus is indicated by recruitment of RFP-CWT.[35] CWT is the cell wall targeting domain of lysostaphin, which efficiently binds to the pentaglycine cross bridge in the peptidoglycan of S. aureus. This reporter is exclusively and homogenously expressed in the host cytosol. Only upon rupture of phagoendosomal membranes, the reporter can be recruited to the cell wall of now cytosolically located bacteria. S. aureus mutants, for instance in the agr quorum sensing system, cannot break down the phagosomal membrane in non-professional phagocytes and thus stay unlabeled by the CWT-reporter.[35] We  include several images (Figure 4, F, Supp. Figure 5) /movies (Supp. Video 4) of escape events in the revised manuscript.  The bacteria numbers for live cell experiments are now shown in Supp. Figure 7.

      In the phagosome maturation experiments, what is the proportion of bacteria in Rab5 or Rab7 compartments at each time point? Will the decreased Rab7 association be accompanied by increased Rab5? Showing raw values and images will help appreciate such differences. Given the expertise and tools available in live cell imaging, can the authors trace Rab5 and Rab7 positive compartment times for the same bacteria?

      We included the proportion of Rab7-associated bacteria in the revised manuscript (Supp. Figure 4A and C) and also shortly mention these proportions in the text (line 353). Usually, we observe that Rab5 is only transiently (for a few minutes) present on phagosomes and only afterwards the phagosomes become positive for Rab7. We do not think that a decrease in Rab7-positive phagosomes would increase the proportion of Rab5-positive phagosomes. However, we cannot exclude this hypothesis with our data.

      We can achieve tracing of individual bacteria for recruitment of Rab5/Rab7 only manually, which impedes a quantitative evaluation. However, we included a Video (Supp. Video 3)  that illustrates the consecutive recruitment of the GTPases.

      The results with longer-term infection are interesting. Live cell imaging suggests that ASM-inhibited cells show accelerated phagosomal escape that reduces by 6 hpi. Where are the bacteria at this time point ? Presumably, they should have reached lysosomes. The relationship between cytosolic escape, replication, and host cell death is interesting, but the evidence, as presented is correlative for the populations. Given the use of live cell imaging, can the authors show these events in the same cell?

      We think that most bacteria-containing phagoendosomes should have fused with lysosomes 6 h p.i. as we have previously shown by acidification to pH of 5 and LAMP1 decoration.[36]

      The correlation between phagosomal escape and replication in the cytosol of non-professional phagocytes has been observed by us and others. In the revised manuscript we also provide images (Supp. Figure 5)/videos (Supp. Video 4) to show this correlation in our experiments.

      Given the inherent heterogeneity in uptake processes and the use of inhibitors in most experiments, the distinction between ASM-dependent and independent pathways might not be as clear-cut as the authors suggest. Some caution here will be good. Can the authors estimate what fraction of intracellular bacteria are taken up ASM-dependent?

      We agree with the reviewer that an overlap between internalization pathways is likely. A clear distinction is therefore certainly non-trivial. Alternative to ASM-dependent and ASM-independent pathways, the ASM activity may also accelerate one or several internalization pathways. We address this limitation in the discussion of the revised manuscript (line 596 ff).

      Early in infection (~10 min after contact with the cells), the proportion of bacteria that enter host cells ASM-dependently is relatively high amounting to roughly 75-80% in HuLEC. After 30 min, this proportion is decreasing to about 50%. We included a paragraph in the discussion of the revised manuscript (line 593 ff).

      Reviewer #2 (Recommendations for the authors):

      (1) The experiment in Figure 4H is interesting. Details on what proportion of the cell is double positive, and if only this fraction was used for analysis will be good.

      We did use all bacteria found in the images independently from whether host cells were infected with only one or both strains. We unfortunately cannot properly determine the proportion of cells that are double infected, since i) we record the samples with CLSM and hence, cannot exclude that there are intracellular bacteria found in higher or lower optical sections. ii) we visualized cells by staining Nuclei and did not stain the cell borders, thus we cannot precisely tell to which host cell the bacteria localize.

      (2) Data is sparse for steps 5 and 6 of the model (line 330).

      We apologize for the inconvenience. There is a related study published  elsewhere[19], in which we identified NRCAM and PTK7 as putative receptors involved in this invasion pathway. We included a section in the discussion with the corresponding citation (line 569).

      (3) Data for the reduced number of intracellular bacteria upon blocking ASM-dependent uptake (line 235) is not clear. Do they mean decreased invasion efficiency? These two need not be the same.

      We changed “reduced number of intracellular bacteria” to “invasion efficiency”.

      (4) b-toxin added to the surface can get endocytosed. Can its surface effect be delineated from endo/phagosomal effect?

      We attempted to delineate effects contributed by the toxin activity on the surface vs. within phagosomes (Figure 5 A-C). We see an increased phagosomal escape, when we pretreated host cells with β-toxin (removal of SM form the surface) and infected either in presence (toxin will be taken up together with the bacteria into the phagosome) or in absence (toxin was washed away shortly before infection) of β-toxin. By contrast, overexpression of β-toxin by S. aureus did not affect phagosomal escape rates. The proper activity of β-toxin was confirmed by absence of Lysenin recruitment during phagosomal escape in all three conditions. We concluded that the activity on the surface and not the activity in the phagosome is important.

      (5) The potential role(s) of bacterial factors in the uptake and subsequent intracellular stages can be discussed.

      There are multiple bacterial adhesins known in S. aureus. These usually are either covalently attached to the bacterial cell wall such as the sortase-dependently anchored Fibronectin-binding Proteins A and B but also secreted and “cell wall binding” proteins as well at non proteinaceous factor such as wall-teichoic acids. A discussion of these factors would thus be out of the scope of this manuscript, and we here suggest reverting to specialized reviews on that topic.

      (6) The manuscript is not very easy to read. The abstract could be rephrased for better clarity and succinctness, with a clearly stated problem statement. The introduction is somewhat haphazard, I feel it can be better structured.

      We apologize for the inconvenience. We stated the problem/research question in the abstract and tried to improve the introduction without adding too much unnecessary detail. In general, we tried  to improve the readability of the manuscript and hope that our results and conclusions can be easier understood by the reader in the revised version.

      (7) Typo in Figure 5F. Step 6 should read "accessory receptors"

      The typo was corrected.

      References

      (1) Lloyd-Evans, E. et al. Niemann-Pick disease type C1 is a sphingosine storage disease that causes deregulation of lysosomal calcium. Nature Medicine 14, 1247-1255 (2008).

      (2) Launay, P. et al. TRPM4 Is a Ca<sup>2<sup>+</sup></sup>-Activated Nonselective Cation Channel Mediating Cell Membrane Depolarization. Cell 109, 397-407 (2002).

      (3) Nilius, B. et al. The Ca<sup>2<sup>+</sup></sup>‐activated cation channel TRPM4 is regulated by phosphatidylinositol 4,5‐biphosphate. The EMBO Journal 25, 467-478-478 (2006).

      (4) Cáceres, M. et al. TRPM4 Is a Novel Component of the Adhesome Required for Focal Adhesion Disassembly, Migration and Contractility. PLoS One 10, e0130540 (2015).

      (5) Silva, I., Brunett, M., Cáceres, M. & Cerda, O. TRPM4 modulates focal adhesion-associated calcium signals and dynamics. Biophysical Journal 123, 390a (2024).

      (6) Schlesier, T., Siegmund, A., Rescher, U. & Heilmann, C. Characterization of the Atl-mediated staphylococcal internalization mechanism. International Journal of Medical Microbiology 310, 151463 (2020).

      (7) Jevon, M. et al. Mechanisms of Internalization ofStaphylococcus aureus by Cultured Human Osteoblasts. Infection and Immunity 67, 2677-2681 (1999).

      (8) Rodriguez, A., Webster, P., Ortego, J. & Andrews, N.W. Lysosomes behave as Ca<sup>2<sup>+</sup></sup>-regulated exocytic vesicles in fibroblasts and epithelial cells. J Cell Biol 137, 93-104 (1997).

      (9) Krones & Rühling et al. Staphylococcus aureus alpha-Toxin Induces Acid Sphingomyelinase Release From a Human Endothelial Cell Line. Front Microbiol 12, 694489 (2021).

      (10) Sakurai, Y. et al. Two-pore channels control Ebola virus host cell entry and are drug targets for disease treatment. Science 347, 995-998 (2015).

      (11) Aarhus, R., Graeff, R.M., Dickey, D.M., Walseth, T.F. & Lee, H.C. ADP-ribosyl cyclase and CD38 catalyze the synthesis of a calcium-mobilizing metabolite from NADP. J Biol Chem 270, 3032730333 (1995).

      (12) Schmid, F., Fliegert, R., Westphal, T., Bauche, A. & Guse, A.H. Nicotinic acid adenine dinucleotide phosphate (NAADP) degradation by alkaline phosphatase. J Biol Chem 287, 32525-32534 (2012).

      (13) Angeletti, C. et al. SARM1 is a multi-functional NAD(P)ase with prominent base exchange activity, all regulated bymultiple physiologically relevant NAD metabolites. iScience 25, 103812 (2022).

      (14) Gu, F. et al. Dual NADPH oxidases DUOX1 and DUOX2 synthesize NAADP and are necessary for Ca(2<sup>+</sup>) signaling during T cell activation. Sci Signal 14, eabe3800 (2021).

      (15) Schonn, J.-S., Maximov, A., Lao, Y., Südhof, T.C. & Sørensen, J.B. Synaptotagmin-1 and -7 are functionally overlapping Ca<sup>2<sup>+</sup></sup> sensors for exocytosis in adrenal chromaffin cells. Proceedings of the National Academy of Sciences 105, 3998-4003 (2008).

      (16) Kornhuber, J. et al. Functional Inhibitors of Acid Sphingomyelinase (FIASMAs): a novel pharmacological group of drugs with broad clinical applications. Cell Physiol Biochem 26, 9-20 (2010).

      (17) Naser, E. et al. Characterization of the small molecule ARC39, a direct and specific inhibitor of acid sphingomyelinase in vitro. J Lipid Res 61, 896-910 (2020).

      (18) Roth, A.G. et al. Potent and selective inhibition of acid sphingomyelinase by bisphosphonates. Angew Chem Int Ed Engl 48, 7560-7563 (2009).

      (19) Rühling, M., Schmelz, F., Kempf, A., Paprotka, K. & Fraunholz Martin, J. Identification of the Staphylococcus aureus endothelial cell surface interactome by proximity labeling. mBio 0, e03654-03624 (2025).

      (20) Schuchman, E.H. & Desnick, R.J. Types A and B Niemann-Pick disease. Mol Genet Metab 120, 27-33 (2017).

      (21) Miller, M.E., Adhikary, S., Kolokoltsov, A.A. & Davey, R.A. Ebolavirus Requires Acid Sphingomyelinase Activity and Plasma Membrane Sphingomyelin for Infection. Journal of Virology 86, 7473-7483 (2012).

      (22) M. Rühling, L.K., F. Wagner, F. Schumacher, D. Wigger, D. A. Helmerich, T. Pfeuffer, R. Elflein, C. Kappe, M. Sauer, C. Arenz, B. Kleuser, T. Rudel, M. Fraunholz, J. Seibel Trifunctional sphingomyelin derivatives enable nanoscale resolution of sphingomyelin turnover in physiological and infection processes via expansion microscopy. Nat Commun accepted in principle (2024).

      (23) Peters, S. et al. Neisseria meningitidis Type IV Pili Trigger Ca(2<sup>+</sup>)-Dependent Lysosomal Trafficking of the Acid Sphingomyelinase To Enhance Surface Ceramide Levels. Infect Immun 87 (2019).

      (24) Grassmé, H. et al. Acidic sphingomyelinase mediates entry of N. gonorrhoeae into nonphagocytic cells. Cell 91, 605-615 (1997).

      (25) Li, C. et al. Regulation of Staphylococcus aureus Infection of Macrophages by CD44, Reactive Oxygen Species, and Acid Sphingomyelinase. Antioxid Redox Signal 28, 916-934 (2018).

      (26) Fernandes, M.C. et al. Trypanosoma cruzi subverts the sphingomyelinase-mediated plasma membrane repair pathway for cell invasion. J Exp Med 208, 909-921 (2011).

      (27) Luisoni, S. et al. Co-option of Membrane Wounding Enables Virus Penetration into Cells. Cell Host & Microbe 18, 75-85 (2015).

      (28) Rühling, M. et al. Trifunctional sphingomyelin derivatives enable nanoscale resolution of sphingomyelin turnover in physiological and infection processes via expansion microscopy. Nature Communications 15, 7456 (2024).

      (29) Ellison, C.J., Kukulski, W., Boyle, K.B., Munro, S. & Randow, F. Transbilayer Movement of Sphingomyelin Precedes Catastrophic Breakage of Enterobacteria-Containing Vacuoles. Curr Biol 30, 2974-2983 e2976 (2020).

      (30) Moldovan, A. & Fraunholz, M.J. In or out: Phagosomal escape of Staphylococcus aureus. Cell Microbiol 21, e12997 (2019).

      (31) Slotte, J.P. Biological functions of sphingomyelins. Progress in Lipid Research 52, 424-437 (2013).

      (32) Stelzner, K. et al. Intracellular Staphylococcus aureus Perturbs the Host Cell Ca(2<sup>+</sup>) Homeostasis To Promote Cell Death. mBio 11 (2020).

      (33) Kunz, T.C. et al. The Expandables: Cracking the Staphylococcal Cell Wall for Expansion Microscopy. Front Cell Infect Microbiol 11, 644750 (2021).

      (34) Sakurai, Y. et al. Ebola virus. Two-pore channels control Ebola virus host cell entry and are drug targets for disease treatment. Science 347, 995-998 (2015).

      (35) Grosz, M. et al. Cytoplasmic replication of Staphylococcus aureus upon phagosomal escape triggered by phenol-soluble modulin alpha. Cell Microbiol 16, 451-465 (2014).

      (36) Giese, B. et al. Staphylococcal alpha-toxin is not sufficient to mediate escape from phagolysosomes in upper-airway epithelial cells. Infect Immun 77, 3611-3625 (2009).

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      The study starts with the notion that in an AD-like disease model, ILC2s in the Rag1 knockout were expanded and contained relatively more IL-5<sup>+</sup> and IL-13<sup>+</sup> ILC2s. This was confirmed in the Rag2 knock-out mouse model.

      By using a chimeric mouse model in which wild-type knock-out splenocytes were injected into irradiated Rag1 knock-out mice, it was shown that even though the adaptive lymphocyte compartment was restored, there were increased AD-like symptoms and increased ILC2 expansion and activity. Moreover, in the reverse chimeric model, i.e. injecting a mix of wild-type and Rag1 knock-out splenocytes into irradiated wild-type animals, it was shown that the Rag1 knock-out ILC2s expanded more and were more active. Therefore, the authors could conclude that the RAG1 mediated effects were ILC2 cell-intrinsic.

      Subsequent fate-mapping experiments using the Rag1Cre;reporter mouse model showed that there were indeed RAGnaïve and RAGexp ILC2 populations within naïve mice. Lastly, the authors performed multi-omic profiling, using single-cell RNA sequencing and ATACsequencing, in which a specific gene expression profile was associated with ILC2. These included well-known genes but the authors notably also found expression of Ccl1 and Ccr8 within the ILC2. The authors confirmed their earlier observations that in the RAGexp ILC2 population, the Th2 regulome was more suppressed, i.e. more closed, compared to the RAGnaïve population, indicative of the suppressive function of RAG on ILC2 activity. I do agree with the authors' notion that the main weakness was that this study lacks the mechanism by which RAG regulates these changes in ILC2s.

      The manuscript is very well written and easy to follow, and the compelling conclusions are well supported by the data. The experiments are meticulously designed and presented. I wish to commend the authors for the study's quality.

      Even though the study is compelling and well supported by the presented data, some additional context could increase the significance:

      (1) The presence of the RAGnaïve and RAGexp ILC2 populations raises some questions on the (different?) origin of these populations. It is known that there are different waves of ILC2 origin (most notably shown in the Schneider et al Immunity 2019 publication, PMID 31128962). I believe it would be very interesting to further discuss or possibly show if there are different origins for these two ILC populations.

      Several publications describe the presence and origin of ILC2s in/from the thymus (PMIDs 33432227 24155745). Could the authors discuss whether there might be a common origin for the RAGexp ILC2 and Th2 cells from a thymic lineage? If true that the two populations would be derived from different populations, e.g. being the embryonic (possibly RAGnaïve) vs. adult bone marrow/thymus (possibly RAGexp), this would show a unique functional difference between the embryonic derived ILC2 vs. adult ILC2.

      We agree with the Reviewer that our findings raise important questions about ILC ontogeny. These are areas of ongoing investigation for us, and it is our hope this study may inform further investigation by others as well.

      Regarding the Schneider et al study, we have considered the possibility that RAG expression may mark a particular wave of ILC2 origin. In that study, the authors used a tamoxifen-based inducible Cre strategy in their experiments to precisely time the lineage tracing of a reporter from the Rosa26 locus. Those lineage tracing mice would overlap genetically with the RAG lineage tracing mice we used in our current study, thus performing combined timed migration fate mapping and RAG fate mapping experiments would require creating novel mouse strains.

      Similarly, the possible influence of the thymic or bone marrow environment on RAG expression in ILCs is an exciting possibility. Perhaps there are signals common to those environments that can influence all developing lymphocytes, including not only T and B cells but also ILCs, with one consequence being induction of RAG expression. While assessing levels of RAG-experienced ILCs in these tissues using our lineage tracing mouse may hint at these possibilities, conclusive evidence would require more precise control over the timing of RAG lineage tracing than our current reagents allow (e.g. to control for induction in those environments vs migration of previously fate-mapped cells to those environments).

      To answer these questions directly, we are developing orthogonal lineage tracing mouse strains, which can report on both timing of ILC development and RAG expression, but these mice are not available yet. Given the limitations of our currently available reagents, we were careful to focus our manuscript on the skin phenotype and the more descriptive aspects of the RAG-induced phenotype. We have elaborated on these important questions and referenced all the studies noted by the Reviewer in the Discussion section as areas of future inquiry on lines 421-433.  

      (2) On line 104 & Figures 1C/G etc. the authors describe that in the RAG knock-out ILC2 are relatively more abundant in the lineage negative fraction. On line 108 they further briefly mentioned that this observation is an indication of enhanced ILC2 expansion. Since the study includes an extensive multi-omics analysis, could the authors discuss whether they have seen a correlation of RAG expression in ILC2 with regulation of genes associated with proliferation, which could explain this phenomenon?

      We thank the Reviewer for pointing out this opportunity to further correlate our functional and multiomic findings. To address this, we first looked deeper into our prior analyses and found that among the pathways enriched in GSEA analysis of differentially expressed genes (DEGs) between RAG<sup>+</sup> and RAG<sup>-</sup> ILC2s, one of the pathways suppressed in RAG<sup>+</sup> ILC2s was “GOBP_EPITHELIAL_CELL_PROLIFERATION.”

      ( Author response image 1). There are a few other gene sets present in other databases such as MSigDB with terms including “proliferation,” but these are often highly specific to a particular cell type and experimental or disease condition (e.g. tissue-specific cancers). We did not find any of these enriched in our GSEA analysis.

      Author response image 1.

      GSEA plot of GOBP epithelial proliferation pathway in RAG-experienced vs RAG-naïve ILC2s.

      The ability to predict cellular proliferation states from transcriptomic data is an area of active research, and there does not appear to be any universally accepted method to do this reliably. We found two recent studies (PMIDs 34762642; 36201535) that identified novel “proliferation signatures.” Since these gene sets are not present in any curated database, we repeated our GSEA analysis using a customized database with the addition of these gene sets. However, we did not find enrichment of these sets in our RAG+/- ILC2 DEG list. We also applied our GPL strategy integrating analysis of our epigenomic data to the proliferation signature genes, but we did not see any clear trend. Conversely, our GSEA analysis did not identify any enrichment for apoptotic signatures as a potential mechanism by which RAG may suppress ILC2s.

      Notwithstanding the limitations of inferring ILC2 proliferation states from transcriptomic and epigenomic data, our experimental data suggest RAG exerts a suppressive effect on ILC2 proliferation. To formally test the hypothesis that RAG suppresses proliferation in the most rigorous way, we feel new mouse strains are needed that allow simultaneous RAG fate mapping and temporally restricted fate mapping. We elaborate on this in new additions to the discussion on lines 421-433.

      Reviewer #2 (Public Review):

      Summary:

      The study by Ver Heul et al., investigates the consequences of RAG expression for type 2 innate lymphoid cell (ILC2) function. RAG expression is essential for the generation of the receptors expressed by B and T cells and their subsequent development. Innate lymphocytes, which arise from the same initial progenitor populations, are in part defined by their ability to develop in the absence of RAG expression. However, it has been described in multiple studies that a significant proportion of innate lymphocytes show a history of Rag expression. In compelling studies several years ago, members of this research team revealed that early Rag expression during the development of Natural Killer cells (Karo et al., Cell 2014), the first described innate lymphocyte, had functional consequences.

      Here, the authors revisit this topic, a worthwhile endeavour given the broad history of Rag expression within all ILCs and the common use of RAG-deficient mice to specifically assess ILC function. Focusing on ILC2s and utilising state-of-the-art approaches, the authors sought to understand whether early expression of Rag during ILC2 development had consequences for activity, fitness, or function. Having identified cell-intrinsic effects in vivo, the authors investigated the causes of this, identifying epigenetic changes associated with the accessibility genes associated with core ILC2 functions.

      The manuscript is well written and does an excellent job of supporting the reader through reasonably complex transcriptional and epigenetic analyses, with considerate use of explanatory diagrams. Overall I think that the conclusions are fair, the topic is thoughtprovoking, and the research is likely of broad immunological interest. I think that the extent of functional data and mechanistic insight is appropriate.

      Strengths:

      - The logical and stepwise use of mouse models to first demonstrate the impact on ILC2 function in vivo and a cell-intrinsic role. Initial analyses show enhanced cytokine production by ILC2 from RAG-deficient mice. Then through two different chimeric mice (including BM chimeras), the authors convincingly show this is cell intrinsic and not simply as a result of lymphopenia. This is important given other studies implicating enhanced ILC function in RAG-/- mice reflect altered competition for resources (e.g. cytokines).

      - Use of Rag expression fate mapping to support analyses of how cells were impacted - this enables a robust platform supporting subsequent analyses of the consequences of Rag expression for ILC2.

      - Use of snRNA-seq supports gene expression and chromatin accessibility studies - these reveal clear differences in the data sets consistent with altered ILC2 function.

      - Convincing evidence of epigenetic changes associated with loci strongly linked to ILC2 function. This forms a detailed analysis that potentially helps explain some of the altered ILC2 functions observed in ex vivo stimulation assays.

      - Provision of a wealth of expression data and bioinformatics analyses that can serve as valuable resources to the field.

      We appreciate the strengths noted by the Reviewer for our study. We would like to especially highlight the last point about our single cell dataset and provision of supplemental data tables. Although our study is focused on AD-like skin disease and skin draining lymph nodes, we hope that our findings can serve as a valuable resource for future investigation into mechanisms of RAG modulation of ILC2s in other tissues and disease states.  

      Weaknesses:

      - Lack of insight into precisely how early RAG expression mediates its effects, although I think this is beyond the scale of this current manuscript. Really this is the fundamental next question from the data provided here.

      We thank the Reviewer for their recognition of the context of our current work and its future implications. We aimed to present compelling new observations within the scope of what our current data can substantiate. We believe answering the next fundamental question of the mechanisms by which RAG mediates its effects in ILC2s will require development of novel reagents. We are actively pursuing this, and we look forward to others building on our findings as well.

      - The epigenetic analyses provide evidence of differences in the state of chromatin, but there is no data on what may be interacting or binding at these sites, impeding understanding of what this means mechanistically.

      We thank the Reviewer for pointing out this aspect of the epigenomic data analysis and the opportunity to expand the scope of our manuscript. We performed additional analyses of our data to identify DNA binding motifs and infer potential transcription factors that may be driving the effects of a history of RAG expression that we observed. We hope that these additional data, analyses, and interpretation add meaningful insight for our readers.

      We first performed the analysis for the entire dataset and validated that the analysis yielded results consistent with prior studies (e.g. finding EOMES binding motifs as a marker in NK cells). Then, we examined the differences in RAG fate-mapped ILC2s. These analyses are in new Figure S10 and discussed on lines 277-316.  

      We also performed an analysis specifically on the Th2 locus, given the effects of RAG on type 2 cytokine expression. These analyses are in new Figure S12 and discussed on lines 366-378.

      - Focus on ILC2 from skin-draining lymph nodes rather than the principal site of ILC2 activity itself (the skin). This may well reflect the ease at which cells can be isolated from different tissues.

      We appreciate the Reviewer’s insight into the limitations of our study. Difficulties in isolating ILC2s from the skin were indeed a constraint in our study. In particular, we were unable to isolate enough ILC2s from the skin for stimulation and cytokine staining. Given that one of our main hypotheses was that RAG affects ILC2 function, we focused our studies on skin draining lymph nodes, which allowed measurement of the two main ILC2 functional cytokines, IL-5 and IL-13, as readouts in the key steady state and AD-like disease experiments.

      - Comparison with ILC2 from other sites would have helped to substantiate findings and compensate for the reliance on data on ILC2 from skin-draining lymph nodes, which are not usually assessed amongst ILC2 populations.

      We agree with the Reviewer that a broader survey of the RAG-mediated phenotype in other tissues and by extension other disease models would strengthen the generalizability of our observations. Indeed, we did a more expansive survey of tissues in our BM chimera experiments. We found a similar trend to our reported findings in the sdLN in tissues known to be affected by ILC2s ( Author response image 2) including the skin and lung and in other lymphoid tissues including spleen and mesenteric lymph nodes (mLN). We found that donor reconstitution in each tissue was robust except for the skin, where there was no significant difference between host and -donor CD45<sup>+</sup> immune cells and where CD45<sup>-</sup> parenchymal cells predominated ( Author response image 2A,C,E,G,I). This may explain why Rag1<sup>-/-</sup> donor ILC2s were significantly higher in proportion in all tissues except the skin, where we observed a similar trend that was not statistically significant ( Author response image 2B,D,F,H,J).

      Notwithstanding these results, given that we unexpectedly observed enhanced AD-like inflammation in the MC903 model in Rag1 KO mice, we concentrated our later experiments and analyses on defining the differences in skin draining ILC2s modulated by RAG. Our subsequent findings in the skin provoke many new hypotheses about the role of RAG in ILC2s in other tissues, and our tissue survey in the BM chimera provides additional rationale to pursue similar studies in disease models in other tissues. While this is an emerging area of investigation in our lab, we opted to focus this manuscript on our findings related to the AD-like disease model. We have ongoing studies to investigate other tissues, and we are still in the early stages of developing disease models to expand on these findings. However, if the reviewer feels strongly this additional data should be included in the manuscript, we are happy to add it. Considering the complexity of the data and concepts in the manuscript, we hoped to keep it focused to where we have strong molecular, cellular, and phenotypic outcomes.

      Author response image 2.

      Comparison of immune reconstitution in and ILC2 donor proportions in different tissues from BM chimeras. Equal quantities of bone marrow cells from Rag1<sup>-/-</sup> (CD45.2,CD90.2) and WT (CD45.2, CD90.1) C57Bl/6J donor mice were used to reconstitute the immune systems of irradiated recipient WT (CD45.1) C57Bl/6J mice. The proportion of live cells that are donor-derived (CD45.2), host-derived (CD45.1), or parenchymal (CD45-) [above] and proportion of ILC2s that are from Rag1<sup>-/-</sup> (CD90.2) or WT (CD90.1) donors [below] for A,B) skin C,D) sdLN E,F) lung G,H) spleen and I,J) mLN.

      - The studies of how ILC2 are impacted are a little limited, focused exclusively on IL-13 and IL-5 cytokine expression.

      We agree with the reviewer that our functional readout on IL-5 and IL-13 is relatively narrow. However, this focused experimental design was based on several considerations. First, IL-5 and IL-13 are widely recognized as major ILC2 effector molecules (Vivier et al, 2018, PMID 30142344). Second, in the MC903 model of AD-like disease, we have previously shown a clear correlation between ILC2s, levels of IL-5 and IL-13, and disease severity as measured by ear thickness (Kim et al, 2013, PMID 23363980). Depletion of ILC2s led to decreased levels of IL-13 and IL-5 and correspondingly reduced ear inflammation. However, while ILC2s are also recognized to produce other effector molecules such as IL-9 and Amphiregulin, which are likely involved in human atopic dermatitis (Namkung et al, 2011, PMID 21371865; Rojahn et al, 2020, PMID 32344053), there is currently no evidence linking these effectors to disease severity in the MC903 model. Third, IL-13 is emerging as a key cytokine driving atopic dermatitis in humans (Tsoi et al, 2019, PMID 30641038). Drugs targeting the IL-4/IL-13 receptor (dupilumab), or IL-13 itself (tralokinumab, lebrikizumab), have shown clear efficacy in treating atopic dermatitis. Interestingly, drugs targeting more upstream molecules, like TSLP (tezepelumab) or IL-33 (etokimab), have failed in atopic dermatitis. Taken together, these findings from both mouse and human studies suggest IL-13 is a critical therapeutic target, and thus functional readout, in determining the clinical implications of type 2 immune activation in atopic dermatitis.

      Aside from effector molecules, other readouts such as surface receptors may be of interest in understanding the mechanism of how RAG influences ILC2 function. For example, IL-18 has been shown to be an important co-stimulatory molecule along with TSLP in driving production of IL-13 by cutaneous ILC2s (Ricardo-Gonzalez et al, 2018, PMID 30201992). Our multiomic analysis showed decreased IL-18 receptor regulome activity in RAG-experienced ILC2s, which may be a mechanism by which RAG suppresses IL-13 production. Ultimately, in that study the role of IL-18 in enhancing MC903-induced inflammation through ILC2s was via increased production of IL-13, which was one of our major functional readouts. To clearly define mechanisms like these will require generation of new mice to interrogate RAG status in the context of tissue-specific knockout of other genes, such as the IL-18 receptor. We plan to perform these types of experiments in follow up studies. Notwithstanding this, we have now included additional discussion on lines 476508 to highlight why understanding how RAG impacts other regulatory and effector pathways would be an interesting area of future inquiry.

      Reviewer #3 (Public Review):

      In this study, Ver Heul et al. investigate the role of RAG expression in ILC2 functions. While RAG genes are not required for the development of ILCs, previous studies have reported a history of expression in these cells. The authors aim to determine the potential consequences of this expression in mature cells. They demonstrate that ILC2s from RAG1 or RAG2 deficient mice exhibit increased expression of IL-5 and IL-13 and suggest that these cells are expanded in the absence of RAG expression. However, it is unclear whether this effect is due to a direct impact of RAG genes or a consequence of the lack of T and B cells in this condition. This ambiguity represents a key issue with this study: distinguishing the direct effects of RAG genes from the indirect consequences of a lymphopenic environment.

      The authors focus their study on ILC2s found in the skin-draining lymph nodes, omitting analysis of tissues where ILC2s are more enriched, such as the gut, lungs, and fat tissue. This approach is surprising given the goal of evaluating the role of RAG genes in ILC2s across different tissues. The study shows that ILC2s derived from RAG-/- mice are more activated than those from WT mice, and RAG-deficient mice show increased inflammation in an atopic dermatitis (AD)-like disease model. The authors use an elegant model to distinguish ILC2s with a history of RAG expression from those that never expressed RAG genes. However, this model is currently limited to transcriptional and epigenomic analyses, which suggest that RAG genes suppress the type 2 regulome at the Th2 locus in ILC2s.

      We agree with the Reviewer that understanding the role of RAG in ILC2s across different tissues is an important goal. One of the primary inspirations for our paper was the clinical paradox that patients with Omenn syndrome, despite having profound adaptive T cell deficiency, develop AD with much greater penetrance than in the general population. Thus, there was always an appreciation for the likelihood that skin ILC2s have a unique proclivity towards the development of AD-like disease. Notwithstanding this, given the profound differences that can be found in ILC2s based on their tissue residence and disease state (as the Reviewer also points out below), we focused our investigations on characterizing the skin draining lymph nodes to better define factors underlying our initial observations of enhanced AD-like disease in Rag1<sup>-/-</sup> mice. While our findings in skin provoke the hypothesis that similar effects may be observed in other tissues and influence corresponding disease states, we were cautious not to suggest this may be the case by reporting surveys of other tissues without development of additional disease models to formally test these hypotheses. We present this manuscript now as a short, skin-focused study, rather than delaying publication to expand its scope. Truthfully, this project started in 2015 and has undergone many delays with the hopes of newer technologies and reagents coming to add greater clarity. We hope our study will enable others to pursue the goal of understanding the broader effects of RAG in ILC2s, and potentially other innate lymphoid lineages as well.

      We did a more expansive survey of tissues in our BM chimera experiments. We found a similar trend to our reported findings in the sdLN in tissues known to be affected by ILC2s ( Author response image 2) including the skin and lung and in other lymphoid tissues including spleen and mesenteric lymph nodes (mLN). We found that donor reconstitution in each tissue was robust except for the skin, where there was no significant difference between host and donor CD45<sup>+</sup> immune cells and where CD45<sup>-</sup> parenchymal cells predominated ( Author response image 2A,C,E,G,I). This may explain why Rag1<sup>-/-</sup> donor ILC2s were significantly higher in proportion in all tissues except the skin, where we observed a similar trend that was not statistically significant ( Author response image 2B,D,F,H,J). However, given the lack of correlation to disease readouts in other organ systems, we chose to not include this data in our manuscript. However, if the Reviewer feels these data should be included, we would be happy to include as a supplemental figure.

      The authors report a higher frequency of ILC2s in RAG-/- mice in skin-draining lymph nodes, which is expected as these mice lack T and B cells, leading to ILC expansion. Previous studies have reported hyper-activation of ILCs in RAG-deficient mice, suggesting that this is not necessarily an intrinsic phenomenon. For example, RAG-/- mice exhibit hyperphosphorylation of STAT3 in the gut, leading to hyperactivation of ILC3s. This study does not currently provide conclusive evidence of an intrinsic role of RAG genes in the hyperactivation of ILC2s. The splenocyte chimera model is artificial and does not reflect a normal environment in tissues other than the spleen. Similarly, the mixed BM model does not demonstrate an intrinsic role of RAG genes, as RAG1-/- BM cells cannot contribute to the B and T cell pool, leading to an expected expansion of ILC2s. As the data are currently presented it is expected that a proportion of IL-5-producing cells will come from the RAG1/- BM.

      The Reviewer raises an important point about the potential cell-intrinsic roles of RAG vs the many cell-extrinsic explanations that could affect ILC2 populations, with the most striking being the lack of T and B cells in RAG knockout mice. It is well-established that splenocyte transfer into T and B cell-deficient mice reconstitutes T cell-mediated effects (such as the T cell transfer colitis model pioneered by Powrie and others), and we were careful in our interpretation of the splenocyte chimera experiment to conclude only that lack of Tregs was unlikely to explain the enhanced ADlike disease in T (and B) cell-deficient mice.

      We agree with the Reviewer that the Rag1<sup>-/-</sup> BM will not contribute to the B and T cell pool. However, BM from the WT mice would be expected to contribute to development of the adaptive lymphocyte pool. Indeed, we found that most of the CD45<sup>+</sup> immune cells in the spleens of BM chimera mice were donor-derived ( Author response image 3A), and total levels of B cells and T cells showed reconstitution in a pattern similar to control spleens from donor WT mice, while spleens from donor Rag1<sup>-/-</sup> mice expectedly had essentially no detectable adaptive lymphocytes ( Author response image 3B-D). From this, we concluded the BM chimera experiment was successful in establishing an immune environment with the presence of adaptive lymphocytes, and the differences in ILC2 proportions we observed were in the context of developing alongside a normal number of B and T lymphocytes. Notwithstanding the potential role of the adaptive lymphocyte compartment in shaping ILC2 development, since we transplanted equal amounts of WT and Rag1<sup>-/-</sup> BM into the same recipient environment, we are not able to explain how cell-extrinsic effects alone would account for the unequal numbers of WT vs Rag1<sup>-/-</sup> ILC2s we observed after immune reconstitution.

      Author response image 3.

      Comparison of immune reconstitution in BM chimeras to controls. Equal quantities of bone marrow cells from Rag1<sup>-/-</sup> (CD45.2) and WT (CD45.2) C57Bl/6J donor mice were used to reconstitute the immune systems of irradiated recipient WT (CD45.1) C57Bl/6J mice. A) Number of WT recipient CD45.1+ immune cells in the spleens of recipient mice compared to number of donor CD45.2+ cells (WT and Rag1<sup>-/-</sup>) normalized to 100,000 live cells. Comparison of numbers of B cells, CD4+ T cells, and CD8+ T cells in spleens of B) BM chimera mice, C) control WT mice and D) control Rag1<sup>-/-</sup> mice.

      We also subsequently found transcriptional and epigenomic differences in RAG-experienced ILC2s compared to RAG-naïve ILC2s. Critically, these differences were present in ILC2s from the same mice that had developed normally within an intact immune system, rather than in the setting of a BM transplant or a defective immune background such as in Rag1<sup>-/-</sup> mice.

      We recognize that there are almost certainly cell-extrinsic factors affecting ILC2s in Rag1<sup>-/-</sup> mice due to lack of B and T cells, and that BM chimeras are not perfect substitutes for simulating normal hematopoietic development. However, the presence of cell-extrinsic effects does not negate the potential contribution of cell-intrinsic factors as well, and we respectfully stand by our conclusion that our data support a role, however significant, for cell-intrinsic effects of RAG in ILC2s.

      Finally, the Reviewer mentions the interesting observation that gut ILC3s exhibit hyperphosphorylation of STAT3 in Rag1<sup>-/-</sup> mice compared to WT as an example of cell-extrinsic effects of RAG deficiency (we assume this is in reference to Mao et al, 2018, PMID 29364878 and subsequent work). We now reference this paper and have included additional discussion on how our observations of ILC2s may be generalizable to not only other organ systems, but also other ILC subsets, limitations on these generalizations, and future directions on lines 477-520.

      Overall, the level of analysis could be improved. Total cell numbers are not presented, the response of other immune cells to IL-5 and IL-13 (except the eosinophils in the splenocyte chimera mice) is not analyzed, and the analysis is limited to skin-draining lymph nodes.

      We thank the Reviewer for the suggestions to add rigor to our analysis. ILC2 populations are relatively rare, and we designed our experiments to assess frequencies, rather than absolute numbers. We did not utilize counting beads, so our counts may not be comparable between samples. We have added additional data for absolute cell counts normalized to 100,000 live cells for each experiment (see below for a summary of new panels in each figure). Our new data on total cell numbers are consistent with the initial observations regarding frequency of ILC2s we reported from our experiments. For the BM chimera experiments, we presented the proportions of ILC2s, and IL-5 and IL-13 positive ILC2s, by donor source, as this is the critical question of the experiment. Notwithstanding our analysis by proportion, we found that the frequency of Rag1<sup>-/-</sup> ILC2s, IL-5<sup>+</sup> cells, or IL-13<sup>+</sup> cells within Lin- population was also significantly increased. While our initial submission included only the proportions for clarity and simplicity, we now include frequency and absolute numbers in new panels for more critical appraisal of our data by readers.

      In New Figure 1, we added new panels for ILC2 cell number in both the AD-like disease experiment (C) and in steady state (H).

      In New Figure S2, we added a panel for ILC2 cell number in steady state (B).

      In Figure 2 and associated supplemental data in Figure S4, we added several more panels. For the splenocyte chimera, we added a panel for ILC2 cell number in New Figure 2C.

      We incorporated multiple new panels in New Figure S4 to address the need for more data to be shown for the BM chimera (also requested by Reviewer #2). These included total cell counts and frequency for ILC2 (New Figure S4F,G), and IL-5<sup>+</sup> (New Figure S4I,K) and IL-13<sup>+</sup> (New Figure S4J,L) ILCs in addition to the proportions originally presented in Figure 2.  

      In terms of the limited analysis of other tissues, our initial observation of enhanced AD-like disease in Rag1<sup>-/-</sup> compared to WT mice built on our prior work elucidating the role of ILC2s in the MC903 model of AD-like disease in mice and AD in humans (Kim et al, 2013, PMID 23363980). Consequently, we focused on the skin to further develop our understanding of the role of RAG1 in this model. As in our prior studies, technical limitations in obtaining sufficient numbers of ILC2s from the skin itself for ex vivo stimulation to assess effector cytokine levels required performing these experiments in the skin draining lymph nodes.

      We agree that IL-5 and IL-13 are major mediators of type 2 pathology and studying their effects on immune cells is an important area of inquiry, particularly since there are multiple drugs available or in development targeting these pathways. However, our goal was not to study what was happening downstream of increased cytokine production from ILC2s, but instead to understand what was different about RAG-deficient or RAG-naïve ILC2s themselves that drive their expansion and production of effector cytokines compared to RAG-sufficient or RAGexperienced ILC2s. By utilizing the same MC903 model in which we previously showed a critical role for ILC2s in driving IL-5 and IL-13 production and subsequent inflammation in the skin, we were able to instead focus on defining the cell-intrinsic aspects of RAG function in ILC2s.

      The authors have a promising model in which they can track ILC2s that have expressed RAG or not. They need to perform a comprehensive characterization of ILC2s in these mice, which develop in a normal environment with T and B cells. Approximately 50% of the ILC2s have a history of RAG expression. It would be valuable to know whether these cells differ from ILC2s that never expressed RAG, in terms of proliferation and expression of IL5 and IL-13. These analyses should be conducted in different tissues, as ILC2s adapt their phenotype and transcriptional landscape to their environment. Additionally, the authors should perform their AD-like disease model in these mice.

      We agree with the Reviewer (and a similar comment from Reviewer #2) that a broader survey of the RAG-mediated phenotype in other tissues and by extension other disease models would strengthen the generalizability of our observations. Indeed, we did a more expansive survey of tissues in our BM chimera experiments. We found a similar trend to our reported findings in the sdLN in tissues known to be affected by ILC2s ( Author response image 2) including the skin and lung and in other lymphoid tissues including spleen and mesenteric lymph nodes (mLN). We found that donor reconstitution in each tissue was robust except for the skin, where there was no significant difference between host and donor CD45<sup>+</sup> immune cells and where CD45<sup>-</sup> parenchymal cells predominated (Author response image 2A,C,E,G,I). This may explain why Rag1<sup>-/-</sup> donor ILC2s were significantly higher in proportion in all tissues except the skin, where we observed a similar trend that was not statistically significant (Author response image 2B,D,F,H,J). We omitted these analyses to maintain the focus on the skin, but we will be happy to add this data to the manuscript if the Reviewer feels this figure should be helpful.

      Notwithstanding these results, given that we unexpectedly observed enhanced AD-like inflammation in the MC903 model in Rag1 KO mice, we concentrated our later experiments and analyses on defining the differences in skin draining ILC2s modulated by RAG. Our subsequent findings in the skin provoke many new hypotheses about the role of RAG in ILC2s in other tissues, and our tissue survey in the BM chimera provides additional rationale to pursue similar studies in disease models in other tissues. While this is an emerging area of investigation in our lab, we opted to focus this manuscript on our findings related to the AD-like disease model. We have ongoing studies to investigate other tissues, and we are still in the early stages of developing disease models to expand on these findings. However, if the reviewer feels strongly this additional data should be included in the manuscript, we are happy to add it. Considering the complexity of the data and concepts in the manuscript, we hoped to keep it focused to where we have strong molecular, cellular, and phenotypic outcomes. We elaborate on the implications of our work for future studies, including limitations of our study and currently available reagents and need for new mouse strains to rigorously answer these questions on lines 476-508

      The authors provide a valuable dataset of single-nuclei RNA sequencing (snRNA-seq) and ATAC sequencing (snATAC-seq) from RAGexp (RAG fate map-positive) and RAGnaïve (RAG fate map-negative) ILC2s. This elegant approach demonstrates that ILC2s with a history of RAG expression are epigenomically suppressed. However, key genes such as IL-5 and IL-13 do not appear to be differentially regulated between RAGexp and RAGnaïve ILC2s according to Table S5. Although the authors show that the regulome activity of IL-5 and IL-13 is decreased in RAGexp ILC2s, how do the authors explain that these genes are not differentially expressed between the RAGexp and RAGnaïve ILC2? I think that it is important to validate this in vivo.

      We thank the Reviewer for highlighting the value and possible elegance of our data. The Reviewer brings up an important issue that we grappled with in this study and that highlights a major technical limitation of single cell sequencing studies. Genes for secreted factors such as cytokines are often transcribed at low levels and are poorly detected in transcriptomic studies. This is particularly true in single cell studies with lower sequencing depth. Various efforts have been made to overcome these issues such as computational approaches to estimate missing data (e.g. van Djik et al, 2018, PMID 29961576; Huang et al, 2018, PMID 29941873), or recent use of cytokine reporter mice and dial-out PCR to enhance key cytokine signals in sequenced ILCs (Bielecki et al, 2021, PMID 33536623). We did not utilize computational methods to avoid the risk of introducing artifacts into the data, and we did not perform our study in cytokine reporter mice. Thus, cytokines were poorly detected in our transcriptomic data, as evidenced by lack of identification of cytokines as markers for specific clusters (e.g. IL-5 for ILC2s) or significant differential expression between RAG-naïve and RAG-experienced ILC2s.

      However, the multiomic features of our data allowed a synergistic analysis to identify effects on cytokines. For example, transcripts for the IL-4 and IL-5 were not detected at a high enough level to qualify as marker genes of the ILC2 cluster in the gene expression (GEX) assay but were identified as markers for the ILC2 cluster in the ATAC-seq data in the differentially accessible chromatin (DA) assay. Using the combined RNA-seq and ATAC-seq gene to peak links (GPL) analyses, many GPLs were identified in the Th2 locus for ILC2s, including for IL-13, which was not identified as a marker for ILC2s by any of the assays alone. Thus, our combined analysis took advantage of the potential of multiomic datasets to overcome a general weakness inherent to most scRNAseq datasets.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      - Line 168; Reference 23 also showed expression in the NK cells, please add this reference to reference 24.

      We thank the reviewer for catching this oversight, and we have corrected it in the revised manuscript.

      - Please add the full names for GPL and sdLN in the text of the manuscript when first using these abbreviations. They are now only explained in the legends.

      We reviewed the manuscript text and found that we defined sdLNs for the first time on line 104. We defined GPLs for the first time on line 248. We believe these definitions are placed appropriately near the first references to the corresponding figures/analysis, but if the Reviewer believes we should move these definitions earlier, we are happy to do so.

      Reviewer #2 (Recommendations For The Authors):

      I would suggest that the following reanalyses would improve the clarity of the data:

      - Can ILC2 numbers, rather than frequency, be used (e.g. in Figure 1C, S2B, and so on). This would substantiate the data that currently relies on percentages.

      This was a weakness also noted by Reviewer #3. We have added data on ILC2 numbers for each experiment as outlined below:

      In New Figure 1, we added new panels for ILC2 cell number in both the AD-like disease experiment (C) and in steady state (H).

      In New Figure S2, we added a panel for ILC2 cell number in steady state (B).

      In Figure 2 and associated supplemental data in Figure S4, we added several more panels. For the splenocyte chimera, we added a panel for ILC2 cell number in New Figure 2C.

      We incorporated multiple new panels in New Figure S4 to address the need for more data to be shown for the BM chimera (also requested by Reviewer #2). These included total cell counts and frequency for ILC2 (New Figure S4F,G), and IL-5<sup>+</sup> (New Figure S4I,K) and IL-13<sup>+</sup> (New Figure S4J,L) ILCs in addition to the proportions originally presented in Figure 2.  

      - Can the authors provide data on IL-33R expression on sdLN ILC2s? Expression of ST-2 (IL-33R) does vary between ILC2 populations and is impacted by the digestion of tissue. All of the data provided here requires ILC2 to be IL-33R<sup>+</sup>. In the control samples, the ILC2 compartment is very scarce - in LNs, ILC2s are rare. The gating strategy with limited resolution of positive and negative cells in the lineage gate doesn't help this analysis.

      The Reviewer raises a valid point regarding the IL-33R marker and ILC2s. We designed our initial experiments to be consistent with our earlier observations of skin ILC2s, which were defined as CD45<sup>+</sup>Lin-CD90+CD25+IL33+, and the scarcity of skin draining lymph node ILC2s at steady state was consistent with our prior findings (Kim et al, 2013, PMID 23363980). We can include MFI data on IL-33R expression in these cells if the reviewer feels strongly that this would add to the manuscript, but we did not include other ILC2-specific markers in these experiments that would give us an alternative total ILC2 count to calculate frequency of IL-33R<sup>+</sup> ILC2s, which would also make the context of the IL-33 MFI difficult to interpret.

      Other studies defining tissue specific expression patterns in ILC2s have called into question whether IL-33R is a reliable marker to define skin ILC2s (Ricardo-Gonzalez et al, 2018, PMID 30201992). However, there is evidence for region-specific expression of IL-33R (Kobayashi et al, 2019, PMID 30712873), with ILC2s in the subcutis expressing high levels of IL-33R and both IL5 and IL-13, while ILC2s in the epidermis and dermis have low levels of IL-33R and IL-5 expression. In contrast to the Kobayashi et al study, Ricardo-Gonzalez et al sequenced ILC2s from whole skin, thus the region-specific expression patterns were not preserved, and the lower expression of IL-33R in the epidermis and dermis may have diluted the signal from the ILC2s in the subcutis. These may also be the ILC2s most likely to drain into the lymph nodes, which is the tissue on which we focused our analyses (consistent with our prior work in Kim et al, 2013).

      - In Figure 2 (related to 2H, 2I) can flow plots of the IL-5 versus IL-13 gated on either CD90.1+CD45.2+ or CD90.2+CD45.2+ ILC2 be shown? I.e. gate on the ILC2s and show cytokine expression, rather than the proportion of donor IL5/13. The proportion of donor ILC2 is shown to be significantly higher in 2G. Therefore gating on the cells of interest and showing on a cellular basis their ability to produce the cytokines would better make the point I think.

      We agree that this is important additional data to include. We have added flow plots of sdLN ILC2s from the BM chimera divided by donor genotype showing IL-5 and IL-13 expression in New Figure S4H.

      I assume the authors have looked and there is no obvious data, but does analysis of transcription factor consensus binding sequences in the open chromatin provide any new insight?

      The Reviewer also commented on this in the public review. As copied from our response above:

      We found that the most enriched sites in the ILC2 gene loci contained the consensus sequence GGGCGG (or its reverse complement), a motif recognized by a variety of zinc finger transcription factors (TFs). Predictions from our analyses predicted the KLF family of zinc finger TFs as most likely to be enriched at the identified open chromatin regions. To infer which KLFs might be occupying these sites in the RAG-experienced or RAG-naïve cells, we also assessed the expression levels of these identified TFs. Interestingly, KLF2 and KLF6 are more expressed in RAG-experienced ILC2s. KLF6 is a tumor suppressor (PMID: 11752579), and both KLF6 and KLF2 were recently shown to be markers of “quiescent-like” ILCs (PMID: 33536623). Further, upon analysis of the Th2 locus, the (A/T)GATA(A/G) consensus site (or reverse complement) was enriched in identified open chromatin at that locus. The algorithm predicted multiple TFs from the GATA family as possible binding partners, but expression analysis showed only GATA3 was highly expressed in ILC2s, consistent with what would be predicted from prior studies (PMID: 9160750).

      We have added this data in new Figure S10 and new Figure S12, with corresponding text in the Results section on lines 277-316 and lines 366-378.

      In terms of phrasing and presentation:

      - It would help to provide some explanation of why all analyses focus on the draining LNs rather than the actual site of inflammation (the ear skin). I do not think it appropriate to ask for data on this as this would require extensive further experimentation, but there should be some discussion on this topic. This feels relevant given that the skin is the site of inflammatory insult and ILC2 is present here. How the ILC2 compartment in the skindraining lymph nodes relates to those in the skin is not completely clear, particularly given the prevailing dogma that ILC2 are tissue-resident.

      Given limitations of assessing cytokine production of the relatively rare population of skin-resident ILC2s, we focused on the skin-draining lymph nodes (sdLN). Our findings in the current manuscript are consistent with our prior work in Kim et al, 2013 (PMID 23363980), and more recently in Tamari et al, 2024 (PMID 38134932), which demonstrated correlation of increased ILC2s in sdLN with increased skin inflammation in the MC903 model. Similarly, Dutton et al (PMID 31152090) have demonstrated expansion of the sdLN ILC2 pool in response to MC903-induced AD-like inflammation in mice. We elaborate on the implications of our work for future studies, including limitations of our study (including the focus on the sdLN), and currently available reagents and need for new mouse strains to rigorously answer these questions on lines 476-508

      - I think the authors should explicitly state that cytokine production is assessed after ex vivo restimulation (e.g. Lines 112-113).

      We have added this statement to the revised text.

      - I also think that it would help to be consistent with axis scales where analyses are comparable (e.g. Figure 1D vs Figure 1H).

      We agree with the Reviewer and we have adjusted the axes for consistency. The data remains unchanged, but axes are slightly adjusted in New Figure 1 (D&I, E&J, F&K) and New Figure S2 (C-E match New Figure 1 D-F). This same axis scaling scheme is carried forward to New Figure 2 (D-E) and New Figure S4 (G,K,L). New data on cell counts is also included per request by Reviewers 2 and 3 (see above). However, we found results for total cells, including ILC2s (New Figure 1C,H, New Figure S2B, New Figure 2C, New Figure S4F), were consistent within experiments, but not between experiments, likely representing issues with normalizing counts (we did not include counting beads for more accurate total counts). Thus, the y-axes in those panels are not consistent between experiments/figures.

      We feel reporting the proportion of WT vs Rag1<sup>-/-</sup> donor cells for the BM chimera is most illustrative of the effect of RAG and have kept it in the main New Figure 2, but for the BM chimera experiment panels we also include the total counts of IL-5<sup>+</sup> and IL-13<sup>+</sup> ILC2s (New Figure S4I,J).

    1. Author response:

      The following is the authors’ response to the original reviews.

      In summary, the changes made in the revision process include:

      An addition of a paragraph in the result section that discusses the absolute values of measured Young’s moduli in the light of probing frequencies, accompanied by a new supplementary figure and a supplementary table that support that discussion

      - Fig. S10. Absolute Young’s modulus values across the frequencies characteristic for the three measurement methods.

      - Table S9. Operation parameters of the three methods used for characterizing the mechanical properties of cells.

      Three new supplementary figures that display the expression matrices for the genes from the identified modules in carcinoma datasets used for validation:

      - Fig. S4. Expression of identified target genes in the CCLE microarray dataset used for validation.

      - Fig. S5. Expression of identified target genes in the CCLE RNA-Seq dataset used for validation.

      - Fig. S6. Expression of identified target genes in the Genentech dataset used for validation.

      An addition of a paragraph in the discussion section that discusses the intracellular origins of resistance to deformation and the dominance of actin cortex at low deformations.

      - Refinement of the manuscript text and figures based on the specific feedback from the Reviewers.

      Please see below for detailed responses to the Reviewers’ comments.

      Reviewer #1 (Public Review)

      In this work, Urbanska and colleagues use a machine-learning based crossing of mechanical characterisations of various cells in different states and their transcriptional profiles. Using this approach, they identify a core set of five genes that systematically vary together with the mechanical state of the cells, although not always in the same direction depending on the conditions. They show that the combined transcriptional changes in this gene set is strongly predictive of a change in the cell mechanical properties, in systems that were not used to identify the genes (a validation set). Finally, they experimentally after the expression level of one of these genes, CAV1, that codes for the caveolin 1 protein, and show that, in a variety of cellular systems and contexts, perturbations in the expression level of CAV1 also induce changes in cell mechanics, cells with lower CAV1 expression being generally softer. 

      Overall the approach seems accessible, sound and is well described. My personal expertise is not suited to judge its validity, novelty or relevance, so I do not make comments on that. The results it provides seem to have been thoroughly tested by the authors (using different types of mechanical characterisations of the cells) and to be robust in their predictive value. The authors also show convincingly that one of the genes they identified, CAV1, is not only correlated with the mechanical properties of cells, but also that changing its expression level affects cell mechanics. At this stage, the study appears mostly focused on the description and validation of the methodological approach, and it is hard to really understand what the results obtain really mean, the importance of the biological finding - what is this set of 5 genes doing in the context of cell mechanics? Is it really central, or is it just one of the set of knobs on which the cell plays - and it is identified by this method because it is systematically modulated but maybe, for any given context, it is not the dominant player - all these fundamental questions remain unanswered at this stage. On one hand, it means that the study might have identified an important novel module of genes in cell mechanics, but on the other hand, it also reveals that it is not yet easy to interpret the results provided by this type of novel approach. 

      We thank the Reviewer #1 for the thoughtful evaluation of our manuscript. The primary goal of the manuscript was to present a demonstration of an unbiased approach for the identification of genes involved in the regulations of cell mechanics. The manuscript further provides a comprehensive computational validation of all genes from the identified network, and experimental validation of a selected gene, CAV1. 

      We agree that at the current stage, far-reaching conclusions about the biological meaning of the identified network cannot be made. We are, however, convinced that the identification of an apparently central player such as CAV1 across various cellular systems is per se meaningful, in particular since CAV1 modulation shows clear effects on the cell mechanical state in several cell types. 

      We anticipate that our findings will encourage more mechanistic studies in the future, investigating how these identified genes regulate mechanical properties and interact with each other. Notwithstanding, the identified genes (after testing in specific system of interest) can be readily used as genetic targets for modulating mechanical properties of cells. Access to such modifications is of huge relevance not only for performing further research on the functional consequence of cell mechanics changes (in particular in in-vivo systems where using chemical perturbations is not always possible), but also for the potential future implementation in modulating mechanical properties of the cells to prevent disease (for example to inhibit cancer metastasis or increase efficacy of cancer cell killing by cytotoxic T cells).

      We have now added a following sentence in the first paragraph of discussion to acknowledge the open ends of our study:

      “(...). Here we leveraged this opportunity by performing discriminative network analysis on transcriptomes associated with mechanical phenotype changes to elucidate a conserved module of five genes potentially involved in cell mechanical phenotype regulation. We provided evidence that the inferred conserved functional network module contains an ensemble of five genes that, in particular when combined in a unique combinatorial marker, are universal, specific and trustworthy markers of mechanical phenotype across the studied mouse and human systems. We further demonstrated on the example of a selected marker gene, CAV1, that its experimental up- and downregulation impacts the stiffness of the measured cells. This demonstrates that the level of CAV1 not only correlates with, but also is causative of mechanical phenotype change. The mechanistic insights into how precisely the identified genes are involved in regulating mechanical properties, how they interact with each other, and whether they are universal and dominant in various contexts all remain to be established in

      future studies.”

      Reviewer #2 (Public Review)

      A key strength is the quantitative approaches all add rigor to what is being attempted. The approach with very different cell culture lines will in principle help identify constitutive genes that vary in a particular and predictable way. To my knowledge, one other study that should be cited posed a similar pan-tissue question using mass spectrometry proteomics instead of gene expression, and also identified a caveolae component (cavin-1, PTRF) that exhibited a trend with stiffness across all sampled tissues. The study focused instead on a nuclear lamina protein that was also perturbed in vitro and shown to follow the expected mechanical trend (Swift et al 2013). 

      We thank the Reviewer #2 for the positive evaluation of the breadth of the results and for pointing us to the relevant reference for the proteomic analysis related to tissue stiffness (Swift et al., 2013). This study, which focused primarily on the tissue-level mechanical properties, identifying PTRF, a caveolar component, which links to our observation of another caveolar component, CAV1, at the single-cell level. 

      We have now included the citation in the following paragraph of the discussion:

      “To our knowledge, there are no prior studies that aim at identifying gene signatures associated with single-cell mechanical phenotype changes, in particular across different cell types. There are, however, several studies that investigated changes in expression upon exposure of specific cell types to mechanical stimuli such as compression (87, 88) or mechanical stretch (22, 80, 89), and one study that investigated difference in expression profiles between stiffer and softer cells sorted from the same population (90). Even though the studies concerned with response to mechanical stimuli answer a fundamentally different question (how gene expression changes upon exposure to external forces vs which genes are expressed in cells of different mechanical phenotype), we did observe some similarities in the identified genes. For example, in the differentially expressed genes identified in the lung epithelia exposed to compression (87), three genes from our module overlapped with the immediate response (CAV1, FHL2, TGLN) and four with the long-term one (CAV1, FHL2, TGLN, THBS1). We speculate that this substantial overlap is caused by the cells undergoing change in their stiffness during the response to compression (and concomitant unjamming transition). Another previous study explored the association between the stiffness of various tissues and their proteomes. Despite the focus on the tissue-scale rather than single-cell elasticity, the authors identified polymerase I and transcript release factor (PTRF, also known as cavin 1 and encoding for a structural component of the caveolae) as one of the proteins that scaled with tissue stiffness across samples (91).”

      Reviewer #3 (Public Review)

      In this work, Urbanska et al. link the mechanical phenotypes of human glioblastoma cell lines and murine iPSCs to their transcriptome, and using machine learning-based network analysis identify genes with putative roles in cell mechanics regulation. The authors identify 5 target genes whose transcription creates a combinatorial marker which can predict cell stiffness in human carcinoma and breast epithelium cell lines as well as in developing mouse neurons. For one of the target genes, caveolin1 (CAV1), the authors perform knockout, knockdown, overexpression and rescue experiments in human carcinoma and breast epithelium cell lines. They determine the cell stiffness via RT-DC, AFM indentation and AFM rheology and confirm that high CAV1 expression levels correlate with increased stiffness in those model systems. This work brings forward an interesting approach to identify novel genes in an unbiased manner, but surprisingly the authors validate caveolin 1, a target gene with known roles in cell mechanics regulation. 

      I have two main concerns with the current version of this work: 

      (1) The authors identify a network of 5 genes that can predict mechanics. What is the relationship between the 5 genes? If the authors aim to highlight the power of their approach by knockdown, knockout or over-expression of a single gene why choose CAV1 (which has an individual p-value of 0.16 in Fig S4)? To justify their choice, the authors claim that there is limited data supporting the direct impact of CAV1 on mechanical properties of cells but several studies have previously shown its role in for example zebrafish heart stiffness, where a knockout leads to higher stiffness (Grivas et al., Scientific Reports 2020), in cancer cells, where a knockdown leads to cell softening (Lin et al., Oncotarget 2015), or in endothelial cell, where a knockout leads to cell softening (Le Master et al., Scientific Reports 2022). 

      We thank the reviewer for their comments. First, we do acknowledge that studying the relationship between the five identified genes is an intriguing question and would be a natural extension of the currently presented work. It is, however, beyond the scope of presented manuscript, in which our primarily goal was to introduce a general pipeline for de novo identification of genes related to cell mechanics. We did add a following statement in the discussion (yellow highlight) to acknowledge the open ends of our study:

      “The mechanical phenotype of cells is recognized as a hallmark of many physiological and pathological processes. Understanding how to control it is a necessary next step that will facilitate exploring the impact of cell mechanics perturbations on cell and tissue function (76).

      The increasing availability of transcriptional profiles accompanying cell state changes has recently been complemented by the ease of screening for mechanical phenotypes of cells thanks to the advent of high-throughput microfluidic methods (77). This provides an opportunity for data-driven identification of genes associated with the mechanical cell phenotype change in a hypothesis-free manner. Here we leveraged this opportunity by performing discriminative network analysis on transcriptomes associated with mechanical phenotype changes to elucidate a conserved module of five genes potentially involved in cell mechanical phenotype regulation. We provided evidence that the inferred conserved functional network module contains an ensemble of five genes that, in particular when combined in a unique combinatorial marker, are universal, specific and trustworthy markers of mechanical phenotype across the studied mouse and human systems. We further demonstrated on the example of a selected marker gene, CAV1, that its experimental up- and downregulation impacts the stiffness of the measured cells. This demonstrates that the level of CAV1 not only correlates with, but also is causative of mechanical phenotype change. The mechanistic insights into how precisely the identified genes are involved in regulating mechanical properties, how they interact with each other, and whether they are universal and dominant in various contexts all remain to be established in future studies.”

      Regarding the selection of CAV1 as the gene that we used for validation experiment; as mentioned in the introductory paragraph of the result section “Perturbing expression levels of CAV1 changes cells stiffness” (copied below), we were encouraged by the previous data already linking CAV1 with cell mechanics when selecting it as our first target. The relationship between CAV1 and cell mechanics regulation, however, is not very well established (of note, two of the latest manuscripts came out after the initial findings of our study). 

      Regarding the citations suggested by the reviewer: two are already included in the original manuscript (Lin et al., Oncotarget 2015 – Ref (63), Le Master –2022 Ref (67)), along with an additional one (Hsu et al 2018 (66)), and the third one (Grivas et al, 2020 (68)) is now also added to the manuscript. Though, we would like to highlight that even though Grivas et al state that the CAV1 KO cells are stiffer, the AFM indentation measurements were performed on the cardiac tissue, with a spherical tip of 30 μm radius and likely reflect primarily supracelluar, tissue-scale properties, as opposed to cell-scale measurements performed in our study (we used cultured cells which mostly lack the extracellular tissue structures, deformability cytometry was performed on dissociated cells and picks up on cell properties exclusively, and in case of AFM measurements a spherical tip with 5 μm radius was used).

      “We decided to focus our attention on CAV1 as a potential target for modulating mechanical properties of cells, as it has previously been linked to processes intertwined with cell mechanics. In the context of mechanosensing, CAV1 is known to facilitate buffering of the membrane tension (45), play a role in β1-inegrin-dependent mechanotransduction (58) and modulate the mechanotransduction in response to substrate stiffness (59). CAV1 is also intimately linked with actin cytoskeleton — it was shown to be involved in cross-talk with Rho-signaling and actin cytoskeleton regulation (46, 60–62), filamin A-mediated interactions with actin filaments (63), and co-localization with peripheral actin (64). The evidence directly relating CAV1 levels with the mechanical properties of cells (47, 62, 65, 66) and tissues (66, 67) , is only beginning to emerge.”

      Regarding the cited p-value of 0.16, we would like to clarify that it is the p-value associated with the coefficient of the crude linear regression model fitted to the data for illustrative purposes in Fig S4. This value only says that from the linear fit we cannot conclude much about the correlation of the level of Cav1 with the Young’s modulus change. Much more relevant parameters to look at are the AUC-ROC values and associated p-values reported in the Table 4 in the main text (see below), which show good performance of CAV1 in separating soft and stiff cell states. 

      The positive hypothesis I assumes that markers are discriminative of samples with stiff/soft mechanical phenotype regardless of the studied biological system, and CAV1 has a clear trend with the minimum AUC-ROC on 3 datasets of 0.78, even though the p-value is below the significance level. The positive hypothesis II assumes that markers are discriminative of samples with stiff/soft mechanical phenotype in carcinoma regardless of data source, and CAV1 has a clear significance because the minimum AUC-ROC on 3 datasets is 0.89 and the p-value is 0.02.

      (2) The authors do not show how much does PC-Corr outperforms classical co-expression network analysis or an alternative gold standard. It is worth noting that PC-Corr was previously published by the same authors to infer phenotype-associated functional network modules from omics datasets (Ciucci et al., Scientific Reports 2017). 

      As pointed out by the Reviewer, PC-corr has been introduced and characterized in detail in a previous publication (Ciucci et al, 2017, Sci. Rep.), where it was compared against standard co-expression analysis (below reported as: p-value network) on molecules selected using univariate statistical analysis. 

      See the following fragment of Discussion in Ciucci et al, 2017:

      “The PC-corr networks were always compared to P-value networks. The first strategical difference lies in the way features are selected: while the PC-corr adopts a multivariate approach, i.e. it uses a combination of features that are responsible for the sample discrimination, in the P-value network the discriminating features are singly selected (one by one) with each Mann-Whitney test (followed by Benjamini-Hochberg procedure). The second strategical difference lies in the generation of the correlation weights in the network. PC-corr combines in parallel and at the same time in a unique formula the discrimination power of the PC-loadings and the association power of the Pearson correlation, directly providing in output discriminative omic associations. These are generated using a robust (because we use as merging factor the minimum operator, which is a very penalizing operator) mathematical trade-off between two important factors: multivariate discriminative significance and correlation association. In addition, as mentioned above, the minimum operator works as an AND logical gate in a digital circuit, therefore in order to have a high link weight in the PCcorr network, both the discrimination (the PC-loadings) and the association (the Pearson correlations) of the nodes adjacent to the link should be simultaneously high. Instead, the Pvalue procedure begins with the pre-selection of the significant omic features and, only in a second separated step, computes the associations between these features. Therefore, in P-value networks, the interaction weights are the result neither of multivariate discriminative significance, nor of a discrimination/association interplay.”

      Here we implement PC-corr for a particular application and do not see it as central to the message of the present manuscript to compare it with other available methods. We considered it much more relevant to focus on an in-silico validation on dataset not used during the PCcorr analysis (see Table 3 and 4 for details).

      Altogether, the authors provide an interesting approach to identify novel genes associated with cell mechanics changes, but the current version does not fulfill such potential by focusing on a single gene with known roles in cell mechanics. 

      Our manuscript presents a demonstration of an overall approach for the identification of genes involved in the regulation of cell mechanics, and the perturbations performed on CAV1 have a demonstrative role (please also refer to the explanations of why we decided to perform the verification focused on CAV1 above). The fact that we identify CAV1, which has been implicated in regulating cell mechanics in a handful of studies, de novo and in an unbiased way speaks to the power of our approach. We do agree that investigation into the effect of manipulating the expression of the remaining genes from the identified network module, as well as into the mutual relationships between those genes and their covariance in perturbation experiments, constitutes a desirable follow-up on the presented results. It is, however, beyond the scope of the current manuscript. Regardless, the other genes identified can be readily tested in systems of interest and used as potential knobs for tuning mechanical properties on demand.

      Reviewer #1 (Recommendations For Authors)

      I am not a specialist of the bio-informatics methods used in this study, so I will not make any specific technical comments on them. 

      In terms of mechanical characterisation of cells, the authors use well established methods and the fact that they systematically validate their findings with at least two independent methods (RT-DC and AFM for example) makes them very robust. So I have no concerns with this part.  The experiments of perturbations of CAV 1 are also performed to the best standards and the results are clear, no concern on that. 

      My main concerns are rather questions I was asking myself and could not answer when reading the article. Maybe the authors could find ways to clarify them - the discussion of their article is already very long and maybe it should not be lengthened to much. In my opinion, some of the points discussed are not really essential and rather redundant with other parts of the paper. This could be improved to give some space to clarify some of the points below:  

      We thank the Reviewer #1 for an overall positive evaluation of the manuscript as well as the points of criticism which we addressed in a point-by-point manner below.

      (1) This might be a misunderstanding of the method on my side, but I was wondering whether it is possible to proceed through the same steps but choose other pairs of training datasets amongst the 5 systems available (there are 10 such pairs if I am not mistaken) and ask whether they always give the same set of 5 genes. And if not, are the other sets also then predictive, robust, etc. Or is it that there are 'better' pairs than others in this respect. Or the set of 5 genes is the only one that could be found amongst these 5 datasets - and then could it imply that it is the only group 'universal' group of predictive genes for cell mechanics (when applied to any other dataset comprising similar mechanical measures and expression profiles, for other cells, other conditions)? 

      I apologize in case this question is just the result of a basic misunderstanding of the method on my side. But I could not answer the question myself based on what is in the article and it seems to be important to understand the significance of the finding and the robustness of the method. 

      We thank the Reviewer for this question. To clarify: while in general it is possible to proceed through the same analysis steps choosing a different pair of datasets (see below for examples), we have purposefully chosen those two and not any other datasets because they encompassed the highest number of samples per condition in the RNAseq data (see Fig 4 and Table R1 below), originated from two different species and concerned least related tissues (the other option for mouse would be neural progenitors which in combination with the glioblastoma would likely result in focusing on genes expressed in neural tissues). This is briefly explained in the following fragment of the manuscript on Page 10:

      “For the network construction, we chose two datasets that originate from different species, concern unrelated biological processes, and have a high number of samples included in the transcriptional analysis: human glioblastoma and murine iPSCs (Table 1).”

      To further address the comment of the reviewer: there is indeed a total of 10 possible two-set combinations of datasets, 6 of those pairs are human-mouse combinations (highlighted in orange in Author response Table 1), 3 are human-human combinations (highlighted in blue), and 1 is mousemouse (marked in green).

      Author response table 1.

      Possible two-set combinations of datasets. For each combination, the number of common genes is indicated. The number on the diagonal represents total number of transcripts in the individual datasets, n corresponds to the number of samples in the respective datasets.  * include non-coding genes.

      To reiterate, we have chosen the combination of set A (glioblastoma) and set D (iPSCs) to choose datasets from different species and with highest sample number. 

      As for the other combinations of human-mouse datasets:

      • set A & E lead to derivation of a conserved module, however as expected this module includes genes specific for neuronal tissues (such as brain & testis specific immunoglobulin IGSF11, or genes involved in neuronal development such as RFX4, SOX8)

      Author response image 1.

      • the remaining combinations (set B&D, B&E, C&D and C&E) do not lead to a derivation of a highly interconnected module

      Author response image 2.

      Author response image 3.

      Author response image 4.

      Author response image 5.

      Finally, it would have also been possible to perform the combined PC-corr procedure on all 5 datasets. However, this would prevent us from doing validation using unknown datasets.

      Hence, we decided to proceed with the 2 discovery and 4 validation datasets.

      For the sake of completeness, we present below some of the networks obtained from the analysis performed on all 5 datasets (which intersect at 8059 genes).

      Author response image 6.

      The above network was created by calculating mean/minimum PC-corr among all five datasets and applying the threshold. The thresholding can be additionally restricted in that we:

      a. constrain the directionality of the correlation between the genes (𝑠𝑔𝑛(𝑐) ) to be the same among all or at least n datasets

      b. constrain the directionality of the correlation between the cell stiffness and gene expression level (𝑠𝑔𝑛(𝑉)) for individual genes.

      Some of the resulting networks for such restrictions are presented below.

      Author response image 7.

      Author response image 8.

      Of note, some of the nodes from the original network presented in the paper (CAV1, FHL2, and IGFBP7) are preserved in the 5-set network (and highlighted with blue rims),

      (2) The authors already use several types of mechanical characterisation of the cells, but there are even more of them, in particular, some that might not directly correspond to global cell stiffness but to other aspects, like traction forces, or cell cortex rheology, or cell volume or passage time trough constrictions (active or passive) - they might all be in a way or another related, but they are a priori independent measures. Would the authors anticipate finding very different 'universal modules' for these other mechanical properties, or again the same one? Is there a way to get at least a hint based on some published characterisations for the cells used in the study? Basically, the question is whether the gene set identified is specific for a precise type of mechanical property of the cell, or is more generally related to cell mechanics modulation - maybe, as suggested by the authors because it is a set of molecular knobs acting upstream of general mechanics effectors like YAP/TAZ or acto-myosin? 

      We thank the Reviewer for this comment. We would like to first note that in our study, we focused on single-cell mechanical phenotype understood as a response of the cells to deformation at a global (RT-DC) or semi-local (AFM indentation with 5-μm bead) level and comparatively low deformations (1-3 μm, see Table S9). There is of course a variety of other methods for measuring cell mechanics and mechanics-related features, such as traction force microscopy mentioned by the reviewer. Though, traction force microscopy probes how the cells apply forces and interact with their environment rather than the inherent mechanical properties of the cells themselves which were the main interest of our study. 

      Nevertheless, as mentioned in the discussion, we found some overlap with the genes identified in other mechanical contexts, for example in the context of mechanical stretching of cells:

      “Furthermore, CAV1 is known to modulate the activation of transcriptional cofactor yesassociated protein, YAP, in response to changes in stiffness of cell substrate (60) and in the mechanical stretch-induced mesothelial to mesenchymal transition (74).”

      Which suggests that the genes identified here may be more broadly related to mechanical aspects of cells. 

      Of note, we do have some insights connected to the changes of cell volume — one of the biophysical properties mentioned by the reviewer — from our experiments.  For all measurements performed with RT-DC, we can also calculate cell volumes from 2D cell contours (see Author response images 9, 10, and 11). For most of the cases (all apart from MEF CAV1KO), the stiffer phenotype of the cells, associated with higher levels of CAV1, shows a higher volume.

      Author response image 9.

      Cell volumes for the divergent cell states in the five characterized biological systems. (A) Glioblastoma. (B) Carcinoma, (C) MCF10A, (D) iPSCs, (E) Developing neurons. Data corresponds to Figure 2. Cell volumes were estimated using Shape-Out 1.0.10 by rotation of the cell contours.

      Author response image 10.

      Cell volumes for CAV1 perturbation experiments. (A) CAV1 knock down performed in TGBC cells. (B) CAV1 overexpression in ECC4 and TGBC cells. Data corresponds to Figure 5. Cell volumes were estimated using Shape-Out 1.0.10 by rotation of the cell contours.  

      Author response image 11.

      Cell volumes for WT and CAV1KO MEFs. Data corresponds to Figure S9. Cell volumes were estimated using Shape-Out 1.0.10 by rotation of the cell contours.  

      (3) The authors have already tested a large number of conditions in which perturbations of the level of expression of CAV1 correlates with changes in cell mechanics, but I was wondering whether it also has some direct explanatory value for the initial datasets used - for example for the glioblastoma cells from Figure 2, in the different media, would a knock-down of CAV1 prevent the increase in stiffness observed upon addition of serum, or for the carcinoma cells from different tissues treated with different compounds - if I understand well, the authors have tested a subset of these (ECC4 versus TGBC in figure 5) - how did they choose these and how general is it that the mechanical phenotype changes reported in Figure 2 are all mostly dependant on CAV1 expression level? I must say that the way the text is written and the results shown, it is hard to tell whether CAV1 is really having a dominant effect on cell mechanics in most of these contexts or only a partial effect. I hope I am being clear in my question - I am not questioning the conclusions of Figures 5 and 6, but asking whether the level of expression of CAV1, in the datasets reported in Figure 2, is the dominant explanatory feature for the differences in cell mechanics. 

      We thank reviewer for this comment and appreciate the value of the question about the generality and dominance of CAV1 in influencing cell mechanics.

      On the computational side, we have addressed these issues by looking at the performance of CAV1 (among other identified genes) in classifying soft and stiff phenotypes across biological systems (positive hypothesis I), as well as across data of different type (sequencing vs microarray data) and origin (different research institutions) (positive hypothesis II). CAV1 showed strong classification performance (Table 4), suggesting it is a general marker of stiffness changes.  

      On the experimental side, we conducted the perturbation experiments in two systems of choice: two intestinal carcinoma cell lines (ECC4 and TGBC) and the MCF10A breast epithelial cell line. These choices were driven by ease of handling, accessibility, as well as (for MCF10A) connection with a former study (Taveres et al, 2017). While we observed correlations between CAV1 expression and cell mechanics in wide range of datasets, the precise role of CAV1 in each system may vary, and further perturbation experiments in specific systems could be performed to solidify the direct/dominant role of CAV1 in cell mechanics. We hypothesize that the suggested knockdown of CAV1 upon serum addition in glioblastoma cells could reduce or prevent the increase in stiffness observed, though this experiment has not been performed. 

      In conclusion, while the computational analysis gives us confidence that CAV1 is a good indicator of cell stiffness, we predict that it acts in concert with other genes and in specific context could be replaced by other changes. We suggest that the suitability of CAV1 for manipulation of the mechanical properties should be tested in each system of interested before use. 

      To highlight the fact that the relevance of CAV1 for modulating cell mechanics in specific systems of interest should be tested and the mechanistic insights into how CAV1 regulates cell mechanics are still missing, we have added the following sentence in the discussion:

      “The mechanical phenotype of cells is recognized as a hallmark of many physiological and pathological processes. Understanding how to control it is a necessary next step that will facilitate exploring the impact of cell mechanics perturbations on cell and tissue function (76). The increasing availability of transcriptional profiles accompanying cell state changes has recently been complemented by the ease of screening for mechanical phenotypes of cells thanks to the advent of high-throughput microfluidic methods (77). This provides an opportunity for data-driven identification of genes associated with the mechanical cell phenotype change in a hypothesis-free manner. Here we leveraged this opportunity by performing discriminative network analysis on transcriptomes associated with mechanical phenotype changes to elucidate a conserved module of five genes potentially involved in cell mechanical phenotype regulation. We provided evidence that the inferred conserved functional network module contains an ensemble of five genes that, in particular when combined in a unique combinatorial marker, are universal, specific and trustworthy markers of mechanical phenotype across the studied mouse and human systems. We further demonstrated on the example of a selected marker gene, CAV1, that its experimental up- and downregulation impacts the stiffness of the measured cells. This demonstrates that the level of CAV1 not only correlates with, but also is causative of mechanical phenotype change. The mechanistic insights into how precisely the identified genes are involved in regulating mechanical properties, how they interact with each other, and whether they are universal and dominant in various contexts all remain to be established in future studies.”

      (4) It would be nice that the authors try to more directly address, in their discussion, what is the biological meaning of the set of 5 genes that they found - is it really mostly a product of the methodology used, useful but with little specific relevance to any biology, or does it have a deeper meaning? Either at a system level, or at an evolutionary level. 

      We would like to highlight that our manuscript is focused on the method that we introduce to identify sets of genes involved in the regulation of cell mechanics. The first implementation included here is only the beginning of this line of work which, in the future, will include looking in detail at the biological meaning and the interconnectivity of the genes identified. Most likely, there is a deeper meaning of the identified module which could be revealed with a lot of dedicated future work. As it is a mere speculation at this point, we would like to refrain from going into more detail about it in the current manuscript. We provide below a few words of extended explanation and additional analysis that can shed light on the current limited knowledge of the connections between the genes and evolutionary preservation of the genes. 

      While it is difficult to prove at present, we do believe that the identified node of genes may have an actual biological meaning and is not a mere product of the used methodology. The PC-corr score used for applying the threshold and obtaining the gene network is high only if the Pearson’s correlation between the two genes is high, meaning that the high connected module of genes identified show corelated expression and is likely co-regulated. Additionally, we performed the GO Term analysis using DAVID to assess the connections between the genes (Figure S3). We have now performed an additional analysis using two orthogonal tools the functional protein association tool STRING and KEGG Mapper. 

      With STRING, we found a moderate connectivity using the five network nodes identified in our study, and many of the obtained connections were based on text mining and co-expression, rather than direct experimental evidence (Author response image 12A). A more connected network can be obtained by allowing STRING to introduce further nodes (Author response image 12B). Interestingly, some of the nodes included by STRING in the extended network are nodes identified with milder PCcorr thresholds in our study (such as CNN2 or IGFBP3, see Table S3). 

      With KEGG Mapper, we did not find an obvious pathway-based clustering of the genes from the module either. A maximum of two genes were assigned to one pathway and those included: 

      • focal adhesions (pathway hsa04510): CAV1 and THBS1

      • cytoskeleton in muscle cells (pathway hsa04820): FHL2 and THBS1

      • proteoglycans in cancer (pathway hsa05205): CAV1 and THBS1.

      As for the BRITE hierarchy, following classification was found:

      • membrane trafficking(hsa04131): CAV1, IGFBP7, TAGLN, THBS, with following subcategories:

      - endocytosis / lipid raft mediated endocytosis/caveolin-mediated endocytosis:

      CAV1

      - endocytosis / phagocytosis / opsonins: THBS1

      - endocytosis / others/ insulin-like growth factor-binding proteins: IGFBP7 o others / actin-binding proteins/others: TAGLN.

      Taken together, all that analyses (DAVID, STRING, KEGG) show that at present no direct relationship/single pathway can be found that integrates all the genes from the identified modules. Future experiments, including investigations of how other module nodes are affected when one of the genes is manipulated, will help to establish actual physical or regulatory interactions between the genes from our module. 

      To touch upon the evolutionary perspective, we provide an overview of occurrence of the genes from the identified module across the evolutionary tree. This overview shows that the five identified genes are preserved in phylum Chordata with quite high sequence similarity, and even more so within mammals (Author response image 13).

      Author response image 12.

      Visualisation of interactions between the nodes in the identified module using functional protein association networks tool STRING. (A) Connections obtained using multiple proteins search and entering the five network nodes. (B) Extended network that includes further genes to increase indirect connectivity. The genes are added automatically by STRING. Online version of STRING v12.0 was used with Homo sapiens as species of interest.   

      Author response image 13.

      Co-occurrence of genes from the network module across the evolutionary tree. Mammals are indicated with the green frame, glires (include mouse), as well as primates (include human) are indicated with yellow frames. The view was generated using online version of STRING 12.0.

      Reviewer #2 (Recommendations For Authors) 

      (1) The authors need to discuss the level of sensitivity of their mechanical measurements with RT-DC for changes to the membrane compared to changes in microtubules, nucleus, etc. The limited AFM measurements also seem membrane/cortex focused. For these and further reasons below, "universal" doesn't seem appropriate in the title or abstract, and should be deleted. 

      We thank the reviewer for this comment. Indeed, RT-DC is a technique that deforms the entire cell to a relatively low degree (inducing ca 17% mean strain, i.e. a deformation of approximately 2.5 µm on a cell with a 15 µm diameter, see Table S9 and Urbanska et al., Nat Methods 2020). Similarly, the AFM indentation experiments performed in this study (using a 5-µm diameter colloidal probe and 1 µm indentation) induce low strains, at which, according to current knowledge, the actin cortex dominates the measured deformations. However, other cellular components, including the membrane, microtubules, intermediate filaments, nucleus, other organelles, and cytoplasmic packing, can also contribute. We have reviewed these contributions in detail in a recent publication (Urbanska and Guck, 2024, Ann Rev Biophys., PMID 38382116). For a particular system, it is hard to speculate without further investigation which parts of the cell have a dominant effect on the measured deformability. We have added now a following paragraph in the discussion to include this information:

      “The mechanical phenotype of single cells is a global readout of cell’s resistance to deformation that integrates contributions from all cellular components. The two techniques implemented for measuring cell mechanical in this study — RT-DC and AFM indentation using a spherical indenter with 5 µm radius — exert comparatively low strain on cells (< 3 µm, see Table S9), at which the actin cortex is believed to dominate the measured response. However, other cellular components, including the membrane, microtubules, intermediate filaments, nucleus, other organelles, and cytoplasmic packing, also contribute to the measured deformations (reviewed in detail in (79)) and, for a particular system, it is hard to speculate without further investigation which parts of the cell have a dominant effect on the measured deformability.”

      The key strength of measuring the global mechanics is that such measurements are agnostic of the specific origin of the resistance to shape change. As such, the term “universal” could be seen as rather appropriate, as we are not testing specific contributions to cell mechanics, and we see the two methods used (RT-DC and AFM indentation) as representative when it comes to measuring global cell mechanics. And we highlighted many times throughout the text that we are measuring global single-cell mechanical phenotype. 

      Most importantly, however, we have used the term “universal” to capture that the genes are preserved across different systems and species, not in relation to the type of mechanical measurements performed and as such we would like to retain the term in the title.

      (2) Fig.2 cartoons of tissues is a good idea to quickly illustrate the range of cell culture lines studied. However, it obligates the authors to examine the relevant primary cell types in singlecell RNAseq of human and/or mouse tissues (e.g. Tabula Muris). They need to show CAV1 is expressed in glioblastoma, iPSCs, etc and not a cell culture artifact. CAV1 and the other genes also need to be plotted with literature values of tissue stiffness.  

      We thank the reviewer for this the comment; however, we do believe that the cartoons in Figure 2 should assist the reader to readily understand whether cultured cells derived from the respective tissues were used (see cartoons representing dishes), or the cells directly isolated from the tissue were measured (this is the case for the developing neurons dataset). 

      We did, however, follow the suggestion of the reviewer to use available resources and checked the expression of genes from the identified network module across various tissues in mouse and human. We first used the Mouse Genome Informatics (MGI; https://www.informatics.jax.org/) to visualize the expression of the genes across organs and organ systems (Author response image 14) as well as across more specific tissue structures (Author response image 15). These two figures show that the five identified genes are expressed quite broadly in mouse. We next looked at the expression of the five genes in the scRNASeq dataset from Tabula Muris (Author response image 16). Here, the expression of respective genes seemed more restricted to specific cell clusters. Finally, we also collected the cross-tissue expression of the genes from our module in human tissues from Human Protein Atlas v23 at both mRNA (Author response image 17) and protein (Author response image 18) levels. CAV1, IGFBP7, and THBS1 showed low tissue specificity at mRNA level, FHL2 was enriched in heart muscle and ovary (the heart enrichment is also visible in Author response image 15 for mouse) and TAGLN in endometrium and intestine. Interestingly, the expression at the protein level (Author response image 18) did not seem to follow faithfully the mRNA levels (Author response image 17). Overall, we conclude that the identified genes are expressed quite broadly across mouse and human tissues. 

      Author response image 14.

      Expression of genes from the identified module across various organ and organ systems in mouse. The expression matrices for organs (A) and organ systems (B) were generated using Tissue x Gene Matrix tool of Gene eXpression Database (https://www.informatics.jax.org/gxd/, accessed on 22nd September 2024). No pre-selection of stage (age) and assay type (includes RNA and protein-based assays) was applied. The colors in the grid (blues for expression detected and reds for expression not detected) get progressively darker when there are more supporting annotations. The darker colors do not denote higher or lower levels of expression, just more evidence.

      Author response image 15.

      Expression of genes from the identified module across various mouse tissue structures. The expression matrices for age-selected mouse marked as adult (A) or young individuals (collected ages labelled P42-84 / P w6-w12 / P m1.5-3.0) (B) are presented and were generated using RNASeq Heatmap tool of Gene eXpression Database (https://www.informatics.jax.org/gxd/, accessed on 2nd October 2024).

      Author response image 16.

      Expression of genes from the identified module across various cell types and organs in t-SNE embedding of Tabula Muris dataset. (A) t-SNE clustering color-coded by organ. (B-F) t-SNE clustering colorcoded for expression of CAV1 (B), IGFBP7 (C), FHL2 (D), TAGLN (E), and THBS1 (F). The plots were generated using FACS-collected cells data through the visualisation tool available at https://tabulamuris.sf.czbiohub.org/ (accessed on 22nd September 2024).

      Author response image 17.

      Expression of genes from the identified module at the mRNA level across various human tissues. (A-E) Expression levels of CAV1 (A), IGFBP7 (B), FHL2 (C), TAGLN (D), and THBS1 (E). The plots were generated using consensus dataset from Human Protein Atlas v23 https://www.proteinatlas.org/ (accessed on 22nd September 2024).

      Author response image 18.

      Protein levels of genes from the identified module across various human tissues. (A-E) Protein levels of CAV1 (A), IGFBP7 (B), FHL2 (C), TAGLN (D), and THBS1 (E). The plots were generated using Human Protein Atlas v23 https://www.proteinatlas.org/ (accessed on 22nd September 2024).

      Regarding literature values and tissue stiffness, we would like to argue that cell stiffness is not equivalent to tissue stiffness, and we are interested in the former. Tissue stiffness is governed by a combination of cell mechanical properties, cell adhesions, packing and the extracellular matrix. There can be, in fact, mechanically distinct cell types (for example characterized by different metabolic state, malignancy level etc) within one tissue of given stiffness. Hence, we consider that testing for the correlation between tissue stiffness and expression of identified genes is not immediately relevant.

      (3) Fig.5D,H show important time-dependent mechanics that need to be used to provide explanations of the differences in RT-DC (5B,F) and in standard AFM indentation expts (5C,G). In particular, it looks to me that RT-DC is a high-f/short-time measurement compared to the AFM indentation, and an additional Main or Supp Fig needs to somehow combine all of this data to clarify this issue. 

      We thank the reviewer for this comment. It is indeed the case, that cells typically display higher stiffness when probed at higher rates. We have now expanded on this aspect of the results and added a supplementary figure (Fig. S10) that illustrates the frequencies used in different methods and summarizes the apparent Young’s moduli values into one plot in a frequencyordered manner. Of note, we typically acquire RT-DC measurements at up to three flowrates, and the increase in measurement flow rates accompanying increase in flow rate also results in higher extracted apparent Young’s moduli (see Fig. S10 B,D). We have further added Table S9 that summarizes operating parameters of all three methods used for probing cell mechanics in this manuscript:

      “The three techniques for characterizing mechanical properties of cells — RT-DC, AFM indentation and AFM microrheology — differ in several aspects (summarized in Table S9), most notably in the frequency at which the force is applied to cells during the measurements, with RT-DC operating at the highest frequency (~600 Hz), AFM microrheology at a range of frequencies in-between (3–200 Hz), and AFM indentation operating at lowest frequency (5 Hz) (see Table S9 and Figure S10A). Even though the apparent Young’s moduli obtained for TGBCS cells were consistently higher than those for ECC4 cells across all three methods, the absolute values measured for a given cell line varied depending on the methods: RT-DC measurements yielded higher apparent Young’s moduli compared to AFM indentation, while the apparent Young’s moduli derived from AFM microrheology measurements were frequency-dependent and fell between the other two methods (Fig. 5B–D, Fig. S10B). The observed increase in apparent Young’s modulus with probing frequency aligns with previous findings on cell stiffening with increased probing rates observed for both AFM indentation (68, 69) and microrheology assays (70–72).”

      (4) The plots in Fig.S4 are important as main Figs, particularly given the cartoons of different tissues in Fig.1,2. However, positive correlations for a few genes (CAV1, IGFBP7, TAGLN) are most clear for the multiple lineages that are the same (stomach) or similar (gli, neural & pluri). The authors need to add green lines and pink lines in all plots to indicate the 'lineagespecific' correlations, and provide measures where possible. Some genes clearly don't show the same trends and should be discussed. 

      We thank reviewer for this comment. It is indeed an interesting observation (and worth highlighting by adding the fits to lineage-restricted data) that the relationship between relative change in Young’s modulus and the selected gene expression becomes steeper for samples from similar tissue contexts. 

      For the sake of keeping the main manuscript compact, we decided to keep Fig. S7 (formerly Fig. S4) in the supplement, however, we did add the linear fit to the glioblastoma dataset (pink line) and a fit to the related neural/embryonic datasets (gli, neural & pluri – purple line) as advised — see below.

      We did not pool the stomach data since it is represented by a single point in the figure, aligning with how the data is presented in the main text—stomach adenocarcinoma cell lines (MKN1 and MKN45) are pooled in Fig. 1B (see below).

      We have also amended the respective results section to emphasize that, in certain instances, the correlation between changes in mechanical phenotype and alterations in the expression of analysed genes may be less pronounced:

      “The relation between normalized apparent Young’s modulus change and fold-change in the expression of the target genes is presented in Fig. S7. The direction of changes in the expression levels between the soft and stiff cell states in the validation datasets was not always following the same direction (Fig. 4, C to F, Fig. S7). This suggests that the genes associated with cell mechanics may not have a monotonic relationship with cell stiffness, but rather are characterized by different expression regimes in which the expression change in opposite directions can have the same effect on cell stiffness. Additionally, in specific cases a relatively high change in Young’s modulus did not correspond to marked expression changes of a given gene — see for example low CAV1 changes observed in MCF10A PIK3CA mutant (Fig. S7A), or low IGFBP7 changes in intestine and lung carcinoma samples (Fig. S7C). This indicates that the importance of specific targets for the mechanical phenotype change may vary depending on the origin of the sample.”

      (5) Table-1 neuro: Perhaps I missed the use of the AFM measurements, but these need to be included more clearly in the Results somewhere. 

      To clarify: there were no AFM measurements performed for the developing neurons (neuro) dataset, and it is not marked as such in Table 1. There are previously published AFM measurements for the iPSCs dataset (maybe that caused the confusion?), and we referred to them as such in the table by citing the source (Urbanska et al (30)) as opposed to the statement “this paper” (see the last column of Table 1). We did not consider it necessary to include these previously published data. We have added additional horizontal lines to the table that will hopefully help in the table readability.

      Reviewer #3 (For Authors) 

      Major 

      -  I strongly encourage the authors to validate their approach with a gene for which mechanical data does not exist yet, or explore how the combination of the 5 identified genes is the novel regulator of cell mechanics. 

      We appreciate the reviewer’s insightful comment and agree that it would be highly interesting to validate further targets and perform combinatorial perturbations. However, it is not feasible at this point to expand the experimental data beyond the one already provided. We hope that in the future, the collective effort of the cell mechanics community will establish more genes that can be used for tuning of mechanical properties of cells.

      - If this paper aims at highlighting the power of PC-Corr as a novel inference approach, the authors should compare its predictive power to that of classical co-expression network analysis or an alternative gold standard. 

      We thank the reviewer for the suggestion to compare the predictive power of PC-Corr with classical co-expression network analysis or an alternative gold standard. PC-corr has been introduced and characterized in detail in a previous publication (Ciucci et al, 2017, Sci. Rep.), where it was compared against standard co-expression analysis methods. Here we implement PC-corr for a particular application. Thus, we do not see it as central to the message of the present manuscript to compare it with other available methods again.

      - The authors call their 5 identified genes "universal, trustworthy and specific". While they provide a great amount of data all is derived from human and mouse cell lines. I suggest toning this down. 

      We thank the reviewers for this comment. To clarify, the terms universal, trustworthy and specific are based on the specific hypotheses tested in the validation part of the manuscript, but we understand that it may cause confusion. We have now toned that the statement by adding “universal, trustworthy and specific across the studied mouse and human systems” in the following text fragments:

      (1) Abstract

      “(…) We validate in silico that the identified gene markers are universal, trustworthy and specific to the mechanical phenotype across the studied mouse and human systems, and demonstrate experimentally that a selected target, CAV1, changes the mechanical phenotype of cells accordingly when silenced or overexpressed. (...)”

      (2) Last paragraph of the introduction

      “(…) We then test the ability of each gene to classify cell states according to cell stiffness in silico on six further transcriptomic datasets and show that the individual genes, as well as their compression into a combinatorial marker, are universally, specifically and trustworthily associated with the mechanical phenotype across the studied mouse and human systems. (…)”

      (3) First paragraph of the discussion

      “We provided strong evidence that the inferred conserved functional network module contains an ensemble of five genes that, in particular when combined in a unique combinatorial marker, are universal, specific and trustworthy markers of mechanical phenotype across the studied mouse and human systems.”

      Minor suggestions 

      -  The authors point out how genes that regulate mechanics often display non-monotonic relations with their mechanical outcome. Indeed, in Fig.4 developing neurons have lower CAV1 in the stiff group. Perturbing CAV1 expression in that model could show the nonmonotonic relation and strengthen their claim. 

      We thank reviewer for highlighting this important point. It would indeed be interesting to explore the changes in cell stiffness upon perturbation of CAV1 in a system that has a potential to show an opposing behavior. Unfortunately, we are unable to expand the experimental part of the manuscript at this time. We do hope that this point can be addressed in future research, either by our team or other researchers in the field. 

      -  In their gene ontology enrichment assay, the authors claim that their results point towards reduced transcriptional activity and reduced growth/proliferation in stiff compared to soft cells. Proving this with a simple proliferation assay would be a nice addition to the paper. 

      This is a valuable suggestion that should be followed up on in detail in the future. To give a preliminary insight into this line of investigation, we have had a look at the cell count data for the CAV1 knock down experiments in TGBC cells. Since CAV1 is associated with the GO Term “negative regulation of proliferation/transcription” (high CAV1 – low proliferation), we would expect that lowering the levels of CAV1 results in increased proliferation and higher cell counts at the end of experiment (3 days post transfection). As illustrated in Author response image 19 below, the cell counts were higher for the samples treated with CAV1 siRNAs, though, not in a statistically significant way. Interestingly, the magnitude of the effect partially mirrored the trends observed for the cell stiffness (Figure 5F).

      Author response image 19.

      The impact of CAV1 knock down on cell counts in TGBC cells. (A) Absolute cell counts per condition in a 6-well format. Cell counts were performed when harvesting for RT-DC measurements using an automated cell counter (Countess II, Thermo Fisher Scientific). (B) The event rates observed during the RT-DC measurements. The harvested cells are resuspended in a specific volume of measuring buffer standardized per experiment (50-100 μl); thus, the event rates reflect the absolute cell numbers in the respective samples. Horizontal lines delineate medians with mean absolute deviation (MAD) as error, datapoints represent individual measurement replicates, with symbols corresponding to matching measurement days. Statistical analysis was performed using two sample two-sided Wilcoxon rank sum test.

      Methods

      - The AFM indentation experiments are performed with a very soft cantilever at very high speeds. Why? Also, please mention whether the complete AFM curve was fitted with the Hertz/Sneddon model or only a certain area around the contact point. 

      We thank the reviewer for this comment. However, we believe that the spring constants and indentation speeds used in our study are typical for measurements of cells and not a cause of concern. 

      For the indentation experiments, we used Arrow-TL1 cantilevers (nominal spring constant k = 0.035-0.045 N m<sup>−1</sup>, Nanoworld, Switzerland) which are used routinely for cell indentation (with over 200 search results on Google Scholar using the term: "Arrow-TL1"+"cell", and several former publications from our group, including Munder et al 2016, Tavares et al 2017, Urbanska et al 2017, Taubenberger et al 2019, Abuhattum et al 2022, among others). Additionally, cantilevers with the spring constants as low as 0.01 N m−1 can be used for cell measurements (Radmacher 2002, Thomas et al, 2013). 

      The indentation speed of 5 µm s<sup>−1</sup> is not unusually high and does not result in significant hydrodynamic drag. 

      For the microrheology experiments, we used slightly stiffer and shorter (100/200 µm compared to 500 µm for Arrow-TL1) cantilevers: PNP-TR-TL (nominal spring constant k = 0.08 N m<sup>−1</sup>, Nanoworld, Switzerland). The measurement frequencies of 3-200 Hz correspond to movements slightly faster than 5 µm s<sup>−1</sup>, but cells were indented only to 100 nm, and the data were corrected for the hydrodynamic drag (see equation (8) in Methods section).

      Author response image 20.

      Exemplary indentation curve obtained using arrow-TL1 decorated with a 5-µm sphere on a ECC4 cell. The shown plot is exported directly from JPK Data Processing software. The area shaded in grey is the area used for fitting the Sneddon model.  

      In the indentation experiments, the curves were fitted to a maximal indentation of 1.5 μm (rarely exceeded, see Author response image 20). We have now added this information to the methods section:

      - Could the authors include the dataset wt #1 in Fig 4D? Does it display the same trend? 

      We thank the reviewer for this comment. To clarify: in the MCF10A dataset (GEO: GSE69822) there are exactly three replicates of each wt (wild type) and ki (knock-in, referring to the H1047R mutation in the PIK3CA) samples. The numbering wt#2, wt#3, wt#4 originated from the short names that were used in the working files containing non-averaged RPKM (possibly to three different measurement replicates that may have not been exactly paired with the ki samples). We have now renamed the samples as wt#1, wt#2 and wt#3 to avoid the confusion. This naming also reflects better the sample description as deposited in the GSE69822 dataset (see Author response table 2).

      Author response table 2.

      - Reference (3) is an opinion article with the last author as the sole author. It is used twice as a self-standing reference, which is confusing, as it suggests there is previous experimental evidence. 

      We thank the reviewer for pointing this out and agree that it may not be appropriate to cite the article (Guck 2019 Biophysical Reviews, formerly Reference (3), currently Reference (76)) in all instances. The references to this opinion article have now been removed from the introduction:

      “The extent to which cells can be deformed by external loads is determined by their mechanical properties, such as cell stiffness. Since the mechanical phenotype of cells has been shown to reflect functional cell changes, it is now well established as a sensitive label-free biophysical marker of cell state in health and disease (1-2).”

      “Alternatively, the problem can be reverse-engineered, in that omics datasets for systems with known mechanical phenotype changes are used for prediction of genes involved in the regulation of mechanical phenotype in a mechanomics approach.”

      But has been kept in the discussion:

      “The mechanical phenotype of cells is recognized as a hallmark of many physiological and pathological processes. Understanding how to control it is a necessary next step that will facilitate exploring the impact of cell mechanics perturbations on cell and tissue function

      (76).”.

      This reference seems appropriate to us as it expands on the point that our ability to control cell mechanics will enable the exploration of its impact on cell and tissue function, which is central to the discussion of the current manuscript. 

      -The authors should mention what PC-corr means. Principle component correlation? Pearson's coefficient correlation? 

      PC-corr is a combination of loadings from the principal component (PC) analysis and Pearson’s correlation for each gene pair. We have aimed at conveying this in the “Discriminative network analysis on prediction datasets” result section. We have now added and extra sentence at the first appearance of PC-corr to clarify that for the readers from the start:

      “After characterizing the mechanical phenotype of the cell states, we set out to use the accompanying transcriptomic data to elucidate genes associated with the mechanical phenotype changes across the different model systems. To this end, we utilized a method for inferring phenotype-associated functional network modules from omics datasets termed PCCorr (28), that relies on combining loadings obtained from the principal component (PC) analysis and Pearson’s correlation (Corr) for every pair of genes. PC-Corr was performed individually on two prediction datasets, and the obtained results were overlayed to derive a conserved network module. Owing to the combination of the Pearson’s correlation coefficient and the discriminative information included in the PC loadings, the PC-corr analysis does not only consider gene co-expression — as is the case for classical co-expression network analysis — but also incorporates the relative relevance of each feature for discriminating between two or more conditions; in our case, the conditions representing soft and stiff phenotypes. The overlaying of the results from two different datasets allows for a multi-view analysis (utilizing multiple sets of features) and effectively merges the information from two different biological systems.”

      - The formatting of Table 1 is confusing. Horizontal lines should be added to make it clear to the reader which datasets are human and which mouse as well as which accession numbers belong to the carcinomas. 

      Horizontal lines have now been added to improve the readability of Table 1. We hope that makes the table easier to follow and satisfies the request. We assume that further modifications to the table appearance may occur during publishing process in accordance with the publisher’s guidelines. 

      - In many figures, data points are shown in different shapes without an explanation of what the shapes represent. 

      We thank the reviewer for this comment and apologize for not adding this information earlier. We have added explanations of the symbols to captions of Figures 2, 3, 5, and 6 in the main text:

      “Fig. 2. Mechanical properties of divergent cell states in five biological systems. Schematic overviews of the systems used in our study, alongside with the cell stiffness of individual cell states parametrized by Young’s moduli E. (…) Statistical analysis was performed using generalized linear mixed effects model. The symbol shapes represent measurements of cell lines derived from three different patients (A), matched experimental replicates (C), two different reprogramming series (D), and four different cell isolations (E). Data presented in (A) and (D) were previously published in ref (29) and (30), respectively.”

      “Fig. 3. Identification of putative targets involved in cell mechanics regulation. (A) Glioblastoma and iPSC transcriptomes used for the target prediction intersect at 9,452 genes. (B, C) PCA separation along two first principal components of the mechanically distinct cell states in the glioblastoma (B) and iPSC (C) datasets. The analysis was performed using the gene expression data from the intersection presented in (A). The symbol shapes in (B) represent cell lines derived from three different patients. (…)”

      “Fig. 5. Perturbing levels of CAV1 affects the mechanical phenotype of intestine carcinoma cells. (…) In (E), (F), (I), and (J), the symbol shapes represent experiment replicates.”

      “Fig. 6. Perturbations of CAV1 levels in MCF10A-ER-Src cells result in cell stiffness changes. (…)  Statistical analysis was performed using a two-sided Wilcoxon rank sum test. In (B), (D), and (E), the symbol shapes represent experiment replicates.”

      As well as to Figures S2, S9, and S11 in the supplementary material (in Figure S2, the symbol explanation was added to the legends in the figure panels as well): 

      “Fig. S2. Plots of area vs deformation for different cell states in the characterized systems. Panels correspond to the following systems: (A) glioblastoma, (B) carcinoma, (C) non-tumorigenic breast epithelia MCF10A, (D) induced pluripotent stem cells (iPSCs), and (E) developing neurons. 95%- and 50% density contours of data pooled from all measurements of given cell state are indicated by shaded areas and continuous lines, respectively. Datapoints indicate medians of individual measurements. The symbol shapes represent cell lines derived from three different patients (A), two different reprogramming series (D), and four different cell isolations (E), as indicated in the respective panels. (…).”

      “Fig. S9. CAV1 knock-out mouse embryonic fibroblasts (CAV1KO) have lower stiffness compared to the wild type cells (WT). (…) (C) Apparent Young’s modulus values estimated for WT and CAV1KO cells using areadeformation data in (B). The symbol shapes represent experimental replicates. (…)”

      “Fig. S11. Plots of area vs deformation from RT-DC measurements of cells with perturbed CAV1 levels. Panels correspond to the following experiments: (A and B) CAV1 knock-down in TGBC cells using esiRNA (A) and ONTarget siRNA (B), (C and D) transient CAV1 overexpression in ECC4 cells (C) and TGBC cells (D). Datapoints indicate medians of individual measurement replicates. The isoelasticity lines in the background (gray) indicate regions of of same apparent Young’s moduli. The symbol shapes represent experimental replicates.”

      - In Figure 2, the difference in stiffness appears bigger than it actually is because the y-axes are not starting at 0. 

      While we acknowledge that starting the y-axes at a value other than 0 is generally not ideal, we chose this approach to better display data variability and minimize empty space in the plots.

      A similar effect can be achieved with logarithmic scaling, which is a common practice (see  Author response image 21 for visualization). We believe our choice of axes cut-off enhances the interpretability of the data without misleading the viewer.

      Author response image 21.

      Visualization of different axis scaling strategies applied to the five datasets presented in Figure 2 of the manuscript. 

      Of note, apparent Young’s moduli obtained from RT-DC measurements typically span 0.5-3.0 kPa (see Figure 2.3 from Urbanska et al 2021, PhD thesis). Differences between treatments rarely exceed a few hundred pascals. For example, in an siRNA screen of mitotic cell mechanics regulators in Drosophila cells (Kc167), the strongest hits (e.g., Rho1, Rok, dia) showed changes in stiffness of 100-150 Pa (see Supplementary Figure 11 from Rosendahl, Plak et al 2018, Nature Methods 15(5): 355-358).

      - In Figure 3, I don't personally see the benefit of showing different cut-offs for PC-corr. In the end, the paper focuses on the 5 genes in the pentagram. I think only showing one of the cutoffs and better explaining why those target genes were picked would be sufficient and make it clearer for the reader. 

      We believe it is beneficial to show the extended networks for a few reasons. First, it demonstrates how the selected targets connect to the broader panel of the genes, and that the selected module is indeed much more interconnected than other nodes. Secondly, the chosen PC-corr cut-off is somewhat arbitrary and it may be interesting to look through the genes from the extended network as well, as they are likely also important for regulating cell mechanics. This broader view may help readers identify familiar genes and recognizing the connections to relevant signaling networks and processes of interest.

      - In Figure 4C, I suggest explaining why the FANTOM5 and not another dataset was used for the visualization here and mentioning whether the other datasets were similar. 

      In Figure 4C, we have chosen to present data corresponding to FANTOM5, because that was the only carcinoma dataset in which all the cell lines tested mechanically are presented. We have now added this information to the caption of Figure 4. Additionally, the clustergrams corresponding to the remaining carcinoma datasets (CCLE RNASeq, Genetech ) are presented in supplementary figures S4-S6. 

      “The target genes show clear differences in expression levels between the soft and stiff cell states and provide for clustering of the samples corresponding to different cell stiffnesses in both prediction and validation datasets (Fig. 4, Figs. S4-S6).”

      Typos 

      We would like to thank the Reviewer#3 for their detailed comments on the typos and details listed below. This is much appreciated as it improved the quality of our manuscript.

      -  In the first paragraph of the results section the 'and' should be removed from this sentence: Each dataset encompasses two or more cell states characterized by a distinct mechanical phenotype, and for which transcriptomic data is available. 

      The sentence has been corrected and now reads:

      “Each dataset encompasses two or more cell states characterized by a distinct mechanical phenotype, and for which transcriptomic data is available.”

      -  In the methods in the MCF10A PIK3CA cell lines part, it says cell liens instead of cell lines. 

      The sentence has been corrected and now reads:

      “The wt cells were additionally supplemented with 10 ng ml<sup>−1</sup> EGF (E9644, Sigma-Aldrich), while mutant cell lienes were maintained without EGF.”

      -  In the legend of Figure 6 "accession number: GSE17941, data previously published in ())" the reference is missing. 

      The reference has been added.

      -  In the legend of Figure 5 "(E) Verification of CAV1 knock-down in TGBC cells using two knock-down system" 'a' between using and two is missing. 

      The legend has been corrected (no ‘a’ is missing, but it should say systems (plural)):

      -  In Figure 5B one horizontal line is missing. 

      The Figure 5B has been corrected accordingly. 

      -  Terms such as de novo or in silico should be written in cursive. 

      We thank the Reviewer for this comment; however, we believe that in the style used by eLife, common Latin expressions such as de novo or in vitro are used in regular font.

      -  In the heading of Table 4 "The results presented in this table can be reproducible using the code and data available under the GitHub link reported in the methods section." It should say reproduced instead of reproducible. 

      Yes, indeed. It has been corrected.

      -  The citation of reference 20 contains several author names multiple times. 

      Indeed, it has been fixed now:

      -  In Figure S2 there is a vertical line in the zeros of the y axis labels. 

      I am not sure if there was some rendering issue, but we did not see a vertical line in the zeros of the y axis label in Figure S2.

      - The Text in Figure S4 is too small.                   

      We thank the reviewer for pointing this out. We have now revised Figure S7 (formerly Figure S4) to increase the text size, ensuring better readability. (It has also been updated to include additional fits as requested by Reviewer #2).

      - In Table 3 "positive hypothesis II markers are discriminative of samples with stiff/soft independent of data source" the words 'mechanical phenotype' are missing. 

      The column headings in Table 3 have now been updated accordingly.

      - In Table S3 explain in the table headline what vi1, vi2 and v are. I assume the loading for PC1, the loading for PC2 and the average of the previous two values. But it should be mentioned somewhere.

      The caption of table S3 has been updated to explain the meaning of vi1, vi2 and v.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors provide strong evidence that the cell surface E3 ubiquitin ligases RNF43 and ZNRF3, which are well known for their role in regulating cell surface levels of WNT receptors encoded by FZD genes, also target EGFR for degradation. This is a newly identified function for these ubiquitin ligases beyond their role in regulating WNT signaling. Loss of RNF43/ZNRF3 expression leads to elevated EGFR levels and signaling, suggesting a potential new axis to drive tumorigenesis, whereas overexpression of RNF43 or ZNRF3 decreases EGFR levels and signaling. Furthermore, RNF43 and ZNRF3 directly interact with EGFR through their extracellular domains.

      Strengths:

      The data showing that RNF43 and ZNRF3 interact with EGFR and regulate its levels and activity are thorough and convincing, and the conclusions are largely supported.

      Weaknesses:

      While the data support that EGFR is a target for RNF43/ZNRF3, some of the authors' interpretations of the data on EGFR's role relative to WNT's roles downstream of RNF43/ZNRF3 are overstated. The authors, perhaps not intentionally, promote the effect of RNF43/ZNRF3 on EGFR while minimizing their role in WNT signaling. This is the case in most of the biological assays (cell and organoid growth and mouse tumor models). For example, the conclusion of "no substantial activation of Wnt signaling" (page 14) in the prostate cancer model is currently not supported by the data and requires further examination. In fact, examination of the data presented here indicates effects on WNT/b-catenin signaling, consistent with previous studies.

      Cancers in which RNF43 or ZNRF3 are deleted are often considered to be "WNT addicted", and inhibition of WNT signaling generally potently inhibits tumor growth. In particular, treatment of WNT-addicted tumors with Porcupine inhibitors leads to tumor regression. The authors should test to what extent PORCN inhibition affects tumor (and APC-min intestinal organoid) growth. If the biological effects of RNF43/ZNRF3 loss are mediated primarily or predominantly through EGFR, then PORCN inhibition should not affect tumor or organoid growth.

      We thank the reviewer’s appreciation of the key strength of our study. We fully agree with the reviewer that RNF43/ZNRF3 play key roles in restraining WNT signaling and their deletions activate WNT signaling that leads  to cancer promotion, as discussed and cited in our manuscript (Hao et al, 2012; Koo et al, 2012). We have revised the language in this manuscript to avoid any confusion or appearance of downplaying this known signaling pathway in cancer progression.

      What we would like to highlight in this work is that our study uncovered an effect of RNF43/ZNRF3 on EGFR, leading to biological impact in multiple model systems. In particular, we included the APC-mutated human cancer cell line HT29 and Apc min mouse intestinal tumor organoids. In the context of APC mutations, β-catenin stabilization and the activation of WNT target genes are essentially decoupled from upstream WNT ligand binding to WNT receptors, thus we could primarily focus on the effect of RNF43/ZNRF3 on EGFR. Our statement of “no substantial activation of WNT signaling” as cited by the reviewer was made in describing the data in Fig. 7E where we did not observe β-catenin accumulation in the nucleus and reasoned no substantial activation of canonical WNT signaling. We agree that further examination would help strengthen the conclusion and appreciate the reviewer’s suggestion of PORCN inhibition experiments. While PORCN inhibition is a valuable experiment in models with abundance of WNT ligands/receptors and non-mutationally activated regulators of WNT signaling (Yu et al, 2020), in biological scenarios with existing APC mutations, another group has previously demonstrated that PORCN inhibition had no observable effect on WNT signaling in APC-deficient cells (PMID: 29533772). In our initial submission, we confirmed this predicted low response to manipulation of WNT signaling components upstream of a mutated APC. We showed that addition of RSPO1 in Apc min mouse intestinal tumor organoids failed to further activate WNT target expression (Fig. 6G). Furthermore, in this revised manuscript, we added new data on EGFR inhibition and PORCN inhibition in WT and Znrf3 KO MEFs (Fig. 6L). PORCN inhibition had no impact on cell growth in neither WT nor Znrf3 KO MEFs, suggesting that Znrf3 KO promoting MEF growth is WNT independent. In contrast, inhibition of EGFR downstream signaling components (Fig. 6L) significantly blocked MEF growth and abolished the impact of Znrf3 KO in MEF growth. This new evidence further supports our main conclusion that RNF43/ZNRF3 controls EGFR signaling to regulate cell growth.

      Reviewer #2 (Public Review):

      Using proteogenomic analysis of human cancer datasets, Yu et al, found that EGFR protein levels negatively correlate with ZNFR3/RNF43 expression across multiple cancers. Interestingly, they found that CRC harbouring the frequent RNF43 G659Vfs*41 mutation exhibits higher levels of EGFR when compared to RNF43 wild-type tumors. This is highly interesting since this mutation is generally not thought to influence Frizzled levels and Wnt-bcatenin pathway activity. Using CRISPR knockouts and overexpression experiments, the authors show that EGFR levels are modulated by ZNRF3/RNF43. Supporting these findings, modulation of ZNRF3/RNF43 activity using Rspondin also leads to increased EGFR levels. Mechanistically, the authors, show that ZNRF3/RNF43 ubiquitinate EGFR and leads to degradation. Finally, the authors present functional evidence that loss of ZNRF3/RNF43 unleashes EGFR-mediated cell growth in 2D culture and organoids and promotes tumor growth in vivo.

      Overall, the conclusions of the manuscript are well supported by the data presented, but some aspects of the mechanism presented need to be reinforced to fully support the claims made by the authors. Additionally, the title of the paper suggests that ZNRF3 and RNF43 loss leads to the hyperactivity of EGFR and that its signalling activity contributes to cancer initiation/progression. I don't think the authors convincingly showed this in their study.

      We thank the reviewer commenting that our “conclusions of the manuscript are well supported by the data presented.”  We address the concerns raised by this reviewer in an itemized way as detailed below:

      Major points:

      (1) EGFR ubiquitination. All of the experiments supporting that ZNFR3/RNF43 mediates EGFR ubiquitination are performed under overexpression conditions. A major caveat is also that none of the ubiquitination experiments are performed under denaturing conditions. Therefore, it is impossible to claim that the ubiquitin immunoreactivity observed on the western blots presented in Figure 4 corresponds to ubiquitinated-EGFR species. Another issue is that in Figure 4A, the experiments suggest that the RNF43-dependent ubiquitination of EGFR is promoted by EGF. However, there is no control showing the ubiquitination of EGFR in the absence of EGF but under RNF43 overexpression. According to the other experiments presented in Figures 4B, 4C, and 4F, there seems to be a constitutive ubiquitination of EGFR upon overexpression. How do the authors reconcile the role of ZNRF3/RNF43 vs c-cbl?

      We agree with this reviewer of the limitation of overexpression experiments. In this manuscript, we actually leveraged both overexpression and knockout systems to demonstrate that ZNRF3/RNF43 regulates EGFR ubiquitination: in Fig 4A, we showed that overexpression of RNF43 increased EGFR ubiquitination; in Fig 4B&C and Fig S3A, we showed that RNF43 knockout decreased EGFR ubiquitination; in Fig 4F, we showed that overexpression of ZNRF3 WT increased EGFR ubiquitination but overexpression of ZNRF3 RING domain deletion mutant failed to increase EGFR ubiquitination.

      We also appreciate the rigor with which the reviewer has approached our methodology. We acknowledge that denaturing conditions can provide additional validation, but the technical challenges associated with denaturing conditions include the potential disruption of epitope structures recognized by these antibodies. Our methodology was chosen to balance the need for accurate detection with the preservation of protein structure and function, which are crucial for understanding the biological implications of EGFR ubiquitination. Moreover, our immunoprecipitation and subsequent Western blotting were stringent with high SDS and 2-ME, optimized to minimize non-specific binding and enhance the specificity of detection. We believe that the data presented are robust and contribute significantly to the existing body of knowledge on EGFR ubiquitination.

      CBL is a well-known E3 ligase of EGFR, and it induces EGFR ubiquitination upon EGF ligand stimulation. Therefore, in order to have a fair comparison of RNF43 and CBL on EGFR ubiquitination, we designed Fig 4A and related experiments in the setting of EGF stimulation. We observed that RNF43 overexpression increased EGFR ubiquitination as potently as CBL did. Following this result, we further demonstrated that knockout of RNF43 decreased endogenous ubiquitinated EGFR level in the unstimulated/basal condition (Fig 4B) as well as in the EGF-stimulated condition (Fig 4C). We acknowledge the importance and interest in fully understanding how ZNRF3/RNF43 interplays with the functions of CBL in regulating EGFR ubiquitination. This line of investigation indeed holds the potential to uncover novel regulatory mechanisms in detail. However, the primary focus of the current study was to establish a foundational understanding of ZNRF3/RNF43 role in regulating EGFR ubiquitination. We look forward to exploring further in future work.

      (2) EGFR degradation vs internalization. In Figure 3C, the authors show experiments that demonstrate that RNF43 KO increases steady-state levels of EGFR and prevents its EGF-dependent proteolysis. Using flow cytometry they then present evidence that the reduction in cell surface levels of EGFR mediated by EGF is inhibited in the absence of RNF43. The authors conclude that this is due to inhibition of EGF-induced internalization of surface EGF. However, the experiments are not designed to study internalization and rather merely examine steady-state levels of surface EGFR pre and post-treatment. These changes are an integration of many things (retrograde and anterograde transport mechanisms presumable modulated by EGF). What process(es) is/are specifically affected by ZNFR3/RNF43? Are these processes differently regulated by c-cbl? If the authors are specifically interested in internalization/recycling, the use of cell surface biotinylation experiments and time courses are needed to examine the effect of EGF in the presence or absence of the E3 ligases.

      We agree that our study design primarily assesses EGFR levels on the cell surface before and after EGF treatment and does not comprehensively measure the whole internalization process. In response to the reviewer’s comments, we have revised the relevant sections of manuscript to clarify that our current findings are focused on changes in cell surface EGFR and do not extend to the detailed mechanisms of EGF-induced internalization or recycling.

      (3) RNF43 G659fs*41. The authors make a point in Figure 1D that this mutant leads to elevated EGFR in cancers but do not present evidence that this mutant is ineffective in mediated ubiquitination and degradation of EGFR. As this mutant maintains its ability to promote Frizzled ubiquitination and degradation, it would be important to show side by side that it does not affect EGFR. This would perhaps imply differential mechanisms for these two substrates.

      Fig 1D is based on bioinformatic analysis of colon cancer patient samples, showing that RNF43 G659Vfs*41 mutant tumors exhibited significantly higher levels of EGFR protein compared to RNF43 WT tumors. Following this lead, we investigated whether this RNF43 G659fs*41 hotspot mutation lost its role in downregulating EGFR. To this end, we transfected the same amount of control vector, RNF43 WT, RING deletion mutant, G659fs*41 mutant DNA into 293T cells and measured the level of EGFR (co-transfected). As shown in Author response image 1, overexpression of RNF43 WT decreased EGFR level while overexpression of RING deletion mutant had no impact on EGFR level as compared with the Vector group, which is consistent with our findings in the manuscript. Cells transfected with the RNF43 G659Vfs*41 mutant exhibited nearly normal levels of EGFR; however, we also observed that RNF43 G659Vfs*41 was less expressed than WT, even though the same amounts of DNA were transfected. Therefore, the insubstantial impact on EGFR levels could be attributed to both functional loss or compromised stability of RNF43 G659Vfs*41 mRNA or protein. Further investigation on RNF43 G659Vfs*41 mRNA and protein stability vs. RNF43 G659Vfs*41 protein function is needed to draw a solid conclusion.

      Author response image 1.

      (4) "Unleashing EGFR activity". The title of the paper implies that ZNRF3/RNF43 loss leads to increased EGFR expression and hence increased activity that underlies cancer. However, I could find only one direct evidence showing that increased proliferation of the HT29 cell line mutant for RNF43 could be inhibited by the EGFR inhibitor Erlotinib. All the other evidence presented that I could find is correlative or indirect (e.g. RPPA showing increased phosphorylation of pathway members upon RNF43 KO, increased proliferation of a cell line upon ZNRF3/ RNF43 KO, decreased proliferation of a cell line upon ZNRF3/RNF43 OE in vitro or in xeno...). Importantly, the authors claim that cancer initiation/ progression in ZNRF3/RNF43 mutants may in some contexts be independent of their regulation of Wnt-bcatenin signaling and relying on EGFR activity upregulation. However, this has not been tested directly. Could the authors leverage their znrf3/RNF43 prostate cancer model to test whether EGFR inhibition could lead to reduced cancer burden whereas a Frizzled or Wnt inhibitor does not?

      More broadly, if EGFR signaling were to be unleashed in cancer, then one prediction would be that these cells would be more sensitive to EGFR pathway inhibition. Could the authors provide evidence that this is the case? Perhaps using isogenic cell lines or a panel of patient-derived organoids (with known genotypes).

      We appreciate the reviewer’s suggestion to provide more direct evidence demonstrating the importance of the ZNRF3/RNF43-EGFR axis in cancer cell proliferation.   In this revised manuscript, we further studied this issue in the WT vs. Znrf3 KO MEF cells. We observed that treatment with the EGFR inhibitor erlotinib did not affect WT MEF but stunted the growth advantage of Znrf3 KO MEF cells (Fig. 6L). On the other hand, treatment with the porcupine inhibitor C59 did not impact either WT or Znrf3 KO MEF cells (Fig. 6L), suggesting a more important role of the ZNRF3/RNF43-EGFR axis in mediating the enhanced cell growth of MEF caused by Znrf3 knockout. Furthermore, considering EGFR is often mutated in human cancer, to increase the clinical relance of our study, we also tested the effect of RNF43 knockout on EGFR L858R (Fig. 2D), a common oncogenic EGFR mutant, and found that RNF43 knockout in HT29 boosted levels of this EGFR mutant detected by its FLAG tag, suggesting that RNF43 degrades both WT and mutated EGFR and its loss can enhance signaling of both WT EGFR and its oncogenic mutant .  However, we emphasize again that this manuscript is in no way written to diminish the proven importance of ZNRF3/RNF43-WNT-β-catenin axis in cancer and development.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The main conclusion that EGFR is targeted for degradation by RNF43 and ZNRF3 is well supported and documented. Figures 1-5 and associated supplemental figures contain largely convincing data. Figures 6 and 7, however, require some modifications, as follows in order of appearance:

      Figure 6C: Growth of intestinal tumor organoids from Apcmin mice does not require Rspo, however, the authors show that these organoids grow larger in the presence of Rspo, an effect they attribute to increased EGFR activity, rather than increased WNT activity. While this conclusion may be correct, the authors should address this possibility by treating the organoids with PORCN inhibitor. The prediction would be that Rspo treatment still increases organoid size in the presence of PORCN inhibition. A further prediction would be that blocking EGFR (e.g. with Cetuximab) will abrogate the RSPO1 effect.

      Yes, we attributed the impact of Rspo on Apc min organoid growth to enhanced EGFR activity because we observed increased EGFR levels (Fig 6F) but no detectable increase in eight WNT target genes assayed. We agree that further pharmacologic experiments would further boost our conclusion, but our few attempts at treating organoids encountered technical difficulties. Hence, we switched to testing PORCN inhibition vs EGFR inhibition in WT and Znfr33 KO MEFs. As shown in the revised Fig. 6L, EGFR inhibition significantly reversed the growth advantage caused by Znrf3 KO but C59 did not.

      Figure 6G: It is unclear why the authors provide "8-day RSPO1 treatment" data. Here, EGFR mRNA appears to be elevated 2-fold (perhaps not statistically significant), and the Wnt targets Lef1 and Axin2 are decreased, as indicated by the statistical significance. What point is being made here?

      Our observation of increased size of APC min mouse intestinal tumor organoids and increased the EGFR protein levels were at 8 days of RSPO1 treatment. Therefore, we measured mRNA levels at the same time point with the 2-day time point also included for comparison. The goal of this qPCR experiment was to detect the contribution of WNT signaling, and we did not detect an increased transcriptional readout. We included EGFR mRNA levels for comparison, and we did not detect a statistically significant increase, consistent with our experiments concluding that ZNRF3/RNF43 regulate EGFR at the protein level. As stated in the preceding response, these data led us to attribute the impact of Rspo on Apc min organoid growth to enhanced EGFR activity.

      Figure 7A: This requires quantitation. How many mice were used per cell line? The data shown is not particularly convincing, with ZNRF3 overexpressing HT29 cells growing detectably. Showing representative mice is fine, but this should be supplemented with quantitation of all mice.

      We had provided this data. The BLI signal quantification was shown below the representative BLI images. Seven mice were used per cell line, as annotated at the top of the graph.

      Figure 7B: The authors assert that "canonical WNT signaling, based on levels of active-β-Catenin (non-phosphorylated at Ser33/37/Thr41; Figure 7B), remained unaffected". As shown, 2 of the 3 Myc-Znrf3 tumors have increased active-b-catenin signal over the GFP tumors. This indicates to me that canonical Wnt signaling was affected. The authors either need to present quantitative data that supports this claim or modify their conclusions. As presented, I don't think it is appropriate to decouple the effect of Znrf3 overexpression on EGFR from its effect on WNT.

      As requested, we have quantified the level of non-phospho β-Catenin at Ser33/37/Thr41 and found no significant differences (p > 0.05) between the control group vs. ZNRF3 overexpression group. We once again note that our manuscript was not meant to dispute the proven signaling and biological significance of WNT signaling regulation by ZNRF3/RNF43, and we have proof-read the manuscript multiple times to ensure that we did not make any generalized or misleading statements in this aspect.

      Author response image 2.

      Figure 7E: Here the authors assert that "no substantial activation of canonical Wnt signaling" in the Z&R KO tumors, however, the figure shows a substantial increase in active b-catenin staining. The current resolution is insufficient to claim that there is no increase in nuclear b-catenin. The authors' claim that WNT signaling is not involved here is not supported by the data presented here. One way to demonstrate that this effect is through EGFR activation and not through WNT activation is to treat mice with PORCN inhibitor. WNT-addicted tumors, such as by Rnf43 or Znrf3 deletion, regress upon PORCN inhibition. In this case, if the effect of Z&R KO is mediated through EGFR rather than WNT, then there should be no effect on tumor growth upon PORCN inhibition. This is a critical experiment in order to make this point.

      We appreciate the reviewer’s comments and suggestion of experiments. We based our initial statement on insubstantial nuclear β-catenin staining, but we agree that immunohistochemical staining lacks the resolution suitable for quantification. We could not generate the adequate number of KO animals for these in vivo experiments in the window of time planned for this revision. Rather, as shown in the newly added Fig. 6L, we tested EGFR inhibition and PORCN inhibition in Znrf3 KO MEFs and obtained strong data further supporting EGFR in mediating Znrf3 KO promotion of MEF growth. Notwithstanding, we have carefully revised our description of the in vivo data in Fig 7E to avoid any confusion or over-interpretation.

      Minor points:

      Figure 2A: provide quantitation of this immunoblot.

      We have revised manuscript with quantification result shown next to the immunoblot.

      Figure 2B: provide more detail in the figure legend and in the Materials and Methods section on how the KO MEFs were generated. Confirmation that Znrf3 (or in cases of Rnf43 KO) expression is lost in KO would be advisable.

      We have confirmed Znrf3 KO by genotyping and RNF43 KO by immunofluorescent staining. We have also tested multiple commercial anti-ZNRF3 antibodies and anti-RNF43 antibodies for Western blotting, but they all failed.

      Figure 4C is a little misleading. The schematic indicates that ECD-TM and TM-ICD truncations were analyzed for both ZNRF3 and RNF43. However, Figure 4 only shows data for ZNRF3, and the corresponding Figure S4 lacks data for the TM-ICD of Rnf43. A recommendation is to show only those schematics for which data is presented in that figure. On a related topic, the results using the deltaRING constructs (Figure S5) are not mentioned/described in the text.

      We think that the reviewer meant Fig 5C. We have revised the Fig 5C by removing the RNF43 label, and we confirm that  Results section does include the data in Fig S5.

      Figure S4A: Only ZNRF3 is indicated in this figure. Please explain why RNF43 is not represented here. Also, indicate what is plotted along the x-axis.

      We only detected the endogenous ZNRF3-EGFR interaction, possibly because the RNF43 protein level is relatively low in the cell line we used for the mass spec experiment. X-axis is the proteins ordered based on Y-axis values as detailed in the figure legend  -- each data point was arranged along the x axis based on the fold change of iBAQ of EGFR-associated proteins identified in EGF-stimulated vs. control in the log2 scale, from low to high (from left to right on x axis). We have added the phrase “Proteins detected by Mass-Spec” for X-axis.

      Reviewer #2 (Recommendations For The Authors):

      Minor Points.

      (1) In Figure 2B, the authors claim that Znrf3 KO enhanced both EGFR and p-EGFR levels both in the absence and presence of EGF. Although it is clear in the presence of EGF, the increased in p-EGFR in the absence of EGF is less than clear.

      We have revised the manuscript to more clearly state the result in Fig 2B.

      (2) Importantly the authors validated their findings using three independent RNF43 gRNA (fig S2D) but they do not show the editing efficiency obtained with the gRNA.

      We did not include RNF43 IB in this Figure due to lack of specific antibodies for detecting RNR43 in IB. We have no reasons to doubt adequate efficiency of knockout since EGFR was increased compared to the control group. As a result, we did not perform deep sequencing to validate knockout efficacy.

      (3) In S2E, the authors show that KO of either ZNRF3 or RNF43 enhance HER2 levels. This suggests that there is no redundancy between these E3 ligases, at least in this context. How do the authors reconcile that?

      The reviewer raised an interesting issue. Due to the lack of WB antibodies for these two proteins, we would not easily assess the feedback impact of knockout of either gene on the protein levels of the other gene. We speculate that there may be a threshold level of the sum of the two proteins that is needed for adequate degradation of HER2, leading to HER2 increase when either gene is knocked out. Detailed studies of this issue is beyond the scope of this current work.

      (4) Experiments performed in Fig 3C are performed in only one clone. The authors need to repeat in an additional clone or rescue this phenotype using a RNF43 cDNA.

      Our RNF43 KO HT29 line is a pool of KO cells, not a single clone.

      (5) In Figure 7E, the authors suggest that the absence of nuclear bcatenin means that canonical Wnt signaling is unaffected. It is widely known that nuclear bcatenin is often not correlating with pathway activity.

      As stated above, we have revised the manuscript to avoid confusion and misinterpretation.

      (6) What is the nature of the error bars in Fig 3c? Are the differences statistically significant?

      As mentioned in the figure legend, the error bars are SEM. The result is statistically significant, and p-value is noted in the graph.

      (7) In the Figure legends, it should be stated clearly how many biological replicates were performed for each experiment and single data points should be plotted where applicable (e.g. qPCR data). It would be helpful if the uncropped and unprocessed Western blot membranes and replicates that are not shown would be accessible to allow the reader a more comprehensive view of the acquired data, especially for blots that were quantified (e.g. Figure 2F, Figure 3C, there is clearly some defect on the blot).

      For WB representation, it would be helpful to include more size markers on the Western blots (especially on the Ips that show ubiquitin smear) and in general to use a reference protein (GAPDH, Actin, Vinculin) that is closer to the protein being accessed.

      More details should be added in the Methods section to explain how protocols were performed in detail. For example, it should be explained how the viruses used for infecting cells were produced (which plasmids were transfected using which transfection reagent, how long was the virus collected for, etc). Then, it should be stated how long the cells were undergoing selection before being harvested. Because the expression of the viral constructs potentially has an effect on cell proliferation through EGFR, this information is quite relevant. This is just an example, there are details missing in nearly every section (Flow: washing protocols, gating protocols (Live/dead stain?), WB: RIPA lysis buffer composition? How much protein was loaded on blots? How was protein quantification done? IP: how were washes performed and how often repeated?)

      Missing: antibody dilutions for IF, IHC, and WB, plasmid backbones, sequences and availability, qPCR primer sequences from Origene.

      Incucyte experiments are not described.

      We have revised the relevant sections to include more details.

      (8) Line 141: revise text: 2x mRNA abundance in the same sentence.

      Line 162: define intermediate expression better.

      Line 197/198: revise text ('the predominant one'?).

      Line 218/219: revise text (Internalisation of surface EGFR?).

      Line 245: clarify in text that it is endogenous EGFR that is being pulled down.

      Line 264: typo: conserved instead of conservative.

      Line 324: revise text (What does 'unknown significance' mean).

      Line 396/397: revise text: 2x Co-IP in the same sentence.

      Figure 3 D/E: more details on the Method in the figure legend.

      We have revised them accordingly.

    1. Author Response

      The following is the authors’ response to the current reviews.

      I greatly appreciate your time and attention on our manuscript. I have carefully considered the reviewers’ comments and made modifications. Below are my responses to each comment and the revisions I have made.

      Reviewer #2 (Recommendations for The Authors):

      1) The authors address well with most of my concerns. I am fine with most of the responses except question 8. Actin is also reported to be located in nuclear (PMID: 31481797). It would be better to utlize other markers, like GAPDH. Moreover, the author did not address the issue of LXRa. I strongly suggest that the authors repeat this experiment to get a more solid result.

      Thank you for the comment! Actin is frequently used as a negative control for nucleus protein in many publications, such as DOI:10.1038/s41419-018-0428-x. Beta-actin is rich in cytoplasm protein that it only takes few seconds to reveal the strong band when performing western blot with cytoplasm. However, actin does not reveal when exposing western- blot with nucleus for minutes in many studies, including in this study. Even though as mentioned actin is also located in the nuclear, such a tiny amount in the nucleus may not be revealed in western blot with exposure in seconds. However, if nucleus protein is contaminated with total cell lysate, the action is quite easy to reveal. As a result, the use of actin as the nagtive control of nucleus protein is well-accepted.

      Author response image 1.

      2) In addition, the authors mentioned IL-1b but present IL-6 in the figure of Figure. 2F. Please correct.

      We appreciate your attention on the detail. “IL-1b” is corrected to “IL-6”.


      The following is the authors’ response to the original reviews.

      I greatly appreciate the time you and the reviewers have taken to review my paper and provide detailed feedback and suggestions. I have carefully considered the reviewers’ comments and made thorough modifications to the paper. Below are my responses to each comment and the revisions I have made.

      Reviewer #1 (Recommendations for The Authors):

      Although the paper has strengths in understanding better the pathway of activation leading to polarization, the mechanisms contributing to cytokine storm are weak. In the context of cellular in vitro changes, it would be very interesting to map these molecular changes to strengthen the pathways affected in this model. In vivo, stronger evidence is required to bridge the gap between the in vitro model and mechanisms regulating in vivo disease development. Reporting of experiments needs to be considerably strengthened. Individual data points are shown, however, it is unclear whether these represent biological or technical, or how many experiments have been undertaken. The addition of this information is essential for uznderstanding the robustness and repeatability of findings. Currently, these cannot be assessed from the information provided. Furthermore, it is unclear whether the error bars represent s.e.m or s.d. which greatly impacts data interpretation.

      Answer: thank you for the valuable comments! We have added some in vivo experiments to strengthen the bridge between the in vitro and in vivo model. 1) The depletion of macrophage by clodronate-liposomes (CLL) i.v. injection was performed in endotoxemic mice with leucine. The alleviation of LPS-induced cytokine production by leucine was muted with macrophage depletion (Figure 2E, F), suggesting the anti-inflammatory effect of leucine was exerted via the regulation of macrophage. 2) The LXRα inhibitor, GSK2033, was applied to mice via i.v. injection prior to LPS-challenge. In GSK2033 treated mice, the effects of leucine on the serum levels of inflammatory cytokines were neutralized (Supplementary Figure 4), partially indicating the importance of LXRα in the regulation of cytokine release. We acknowledge the limitation of LXRα inhibition by GSK2033 in this study. In our future study, we plan to use monocyte specific LXRα knockout mice by LysM-cre to elucidate the importance of LXRα in the progression of CSS, and specifically focuse on the molecular mechanism how mTORC1 interacts with LXRα to modulate M2 macrophage polarization. Additionally, we made modifications in the manuscript to clarify that the error bars represented as the standard error of the mean (SEM) (line 416).

      Reviewer #2 (Recommendations for The Authors):

      1. The whole manuscript is based on the 2% leucine from feed and 5% leucine from water. Is there any rationale for using these two types of different concentrations in this study? Often, a dose-dependent treatment is utilized in vivo in pharmacological study. Therefore, the authors should at least test two different concentrations in each type to confirm the conclusion.

      Answer: thank you for your comment and suggestion. The 2% leucine in feed and 5% leucine in water in this study were based on the literatures. In those studies, leucine was reported to activate mTORC1 and regulate metabolism at such types of different concentration as shown below, although there is lack of leucine in the regulation of macrophage activation. In this study, we found leucine supplementation in such types significantly increased the average body weight gain of mice, suggesting growth promoting and no toxicity of leucine on mice.

      (1) Jiang X, Zhang Y, Hu W, Liang Y, Zheng L, Zheng J, Wang B, Guo X. 2021. Different Effects of Leucine Supplementation and/or Exercise on Systemic Insulin Sensitivity in Mice. Front Endocrinol (Lausanne) 12:651303. doi:10.3389/fendo.2021.651303

      (2) Holler M, Grottke A, Mueck K, Manes J, Jücker M, Rodemann HP, Toulany M. 2016. Dual Targeting of Akt and mTORC1 Impairs Repair of DNA Double-Strand Breaks and Increases Radiation Sensitivity of Human Tumor Cells. PLoS One 11: e0154745. doi:10.1371/ journal. pone.0154745

      1. The authors focus on macrophage polarization as the major cellular event affected by leucine treatment; however, they also report that the proportion of multiple immune cell types has been suppressed by leucine treatment. As some of these immune cells can also produce inflammatory cytokines, the authors should confirm the anti-inflammatory effects of leucine were mainly mediated by modulating macrophage polarization as they suggested in the manuscript. For example, the authors could utilize Anti-CSF1 or clodronate to deplete macrophage and observed whether leucine-reduced inflammatory cytokines production was largely diminished.

      Answer: thank you for your valuable suggestion! We used clodronate-liposome (CLL) i.v. injection to deplete macrophages to further validate the specific contribution of macrophage polarization to the anti-inflammatory effects of leucine. The results revealed that clodronate treatment decreased blood monocyte counts and eliminated the effect of leucine in lowering serum inflammatory factors IL-6, IFN-γ and TNF-α (Figure 2E-F), suggesting the importance of leucine-mediacted macrophage activation on the anti-inflammation.

      1. It would be important to examine whether 10 mM leucine would exhibit cytotoxicity to bone marrow derived monocytes/macrophages. This would confirm that leucine treatment directly suppresses inflammatory cytokines production or reduces cell viability to indirectly modulates inflammatory responses.

      Answer: thank you for your valuable suggestion! We performed cell viability assays after treating BMDM with 2 mM and 10 mM leucine for 6h or 24h (consistent with the timing of leucine treatment in article). The results showed that at 6h, 2 mM leucine significantly increased cell viability, while 10 mM leucine had no significant effect on cell viability. At 24h, both 2 mM and 10 mM leucine significantly increased cell viability. In conclusion, 2 mM and 10 mM leucine were not cytotoxic to BMDM, and the anti-inflammatory effect of leucine was not derived from the reduction in cell viability (Supplementary Figure 2).

      1. The authors found that leucine promotes mTORC1-LXRα for arginase-1 transcription and M2 polarization. The pathway the authors elucidated is not surprising, which has already been reported in other studies. What about the other M2 markers? The authors could examine whether arginiase-1 deficiency would deplete leucine-increased other M2 marker genes expression. Moreover, what about the molecular mechanism for leucine-reduced M1 polarization?

      Answer: Thank you for the valuable comments! To clarify that Arginase-1 activity, mRNA expression of Fizz1, Mgl1, Mgl2, and Ym1 were well established markers for M2 macrophage. Specifically, Arginase-1 activity is important to define M2 functionality. These markers were used to define the level of M2 macrophage polarization. Only a few studies indicated the involvement of mTORC1 in the M2 polarization as shown below; however, there is no molecular mechanism about how mTORC1 modulates this process. In this study, we provide the evidence that LXRα mediated the mTORC1 associated M2 polarization, and leucine regulated mTORC1-LXRα to promote M2 polarization, which was in dependent of IL-4-induced STAT6 signaling. In our future study, we are focusing on the molecular mechanism how mTORC1 interacts with LXRα to modulate M2 macrophage polarization.

      (1) Byles V, Covarrubias AJ, Ben-Sahra I, Lamming DW, Sabatini DM, Manning BD, Horng T. 2013. The TSC-mTOR pathway regulates macrophage polarization. Nat Commun 4:2834. doi:10.1038/ncomms3834

      (2) Kimura T, Nada S, Takegahara N, Okuno T, Nojima S, Kang S, Ito D, Morimoto K, Hosokawa T, Hayama Y, Mitsui Y, Sakurai N, Sarashina-Kida H, Nishide M, Maeda Y, Takamatsu H, Okuzaki D, Yamada M, Okada M, Kumanogoh A. 2016. Polarization of M2 macrophages requires Lamtor1 that integrates cytokine and amino-acid signals. Nat Commun 7:13130. doi:10.1038/ncomms13130

      1. In Fig. 1A, what's the P-value among these two groups? Moreover, what about the result with combination treatment as the authors performed in other panels?

      Answer: thank you for the valuable comments from the reviewer! In Figure 1A, the P-value between the LPS and LPS+2% Leucine groups is 0.0031, and the P-value between the LPS and LPS+5% Leucine groups is 0.0009. I have marked the significance in Figure 1A accordingly. Due to the limited number of mice, we only treated mice in two different ways respectively. Initially, we performed survival experiment and observed that the addition of leucine prolonged survive of mice at lethal dose. Based on these findings, we further investigated whether a combination of the two methods would yield better results on the regulation of inflammation, but the combination exhibited the similar effect on cytokines production, and it is not necessary to repeat the survival experiment with the combination.

      1. It seems not much difference could be observed between 2% leucine from feed and 5% leucine from water in the expression of inflammatory genes and anti-inflammation-related markers. However, it seems that 5% leucine from water would exhibit a better survival rate than 2% leucine from feed. The authors should explain potential reasons and at least examine it in vitro.

      Answer: we appreciate the valuable comments from the reviewer! There are two possible reasons: 1) When lethal dose of LPS applied, mice were too weak to eat but still drank a small amount of water; 2) the absorption of leucine from the water were much easier than from the feed, thus leucine from the water exhibited much better efficiency in a short period of survival experiment. On the other hand, the cytokine levels and expressions were measure in non-lethal experiments, in which mice were in much better condition for lecine absorption.

      1. In Fig. 4A, the authors examined the expression of p-mTOR. The authors should further examine the expression of p-AKT (S473, T308) and p-S6 to clarify whether mTORC1 or mTORC2 has been modulated. As reported, leucine should act on GATOR2 for mTORC1 activation. However, the authors reported that Torin, a mTORC1/mTORC2 inhibitor, inhibited M2 polarization more significantly compared to rapamycin, a mTORC1 inhibitor. These observations seem to indicate that leucine has other targets except mTORC1, such as mTORC2, which might raise novel mechanisms that have never been reported before.

      Answer: thank you for the valuable comments! Akt-mTORC1 signaling integrates metabolic inputs to control macrophage activation. Wortamannin inhibition of AKT was followed by inhibition of M2 polarization, suggesting that AKT signaling is involved in M2 polarization. Studies reported that mTORC1 activation inhibits pAkt (T308), inhibition of mTORC1 in turn activate Akt (1), promoting M2 polarization as a feed back to compensate the inhibition of mTORC1 induced suppression of M2 polarization. mTORC2, directly phosphrlate Akt at S473, and inhibition of mTORC2 inhibits p-Akt (S473) (2), further inhibiting M2 porlarization. Torin1 is the inhibitor for both, while rapamycin is specially for mTORC1 (3). The explanation was included in Line 252-262

      (1) Leontieva OV, Demidenko ZN, Blagosklonny MV. 2014. Rapamycin reverses insulin resistance (IR) in high-glucose medium without causing IR in normoglycemic medium. Cell Death Dis 5: e1214. doi:10.1038/cddis.2014. 178Byles.

      (2) Holler M, Grottke A, Mueck K, Manes J, Jücker M, Rodemann HP, Toulany M. 2016. Dual Targeting of Akt and mTORC1 Impairs Repair of DNA Double-Strand Breaks and Increases Radiation Sensitivity of Human Tumor Cells. PLoS One 11: e0154745. doi:10.1371/journal. pone .0154745

      (3) V, Covarrubias AJ, Ben-Sahra I, Lamming DW, Sabatini DM, Manning BD, Horng T. 2013. The TSC-mTOR pathway regulates macrophage polarization. Nat Commun 4:2834. doi:10.1038/ncomms3834.

      1. In Fig.5B, frankly speaking, I do not observe much difference in LXRα expression. Also, the actin band is too poor to get any conclusion.

      Answer: thank you for the valuable comments from the reviewer! In Fig. 5B, the extracted protein is specifically mentioned as nuclear protein in the text. It is stated that actin is expressed in the cytoplasm, while histone is expressed in the nucleus. The figure shows that actin expression is almost absent, which is mentioned to demonstrate the purity of the extracted nuclear protein.

      1. In Fig. 5C and 5D, it is amazing that GSK2033 would reduce urea production even largely greater than the basal condition (lane 1). As GSK2033 normalized IL-4 or IL-4 combination with Leucine raised urea production in cells, how GSK2033 could reduce urea in medium. The authors should explain this discrepancy.

      Answer: thank you for the valuable comments from the reviewer! In Fig. 5C, urea production was measured directly in the culture medium using a commercial assay kit, and GSK2033 indeed led to a significant decrease in urea production. In Fig. 5D, on the other hand, we assessed the activity of arginase-1 by lysing the cells, activating arginase-1, providing the substrate arginine, and then measuring urea production. In response to your question, the explanation is that in the assay measuring arginase-1 activity, we supplied a sufficient amount of substrate arginine, which may better reflect the enzyme’s activity and the results were consistent with our expectations. Additionally, when GSK2033 was used in combination with IL-4 or IL-4 plus leucine, it might interact with the IL-4 signaling pathway or leucine metabolism pathway, leading to an increase in urea production. This is just our preliminary explanation for the contradictory results, and we acknowledge that further research is needed to explore the mechanism of action of GSK2033 and its interactions with IL-4 or leucine.

      1. Line 98, "INF-gamma" should be IFN-gamma.

      Answer: We appreciate your attention to detail. We apologize for the error in line 98, where “INF-gamma” should indeed be corrected to “IFN-gamma (IFN-γ).” We will make the necessary correction in the revised version of the manuscript.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank you for sending our manuscript for the second round of review.  We are encouraged by the comments from reviewer #2 that our supplementary work on naïve T cells and antibody blockade work satisfied their previous concerns and is important for our work.

      The Editors raised concerns that we have shared preliminary data on Nrn1 and AMPAR double knockout mice.  We apologize for our enthusiasm for these studies.  Because of the publication model by eLife, we shared that data not because we needed to persuade the reviewer for publication purposes but rather to agree with the reviewer that the molecular target of Nrn1 is important, and we are progressing in understanding this subject.


      The following is the authors’ response to the original reviews.

      To Reviewer #1:

      Thank you for your thorough review and comments on our work, which you described as “the role of neuritin in T cell biology studied here is new and interesting.”.  We have summarized your comments into two categories: biology and investigation approach, experimental rigor, and data presentation.

      Biology and Investigation approach comments:

      (1) Questions regarding the T cell anergy model:

      Major point “(4) Figure 1E-H. The authors assume that this immunization protocol induces anergic cells, but they provide no experimental evidence for this. It would be useful to show that T cells are indeed anergic in this model, especially those that are OVA-specific. The lack of IL-2 production by Cltr cells could be explained by the presence of fewer OVA-specific cells, rather than by an anergic status.”

      T cell anergy is a well-established concept first described by Schwartz’s group. It refers to the hyporesponsive T cell functional state in antigen-experienced CD4 T cells (Chappert and Schwartz, 2010; Fathman and Lineberry, 2007; Jenkins and Schwartz, 1987; Quill and Schwartz, 1987).  Anergic T cells are characterized by their inability to expand and to produce IL2 upon subsequent antigen re-challenge. In this paper, we have borrowed the existing in vivo T cell anergy induction model used by Mueller’s group for T cell anergy induction (Vanasek et al., 2006).  Specifically, Thy1.1+ Ctrl or Nrn1-/- TCR transgenic OTII cells were co-transferred with the congenically marked Thy1.2+ WT polyclonal Treg cells into TCR-/- mice.  After anergy induction, the congenically marked TCR transgenic T cells were recovered by sorting based on Thy1.1+ congenic marker, and subsequently re-stimulation ex vivo with OVA323-339 peptide. We evaluated the T cell anergic state based on OTII cell expansion in vivo and IL2 production upon OVA323-339 restimulation ex vivo.  

      “The authors assume that this immunization protocol induces anergic cells, but they provide no experimental evidence for this.”

      Because the anergy model by Mueller's group is well established (Vanasek et al., 2006), we did not feel that additional effort was required to validate this model as the reviewer suggested. Moreover, the limited IL2 production among the control cells upon restimulation confirms the validity of this model.

      “The lack of IL-2 production by Cltr cells could be explained by the presence of fewer OVAspecific cells, rather than by an anergic status”.

      Cells from Ctrl and Nrn1-/- mice on a homogeneous TCR transgenic (OTII) background were used in these experiments. The possibility that substantial variability of TCR expression or different expression levels of the transgenic TCR could have impacted IL2 production rather than anergy induction is unlikely.

      Overall, we used this in vivo anergy model to evaluate the Nrn1-/- T cell functional state in comparison to Ctrl cells under the anergy induction condition following the evaluation of Nrn1 expression, particularly in anergic T cells.  Through studies using this anergy model, we observed a significant change in Treg induction among OTII cells. We decided to pursue the role of Nrn1 in Treg cell development and function rather than the biology of T cell anergy as evidenced by subsequent experiments.

      Minor points “(6) On which markers are anergic cells sorted for RNAseq analysis?”

      Cells were sorted out based on their congenic marker marking Ctrl or Nrn1-/- OTII cells transferred into the host mice.  We did not specifically isolate anergic cells for sequencing.

      (2) Question regarding the validity of iTreg differentiation model.

      Major point: “(5) Figure 2A-C and Figure 3. The use of iTregs to try to understand what is happening in vivo is problematic. iTregs are cells that have probably no equivalent in vivo, and so may have no physiological relevance. In any case, they are different from pTreg cells generated in vivo. Working with pTreg may be challenging, that is why I would suggest generating data with purified nTreg. Moreover, it was shown in the article of Gonzalez-Figueroa 2021 that Nrn1-/- nTreg retained a normal suppressive function, which would not be what is concluded by the authors of this manuscript. Moreover, we do not even know what the % of Foxp3 cells is in the iTreg used (after differentiation and 20h of re-stimulation) and whether this % is the same between Ctlr and Nrn1 KO cells.”.

      We thank Reviewer #1 for their feedback. While it is true that iTregs made in vitro and in vivo generated pTregs display several distinctions (e. g., differences in Foxp3 expression stability, for example), we strongly disagree with this statement by Revieweer#1 “The use of iTregs to try to understand what is happening in vivo is problematic. iTregs are cells that have probably no equivalent in vivo, and so may have no physiological relevance.”  The induced Treg cell (iTreg) model was established over 20 years ago (Chen et al., 2003; Zheng et al., 2002), and the model is widely adopted with over 2000 citations. Further, it has been instrumental in understanding different aspects of regulatory T cell biology (Hurrell et al., 2022; John et al., 2022; Schmitt and Williams, 2013; Sugiura et al., 2022).   

      Because we have observed reduced pTreg generation in vivo, we choose to use the in vitro iTreg model system to understand the mechanistic changes involved in Treg cell differentiation and function, specifically, neuritin’s role in this process. We have made no claim that iTreg cell biology is identical to pTreg generated in vivo or nTreg cells. However, the iTreg culture system has proved to be a good in vitro system for deciphering molecular events involved in complex processes. As such, it remains a commonly used approach by many research groups in the Treg cell field (Hurrell et al., 2022; John et al., 2022; Sugiura et al., 2022). Moreover, applying the iTreg in vitro culture system has been instrumental in helping us identify the cell electrical state change in Nrn1-/- CD4 cells and revealed the biological link between Nrn1 and the ionotropic AMPA receptor (AMPAR), which we will discuss in the subsequent discussion. It is technically challenging to use nTreg cells for T cell electrical state studies due to their heterogeneous nature from development in an in vivo environment and the effect of manipulation during the nTreg cell isolation process, which can both affect the T cell electrical state.   

      “Moreover, it was shown in the article of Gonzalez-Figueroa 2021 that Nrn1-/- nTreg retained a normal suppressive function, which would not be what is concluded by the authors of this manuscript.” 

      We have also carried out nTreg studies in vitro in addition to iTreg cells. Similar to Gonzalez-Figueroa et al.'s findings, we did not observe differences in suppression function between Nrn1-/- and WT nTreg using the in vitro suppression assay. However, Nrn1-/- nTreg cells revealed reduced suppression function in vivo (Fig. 2D-L). In fact, Gonzalez-Figueroa et al. observed reduced plasma cell formation after OVA immunization in Treg-specific Nrn1-/- mice, implicating reduced suppression from Nrn1-/- follicular regulatory T (Tfr) cells. Thus, our observation of the reduced suppression function of Nrn1-/- nTreg toward effector T cell expansion, as presented in Fig. 2D-L, does not contradict the results from Gonzalez-Figueroa et al. Rather, the conclusions of these two studies agree that Nrn1 can play important roles in immune suppression observable in vivo that are not captured readily by the in vitro suppression assay.

      “Moreover, we do not even know what the % of Foxp3 cells is in the iTreg used (after differentiation and 20h of re-stimulation) and whether this % is the same between Ctlr and Nrn1 KO cells.”

      We have stated in the manuscript on page 7 line 208 that “Similar proportions of Foxp3+ cells were observed in Nrn1-/- and Ctrl cells under the iTreg culture condition, suggesting that Nrn1 deficiency does not significantly impact Foxp3+ cell differentiation”. In the revised manuscript, we will include the data on the proportion of Foxp3+ cells before iTreg restimulation.

      (3) Confirmation of transcriptomic data regarding amino acids or electrolytes transport change

      Minor point“(3) Would not it be possible to perform experiments showing the ability of cells to transport amino acids or electrolytes across the plasma membrane? This would be a more interesting demonstration than transcriptomic data.”

      We appreciate Review# 1’s suggestion regarding “perform experiments showing the ability of cells to transport amino acids or electrolytes across the plasma membrane”.  We have indeed already performed such experiments corroborating the transcriptomics data on differential amino acid and nutrient transporter expression. Specifically, we loaded either iTreg or Th0 cells with membrane potential (MP) dye and measured MP level change after adding the complete set of amino acids (complete AA).  Upon entry, the charge carried by AAs may transiently affect cell membrane potential. Different AA transporter expression patterns may show different MP change patterns upon AA entry, as we showed in Author response image 1. We observed reduced MP change in Nrn1-/- iTreg compared to the Ctrl, whereas in the context of Th0 cells, Nrn1-/- showed enhanced MP change than the Ctrl. We can certainly include these data in the revised manuscript.

      Author response image 1.

      Membrane potential change induced by amino acids entry. a. Nrn1-/- or WT iTreg cells loaded with MP dye and MP change was measured upon the addition of a complete set of AAs. b. Nrn1-/- or WT Th0 cells loaded with MP dye and MP change was measured upon the addition of a complete set of AAs.

      (4) EAE experiment data assessment

      Minor point ”(5) Figure 5F. How are cells re-stimulated? If polyclonal stimulation is used, the experiment is not interesting because the analysis is done with lymph node cells. This analysis should either be performed with cells from the CNS or with MOG restimulation with lymph node cells.”

      In the EAE study, the Nrn1-/- mice exhibit similar disease onset but a protracted non-resolving disease phenotype compared to the WT control mice.  Several reasons may contribute to this phenotype: 1. Enhanced T effector cell infiltration/persistence in the central nervous system (CNS); 2. Reduced Treg cell-mediated suppression to the T effector cells in the CNS; 3. Protracted non-resolving inflammation at the immunization site has the potential to continue sending T effector cells into CNS, contributing to persistent inflammation. Based on this reasoning, we examined the infiltrating T effector cell number and Treg cell proportion in the CNS.  We also restimulated cells from draining lymph nodes close to the inflammation site, looking for evidence of persistent inflammation.  When mice were harvested around day 16 after immunization, the inflammation at the local draining lymph node should be at the contraction stage.  We stimulated cells with PMA and ionomycin intended to observe all potential T effector cells involved in the draining lymph node rather than only MOG antigen-specific cells.  We disagree with Reviewer #1’s assumption that “This analysis should either be performed with cells from the CNS or with MOG restimulation with lymph node cells.”. We think the experimental approach we have taken has been appropriately tailored to the biological questions we intended to answer.

      Experimental rigor and data presentation.

      (1) data labeling and additional supporting data

      Major points

      (2) The authors use Nrn1+/+ and Nrn1+/- cells indiscriminately as control cells on the basis of similar biology between Nrn1+/+ and Nrn1+/- cells at homeostasis. However, it is quite possible that the Nrn1+/- cells have a phenotype in situations of in vitro activation or in vivo inflammation (cancer, EAE). It would be important to discriminate Nrn1+/- and Nrn1+/+ cells in the data or to show that both cell types have the same phenotype in these conditions too.

      (3) Figure 1A-D. Since the authors are using the Nrp1 KO mice, it would be important to confirm the specificity of the anti-Nrn1 mAb by FACS. Once verified, it would be important to add FACS results with this mAb in Figures 1A-C to have single-cell and quantitative data as well.

      Minor points  

      (1) Line 119, 120 of the text. It is said that one of the most up-regulated genes in anergic cells is Nrn1 but the data is not shown.

      (2) For all figures showing %, the titles of the Y axes are written in an odd way. For example, it is written "Foxp3% CD4". It would be more conventional and clearer to write "% Foxp3+ / CD4+" or "% Foxp3+ among CD4+".

      (4) For certain staining (Figure 3E, H) it would be important to show the raw data, in addition to MFI or % values.

      We can adapt the labeling and provide additional data, including Nrn1 staining on Treg cells and flow graphs for pmTOR and pS6 staining (Fig. 3H), as requested by Reviewer #1.

      (2) Experimental rigor:

      General comments:

      “However, it is disappointing that reading this manuscript leaves an impression of incomplete work done too quickly.”

      We were discouraged to receive the comment, “this manuscript leaves an impression of incomplete work done too quickly.” Our study of this novel molecule began without any existing biological tools such as antibodies, knockout mice, etc.  Over the past several years, we have established our own antibodies for Nrn1 detection, obtained and characterized Nrn1 knockout mice, and utilized multiple approaches to identify the molecular mechanism of Nrn1 function. Through the use of the in vitro iTreg system described in this manuscript, we identified the association of Nrn1 deficiency with cell electrical state change, potentially connected to AMPAR function. We have further corroborated our findings by generating Nrn1 and AMPAR T cell specific double knockout mice and confirmed that T cell specific AMPAR deletion could abrogate the phenotype caused by the Nrn1 deficiency (see Support Figure 2).  We did not include the double knockout data in the current manuscript because AMPAR function has not yet been studied thoroughly in T cell biology, and we feel this topic warrants examination in its own right.  However, the unpublished data support the finding that Nrn1 modulates the T cell electrical state and, consequently, metabolism, ultimately influencing tolerance and immunity.  In its current form, the manuscript represents the first characterization of the novel molecule Nrn1 in anergic cells, Tregs, and effector T cells. While this work has led to several exciting additional questions, we disagree that the novel characterization we have presented Is incomplete. We feel that our present data set, which squarely highlights Nrn1’s role as an important immune regulator while shedding unprecedented light on the molecular events involved, will be of considerable interest to a broad field of researchers.

      “Multiple models have been used, but none has been studied thoroughly enough to provide really conclusive and unambiguous data. For example, 5 different models were used to study T cells in vivo. It would have been preferable to use fewer, but to go further in the study of mechanisms.”

      We have indeed used multiple in vivo models to reveal Nrn1's function in Treg differentiation, Treg suppression function, T effector cell differentiation and function, and the overall impact on autoimmune disease. Because the impact of ion channel function is often context-dependent, we examined the biological outcome of Nrn1 deficiency in several in vivo contexts.  We would appreciate it if Reviewer#1 would provide a specific example, given the Nrn1 phenotype, of how to proceed deeper to investigate the electrical change in the in vivo models.

      “Major points

      (1) A real weakness of this work is the fact that in most of the results shown, there are few biological replicates with differences that are often small between Ctrl and Nrn1 -/-. The systematic use of student's t-test may lead to thinking that the differences are significant, which is often misleading given the small number of samples, which makes it impossible to know whether the distributions are Gaussian and whether a parametric test can be used. RNAseq bulk data are based on biological duplicates, which is open to criticism.”

      We respectfully disagree with Reviewer #1 on the question of statistical power and significance to our work. We have used 5-8 mice/group for each in vivo model and 3-4 technical replicates for the in vitro studies, with a minimum of 2-3 replicate experiments. These group sizes and replication numbers are in line with those seen in high-impact publications. While some differences between Ctrl and Nrn1-/- appear small, they have significant biological consequences, as evidenced by the various Nrn1-/- in vivo phenotypes. Furthermore, we believe we have subjected our data to the appropriate statistical tests to ensure rigorous analysis and representation of our findings.

      To Reviewer #2.

      We thank Reviewer #2 for the careful review of the manuscript. We especially appreciate the comments that “The characterizations of T cell Nrn1 expression both in vitro and in vivo are comprehensive and convincing. The in vivo functional studies of anergy development, Treg suppression, and EAE development are also well done to strengthen the notion that Nrn1 is an important regulator of CD4 responsiveness.”

      “The major weakness of this study stems from a lack of a clear molecular mechanism involving Nrn1. “  

      We fully understand this comment from Reviewer #2. The main mechanism we identified contributing to the functional defect of Nrn1-/- T cells involves novel effects on the electric and metabolic state of the cells. Although we referenced neuronal studies that indicate Nrn1 is the auxiliary protein for the ionotropic AMPA-type glutamate receptor (AMPAR) and may affect AMPAR function, we did not provide any evidence in this manuscript as the topic requires further in-depth study.   

      For the benefit of this discussion, we include our preliminary Nrn1 and AMPAR double knockout data (Author response image 2), which indicates that abrogating AMPAR expression can compensate for the defect caused by Nrn1 deficiency in vitro and in vivo. This preliminary data supports the notion that Nrn1 modulates AMPAR function, which causes changes in T cell electric and metabolic state, influencing T cell differentiation and function.  

      Author response image 2.

      Deletion of AMPAR expression in T cells compensates for the defect caused by Nrn1 deficiency. Nrn1-/- mice were crossed with T cell-specific AMPAR knockout mice (AMPARfl/flCD4Cre+) mice. The following mice were generated and used in the experiment: T cell specific AMPAR-knockout and Nrn1 knockout mice (AKONKO), Nrn1 knockout mice (AWTNKO), Ctrl mice (AWTNWT). a. Deletion of AMPAR compensates for the iTreg cell defect observed in Nrn1-/- CD4 cells. iTreg live cell proportion, cell number, and Ki67 expression among Foxp3+ cells 3 days after aCD3 restimulation. b. Deletion of AMPAR in T cells abrogates the enhanced autoimmune response in Nrn1-/- Mouse in the EAE disease model. Mouse relative weight change and disease score progression after EAE disease induction.  

      Ion channels can influence cell metabolism through multiple means (Vaeth and Feske, 2018; Wang et al., 2020). First, ion channels are involved in maintaining cell resting membrane potential. This electrical potential difference across the cell membrane is essential for various cellular processes, including metabolism (Abdul Kadir et al., 2018; Blackiston et al., 2009; Nagy et al., 2018; Yu et al., 2022). Second, ion channels facilitate the movement of ions across cell membranes. These ions are essential for various metabolic processes. For example, ions like calcium (Ca2+), potassium (K+), and sodium (Na+) play crucial roles in signaling pathways that regulate metabolism (Kahlfuss et al., 2020). Third, ion channel activity can influence cellular energy balance due to ATP consumption associated with ion transport to maintain ion balances (Erecińska and Dagani, 1990; Gerkau et al., 2019). This, in turn, can impact processes like ATP production, which is central to cellular metabolism. Thus, ion channel expression and function determine the cell’s bioelectric state and contribute to cell metabolism (Levin, 2021).

      Because the AMPAR function has not been thoroughly studied using a genetic approach in T cells, we do not intend to include the double knockout data in this manuscript before fully characterizing the T cell-specific AMPAR knockout mice.  

      “Although the biochemical and informatics studies are well-performed, it is my opinion that these results are inconclusive in part due to the absence of key "naive" control groups. This limits my ability to understand the significance of these data.

      Specifically, studies of the electrical and metabolic state of Nrn1-/- inducible Treg cells (iTregs) would benefit from similar data collected from wild-type and Nrn1-/- naive CD4 T cells.”

      We appreciate the reviewer’s comments. This comment reflects two concerns in data interpretation:

      (1) Are Nrn1-/- naïve T cells fundamentally different from WT cells? Does this fundamental difference contribute to the observed electrical and metabolic phenotype in iTreg or Th0 cells? This is a very good question we will perform the experiments as the reviewer suggested. While Nrn1 is expressed at a basal (low) level in naïve T cells, deletion of Nrn1 may cause changes in naïve T cell phenotype.   

      (2) Is the Nrn1-/- phenotype caused by Nrn1 functional deficiency or due to the secondary effect of Nrn1 deletion, such as non-physiological cell membrane structure changes?

      We have done the following experiment to address this concern.  We have cultured WT T cells in the presence of Nrn1 antibody and compared the outcome with Nrn1-/- iTreg cells (Figure 3-figure supplement 2D,E,F). WT iTreg cells under antibody blockade exhibited similar changes as Nrn1-/- iTreg cells, confirming the physiological relevance of the Nrn1-/- phenotype.

      Manuscript Revision based on the Reviewer’s suggestions:

      Reviewer #1:

      Major points (3) Figure 1A-D. Since the authors are using the Nrp1 KO mice, it would be important to confirm the specificity of the anti-Nrn1 mAb by FACS. 

      Following the suggestion by Reviewer#1, We have included the Nrn1 Ab staining on activated Nrn1-/- CD4 cells in Figure 1D. We have also added the staining of cell surface Nrn1 on Treg cells in Figure 1-figure supplement 1D.

      Major point: (5) “Moreover, we do not even know what the % of Foxp3 cells is in the iTreg used (after differentiation and 20h of re-stimulation) and whether this % is the same between Ctlr and Nrn1 KO cells.”

      In the revised manuscript, we have included the proportion of Foxp3+ cells among Nrn1-/- and ctrl iTreg cells developed under the iTreg culture condition in Figure 2A.

      Minor points  

      (2) For all figures showing %, the titles of the Y axes are written in an odd way. For example, it is written "Foxp3% CD4". It would be more conventional and clearer to write "% Foxp3+ / CD4+" or "% Foxp3+ among CD4+".

      Following reviewer#1’s suggestion, we have changed the Y-axis label in all the relevant figures.

      (3) Would not it be possible to perform experiments showing the ability of cells to transport amino acids or electrolytes across the plasma membrane? This would be a more interesting demonstration than transcriptomic data.”

      We appreciate Review# 1’s suggestion regarding “perform experiments showing the ability of cells to transport amino acids or electrolytes across the plasma membrane”.  We have used AAinduced cellular MP changes to confirm differential AA transporter expression patterns and their impact on cellular MP levels.  The data are included in the revised manuscript in Figure 3H and Figure 4K.

      (4) For certain staining (Figure 3E, H) it would be important to show the raw data, in addition to MFI or % values.

      We appreciated Reviewer #1’s suggestion and have included the histogram staining data for Figure 3E. We have moved the original Figure 3H to the supplemental figure and included the histogram staining data in Figure 3-figure supplement 1C.  Similarly, we have included the histogram staining data in Figure 4-figure supplement 1C.

      Reviewer#2:

      “Although the biochemical and informatics studies are well-performed, it is my opinion that these results are inconclusive in part due to the absence of key "naive" control groups. This limits my ability to understand the significance of these data.

      Specifically, studies of the electrical and metabolic state of Nrn1-/- inducible Treg cells (iTregs) would benefit from similar data collected from wild-type and Nrn1-/- naive CD4 T cells.”

      We greatly appreciate Reviewer#2’s suggestion and have carried out experiments on naïve CD4 cells derived from Nrn1-/- and WT mice. We have compared membrane potential, AA-induced MP change between Nrn1-/- and WT naïve T cells, and the metabolic state of Nrn1-/- and WT naïve T cells by carrying out glucose stress tests and mitochondria stress tests using a seahorse assay.  Moreover, to investigate whether the phenotype revealed in Nrn1-/- CD4 cells was caused by a secondary effect of cell membrane structure change due to Nrn1 deletion, we carried out Nrn1 antibody blockade in WT CD4 cells and investigated the phenotypic change. These new results are included in Figure 3-figure supplement 2.

      Reference:

      Abdul Kadir, L., M. Stacey, and R. Barrett-Jolley. 2018. Emerging Roles of the Membrane Potential: Action Beyond the Action Potential. Front Physiol 9:1661.

      Blackiston, D.J., K.A. McLaughlin, and M. Levin. 2009. Bioelectric controls of cell proliferation: ion channels, membrane voltage and the cell cycle. Cell Cycle 8:3527-3536.

      Chappert, P., and R.H. Schwartz. 2010. Induction of T cell anergy: integration of environmental cues and infectious tolerance. Current opinion in immunology 22:552-559.

      Chen, W., W. Jin, N. Hardegen, K.J. Lei, L. Li, N. Marinos, G. McGrady, and S.M. Wahl. 2003. Conversion of peripheral CD4+CD25- naive T cells to CD4+CD25+ regulatory T cells by TGF-beta induction of transcription factor Foxp3. The Journal of experimental medicine 198:1875-1886.

      Erecińska, M., and F. Dagani. 1990. Relationships between the neuronal sodium/potassium pump and energy metabolism. Effects of K+, Na+, and adenosine triphosphate in isolated brain synaptosomes. J Gen Physiol 95:591-616.

      Fathman, C.G., and N.B. Lineberry. 2007. Molecular mechanisms of CD4+ T-cell anergy. Nat Rev Immunol 7:599-609.

      Gerkau, N.J., R. Lerchundi, J.S.E. Nelson, M. Lantermann, J. Meyer, J. Hirrlinger, and C.R. Rose. 2019. Relation between activity-induced intracellular sodium transients and ATP dynamics in mouse hippocampal neurons. The Journal of physiology 597:5687-5705.

      Hurrell, B.P., D.G. Helou, E. Howard, J.D. Painter, P. Shafiei-Jahani, A.H. Sharpe, and O. Akbari. 2022. PD-L2 controls peripherally induced regulatory T cells by maintaining metabolic activity and Foxp3 stability. Nature communications 13:5118.

      Jenkins, M.K., and R.H. Schwartz. 1987. Antigen presentation by chemically modified splenocytes induces antigen-specific T cell unresponsiveness in vitro and in vivo. The Journal of experimental medicine 165:302-319.

      John, P., M.C. Pulanco, P.M. Galbo, Jr., Y. Wei, K.C. Ohaegbulam, D. Zheng, and X. Zang. 2022. The immune checkpoint B7x expands tumor-infiltrating Tregs and promotes resistance to anti-CTLA-4 therapy. Nature communications 13:2506.

      Kahlfuss, S., U. Kaufmann, A.R. Concepcion, L. Noyer, D. Raphael, M. Vaeth, J. Yang, P. Pancholi, M. Maus, J. Muller, L. Kozhaya, A. Khodadadi-Jamayran, Z. Sun, P. Shaw, D. Unutmaz, P.B. Stathopulos, C. Feist, S.B. Cameron, S.E. Turvey, and S. Feske. 2020. STIM1-mediated calcium influx controls antifungal immunity and the metabolic function of nonpathogenic Th17 cells. EMBO molecular medicine 12:e11592.

      Levin, M. 2021. Bioelectric signaling: Reprogrammable circuits underlying embryogenesis, regeneration, and cancer. Cell 184:1971-1989.

      Nagy, E., G. Mocsar, V. Sebestyen, J. Volko, F. Papp, K. Toth, S. Damjanovich, G. Panyi, T.A. Waldmann, A. Bodnar, and G. Vamosi. 2018. Membrane Potential Distinctly Modulates Mobility and Signaling of IL-2 and IL-15 Receptors in T Cells. Biophys J 114:2473-2482.

      Quill, H., and R.H. Schwartz. 1987. Stimulation of normal inducer T cell clones with antigen presented by purified Ia molecules in planar lipid membranes: specific induction of a long-lived state of proliferative nonresponsiveness. Journal of immunology (Baltimore, Md. : 1950) 138:3704-3712.

      Schmitt, E.G., and C.B. Williams. 2013. Generation and function of induced regulatory T cells. Frontiers in immunology 4:152.

      Sugiura, A., G. Andrejeva, K. Voss, D.R. Heintzman, X. Xu, M.Z. Madden, X. Ye, K.L. Beier, N.U. Chowdhury, M.M. Wolf, A.C. Young, D.L. Greenwood, A.E. Sewell, S.K. Shahi, S.N. Freedman, A.M. Cameron, P. Foerch, T. Bourne, J.C. Garcia-Canaveras, J. Karijolich, D.C. Newcomb, A.K. Mangalam, J.D. Rabinowitz, and J.C. Rathmell. 2022. MTHFD2 is a metabolic checkpoint controlling effector and regulatory T cell fate and function. Immunity 55:65-81.e69.

      Vaeth, M., and S. Feske. 2018. Ion channelopathies of the immune system. Current opinion in immunology 52:39-50.

      Vanasek, T.L., S.L. Nandiwada, M.K. Jenkins, and D.L. Mueller. 2006. CD25+Foxp3+ regulatory T cells facilitate CD4+ T cell clonal anergy induction during the recovery from lymphopenia. Journal of immunology (Baltimore, Md. : 1950) 176:5880-5889.

      Wang, Y., A. Tao, M. Vaeth, and S. Feske. 2020. Calcium regulation of T cell metabolism. Current opinion in physiology 17:207-223.

      Yu, W., Z. Wang, X. Yu, Y. Zhao, Z. Xie, K. Zhang, Z. Chi, S. Chen, T. Xu, D. Jiang, X. Guo, M. Li, J. Zhang, H. Fang, D. Yang, Y. Guo, X. Yang, X. Zhang, Y. Wu, W. Yang, and D. Wang. 2022. Kir2.1-mediated membrane potential promotes nutrient acquisition and inflammation through regulation of nutrient transporters. Nature communications 13:3544.

      Zheng, S.G., J.D. Gray, K. Ohtsuka, S. Yamagiwa, and D.A. Horwitz. 2002. Generation ex vivo of TGF-beta-producing regulatory T cells from CD4+CD25- precursors. Journal of immunology (Baltimore, Md. : 1950) 169:4183-4189.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Koumoundourou et al., identify a pathway downstream of Bcl11b that controls synapse morphology and plasticity of hippocampal mossy fiber synapses. Using an elegant combination of in vivo, ex vivo, and in vitro approaches, the authors build on their previous work that indicated C1ql2 as a functional target of Bcl11b (De Bruyckere et al., 2018). Here, they examine the functional implications of C1ql2 at MF synapses in Bcl11b cKO mice and following C1ql2 shRNA. The authors find that Bcl11b KO and shRNA against C1ql2 significantly reduces the recruitment of synaptic vesicles and impairs LTP at MF synapses. Importantly, the authors test a role for the previously identified C1ql2 binding partner, exon 25b-containing Nrxn3 (Matsuda et al., 2016), as relevant at MF synapses to maintain synaptic vesicle recruitment. To test this, the authors developed a K262E C1ql2 mutant that disrupts binding to Nrxn3. Curiously, while Bcl11b KO and C1ql2 KD largely phenocopy (reduced vesicle recruitment and impaired LTP), only vesicle recruitment is dependent on C1ql2-Nrxn3 interactions. These findings provide new insight into the functional role of C1ql2 at MF synapses. While the authors convincingly demonstrate a role for C1ql2-Nrxn3(25b+) interaction for vesicle recruitment and a Nrxn3(25b+)independent role for C1ql2 in LTP, the underlying mechanisms remain inconclusive. Additionally, a discussion of how these findings relate to previous work on C1ql2 at mossy fiber synapses and how the findings contribute to the biology of Nrxn3 would increase the interpretability of this work.

      As suggested by reviewer #1, we extended our discussion of previous work on C1ql2 and additionally discussed the biology of Nrxn3 and how our work relates to it. Moreover, we extended our mechanistic analysis of how Bcl11b/C1ql2/Nrxn3 pathway controls synaptic vesicle recruitment as well as LTP (please see also response to reviewer #2 points 5 and 8 and reviewer #3 point 4 of public reviews below for detailed discussion).

      Reviewer #2 (Public Review):

      This manuscript describes experiments that further investigate the actions of the transcription factor Bcl11b in regulating mossy fiber (MF) synapses in the hippocampus. Prior work from the same group had demonstrated that loss of Bcl11b results in loss of MF synapses as well as a decrease in LTP. Here the authors focus on a target of Bcl11b a secreted synaptic organizer C1ql2 which is almost completely lost in Bcl11b KO. Viral reintroduction of C1ql2 rescues the synaptic phenotypes, whereas direct KD of C1ql2 recapitulates the Bcl1 phenotype. C1ql2 itself interacts directly with Nrxn3 and replacement with a binding deficient mutant C1q was not able to rescue the Bcl11b KO phenotype. Overall there are some interesting observations in the study, however there are also some concerns about the measures and interpretation of data.

      The authors state that they used a differential transcriptomic analysis to screen for candidate targets of Bcl11b, yet they do not present any details of this screen. This should be included and at the very least a table of all DE genes included. It is likely that many other genes are also regulated by Bcl11b so it would be important to the reader to see the rationale for focusing attention on C1ql2 in this study.

      The transcriptome analysis mentioned in our manuscript was published in detail in our previous study (De Bruyckere et al., 2018), including chromatin-immunoprecipitation that revealed C1ql2 as a direct transcriptional target of Bcl11b. Upon revision of the manuscript, we made sure that this was clearly stated within the main text module to avoid future confusion. In the same publication (De Bruyckere et al., 2018), we discuss in detail several identified candidate genes such as Sema5b, Ptgs2, Pdyn and Penk as putative effectors of Bcl11b in the structural and functional integrity of MFS. C1ql2 has been previously demonstrated to be almost exclusively expressed in DG neurons and localized to the MFS.

      There it bridges the pre- and post-synaptic sides through interaction with Nrxn3 and KAR subunits, respectively, and regulates synaptic function (Matsuda et al., 2016). Taken together, C1ql2 was a very good candidate to study as a potential effector downstream of Bcl11b in the maintenance of MFS structure and function. However, as our data reveal, not all Bcl11b mutant phenotypes were rescued by C1ql2 (see supplementary figures 2d-f of revised manuscript). We expect additional candidate genes, identified in our transcriptomic screen, to act downstream of Bcl11b in the control of MFS.

      All viral-mediated expression uses AAVs which are known to ablate neurogenesis in the DG (Johnston DOI: 10.7554/eLife.59291) through the ITR regions and leads to hyperexcitability of the dentate. While it is not clear how this would impact the measurements the authors make in MF-CA3 synapses, this should be acknowledged as a potential caveat in this study.

      We agree with reviewer #2 and are aware that it has been demonstrated that AAV-mediated gene expression ablates neurogenesis in the DG. To avoid potential interference of the AAVs with the interpretability of our phenotypes, we made sure during the design of the study that all of our control groups were treated in the same way as our groups of interest, and were, thus, injected with control AAVs. Moreover, the observed phenotypes were first described in Bcl11b mutants that were not injected with AVVs (De Bruyckere et al., 2018). Finally, we thoroughly examined the individual components of the proposed mechanism (rescue of C1ql2 expression, over-expression of C1ql3 and introduction of mutant C1ql2 in Bcl11b cKOs, KD of C1ql2 in WT mice, and Nrxn123 cKO) and reached similar conclusions. Together, this strongly supports that the observed phenotypes occur as a result of the physiological function of the proteins involved in the described mechanism and not due to interference of the AAVs with these biological processes. We have now addressed this point in the main text module of the revised ms.

      The authors claim that the viral re-introduction "restored C1ql2 protein expression to control levels. This is misleading given that the mean of the data is 2.5x the control (Figure 1d and also see Figure 6c). The low n and large variance are a problem for these data. Moreover, they are marked ns but the authors should report p values for these. At the least, this likely large overexpression and variability should be acknowledged. In addition, the use of clipped bands on Western blots should be avoided. Please show the complete protein gel in primary figures of supplemental information.

      We agree with reviewer #2 that C1ql2 expression after its re-introduction in Bcl11b cKO mice was higher compared to controls and that this should be taken into consideration for proper interpretation of the data. To address this, based also on the suggestion of reviewer #3 point 1 below, we overexpressed C1ql2 in DG neurons of control animals. We found no changes in synaptic vesicle organization upon C1ql2 over-expression compared to controls. This further supports that the observed effect upon rescue of C1ql2 expression in Bcl11b cKOs is due to the physiological function of C1ql2 and not as result of the overexpression. These data are included in supplementary figure 2g-j and are described in detail in the results part of the revised manuscript.

      Additionally, we looked at the effects of C1ql2 overexpression in Bcl11b cKO DGN on basal synaptic transmission. We plotted fEPSP slopes versus fiber volley amplitudes, measured in slices from rescue animals, as we had previously done for the control and Bcl11b cKO (Author response image 1a). Although regression analysis revealed a trend towards steeper slopes in the rescue mice (Author response image 1a and b), the observation did not prove to be statistically significant, indicating that C1ql2 overexpression in Bcl11b cKO animals does not strongly alter basal synaptic transmission at MFS. Overall, our previous and new findings support that the observed effects of the C1ql2 rescue are not caused by the artificially elevated levels of C1ql2, as compared to controls, but are rather a result of the physiological function of C1ql2.

      Following the suggestion of reviewer #2 all western blot clipped bands were exchanged for images of the full blot. This includes figures 1c, 4c, 6b and supplementary figure 2g of the revised manuscript. P-value for Figure 1d has now been included.

      Author response image 1.

      C1ql2 reintroduction in Bcl11b cKO DGN does not significantly alter basal synaptic transmission at mossy fiber-CA3 synapses. a Input-output curves generated by plotting fEPSP slope against fiber volley amplitude at increasing stimulation intensities. b Quantification of regression line slopes for input-output curves for all three conditions. Control+EGFP, 35 slices from 16 mice; Bcl11b cKO+EGFP, 32 slices from 14 mice; Bcl11b cKO+EGFP-2A-C1ql2, 22 slices from 11 mice. The data are presented as means, error bars represent SEM. Kruskal-Wallis test (non-parametric ANOVA) followed by Dunn’s post hoc pairwise comparisons. p=0.106; ns, not significant.

      Measurement of EM micrographs: As prior work suggested that MF synapse structure is disrupted the authors should report active zone length as this may itself affect "synapse score" defined by the number of vesicles docked. More concerning is that the example KO micrographs seem to have lost all the densely clustered synaptic vesicles that are away from the AZ in normal MF synapses e.g. compare control and KO terminals in Fig 2a or 6f or 7f. These terminals look aberrant and suggest that the important measure is not what is docked but what is present in the terminal cytoplasm that normally makes up the reserve pool. This needs to be addressed with further analysis and modifications to the manuscript.

      As requested by reviewer #2 we analyzed and reported in the revised manuscript the active zone length. We found that the active zone length remained unchanged in all conditions (control/Bcl11b cKO/C1ql2 rescue, WT/C1ql2 KD, control/K262E and control/Nrxn123 cKO), strengthening our results that the described Bcl11b/C1ql2/Nrxn3 mechanism is involved in the recruitment of synaptic vesicles. These data have been included in supplementary figures 2c, 4h, 5f and 6g and are described in the results part of the revised manuscript.

      We want to clarify that the synapse score is not defined by the number of docked vesicles to the plasma membrane. The synapse score, which is described in great detail in our materials and methods part and has been previously published (De Bruyckere et al., 2018), rates MFS based on the number of synaptic vesicles and their distance from the active zone and was designed according to previously described properties of the vesicle pools at the MFS. The EM micrographs refer to the general misdistribution of SV in the proximity of MFS. Upon revision of the manuscript, we made sure that this was clearly stated in the main text module to avoid further confusion.

      The study also presents correlated changes in MF LTP in Bcl11b KO which are rescued by C1ql2 expression. It is not clear whether the structural and functional deficits are causally linked and this should be made clearer in the manuscript. It is also not apparent why this functional measure was chosen as it is unlikely that C1ql2 plays a direct role in presynaptic plasticity mechanisms that are through a cAMP/ PKA pathway and likely disrupted LTP is due to dysfunctional synapses rather than a specific LTP effect.

      The inclusion of functional experiments in this and our previous study (de Bruyckere et al., 2018) was first and foremost intended to determine whether the structural alterations observed at MFB disrupt MFS signaling. From the signaling properties we tested, basal synaptic transmission (this study) and short-term potentiation (de Bruyckere et al., 2018) were unaltered by Bcl11b KO, whereas MF LTP was found to be abolished (de Bruyckere et al., 2018). Indeed, because MF LTP largely depends on presynaptic mechanisms, including the redistribution of the readily releasable pool and recruitment of new active zones (Orlando et al., 2021; Vandael et al., 2020), it appears to be particularly sensitive to the specific structural changes we observed. We therefore believe that it is valuable information that MF LTP is affected in Bcl11b cKO animals - it conveys a direct proof for the functional importance of the observed morphological alterations, while basic transmission remains largely normal. Furthermore, it subsequently provided a functional marker for testing whether the reintroduction of C1ql2 in Bcl11b cKO animals or the KD of C1ql2 in WT animals can functionally recapitulate the control or the Bcl11b KO phenotype, respectively.

      We fully agree with the reviewer that C1ql2 is unlikely to directly participate in the cAMP/PKA pathway and that the ablation of C1ql2 likely disrupts MF LTP through an alternative mode of action. Our original wording in the paragraph describing the results of the forskolin-induced LTP experiment might have overstressed the importance of the cAMP pathway. We have now rephrased that paragraph to better describe the main idea behind the forskolin experiment, namely to circumvent the initial Ca2+ influx in order to test whether deficient presynaptic Ca2+ channel/KAR signaling might be responsible for the loss of LTP in Bcl11b cKO. The results are strongly indicative of a downstream mechanism and further investigation is needed to determine the specific mechanisms by which C1ql2 regulates MFLTP, especially in light of the result that C1ql2.K262E rescued LTP, while it was unable to rescue the SV recruitment at the MF presynapse. This raises the possibility that C1ql2 can influence MF-LTP through additional, yet uncharacterized mechanisms, independent of SV recruitment. As such, a causal link between the structural and functional deficits remains tentative and we have now emphasized that point by adding a respective sentence to the discussion of our revised manuscript. Nevertheless, we again want to stress that the main rationale behind the LTP experiments was to assess the functional significance of structural changes at MFS and not to elucidate the mechanisms by which MF LTP is established.

      The authors should consider measures that might support the role of Bcl11b targets in SV recruitment during the depletion of synapses or measurements of the readily releasable pool size that would complement their findings in structural studies.

      We fully agree that functional measurements of the readily releasable pool (RRP) size would be a valuable addition to the reported redistribution of SV in structural studies. We have, in fact, attempted to use high-frequency stimulus trains in both field and single-cell recordings (details on single-cell experiments are described in the response to point 8) to evaluate potential differences in RRP size between the control and Bcl11b KO (Figure for reviewers 2a and b). Under both recording conditions we see a trend towards lower values of the intersection between a regression line of late responses and the y-axis. This could be taken as an indication of slightly smaller RRP size in Bcl11b mutant animals compared to controls. However, due to several technical reasons we are extremely cautious about drawing such far-reaching conclusions based on these data. At most, they suffice to conclude that the availability of release-ready vesicles in the KO is likely not dramatically smaller than in the control.

      The primary issue with using high-frequency stimulus trains for RRP measurements at MFS is the particularly low initial release probability (Pr) at these synapses. This means that a large number of stimulations is required to deplete the RRP. As the RRP is constantly replenished, it remains unclear when steady state responses are reached (reviewed by Kaeser and Regehr, 2017). This is clearly visible in our single-cell recordings (Author response image 2b), which were additionally complicated by prominent asynchronous release at later stages of the stimulus train and by a large variability in the shapes of cumulative amplitude curves between cells. In contrast, while the cumulative amplitude curves for field potential recordings do reach a steady state (Author response image 2a), field potential recordings in this context are not a reliable substitute for single cell or, in the case of MFB, singlebouton recordings. Postsynaptic cells in field potential recordings are not clamped, meaning that the massive release of glutamate due to continuous stimulation depolarizes the postsynaptic cells and reduces the driving force for Na+, irrespective of depletion of the RRP. This is supported by the fact that we consistently observed a recovery of fEPSP amplitudes later in the trains where RRP had presumably been maximally depleted. In summary, high-frequency stimulus trains at the field potential level are not a valid and established technique for estimating RRP size at MFS.

      Specialized laboratories have used highly advanced techniques, such as paired recordings between individual MFB and postsynaptic CA3 pyramidal cells, to estimate the RRP size of MFB (Vandael et al., 2020). These approaches are outside the scope of our present study which, while elucidating functional changes following Bcl11b depletion and C1ql2 rescue, does not aim to provide a high-end biophysical analysis of the presynaptic mechanisms involved.

      Author response image 2.

      Estimation of RRP size using high-frequency stimulus trains at mossy fiber-CA3 synapses. a Results from field potential recordings. Cumulative fEPSP amplitude in response to a train of 40 stimuli at 100 Hz. All subsequent peak amplitudes were normalized to the amplitude of the first peak. Data points corresponding to putative steady state responses were fit with linear regression (RRP size is indirectly reflected by the intersection of the regression line with the yaxis). Control+EGFP, 6 slices from 5 mice; Bcl11b cKO+EGFP, 6 slices from 3 mice. b Results from single-cell recordings. Cumulative EPSC amplitude in response to a train of 15 stimuli at 50 Hz. The last four stimuli were fit with linear regression. Control, 5 cells from 4 mice; Bcl11b cKO, 3 cells from 3 mice. Note the shallow onset of response amplitudes and the subsequent frequency potentiation. Due to the resulting increase in slope at higher stimulus numbers, intersection with the y-axis occurs at negative values. The differences shown were not found to be statistically significant; unpaired t-test or Mann-Whitney U-test.

      Bcl11b KO reduces the number of synapses, yet the I-O curve reported in Supp Fig 2 is not changed. How is that possible? This should be explained.

      We agree with reviewer #2– this apparent discrepancy has indeed struck us as a counterintuitive result. It might be that synapses that are preferentially eliminated in Bcl11b cKO are predominantly silent or have weak coupling strength, such that their loss has only a minimal effect on basal synaptic transmission. Although perplexing, the result is fully supported by our single-cell data which shows no significant differences in MF EPSC amplitudes recorded from CA3 pyramidal cells between controls and Bcl11b mutants (Author response image 3; please see the response below for details and also our response to Reviewer #1 question 2).

      Matsuda et al DOI: 10.1016/j.neuron.2016.04.001 previously reported that C1ql2 organizes MF synapses by aligning postsynaptic kainate receptors with presynaptic elements. As this may have consequences for the functional properties of MF synapses including their plasticity, the authors should report whether they see deficient postsynaptic glutamate receptor signaling in the Bcl11b KO and rescue in the C1ql2 re-expression.

      We agree that the study by Matsuda et al. is of key importance for our present work. Although MF LTP is governed by presynaptic mechanisms and we previously did not see differences in short-term plasticity between the control and Bcl11b cKO (De Bruyckere et al., 2018), the clustering of postsynaptic kainate receptors by C1ql2 is indeed an important detail that could potentially alter synaptic signaling at MFS in Bcl11b KO. We, therefore, re-analyzed previously recorded single-cell data by performing a kinetic analysis on MF EPSCs recorded from CA3 pyramidal cells in control and Bcl11b cKO mice (Figure for reviewers 3a) to evaluate postsynaptic AMPA and kainate receptor responses in both conditions. We took advantage of the fact that AMPA receptors deactivate roughly 10 times faster than kainate receptors, allowing the contributions of the two receptors to mossy fiber EPSCs to be separated (Castillo et al., 1997 and reviewed by Lerma, 2003). We fit the decay phase of the second (larger) EPSC evoked by paired-pulse stimulation with a double exponential function, yielding a fast and a slow component, which roughly correspond to the fractional currents evoked by AMPA and kainate receptors, respectively. Analysis of both fast and slow time constants and the corresponding fractional amplitudes revealed no significant differences between controls and Bcl11b mutants (Figure for reviewers 3e-h), indicating that both AMPA and kainate receptor signaling is unaffected by the ablation of C1ql2 following Bcl11b KO.

      Importantly, MF EPSC amplitudes evoked by the first and the second pulse (Author response image 3b), paired-pulse facilitation (Author response image 3c) and failure rates (Author response image 3d) were all comparable between controls and Bcl11b mutants. These results further corroborate our observations from field recordings that basal synaptic transmission at MFS is unaltered by Bcl11b KO.

      We note that the results from single cell recordings regarding basal synaptic transmission merely confirm the observations from field potential recordings, and that the attempted measurement of RRP size at the single cell level was not successful. Thus, our single-cell data do not add new information about the mechanisms underlying the effects of Bcl11b-deficiency and we therefore decided not to report these data in the manuscript.

      Author response image 3.

      Basal synaptic transmission at mossy fiber-CA3 synapses is unaltered in Bcl11b cKO mice. a Representative average trace (20 sweeps) recorded from CA3 pyramidal cells in control and Bcl11b cKO mice at minimal stimulation conditions, showing EPSCs in response to paired-pulse stimulation (PPS) at an interstimulus interval of 40 ms. The signal is almost entirely blocked by the application of 2 μM DCG-IV (red). b Quantification of MF EPSC amplitudes in response to PPS for both the first and the second pulse. c Ratio between the amplitude of the second over the first EPSC. d Percentage of stimulation events resulting in no detectable EPSCs for the first pulse. Events <5 pA were considered as noise. e Fast decay time constant obtained by fitting the average second EPSC with the following double exponential function: I(t)=Afaste−t/τfast+Aslowe−t/τslow+C, where I is the recorded current amplitude after time t, Afast and Aslow represent fractional current amplitudes decaying with the fast (τfast) and slow (τslow) time constant, respectively, and C is the offset. Starting from the peak of the EPSC, the first 200 ms of the decaying trace were used for fitting. f Fractional current amplitude decaying with the fast time constant. g-h Slow decay time constant and fractional current amplitude decaying with the slow time constant. For all figures: Control, 8 cells from 4 mice; Bcl11b cKO, 8 cells from 6 mice. All data are presented as means, error bars indicate SEM. None of the differences shown were found to be statistically significant; Mann-Whitney U-test for nonnormally and unpaired t-test for normally distributed data.

      Reviewer #3 (Public Review):

      Overall, this is a strong manuscript that uses multiple current techniques to provide specific mechanistic insight into prior discoveries of the contributions of the Bcl11b transcription factor to mossy fiber synapses of dentate gyrus granule cells. The authors employ an adult deletion of Bcl11b via Tamoxifen-inducible Cre and use immunohistochemical, electron microscopy, and electrophysiological studies of synaptic plasticity, together with viral rescue of C1ql2, a direct transcriptional target of Bcl11b or Nrxn3, to construct a molecular cascade downstream of Bcl11b for DG mossy fiber synapse development. They find that C1ql2 re-expression in Bcl11b cKOs can rescue the synaptic vesicle docking phenotype and the impairments in MF-LTP of these mutants. They also show that C1ql2 knockdown in DG neurons can phenocopy the vesicle docking and plasticity phenotypes of the Bcl11b cKO. They also use artificial synapse formation assays to suggest that C1ql2 functions together with a specific Nrxn3 splice isoform in mediating MF axon development, extending these data with a C1ql2-K262E mutant that purports to specifically disrupt interactions with Nrxn3. All of the molecules involved in this cascade are disease-associated and this study provides an excellent blueprint for uncovering downstream mediators of transcription factor disruption. Together this makes this work of great interest to the field. Strengths are the sophisticated use of viral replacement and multi-level phenotypic analysis while weaknesses include the linkage of C1ql2 with a specific Nrxn3 splice variant in mediating these effects.

      Here is an appraisal of the main claims and conclusions:

      1) C1ql2 is a downstream target of Bcl11b which mediates the synaptic vesicle recruitment and synaptic plasticity phenotypes seen in these cKOs. This is supported by the clear rescue phenotypes of synapse anatomy (Fig.2) and MF synaptic plasticity (Fig.3). One weakness here is the absence of a control assessing over-expression phenotypes of C1ql2. It's clear from Fig.1D that viral rescue is often greater than WT expression (totally expected). In the case where you are trying to suppress a LoF phenotype, it is important to make sure that enhanced expression of C1ql2 in a WT background does not cause your rescue phenotype. A strong overexpression phenotype in WT would weaken the claim that C1ql2 is the main mediator of the Bcl11b phenotype for MF synapse phenotypes.

      As suggested by reviewer #3, we carried out C1ql2 over-expression experiments in control animals. We show that the over-expression of C1ql2 in the DG of control animals had no effect on the synaptic vesicle organization in the proximity of MFS. This further supports that the observed effect upon rescue of C1ql2 expression in Bcl11b cKOs is due to the physiological function of C1ql2 and not a result of the artificial overexpression. These data are now included in supplementary figure 2g-j and are described in detail in the results part of the revised manuscript. Please also see response to point 3 of reviewer #2.

      2) Knockdown of C1ql2 via 4 shRNAs is sufficient to produce the synaptic vesicle recruitment and MFLTP phenotypes. This is supported by clear effects in the shRNA-C1ql2 groups as compared to nonsense-EGFP controls. One concern (particularly given the use of 4 distinct shRNAs) is the potential for off-target effects, which is best controlled for by a rescue experiment with RNA insensitive C1ql2 cDNA as opposed to nonsense sequences, which may not elicit the same off-target effects.

      We agree with reviewer #3 that the usage of shRNAs could potentially create unexpected off-target effects and that the introduction of a shRNA-insensitive C1ql2 in parallel to the expression on the shRNA cassette would be a very effective control experiment. However, the suggested experiment would require an additional 6 months (2 months for AAV production, 2-3 months from animal injection to sacrifice and 1-2 months for EM imaging/analysis and LTP measurements) and a high number of additional animals (minimum 8 for EM and 8 for LTP measurements). We note here, that before the production of the shRNA-C1ql2 and the shRNA-NS, the individual sequences were systematically checked for off-target bindings on the murine exome with up to two mismatches and presented with no other target except the proposed (C1ql2 for shRNA-C1ql2 and no target for shRNA-NS). Taking into consideration our in-silico analysis, we feel that the interpretation of our findings is valid without this (very reasonable) additional control experiment.

      3) C1ql2 interacts with Nrxn3(25b+) to facilitate MF terminal SV clustering. This claim is theoretically supported by the HEK cell artificial synapse formation assay (Fig.5), the inability of the K262-C1ql2 mutation to rescue the Bcl11b phenotype (Fig.6), and the altered localization of C1ql2 in the Nrxn1-3 deletion mice (Fig.7). Each of these lines of experimental evidence has caveats that should be acknowledged and addressed. Given the hypothesis that C1ql2 and Nrxn3b(25b) are expressed in DG neurons and work together, the heterologous co-culture experiment seems strange. Up till now, the authors are looking at pre-synaptic function of C1ql2 since they are re-expressing it in DGNs. The phenotypes they are seeing are also pre-synaptic and/or consistent with pre-synaptic dysfunction. In Fig.5, they are testing whether C1ql2 can induce pre-synaptic differentiation in trans, i.e. theoretically being released from the 293 cells "post-synaptically". But the post-synaptic ligands (Nlgn1 and and GluKs) are not present in the 293 cells, so a heterologous synapse assay doesn't really make sense here. The effect that the authors are seeing likely reflects the fact that C1ql2 and Nrxn3 do bind to each other, so C1ql2 is acting as an artificial post-synaptic ligand, in that it can cluster Nrxn3 which in turn clusters synaptic vesicles. But this does not test the model that the authors propose (i.e. C1ql2 and Nrxn3 are both expressed in MF terminals). Perhaps a heterologous assay where GluK2 is put into HEK cells and the C1ql2 and Nrxn3 are simultaneously or individually manipulated in DG neurons?

      C1ql2 is expressed by DG neurons and is then secreted in the MFS synaptic cleft, while Nrxn3, that is also expressed by DG neurons, is anchored at the presynaptic side. In our work we used the well established co-culture system assay and cultured HEK293 cells secreting C1ql2 (an IgK secretion sequence was inserted at the N-terminus of C1ql2) together with hippocampal neurons expressing Nrxn3(25b+). We used the HEK293 cells as a delivery system of secreted C1ql2 to the neurons to create regions of high concentration of C1ql2. By interfering with the C1ql2-Nrxn3 interaction in this system either by expression of the non-binding mutant C1ql2 variant in the HEK cells or by manipulating Nrxn expression in the neurons, we could show that C1ql2 binding to Nrxn3(25b+) is necessary for the accumulation of vGlut1. However, we did not examine and do not claim within our manuscript that the interaction between C1ql2 and Nrxn3(25b+) induces presynaptic differentiation. Our experiment only aimed to analyze the ability of C1ql2 to cluster SV through interaction with Nrxn3. Moreover, by not expressing potential postsynaptic interaction partners of C1ql2 in our system, we could show that C1ql2 controls SV recruitment through a purely presynaptic mechanism. Co-culturing GluK2-expressing HEK cells with simultaneous manipulation of C1ql2 and/or Nrxn3 in neurons would not allow us to appropriately answer our scientific question, but rather focus on the potential synaptogenic function of the Nrxn3/C1ql2/GluK2 complex and the role of the postsynaptic ligand in it. Thus, we feel that the proposed experiment, while very interesting in characterization of additional putative functions of C1ql2, may not provide additional information for the point we were addressing. In the revised manuscript we tried to make the aim and methodological approach of this set of experiments more clear.

      4) K262-C1ql2 mutation blocks the normal rescue through a Nrxn3(25b) mechanism (Fig.6). The strength of this experiment rests upon the specificity of this mutation for disrupting Nrxn3b binding (presynaptic) as opposed to any of the known postsynaptic C1ql2 ligands such as GluK2. While this is not relevant for interpreting the heterologous assay (Fig.5), it is relevant for the in vivo phenotypes in Fig.6. Similar approaches as employed in this paper can test whether binding to other known postsynaptic targets is altered by this point mutation.

      It has been previously shown that C1ql2 together with C1ql3 recruit postsynaptic GluK2 at the MFS. However, loss of just C1ql2 did not affect the recruitment of GluK2, which was disrupted only upon loss of both C1ql2 and C1ql3 (Matsuda et al., 2018). In our study we demonstrate a purely presynaptic function of C1ql2 through Nrxn3 in the synaptic vesicle recruitment. This function is independent of C1ql3, as C1ql3 expression is unchanged in all of our models and its over-expression did not compensate for C1ql2 functions (Fig. 2, 3a-c). Our in vitro experiments also reveal that C1ql2 can recruit both Nrxn3 and vGlut1 in the absence of any known postsynaptic C1ql2 partner (KARs and BAI3; Fig.5; please also see response above). Furthermore, we have now performed a kinetic analysis on single-cell data which we had previously collected to evaluate postsynaptic AMPA and kainate receptor responses in both the control and Bcl11b KO. Our analysis reveals no significant differences in postsynaptic current kinetics, making it unlikely that AMPA and kainate receptor signaling is altered upon the loss of C1ql2 following Bcl11b cKO (Author response image 3e-h; please also see our response to reviewer #2 point 8). Thus, we have no experimental evidence supporting the idea that a loss of interaction between C1ql2.K262E and GluK2 would interfere with the examined phenotype. However, to exclude that the K262E mutation disrupts interaction between C1ql2 and GluK2, we performed co-immunoprecipitation from protein lysate of HEK293 cells expressing GluK2myc-flag and GFP-C1ql2 or GluK2-myc-flag and GFP-K262E and could show that both C1ql2 and K262E had GluK2 bound when precipitated. These data are included in supplementary figure 5k of the revised manuscript.

      5) Altered localization of C1ql2 in Nrxn1-3 cKOs. These data are presented to suggest that Nrx3(25b) is important for localizing C1ql2 to the SL of CA3. Weaknesses of this data include both the lack of Nrxn specificity in the triple a/b KOs as well as the profound effects of Nrxn LoF on the total levels of C1ql2 protein. Some measure that isn't biased by this large difference in C1ql2 levels should be attempted (something like in Fig.1F).

      We acknowledge that the lack of specificity in the Nrxn123 model makes it difficult to interpret our data. We have now examined the mRNA levels of Nrxn1 and Nrxn2 upon stereotaxic injection of Cre in the DG of Nrxn123flox/flox animals and found that Nrxn1 was only mildly reduced. At the same time Nrxn2 showed a tendency for reduction that was not significant (data included in supplementary figure 6a of revised manuscript). Only Nrxn3 expression was strongly suppressed. Of course, this does not exclude that the mild reduction of Nrxn1 and Nrxn2 interferes with the C1ql2 localization at the MFS. We further examined the mRNA levels of C1ql2 in control and Nrxn123 mutants to ensure that the observed changes in C1ql2 protein levels at the MFS are not due to reduced mRNA expression and found no changes (data are included in supplementary figure 6b of the revised manuscript), suggesting that overall protein C1ql2 expression is normal.

      The reduced C1ql2 fluorescence intensity at the MFS was first observed when non-binding C1ql2 variant K262E was introduced to Bcl11b cKO mice that lack endogenous C1ql2 (Fig.6). In these experiments, we found that despite the overall high protein levels of C1ql2.K262E in the hippocampus (Fig. 6c), its fluorescence intensity at the SL was significantly reduced compared to WT C1ql2 (Fig. 6d-e). The remaining signal of the C1ql2.K262E at the SL was equally distributed and in a punctate form, similar to WT C1ql2. Together, this suggests that loss of C1ql2-Nrxn3 interaction interferes with the localization of C1ql2 at the MFS, but not with the expression of C1ql2. Of course, this does not exclude that other mechanisms are involved in the synaptic localization of C1ql2, beyond the interaction with Nrxn3, as both the mutant C1ql2 in Bcl11b cKO and the endogenous C1ql2 in Nrxn123 cKOs show residual immunofluorescence at the SL. Further studies are required to determine how C1ql2-Nrxn3 interaction regulates C1ql2 localization at the MFS.

      Reviewer #1 (Recommendations For The Authors):

      In addition to addressing the comments below, this study would benefit significantly from providing insight and discussion into the relevant potential postsynaptic signaling components controlled exclusively by C1ql2 (postsynaptic kainate receptors and the BAI family of proteins).

      We have now performed a kinetic analysis on single-cell data that we had previously collected to evaluate postsynaptic AMPA and kainate receptor responses in both the control and Bcl11b cKO. Our analysis reveals no significant differences in postsynaptic current kinetics, making it unlikely that AMPA and kainate receptor signaling differ between controls and upon the loss of C1ql2 following Bcl11b cKO (Author response image 3e-h; please also see our response to Reviewer #2 point 8). This agrees with previous findings that C1ql2 regulates postsynaptic GluK2 recruitment together with C1ql3 and only loss of both C1ql2 and C1ql3 results in a disruption of KAR signaling (Matsuda et al., 2018). In our study we demonstrate a purely presynaptic function of C1ql2 through Nrxn3 in the synaptic vesicle recruitment. This function is independent of C1ql3, as C1ql3 expression is unchanged in all of our models and its over-expression did not compensate for C1ql2 functions (Fig. 2, 3a-c). Our in vitro experiments also reveal that C1ql2 can recruit both Nrxn3 and vGlut1 in the absence of any known postsynaptic C1ql2 partner (KARs and BAI3; Fig.5; please also see our response to reviewer #3 point 4 above). We believe that further studies are needed to fully understand both the pre- and the postsynaptic functions of C1ql2. Because the focus of this manuscript was on the role of the C1ql2-Nrxn3 interaction and our investigation on postsynaptic functions of C1ql2 was incomplete, we did not include our findings on postsynaptic current kinetics in our revised manuscript. However, we increased the discussion on the known postsynaptic partners of C1ql2 in the revised manuscript to increase the interpretability of our results.

      Major Comments:

      The authors demonstrate that the ultrastructural properties of presynaptic boutons are altered after Bcl11b KO and C1ql2 KD. However, whether C1ql2 functions as part of a tripartite complex and the identity of the postsynaptic receptor (BAI, KAR) should be examined.

      Matsuda and colleagues have nicely demonstrated in their 2016 (Neuron) study that C1ql2 is part of a tripartite complex with presynaptic Nrxn3 and postsynaptic KARs. Moreover, they demonstrated that C1ql2, together with C1ql3, recruit postsynaptic KARs at the MFS, while the KO of just C1ql2 did not affect the KAR localization. In our study we demonstrate a purely presynaptic function of C1ql2 through Nrxn3 in the synaptic vesicle recruitment. This function is independent of C1ql3, as C1ql3 expression is unchanged in all of our models and its over-expression did not compensate for C1ql2 functions (Fig. 2, 3a-c). Our in vitro experiments also reveal that C1ql2 is able to recruit both Nrxn3 and vGlut1 in the absence of any known postsynaptic C1ql2 partner (Fig. 5; please also see our response to reviewer #3 point 4 above). Moreover, we were able to show that the SV recruitment depends on C1ql2 interaction with Nrxn3 through the expression of a non-binding C1ql2 (Fig. 6) that retains the ability to interact with GluK2 (supplementary figure 5k of revised manuscript) or by KO of Nrxns (Fig. 7). Furthermore, we have now performed a kinetic analysis on single-cell data which we had previously collected to evaluate postsynaptic AMPA and kainate receptor responses in both the control and Bcl11b cKO. Our analysis reveals no significant differences in postsynaptic current kinetics, making it unlikely that AMPA and kainate receptor signaling differ between controls and Bcl11b mutants (Author response image 3e-h; please also see our response to Reviewer #2 question 8). Together, we have no experimental evidence so far that would support that the postsynaptic partners of C1ql2 are involved in the observed phenotype. While it would be very interesting to characterize the postsynaptic partners of C1ql2 in depth, we feel this would be beyond the scope of the present study.

      Figure 1f: For a more comprehensive understanding of the Bcl11b KO phenotype and the potential role for C1ql2 on MF synapse number, a complete quantification of vGlut1 and Homer1 for all conditions (Supplement Figure 2e) should be included in the main text.

      In our study we focused on the role of C1ql2 in the structural and functional integrity of the MFS downstream of Bcl11b. Bcl11b ablation leads to several phenotypes in the MFS that have been thoroughly described in our previous study (De Bruyckere et al., 2018). As expected, re-expression of C1ql2 only partially rescued these phenotypes, with full recovery of the SV recruitment (Fig. 2) and of the LTP (Fig. 3), but had no effect on the reduced numbers of MFS nor the structural complexity of the MFB created by the Bcl11b KO (supplementary figure 2d-f of revised manuscript). We understand that including the quantification of vGlut1 and Homer1 co-localization in the main figures would help with a better understanding of the Bcl11b mutant phenotype. However, in our manuscript we investigate C1ql2 as an effector of Bcl11b and thus we focus on its functions in SV recruitment and LTP. As we did not find a link between C1ql2 and the number of MFS/MFB upon re-expression of C1ql2 in Bcl11b cKO or now also in C1ql2 KD (see response to comment #4 below), we believe it is more suitable to present these data in the supplement.

      Figure 3/4: Given the striking reduction in the numbers of synapses (Supplement Figure 2e) and docked vesicles (Figure 2d) in the Bcl11b KO and C1ql2 KD (Figure 4e-f), it is extremely surprising that basal synaptic transmission is unaffected (Supplement Figure 2g). The authors should determine the EPSP input-output relationship following C1ql2 KD and measure EPSPs following trains of stimuli at various high frequencies.

      We fully acknowledge that this is an unexpected result. It is, however, well feasible that the modest displacement of SV fails to noticeably influence basal synaptic transmission. This would be the case, for example, if only a low number of vesicles are released by single stimuli, in line with the very low initial Pr at MFS. In contrast, the reduction in synapse numbers in the Bcl11b mutant might indeed be expected to reflect in the input-output relationship. It is possible, however, that synapses that are preferentially eliminated in Bcl11b cKO are predominantly silent or have weak coupling strength, such that their loss has only a minimal effect on basal synaptic transmission. Finally, we cannot exclude compensatory mechanisms (homeostatic plasticity) at the remaining synapses. A detailed analysis of these potential mechanisms would be a whole project in its own right.

      As additional information, we can say that the largely unchanged input-output-relation in Bcl11b cKO is also present in the single-cell level data (Author response image 3; details on single-cell experiments are described in the response to Reviewer #2 point 8).

      As suggested by the reviewer, we have now additionally analyzed the input-output relationship following C1ql2 KD and again did not observe any significant difference between control and KD animals. We have incorporated the respective input-output curves into the revised manuscript under Supplementary figure 3c-d.

      Figure 4: Does C1ql2 shRNA also reduce the number of MFBs? This should be tested to further identify C1ql2-dependent and independent functions.

      As requested by reviewer #1 we quantified the number of MFBs upon C1ql2 KD. We show that C1ql2 KD in WT animals does not alter the number of MFBs. The data are presented in supplementary figure 4d of the revised manuscript. Re-expression of C1ql2 in Bcl11b cKO did not rescue the loss of MFS created by the Bcl11b mutation. Moreover, C1ql2 re-expression did not rescue the complexity of the MFB ultrastructure perturbed by the Bcl11b ablation. Together, this suggests that Bcl11b regulates MFs maintenance through additional C1ql2-independent pathways. In our previously published work (De Bruyckere et al., 2018) we identified and discussed in detail several candidate genes such as Sema5b, Ptgs2, Pdyn and Penk as putative effectors of Bcl11b in the structural and functional integrity of MFS (please also see response to reviewer #2- point 1 of public reviews).

      Figure 5: Clarification is required regarding the experimental design of the HEK/Neuron co-culture: 1. C1ql2 is a secreted soluble protein - how is the protein anchored to the HEK cell membrane to recruit Nrxn3(25b+) binding and, subsequently, vGlut1?

      C1ql2 was secreted by the HEK293 cells through an IgK signaling peptide at the N-terminus of C1ql2. The high concentration of C1ql2 close to the secretion site together with the sparse coculturing of the HEK293 cells on the neurons allows for the quantification of accumulation of neuronal proteins. We have now described the experimental conditions in greater detail in the main text module of the revised manuscript

      2) Why are the neurons transfected and not infected? Transfection efficiency of neurons with lipofectamine is usually poor (1-5%; Karra et al., 2010), while infection of neurons with lentiviruses or AAVs encoding cDNAs routinely are >90% efficient. Thus, interpretation of the recruitment assays may be influenced by the density of neurons transfected near a HEK cell.

      We agree with reviewer #1 that viral infection of the neurons would have been a more effective way of expressing our constructs. However, due to safety allowances in the used facility and time limitation at the time of conception of this set of experiments, a lipofectamine transfection was chosen.

      However, as all of our examined groups were handled in the same way and multiple cells from three independent experiments were examined for each experimental set, we believe that possible biases introduced by the transfection efficiency have been eliminated and thus have trust in our interpretation of these results.

      3) Surface labeling of HEK cells for wild-type C1ql2 and K262 C1ql2 would be helpful to assess the trafficking of the mutant.

      We recognize that potential changes to the trafficking of C1ql2 caused by the K262E mutation would be important to characterize, in light of the reduced localization of the mutant protein at the SL in the in vivo experiments (Fig. 6e). In our culture system, C1ql2 and K262E were secreted by the HEK cells through insertion of an IgK signaling peptide at the N-terminus of the myc-tagged C1ql2/K262E. Thus, trafficking analysis on this system would not be informative, as the system is highly artificial compared to the in vivo model. Further studies are needed to characterize C1ql2 trafficking in neurons to understand how C1ql2-Nrxn3 interaction regulates the localization of C1ql2. However, labeling of the myc-tag in C1ql2 or K262E expressing HEK cells of the co-culture model reveals a similar signal for the two proteins (Fig. 5a,c). Nrxn-null mutation in neurons co-cultured with C1ql2-expressing HEK cells disrupted C1ql2 mediated vGlut1 accumulation in the neurons. Selective expression of Nrxn3(25b) in the Nrxn-null neurons restored vGlut1 clustering was (Fig. 5e-f). Together, these data suggest that it is the interaction between C1ql2 and Nrxn3 that drives the accumulation of vGlut1.

      Figure 6: Bcl11b KO should also be included in 6f-h.

      As suggested by reviewer #1, we included the Bcl11b cKO in figures 6f-h and in corresponding supplementary figures 5c-j.

      Figure 7b: What is the abundance of mRNA for Nrxn1 and Nrxn2 as well as the abundance of Nrxns after EGFP-Cre injection into DG?

      We addressed this point raised by reviewer #1 by quantifying the relative mRNA levels of Nrxn1 and Nrxn2 via qPCR upon Nrxn123 mutation induction with EGFP-Cre injection. We have now examined the mRNA levels of Nrxn1 and Nrxn2 upon stereotaxic injection of Cre in the DG of Nrxn123flox/flox animals and found that Nrxn1 was only mildly reduced. At the same time Nrxn2 showed a tendency for reduction that was not significant. The data are presented in supplementary figure 6a of the revised maunscript.

      Minor Comments for readability:

      Synapse score is referred to frequently in the text and should be defined within the text for clarification.

      'n' numbers should be better defined in the figure legends. For example, for protein expression analysis in 1c, n=3. Is this a biological or technical triplicate? For electrophysiology (e.g. 3c), does "n=7" reflect the number of animals or the number of slices? n/N (slices/animals) should be presented.

      Figure 7a: Should the diagrams of the cre viruses be EGFP-Inactive or active Cre and not CRE-EGFP as shown in the diagram?

      Figure 7b: the region used for the inset should be identified in the larger image.

      All minor points have been fixed in the revised manuscript according to the suggestions.

      Reviewer #3 (Recommendations For The Authors):

      -Please describe the 'synapse score' somewhere in the text - it is too prominently featured to not have a clear description of what it is.

      The description of the synapse score has been included in the main text module of the revised manuscript.

      -The claim that Bcl11b controls SV recruitment "specifically" through C1ql2 is a bit stronger than is warranted by the data. Particularly given that C1ql2 is expressed at 2.5X control levels in their rescue experiments. See pt.2

      Please see response to reviewer #3 point 1 of public reviews. To address this, we over-expressed C1ql2 in control animals and found no changes in the synaptic vesicle distribution (supplementary figure 2g-j of revised manuscript). This supports that the observed rescue of synaptic vesicle recruitment by re-expression of C1ql2 is due to its physiological function and not due to the artificially elevated protein levels. Of course, we cannot exclude the possibility that other, C1ql2-independent, mechanisms also contribute to the SV recruitment downstream of Bcl11b. Our data from the C1ql2 rescue, C1ql2 KD, the in vitro experiments and the interruption of C1ql2-Nrxn3 in vivo, strongly suggest C1ql2 to be an important regulator of SV recruitment.

      -Does Bcl11b regulate Nrxn3 expression? Considering the apparent loss of C1ql2 expression in the Nrxn KO mice, this is an important detail.

      We agree with reviewer #3 that this is an important point. We have previously done differential transcriptomics from DG neurons of Bcl11b cKOs compared to controls and did not find Nrxn3 among the differentially expressed genes. To further validate this, we now quantified the Nrxn3 mRNA levels via qPCR in Bcl11b cKOs compared to controls and found no differences. These data are included in supplementary figure 5a of the revised manuscript.

      -It appears that C1ql2 expression is much lower in the Nrxn123 KO mice. Since the authors are trying to test whether Nrxn3 is required for the correct targeting of C1ql2, this is a confounding factor. We can't really tell if what we are seeing is a "mistargeting" of C1ql2, loss of expression, or both. If the authors did a similar analysis to what they did in Figure 1 where they looked at the synaptic localization of C1ql2 (and quantified it) that could provide more evidence to support or refute the "mistargeting" claim.

      Please also see response to reviewer #3 point 5 of public reviews. To exclude that reduction of fluorescence intensity of C1ql2 at the SL in Nrxn123 KO mice is due to loss of C1ql2 expression, we examined the mRNA levels of C1ql2 in control and Nrxn123 mutants and found no changes (data are included in supplementary figure 6b of the revised manuscript), suggesting that C1ql2 gene expression is normal. The reduced C1ql2 fluorescence intensity at the MFS was first observed when non-binding C1ql2 variant K262E was introduced to Bcl11b cKO mice that lack endogenous C1ql2 (Fig.6). In these experiments, we found that despite the overall high protein levels of C1ql2.K262E in the hippocampus (Fig. 6c), its fluorescence intensity at the SL was significantly reduced compared to WT C1ql2 (Fig. 6d-e). The remaining C1ql2.K262E signal in the SL was equally distributed and in a punctate form, similar to WT C1ql2. Together, this indicates that the loss of C1ql2-Nrxn3 interaction interferes with the localization of C1ql2 along the MFS, but not with expression of C1ql2. Of course, this does not exclude that additional mechanisms regulate C1ql2 localization at the synapse, as both the mutant C1ql2 in Bcl11b cKO and the endogenous C1ql2 in Nrxn123 cKO show residual immunofluorescence at the SL.

      We note here that we have not previously quantified the co-localization of C1ql2 with individual synapses. C1ql2 is a secreted molecule that localizes at the MFS synaptic cleft. However, not much is known about the number of MFS that are positive for C1ql2 nor about the mechanisms regulating C1ql2 targeting, transport, and secretion to the MFS. Whether C1ql2 interaction with Nrxn3 is necessary for the protection of C1ql2 from degradation, its surface presentation and transport or stabilization to the synapse is currently unclear. Upon revision of our manuscript, we realized that we might have overstated this particular finding and have now rephrased the specific parts within the results to appropriately describe the observation and have also included a sentence in the discussion referring to the lack of understanding of the mechanism behind this observation.

      -Title of Figure S5 is "Nrxn KO perturbs C1ql2 localization and SV recruitment at the MFS", but there is no data on C1ql2 localization.

      This issue has been fixed in the revised manusript.

      -S5 should be labeled more clearly than just Cre+/-

      This issue has been fixed in the revised manuscript.

      References

      Castillo, P.E., Malenka, R.C., Nicoll, R.A., 1997. Kainate receptors mediate a slow postsynaptic current in hippocampal CA3 neurons. Nature 388, 182–186. https://doi.org/10.1038/40645

      De Bruyckere, E., Simon, R., Nestel, S., Heimrich, B., Kätzel, D., Egorov, A.V., Liu, P., Jenkins, N.A., Copeland, N.G., Schwegler, H., Draguhn, A., Britsch, S., 2018. Stability and Function of Hippocampal Mossy Fiber Synapses Depend on Bcl11b/Ctip2. Front. Mol. Neurosci. 11. https://doi.org/10.3389/fnmol.2018.00103

      Kaeser, P.S., Regehr, W.G., 2017. The readily releasable pool of synaptic vesicles. Curr. Opin. Neurobiol. 43, 63–70. https://doi.org/10.1016/j.conb.2016.12.012

      Lerma, J., 2003. Roles and rules of kainate receptors in synaptic transmission. Nat. Rev. Neurosci. 4, 481–495. https://doi.org/10.1038/nrn1118

      Orlando, M., Dvorzhak, A., Bruentgens, F., Maglione, M., Rost, B.R., Sigrist, S.J., Breustedt, J., Schmitz, D., 2021. Recruitment of release sites underlies chemical presynaptic potentiation at hippocampal mossy fiber boutons. PLoS Biol. 19, e3001149. https://doi.org/10.1371/journal.pbio.3001149

      Vandael, D., Borges-Merjane, C., Zhang, X., Jonas, P., 2020. Short-Term Plasticity at Hippocampal Mossy Fiber Synapses Is Induced by Natural Activity Patterns and Associated with Vesicle Pool Engram Formation. Neuron 107, 509-521.e7. https://doi.org/10.1016/j.neuron.2020.05.013

    1. Author Response

      The following is the authors’ response to the original reviews.

      We are very grateful to the reviewers for their thoughtful comments on the manuscript and to the editors for their assessment.

      We thank the reviewers for their positive feedback and appreciate that they consider our method a valid addition to previously established systems for generating recombinant RNA viruses.

      To strengthen this point, we have now included additional validation by the rescue of recombinant Chikungunya and Dengue virus from viral RNA directly, using the CLEVER protocol. This strengthens the potential of this method as a reverse genetics platform for positive-stranded viruses in general.

      The supportive data has been amended in the Results section, taken into account in Materials and Methods, and the corresponding supplementary figure (Figure S4) has been added.

      One key point raised by one of the reviewers, a comparison with different systems, could not be addressed in this manuscript as our lab does not at all perform BAC cloning. We currently do not have the necessary expertise to conduct an unbiased side-by-side comparison.

      All other comments were addressed in detail, either by including additional data or through specific clarification in the revised text. We are grateful for the careful review and constructive criticisms raised by the reviewers and feel that the corrections and additions have significantly improved the manuscript.

      We have revised the latest version posted May 30, 2023 on bioRxiv (https://doi.org/10.1101/2023.05.11.540343).

      Reviewer #1:

      Public Review:

      In this manuscript, Kipfer et al describe a method for a fast and accurate SARS-CoV2 rescue and mutagenesis. This work is based on a published method termed ISA (infectious subgenomic amplicons), in which partially overlapping DNA fragments covering the entire viral genome and additional 5' and 3' sequences are transfected into mammalian cell lines. These DNA fragments recombine in the cells, express the full length viral genomic RNA and launch replication and rescue of infectious virus.

      CLEVER, the method described here significantly improves on the ISA method to generate infectious SARS-CoV2, making it widely useful to the virology community.

      Specifically, the strengths of this method are:

      1) The successful use of various cell lines and transfection methods.

      2) Generation of a four-fragment system, which significantly improves the method efficiency due to lower number of required recombination events.

      3) Flexibility in choice of overlapping sequences, making this system more versatile.

      4) The authors demonstrated how this system can be used to introduce point mutations as well as insertion of a tag and deletion of a viral gene.

      5) Fast-tracking generation of infectious virus directly from RNA of clinical isolates by RT-PCR, without the need for cloning the fragments or using synthetic sequences.

      One weakness of the latter point, which is also pointed out by the authors, is that the direct rescue of clinical isolates was not tested for sequence fidelity.

      The manuscript clearly presents the findings, and the proof-of-concept experiments are well designed.

      Overall, this is a very useful method for SARS-CoV2 research. Importantly, it can be applicable to many other viruses, speeding up the response to newly emerging viruses than threaten the public health.

      We thank the reviewer for this positive feedback and the summary of the main points. Nevertheless, we would like to comment on point 5): “the direct rescue of clinical isolates was not tested for sequence fidelity”

      This impression by the reviewer suggests that the data was not sufficient on this point. However, the sequence fidelity after direct rescue from RNA was indeed tested in this study, even on a clonal level (please see: Table S2, or raw NGS data SRX20303605 - SRX20303607). For higher clarity, we added the following sentence to the manuscript:<br /> “Indeed, a slight increase of unintentional mutations was observed when sequencing clonal virus populations rescued from RNA directly”.

      Recommendations for the authors:

      Minor Points:

      1) On page 8, the authors write: "levels correlated very well with the viral phenotype". This sentence is not clear. Please clarify what you mean by "viral phenotype". Do you mean CPE on Vero cells?

      We corrected the sentence to: “(…) staining intensity and patterns correlated very well with the wild-type phenotype.”

      2) Page 9 "sequences were analyzed with a cut-off of 10%. Cutoff of what? please clarify.

      The sentence was rephrased to: “(…)mutations with a relative abundance of >10% in the entire virus population were analyzed”

      3) Page 15: The authors refer to the time required for completion of each step of the process. It would be helpful and informative for the readers to include a panel in figure 4, visualizing the timelines.

      We included a timeline in Figure 4, Panel A.

      4) Materials and methods, first paragraph: Please specify which human samples were collected. Do the authors refer to clinical virus isolates?

      We added the following information to the Materials and Methods section:<br /> “Human serum samples for neutralization assays were collected from SARS-CoV-2 vaccinated anonymous donors (…)”

      Clinical virus isolates (Material and Methods; Virus) were used for control experiments, neutralization assays, or as templates for RT-PCR.

      5) Supplementary figure 4A: The color scheme makes it hard to differentiate between the BA.1 and BA.5 fragments. Please choose colors that are not as similar to each other.

      Colors were adapted for better distinction.

      Reviewer #2:

      Public Review:

      The authors of the manuscript have developed and used cloning-free method. It is not entirely novel (rather it is based on previously described ISA method) but it is clearly efficient and useful complementation to the already existing methods. One of strong points of the approach use by authors is that it is very versatile, i.e. can be used in combination with already existing methods and tools. I find it important as many laboratories have already established their favorite methods to manipulate SARS-CoV-2 genome and are probably unwilling to change their approach entirely. Though authors highlight the benefits of their method these are probably not absolute - other methods may be as efficient or as fast. Still, I find myself thinking that for certain purposes I would like to complement my current approach with elements from authors CLEVER method.

      The work does not contain much novel biological data - which is expected for a paper dedicated to development of new method (or for improving the existing one). It may be kind of shortcoming as it is commonly expected that authors who have developed new methods apply it for discovery of something novel. The work stops on step of rescue the viruses and confirming their biological properties. This part is done very well and represents a strength of the study. The properties of rescued viruses were also studied using NSG methods that revealed high accuracy of the used method, which is very important as the method relies on use of PCR that is known to generate random mistakes and therefore not always method of choice.

      What I found missing is a real head-to-head comparison of the developed system with an existing alternatives, preferably some PCR-free standard methods such as use of BAC clones. There are a lot of comparisons but they are not direct, just data from different studies has been compared. Authors could also be more opened to discuss limitations of the method. One of these seems to be rather low rescue efficiency - 1 rescue event per 11,000 transfected cells. This is much lower compared to infectious plasmid (about 1 event per 100 cells or so) and infectious RNAs (often 1 event per 10 cells, for smaller genomes most of transfected cells become infected). This makes the CLEVER method poorly suitable for generation of large infectious virus libraries and excludes its usage for studies of mutant viruses that harbor strongly attenuating mutations. Many of such mutations may reduce virus genome infectivity by 3-4 orders of magnitude; with current efficiencies the use of CLEVER approach may result in false conclusions (mutant viruses will be classified as non-viable while in reality they are just strongly attenuated).

      We thank reviewer 2 for the careful review of our work and the valuable feedback. We agree that a direct comparison with other (PCR-free) methods such as BAC cloning, could be useful for demonstrating the unique benefits of the CLEVER method. However, as our laboratory does not use any BAC or YAC cloning methods, we could not ensure an unbiased side-byside comparison using different techniques.

      We would like to highlight the avoidance of any yeast/bacterial cloning steps that render the CLEVER protocol significantly faster and easier to handle. A visualization of the key steps that could be skipped using CLEVER in comparison to common reverse genetics methods is given in Figure 6.

      Further, we firmly believe that the benefits of the CLEVER method become especially apparent for large viral genomes such as the one of SARS-CoV-2, where assembly, genome amplification and sequence verification of plasmid DNA are highly inefficient and more timeconsuming than for small viruses like DENV, CHIKV or HIV.

      We agree with the reviewer that the overall transfection and recombination efficiencies observed with CLEVER seemed rather low. Although data on transfection/rescue efficiency is known for many techniques and viruses, we did not find any published data on the reconstitution of SARS-CoV-2 or viruses with similar genome sizes. Therefore, a useful comparator for our observations in relation to other techniques is currently simply missing. We therefore emphasize that the efficiencies of CLEVER were achieved with one of the largest plus-stranded RNA virus genomes, and our data can’t be directly compared to transfection efficiencies of short infectious RNAs.

      On the contrary, it was rather interesting to observe the very high rescue efficiency of infectious virus progeny. During the two years of establishing and validating the CLEVER protocol, we reached success rates for the genome reconstitution after transfection of >95 %. This was even obtained with highly attenuated mutants including rCoV2∆ORF3678 (joint deletion of ORF3a, ORF6, ORF7a, and ORF8) (Liu et al., 2022)(see Author response image 1). We amended this data in response to the reviewers’ comment and as an example of the successful rescue of an attenuated virus from five overlapping genome fragments (fragments A, B, C, D1, and D2∆ORF3678).

      The latter data were not added to the main manuscript since in this case the deletions were introduced using a different method: from the plasmid-based DNA fragment D2∆ORF3678 and not directly from PCR-based mutagenesis.

      Further, CLEVER was used for related substantial manipulations, including the complete deletion of the Envelope gene (E) which led to the creation of a single-cycle virus that may serve as a live, replication-incompetent vaccine candidate (Lett et al., 2023).

      Author response image 1.

      rCoV2∆ORF3678. Detection of intracellular SARS-CoV-2 nucleocapsid protein (N, green) and nuclei (Hoechst, blue) in Vero E6TMPRSS2 cells infected with rCoV2∆ORF3678 by immunocytochemistry. Scalebar is 200 µm in overview and 50 µm in ROI images.

      Recommendations for the authors:

      The work is nicely presented and the method authors has developed is clearly valuable. As indicated in Public review section the work would benefit from direct comparison of CLEVER with that of infectious plasmid (or RNA) based methods; direct comparison of data would be more convincing that indirect one. Authors should also discuss possible limitations of the method - this is helpful for a reader.

      We were not able to perform a direct comparison of CLEVER with other methods (see our statement above).

      We added the following section to the discussion: “Along with the advantages of the CLEVER protocol, limitations must be considered: Interestingly, virus was never rescued after transfecting Vero E6 cells, as has been observed previously (Mélade et al., 2022). Whether this is due to low transfection efficiency or the cell’s inability to recombine remains to be elucidated. Other cell lines not tested within this study will have to be tested for efficient recombination and virus production first. Further, the high sequence integrity of rescued virus is highly dependent on the fidelity of the DNA polymerase used for amplification. The use of other enzymes might negatively influence the sequence integrity of recombinant virus, as it has been observed for the direct rescue from viral RNA using a commercially available onestep RT-PCR kit. Another limitation when performing direct mutagenesis is the synthesis of long oligos to create an overlapping region. Repetitive sequences, for example, can impair synthesis, and self-annealing and hairpin formation increase with prolonged oligos.”

      Some technical corrections of the text would be beneficial. In all past of the text the use of terms applicable only for DNA or RNA is mixed and creates some confusion. For example, authors state that "the human cytomegalovirus promoter (CMV) was cloned upstream of 5' UTR and poly(A) tail, the hepatitis delta ribozyme (HDVr) and the simian virus 40 polyadenylation signal downstream of the 3' UTR". Strictly speaking it is impossible as such a construct would contain dsDNA sequence (CMV promoter) followed by ssRNA (5'UTR, polyA tail and HDV ribozyme) and then again dsDNA (SV40 terminator). So, better to be correct and add "sequences corresponding to", "dsDNA copies of" to the description of RNA elements

      We thank the reviewer for the advice but would like to state that in scientific language it is common to assume that nucleic acid cloning is based on DNA.

      We have corrected the description in the Methods section: “The human cytomegalovirus promoter (CMV) was cloned upstream of the DNA sequence of the viral 5’UTR; herein, the first five nucleotides (ATATT) correspond to the 5’UTR of SARS-CoV. Sequences corresponding to the poly(A) tail (n=35), the hepatitis delta virus ribozyme (HDVr), and the simian virus 40 polyadenylation signal (SV40pA) were cloned immediately downstream of the DNA sequence of the viral 3’UTR.”

      For ease of reading and for consistent terminology, we kept the original spelling in the rest of the manuscript.

      In description of neutralization assay authors have used temperature 34 C for incubation of virus with antibodies as well as for subsequent incubation of infected cells. Why this temperature was used?

      The following sentence was added (Materials and Methods; Cells): “A lower incubation temperature was chosen based on previous studies (V’kovski et al., 2021).”

      References

      Lett MJ, Otte F, Hauser D, Schön J, Kipfer ET, Hoffmann D, Halwe NJ, Ulrich L, Zhang Y, Cmiljanovic V, Wylezich C, Urda L, Lang C, Beer M, Mittelholzer C, Klimkait T. 2023. Single-cycle SARS-CoV-2 vaccine elicits high protection and sterilizing immunity in hamsters. doi:10.1101/2023.05.17.541127

      Liu Y, Zhang X, Liu J, Xia H, Zou J, Muruato AE, Periasamy S, Kurhade C, Plante JA, Bopp NE, Kalveram B, Bukreyev A, Ren P, Wang T, Menachery VD, Plante KS, Xie X, Weaver SC, Shi P-Y. 2022. A live-attenuated SARS-CoV-2 vaccine candidate with accessory protein deletions. Nat Commun 13:4337. doi:10.1038/s41467-022-31930-z

      V’kovski P, Gultom M, Kelly JN, Steiner S, Russeil J, Mangeat B, Cora E, Pezoldt J, Holwerda M, Kratzel A, Laloli L, Wider M, Portmann J, Tran T, Ebert N, Stalder H, Hartmann R, Gardeux V, Alpern D, Deplancke B, Thiel V, Dijkman R. 2021. Disparate temperaturedependent virus–host dynamics for SARS-CoV-2 and SARS-CoV in the human respiratory epithelium. PLoS Biol 19:e3001158. doi:10.1371/journal.pbio.3001158

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study addresses how faces and bodies are integrated in two STS face areas revealed by fMRI in the primate brain. It builds upon recordings and analysis of the responses of large populations of neurons to three sets of images, that vary face and body positions. These sets allowed the authors to thoroughly investigate invariance to position on the screen (MC HC), to pose (P1 P2), to rotation (0 45 90 135 180 225 270 315), to inversion, to possible and impossible postures (all vs straight), to the presentation of head and body together or in isolation. By analyzing neuronal responses, they found that different neurons showed preferences for body orientation, head orientation, or the interaction between the two. By using a linear support vector machine classifier, they show that the neuronal population can decode head-body angle presented across orientations, in the anterior aSTS patch (but not middle mSTS patch), except for mirror orientation.

      Strengths:

      These results extend prior work on the role of Anterior STS fundus face area in face-body integration and its invariance to mirror symmetry, with a rigorous set of stimuli revealing the workings of these neuronal populations in processing individuals as a whole, in an important series of carefully designed conditions.

      Minor issues and questions that could be addressed by the authors:

      (1) Methods. While monkeys certainly infer/recognize that individual pictures refer to the same pose with varying orientations based on prior studies (Wang et al.), I am wondering whether in this study monkeys saw a full rotation of each of the monkey poses as a video before seeing the individual pictures of the different orientations, during recordings.

      The monkeys had not been exposed to videos of a rotating monkey pose before the recordings. However, they were reared and housed with other monkeys, providing them with ample experience of monkey poses from different viewpoints.

      (2) Experiment 1. The authors mention that neurons are preselected as face-selective, body-selective, or both-selective. Do the Monkey Sum Index and ANOVA main effects change per Neuron type?

      We have performed a new analysis to assess whether the Monkey Sum Index is related to the response strength for the face versus the body as measured in the Selectivity Test of Experiment 1. To do this we selected face- and body-category selective neurons, as well as neurons responding selectively to both faces and bodies. First, we selected those neurons that responded significantly to either faces, bodies, or the two control object categories, using a split-plot ANOVA for these 40 stimuli. From those neurons, we selected face-selective ones having at least a twofold larger mean net response to faces compared to bodies (faces > 2 * bodies) and the control objects for faces (faces  > 2* objects). Similarly, a body-selective neuron was defined by a twofold larger mean net response to bodies compared to faces and the control objects for bodies. A body-and-face selective neuron was defined as having a twofold larger net response to the faces compared to their control objects, and to bodies compared to their control objects, with the ratio between mean response to bodies and faces being less than twofold. Then, we compared the distribution of the Monkey Sum Index (MSI) for each region (aSTS; mSTS), pose (P1, P2), and centering (head- (HC) or monkey-centered (MC)) condition. Too few body-and-face selective neurons were present in each combination of region, pose, and centering (a maximum of 7) to allow a comparison of their MSI distribution with the other neuron types. The Figure below shows the distribution of the MSI for the different orientation-neuron combinations for the body- and face-selective neurons (same format as in Figure 3a, main text). The number of body-selective neurons, according to the employed criteria, varied from 21 to 29, whereas the number of face-selective neurons ranged from 14 to 24 (pooled across monkeys). The data of the two subjects are shown in a different color and the number of cases for each subject is indicated (n1: number of cases for M1; n2: number of cases for M2). The arrows indicate the medians for the data pooled across the monkey subjects. For the MC condition, the MSI tended to be more negative (i.e. relatively less response to the monkey compared to the sum of the body and face responses) for the face compared to the body cells, but this was significant only for mSTS and P1 (p = 0.043; Wilcoxon rank sum test; tested after averaging the indices per neuron to avoid dependence of indices within a neuron). No consistent, nor significant tendencies were observed for the HC stimuli. This absence of a consistent relationship between MSI and face- versus body-selectivity is in line with the absence of a correlation between the MSI and face- versus body-selectivity using natural images of monkeys in a previous study (Zafirova Y, Bognár A, Vogels R. Configuration-sensitive face-body interactions in primate visual cortex. Prog Neurobiol. 2024 Jan;232:102545).

      We did not perform a similar analysis for the main effects of the two-way ANOVA because the very large majority of neurons showed a significant effect of body orientation and thus no meaningful difference between the two neuron types can be expected.

      Author response image 1.

      (3) I might have missed this information, but the correlation between P1 and P2 seems to not be tested although they carry similar behavioral relevance in terms of where attention is allocated and where the body is facing for each given head-body orientation.

      Indeed, we did not compute this correlation between the responses to the sitting (P1) and standing (P2) pose avatar images. However, as pointed out by the reviewer, one might expect such correlations because of the same head orientations and body-facing directions. Thus, we computed the correlation between the 64 head-body orientation conditions of P1 and P2 for those neurons that were tested with both poses and showed a response for both poses (Split-plot ANOVA). This was performed for the Head-Centered and Monkey-Centered tests of Experiment 1 for each monkey and region. Note that not all neurons were tested with both poses (because of failure to maintain isolation of the single unit in both tests or the monkey stopped working) and not all neurons that were recorded in both tests showed a significant response for both poses, which is not unexpected since these neurons can be pose selective. The distribution of the Pearson correlation coefficients of the neurons with a significant response in both tests is shown in Figure S1. The median correlation coefficient was significantly larger than zero for each region, monkey, and centering condition (outcome of Wilcoxon tests, testing whether the median was different from zero (p1 = p-value for M1; p2: p-value for M2) in Figure), indicating that the effect of head and/or body orientation generalizes across pose. We have noted this now in the Results (page 12) and added the Figure (New Figure S1) in the Suppl. Material.

      (4) Is the invariance for position HC-MC larger in aSTS neurons compared to mSTS neurons, as could be expected from their larger receptive fields?

      Yes, the position tolerance of the interaction of body and head orientation was significantly larger for aSTS compared to mSTS neurons, as we described on pages 11 and 12 of the Results. This is in line with larger receptive fields in aSTS than in mSTS. However, we did not plot receptive fields in the present study.

      (5) L492 "The body-inversion effect likely results from greater exposure to upright than inverted bodies during development". Monkeys display more hanging upside-down behavior than humans, however, does the head appear more tilted in these natural configurations?

      Indeed, infant monkeys do spend some time hanging upside down from their mother's belly. While we lack quantitative data on this behavior, casual observations suggest that even young monkeys spend more time upright. The tilt of the head while hanging upside down can vary, just as it does in standing or sitting monkeys (as when they search for food or orient to other individuals). To our knowledge, no quantitative data exist on the frequency of head tilts in upright versus upside-down monkeys. Therefore, we refrain from further speculation on this interesting point, which warrants more attention.

      (6) Methods in Experiment 1. SVM. How many neurons are sufficient to decode the orientation?

      The number of neurons that are needed to decode the head-body orientation angle depends on which neurons are included, as we show in a novel analysis of the data of Experiment 1. We employed a neuron-dropping analysis, similar to Chiang et al. (Chiang FK, Wallis JD, Rich EL. Cognitive strategies shift information from single neurons to populations in prefrontal cortex. Neuron. 2022 Feb 16;110(4):709-721) to assess the positive (or negative) contribution of each neuron to the decoding performance. We performed cross-validated linear SVM decoding N times, each time leaving out a different neuron (using N-1 neurons; 2000 resamplings of pseudo-population vectors). We then ranked decoding accuracies from highest to lowest, identifying the ‘worst’ (rank 1) to ‘best’ (rank N) neurons. Next, we conducted N decodings, incrementally increasing the number of included neurons from 1 to N, starting with the worst-ranked neuron (rank 1) and sequentially adding the next (rank 2, rank 3, etc.). This analysis focused on zero versus straight angle decoding in the aSTS, as it yielded the highest accuracy. We applied it when training on MC and testing on HC for each pose. Plotting accuracy as a function of the number of included neurons suggested that less than half contributed positively to decoding. We show also the ten “best” neurons for each centering condition and pose. These have a variety of tuning patterns for head and body orientation suggesting that the decoding of head-body orientation angle depends on a population code. Notably, the best-ranked (rank N) neuron alone achieved above-chance accuracy. We have added this interesting and novel result to the Results (page 16) and Suppl. Material (new Figure S3).

      (7) Figure 3D 3E. Could the authors please indicate for each of these neurons whether they show a main effect of face, body, or interaction, as well as their median corrected correlation to get a flavor of these numbers for these examples?

      We have indicated these now in Figure 3.

      (8) Methods and Figure 1A. It could be informative to precise whether the recordings are carried in the lateral part of the STS or in the fundus of the STS both for aSTS and mSTS for comparison to other studies that are using these distinctions (AF, AL, MF, ML).

      In experiment 1, the recording locations were not as medial as the fundus. For experiments 2 and 3, the ventral part of the fundus was included, as described in the Methods. We have added this to the Methods now (page 31).

      Wang, G., Obama, S., Yamashita, W. et al. Prior experience of rotation is not required for recognizing objects seen from different angles. Nat Neurosci 8, 1768-1775 (2005). https://doi-org.insb.bib.cnrs.fr/10.1038/nn1600

      Reviewer #2 (Public review):

      Summary:

      This paper investigates the neuronal encoding of the relationship between head and body orientations in the brain. Specifically, the authors focus on the angular relationship between the head and body by employing virtual avatars. Neuronal responses were recorded electrophysiologically from two fMRI-defined areas in the superior temporal sulcus and analyzed using decoding methods. They found that: (1) anterior STS neurons encode head-body angle configurations; (2) these neurons distinguish aligned and opposite head-body configurations effectively, whereas mirror-symmetric configurations are more difficult to differentiate; and (3) an upside-down inversion diminishes the encoding of head-body angles. These findings advance our understanding of how visual perception of individuals is mediated, providing a fundamental clue as to how the primate brain processes the relationship between head and body - a process that is crucial for social communication.

      Strengths:

      The paper is clearly written, and the experimental design is thoughtfully constructed and detailed. The use of electrophysiological recordings from fMRI-defined areas elucidated the mechanism of head-body angle encoding at the level of local neuronal populations. Multiple experiments, control conditions, and detailed analyses thoroughly examined various factors that could affect the decoding results. The decoding methods effectively and consistently revealed the encoding of head-body angles in the anterior STS neurons. Consequently, this study offers valuable insights into the neuronal mechanisms underlying our capacity to integrate head and body cues for social cognition-a topic that is likely to captivate readers in this field.

      Weaknesses:

      I did not identify any major weaknesses in this paper; I only have a few minor comments and suggestions to enhance clarity and further strengthen the manuscript, as detailed in the Private Recommendations section.

      Reviewer #3 (Public review):

      Summary:

      Zafirova et al. investigated the interaction of head and body orientation in the macaque superior temporal sulcus (STS). Combining fMRI and electrophysiology, they recorded responses of visual neurons to a monkey avatar with varying head and body orientations. They found that STS neurons integrate head and body information in a nonlinear way, showing selectivity for specific combinations of head-body orientations. Head-body configuration angles can be reliably decoded, particularly for neurons in the anterior STS. Furthermore, body inversion resulted in reduced decoding of head-body configuration angles. Compared to previous work that examined face or body alone, this study demonstrates how head and body information are integrated to compute a socially meaningful signal.

      Strengths:

      This work presents an elegant design of visual stimuli, with a monkey avatar of varying head and body orientations, making the analysis and interpretation straightforward. Together with several control experiments, the authors systematically investigated different aspects of head-body integration in the macaque STS. The results and analyses of the paper are mostly convincing.

      Weaknesses:

      (1) Using ANOVA, the authors demonstrate the existence of nonlinear interactions between head and body orientations. While this is a conventional way of identifying nonlinear interactions, it does not specify the exact type of the interaction. Although the computation of the head-body configuration angle requires some nonlinearity, it's unclear whether these interactions actually contribute. Figure 3 shows some example neurons, but a more detailed analysis is needed to reveal the diversity of the interactions. One suggestion would be to examine the relationship between the presence of an interaction and the neural encoding of the configuration angle.

      This is an excellent suggestion. To do this, one needs to identify the neurons that contribute to the decoding of head-body orientation angles. For that, we employed a neuron-dropping analysis, similar to Chiang et al. (Chiang FK, Wallis JD, Rich EL. Cognitive strategies shift information from single neurons to populations in prefrontal cortex. Neuron. 2022 Feb 16;110(4):709-721.) to assess the positive (or negative) contribution of each neuron to the decoding performance. We performed cross-validated linear SVM decoding N times, each time leaving out a different neuron (using N-1 neurons; 2000 resamplings of pseudo-population vectors). We then ranked decoding accuracies from highest to lowest, identifying the ‘worst’ (rank 1) to ‘best’ (rank N) neurons. Next, we conducted N decodings, incrementally increasing the number of included neurons from 1 to N, starting with the worst-ranked neuron (rank 1) and sequentially adding the next (rank 2, rank 3, etc.). This analysis focused on zero versus straight angle decoding in the aSTS, as it yielded the highest accuracy. We applied it when training on MC and testing on HC for each pose. Plotting accuracy as a function of the number of included neurons suggested that less than half contributed positively to decoding (see Figure S3). We examined the tuning for head and body orientation of the 10 “best” neurons (Figure S3). For half or more of those the two-way ANOVA showed a significant interaction. These are indicated by the red color in the Figure. They showed a variety of tuning patterns for head and body orientation, suggesting that the decoding of the head-body orientation angle results from a combination of neurons with different tuning profiles. Based on a suggestion from reviewer 2, we performed for each neuron of experiment 1 a one-way ANOVA with as factor head-body orientation angle. To do that, we combined all 64 trials that had the same head-body orientation angle. The percentage of neurons (required to be responsive in the tested condition) for which this one-way ANOVA was significant was low but larger than the expected 5% (Type 1 error), with a median of 16.5% (range: 3 to 23%) in aSTS and 8% for mSTS (range: 0-19%). However, a higher percentage of the 10 best neurons for each pose (indicated by the star) showed a significant one-way ANOVA for angle (for P1, MC: 50% (95% confidence interval (CI): 19% – 81%); P1, HC: 70% (CI: 35% - 93%); P2, MC: 70% (CI: 35% – 93%); P2: HC: 50% (CI: 19%-81%)). These percentages were significantly higher than expected for a random sample from the population of neurons for each pose-centering combination (expected percentages listed in the same order as above: 16%, 13%, 16%, and 10%; all outside CI). Thus, for at least half of the “best” neurons, the response differed significantly among the head-orientation angles at the single neuron level. Nonetheless, the tuning profiles were diverse, suggesting a populationl code for head-body orientation angle. We have added this interesting and novel result to the Results (page 16) and Suppl. Material (Figure S3).

      (2) Figure 4 of the paper shows a better decoding of the configuration angle in the anterior STS than in the middle STS. This is an interesting result, suggesting a transformation in the neural representation between these two areas. However, some control analyses are needed to further elucidate the nature of this transformation. For example, what about the decoding of head and body orientations - dose absolute orientation information decrease along the hierarchy, accompanying the increase in configuration information?

      We have performed now two additional analyses, one in which we decoded the orientation of the head and another one in which we decoded the orientation of the body. We employed the responses to the avatar of experiment 1, using the same sample of neurons of which we decoded the head-body orientation angle. To decode the head orientation, the trials with identical head orientation, irrespective of their body orientation, were given the same label. For this, we employed only responses in the head-centered condition. To decode the body orientation, the trials with identical body orientation, irrespective of their head orientation, had the same label, and we employed only responses in the body-centered condition. The decoding was performed separately for each pose (P1 and P2) and region. We decoded either the responses of 20 neurons (10 randomly sampled from each monkey for each of the 1000 resamplings), 40 neurons (20 randomly sampled per monkey), or 60 neurons (30 neurons per monkey) since the sample of 60 neurons yielded close to ceiling performance for the body orientation decoding. For each pose, the body orientation decoding was worse for aSTS than for mSTS, although this difference reached significance only for P1 and for the 40 neurons sample of P2 (p < 0.025; two-tailed test; same procedure as employed for testing the significance of the decoding of whole-body orientation for upright versus inverted avatars (Experiment 3))). Face orientation decoding was significantly worse for aSTS compared to mSTS. These results are in line with the previously reported decreased decoding of face orientation in the anterior compared to mid-STS face patches (Meyers EM, Borzello M, Freiwald WA, Tsao D. Intelligent information loss: the coding of facial identity, head pose, and non-face information in the macaque face patch system. J Neurosci. 2015 May 6;35(18):7069-81), and decreased decoding of body orientation in anterior compared to mid-STS body patches (Kumar S, Popivanov ID, Vogels R. Transformation of Visual Representations Across Ventral Stream Body-selective Patches. Cereb Cortex. 2019 Jan 1;29(1):215-229). As mentioned by the reviewer, this contrasts with the decoding of the head-body orientation angle, which increases when moving more anteriorly. We mention this finding now in the Discussion (page 27) and present the new Figure S10 in the Suppl. Material.    

      (3) While this work has characterized the neural integration of head and body information in detail, it's unclear how the neural representation relates to the animal's perception. Behavioural experiments using the same set of stimuli could help address this question, but I agree that these additional experiments may be beyond the scope of the current paper. I think the authors should at least discuss the potential outcomes of such experiments, which can be tested in future studies.

      Unfortunately, we do not have behavioral data. One prediction would be that the discrimination of head-body orientation angle, irrespective of the viewpoint of the avatar, would be more accurate for zero versus straight angles compared to the right versus left angles. We have added this to the Discussion (page 28).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) P22 L373. It should read Figure S5C instead of S4C.

      Thanks; corrected.

      (2) Figure 7B. All inverted decoding accuracies, although significantly lower than upright decoding accuracies, appear significantly above baseline. Should the title be amended accordingly?

      Thanks for pointing this out. To avoid future misunderstanding we have changed the title to:

      “Integration of head and body orientations in the macaque superior temporal sulcus is stronger for upright bodies”

      (3) Discussion L432-33. "with some neurons being tuned to a particular orientation of both the head and the body". Wouldn't that be visible as a diagonal profile on the normalized net responses in Fig 3D? Or can the Anova evidence such a tuning?

      We meant to say that some neurons were tuned to a particular combination of head and body orientation, like the third aSTS example neuron shown in Figure 3D. We have corrected the sentence.

      Reviewer #2 (Recommendations for the authors):

      Major comment:

      This paper effectively demonstrates that the angular relationship between the head and body can be decoded from population responses in the anterior STS. In other words, these neurons encode information about the head-body angle. However, how exactly do these neurons encode this information? Given that the study employed electrophysiological recordings from a local population of neurons, it might be possible to provide additional data on the response patterns of individual neurons to shed light on the underlying encoding mechanisms.

      Although the paper already presents example response patterns (Figures 3D, E) and shows that STS neurons encode interactions between head and body orientations (Figure 3B), it remains unclear whether the angle difference between the head and body has a systematic effect on neuronal responses. For instance, a description of whether some neurons preferentially encode specific head-body angle differences (e.g., a "45-degree angle neuron"), or additional population analyses such as a one-way ANOVA with angle difference as the main effect (or two-way ANOVA with angle difference as one of the main effect), would be very informative. Such data could offer valuable insights into how individual neurons contribute to the encoding of head-body angle differences-a detail that may also be reflected in the decoding results. Alternatively, it is possible that the encoding of head-body angle is inherently complex and only discernible via decoding methods applied to population activity. Either scenario would provide interesting and useful information to the field.

      We have performed two additional analyses which are relevant to this comment. First, we attempted to relate the tuning for body and head orientation with the decoding of the head-body orientation angle. To do this, one needs to identify the neurons that contribute to the decoding of head-body orientation angles. For that, we employed a neuron-dropping analysis, similar to Chiang et al. (Chiang FK, Wallis JD, Rich EL. Cognitive strategies shift information from single neurons to populations in prefrontal cortex. Neuron. 2022 Feb 16;110(4):709-721.) to assess the positive (or negative) contribution of each neuron to the decoding performance. We performed cross-validated linear SVM decoding N times, each time leaving out a different neuron (using N-1 neurons; 2000 resamplings of pseudo-population vectors). We then ranked decoding accuracies from highest to lowest, identifying the ‘worst’ (rank 1) to ‘best’ (rank N) neurons. Next, we conducted N decodings, incrementally increasing the number of included neurons from 1 to N, starting with the worst-ranked neuron (rank 1) and sequentially adding the next (rank 2, rank 3, etc.). This analysis focused on zero versus straight angle decoding in the aSTS, as it yielded the highest accuracy. We applied it when training on MC and testing on HC for each pose. Plotting accuracy as a function of the number of included neurons suggested that less than half contributed positively to decoding (see Figure S3). We examined the tuning for head and body orientation of the 10 “best” neurons (Figure S3). For half or more of those the two-way ANOVA showed a significant interaction. These are indicated by the red color in the Figure. They showed a variety of tuning patterns for head and body orientation, suggesting that the decoding of the head-body orientation angle results from a combination of neurons with different tuning profiles.

      Second, we have followed the suggestion of the reviewer to perform for each neuron of experiment 1 a one-way ANOVA with as factor head-body orientation angle. To do that, we combined all 64 trials that had the same head-body orientation angle. The percentage of neurons (required to be responsive in the tested condition) for which this one-way ANOVA was significant is shown in the Tables below for each region, separately for each pose (P1, P2), centering condition (MC = monkey-centered; HC = head-centered) and monkey subject (M1, M2). The percentages were low but larger than the expected 5% (Type 1 error), with a median of 16.5% (range: 3 to 23%) in aSTS and 8% for mSTS (range: 0-19%).

      Author response table 1.

      Interestingly, a higher percentage of the 10 best neurons for each pose (indicated by the star in the Figure above) showed a significant one-way ANOVA for angle (for P1, MC: 50% (95% confidence interval (CI): 19% – 81%); P1, HC: 70% (CI: 35% - 93%); P2, MC: 70% (CI: 35% – 93%); P2: HC: 50% (CI: 19%-81%)). These percentages were significantly higher than expected for a random sample from the population of neurons for each pose-centering combination (expected percentages listed in the same order as above: 16%, 13%, 16%, and 10%; all outside CI). Thus, for at least half of the “best” neurons, the response differed significantly among the head-orientation angles at the single neuron level. Nonetheless, the tuning profiles were quite diverse, suggesting population coding of head-body orientation angle. We have added this interesting and novel result to the Results (page 16) and Suppl. Material (Figure S3).    

      Minor comments:

      (1) Figure 4A, Fourth Row Example (Zero Angle vs. Straight Angle, Bottom of the P2 Examples): The order of the example stimuli might be incorrect- the 0{degree sign} head with 180{degree sign} body stimulus (leftmost) might be swapped with the 180{degree sign} head with 0{degree sign} body stimulus (5th from the left). While this ordering may be acceptable, please double-check whether it reflects the authors' intended arrangement.

      We have changed the order of the two stimuli in Figure 4A, following the suggestion of the reviewer.

      (2) Page 12, Lines 192-194: The text states, "Interestingly, some neurons (e.g. Figure 3D) were tuned to a particular combination of a head and body irrespective of centering." However, Figure 3D displays data for a total of 10 neurons. Could you please specify which of these neurons are being referred to in this context?

      The wording was not optimal. We meant to say that some neurons were tuned to a particular combination of head and body orientation, like the third aSTS example neuron of Figure 3D. We have rephrased the sentence and clarified which example neuron we referred to.

      (3) Page 28, Lines 470-471: The text states, "We observed no difference in response strength between anatomically possible and impossible configurations." Please clarify which data were compared for response strength, as I could not locate the corresponding analyses.

      The anatomically possible and impossible configurations differ in the head-body orientation angle. However, as we reported before in the Results, there was no effect of head-body orientation angle on mean response strength across poses (Friedman ANOVA; all p-values for both poses and centerings > 0.1). We have clarified this now in the Discussion (page 28).

      (4) Pages 40-43, Decoding Analyses: In experiments 2 and 3, were the decoding analyses performed on simultaneously recorded neurons? If so, such analyses might leverage trial-by-trial correlations and thus avoid confounds from trial-to-trial variability. In contrast, experiment 1, which used single-shank electrodes, would lack this temporal information. Please clarify how trial numbers were assigned to neurons in each experiment and how this assignment may have influenced the decoding performance.

      For the decoding analyses of experiments 2 and 3, we combined data from different daily penetrations, with only units from the same penetration being recorded simultaneously. In the decoding analyses of each experiment, the trials were assigned randomly to the pseudo-population vectors, shuffling on each resampling the trial order per neuron. This shuffling abolishes noise correlations in the analysis of each experiment.

      (5) Page 41, Lines 792-802: The authors state that "To assess the significance of the differences in classification scores between pairs of angles ... we computed the difference in classification score between the two pairs for each resampling and the percentile of 0 difference corresponded to the p-value." In a two-sided test under the null hypothesis of no difference between the distributions, the conventional approach would be to compute the p-value as the proportion of resampled differences that are as extreme or more extreme than the observed difference. Since a zero difference might be relatively rare, relying solely on its percentile could potentially misrepresent the tail probabilities relevant to a two-sided test. Could you clarify how their method addresses this issue?

      This test is based on the computation of the distribution of the difference between classification accuracies across resamplings. This is similar to the computation of the confidence interval of a  difference. Thus, we assess whether the theoretical zero value (= no difference; = null hypothesis) is outside the 2.5 and 97.5 percentile interval of the computed distribution of the empirically observed differences. We clarified now in the Methods (page 41) that for a two-tailed test the computed p-value (the percentile of the zero value) should be smaller than 0.025.

      (6) Page 43, Lines 829-834: The manuscript explains: "The mean of 10 classification accuracies (i.e., of 10 resamplings) was employed to obtain a distribution (n=100) of the differences in classification accuracy ... The reported standard deviations of the classification accuracies are computed using also the means of 10 resamplings." I am unfamiliar with this type of analysis and am unclear about the rationale for calculating distributions and standard deviations based on the means of 10 resamplings rather than using the original distribution of classification accuracies. This resampling procedure appears to yield a narrower distribution and smaller standard deviations than the original data. Could you please justify this approach?

      The logic of the analysis is to reduce the noise in the data, by averaging across 10 randomly selected resamplings, but still keeping a sufficient number of data (100 values) for a test.

      Reviewer #3 (Recommendations for the authors):

      (1) Some sentences are too long and difficult to parse. For example, in line 177: "the correlations between the responses to the 64 head-body orientation conditions of the two centerings for the neuron and pose combinations showing significant head-body interactions for the two centerings were similar to those observed for the whole population."

      We have modified this sentence: For neuron and pose combinations with significant head-body interactions in both centerings, the correlations between responses to the 64 head-body orientation conditions were similar to those observed in the whole population.

      (2) The authors argue in line 485: "in our study, a search bias cannot explain the body-inversion effect since we selected responsive units using both upright and inverted images." However, the body-selective patches were localized using upright images, correct?

      The monkey-selective patches were localized using upright images indeed. However, we recorded in experiment 3 (and 2) also outside the localized patches (as we noted before in the Methods:  “In experiments 2 and 3 we recorded from a wider region, which overlapped with the two monkey patches and the recording locations of experiment 1”). Furthermore, the preference for upright monkey images is not an all-or-nothing phenomenon: most units still responded to inverted monkeys. Also, we believe it is likely that the mean responses to the inverted bodies in the monkey patches, defined by upright bodies versus objects, would be larger than those to objects and we would be surprised to learn that there is a patch selective for inverted bodies that we would have missed with our localizer.

      (3) Typo: line 447, "this independent"->"is independent"?

      Corrected.

    1. Author Response

      The following is the authors’ response to the original reviews.

      We thank the reviewers for truly valuable advice and comments. We have made multiple corrections and revisions to the original pre-print accordingly per the following comments:

      1. Pro1153Leu is extremely common in the general population (allele frequency in gnomAD is 0.5). Further discussion is warranted to justify the possibility that this variant contributes to a phenotype documented in 1.5-3% of the population. Is it possible that this variant is tagging other rare SNPs in the COL11A1 locus, and could any of the existing exome sequencing data be mined for rare nonsynonymous variants?

      One possible avenue for future work is to return to any existing exome sequencing data to query for rare variants at the COL11A1 locus. This should be possible for the USA MO case-control cohort. Any rare nonsynonymous variants identified should then be subjected to mutational burden testing, ideally after functional testing to diminish any noise introduced by rare benign variants in both cases and controls. If there is a significant association of rare variation in AIS cases, then they should consider returning to the other cohorts for targeted COL11A1 gene sequencing or whole exome sequencing (whichever approach is easier/less expensive) to demonstrate replication of the association.

      Response: Regarding the genetic association of the common COL11A1 variant rs3753841 (p.(Pro1335Leu)), we do not propose that it is the sole risk variant contributing to the association signal we detected and have clarified this in the manuscript. We concluded that it was worthy of functional testing for reasons described here. Although there were several common variants in the discovery GWAS within and around COL11A1, none were significantly associated with AIS and none were in linkage disequilibrium (R2>0.6) with the top SNP rs3753841. We next reviewed rare (MAF<=0.01) coding variants within the COL11A1 LD region of the associated SNP (rs3753841) in 625 available exomes representing 46% of the 1,358 cases from the discovery cohort. The LD block was defined using Haploview based on the 1KG_CEU population. Within the ~41 KB LD region (chr1:103365089- 103406616, GRCh37) we found three rare missense mutations in 6 unrelated individuals, Table below. Two of them (NM_080629.2:c.G4093A:p.A1365T; NM_080629.2:c.G3394A:p.G1132S), from two individuals, are predicted to be deleterious based on CADD and GERP scores and are plausible AIS risk candidates. At this rate we could expect to find only 4-5 individuals with linked rare coding variants in the total cohort of 1,358 which collectively are unlikely to explain the overall association signal we detected. Of course, there also could be deep intronic variants contributing to the association that we would not detect by our methods. However, given this scenario, the relatively high predicted deleteriousness of rs3753841 (CADD= 25.7; GERP=5.75), and its occurrence in a GlyX-Y triplet repeat, we hypothesized that this variant itself could be a risk allele worthy of further investigation.

      Author response table 1.

      We also appreciate the reviewer’s suggestion to perform a rare variant burden analysis of COL11A1. We did conduct pilot gene-based analysis in 4534 European ancestry exomes including 797 of our own AIS cases and 3737 controls and tested the burden of rare variants in COL11A1. SKATO P value was not significant (COL11A1_P=0.18), but this could due to lack of power and/or background from rare benign variants that could be screened out using the functional testing we have developed.

      1. COL11A1 p.Pro1335Leu is pursued as a direct candidate susceptibility locus, but the functional validation involves both: (a) a complementation assay in mouse GPCs, Figure 5; and (b) cultured rib cartilage cells from Col11a1-Ad5 Cre mice (Figure 4). Please address the following:

      2A. Is Pro1335Leu a loss of function, gain of function, or dominant negative variant? Further rationale for modeling this change in a Col11a1 loss of function cell line would be helpful.

      Response: Regarding functional testing, by knockdown/knockout cell culture experiments, we showed for the first time that Col11a1 negatively regulates Mmp3 expression in cartilage chondrocytes, an AIS-relevant tissue. We then tested the effect of overexpressing the human wt or variant COL11A1 by lentiviral transduction in SV40-transformed chondrocyte cultures. We deleted endogenous mouse Col11a1 by Cre recombination to remove the background of its strong suppressive effects on Mmp3 expression. We acknowledge that Col11a1 missense mutations could confer gain of function or dominant negative effects that would not be revealed in this assay. However as indicated in our original manuscript we have noted that spinal deformity is described in the cho/cho mouse, a Col11a1 loss of function mutant. We also note the recent publication by Rebello et al. showing that missense mutations in Col11a2 associated with congenital scoliosis fail to rescue a vertebral malformation phenotype in a zebrafish col11a2 KO line. Although the connection between AIS and vertebral malformations is not altogether clear, we surmise that loss of the components of collagen type XI disrupt spinal development. in vivo experiments in vertebrate model systems are needed to fully establish the consequences and genetic mechanisms by which COL11A1 variants contribute to an AIS phenotype.

      2B. Expression appears to be augmented compared WT in Fig 5B, but there is no direct comparison of WT with variant.

      Response: Expression of the mutant (from the lentiviral expression vector) is increased compared to mutant. We observed this effect in repeated experiments. Sequencing confirmed that the mutant and wildtype constructs differed only at the position of the rs3753841 SNP. At this time, we cannot explain the difference in expression levels. Nonetheless, even when the variant COL11A1 is relatively overexpressed it fails to suppress MMP3 expression as observed for the wildtype form.

      2C. How do the authors know that their complementation data in Figure 5 are specific? Repetition of this experiment with an alternative common nonsynonymous variant in COL11A1 (such as rs1676486) would be helpful as a comparison with the expectation that it would be similar to WT.

      Response: We agree that testing an allelic series throughout COL11A1 could be informative, but we have shifted our resources toward in vivo experiments that we believe will ultimately be more informative for deciphering the mechanistic role of COL11A1 in MMP3 regulation and spine deformity.

      2D. The y-axes of histograms in panel A need attention and clarification. What is meant by power? Do you mean fold change?

      Response: Power is directly comparable to fold change but allows comparison of absolute expression levels between different genes.

      2E. Figure 5: how many technical and biological replicates? Confirm that these are stated throughout the figures.

      Response: Thank you for pointing out this oversight. This information has been added throughout.

      1. Figure 2: What does the gross anatomy of the IVD look like? Could the authors address this by showing an H&E of an adjacent section of the Fig. 2 A panels?

      Response: Panel 2 shows H&E staining. Perhaps the reviewer is referring to the WT and Pax1 KO images in Figure 3? We have now added H&E staining of WT and Pax1 KO IVD as supplemental Figure 3E to clarify the IVD anatomy.

      1. Page 9: "Cells within the IVD were negative for Pax1 staining ..." There seems to be specific PAX1 expression in many cells within the IVD, which is concerning if this is indeed a supposed null allele of Pax1. This data seems to support that the allele is not null.

      Response: We have now added updated images for the COL11A1 and PAX1 staining to include negative controls in which we omitted primary antibodies. As can be seen, there is faint autofluorescence in the PAX1 negative control that appears to explain the “specific staining” referred to by the reviewer. These images confirm that the allele is truly a null.

      1. There is currently a lack of evidence supporting the claim that "Col11a1 is positively regulated by Pax1 in mouse spine and tail". Therefore, it is necessary to conduct further research to determine the direct regulatory role of Pax1 on Col11a1.

      Response: We agree with the reviewer and have clarified that Pax1 may have either a direct or indirect role in Col11a1 regulation.

      1. There is no data linking loss of COL11A1 function and spine defects in the mouse model. Furthermore, due to the absence of P1335L point mutant mice, it cannot be confirmed whether P1335L can actually cause AIS, and the pathogenicity of this mutation cannot be directly verified. These limitations need to be clearly stated and discussed. A Col11a1 mouse mutant called chondroysplasia (cho), was shown to be perinatal lethal with severe endochondral defects (https://pubmed.ncbi.nlm.nih.gov/4100752/). This information may help contextualize this study.

      Response: We partially agree with the reviewer. Spine defects are reported in the cho mouse (for example, please see reference 36 Hafez et al). We appreciate the suggestion to cite the original Seegmiller et al 1971 reference and have added it to the manuscript.

      1. A recent article (PMID37462524) reported mutations in COL11A2 associated with AIS and functionally tested in zebrafish. That study should be cited and discussed as it is directly relevant for this manuscript.

      Response: We agree with the reviewer that this study provides important information supporting loss of function I type XI collagen in spinal deformity. Language to this effect has been added to the manuscript and this study is now cited in the paper.

      1. Please reconcile the following result on page 10 of the results: "Interestingly, the AISassociated gene Adgrg6 was amongst the most significantly dysregulated genes in the RNA-seq analysis (Figure 3c). By qRT-PCR analysis, expression of Col11a1, Adgrg6, and Sox6 were significantly reduced in female and male Pax1-/- mice compared to wild-type mice (Figure 3d-g)." In Figure 3f, the downregulation of Adgrg6 appears to be modest so how can it possibly be highlighted as one of the most significantly downregulated transcripts in the RNAseq data?

      Response: By “significant” we were referring to the P-value significance in RNAseq analysis, not in absolute change in expression. This language was clearly confusing, and we have removed it from the manuscript.

      1. It is incorrect to refer to the primary cell culture work as growth plate chondrocytes (GPCs), instead, these are primary costal chondrocyte cultures. These primary cultures have a mixture of chondrocytes at differing levels of differentiation, which may change differentiation status during the culturing on plastic. In sum, these cells are at best chondrocytes, and not specifically growth plate chondrocytes. This needs to be corrected in the abstract and throughout the manuscript. Moreover, on page 11 these cells are referred to as costal cartilage, which is confusing to the reader.

      Response: Thank you for pointing out these inconsistencies. We have changed the manuscript to say “costal chondrocytes” throughout.

      Minor points

      • On 10 of the Results: "These data support a mechanistic link between Pax1 and Col11a1, and the AIS-associated genes Gpr126 and Sox6, in affected tissue of the developing tail." qRT-PCR validation of Sox6, although significant, appears to be very modestly downregulated in KO. Please soften this statement in the text.

      Response: We have softened this statement.

      • Have you got any information about how the immortalized (SV40) costal cartilage affected chondrogenic differentiation? The expression of SV40 seemed to stimulate Mmp13 expression. Do these cells still make cartilage nodules? Some feedback on this process and how it affects the nature of the culture what be appreciated.

      Response: The “+ or –“ in Figure 5 refers to Ad5-cre. Each experiment was performed in SV40-immortalized costal chondrocytes. We have removed SV40 from the figure and have clarified the legend to say “qRT-PCR of human COL11A1 and endogenous mouse Mmp3 in SV40 immortalized mouse costal chondrocytes transduced with the lentiviral vector only (lanes 1,2), human WT COL11A1 (lane 3), or COL11A1P1335L. Otherwise we absolutely agree that understanding Mmp13 regulation during chondrocyte differentiation is important. We plan to study this using in vivo systems.

      • Figure 1: is the average Odds ratio, can this be stated in the figure legend?

      Response: We are not sure what is being asked here. The “combined odds ratio” is calculated as a weighted average of the log of the odds.

      • A more consistent use of established nomenclature for mouse versus human genes and proteins is needed.

      Human:GENE/PROTEIN

      Mouse: Gene/PROTEIN

      Response: Thank you for pointing this out. The nomenclature has been corrected throughtout the manuscript.

      • There is no Figure 5c, but a reference to results in the main text. Please reconcile. -There is no Figure 5-figure supplement 5a, but there is a reference to it in the main text. Please reconcile.

      Response: Figure references have been corrected.

      • Please indicate dilutions of all antibodies used when listed in the methods.

      Response: Antibody dilutions have been added where missing.

      • On page 25, there is a partial sentence missing information in the Histologic methods; "#S36964 Invitrogen, CA, USA)). All images were taken..."

      Response: We apologize for the error. It has been removed.

      • Table 1: please define all acronyms, including cohort names.

      Response: We apologize for the oversight. The legend to the Table has been updated with definitions of all acronyms.

      • Figure 2: Indicate that blue staining is DAPI in panel B. Clarify that "-ab" as an abbreviation is primary antibody negative.

      Response: A color code for DAPI and COL11A! staining has been added and “-ab” is now defined.

      • Page 4: ADGRG6 (also known as GPR126)...the authors set this up for ADGRG6 but then use GPR126 in the manuscript, which is confusing. For clarity, please use the gene name Adgrg6 consistently, rather than alternating with Gpr126.

      Response: Thank you for pointing this out. GPR126 has now been changed to ADGRG6 thoughout the manuscript.

      • REF 4: Richards, B.S., Sucato, D.J., Johnston C.E. Scoliosis, (Elsevier, 2020). Is this a book, can you provide more clarity in the Reference listing?

      Response: Thank you for pointing this out. This reference has been corrected.

      • While isolation was addressed, the methods for culturing Rat cartilage endplate and costal chondrocytes are poorly described and should be given more text.

      Response: Details about the cartilage endplate and costal chondrocyte isolation and culture have been added to the Methods.

      • Page 11: 1st paragraph, last sentence "These results suggest that Mmp3 expression"... this sentence needs attention. As written, I am not clear what the authors are trying to say.

      Response: This sentence has been clarified and now reads “These results suggest that Mmp3 expression is negatively regulated by Col11a1 in mouse costal chondrocytes.”

      • Page 13: line 4 from the bottom, "ECM-clearing"? This is confusing do you mean ECM degrading?

      Response: Yes and thank you. We have changed to “ECM-degrading”.

      • Please use version numbers for RefSeq IDs: e.g. NM_080629.3 instead of NM_080629 Response: This change has been made in the revised manuscript.

      • It would be helpful for readers if the ethnicity of the discovery case cohort was clearly stated as European ancestry in the Results main text.

      Response: “European ancestry” has been added at first description of the discovery cohort in the manuscript.

      • Avoid using the term "mutation" and use "variant" instead.

      Response: Thank you for pointing this out. “Variant” is now used throughout the manuscript.

      • Define error bars for all bar charts throughout and include individual data points overlaid onto bars.

      Response: Thank you. Error bars are now clarified in the Figure legends.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review): 

      Summary: 

      Campbell et al investigated the effects of light on the human brain, in particular the subcortical part of the hypothalamus during auditory cognitive tasks. The mechanisms and neuronal circuits underlying light effects in non-image forming responses are so far mostly studied in rodents but are not easily translated in humans. Therefore, this is a fundamental study aiming to establish the impact light illuminance has on the subcortical structures using the high-resolution 7T fMRI. The authors found that parts of the hypothalamus are differently responding to illuminance. In particular, they found that the activity of the posterior hypothalamus increases while the activity of the anterior and ventral parts of the hypothalamus decreases under high illuminance. The authors also report that the performance of the 2-back executive task was significantly better in higher illuminance conditions. However, it seems that the activity of the posterior hypothalamus subpart is negatively related to the performance of the executive task, implying that it is unlikely that this part of the hypothalamus is directly involved in the positive impact of light on performance observed. Interestingly, the activity of the posterior hypothalamus was, however, associated with an increased behavioural response to emotional stimuli. This suggests that the role of this posterior part of the hypothalamus is not as simple regarding light effects on cognitive and emotional responses. This study is a fundamental step towards our better understanding of the mechanisms underlying light effects on cognition and consequently optimising lighting standards. 

      Strengths: 

      While it is still impossible to distinguish individual hypothalamic nuclei, even with the highresolution fMRI, the authors split the hypothalamus into five areas encompassing five groups of hypothalamic nuclei. This allowed them to reveal that different parts of the hypothalamus respond differently to an increase in illuminance. They found that higher illuminance increased the activity of the posterior part of the hypothalamus encompassing the MB and parts of the LH and TMN, while decreasing the activity of the anterior parts encompassing the SCN and another part of TMN. These findings are somewhat in line with studies in animals. It was shown that parts of the hypothalamus such as SCN, LH, and PVN receive direct retinal input in particular from ipRGCs. Also, acute chemogenetic activation of ipRGCs was shown to induce activation of LH and also increased arousal in mice. 

      Weaknesses: 

      While the light characteristics are well documented and EDI calculated for all of the photoreceptors, it is not very clear why these irradiances and spectra were chosen. It would be helpful if the authors explained the logic behind the four chosen light conditions tested. Also, the lights chosen have cone-opic EDI values in a high correlation with the melanopic EDI, therefore we can't distinguish if the effects seen here are driven by melanopsin and/or other photoreceptors. In order to provide a more mechanistic insight into the light-driven effects on cognition ideally one would use a silent substitution approach to distinguish between different photoreceptors. This may be something to consider when designing the follow-up studies. 

      Reviewer #1 (Recommendations For The Authors): 

      (1) As suggested in the public review more information regarding the reasons behind the chosen light condition is needed. 

      While the light characteristics are well documented and EDI calculated for all of the photoreceptors, it is not very clear why these irradiances and spectra were chosen. It would be helpful if the authors explained the logic behind the four chosen light conditions tested. Also, the lights chosen have cone-opic EDI values in a high correlation with the melanopic EDI, therefore we can't distinguish if the effects seen here are driven by melanopsin or cone opsins. In order to provide a more mechanistic insight into the light-driven effects on cognition ideally one would use a silent substitution approach to distinguish between different photoreceptors. 

      (2) In support of this work, it was shown in mice that acute activation of ipRGCs using chemogenetics induces c-fos in some of the hypothalamic brain areas discussed here including LH (Milosavljevic et al, 2016 Curr Biol). Another study to consider including in the discussion is by Sonoda et al 2020 Science, in which the authors showed that a subset of ipRGCs release GABA. 

      (3) Figure 1 looks squashed, especially the axes. Also, Figure 2 looks somewhat blurry. I would suggest that the authors edit the figures to correct this.

      We thank the reviewer for their positive comments and agree with the weaknesses they pointed out. 

      (1) The explanation regarding the choice of the illuminance is now included in the revised manuscript (PAGE 17): “Blue-enriched light illuminances were set according to the technical characteristics of the light source and to keep the overall photon flux similar to prior 3T MRI studies of our team (between ~1012 and 1014 ph/cm²/s) (Vandewalle et al., 2010, 2011). The orange light was introduced as a control visual stimulation for potential secondary whole-brain analyses. For the present region of interest analyses, we discarded colour differences between the light conditions and only considered illuminance as indexed by mel EDI lux. This constitutes a limitation of our study as it does not allow attributing the findings to a particular photoreceptor class.”

      The revised discussion makes clear that these choices limit the interpretation about the photoreceptors involved (PAGES 12-13): “We based our rationale and part of our interpretations on ipRGC projections, which have been demonstrated in rodents to channel the NIF biological impact of light and incorporate the inputs from rods and cones with their intrinsic photosensitivity into a light signal that can impact the brain (Güler et al., 2008; Tri & Do, 2019). Given the polychromatic nature of the light we used, classical photoreceptors and their projections to visual brain areas are, however, very likely to have directly or indirectly contributed to the modulation by light of the regional activity of the hypothalamus.”

      The discussion also points out the promises of silent substitution (PAGE 13): “Future human studies could isolate the contribution of each photoreceptor class to the impact of light on cognitive brain functions by manipulating prior light history (Chellappa et al., 2014) or through the use of silent substitutions between metameric light exposures (Viénot et al., 2012)”.

      (2) We now refer to the studies by Milosavljevic et al. and Sonoda et al. 

      PAGE 9: “Our data may therefore be compatible with an increase in orexin release by the LH with increasing illuminance. In line with this assumption, chemoactivation of ipRGCs lead to increase c-fos production, a marker of cellular activation, over several nuclei of the hypothalamus, including the lateral hypothalamus (Milosavljevic et al., 2016). If this initial effect of light we observe over the posterior part of the hypothalamus was maintained over a longer period of exposure, this would stimulate cognition and maintain or increase alertness (Campbell et al., 2023) and may also be part of the mechanisms through which daytime light increases the amplitude in circadian variations of several physiological features (BanoOtalora et al., 2021; Dijk et al., 2012).”

      PAGE 10: “Chemoactivation of ipRGCs in rodents led to an increase activity of the SCN, over the inferior anterior hypothalamus, but had no impact on the activity of the VLPO, over the superior anterior hypothalamus (Milosavljevic et al., 2016). How our findings fit with these fine-grained observations and whether there are species-specific differences in the responses to light over the different part of the hypothalamus remains to be established.”

      PAGE 10: “In terms of chemical communication, these changes in activity could be the results of an inhibitory signal from a subclass of ipRGCs, potentially through the release aminobutyric acid (GABA), as a rodent study found that a subset of ipRGCs release GABA at brain targets including the SCN (and intergeniculate leaflet and ventral lateral geniculate nucleus), leading to a reduction in the ability of light to affect pupil size and circadian photoentrainment (Sonoda et al., 2020). Whatever the signalling of ipRGC, our finding over the anterior hypothalamus could correspond to a modification of GABA signalling of the SCN which has been reported to have excitatory properties, such that the BOLD signal changes we report may correspond to a reduction in excitation arising in part from the SCN (Albers et al., 2017).”

      (3) Figures 1 and 2 were modified. We hope their quality is now satisfactory. We are willing to provide separate figures prior to publication of the Version of Record.

      Reviewer #2 (Public Review): 

      Summary 

      The interplay between environmental factors and cognitive performance has been a focal point of neuroscientific research, with illuminance emerging as a significant variable of interest. The hypothalamus, a brain region integral to regulating circadian rhythms, sleep, and alertness, has been posited to mediate the effects of light exposure on cognitive functions. Previous studies have illuminated the role of the hypothalamus in orchestrating bodily responses to light, implicating specific neural pathways such as the orexin and histamine systems, which are crucial for maintaining wakefulness and processing environmental cues. Despite advancements in our understanding, the specific mechanisms through which varying levels of light exposure influence hypothalamic activity and, in turn, cognitive performance, remain inadequately explored. This gap in knowledge underscores the need for high-resolution investigations that can dissect the nuanced impacts of illuminance on different hypothalamic regions. Utilizing state-of-the-art 7 Tesla functional magnetic resonance imaging (fMRI), the present study aims to elucidate the differential effects of light on the hypothalamic dynamics and establish a link between regional hypothalamic activity and cognitive outcomes in healthy young adults. By shedding light on these complex interactions, this research endeavours to contribute to the foundational knowledge necessary for developing innovative therapeutic strategies aimed at enhancing cognitive function through environmental modulation. 

      Strengths: 

      (1) Considerable Sample Size and Detailed Analysis: The study leverages a robust sample size and conducts a thorough analysis of hypothalamic dynamics, which enhances the reliability and depth of the findings. 

      (2) Use of High-Resolution Imaging: Utilizing 7 Tesla fMRI to analyze brain activity during cognitive tasks offers high-resolution insights into the differential effects of illuminance on hypothalamic activity, showcasing the methodological rigor of the study. 

      (3) Novel Insights into Illuminance Effects: The manuscript reveals new understandings of how different regions of the hypothalamus respond to varying illuminance levels, contributing valuable knowledge to the field. 

      (4) Exploration of Potential Therapeutic Applications: Discussing the potential therapeutic applications of light modulation based on the findings suggests practical implications and future research directions. 

      Weaknesses: 

      (1) Foundation for Claims about Orexin and Histamine Systems: The manuscript needs to provide a clearer theoretical or empirical foundation for claims regarding the impact of light on the orexin and histamine systems in the abstract. 

      (2) Inclusion of Cortical Correlates: While focused on the hypothalamus, the manuscript may benefit from discussing the role of cortical activation in cognitive performance, suggesting an opportunity to expand the scope of the manuscript. 

      (3) Details of Light Exposure Control: More detailed information about how light exposure was controlled and standardized is needed to ensure the replicability and validity of the experimental conditions. 

      (4) Rationale Behind Different Exposure Protocols: To clarify methodological choices, the manuscript should include more in-depth reasoning behind using different protocols of light exposure for executive and emotional tasks. 

      Reviewer #2 (Recommendations For The Authors): 

      Attention to English language precision and correction of typographical errors, such as "hypothalamic nuclei" instead of "hypothalamus nuclei," is necessary for enhancing the manuscript.

      We thank the reviewer for recognising the interest and strength of our study.

      (1) As detailed in the discussion, we do believe orexin and histamine are excellent candidates for mediating the results we report. As also pointing out, however, we are in no position to know which neurons, nuclei, neurotransmitter and neuromodulator underlie the results. The last sentence of the abstract (PAGE 2) was therefore removed as we agree the statement was too strong. We carefully reconsider the discussion and believe that no such overstatement was present.

      (2) Hypothalamus nuclei are connected to multiple cortical (and subcortical) structures. The relevance of these projections will vary with the cognitive task considered. In addition, we have not yet considered the cortex in our analyses such that truly integrating cortical structures appears premature. 

      We nevertheless added the following short statement (PAGE 11): “Subcortical structures, and particularly those receiving direct retinal projections, including those of the hypothalamus, are likely to receive light illuminance signal first before passing on the light modulation to the cortical regions involved in the ongoing cognitive process (Campbell et al., 2023).”

      (3) We now include the following as part of the method section (PAGES 16-17): “Illuminance and spectra could not be directly measured within the MRI scanner due to the ferromagnetic nature of measurement systems. The coil of the MRI and the light stand, together with the lighting system were therefore placed outside of the MR room to reproduce the experimental conditions of the in a completely dark room. A sensor was placed 2 cm away from the mirror of the coil that is mounted at eye level, i.e. where the eye of the first author of the paper would be positioned, to measure illuminance and spectra. The procedure was repeated 4 times for illuminance and twice for spectra and measurements were averaged. This procedure does not take into account interindividual variation in head size and orbit shape such that the reported illuminance levels may have varied slightly across subjects. The relative differences between illuminance are, however, very unlikely to vary substantially across participants such that statistics consisting of tests for the impact of relative differences in illuminance were not affected. The detailed values reported in Supplementary Table 2 were computed combining spectra and illuminance using the excel calculator associated with a published work (Lucas et al., 2014).”

      (4) The explanation regarding the choice of the illuminance is now included in the revised manuscript (PAGE 17): “Blue-enriched light illuminances were set according to the technical characteristics of the light source and to keep the overall photon flux similar to prior 3T MRI studies of our team (between ~1012 and 1014 ph/cm²/s) (Vandewalle et al., 2010, 2011). The orange light was introduced as a control visual stimulation for potential secondary whole-brain analyses. For the present region of interest analyses, we discarded colour differences between the light conditions and only considered illuminance as indexed by mel EDI lux. This constitutes a limitation of our study as it does not allow attributing the findings to a particular photoreceptor class.”

      (5) The manuscript was thoroughly rechecked, and we hope to have spotted all typos and language errors.

      Reviewer #3 (Public Review): 

      Summary: 

      Campbell and colleagues use a combination of high-resolution fMRI, cognitive tasks, and different intensities of light illumination to test the hypothesis that the intensity of illumination differentially impacts hypothalamic substructures that, in turn, promote alterations in arousal that affect cognitive and affective performance. The authors find evidence in support of a posterior-to-anterior gradient of increased blood flow in the hypothalamus during task performance that they later relate to performance on two different tasks. The results provide an enticing link between light levels, hypothalamic activity, and cognitive/affective function, however, clarification of some methodological choices will help to improve confidence in the findings. 

      Strengths: 

      * The authors' focus on the hypothalamus and its relationship to light intensity is an important and understudied question in neuroscience. 

      Weaknesses: 

      (1) I found it challenging to relate the authors' hypotheses, which I found to be quite compelling, to the apparatus used to test the hypotheses - namely, the use of orange light vs. different light intensities; and the specific choice of the executive and emotional tasks, which differed in key features (e.g., block-related vs. event-related designs) that were orthogonal to the psychological constructs being challenged in each task. 

      (4) Given the small size of the hypothalamus and the irregular size of the hypothalamic parcels, I wondered whether a more data-driven examination of the hypothalamic time series would have provided a more parsimonious test of their hypothesis. 

      Reviewer #3 (Recommendations For The Authors): 

      (1) The authors may wish to explain the importance of the orange light condition in the early section of the results -- i.e., when they first present the task structure. As it stands, I don't have a good appreciation of why the orange light was included -- was it a control condition? And if the differences between the light conditions (e.g., the narrow- vs. wide-band of light) were indeed ignored by focussing on the illuminance levels, are there any potential issues that the authors could then mitigate against with further experiments/analyses? 

      (2) Are there other explanations for why illuminance levels might improve cognitive performance? For instance, the capacity to more easily perceive the stimuli in an experiment could plausibly make it easier to complete a given task. If this is the case, can the authors conceptualise a way to rule out this hypothesis? 

      (3) Did the authors control for the differences in the number of voxels in each hypothalamic subregion? Or perhaps consider estimating the variance across voxels within the larger parcels, to determine whether the mean time series was comparable to the time series of the smaller parcels? 

      (4) An alternative strategy that would mitigate against the differences in the size of hypothalamic parcels would be to conduct analyses on the hypothalamus without parcellation, but instead using dimensionality reduction techniques to observe the natural spread of responses across the hypothalamus. From the authors' results, my intuition is that these analyses will lead to similar conclusions, albeit without any of the potential issues with respect to differently-sized parcels. 

      We thank the reviewer for acknowledging the originality and interest of our study. We agree that some methodological choices needed more explanation. We will address the weaknesses they pointed out as follows:

      (1) The explanation regarding the choice of the illuminance is now included in the revised manuscript (PAGE 17): “Blue-enriched light illuminances were set according to the technical characteristics of the light source and to keep the overall photon flux similar to prior 3T MRI studies of our team (between ~1012 and 1014 ph/cm²/s) (Vandewalle et al., 2010, 2011). The orange light was introduced as a control visual stimulation for potential secondary whole-brain analyses. For the present region of interest analyses, we discarded colour differences between the light conditions and only considered illuminance as indexed by mel EDI lux. This constitutes a limitation of our study as it does not allow attributing the findings to a particular photoreceptor class.”

      The revised discussion makes clear that these choices limit the interpretation about the photoreceptors involved (PAGE 12-13): “We based our rationale and part of our interpretations on ipRGC projections, which have been demonstrated in rodents to channel the NIF biological impact of light and incorporate the inputs from rods and cones with their intrinsic photosensitivity into a light signal that can impact the brain (Güler et al., 2008; Tri & Do, 2019). Given the polychromatic nature of the light we used, classical photoreceptors and their projections to visual brain areas are, however, very likely to have directly or indirectly contributed to the modulation by light of the regional activity of the hypothalamus.”

      We further mention that (PAGE 13): “Furthermore, we cannot exclude that colour and/or spectral differences between the orange and 3 blue-enriched light conditions may have contributed to our findings. Research in rodent model demonstrated that variation in the spectral composition of light was perceived by the suprachiasmatic nucleus to set circadian timing (Walmsley et al., 2015). No such demonstration has, however, been reported yet for the acute impact of light on alertness, attention, cognition or affective state.”

      Regarding the choice of tasks, we added the following the method section (PAGE 18): “Prior work of our team showed that the n-back task and emotional task included in the present protocol were successful probes to demonstrate that light illuminance modulates cognitive activity, including within subcortical structures (though resolution did not allow precise isolation of nuclei or subparts) (e.g. (Vandewalle et al., 2007, 2010)). When taking the step of ultra-high-field imaging, we therefore opted for these tasks as our goal was to show that illuminance affects brain activity across cognitive domains while not testing for task-specific aspects of these domains.”

      We further added to the discussion (PAGE 8): “The pattern of light-induced changes was consistent across an executive and an emotional task which consisted of block and an event-related fMRI design, respectively. This suggests that a robust anterior-posterior gradient of activity modulation by illuminance is present in hypothalamus across cognitive domains.”

      (2) We are unsure what the reviewer refers to when he states that the experiment could make it easier to perceive a stimulus. Aside from the fact that illuminance can increase alertness and attention such that a stimulus may be better or more easily perceived/processed, we do not see how blocks of ambient light, i.e. a long-lasting visual stimulus, may render auditory stimulation (letters or pseudo-words in the present) easier to perceive. To our knowledge multimodal or cross-modal integration has been robustly demonstrated for short visual/auditory cues that would precede or accompany auditory/visual stimulation. 

      We are willing to clarify this issue in the text if we receive additional explanation from the reviewer.

      (3) We added subpart size as covariate in the analyses (instead of subpart number) and it did not affect the output of the statistical analyses (Author response table 1). 

      For completeness, we further computed standard deviation of the activity estimates of the voxels within each parcel for the main analysis of the n-back tasks and found a main effect of subpart (Author response table 2) indicating that the variability of the estimates varied across subparts. Post hoc contrast and the display included in Author response image1 show however that the difference were not related to subpart size per see. It is in fact the largest subpart (subpart 4) that shows the largest variability while one of the smallest subpart (subpart 2) shows the lowest variability. Though it may have contributed, it is therefore unlikely to explain our findings. We consider the analyses reported in (Author response table 1 and 2 and (Author response image 1 as very technical and did not include it in the supplementary material for conciseness. If the reviewer judges it essential, we can reconsider our decision.  

      While computing these analyses, we realized that there were errors in the table 1 reporting the statistical outcomes of the main analyses of the emotional task. The main statistical outputs remain the same except for a nominal main effect of the task (emotional vs. neutral) and the fact that post hoc show a consistent difference between the posterior subpart (subpart 3) and all the other subparts, rather than all the other subparts except for the difference with superior tubular hypothalamus subpart: p-corrected = 0.09. We apologise for this slight error and were unable to isolate its origin. It does not modify the rest of the analyses (which were also rechecked) and the interpretations. 

      Author response table 1.

      Recomputations of the main GLMMs using subpart sizes rather than subpart numbers as covariate of interest.

      Author response image 1.

      Activity estimate variability per hypothalamus subpart and subpart size.  

      Author response table 2.

      Difference in activity estimate standard deviation between hypothalamus subparts during the n-back task.

      Outputs of the generalized linear mixed model (GLMM) with subject as the random factor (intercept and slope), and task and subpart as repeated measures (ar(1) autocorrelation).

      * The corrected p-value for multiple comparisons over 2 tests is p < 0.025.

      # Refer to Fig.2A for correspondence of subpart numbers

      The text referring to Table 1 was modified accordingly (PAGE 5): “A nominal main effect of the task was detected for the emotional task [p = 0.049; Table 1] but not for the n-back task. For both tasks, there was no significant main effect for any of the other covariates and post hoc analyses showed that the index of the illuminance impact was consistently different in the posterior hypothalamus subpart compared to the other subparts [pcorrected ≤ 0.05]”.

      (4) We agree that a data driven approach could have constituted an alternative means to tests our hypothesis. We opted for an approach that we mastered best, while still allowing to conclusively test for regional differences in activity across the hypothalamus. Examination of time series of the very same data we used will mainly confirm the results of our analyses – an anterior-posterior gradient in the impact of illuminance - while it may yield slight differences in the boarders of the subparts of the hypothalamus undergoing decreased or increased activity with increasing illuminance. While the suggested approach may have been envisaged if we had been facing negative results (i.e. no differences between subparts, potentially because subparts would not reflect functional differences in response to illuminance change), it would constitute a circular confirmation of our main findings (i.e. using the same data). While we truly appreciate the suggestion, we do not consider that it would constitute a more parsimonious test of our hypothesis, now that we successfully applied GLM/parcellation and GLMM approaches.

      We added the following statement to the discussion to take this comment into account (PAGE 12): “Future research may consider data-driven analyses of hypothalamus voxels time series as an alternative to the parcellation approach we adopted here. This may refine the delineation of the subparts of the hypothalamus undergoing decreased or increased activity with increasing illuminance.”

      Response references

      Albers, H. E., Walton, J. C., Gamble, K. L., McNeill, J. K., & Hummer, D. L. (2017). The dynamics of GABA signaling: Revelations from the circadian pacemaker in the suprachiasmatic nucleus. Frontiers in Neuroendocrinology, 44, 35–82. https://doi.org/10.1016/J.YFRNE.2016.11.003

      Bano-Otalora, B., Martial, F., Harding, C., Bechtold, D. A., Allen, A. E., Brown, T. M., Belle, M. D. C., & Lucas, R. J. (2021). Bright daytime light enhances circadian amplitude in a diurnal

      mammal. Proceedings of the National Academy of Sciences of the United States of America, 118(22), e2100094118. https://doi.org/10.1073/PNAS.2100094118/SUPPL_FILE/PNAS.2100094118.SAPP.PDF

      Campbell, I., Sharifpour, R., & Vandewalle, G. (2023). Light as a Modulator of Non-Image-Forming Brain Functions Positive and Negative Impacts of Increasing Light Availability. Clocks & Sleep, 5(1), 116. https://doi.org/10.3390/CLOCKSSLEEP5010012

      Chellappa, S. L., Ly, J. Q. M., Meyer, C., Balteau, E., Degueldre, C., Luxen, A., Phillips, C., Cooper, H. M., & Vandewalle, G. (2014). Photic memory for executive brain responses. Proceedings of the National Academy of Sciences of the United States of America, 111(16), 6087–6091. https://doi.org/10.1073/pnas.1320005111

      Dijk, D. J., Duffy, J. F., Silva, E. J., Shanahan, T. L., Boivin, D. B., & Czeisler, C. A. (2012). Amplitude reduction and phase shifts of melatonin, cortisol and other circadian rhythms after a gradual advance of sleep and light exposure in humans. PloS One, 7(2). https://doi.org/10.1371/JOURNAL.PONE.0030037

      Güler, A. D., Ecker, J. L., Lall, G. S., Haq, S., Altimus, C. M., Liao, H. W., Barnard, A. R., Cahill, H., Badea, T. C., Zhao, H., Hankins, M. W., Berson, D. M., Lucas, R. J., Yau, K. W., & Hattar, S. (2008). Melanopsin cells are the principal conduits for rod-cone input to non-image-forming vision. Nature, 453(7191), 102–105. https://doi.org/10.1038/nature06829

      Lucas, R. J., Peirson, S. N., Berson, D. M., Brown, T. M., Cooper, H. M., Czeisler, C. A., Figueiro, M. G., Gamlin, P. D., Lockley, S. W., O’Hagan, J. B., Price, L. L. A., Provencio, I., Skene, D. J., & Brainard, G. C. (2014). Measuring and using light in the melanopsin age. Trends in Neurosciences, 37(1), 1–9. https://doi.org/10.1016/j.tins.2013.10.004

      Milosavljevic, N., Cehajic-Kapetanovic, J., Procyk, C. A., & Lucas, R. J. (2016). Chemogenetic Activation of Melanopsin Retinal Ganglion Cells Induces Signatures of Arousal and/or Anxiety in Mice. Current Biology, 26(17), 2358–2363. https://doi.org/10.1016/j.cub.2016.06.057

      Sonoda, T., Li, J. Y., Hayes, N. W., Chan, J. C., Okabe, Y., Belin, S., Nawabi, H., & Schmidt, T. M. (2020). A noncanonical inhibitory circuit dampens behavioral sensitivity to light. Science (New York, N.Y.), 368(6490), 527–531. https://doi.org/10.1126/SCIENCE.AAY3152

      Tri, M., & Do, H. (2019). Melanopsin and the Intrinsically Photosensitive Retinal Ganglion Cells: Biophysics to Behavior. Neuron, 104, 205–226. https://doi.org/10.1016/j.neuron.2019.07.016

      Vandewalle, G., Hébert, M., Beaulieu, C., Richard, L., Daneault, V., Garon, M. Lou, Leblanc, J., Grandjean, D., Maquet, P., Schwartz, S., Dumont, M., Doyon, J., & Carrier, J. (2011). Abnormal hypothalamic response to light in seasonal affective disorder. Biological Psychiatry, 70(10), 954–961. https://doi.org/10.1016/j.biopsych.2011.06.022

      Vandewalle, G., Schmidt, C., Albouy, G., Sterpenich, V., Darsaud, A., Rauchs, G., Berken, P. Y., Balteau, E., Dagueldre, C., Luxen, A., Maquet, P., & Dijk, D. J. (2007). Brain responses to violet, blue, and green monochromatic light exposures in humans: Prominent role of blue light and the brainstem. PLoS ONE, 2(11), e1247. https://doi.org/10.1371/journal.pone.0001247

      Vandewalle, G., Schwartz, S., Grandjean, D., Wuillaume, C., Balteau, E., Degueldre, C., Schabus, M., Phillips, C., Luxen, A., Dijk, D. J., & Maquet, P. (2010). Spectral quality of light modulates emotional brain responses in humans. Proceedings of the National Academy of Sciences of the United States of America, 107(45), 19549–19554. https://doi.org/10.1073/pnas.1010180107

      Viénot, F., Brettel, H., Dang, T.-V., & Le Rohellec, J. (2012). Domain of metamers exciting intrinsically photosensitive retinal ganglion cells (ipRGCs) and rods. Journal of the Optical Society of America A, 29(2), A366. https://doi.org/10.1364/josaa.29.00a366

      Walmsley, L., Hanna, L., Mouland, J., Martial, F., West, A., Smedley, A. R., Bechtold, D. A., Webb, A. R., Lucas, R. J., & Brown, T. M. (2015). Colour As a Signal for Entraining the Mammalian Circadian Clock. PLOS Biology, 13(4), e1002127. https://doi.org/10.1371/journal.pbio.1002127

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews: 

      Reviewer #1 Comments on revisions: 

      The authors have addressed my concerns so I am fine with revision in principle.

      Thank you for taking the time to review our work and for your thoughtful feedback. We’re glad to hear that your concerns have been addressed.

      Reviewer #2 Comments on revisions:

      The authors have addressed many of the concerns raised in the initial review and provided alternative analytical approaches to address the relevant questions in this revision. Some of these are useful; however, they have not fully addressed one critical point. 

      In my original critique, I noted that the maternal KO might not be suitable as a control, given that there is no significant phenotypic difference between the maternal-only KO and the maternal-zygotic KO. While we did not dispute the molecular differences presented in Figure 2, so how the authors conclude in the Response "embryos with a maternal KO or zygotic heterozygous KO of Oct4 or Sox2 show no noticeable ... molecular difference (Figure 2-figure supplement 4A)"? The authors should recheck whether this is a typographical error or a valid statement. 

      Additionally, I recommend the removal of phrases such as "absolutely priority" and "pivotal" throughout the manuscript, as these terms are overly assertive without sufficient supporting evidence.

      We sincerely appreciate the reviewer’s feedback and would like to take this opportunity to provide further clarification, as there might have been a misunderstanding.

      We respectfully disagree with the reviewer’s statement that “there is no significant phenotypic difference between the maternal-only KO and the maternal-zygotic KO.” Based on privious publications, there is clear evidence that maternal-zygotic KO embryos exhibit significant defects: they fail to form a healthy primitive endoderm, are unable to give rise to embryonic stem cells (ESCs) in vitro, and die shortly after implantation (Frum et al., Dev Cell 2013; Wu et al., Nat Cell Biol 2013; Le Bin et al., Development 2014; Wicklow et al., PLoS Genet 2014). In contrast, maternal-only KO embryos develop as healthy as wild-type (WT) embryos and do not display any of these phenotypic abnormalities. We believe that this distinction validates our use of maternal KO embryos as proper controls in our experiments. 

      To address the reviewer’s concerns and ensure clarity, we have also revised the following statement in the manuscript.

      Original manuscript: “Mouse embryos with a maternal KO or zygotic heterozygous KO of either factor show no noticeable phenotype or molecular difference (Figure 2-figure supplement 4A) (Avilion et al., 2003; Frum et al., 2013; Kehler et al, 2004; Nichols et al., 1998; Wicklow et al., 2014; Wu et al., 2013).” 

      Revised manuscript: “Maternal KO embryos (circles in Figure 2—figure supplement 4A) clustered together with wildtype embryos (triangles and squares) in the PCA analysis, consistent with previous studies reporting no observable phenotype in maternal KO embryos (Avilion et al., 2003; Frum et al., 2013; Kehler et al, 2004; Nichols et al., 1998; Wicklow et al., 2014; Wu et al., 2013).”

      While we acknowledge the potential for using maternal-only KO controls to underestimate differences between control and KO samples, we believe this approach does not introduce false positives in our RNA-seq and ATAC-seq experiments, only the possibility of more conservative conclusions. This minimizes the risk of overestimating the molecular impact.

      We appreciate the reviewer’s recommendation regarding the use of overly assertive terms. Upon careful review of the manuscript and response letter, we could not find instances of the term “absolutely priority.” However, we do use the term “pivotal” and would prefer to retain it as we believe it accurately reflects the importance of the findings presented in our manuscript.

      Thank you for your thoughtful comments and suggestions! We hope this response clarifies our rationale and addresses the concerns.

      ---

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review)

      Summary:

      Numerous mechanism and structural studies reported the cooperative role of Oct4 and Sox2 during the establishment of pluripotency during reprogramming. Due to the difficulty in sample collection and RNA-seq with low-number cells, the precise mechanisms remain in early embryos. This manuscript reported the role of OCT4 and SOX2 in mouse early embryos using knockout models with low-input ATAC-seq and RNA-seq. Compared to the control, chromatin accessibility and transcriptome were affected when Oct4 and Sox2 were deleted in early ICM. Specifically, decreased ATAC-seq peaks showed enrichment of Motifs of TF such as OCT, SOX, and OCT-SOX, indicating their importance during early development. Moreover, by deep analysis of ATAC-seq and RNA-seq data, they found Oct4 and Sox2 target enhancer to activate their downstream genes. In addition, they also uncovered the role of OS during development from the morula to ICM, which provided the scientific community with a more comprehensive understanding.

      Strengths:

      On the whole, the manuscript is innovative, and the conclusions of this paper are mostly well supported by data, however, there are some issues that need to be addressed.

      Weaknesses:

      Major Points:

      (1) In Figure 1, a more detailed description of the knockout strategy should be provided to clarify itself. The knockout strategy in Fig1 is somewhat obscure, such as how is OCT4 inactivated in Oct4mKO2 heterozygotes. As shown in Figure 1, the exon of OCT4 is not deleted, and its promoter is not destroyed. Therefore, how does OCT4 inactivate to form heterozygotes?

      Thank you for helping clarify this. We will add a detailed description of the knockout strategy in the legends for Figure 1A and 1B, as shown below. Note that the same strategy was used by Nichols et al (Cell, 1998).

      Figure 1A. Schemes of mKO2-labeled Oct4 KO (Oct4<sup>mKO2</sup>) and Oct4<sup>flox</sup> alleles. In the Oct4<sup>mKO2</sup> allele, a PGK-pac∆tk-P2A-mKO2-pA cassette was inserted 3.6 kb upstream of the Oct4 transcription start site (TSS) and a promoter-less FRT-SA-IRES-hph-P2A-Venus-pA cassette was inserted into Oct4 intron 1. The inclusion of a stop codon followed by three sets of polyadenylation signal sequences (pA) after the Venus cassette ensures both transcriptional and translational termination, effectively blocking the expression of Oct4 exons 2–5.

      Figure 1B. Schemes of EGFP-labeled Sox2 KO (Sox2<sup>EGFP</sup>) and Sox2 <sup>flox</sup> alleles. In the Sox2 Sox2<sup>EGFP</sup> allele, the 5’ untranslated region (UTR), coding sequence and a portion of the 3’ UTR of Sox2 were deleted and replaced with a PGK-EGFP-pA cassette. Notably, 1,023 bp of the Sox2 3’UTR remain intact.

      (2) Is ZP3-Cre expressed in the zygotes? Is there any residual protein?

      This is indeed a very important issue. Here is why we think we are on the safe side. ZP3 is specifically expressed in growing oocytes, thus making ZP3-Cre a widely used tool for deleting maternally inherited alleles. When we crossed Oct4<sup>flox/flox</sup>; ZP3-Cre<sup>-</sup>_females with _Oct4<sup>flox/flox</sup>; ZP3-Cre<sup>+</sup> males, we got ZP3-Cre<sup>+</sup> Oct4<sup>flox/flox</sup> but no Oct4<sup> flox/∆</sup> or Oct4<sup> ∆/∆</sup> pups, suggesting that the paternally inherited ZP3-Cre allele is not functionally active in zygotes, which is consistent with reports from other researchers (e.g. Frum, et al., Dev Cell 2013; Wu, et al., Nat Cell Biol 2013).

      (3) What motifs are enriched in the rising ATAC-seq peaks after knocking out of OCT4 and SOX2?

      The enriched motifs in the rising ATAC-seq peaking in Oct4 KO and Sox2 KO ICMs are the GATA, TEAD, EOMES and KLF motifs, as shown in Figure 4A and Figure supplement 7.

      (4) The ordinate of Fig4c is lost.

      Thank you for pointing this out. The y-axis is average normalized signals (reads per million-normalized pileup signals). We will add it in the revised version.

      (5) Signals of H3K4me1, H3K27ac, and so on are usually used to define enhancers, and the loci of enhancers vary greatly in different cells. In the manuscript, the authors defined ATAC-seq peaks far from the TSS as enhancers. The definition in this manuscript is not strictly an enhancer.

      Thank you for this insightful comment. We analyzed the published H3K27ac ChIP-seq data of mouse ICM at 94-96 h post hCG (B. Liu, et al., Nat Cell Biol 2024) to assess the enrichment of H3K27ac around our ATAC-seq peaks. Unfortunately, the data quality is poor, e.g., inconsistent across replicates (Author response image 1A), and shows little enrichment around the well-defined enhancers (Author response image 1B). Nevertheless, as we admit not all the distal ATAC-seq peaks or open chromatin regions are enhancers, we have replaced “enhancers” with “open chromatin regions”, “ATAC-seq peaks” or “putative enhancers”.

      Author response image 1.

      Analysis of the published H3K27ac ChIP-seq dataset of mouse ICM at 94-96 h post hCG (B. Liu, et al., Nat Cell Biol 2024). A. ChIP-seq profiles of H3K27ac over the decreased, unchanged and increased ATAC-seq peaks in our Oct4-KO late ICMs. To exclude spurious peaks, only strong unchanged peaks (57,512 out of 142,096) were used in the analysis. B. IGV tracks displaying ATAC-seq and H3K27ac ChIP-seq profiles around Dppa3 and Oct4. Red boxes mark the known OCT-SOX enhancers.

      (6) If Oct4 and Sox2 truly activate sap 30 and Uhrf 1, what effect does interfere with both genes have on gene expression and chromatin accessibility?

      This is indeed an interesting question. Unfortunately, we have not conducted this specific experiment, so we do not have direct results. However, Sap30 is a key component of the mSin3A corepressor complex, while Uhrf1 regulates the establishment and maintenance of DNA methylation. Both proteins are known to function as repressors. Therefore, we hypothesize that interfering with these two genes could alleviate repression of some genes, such as trophectoderm markers, similar to what we have observed in Oct4 KO and Sox2 KO ICMs.

      Reviewer #2 (Public review):

      In this manuscript, Hou et al. investigate the interplay between OCT4 and SOX2 in driving the pluripotent state during early embryonic lineage development. Using knockout (KO) embryos, the authors specifically analyze the transcriptome and chromatin state within the ICM-to-EPI developmental trajectory. They emphasize the critical role of OCT4 and the supportive function of SOX2, along with other factors, in promoting embryonic fate. Although the paper presents high-quality data, several key claims are not well-supported, and direct evidence is generally lacking.

      Major Points:

      (1) Although the authors claim that both maternal KO and maternal KO/zygotic hetero KO mice develop normally, the molecular changes in these groups appear overestimated. A wildtype control is recommended for a more robust comparison. (a complementary comment from the reviewer: “Both maternal KO and maternal-zygotic KO in this study exhibited phenotypic consistency but molecular disparity. Specifically, both KO and control groups could develop normally; however, their chromatin landscapes and transcriptomic profiles showed different. This raises the question of whether the molecular differences are real. We suggest that inclusion of a completely wild-type control group would make the comparison more robust.”)

      Thank you for your feedback as this point was obviously not clear in the manuscript. Here is our explanation: Mouse embryos with a maternal KO or zygotic heterozygous KO of Oct4 or Sox2 show no noticeable phenotype or molecular difference (Figure 2-figure supplement 4A) (Avilion et al., 2003; Frum et al., 2013; Kehler et al, 2004; Nichols et al., 1998; Wicklow et al., 2014; Wu et al., 2013). We have clarified this point in the revised manuscript.

      (2) The authors assert that OCT4 and SOX2 activate the pluripotent network via the OCT-SOX enhancer. However, the definition of this enhancer is based solely on proximity to TSSs, which is a rough approximation. Canonical enhancers are typically located in intronic and intergenic regions and marked by H3K4me1 or H3K27ac. Re-analyzing enhancer regions with these standards could be beneficial. Additionally, the definitions of "close to" or "near" in lines 183-184 are unclear and not defined in the legends or methods.

      Thank you for this insightful and helpful comment. As stated in the response to Reviewer #1’s point (5), we have replaced “enhancers” with “open chromatin regions”, “ATAC-seq peaks” or “putative enhancers”.

      The definition of "close to" or "near" in lines 183-184 is in the legend of Figure 2E and Methods. In the GSEA analysis, Ensembl protein-coding genes with TSSs located within 10 kb of ATAC-seq peak centers were included, so that some of the intronic ATAC-seq peaks were taken into consideration. We have also added the information in the main text of the revised manuscript.

      (3) There is no evidence that the decreased peaks/enhancers could be the direct targets of Oct4 and Sox2 throughout this manuscript. Figures 2 and 4 show only minimal peak annotations related to OCT and SOX motifs, and there is a lack of chromatin IP data. Therefore, claims about direct targets are not substantiated and should be appropriately revised.

      Yes indeed, you have a point. In Figure Supplement 3C, we analyzed the published Sox2 CUT&RUN data from E4.5 ICMs (Li et al., Science, 2023), which demonstrates that the reduced ATAC-seq peaks in our Sox2 KO ICMs are enriched with the Sox2 CUT&RUN signals. Unfortunately, we did not to find similar published data for Oct4 in embryos. We have removed the statement indicating that these are the direct targets in the revised manuscript.

      (4) Lines 143-146 lack direct data to support the claim. Actually, the main difference in cluster 1, 11 and 3, 8, 14 is whether the peak contains OCT-SOX motif. However, the reviewer cannot get any information of peaks activated by OCT4 rather than SOX2 in cluster 1, 11.

      Thank you for the comment that we hope we can clarify.

      Lines 143-146 are: “Notably, the peaks activated by Oct4 but not by Sox2 in the ICM tended to be already open at the morula stage (Figure 2B, clusters 1 and 11), whereas those dependent on both Oct4 and Sox2 became open in the ICM (Figure 2B, clusters 3, 8 and 14).”

      We agree with you that clusters 3/8/14 are more enriched in OCT-SOX motifs than clusters 1/11. However, this is consistent with our observation that accessibility of peaks in clusters 1 and 11 relies mainly on Oct4, while accessibility in clusters 3, 8, 14 depends on both Oct4 and Sox2. But maybe the term “activate” is misleading. We have rephrased the text as below:

      “Notably, compared to the peaks that depend on Oct4 but not Sox2 (Figure 2B, clusters 1 and 11), those reliant on both Oct4 and Sox2 show greater enrichment of the OCT-SOX motif (Figure 2B, clusters 3, 8 and 14). The former group was generally already open in the morula, while the latter group only became open in the ICM. “

      Minor Points:

      (1) Lines 153-159: The figure panel does not show obvious enrichment of SOX2 signals or significant differences in H3K27ac signals across clusters, thus not supporting the claim.

      We hope to be able to explain this.

      Line 153-159 refer to two datasets:  Figure Supplement 3C and 3D.

      In Figure Supplement 3C, the average plots above the heatmaps show that the decreased ATAC-seq peaks (the indigo lines) have higher enrichment with Sox2 CUT&RUN signals than the increased or unchanged peaks (the yellow and light blue lines, respectively).

      In Figure Supplement 3D, the average plots indicate that H3K27ac signals around the center of the decreased ATAC-seq peaks (the indigo line) show higher enrichment compared to the unaltered and decreased groups (the light blue and yellow lines, respectively). Notably, H3K27ac enrichment appears slightly offset from the central nucleosome-free regions.

      (2) Lines 189-190: The term "identify" is overstated for the integrative analysis of RNA-seq and ATAC-seq, which typically helps infer TF targets rather than definitively identifying them.

      You are right. We have replaced “identify” with “infer” in the revised manuscript.

      (3) The Discussion is lengthy and should be condensed.

      We have shortened the discussion in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Lu et. al. proposed here a direct role of LPS in inducing hepatic fat accumulation and that the metabolism of LPS therefore can mitigate fatty liver injury. With an Acyloxyacyl hydrolase whole-body KO mice, they demonstrated that Acyloxyacyl hydrolase deletion resulted in higher hepatic fat accumulation over 8 months of high glucose/high fructose diet. Previous literature has found that hepatocyte TLR4 (which is a main receptor for binding LPS) KO reduced fatty liver in the MAFLD model, and this paper complements this by showing that degradation/metabolism of LPS can also reduce fatty liver. This result proposed a very interesting mechanism and the translational implications of utilizing Acyloxyacyl hydrolase to decrease LPS exposure are intriguing.

      The strengths of the present study include that they raised a very simplistic mechanism with LPS that is of interest in many diseases. The phenotype shown in the study is strong. The mechanism proposed by the findings is generally well supported.

      There are also several shortcomings in the findings of this study. As AOAH is a whole-body KO, the source production of AOAH in MAFLD is unclear. Although the authors used published single-cell RNA-seq data and flow-isolated liver cells, physiologically LPS degradation could occur in the blood or the liver. The authors linked LPS to hepatocyte fatty acid oxidation via SREBP1. The mechanism is not explored in great depth. Is this signaling TLR4? In this model, LPS could activate macrophages and mediate the worsening of hepatocyte fatty liver injury via the paracrine effect instead of directly signaling to hepatocytes, thus it is not clear that this is a strictly hepatocyte LPS effect. It would also be very interesting to see if administration of the AOAH enzyme orally could mitigate MAFLD injury. Overall, this work will add to the current understanding of the gut-liver axis and development of MAFLD and will be of interest to many readers.

      We thank the reviewers for their important questions and comments.

      In previous studies we found that AOAH is expressed in Kupffer cells and dendritic cells cells (Shao et al., 2007). Single-cell RNAseq analysis of mouse livers by others has found AOAH in Kupffer cells, monocytes, NK cells and ILC1 cells (Remmerie et al.,2020). We also analyzed human liver single-cell RNAseq data and found that AOAH is expressed in monocytes, macrophages, resident and circulating NK cells, and some T cells (Ramachandran et al., 2019) (Please see new Figure 3E). Using clodronate-liposomes to deplete Kupffer cells we found that hepatic AOAH mRNA diminished and nSREBP1 increased (Please see new Figure 5D). These results suggest that Kupffer cells are the major source of AOAH in the liver and that LPS needs to be inactivated in the liver to prevent hepatocyte lipid accumulation.

      Using primary hepatocyte culture, we found that LPS can stimulate hepatocytes directly to induce mTOR activation and SREBP1 activation (new Figure 6E). Adding purified Kupffer cells to the hepatocyte culture did not further increase SREBP1 activation. These results suggest that LPS may directly stimulate hepatocyte to accumulate fat, at least in vitro.

      Both TLR4 and caspase 11 are reported to play important roles in MASLD development (Sharifnia et al., 2015; Zhu et al., 2021). We have crossed Aoah<sup>-/-</sup> mice with TLR4<sup>-/-</sup> mice and found that Aoah<sup>-/-</sup>TLR4<sup>-/-</sup> and Aoah<sup>-/-</sup> mice had similarly severe MASLD. This is probably because TLR4 is required for gut homeostasis (Rakoff-Nahoum et al., 2004); in TLR4 whole-body KO mice compromised gut homeostasis may result in more severe MASLD. By specifically deleting TLR4 on hepatocytes, Yu et al found that NASH-induced fibrosis was mitigated (Yu et al., 2021). In future studies we therefore would need to specifically delete TLR4 in hepatocytes to test whether excessive gut-derived LPS in Aoah<sup>-/-</sup> mice stimulates hepatic TLR4 to induce more severe MASLD. We would also test whether Caspase 11 is required for hepatic fat accumulation in Aoah<sup>-/-</sup> mice.

      It is intriguing to test whether providing exogenous AOAH may mitigate MASLD. We will use an AAV expressing AOAH to test this idea.

      Reviewer #2 (Public review):

      The authors of this article investigated the impact of the host enzyme AOAH on the progression of MASLD in mice. To achieve this, they utilized whole-body Aoah<sup>-/-</sup> mice. The authors demonstrated that AOAH reduced LPS-induced lipid accumulation in the liver, probably by decreasing the expression and activation of SREBP1. In addition, AOAH reduced hepatic inflammation and minimized tissue damage.

      However, this paper is descriptive without a clear mechanistic study. Another major limitation is the use of whole-body KO mice so the cellular source of the enzyme remains undefined. Moreover, since LPS-mediated SREBP1 regulation or LPS-mediated MASLD progression is already documented, the role of AOAH in SREBP1-dependent lipid accumulation and MASLD progression is largely expected.

      Specific comments:

      (1) The overall human relevance of the current study remains unclear.

      It is a good point. We have studied human relevance and show the results in Figure 3E. AOAH expression increased in the hepatic macrophages and monocytes of MASLD patients.

      (2) Is AOAH secreted from macrophages or other immune cells? Are there any other functions of AOAH within the cells?

      AOAH can be secreted from kidney proximal tubule cells and the released AOAH can be taken up by cells that do not express AOAH (Feulner et al., 2004). AOAH can also deacylate oxidized phospholipids, DAMP molecules (Zou et al., 2021).

      (3) Due to using whole-body KO mice, the role of AOAH in specific cell types was unclear in this study, which is one of the major limitations of this study. The authors should at least conduct in vitro experiments using a co-culture system of hepatocytes and Kupffer cells (or other immune cells) isolated from WT or Aoah<sup>-/-</sup> mice.

      Thanks for the suggestion.

      Using clodronate-liposomes, we depleted Kupffer cells and found that hepatic AOAH mRNA diminished and nSREBP1 increased in the liver (Please see new Figure 5D). These results confirm that Kupffer cells are the major source of AOAH in the liver and LPS needs to be inactivated in the liver to prevent hepatocyte lipid accumulation.  Using primary hepatocyte culture, we found that LPS can stimulate hepatocytes directly to induce mTOR activation and SREBP1 activation (new Figure 6E).  These results suggest that LPS may directly stimulate hepatocytes to accumulate fat, at least in vitro.

      (4) It has been well-known that intestinal tight junction permeability is increased by LPS or inflammatory cytokines. However, in Figure 3E, intestinal permeability is comparable between the groups in both diet groups. The authors should discuss more about this result. In addition, intestinal junctional protein should be determined by Western blot and IHC (or IF) to further confirm this finding.

      We have stained ZO-1 (Please see Author response image 1, ZO-1- green fluorescence) in Aoah<sup>+/+</sup> and Aoah<sup>-/-</sup> mouse colonic sections. We did not see a big difference between the two strains of mice.

      Author response image 1.

      Feeding a high fat diet in our mouse facility for 28 weeks has led to increased gut permeability, but there was no difference between Aoah<sup>+/+</sup> and Aoah<sup>-/-</sup>mice. Thus, the more severe MASLD in Aoah<sup>-/-</sup> mice is mainly caused by elevated bioactive LPS instead of increased LPS translocation from the intestine to the liver.

      (5) In Figure 6, the LPS i.g. Aoah<sup>-/-</sup> group is missing. This group should be included to better interpret the results.

      Please see new Figure 6. When we orally gavaged Aoah<sup>-/-</sup> mice with LPS, fecal LPS levels did not increase further. Their liver SREBP1 did not increase further while the SREBP1 target gene expression increased when compared with Aoah<sup>-/-</sup> mice i.g. PBS.

      (6) The term NAFLD has been suggested to be changed to MASLD as the novel nomenclature according to the guidelines of AASLD and EASL.

      Thanks for the suggestion. We have changed NAFLD to MASLD.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Consider using MAFLD rather than NAFLD.

      Thanks for the suggestion. We have changed NAFLD to MASLD.

      References

      Feulner, J.A., M. Lu, J.M. Shelton, M. Zhang, J.A. Richardson, and R.S. Munford. 2004. Identification of acyloxyacyl hydrolase, a lipopolysaccharide-detoxifying enzyme, in the murine urinary tract. Infection and immunity 72:3171-3178.

      Zou, B., M. Goodwin, D. Saleem, W. Jiang, J. Tang, Y. Chu, R.S. Munford, and M. Lu. 2021. A highly conserved host lipase deacylates oxidized phospholipids and ameliorates acute lung injury in mice. eLife 10:

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      Weaknesses:  

      (1) The heatmaps (for example, Figure 3A, B) are challenging to read and interpret due to their size. Is there a way to alter the visualization to improve interpretability? Perhaps coloring the heatmap by general anatomical region could help? We feel that these heatmaps are critical to the utility of the registration strategy, and hence, clear visualization is necessary. 

      We thank the reviewers for this point on aesthetic improvement, and we agree that clearer visualization of our correlation heatmaps is important. To address this point, we have incorporated the capability of grouping “child” subregions in anatomical order by their more general “parent” region into the package function, plot_correlation_heatmaps(). Parent regions will be can now be plotted as smaller sub-facets in the heatmaps. We have also rearranged our figures to fit enlarged heatmaps in Figures 3-5, and Supplementary Figure 10 for easier visualization. 

      (2) Additional context in the Introduction on the use of immediate early genes to label ensembles of neurons that are specifically activated during the various behavioral manipulations would enable the manuscript and methodology to be better appreciated by a broad audience. 

      We thank the reviewers for this suggestion and have revised the first part of our Introduction to reflect the broader use and appeal of immediate early genes (IEGs) for studying neural changes underlying behavior.

      (3) The authors mention that their segmentation strategies are optimized for the particular staining pattern exhibited by each reporter and demonstrate that the manually annotated cell counts match the automated analysis. They mention that alternative strategies are compatible, but don't show this data. 

      We thank the reviewers for this comment. We also appreciate that integration with alternative strategies is a major point of interest to readers, given that others may be interested in compatibility with our analysis and software package, rather than completely revising their own pre-existing pipelines. 

      Generally, we have validated the ability to import datasets generated from completely different workflows for segmentation and registration. We have since released documentation on our package website with step-by-step instructions on how to do so (https://mjin1812.github.io/SMARTTR/articles/Part5.ImportingExternalDatasets). We believe this tutorial is a major entry point to taking advantage of our analysis package, without adopting our entire workflow.

      This specific point on segmentation refers to the import_segmentation_custom()function in the package. As there is currently not a standard cell segmentation export format adopted by the field, this function still requires some data wrangling into an import format saved as a .txt file. However, we chose not to visually demonstrate this capability in the paper for a few reasons.  

      i) A figure showing the broad testing of many different segmentation algorithms, (e.g., Cellpose, Vaa3d, Trainable Weka Segmentation) would better demonstrate the efficacy of segmentation of these alternative approaches, which have already been well-documented. However, demonstrating importation compatibility is more of a demonstration of API interface, which is better shown in website documentation and tutorial notebooks.

      ii) Additionally, showing importation with one well-established segmentation approach is still a demonstration of a single use case. There would be a major burden-of-proof in establishing importation compatibility with all potential alternative platforms, their specific export formats, which may be slightly different depending on post-processing choices, and the needs of the experimenters (e.g., exporting one versus many channels, having different naming conventions, having different export formats). For example, output from Cellpose can take the form of a NumPy file (_seg.npy file), a .png, or Native ImageJ ROI archive output, and users can have chosen up to four channels. Until the field adopts a standardized file format, one flexible enough to account for all the variables of experimental interest, we currently believe it is more efficient to advise external groups on how to transform their specific data to be compatible with our generic import function.  

      (4) The authors provided highly detailed information for their segmentation strategy, but the same level of detail was not provided for the registration algorithms. Additional details would help users achieve optimal alignment.

      We apologize for this lack of detail. The registration strategy depends upon the WholeBrain (Fürth et al., 2018) package for registration to the Allen Mouse Common Coordinate Framework. While this strategy has been published and documented elsewhere, we have substantially revised our methods section on the registration process to better incorporate details of this approach.

      (5) The authors illustrate registration to the Allen atlas. Can they comment on whether the algorithm is compatible with other atlases or with alternative sectioning planes (horizontal/sagittal)? 

      Since the current registration workflow integrates WholeBrain (Fürth et al., 2018), any limitations of WholeBrain apply to our approach, which means limited support for registering non-coronal sectioning planes and reliance on the Allen Mouse Atlas (Dong, 2008). However, network analysis and plotting functions are currently compatible with the Allen Mouse Brain Atlas and the Kim Unified Mouse Brain Atlas version (2019) (Chon et al., 2019). Therefore, current limitations in registration do not preclude the usefulness of the SMARTTR software in generating valuable insights from network analysis of externally imported datasets. 

      There are a number of alternative workflows, such as the QUINT workflow (Yates et al., 2019), that support multiple different mouse atlases, and registration of arbitrarily sectioned angles. We have plans to support and a facilitate an entry point for this workflow in a future iteration of SMARTTR, but believe it is of benefit to the wider community to release and support SMARTTR in its current state.

      (6) Supplemental Figures S10-13 do not have a legend panel to define the bar graphs. 

      We apologize for this omission and have fixed our legends in our resubmission. Our supplement figure orders have changed and the corresponding figures are now Supplemental Figures S11-14.

      (7) When images in a z-stack were collapsed, was this a max intensity projection or average? Assuming this question is in regards to our manual cell counting validation approach, the zstacks were collapsed as a maximum intensity projection.  

      Reviewer #2 (Public review): 

      Weaknesses: 

      (1) While I was able to install the SMARTR package, after trying for the better part of one hour, I could not install the "mjin1812/wholebrain" R package as instructed in OSF. I also could not find a function to load an example dataset to easily test SMARTR. So, unfortunately, I was unable to test out any of the packages for myself. Along with the currently broken "tractatus/wholebrain" package, this is a good example of why I would strongly encourage the authors to publish SMARTR on either Bioconductor or CRAN in the future. The high standards set by Bioc/CRAN will ensure that SMARTR is able to be easily installed and used across major operating systems for the long term. 

      We greatly thank the reviewer for pointing out this weakness; long-term maintenance of this package is certainly a mutual goal. Loading an .RDATA file is accomplished by either doubleclicking directly on the file in a directory window, after specifying this file type should be opened in RStudio or by using the load() function, (e.g., load("directory/example.RData")). We have now explicitly outlined these directions in the online documentation. 

      Moreover, we have recently submitted our package to CRAN and are currently working on revisions following comments. This has required a package rebranding to “SMARTTR”, as there were naming conflicts with a previously archived repository on CRAN. Currently, SMARTTR is not dependent on the WholeBrain package, which remains optional for the registration portion of our workflow. Ultimately, this independence will allow us to maintain the analysis and visualization portion of the package independently.

      In the meantime, we have fully revised our installation instructions (https://mjin1812.github.io/SMARTTR/articles/SMARTTR). SMARTTR is now downloadable from a CRAN-like repository as a bundled .tar.gz file, which should ease the burden of installation significantly. Installation has been verified on a number of different versions of R on different platforms. Again, we hope these changes are sufficient and improve the process of installation. 

      (2) The package is quite large (several thousand lines include comments and space). While impressive, this does inherently make the package more difficult to maintain - and the authors currently have not included any unit tests. The authors should add unit tests to cover a large percentage of the package to ensure code stability. 

      We have added unit testing to improve the reliability of our package. Unit tests now cover over 71% of our source code base and are available for evaluation on our github website (https://github.com/mjin1812/SMARTTR). We focused on coverage of the most front-facing functions. We appreciate this feedback, which has ultimately enhanced the longevity of our software.

      (3) Why do the authors choose to perform image segmentation outside of the SMARTTR package using ImageJ macros? Leading segmentation algorithms such as CellPose and StarMap have well-documented APIs that would be easy to wrap in R. They would likely be faster as well. As noted in the discussion, making SMARTTR a one-stop shop for multi-ensemble analyses would be more appealing to a user. 

      We appreciate this feedback. We believe parts of our response to Reviewer 1, Comment 3, are relevant to this point. Interfaces for CellPose and ClusterMap (which processes in situ transcriptomic approaches, like STARmap) are both in python, and currently there are ways to call python from within R (https://rstudio.github.io/reticulate/index.html). We will certainly explore incorporating these APIs from R. However, we would anticipate this capability is more similar to “translation” between programming languages, but would not currently preclude users from the issue of needing some familiarity with the capabilities of these python packages, and thus with python syntax.

      (4) Given the small number of observations for correlation analyses (n=6 per group), Pearson correlations would be highly susceptible to outliers. The authors chose to deal with potential outliers by dropping any subject per region that was> 2 SDs from the group mean. Another way to get at this would be using Spearman correlation. How do these analyses change if you use Spearman correlation instead of Pearson? It would be a valuable addition for the author to include Spearman correlations as an option in SMARTTR. 

      We thank reviewers for this suggestion and we have updated our code base to include the possibility for using Spearman’s correlation coefficient as opposed to Pearson’s correlation coefficient for heatmaps in the get_correlations() function. Users can now use the `method` parameter, set to either “pearson” or “spearman” and results will propagate throughout the rest of the analysis using these results.

      Below, in Author response image 1 we show a visual comparison of the correlation heat maps for active eYFP<sup>+</sup> ensembles in the CT and IS groups using both Pearson and Spearman correlations. We see a strongly qualitative similarity between the heat maps. Of course, since the statistical assumptions underlying the relationship between variables using Pearson correlation (linear) vs Spearman correlation (monotonic) are different, users should take this into account when interpreting results using different approaches.

      Author response image 1.

      Pearson and Spearmen regional correlations of eYFP+ ensembles activity in the CT and IS groups.

      (5) I see the authors have incorporated the ability to adjust p-values in many of the analysis functions (and recommend the BH procedure) but did not use adjusted p-values for any of the analyses in the manuscript. Why is this? This is particularly relevant for the differential correlation analyses between groups (Figures 3P and 4P). Based on the un-adjusted pvalues, I assume few if any data points will still be significant after adjusting. While it's logical to highlight the regional correlations that strongly change between groups, the authors should caution which correlations are "significant" without adjusting for multiple comparisons. As this package now makes this analysis easily usable for all researchers, the authors should also provide better explanations for when and why to use adjusted p-values in the online documentation for new users. 

      We appreciate the feedback note that our dataset is presented as a more demonstrative and exploratory resource for readers and, as such, we accept a high tolerance for false positives, while decreasing risk of missing possible interesting findings. As noted by Reviewer #2, it is still “logical to highlight the regional correlations that strongly change between groups.” We have clarified in our methods that we chose to present uncorrected p-values when speaking of significance. 

      We have also removed any previous recommendations for preferred methods for multiple comparisons adjustment in our function documentations, as some previous documentation was outdated. Moreover, the standard multiple comparisons adjustment approaches assume complete independence between tests, whereas this assumption is violated in our differential correlational analysis (i.e., a region with one significantly altered connection is more likely than another to have another significantly altered connection).

      Ultimately, the decision to correct for multiple comparisons with standard FDR, and choice of significance threshold, should still be informed by standard statistical theory and user-defined tolerance for inclusion of false-positives and missing of false-negatives. This will be influenced by factors, such as the nature and purpose of the study, and quality of the dataset.  

      (6) The package was developed in R3.6.3. This is several years and one major version behind the current R version (4.4.3). Have the authors tested if this package runs on modern R versions? If not, this could be a significant hurdle for potential users. 

      We thank reviewers for pointing out concerns regarding versioning. We have since updated our installation approach for SMARTTR, which is compatible with versions of R >= 3.6 and has been tested on Mac ARM-based (Apple silicon) architecture (R v4.4.2), and Windows 10 (R v3.6.3, v4.5.0 [devel]). 

      The recommendation for users to install R 3.6.3 is primarily for those interested in using our full workflow, which requires installation of the WholeBrain package, which is currently a suggested package. We anticipate updating and supporting the visualization and network analysis capabilities, whilst maintaining previous versioning for the full workflow presented in this paper.  

      (7) In the methods section: "Networks were constructed using igraph and tidygraph packages." - As this is a core functionality of the package, it would be informative to specify the exact package versions, functions, and parameters for network construction. 

      We thank reviewers for pointing out the necessity for these details for code reproducibility. We have since clarified our language in the manuscript on the exact functions we use in our analysis and package versions, which we also fully document in our online tutorial. Additionally. We have printed our package development and analysis environment online at https://mjin1812.github.io/SMARTTR/articles/Part7.Development.

      (8) On page 11, "Next, we examined the cross-correlations in IEG expression across brain regions, as strong co-activation or opposing activation can signify functional connectivity between two regions" - cross-correlation is a specific analysis in signal processing. To avoid confusion, the authors should simply change this to "correlations". 

      We thank the reviewer for pointing out this potentially confusing phrasing. We have changed all instances of “cross-correlation” to “correlation”.

      (9) Panels Q-V are missing in Figure 5 caption. 

      We thank the reviewer for pointing out this oversight. We have now fixed this in our revision.

      References

      Chon, U., Vanselow, D. J., Cheng, K. C., & Kim, Y. (2019). Enhanced and unified anatomical labeling for a common mouse brain atlas. Nature Communications, 10(1), 5067. https://doi.org/10.1038/s41467-019-13057-w

      Dong, H. W. (2008). The Allen reference atlas: A digital color brain atlas of the C57Bl/6J male mouse (pp. ix, 366). John Wiley & Sons Inc.

      Fürth, D., Vaissière, T., Tzortzi, O., Xuan, Y., Märtin, A., Lazaridis, I., Spigolon, G., Fisone, G., Tomer, R., Deisseroth, K., Carlén, M., Miller, C. A., Rumbaugh, G., & Meletis, K. (2018). An interactive framework for whole-brain maps at cellular resolution. Nature Neuroscience, 21(1), 139–149. https://doi.org/10.1038/s41593-017-0027-7

      Yates, S. C., Groeneboom, N. E., Coello, C., Lichtenthaler, S. F., Kuhn, P.-H., Demuth, H.-U., Hartlage-Rübsamen, M., Roßner, S., Leergaard, T., Kreshuk, A., Puchades, M. A., & Bjaalie, J. G. (2019). QUINT: Workflow for Quantification and Spatial Analysis of Features in Histological Images From Rodent Brain. Frontiers in Neuroinformatics, 13. https://www.frontiersin.org/articles/10.3389/fninf.2019.00075

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Authors showed the presence of Mtb in human liver biopsy samples of TB patient and reported that chronic infection of Mtb causes immune-metabolic dysregulation. Authors showed that Mtb replicates in hepatocytes in a lipid rich environment created by up regulating transcription factor PPARγ. Authors also reported that Mtb protects itself from anti-TB drugs by inducing drug metabolising enzymes.

      Strengths:

      It has been shown that Mtb induces storage of triacylglycerol in macrophages by induction of WNT6/ACC2 which helps in its replication and intracellular survival, however, creation of favorable replicative niche in hepatocytes by Mtb is not reported. It is known that Mtb infect macrophages and induces formation of lipid-laden foamy macrophages which eventually causes tissue destruction in TB patient. In a recent article it has been reported that "A terpene nucleoside from M. tuberculosis induces lysosomal lipid storage in foamy macrophages" that shows how Mtb manipulates host defense mechanisms for its survival. In this manuscript, authors reported the enhancement of lipid droplets in Mtb infected hepatocytes and convincingly showed that fatty acid synthesis and triacylglycerol formation is important for growth of Mtb in hepatocytes. Authors also showed the molecular mechanism for accumulation of lipid and showed that the transcription factor associated with lipid biogenesis, PPARγ and adipogenic genes were upregulated in Mtb infected cells.

      The comparison of gene expression data between macrophages and hepatocytes by authors is important which indicates that Mtb modulates different pathways in different cell type as in macrophages it is related to immune response whereas, in hepatocytes it is related to metabolic pathways.

      Authors also reported that Mtb residing in hepatocytes showed drug tolerance phenotype due to up regulation of enzymes involved in drug metabolism and showed that cytochrome P450 monooxygenase that metabolize rifampicin and NAT2 gene responsible for N-acetylation of isoniazid were up regulated in Mtb infected cells.

      Weaknesses:

      There are reports of hepatic tuberculosis in pulmonary TB patients especially in immune-compromised patients, therefore finding granuloma in human liver biopsy samples is not surprising.

      Mtb infected hepatic cells showed induced DME and NAT and this could lead to enhanced metabolism of drug by hepatic cells as a result Mtb in side HepG2 cells get exposed to reduced drug concentration and show higher tolerance to drug. Authors mentioned that " hepatocyte resident Mtb may display higher tolerance to rifampicin". In my opinion higher tolerance to drug is possible only when DME of Mtb inside is up regulated or target is modified. Although, in the end authors mentioned that drug tolerance phenotype can be better attributed to host intrinsic factors rather than Mtb efflux pumps. It may be better if Drug tolerant phenotype section can be rewritten to clarify the facts.

      In the revised manuscript, by immune-staining authors convincingly showed that hepatocytes are a favourable niche for replication of MTb.

      Authors have rewritten the drug tolerant phenotype section which reads better.

      Overall, this paper has new and important information on how MTb establishes a favourable niche for growth in hepatocytes and creates a drug tolerant environment.

      We thank the reviewer for the through and insightful review.

      Reviewer #2 (Public review):

      The manuscript by Sarkar et al has demonstrated the infection of liver cells/hepatocytes with Mtb and the significance of liver cells in the replication of Mtb by reprogramming lipid metabolism during tuberculosis. Besides, the present study shows that similar to Mtb infection of macrophages (reviewed in Chen et al., 2024; Toobian et al., 2021), Mtb infects liver cells but with a greater multiplication owing to consumption of enhanced lipid resources mediated by PPARg that could be cleared by its inhibitors. The strength of the study lies in clinical evaluation of the presence of Mtb in human autopsied liver samples from individuals with miliary tuberculosis and presence of a clear granuloma-like structure. The interesting observation is of granuloma-like structure in liver which prompts further investigations in the field.

      The modulation of lipid synthesis during Mtb infection, such as PPARg upregulation, appears generic to different cell types including both liver cells and macrophage cells. It is also known that infection affect PPARγ expression and activity in hepatocytes. It is also known that this can lead to lipid droplet accumulation in the liver and the development of fatty liver disease (as shown for HCV). This study is in similar line for M.tb infection. As liver is the main site for lipid regulation, the availability of lipid resources is greater and higher is the replication rate. In short, the observations from the study confirm the earlier studies with these additional cell types. It is known that higher the lipid content, greater are Lipid Droplet-positive Mtb and higher is the drug resistance (Mekonnen et al., 2021). The DMEs of liver cells add further to the phenotype.

      Comments on revised version:

      The authors noted that even in experiments where mice were infected with lower CFUs, the presence of Mtb colonies could still be detected in the liver. It would be beneficial to include some experimental data related to this in the supplementary information, as it could provide valuable insights for the research field.

      We thank the reviewer for the in depth evaluation of our manuscript and as suggested we will include the data where Mtb was detected in the liver at low CFUs

      Reviewer #3 (Public review):

      In this revised manuscript, the authors explore how Mtb can infect hepatocytes and create a favorable niche associated with upregulation of the transcription factor PPARγ which presumably allows the bacteria to scavenge lipids from lipid droplets in host cells and upregulate drug-metabolizing enzymes to protect against its elimination. In response to the review, the authors have performed some additional immunostaining of hepatocytes, added more detail to figure legends, added experiments somewhat showing improved colocalization and staining, clarified several points and paragraphs, and updated the referenced literature and discussion.

      The current manuscript provides evidence that human miliary TB patients have infection of hepatocytes with Mtb, with evidence that the bacteria survive at least partially through upregulation of PPARγ, which significantly changes the lipid milieu of the cells. There is also an examination of transcriptomics and lipid metabolism in response to Mtb infection, as well as drug tolerance of Mtb inside hepatocytes. The current manuscript is an improvement over the previous one.

      However, although the manuscript is improved, tissue immunophenotyping of the various cells in the liver remains weak and unconvincing. This is truly a missed opportunity and lessens the rigor of the central findings and conclusions. As pointed out by another reviewer, literature has described different fates of Mtb in the liver. Given the tissue available to the authors, carefully dissecting the various cells that the bacteria are in (esp. hepatocytes versus Kupffer cells) is critical. The authors use only 2 generic markers and do not distinguish among cell types within the tissue slices. A review of the literature shows a variety of both human and mouse antibody markers. In fact, a liver atlas based on immunophenotyping has been published. Likewise, the authors comment on liver granulomas, but this is not justified without immunophenotyping.

      We would like to thank the reviewer for the in-depth and detailed suggestions. We would like to clarify that the primary aim of our study was to determine the localization of Mtb within hepatocytes and the downstream biological consequences. To this end, we employed two well-established and widely validated markers (ASPGR 1 and albumin) that are consistently used to identify hepatocytes in both human and murine liver tissue. While we acknowledge the broader potential of comprehensive immunophenotyping, our focused approach was designed to specifically address the question of hepatocyte involvement, which the selected markers effectively support, which was further reiterated by the Reviewer 1.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In my opinion this paper contains important information and no further information is required for this manuscript.

      We thank the reviewer for the insightful comments

      Reviewer #2 (Recommendations for the authors):

      The authors noted that even in experiments where mice were infected with lower CFUs, the presence of Mtb colonies could still be detected in the liver. It would be beneficial to include some experimental data related to this in the supplementary information, as it could provide valuable insights for the research field.

      As suggested,  we will include the data with the low CFUs in the updated manuscript.

      Reviewer #3 (Recommendations for the authors):

      • Line 340, the fact that PPARγ inhibition decreases bacterial load should not be surprising, as the authors cite several papers where this is already shown.

      • Line 379, the increased tolerance of Mtb to drugs in hepatocytes is only significant at the lower 2 concentrations, not at 5 ug/mL.

      • Fig S4F-H, the y axis is inappropriately not set to zero on the lower limit.

      • Fig S9B, the Y-axis states "relative" CFU, but there is no indication what the bars are normalized to, and the numbers are much more typical of standard CFU values. Was the "Relative" part left in by mistake?

      • Double check the ending of the figure legend for Figure S10 and S11.

      • Line 352, phenomenom [sic] is misspelled.

      • On re-read, several sentences throughout this manuscript need improvement regarding structure and grammar. I suggest careful editorial review.

      We thank the reviewer for pointing out the issues and these will be carefully modified in the next version.


      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors showed the presence of Mtb in human liver biopsy samples of TB patients and reported that chronic infection of Mtb causes immune-metabolic dysregulation. Authors showed that Mtb replicates in hepatocytes in a lipid rich environment created by up regulating transcription factor PPARγ. Authors also reported that Mtb protects itself from anti-TB drugs by inducing drug metabolising enzymes.

      Strengths:

      It has been shown that Mtb induces storage of triacylglycerol in macrophages by induction of WNT6/ACC2 which helps in its replication and intracellular survival, however, creation of favorable replicative niche in hepatocytes by Mtb is not reported. It is known that Mtb infects macrophages and induces formation of lipid-laden foamy macrophages which eventually causes tissue destruction in TB patients. In a recent article it has been reported that "A terpene nucleoside from M. tuberculosis induces lysosomal lipid storage in foamy macrophages" that shows how Mtb manipulates host defense mechanisms for its survival. In this manuscript, authors reported the enhancement of lipid droplets in Mtb infected hepatocytes and convincingly showed that fatty acid synthesis and triacylglycerol formation is important for growth of Mtb in hepatocytes. The authors also showed the molecular mechanism for accumulation of lipid and showed that the transcription factor associated with lipid biogenesis, PPARγ and adipogenic genes were upregulated in Mtb infected cells.

      The comparison of gene expression data between macrophages and hepatocytes by authors is important which indicates that Mtb modulates different pathways in different cell type as in macrophages it is related to immune response whereas, in hepatocytes it is related to metabolic pathways.

      Authors also reported that Mtb residing in hepatocytes showed drug tolerance phenotype due to up regulation of enzymes involved in drug metabolism and showed that cytochrome P450 monooxygenase that metabolize rifampicin and NAT2 gene responsible for N-acetylation of isoniazid were up regulated in Mtb infected cells.

      We thank the reviewer for the positive feedback and for highlighting the strengths of our study.

      Weaknesses:

      There are reports of hepatic tuberculosis in pulmonary TB patients especially in immune-compromised patients, therefore finding granuloma in human liver biopsy samples is not surprising.

      Mtb infected hepatic cells showed induced DME and NAT and this could lead to enhanced metabolism of drug by hepatic cells as a result Mtb in side HepG2 cells get exposed to reduced drug concentration and show higher tolerance to drug. The authors mentioned that " hepatocyte resident Mtb may display higher tolerance to rifampicin". In my opinion higher tolerance to drugs is possible only when DME of Mtb inside is up regulated or the target is modified. Although, in the end authors mentioned that drug tolerance phenotype can be better attributed to host intrinsic factors rather than Mtb efflux pumps. It may be better if the Drug tolerant phenotype section can be rewritten to clarify the facts.

      We agree that several case studies regarding liver infection in pulmonary TB patients have been reported in the literature, however this report is the first comprehensive study that establishes hepatocytes to be a favourable niche for Mtb survival and growth.

      Drug tolerance is a phenomenon that is exhibited by the bacteria and during hostpathogen interactions, can be influenced by both intrinsic (bacterial) and extrinsic (host-mediated) factors. Multiple examples of tolerance being attributed to host driven factors can be found in literature (PMID 32546788, PMID: 28659799, PMID: 32846197). Our studies demonstrate that Mtb infected hepatocytes create a drug tolerant environment by modulating the expression of Drug modifying enzymes (DMEs) in the hepatocytes.

      As suggested by the reviewer we will rewrite the drug tolerant phenotype section.

      Reviewer #2 (Public review):

      The manuscript by Sarkar et al has demonstrated the infection of liver cells/hepatocytes with Mtb and the significance of liver cells in the replication of Mtb by reprogramming lipid metabolism during tuberculosis. Besides, the present study shows that similar to Mtb infection of macrophages (reviewed in Chen et al., 2024; Toobian et al., 2021), Mtb infects liver cells but with a greater multiplication owing to consumption of enhanced lipid resources mediated by PPARg that could be cleared by its inhibitors. The strength of the study lies in the clinical evaluation of the presence of Mtb in human autopsied liver samples from individuals with miliary tuberculosis and the presence of a clear granuloma-like structure. The interesting observation is of granuloma-like structure in liver which prompts further investigations in the field.

      The modulation of lipid synthesis during Mtb infection, such as PPARg upregulation, appears generic to different cell types including both liver cells and macrophage cells. It is also known that infection affect PPARγ expression and activity in hepatocytes. It is also known that this can lead to lipid droplet accumulation in the liver and the development of fatty liver disease (as shown for HCV). This study is in a similar line for M.tb infection. As the liver is the main site for lipid regulation, the availability of lipid resources is greater and higher is the replication rate. In short, the observations from the study confirm the earlier studies with these additional cell types. It is known that higher the lipid content, the greater are Lipid Droplet-positive Mtb and higher is the drug resistance (Mekonnen et al., 2021). The DMEs of liver cells add further to the phenotype.

      We thank the reviewer for emphasizing on the strengths of our study and how it can lead to further investigations in the field.

      Reviewer #3 (Public review):

      This manuscript by Sarkar et al. examines the infection of the liver and hepatocytes during M. tuberculosis infection. They demonstrate that aerosol infection of mice and guinea pigs leads to appreciable infection of the liver as well as the lung. Transcriptomic analysis of HepG2 cells showed differential regulation of metabolic pathways including fatty acid metabolic processing. Hepatocyte infection is assisted by fatty acid synthesis in the liver and inhibiting this caused reduced Mtb growth. The nuclear receptor PPARg was upregulated by Mtb infection and inhibition or agonism of its activity caused a reduction or increase in Mtb growth, respectively, supporting data published elsewhere about the role of PPARg in lung macrophage Mtb infection. Finally, the authors show that Mtb infection of hepatocytes can cause upregulation of enzymes that metabolize antibiotics, resulting in increased tolerance of these drugs by Mtb in the liver.

      Overall, this is an interesting paper on an area of TB research where we lack understanding. However, some additions to the experiments and figures are needed to improve the rigor of the paper and further support the findings. Most importantly, although the authors show that Mtb can infect hepatocytes in vitro, they fail to describe how bacteria get from the lungs to the liver in an aerosolized infection. They also claim that "PPARg activation resulting in lipid droplets formation by Mtb might be a mechanism of prolonging survival within hepatocytes" but do not show a direct interaction between PPARg activation and lipid droplet formation and lipid metabolism, only that PPARg promotes Mtb growth. Thus, the correlations with PPARg appear to be there but causation, implied in the abstract and discussion, is not proven.

      The human photomicrographs are important and overall, well done (lung and liver from the same individuals is excellent). However, in lines 120-121, the authors comment on the absence of studies on the precise involvement of different cells in the liver. In this study there is no attempt to immunophenotype the nature of the cells harboring Mtb in these samples (esp. hepatocytes). Proving that hepatocytes specifically harbor the bacteria in these human samples would add significant rigor to the conclusions made.

      We thank the reviewer for nicely summarizing our manuscript.

      Our study establishes the involvement of liver and hepatocytes in pulmonary TB infection in mice. Understanding the mechanism of bacterial dissemination from the lung to the liver in aerosol infections demands a detailed separate study.

      Figure 6E and 6F shows how PPARγ agonist and antagonist modulate (increase and decrease respectively) bacterial growth in hepatocytes (further supported by the CFU data in Supplementary Figure 9B). Again, the number of lipid droplets in hepatocytes increase and decrease with the treatment of PPARγ agonist and antagonist respectively as shown in Figure 6G and 6H. Collectively, these studies provide strong evidence that PPARγ activation leads to more lipid droplets that support better Mtb growth.

      We thank the reviewer for finding our human photomicrographs convincing. In the manuscript, we provide evidence for the direct involvement of the hepatocytes (and liver) in Mtb infection. We have performed detailed immunophenotyping of hepatocyte cells in the mice model with ASPGR1 (asialoglycoprotein receptor 1) and in the revised version of record, we have further stained the infected hepatocytes with anti-albumin antibody.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In my opinion drug tolerant phenotype section should be rewritten for better clarification. The manuscript contains important information about hepatic tuberculosis which are not reported yet.

      We have rewritten the drug tolerant phenotype section for better clarity.

      We appreciate the reviewer’s comments regarding important information about hepatic tuberculosis

      Reviewer #2 (Recommendations for the authors):

      The following are some observations and comments on the manuscript.

      (1) The study delves into the mechanisms related to hepatic TB/miliary TB; however, the introduction and discussion only describe and discuss the data in the context of pulmonary TB giving a sense that the mandate of the MS is the exploration of the role of liver cells in pulmonary TB. There appears a gap in the connection of findings from the Miliary TB to the pulmonary TB. A discussion of the conversion of pulmonary TB to extrapulmonary /hepatic TB in the light of the findings may be helpful.

      We have modified the discussion section to include possible mechanisms that convert pulmonary TB to hepatic TB in the light of findings. Briefly, Pulmonary tuberculosis (TB) can lead to miliary TB probably through hematogenous dissemination, where Mtb spreads from the infected lungs into blood vessels either from a primary lung focus, reactivated TB or caseous necrosis.  Once in blood vessels, the bacteria seed multiple organs, forming tiny granulomas, characteristic of miliary TB. The liver involvement could be either through direct hematogenous spread or extrusion from nearby infected lymph nodes, leading to hepatic TB, which presents with granulomas and liver dysfunction. This spread underscores the severity of untreated pulmonary TB and the need for early intervention. Our in vivo infection data clearly shows that pulmonary infection of Mtb in mice and guinea pigs can steadily leads to significant infection of the liver and metabolic abnormalities in the liver. The study further highlights the need for systemic studies to better understand the route and mode of dissemination from lungs to liver for better pathophysiological understanding of the disease and creating new therapeutic targets.  

      (2) The authors show the presence of Mtb in the liver autopsies of miliary tuberculosis patients. It is well known that Mtb disseminates during the late stages to several organs and liver is a major site (Sharma et al. 2005; 10.1016/S1473-3099(05)70163-8). Other clinical observations also point to the fact that although Mtb infects liver cells, it is cleared (Thandi et al., 2018, https://doi.org/10.4049/jimmunol.200.Supp.173.20). As the samples are from miliary TB, it is expected that the bacterial load must have been very high before spreading to blood. It is known that once in blood, M.tb is expected to spread to various organs, especially highly vascular ones. Were any other tissues (especially with high vasculature) stained and verified? If yes, add to the supplementary data or discuss.

      Other tissues were not collected and stained during this study. Studies are currently underway to understand whether other vasculated organs also harbour Mtb or not. Besides several studies have shown that Mtb can infect a wide range of organs like brain, kidney, bone marrow, etc (PMID: 33142108, PMID: 28046053, PMID: 34269789) during miliary conditions.

      (3) It is not evident from this paper if hepatic infiltration occurs in pulmonary TB patients? It may therefore be important to discuss the status of liver infections in the primary pulmonary infection.

      Based on the available data from human biopsied liver samples, there is an indication of liver involvement in systemic tuberculosis (TB). However, to gain a more comprehensive understanding of hepatic infiltration in pulmonary TB patients, it is essential to conduct well-organized clinical studies. These studies should specifically target pulmonary TB patients and explore the extent and nature of liver involvement in these individuals (discussion). As suggested by the reviewer it is in the discussion

      (4) Similarly, in the mice model, M.tb was shown to localize to liver when aerosolic infection was given. Were any other tissues, such as kidney, bone marrow etc, checked? Is it because of the high dose of M.tb against the standard challenge dose of 50-100 CFU? Further, since the study in the mouse model is to mimic a miliary tuberculosis of liver, did the dissemination occur via bloodstream and if mycobacteremia could be observed in infected mice.

      Currently studies are underway to understand the involvement of other organs like kidney, brain, bone marrow, in aerosol infection mice model and how dissemination occurs in those distant organs.

      The focus of the current study was to understand the role of liver in systemic tuberculosis with emphasis on hepatocytes as a key cell type to be infected. We have also conducted the experiments with lower CFUs and could detect the presence of Mtb colonies in liver, so we do not think that the infection of liver is dependent on the dose of infection.

      (5) There are studies in mouse model which infer that liver carried the lowest bacterial burden, was cleared the fastest, and it is established that as compared to sites persistently seeded by M. tuberculosis, in the liver the bacteria rarely infect cell types other than professional phagocytes. As the observations in this study are contrasting, the discussion section should include a critical comparative analysis to justify why in the conditions used in the study, the hepatocytes and not Kupffer cells are infected. Other than the morphological description to indicate M.tb infection of hepatocytes in the liver section (fig 1E), it will be good to show localization of M.tb specifically to hepatocytes by using hepatocyte specific marker. Unlike as reported, why was a clearance of M.tb not observed even after 10 weeks (figure 2B).

      While some studies show that Mtb from the liver is cleared fast but there are several other studies that report Liver harbours Mtb even after 10 weeks postinfection (PMID: 22359543, PMID: 21533158, PMID: 29242198). We have consistently observed Mtb infection of liver post week 10 in our infection model. 

      We have performed detailed immunophenotyping of hepatocyte cells in the mice model with ASPGR1 (asialoglycoprotein receptor 1) and in the revised version of record, we have further stained the isolated hepatocytes with anti-albumin antibody (albumin is a robust marker of hepatocyte identity) and have showed the presence of Mtb in it. The data has been included in the revised manuscript (Fig 2J)

      (6) While the result section mentions that "individuals with miliary tuberculosis' (line 107), the legend of Figure 1 writes 'Presence of Mtb in human pulmonary tuberculosis patients'. This is confusing. Clarify

      We thank the reviewer for pointing it out, we have changed the figure legends to miliary tuberculosis as most of the liver biopsy samples were obtained from military tuberculosis patients. 

      (7) Supplementary Figure 2D: Corresponding control panel (uninfected) should be added, which will also verify the specificity of Ag85b. As it is known that Ag85B is secreted out from the bacteria and hence the detected signals may not confirm that Mtb is in hepatocytes. Ag85B per bacterium decreases by almost 10,000-fold at later stages of infection because of secretion (Ernst JD, Cornelius A, et al 2019 mBio). In Supl figure 2D, Ag85b signal seems to be present everywhere inside the cells. Hence, it is important that the control panel be added.

      We have included a control image below which shows no staining of Ag85B in the uninfected sample.While we acknowledge with the reviewer’s comment, but Ag85B has been consistently used as a marker for Mtb presence in multiple studies. Nargan et al., uses Ag85B based staining to characterize infection both pulmonary and EPTB samples (PMID: 38880068). Jain et al., uses Ag85B to characterize Mtb infection of Mesenchymal stem cell in lung biopsy samples of pulmonary TB patients (PMID: 32546788)

      Author response image 1.

      Ag85B staining in uninfected mice shows no signals

      (8) The kinetics experiments in Figure 3D-3G should have used time laps microscopy of a few of the infected cells or it should be represented in CFU. If we consider the doubling time of H37Rv is about 22h to 24h, the data showing that MFI increases dramatically from 5 HPI to 120 HPI, gives an impression that the bacterial number inside the cells increased more than its doubling time.

      We have added the modified plot. As suggested, the CFU of Mtb within HepG2, PHCs, THP-1, RAW 264.7 and BMDMs have been included in the revised version (Supplementary Figure 4 D-H)

      (9) What is the effect of C45 and T863 on Mtb growth invitro? The effect of C45 and T863 on Mtb growth invitro should be shown to be ruled out. The representative image in Figure 5F is DMSO or C45 treated cells panel? Please specify it.

      As per the reviewer’s suggestion we have seen the effect of C45 (30 µM) and T863 (25 µM) on Mtb growth in vitro and did not find any difference in the growth kinetics. The representative image in Figure 5F is DMSO treated cells.

      Author response image 2.

      Growth kinetics of Mtb in 7H9 medium with DMSO, C75 and T863

      (10) Supplementary Figure 6B: Correct the Y-axis label from mRNA levels to Fold change (normalised to control). Please do similar changes wherever required.

      We have made the necessary changes as per the suggestion of the reviewer.

      (11) Figure 7B and 7C: How was the normalization performed? Is the data normalized to the number of bacteria that entered the specific cell type or was normalized at 48hrs with respect to DMSO? DMSO alone data should be shown.

      In the drug tolerance assays, we have calculated the ratio of the bacterial burden in hepatocytes treated with drugs compared to hepatocytes treated with DMSO. The infection was given for 48 hours post which the infected cells were treated with the mentioned concentrations of isoniazid and rifampicin for 24 hours. CFU enumeration was conducted after this 24 hour. Figure 7A gives a schematic of the experimental set up.

      % Tolerant Bacterial population= [A/B X 100] % where A is the CFU of Mtb from infected hepatocytes treated with drug and B is the CFU of Mtb infected cells treated with DMSO.Thus the effect of MOI is negated.

      To provide further credence to the CFU data, we have analysed these studies using microscopic studies as well, where no cell death was observed under the conditions. Mouse BMDMs were as a macrophage control. We have calculated the % tolerance as ratio by measuring the mean fluorescent intensity of GFP-Mtb per hepatocyte treated with drug to MFI of GFP-Mtb per hepatocyte treated with DMSO (control). More than 20 fields, each consisting of more than 4 infected cells have been used for analysis providing additional evidence of less killing of Mtb in hepatocytes compared to BMDMs with anti-TB drugs. All these details are included in the manuscript.

      (12) While authors have shown the changes in mRNA levels of CYP3A4, CYP3A43, NAT2, the protein or activities of some of these should be measured to verify the effect.

      Currently studies are underway to understand the activities of the key proteins involved in isoniazid and rifampicin metabolism and will be published as a separate manuscript.

      Reviewer #3 (Recommendations for the authors):

      Additional comments are:

      • Figure 2D, the 20X and 40X magnifications do not look appreciably different in size. Please double-check that the correct images were used.

      We thank the reviewer for pointing it out, we havecorrected it in the revised version.

      • Lines 162-164: The authors state almost 100% purity. However, the contour plot in 2F appears to show 2 cell populations. Figure 2G is missing a legend of which colors correspond to which staining (and again there appears to be highly variable staining).

      We agree with the reviewer that there are two contours observed in Figure 2F. Although both the contours are positive for ASPGR1 protein, but the level of expression of the ASPGR1 protein is variable. The corresponding confocal image (Nucleus stained by DAPI and ASPGR1 stained with ASPGR1 antibody with Alexa fluor 555 conjugated secondary antibody) also indicates a variable staining of isolated primary hepatocytes, where some cells give a stronger intensity signal than the other cells, further visually confirming our statement. Moreover, several studies show differential expression of ASPGR1 protein in hepatocyte like cells (PMID: 27143754)

      To further clarify and be more specific with respect to the identity of the hepatocytes, we have stained primary hepatocytes from infected mouse livers with Albumin antibody (a stable marker for hepatocytes) and Ag85B (2J)

      Multiple figures throughout the manuscript, including this one, would benefit from the use of arrows to depict what is described in the legend and text more clearly, and the use of higher power insets to better define cell architecture. Finally, some images appear blurry to the eye. Improvements are needed throughout.

      As per the suggestion, we have modified the figures and figure legends for better clarity.

      • Lines 153-155. Albumin, AST and GGT appear to be significantly up at week 8, contradicting the statement that there is no change until week 10.

      We thank the reviewer for poiting it out and  have made suitable changes in the write up

      • Lines 203-205: The authors state earlier that bacteria survive in macrophage phagosomes. Do the authors know the niche for bacteria in hepatocytes that enable them to continue to grow? Transcriptome data from HepG2 cells suggest perhaps a phagosomal pathway?

      We thank the reviewer for this insightful question. As rightly pointed out by the reviewer, transcription data indeed suggests changes in several important pathways like macroautophagy, golgi vesicular transport and vacuolar transport, which can affect the subcellular localisation of Mtb within hepatocytes. High resolution microscopic studies with respect to the subcellular localisation of labelled Mtb within Primary hepatocytes, HepG2 and THP-1 has been conducted and the % colocalization within different intra-cellular compartments have been measured. The image of colocalization of labelled Mtb within PHCs is shown below along with the % colocalization within various compartments in PHCs, HepG2 and THP-1 is added. 

      Author response image 3.

      Colocalisation of Mtb-GFP with various intra-cellular markers within PHCs.

      Author response image 4.

      Percentage Colocalisation of Mtb-GFP with various intra-cellular markers within PHCs, HepG2 and THP-1.

      • Validation of some critical genes found in the HepG2 cells should be done by qRTPCR in primary hepatocytes.

      qRT-PCR analysis of some of the key genes in HepG2 have been validated in primary hepatocytes at 24 hours post infection. Majority of the genes show a similar trend.

      Author response image 5.

      Gene expression analysis of the mentioned genes in Mtb infected PHCs as compared to the uninfected control.

      • Lines 259-260: The authors state a high degree of co-localization. The photomicrograph of a single cell in Fig. 5D is not convincing. I'm not even sure that they are really in the same subcellular compartment. Co-localization stated in Fig. S8B is also not convincing as shown.

      The image currently shown in figure 3D is a maximum intensity projection image of multiple z-stacks encompassing the entire cell.

      We agree with the reviewer with respect to figure Fig S8B and will modify the text and the figure legend accordingly.

      Copywriting edits:

      • It is difficult to see individual gene names in Figures 4D and 4E. A higher resolution or larger font would be appreciated for the reader.

      An excel file with the top differentially regulated genes at both 0 hours post infection and 48 hours post infection has been added.

      • Figure 5A has a shadow on the top right image.

      We have changed the image in the revised manuscript

      • Figure 5E is difficult to read the labels on the axes; it would be better in general to make the labels separately instead of relying on the graphing software, since these labels can get stretched when the size of the graph is modified.

      We agree with the reviewer and have made necessary changes.

      • Line 163: should be "percent" and not "perfect."

      We thank the reviewer for pointing it out and have corrected it

      • Line 190: is missing a period at the end of the sentence "...for further experiments"

      We thank the reviewer for pointing it out and have corrected it

      • Line 332: should be "hepatocytes" instead of "hepatoctyte" [sic]

      We thank the reviewer for pointing it out and have corrected it

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Valk and Engert et al. examined the potential relations between three different mental training modules, hippocampal structure and functional connectivity, and cortisol levels over a 9-month period. They found that among the three types of mental training: Presence (attention and introspective awareness), Affect (socio-emotional - compassion and prosocial motivation), and Perspective (socio-cognitive - metacognition and perspective taking) modules; Affect training most consistently related to changes in hippocampal structure and function - specifically, CA1-3 subfields of the hippocampus. Moreover, decreases in diurnal cortisol correlated to bilateral increases in volume, and decreases in diurnal and chronic cortisol left CA1-3 functional connectivity. Chronic cortisol levels also related to right CA4/DG volume and left subiculum function. The authors demonstrate that mindfulness training programs impact hippocampus and are a potential avenue for stress interventions, a potential avenue to improve health. The data contribute to the literature on plasticity of hippocampal subfields during adulthood, the impact of mental training interventions on the brain, and the link between CA1-3 and both short- and long-term stress changes. Additional clarification and extension of the methods is needed to strengthen the authors' conclusions.

      We thank the Reviewer for their positive evaluation and summary of our findings and work. We made additional changes as suggested by the Reviewer and hope this clarified any open points.

      (1) The authors thoughtfully approached the study of hippocampal subfields, utilizing a method designed for T1w images that outperformed Freesurfer 5.3 and that produced comparable results to an earlier version of ASHS. However, given the use of normalized T1-weighted images to delineate hippocampal subfield volume, some caution may be warranted (Wisse et al. 2020). While the authors note the assessment of quality control processes, the difficulty in ensuring valid measurement is an ongoing conversation in the literature. This also extends to the impact of functional co-registration using segmentations. I appreciate the inclusion of Table 5 in documenting reasons for missing data across subjects. Providing additional details on the distribution of quality ratings across subfields would help contextualize the results and ensure there is equal quality of segmentations across subfields.

      We thank the Reviewer for bringing up this point. In the current work, we assessed the overall segmentation of all six subfields per individual. Thus, unfortunately, we have no data of quality of segmentation of individual subfields beyond our holistic assessment. Indeed, registration of hippocampal subfields remains a challenge and we have further highlighted this limitation in the Discussion of the current work.

      “It is of note that the current work relies on a segmentation approach of hippocampal subfields including projection to MNI template space, an implicit correction for total brain volume through the use of a stereotaxic reference frame. Some caution for this method may be warranted, as complex hippocampal anatomy can in some cases lead to over- as well as underestimation of subfield volumes, as well as subfield boundaries may not always be clearly demarcated (1). Future work, studying the hippocampal surface at higher granularity, for example though unfolding the hippocampal sheet (2-5), may further help with both alignment and identification of not only subfield-specific change but also alterations as a function of the hippocampal long axis, a key dimension of hippocampal structural and functional variation that was not assessed in the current work (6, 7).”

      (2) Given the consistent pattern of finding results with CA1-3, in contrast to other subfields, it would help to know if the effects of the different training modules on subfields differed from each other statistically (i.e., not just that one is significant, and one is not) to provide an additional context of the strength of results focused on Affect training and CA1-3 (for example, those shown in Figure 3).

      Our work investigated i) whether the effects of the individual Training Modules differed from each other statistically. We found that the Affect Training Module showed increases in CA1-3 volume, and that these increases remained when testing effects relative to changes in this subfield following Perspective training and in retest controls. Moreover, in CA1-3 we found changes in functional connectivity when comparing the Affect to Perspective training Module. These changes were only present in this contrast, but not significant in each of the Training Modules per se. To test for specificity, we additionally evaluated whether subfield-specific changes were present above and beyond changes in the other ipsilateral hippocampal subfields. Relative to other subfields, right CA1-3 showed increases in the Affect vs Perspective contrast (left: t-value: 2.298, p=0.022, Q>0.1; right: t-value: 3.045, p=0.0025, Q=0.015). No other subfield showed significant changes. We now include this statement in the revised Results and Supplementary Tables.

      “Moreover, associations between CA1-3 and Affect, relative to Perspective, seemed to go largely above and beyond changes in the other subfields (left: t-value: 2.298, p=0.022, Q>0.1; right: t-value: 3.045, p=0.0025, Q=0.015, see further Supplementary File 1h).”

      Author response table 1.

      Subfield-specific changes following the Training Modules, controlling for the other two ipsilateral subfields

      Reviewer #1 (Recommendations For The Authors):

      (1) In Figure 1, using different colors for subfields versus the modules (yellow, red, green) would help as it could lead the reader to try to draw connections between the two when it is namely a depiction of the delineations.

      As suggested, we updated Figure 1 accordingly and present the subfields in different shades of purple for clarity. Please find the updated figure below.

      Author response image 1.

      (2) In the Results, it was at times hard to follow when Affect off Perspective where the focus of the results. Perhaps the authors could restructure or add additional context for clarity.

      We are happy to clarify. For the first analysis on Module-specific changes in hippocampal subfield volume, we compared effects across Training Modules. Here, main contrasts were ran between subjects: Presence vs active control and within subjects: Affect versus Perspective. In additional secondary contrasts, we studied training effects vs retest control. After observing consistent increases in bilateral CA1-3 following Affect, in the following analysis, we evaluated 1) intrinsic functional networks in main and supplementary contrasts and 2) diurnal cortisol measures within the Training modules only and all three Training Modules combined, and also adopted 3) a multivariate approach (PLS) (see comments Reviewer 2). We now also report effects of cortisol change on structural and functional subfield change in Presence and Perspective, for additional completeness and clarity.

      “To study whether there was any training module-specific change in hippocampal subfield volumes following mental training, we compared training effects between all three Training Modules (Presence, Affect, and Perspective). Main contrasts were: Presence vs Active control (between subjects) and Affect vs Perspective (within subjects). Supplementary comparisons were made vs retest controls and within training groups.”

      “Overall, for all hippocampal subfields, findings associated with volume increases in CA1-3 fol-lowing the Affect training were most consistent across timepoints and contrasts (Supplementary File 1a-f).”

      “Subsequently, we studied whether hippocampal CA1-3 would show corresponding changes in intrinsic function following the Affect mental training.”

      “In particular, the moderately consistent CA1-3 volume increases following Affect training were complemented with differential functional connectivity alterations of this subfield when comparing Affect to Perspective training”

      “Last, we probed whether group-level changes in hippocampal subfield CA1-3 volume would correlate with individual-level changes in diurnal cortisol indices (Presence: n= 86; Affect: n=92; Perspective: n=81), given that the hippocampal formation is a nexus of the HPA-axis (8). We took a two-step approach. First, we studied associations between cortisol and subfield change, particularly focusing on the Affect module and CA1-3 volume based on increases in CA1-3 volume identified in our group-level analysis.”

      “We observed that increases in bilateral CA1-3 following Affect showed a negative association with change in total diurnal cortisol output […]”

      “We did not observe alterations in CA1-3 volume in relation to change in cortisol markers in Presence or Perspective. Yet, for Presence, we observed association between slope and LCA4/DG change (t=-2.89, p=0.005, q=0.03), (Supplementary File 1uv).”

      “In case of intrinsic function, we also did not observe alterations in CA1-3 in relation to change in cortisol markers in Presence or Perspective, nor in other subfields (Supplementary File 1wx).”

      Author response table 2.

      Correlating change in subfield volume and diurnal cortisol indices in Presence. Main focus was on CA1-3 based on volumetric observations and are highlighted in bold.

      Author response table 3.

      Correlating change in subfield volume and diurnal cortisol indices in Perspective. Main focus was on CA1-3 based on volumetric observations and are highlighted in bold.

      Author response table 4.

      Association between stress-markers and within functional network sub-regions in Affect and Perspective.

      Author response table 5.

      Correlating change in subfield function and diurnal cortisol indices in Presence. Main focus was on CA1-3 based on volumetric observations and are highlighted in bold. For these multiple comparisons (FDRq, corrected for two subfields) values are reported if uncorrected p values are below p<.05.

      Author response table 6.

      Correlating change in subfield function and diurnal cortisol indices in Perspective. Main focus was on CA1-3 based on volumetric observations and are highlighted in bold. For these multiple comparisons (FDRq, corrected for two subfields) values are reported if uncorrected p values are below p<.05.

      (3) In the Methods, the authors note that corrections for multiple comparisons were used where needed, throughout the manuscript there is some switching between corrected and uncorrected p-values. At times, this made it difficult to follow in terms of when these corrections were needed.

      For clarity, we added explicit multiple comparisons information a) in main and supplementary results, and b) wherever extra information was needed. Also, we only included main contrasts in Table 1-3 to avoid confusion and moved the information on changes in SUB and CA4/DG to the Supplementary tables.

      (4) Typically, when correcting for intracranial volume the purpose is the ensure that sexual dimorphism in the size of the brain is accounted for. I would recommend the authors assess whether sex differences are accounted for by the MNI normalization approach taken. In the reading of the original Methods paper for the patch-based algorithm used, ICV was used to transform to MNI152 space. It would help to have additional information on how the normalization was done in the current study in order to draw comparisons to other findings in the literature.

      We are happy to further clarify. In the current work, we used the same approach as in the original paper. Volumes were linearly registered to the MNI template using FSL flirt. We now provided this additional information in the revised methods.

      “Hippocampal volumes were estimated based on T1w data that were linearly registered to MNI152 using FSL flirt (http://www.fmrib.ox.ac.uk/fsl/), such that intracranial volume was implicitly controlled for.”

      We agree with the Reviewer that sex differences may still be present, and investigated this. At baseline, sex differences were found in all subfields in the left hemisphere, and right CA4/DG (FDRq<0.05). Regressing out ICV resolved remaining sex differences. We then evaluated whether main results of volumetric subfield change were impacted by ICV differences. Differences between Affect and Perspective remained stable. We have now added this additional analysis in the Supplementary Materials.

      “Although stereotaxic normalization to MNI space would in theory account for global sex differences in intra-cranial volume, we still observed sex differences in various subfield volumes at baseline. Yet, accounting for ICV did not impact our main results suggesting changes in CA1-3 following Affect were robust to sex differences in overall brain volume (Supplementary File1j).”

      Author response table 7.

      Sex differences (female versus male) in hippocampal subfield volumes.

      Reviewer #2 (Public Review):

      In this study, Valk, Engert et al. investigated effects of stress-reducing behavioral intervention on hippocampal structure and function across different conditions of mental training and in relation to diurnal and chronic cortisol levels. The authors provide convincing multimodal evidence of a link between hippocampal integrity and stress regulation, showing changes in both volume and intrinsic functional connectivity, as measured by resting-state fMRI, in hippocampal subfield CA1-3 after socio-affective training as compared to training in a socio-cognitive module. In particular, increased CA1-3 volume following socio-affective training overlapped with increased functional connectivity to medial prefrontal cortex, and reductions in cortisol. The conclusions of this paper are well supported by the data, although some aspects of the data analysis would benefit from being clarified and extended.

      A main strength of the study is the rigorous design of the behavioral intervention, including test-retest cohorts, an active control group, and a previously established training paradigm, contributing to an overall high quality of included data. Similarly, systematic quality checking of hippocampal subfield segmentations contributes to a reliable foundation for structural and functional investigations.

      We thank the Reviewer for the thoughtful summary and appreciation of our work, as well as requests for further clarification and analyses. We addressed each of them in a point by point fashion below.

      Another strength of the study is the multimodal data, including both structural and functional markers of hippocampal integrity as well as both diurnal and chronic estimates of cortisol levels.

      (1) However, the included analyses are not optimally suited for elucidating multivariate interrelationships between these measures. Instead, effects of training on structure and function, and their links to cortisol, are largely characterized separately from each other. This results in the overall interpretation of results, and conclusions, being dependent on a large number of separate associations. Adopting multivariate approaches would better target the question of whether there is cortisol-related structural and functional plasticity in the hippocampus after mental training aimed at reducing stress.

      We thank the Reviewer for this suggestion. Indeed, our project combined different univariate analyses to uncover the association between hippocampal subfield structure, function, and cortisol markers. While systematic, a downside of this approach is indeed that interpretation of our results depend on a large number of analyses. To further explore the question whether there is cortisol-related structural and functional plasticity in the hippocampus, we followed the Reviewer’s suggestion and additionally adopted a multivariate partial least squares (PLS) model. We ran two complementary models. One focusing on the bilateral CA1-3, as this region showed increases in volume following Affect training and differential change between Affect and Perspective training in our resting state analyses and one model including all subfields. Both models included all stress markers. We found that both models could significantly relate stress markers to brain measures, and that in particular Affect showed strong associations with significant the latent markers. Both analyses showed inverse effects of structure and function in relation to stress markers and both slope and AUC changes showed strongest loadings. We now include these analyses the revised manuscript.

      Abstract

      “Of note, using a multivariate approach we found that other subfields, showing no group-level changes, also contributed to alterations in cortisol levels, suggesting circuit-level alterations within the hippocampal formation.”

      Methods

      “Partial least squares analysis

      To assess potential relationships between cortisol change and hippocampal subfield volume and functional change, we performed a partial least squares analysis (PLS) (9, 10). PLS is a multivariate associative model that to optimizes the covariance between two matrices, by generating latent components (LCs), which are optimal linear combinations of the original matrices (9, 10). In our study, we utilized PLS to analyze the relationships between change in volume and intrinsic function of hippocampal subfields and diurnal cortisol measures. Here we included all Training Modules and regressed out effects of age, sex, and random effects of subject on the brain measures before conducting the PLS analysis. The PLS process involves data normalization within training groups, cross-covariance, and singular value decomposition. Subsequently, subfield and behavioral scores are computed, and permutation testing (1000 iterations) is conducted to evaluate the significance of each latent factor solution (FDR corrected). We report then the correlation of the individual hippocampal and cortisol markers with the latent factors. To estimate confidence intervals for these correlations, we applied a bootstrapping procedure that generated 100 samples with replacement from subjects’ RSFC and behavioral data.”

      Results

      “Last, to further explore the question whether there is concordant cortisol-related structural and functional plasticity in the hippocampus we adopted a multivariate partial least square approach, with 1000 permutations to account for stability (9, 10) and bootstrapping (100 times) with replacement. We ran two complementary models including all Training Modules whilst regressing out age, sex and random effects of subject. First, we focused on the bilateral CA1-3, as this region showed increases in volume following Affect training and differential change between Affect and Perspective training in our resting state analyses. In the second model included structural and functional data of all subfields. Both models included all stress markers. We found that both models could identify significant associations between cortisol stress markers and hippocampal plasticity (FDRq<0.05), and that in particular Affect showed strongest associations with the latent markers for CA1-3 (Table 5). Both analyses showed inverse effects of subfield structure and function in relation to stress markers and both slope and AUC changes showed strongest associations with the latent factor.”

      Author response table 8.

      Multivariate PLS analyses linking cortisol markers to hippocampal subfield volume and function.

      Discussion

      “Last, performing multivariate analysis, we again observed associations between CA1-3 volume and function plasticity and stress change, strongest in Affect. Yet combining all subfields in a single model indicated that other subfields also link to stress alterations, indicating that ultimately circuit-level alterations within the hippocampal formation relate to latent changes in diurnal stress markers across Training Modules.”

      “This interpretation is also supported by our multivariate observations.”

      “In line with our observations in univariate analysis, we found multivariate associations between hippocampal subfield volume, intrinsic function and cortisol markers. Again, the contribution of volume and intrinsic function was inverse. This may possibly relate to the averaging procedure of the functional networks. Combined, outcomes of our univariate and multivariate analyses point to an association between change in hippocampal subfields and stress markers, and that these changes, at the level of the individual, ultimately reflect complex interactions within and across hippocampal subfields and may capture different aspects of diurnal stress. Future work may more comprehensively study the plasticity of the hippocampal structure, and link this to intrinsic functional change and cortisol to gain full insights in the specificity and system-level interplay across subfields, for example using more detailed hippocampal models (3). Incorporating further multivariate, computational, models is needed to further unpack and investigate the complex and nuanced association between hippocampal structure and function, in particular in relation to subfield plasticity and short and long-term stress markers.”

      “…based on univariate analysis. Our multivariate analysis further nuanced this observation, but again pointed to an overall association between hippocampal subfield changes and cortisol changes, but this time more at a systems level.”

      “Lastly, our multivariate analyses also point to a circuit level understanding of latent diurnal stress scores.”

      Author response image 2.

      Multivariate associations between changes in structure and function of hippocampal subfield volume and markers of stress change in Affect. A) Multivariate associations between bilateral CA1-3 volume and intrinsic function and stress markers. Left: Scatter of loadings, colored by Training Module; Right upper: individual correlations of stress markers; Right lower: individual correlation of subfields; B). Multivariate associations between all subfields’ volume and intrinsic function and stress markers. Left: Scatter of loadings, colored by Training Module; Right upper: individual correlations of stress markers; Right lower: individual correlation of subfields.

      (2) The authors emphasize a link between hippocampal subfield CA1-3 and stress regulation, and indeed, multiple lines of evidence converge to highlight a most consistent role of CA1-3. There are, however, some aspects of the results that limit the robustness of this conclusion. First, formal comparisons between subfields are incomplete, making it difficult to judge whether the CA1-3, to a greater degree than other subfields, display effects of training.

      We thank the Reviewer for this comment. To further test for specificity, we additionally evaluated subfield-specific changes relative to other subfields for our main contrasts (Presence versus Active Control and Affect versus Perspective). Relative to other subfields, right CA1-3 showed increases in the Affect vs Perspective contrast (left: t-value: 2.298, p=0.022, Q>0.1; right: t-value: 3.045, p=0.0025, Q=0.015); no other subfield showed significant changes. We now include this statement in Results and Supplementary Tables.

      “Moreover, associations between CA1-3 and Affect, relative to Perspective, seemed to go largely above and beyond changes in the other subfields (left: t-value: 2.298, p=0.022, Q>0.1; right: t-value: 3.045, p=0.0025, Q=0.015, see further Supplementary File 1h).”

      Author response table 9.

      Subfield-specific changes following the Training Modules, controlling for the other two ipsilateral subfields

      (3) Relatedly, it would be of interest to assess whether changes in CA1-3 make a significant contribution to explaining the link between hippocampal integrity and cortisol, as compared to structure and functional connectivity of the whole hippocampus.

      We thank the Reviewer for this comment. Please see the PLS analysis performed above (R2Q1). Indeed, not only CA1-3 but also other subfields seem to show a relationship with cortisol, in line with circuit level accounts on stress regulation and hippocampal circuit alterations (8, 11-15).

      (4) Second, both structural and functional effects (although functional to a greater degree), were most pronounced in the specific comparison of "Affect" and "Perspective" training conditions, possibly limiting the study's ability to inform general principles of hippocampal stress-regulation.

      We agree with the Reviewer that the association between stress and hippocampal plasticity, on the one hand, and mental training and hippocampal plasticity, on the other hand, make it not very straightforward to inform general principles on hippocampal stress regulation. However, as underscored in the discussion, in previous work we could also link mental training to stress reductions(16-18). We hope that the additional analyses and explanations further explain the multilevel insights of the current work, on the one hand using group-level analysis to investigate and illustrate the association between mental training and hippocampal subfield volume and intrinsic function, and on the other hand using individual level analysis to unpack the association between cortisol change and hippocampal subfield change.

      Reviewer #2 (Recommendations For The Authors):

      (1) In the Results, the description of how the hippocampal subfields' functional networks were defined would benefit from some clarification. It is also somewhat unclear what is meant by (on page 10): "Evaluating functional connectivity changes, we found that connectivity of the right CA1-3 functional network showed differential changes when comparing Affect training to Perspective training (2.420, p=0.016, FDRq=0.032, Cohens D =0.289), but not versus retest control (Table 1 and Supplementary Table 8-14)." Were there significant changes in CA1-3 FC following both training conditions (but these differed from each other)? A description of what this difference reflected would increase the reader's understanding.

      We are happy to clarify. We included information of change of individual modules in the Supplementary materials, Supplementary Table 1 and 2, 9 and 10. Changes for functional connectivity were largely due to the differences in Modules, but did not show strong effects in one Module alone. We now include information on Affect and Perspective un-contrasted change in the main results text:

      “… which could be attributed to decreases in right CA1-3 mean FC following Perspective (t=-2.012, p=0.045, M:-0.024, std: 0.081, CI [-0.041 -0.006]), but not Affect (t=1.691, p=0.092, M: 0.010, std: 0.098, CI [-0.01 0.031]); changes were not present when comparing Affect training versus retest control (Table 1 and Supplementary File 1k-q).”

      (2) As described in the Public Review, the lack of multivariate assessments may risk selling the data short. Including analyses of concomitant functional and structural changes, in relation to cortisol, seems like an approach better adapted to characterize meaningful interrelationships between these measures.

      We thank the Reviewer for suggesting multivariate assessments. To understand the interrelation between behavioral intervention, hippocampal plasticity, and cortisol changes, the current work first evaluates a simpler operationalization of the relationship between hippocampal subfield structure and volume, and cortisol as a function of mental training. Thus, given the complex nature of the study, we initially opted for a model where we assess structural and functional changes independently, with structural changes as the basis of our investigations. Now we have also included a multivariate approach (PLS) to further test the association between hippocampal subfields and cortisol markers, please see our additions to the manuscript above. We now highlighted multivariate associations in the Discussion as well, and suggest this as an important next step for more detailed, future investigations.

      “Incorporating further multivariate, computational, models is needed to further unpack and investigate the complex and nuanced association between hippocampal structure and function, in particular in relation to subfield plasticity and short and long-term stress markers.”

      (3) A minor comment regards the Figures. Some main effects should be visualized in a clearer manner. For instance, the scatterplots in Figure 1, panel D. Also, some of the current headings within the figures could be made more intuitive to the reader.

      We thank the Reviewer for this comment. To improve clarity, we updated figure headings. For Figure 1D, the challenge is that the data are quite scattered and we aimed to visualize our observations in a naturalistic way. Therefore, we added additional y-axis information to further clarify the figures. Creating more overlap or differentiation would make other elements of the figure less clear, hence we remained with the current set-up detailing the intra- and inter-individual alterations of the current model.

      (1) Wisse LEM, Chetelat G, Daugherty AM, de Flores R, la Joie R, Mueller SG, et al. (2021): Hippocampal subfield volumetry from structural isotropic 1 mm(3) MRI scans: A note of caution. Hum Brain Mapp. 42:539-550.

      (2) DeKraker J, Kohler S, Khan AR (2021): Surface-based hippocampal subfield segmentation. Trends Neurosci. 44:856-863.

      (3) DeKraker J, Haast RAM, Yousif MD, Karat B, Lau JC, Kohler S, et al. (2022): Automated hippocampal unfolding for morphometry and subfield segmentation with HippUnfold. Elife. 11.

      (4) Vos de Wael R, Lariviere S, Caldairou B, Hong SJ, Margulies DS, Jefferies E, et al. (2018): Anatomical and microstructural determinants of hippocampal subfield functional connectome embedding. Proc Natl Acad Sci U S A. 115:10154-10159.

      (5) Bernhardt BC, Bernasconi A, Liu M, Hong SJ, Caldairou B, Goubran M, et al. (2016): The spectrum of structural and functional imaging abnormalities in temporal lobe epilepsy. Ann Neurol. 80:142-153.

      (6) Vogel JW, La Joie R, Grothe MJ, Diaz-Papkovich A, Doyle A, Vachon-Presseau E, et al. (2020): A molecular gradient along the longitudinal axis of the human hippocampus informs large-scale behavioral systems. Nat Commun. 11:960.

      (7) Genon S, Bernhardt BC, La Joie R, Amunts K, Eickhoff SB (2021): The many dimensions of human hippocampal organization and (dys)function. Trends Neurosci. 44:977-989.

      (8) McEwen BS (1999): Stress and hippocampal plasticity. Annu Rev Neurosci. 22:105-122.

      (9) Kebets V, Holmes AJ, Orban C, Tang S, Li J, Sun N, et al. (2019): Somatosensory-Motor Dysconnectivity Spans Multiple Transdiagnostic Dimensions of Psychopathology. Biol Psychiatry. 86:779-791.

      (10) McIntosh AR, Lobaugh NJ (2004): Partial least squares analysis of neuroimaging data: applications and advances. Neuroimage. 23 Suppl 1:S250-263.

      (11) Paquola C, Benkarim O, DeKraker J, Lariviere S, Frassle S, Royer J, et al. (2020): Convergence of cortical types and functional motifs in the human mesiotemporal lobe. Elife. 9.

      (12) DeKraker J, Ferko KM, Lau JC, Kohler S, Khan AR (2018): Unfolding the hippocampus: An intrinsic coordinate system for subfield segmentations and quantitative mapping. Neuroimage. 167:408-418.

      (13) McEwen BS, Nasca C, Gray JD (2016): Stress Effects on Neuronal Structure: Hippocampus, Amygdala, and Prefrontal Cortex. Neuropsychopharmacology. 41:3-23.

      (14) Sapolsky RM (2000): Glucocorticoids and hippocampal atrophy in neuropsychiatric disorders. Arch Gen Psychiatry. 57:925-935.

      (15) Jacobson L, Sapolsky R (1991): The role of the hippocampus in feedback regulation of the hypothalamic-pituitary-adrenocortical axis. Endocr Rev. 12:118-134.

      (16) Engert V, Hoehne K, Singer T (2023): Specific reduction in the cortisol awakening response after socio-affective mental training. Mindfulness.

      (17) Puhlmann LMC, Vrticka P, Linz R, Stalder T, Kirschbaum C, Engert V, et al. (2021): Contemplative Mental Training Reduces Hair Glucocorticoid Levels in a Randomized Clinical Trial. Psychosom Med. 83:894-905.

      (18) Engert V, Kok BE, Papassotiriou I, Chrousos GP, Singer T (2017): Specific reduction in cortisol stress reactivity after social but not attention-based mental training. Sci Adv. 3:e1700495.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      (1) Figure 2 and related text: it would be useful to explain more explicitly what is meant by "neurogenic" and "non-neurogenic" models. I presume that the total number of neurons in non-neurogenic models is lower than in neurogenic models because no new neurons are added. It would be useful to plot the number of GCs as a function of timesteps.

      We have clarified the distinction between neurogenic and non-neurogenic models in the text (Lines 142-145), explicitly noting that in non-neurogenic models, no new GCs are added, resulting in a lower total neuron count over time. In response to the reviewer’s suggestion, we generated a plot showing the number of GCs over time (see below). Because the neurogenic model exhibits a simple linear increase, we found this plot not especially informative for inclusion in the manuscript. However, we agree with the reviewer’s later comments that similar plots are useful for interpreting specific results, and we have included those where appropriate.

      Author response image 1.

      Number of GCs over time for neurogenic (solid line) and non-neurogenic (dotted line) networks

      (2) Figure 2F, G: memory declines dramatically when the number of GCs at enrichment onset increases beyond an optimum. Why?

      We have explained the reasoning more thoroughly in the text (Lines 174-177) and added a new supplemental figure to support this reasoning (Figure S2). As the number of GCs increases, the network becomes overly inhibited and the response of abGCs to the stimuli decreases (Fig S2A). This leads to a smaller population of GCs being able to integrate with the stimulus (Fig S2B) which is expected given the activity-dependent plasticity rule. Moreover, it can be seen in Fig S2C that for networks with increasing size, the GCs that do learn only connect to MCs that are driven strongest by the stimuli until they struggle to connect to any MCs at all.

      In principle, a homeostatic mechanism like synaptic scaling could reduce activity to restore balance, but such a mechanism would also likely disrupt existing memories. Alternatively, we suggest activity-dependent apoptosis as a superior homeostatic mechanism because it leads to a stable level of activity without substantially erasing existing memories.

      (3) The paragraph describing synaptic connectivity of abGCs (related to Figure 2H) is confusing. What is the directionality of synapses considered here: mitral-to-granule, or granule-to-mitral? The text is opaque here. Connectivity matrix in Figure 2H: who is presynaptic, who is postsynaptic? If I understand correctly, these questions are actually irrelevant because all mitralgranule synapses in the network are reciprocal. This should be pointed out explicitly in the figure legend. Generally: the fact that the network is fully reciprocal (if I understand correctly) is very important but not stated with sufficient emphasis. It should be stated very explicitly in the text that connectivity matrices are fully reciprocal, and an equation clarifying this point should be included in Methods.

      (6) Connectivity matrix: to what degree was connectivity between mitral and granule cells reciprocal (fraction of connections in either direction that were paired with a connection in the opposite direction between the same cell pair)? Was connectivity shaped by experience (enrichment) reciprocal?

      (7) Directly related to the above: it would be useful to show the disynaptic connectivity matrix between mitral cells and analyze its symmetry. For the symmetric component, it should then be analyzed what fraction of this can be attributed to the reciprocal synapses, and what fraction is contributed by connectivity via different granule cells. This should then be compared to models with biologically realistic fractions of reciprocal connections. Is the model proposed here consistent with a biologically realistic fraction of reciprocal synapses between mitral-granule cell pairs?

      We appreciate these insightful and detailed comments. We agree that the assumption that MC-GC synapses were fully reciprocal was not clearly stated. We now explicitly state this in the main text (lines 90-94, 369-370, Figure 2 caption) and methods (line 561), emphasize its importance. As the reviewer points out, this is a simplifying assumption and does not fully reflect the biology because not all synapses are reciprocal in the true system. We also note that our synaptic plasticity model does not break the reciprocity assumption: all connections added or pruned during learning remain reciprocal. As a result, the disynaptic connectivity matrix (Bottom panel below, MCs sorted by stimulus as shown in the top panel) is always symmetric.

      We have now made these statements explicit in the main text and in the methods. Regarding functional consequences of this assumption, earlier work by our group has examined the impact of the degree of reciprocity of MC-GC synapses in a similar OB model (Chow, Wick & Riecke, Plos Comp Bio 2012). The study examined three different changes in reciprocity by (1) redirecting a fraction of the inhibitory connections of each GC to randomly chosen MCs instead of the MCs that drive that GC, (2) allowing heterogeneity in reciprocal weights so that there is no relationship between the strength of the MC -> GC synapse and the GC -> MC synapse, (3) reducing the level of self-inhibition a MC receives from the GCs that it excites. The model was found to be quite robust to each of these manipulations, suggesting that our present model likely remains functionally relevant even if biological reciprocity is partial. We reference this work now in the discussion, lines 490-492.

      Author response image 2.

      Disynaptic connectivity. Top: MC activity in response to the two stimuli, sorted by MC selectivity. Bottom: Disynaptic connectivity matrix (diagonal subtracted).

      (4) How were mitral cells sorted in Figure 2H? This needs to be explained.

      (5) Directly related to the point above: the text mentions that synaptic connectivity between GCs of the "learning cluster" and mitral cells (which direction?) is increased for mitral cells responding by enrichment odors, but this is not shown in the figure. This statement suggests that mitral cells sorted to the bottom of the y-axis respond more strongly to enrichment odors, but the information is not given directly. Please provide more information to back up your statements.

      Indeed as the reviewer inferred, MCs in Figure 2H were sorted so that those that receive the strongest stimulation from the odor were at the bottom of the y-axis. We have clarified this in the Figure 2 caption and added a subplot to Figure 2H showing the average MC input to make this more explicit.

      (8) Apoptosis (Figure 4 and related text): paragraph 231ff is somewhat difficult to comprehend because the "number" of enrichments should really be the "frequency" of enrichments. In Figure 4, it is not mentioned explicitly that each enrichment is with different random new odors.

      We agree that the term “number” of enrichments was imprecise and have revised the text to refer instead to the frequency of enrichment events (Lines 255-267). We also clarified that in Figure 4, each enrichment corresponds to a different set of randomly sampled odors, and we now state this explicitly in both the Figure 4 legend and main text (Lines 260-261).

      (9) Apoptosis: apoptosis improves memory but the underlying reason remains opaque. A simple prediction of the data in Figure 4D and 4E is that the number of GCs in 4E. It would be helpful to show this. Furthermore, an obvious question that arises is whether a higher frequency of enrichments improves memories because the total number of granule cells is kept low, or because granule cells are removed specifically based on their activity (or both). This could be addressed easily by artificially removing a random subset of granule cells in a simulation such as 4E to match granule cell numbers to the case in 4D.

      Apoptosis improves learning is because it reduces the total inhibition in the network by removing GCs and thus prevents deficits in learning that occur in Fig. 2G as GCs accumulate in the network. As the reviewer inferred, the number of GCs in Figure 4D is lower than in 4E and this is now clarified in the text. This difference was shown implicitly in Supplementary Figure S4D (previously S3D), but we now explicitly reference this plot to support this point as well (Line 266).

      As the reviewer notes, there is a question in whether increased enrichment frequency improves memory because it limits the total number of GCs, or because apoptosis selectively removes GCs based on their activity, or both. Our model supports both mechanisms. Importantly, simply reducing GC numbers through random deletion will degrade existing memories: random removal erodes memory representations encoded by those GCs. In contrast, our age and activity dependent apoptosis rule targets a specific cohort of adult-born GCs. This selective removal minimizes damage to existing memories encoded by GCs outside of this cohort while keeping GC numbers within a regime that supports robust learning (as shown in Figure 2G).

      However, we note that if enrichment frequency becomes too high, even recent memories can be lost due to premature pruning of GCs that have not yet stabilized their synaptic connections. This tradeoff has been shown experimentally (Forest et al., Nat Comm 2019) which we reproduce in our model (Figure S4).

      (10) Text related to Figure 5: "Learning flexibility...approached a steady state when the growth of the network started to saturate". Please show the growth (better: size) of the network (total number of GCs) for these simulations (and other panels in Figure 5). It would also be useful to show the total number of GCs in other figures (e.g. Figure 4; see above).

      We have now added a supplementary figure (Figure S6) that shows the total number of GCs over time for the simulations presented. This confirms that the network size approaches a steady state around the same time that learning flexibility begins to plateau, as noted in the original text (now line 275), and highlights the large number of GCs without apoptosis as well as the slightly reduced number of GCs in the permanent encoding model (line 312).

      (11) As much as I appreciate the comprehensive discussion of the results in a broader context, I feel that the discussion can be somewhat shortened. The section on lateral inhibition is not fully valid given that synaptic connectivity is reciprocal. I also feel that much of the final section (Model assumptions and outlook) can be dropped (except for the last paragraph), not because anything is irrelevant, but because these points have been made, onen repeatedly, in the text above.

      We agree that the discussion could be streamlined and have revised the manuscript accordingly. Specifically, we have shortened the section on lateral inhibition and clarified that the OB relies predominantly on reciprocal connectivity (Line 370). We also agree that parts of the final section were repetitive and have removed these. However, to address comments by Reviewer 3, we also expanded on some of the model assumptions. We thank the reviewer for helping us improve the clarity and focus of the manuscript.

      (12) Figure 5: bolding every 5th curve is confusing.

      We have adjusted our figure accordingly.

      (13) "...we biased the dendritic field...": it would be helpful to explain the idea of a "dendritic field" in a bit more detail prior to this sentence.

      We have now noted that GC’s "dendritic field" refers to the subset of MCs with which it is capable of forming synaptic connections when we initially describe the model (Line 97).

      Reviewer #3:

      (1) The authors find that a network with age-dependent synaptic plasticity outperforms one with constant age-independent plasticity and that having more GC per se is not sufficient to explain this effect. In addition, having an initial higher excitability of GCs leads to increased performance. To what degree the increased excitability of abGCs is conceptually necessarily independent of them having higher synaptic plasticity rates / fast synapses?

      We thank the reviewer for this question, as the difference between excitability and plasticity rate in memory formation is something we intended to highlight in this study. We have updated the (Lines 157-198) to clarify this.

      At the cellular level, a neuron's excitability and its rate of synaptic plasticity are mechanistically distinct: excitability is governed by factors such as ion channel expression or membrane resistance, whereas plasticity rates are influenced by molecular pathways involved in synapse and dendritic spine formation and remodeling. While these are independent properties, they are functionally coupled: most synaptic plasticity rules are activity-dependent, so greater excitability can increase the likelihood of plasticity being induced but does not itself guarantee learning.

      Our model reflects this distinction. Increased excitability biases which neurons become activated and thus eligible to undergo plasticity, but actual learning still depends on the plasticity rate itself. This can be seen by comparing the model constant plasticity and excitability (solid blue and green curves in Figure 2C) to the model with only transient excitability (solid blue and green lines in Figure 2E). In both cases, the strength and duration of the memory remain limited by the plasticity rate. We note additionally that, in this network, neurons compete to learn new stimuli: as GCs start to learn, they suppress MC activity through recurrent inhibition which suppresses learning in other GCs who otherwise would have been in position to learn the odor. As a result there is not a significant increase in the overall number of neurons recruited to learn (Figure 2J). In a different network architecture, such as a feedforward network, we would not expect this to be the case; greater excitability in a population of neurons would likely increase the memory by increasing the number of neurons recruited to learn. Transiently enhanced excitability biases which neurons join the memory engram (Figure 2J), but the extent and rate of learning still depend on the plasticity rates themselves. We did note in the original text (now lines 284-286) that this bias in recruitment subtly increases memory stability, but the extent is not great. In principle, a model can be engineered to rely on transiently increased excitability to encode memories in orthogonal subpopulations of neurons and that this could resolve the flexibility-stability dilemma. However, in that case, the number of memories that can be stored within a short time would be bounded by the size of this subpopulation such that even if a large number of odors are presented, mature GCs cannot become part of the engram and the network would likely fail to learn the stimuli. However, when this was tested experimentally (Forest et al. Cereb Cor. 2020), it was found that mature GCs participated in the engram when the number of odors was sufficiently high. Our results are consistent with these experiments: for complex odor environments, neonatal GCs, which are mature during odor exposure, and abGCs both participate in the engrams.

      Author response image 3.

      Simulating learning in more complex odor environments. Top: enrichment consisted of three odor pairs presented sequentially in a random order. Bottom: enrichment consisted of five odor pairs. Left: discriminability of the odor pairs over time. Middle: connectivity between MCs (sorted by odor selectivity) and GCs (sorted by age). In both cases AbGCs develop a clear connectivity structure. In more complex environments neonatal GCs also start to develop a clear connectivity structure. Right: combined engram membership across all stimuli by GC age.

      In sum, transiently increased excitability alone will not make learning any faster, so a fast learning system must have a high plasticity rate. If this plasticity rate stays high, then memories stored in these neurons, even if no longer highly excitable, will be vulnerable as the neurons can still be driven above their plasticity threshold by moderately interfering stimuli and will thus be quickly forgotten. Conversely, if the reviewer is wondering if a greater increase in the plasticity rate of new neurons can compensate for a lack of excitability, this is not the case: if a newborn neuron is not sufficiently driven by the stimulus it will not learn regardless of how high its plasticity rate is.

      (2) The authors do not mention previous theoretical work on the specificity of mitral to granule cell interactions from several groups (Koulakov & Rinberg - Neuron, 2011; Gilra & Bhalla, PLoSOne, 2015; Grabska-Bawinska...Mainen, Pouget, Latham, Nat. Neurosci. 2017; Tootoonian, Schaefer, Latham, PLoS Comput. Biol., 2022), nor work on the relevance of top-down feedback from the olfactory cortex on the abGC during odor discrimination tasks (Wu & Komiyama, Sci. Adv. 2020), or of top-down regulation from the olfactory cortex on regulating the activity of the mitral/tuned cells in task engaged mice (Lindeman et al., PLoS Comput. Biol., 2024), or in naïve mice that encounter odorants (in the absence of specific context; Boyd, et al., Cell Rep, 2015; Otazu et al., Neuron 2015, Chae et al., Neuron, 2022). In particular, the presence of rich topdown control of granule cell activity (including of abGCs) puts into question the plausibility of one of the opening statements of the authors with respect to relying solely on local circuit mechanisms to solve the flexibility-stability dilemma. I think the discussion of this work is important in order to put into context the idea of specific interactions between the abGCs and the mitral cells.

      We thank the reviewer for these detailed and thorough comments, and whole-heartedly agree that it is important to discuss the listed studies in order to contextualize our work through the broader lens of how information is processed in the OB. We have expanded our discussion to further acknowledge and integrate insight from previous theoretical and experimental work cited by the reviewer. (Lines 361-366, 493-550)

      Regarding the importance of top-down feedback, we of course recognize that in practice cortical inputs play a critical role in abGC survival and synaptic integration. However, its nature is not quite clear and is likely variable across behavioral seungs. In the paradigm that we study in the manuscript, there is likely no key reward value or contextual signal that is relayed to the OB. One plausible interpretation is that in this task, cortical feedback provides a random, variable baseline excitatory drive to GCs. This would likely be consistent with many of the listed studies, e.g.

      (1) Glomerular layer targeting of feedback would be explicitly unrelated to glomerular odor specificity, as in Boyd et al.

      (2) GC activity would decrease if these cortical inputs were silenced, resulting in stronger MC responses as in Otazu et al., Chae et al.

      (3) Silencing PCx during learning would prevent GCs from reaching activity-dependent plasticity thresholds, resulting in decreased spine density as in Wu & Komiyama.

      Likewise activating PCx would lead to increased spine density.

      In this interpretation, the effect of top-down input could be captured implicitly by adjusting model parameters such as activity or plasticity thresholds. For the purposes of our study, we opted to neglect these inputs in favor of model simplicity.

      Critically, even if top-down inputs play a substantially larger role, by perhaps even going as far as providing signals to abGCs to modulate their development, the core solution to the flexibility-stability dilemma that we describe stays local: we predict that the memory persists in the same network in which it was formed.

      (3) To what the degree of specific connectivity reflects a specific stimulus configuration, and is a good proxy for determining the stimulus discriminability and memory capacity in terms of temporal activity patterns (difference in latency/phase with respect to the respiration cycle, etc.) which may account to a substantial fraction of ability to discriminate between stimuli? The authors mention in the discussion that this is, indeed, an upper bound and specific connectivity is necessary for different temporal activity patterns, but a further expansion on this topic would help in understanding the limitations of the model.

      We thank the reviewer for raising this important point. Indeed, there have been several recent experimental studies indicating that much of the information needed for olfactory discrimination is encoded in the temporal activity patterns of mitral and tuned cells. Our model does not explicitly simulate these dynamics. It was for this reason that we defined memory in terms of the learned structure of the network rather than by firing rate activity. This is motivated by the idea that learned patterns of connectivity constrain the space of neural activity the network can support, and thus shape stimulus responses. We now make this limitation more explicit in the discussion and clarify that the specific MC–GC connectivity we analyze should be seen as a structural substrate that constrains the possible temporal transformations the network could support (Lines 492-506).

      (4) Reward or reward prediction error signals are not considered in the model. They however are ubiquitous in nature and likely to be encountered and shape the connectivity and activity patterns of the abGC-mitral cell network. Including a discussion of how the model may be adjusted to incorporate reward/error signals would strengthen the manuscript.

      We appreciate the reviewer’s suggestion and agree that reward and reward prediction error signals are critical components of many learning paradigms. We deliberately chose not to model associative learning, reward signals or top-down neuromodulation in this work. Our goal is to investigate the role of adult neurogenesis in a regime where its contribution has been shown to be experimentally necessary. Specifically, we focused on an unsupervised perceptual learning paradigm where adult neurogenesis is required for successful odor discrimination (Moreno et al. PNAS, 2008). In contrast, when the same odors are used in a rewarded learning paradigm, performance remains intact even when adult neurogenesis is ablated (Imayoshi et al., Nat. Neuro., 2008). This dissociation suggests that neurogenesis is dispensable in contexts where reward can guide learning. As such, we argue that isolating the contribution of local circuit dynamics in an unsupervised setting is critical to understanding what neurogenesis is uniquely enabling, especially given the evolutionary cost of maintaining it.

      We agree that extending this work to incorporate reward-driven plasticity or neuromodulatory influences would be a valuable direction for future research. In particular, it could help clarify how different learning paradigms engage distinct abGC cohorts (e.g., Mandairon et al., eLife 2018; Wu & Komiyama, Sci. Adv. 2020), and how task structure shapes memory allocation and engram composition. We have incorporated this into the discussion regarding extending our model to include top down feedback (lines 539-553).

      Specific comments

      (1) Lines 84-86; 507-509; Eq(3): Sensory input is defined by a basal parameter of MCs spontaneous activity (Sspontaneus) and the odor stimuli input (Siodor) but is not clear from the main text or methods how sensory inputs (glomerular patterns) were modeled

      We now clarify in the Methods section "Stimulus model" how the sensory inputs were modeled. Specifically, odor-evoked inputs to mitral cells (Siodor) were generated either as Gaussian profiles across the mitral cell population (Figs. 2,3) or as sparser random patterns (Figs. 4,5). In Figures 2 and 3, the denser Gaussian stimuli require more GCs to learn the odors, aiding in visualization of the connectivity matrix (Figure 2H) and abGC recruitment plots (Figure 2I,J; Figure 3C,E). However, real olfactory stimuli activate a sparse set of MCs, so in Figures 4 and 5 where we address learning of many stimuli, we utilize sparser, binary, stimuli delivered to only 10% of MCs, in range of experimental data (Wachowiak and Cohen, Neuron, 2001). The fact that the stimuli are binary, however, is not realistic and leads to denser representations. This leads to a worst-case scenario for the model as denser memory representations are easier to overwrite. These points has been added explicitly to the Methods section "Stimulus model" to improve clarity.

      (2) Lines 118-122: The used perceptual learning task explanation is done only in the context of the discriminability of similar artificial stimuli using the Fisher discriminant and "Memory" metric. A detailed description of the logic of the perceptual learning task methods and objective, taking into account Comment 1, would help to better understand the model.

      We thank the reviewer for pointing out had not adequately described the task and have updated the main text (lines 125-132) and included a new methods section "Perceptual learning task" to describe it more explicitly. The experiments that inspired the simulation followed an ecological model of discrimination learning (Moreno et al. PNAS 2009): For one hour a day over a ten day "enrichment period", two tea balls containing similar but distinct odors were suspended from the lid of each mouse's home cage. The mice engaged with the stimuli under self-directed conditions, therefore learning through natural experience. As a result the mice use olfactory information to discriminate between the similar stimuli, a skill potentially relevant for navigation or social behaviors.

      In our simulations, we model these experiments as follows. During the enrichment period, the model is stimulated with a randomly selected stimulus chosen from a set of two similar stimuli, corresponding to a mouse choosing to sniff one of the tea balls. During enrichment, in between these bouts of "sniffing", the model only receives spontaneous activity, reflecting the temporal sparsity of sensory input even over the enrichment period. Outside of enrichment, the model again receives only spontaneous input.

      (3) Rapid re-learning of forgotten odor pair is enabled by sensory-dependent dendritic elaboration of neurons that initially encoded the odors and the observed re-learning would occur even if neurogenesis was blocked following the first enrichment and even though the initial learning did require neurogenesis. When this would ever occur in nature? The re-learning of an odor period? Why is this highlighted in the study?

      We believe that this sort of learning is certainly relevant in nature. To clarify: by “learning,” we do not refer to the memory of an entire “odor period”, but simply an altered mapping of specific stimuli. Therefore, forgeung could occur if these specific stimuli are absent from the environment for a period of time, and re-learning would occur when these stimuli are re-encountered. Natural odor environments are highly dynamic, as environmental conditions and social contexts change over time. The odors an animal encounters also depend strongly on its own behavior; as it explores different environments, it may be exposed to particular odors intermittently: it could encounter them in one location, then not return to that location for some time before returning again.

      Such natural variability in odor exposure makes the ability to forget and re-learn especially valuable, allowing the animal to prioritize relevant information while maintaining flexibility. To this end, we show in Figure 5G that the synaptic forgetting of odors is beneficial to the performance of the model because it reduces interference in the network. Therefore we highlight that re-learning enabled by adult neurogenesis is a highly efficient strategy for memory storage and retrieval, which is why he emphasize it in this study.

      (4) Figure 2A: I understand that the ages shown at the bottom of the colored boxes represent the GC age. If so, find a better way to express that to avoid confusing 'GC ages' from the days shown in the perceptual learning task description (Figure 2B).

      We have updated the text in the figure to disambiguate the two and refer to the “days” shown in the perceptual learning task description now as “time relative to enrichment”

      (5) Figure 2B: Clarify how the two-dimensional arrays are arranged to represent the patterns shown. Does each point of the array represent one neuron? If so, are these neurons re-arranged to help the readers visually differentiate patterns A and B? Are the patterns of activity of MCs in the model spatially and temporally sparse as observed in experimental work?

      In Figure 2B, each point in the two-dimensional array represents the activity of a single mitral cell. The layout is purely for visualization—neurons are re-arranged to make the differences between odor patterns A and B visually apparent. This ordering does not reflect anatomical position or model architecture. We revised the Figure 2 caption to say this explicitly.

      Regarding spatial sparseness, as we mentioned in the response to the reviewer’s comment (1), the activity of mitral cells in response to odors is spatially sparse in the model. Regarding temporal sparseness, while the model is not spiking and does not include temporal dynamics within the timescale of the breath, however, odor input is delivered in discrete, odorspecific epochs interleaved with periods of no input, which leads to temporally structured activity patterns. This information has been made explicit in the new methods sections "Stimulus model" and "Perceptual learning task"

      (6) Figure 3C and Line 189: potential confusion between the color code mentioned in the legend for the enrichment and developing periods.

      It appeared to be a confusion in the text and has been corrected (Lines 212-213).

      (7) Figure 5F: For clarity, this would benefit from replacing the bold line with areas in the plot to depict the enrichment periods.

      We agree that replacing the bolded line segments with shaded areas is more clear and have updated the figure accordingly, and appreciate the reviewer's suggestion to clarify the figure.

      (8) Lines 380, 416: Potential role of cortical feedback and or neuromodulation depending on behavioral relevance or permanent exposure? Later mentioned in Lines 467 - 474.

      We have updated the text to acknowledge the role of potential cortical feedback and neuromodulation, now in lines 403-407.

    1. Author Response

      The following is the authors’ response to the current reviews.

      Response to Reviewer Comments:

      We thank the editors and reviewers for their careful consideration of our revised manuscript. Reviewers 2 and 3 indicated that their previous comments had been satisfactorily addressed by our revisions. Reviewer 1 raised several points and our point by point responses can be found below.

      Reviewer #1 (Recommendations For The Authors):

      1) Please clarify the terminology of spontaneous recovery in your study.

      According to Rescorla RA 2004 ( http://www.learnmem.org/cgi/doi/10.1101/lm.77504.), he defines spontaneous recovery as "with the passage of time following nonreinforcement, there is some "spontaneous recovery" of the initially learned behavior. ". So in this study, I thought Test2 is spontaneous recovery while the Test1 is extinction test as most studies do. But authors seem to define spontaneous recovery from the last trial of Extinction3 to the first trial of Test1, which is confusing to me.

      We agree with the reviewer (and Rescorla, 2004) that spontaneous recovery is defined as the return of the initially learned behaviour after the passage of time. In our study, Test 1 is conducted 24-hours after the final extinction session (Extinction 3) and in our view, the return of responding following that 24-hour delay can be considered spontaneous recovery. Rescorla (2004 and elsewhere) also points out that the magnitude of spontaneous recovery may be greater with larger delays between extinction and testing. This in part motivated our second test 7 days following the last extinction session with optogenetic manipulation. We did not find evidence of greater spontaneous recovery in the test 7 days later, however, the additional extinction trials in Test 1 may have reduced the opportunity to detect such an effect.

      2) Why are E6-8 plots of Offset group in Figure 3E and F different?

      We apologise for this error and have corrected it. This was an artifact of an older version of the figure before final exclusions. The E6-8 data is now the same for panels 2E and 2F.

      3) Related to 2, Please clarify what type of data they are in Figure3E,F Figure5H, and I . If it's average, please add error bars. Also, it's hard to see the statistical significance at the current figure style.

      The data in these panels are the mean lever presses per trial as labeled on the y-axis of the figures. In our view, in this instance, error bars (or lines and other markers of significance) detract from the visual clarity of the figure. The statistical approach and outcomes are included in the figure legend and when presented alongside the figure in the final version of the paper should directly clarify these points.

      Reviewer #2 (Recommendations For The Authors):

      The authors have addressed my previous comments to my satisfaction.

      Reviewer #3 (Recommendations For The Authors):

      The authors have adequately addressed each of the points raised in my original review. The paper will make a nice contribution to the field.


      The following is the authors’ response to the original reviews.

      Reviewer #1 (Recommendations For The Authors):

      • It would be interesting if the authors would do calcium imaging or electrophysiology from LCNA neurons during appetitive extinction.

      Indeed these are interesting ideas. We have plans to pursue them but ongoing work is not yet ready for publication.

      • LC-NA neuronal responses during the omission period seem to be important for appetitive extinction as described in the manuscript (Park et al., 2013; Sara et al., 1994; Su & Cohen 2022). It would be nice to activate/inactivate LC-NA neurons during the omission period.

      Optogenetic manipulation was given for the duration of the stimulus (20 seconds; when reward should be expected contingent upon performance of the instrumental response). We believe the reviewer is suggesting briefer manipulation only at the precise time the pellet would have been expected but omitted. If so, the implementation of that is complex because animals were trained on random ratio schedules and so when exactly the pellet(s) was earned was variable and so when precisely the animal experiences “omission” is difficult to know with better temporal specificity than used in the current experiments. But we agree with the reviewer that now we see that there is an effect of LC manipulation, in future studies we could alter the behavioral task so that the timing of reward is consistent (e.g., train the animals with fixed ratio schedules or continuous reinforcement, or use a Pavlovian paradigm) where a reasonable assertion about when the outcome should occur, and thus when its absence would be detected, can be made and then manipulation given at that time to address this point.

      • Does LC-NA optoinhibition affect the expression of the conditioned response (the lever presses at early trials of Extinction 1)? It's hard to see this from the average of all trials.

      The eNpHR group responded numerically less overall during extinction. This effect appears greatest in the first extinction session, but fails to reach statistical significance [F(1,15)= 3.512, p=0.081]. Likewise, analysis of the trial by trial data for the first extinction session failed to reveal any group differences [F(1,15)= 3.512, p=0.081] or interaction [trial x group; F(1,15)=0.550, p=0.470].

      Comparison of responding in the first trial also failed to reveal group differences [F(1.15)=1.209, p=0.289]. Thus while there is a trend in the data, this is not borne out by the statistical analysis, even in early trials of the session.

      • While the authors manipulate global LC-NA neurons, many people find the heterogeneous populations in the LC. It would be great if the authors could identify the subpopulation responsible for appetitive extinction.

      We agree that it would be exciting to test whether and identify which subpopulation(s) of cells or pathway(s) are responsible for appetitive extinction. While related work has found that discrete populations of LC neurons mediate different behaviours and states, and may even have opposing effects, our initial goal was to determine whether the LC was involved in appetitive extinction learning. These are certainly ideas we hope to pursue in future work.

      Minor:

      • Why do the authors choose 10Hz stimulation?

      The stimulation parameters were based on previously published work. We have added these citations to the manuscript.

      Quinlan MAL, Strong VM, Skinner DM, Martin GM, Harley CW, Walling SG. Locus Coeruleus Optogenetic Light Activation Induces Long-Term Potentiation of Perforant Path Population Spike Amplitude in Rat Dentate Gyrus. Front Syst Neurosci. 2019 Jan 9;12:67. doi: 10.3389/fnsys.2018.00067. PMID: 30687027; PMCID: PMC6333706.

      Glennon E, Carcea I, Martins ARO, Multani J, Shehu I, Svirsky MA, Froemke RC. Locus coeruleus activation accelerates perceptual learning. Brain Res. 2019 Apr 15;1709:39-49. doi: 10.1016/j.brainres.2018.05.048. Epub 2018 May 31. PMID: 29859972; PMCID: PMC6274624.

      Vazey EM, Moorman DE, Aston-Jones G. Phasic locus coeruleus activity regulates cortical encoding of salience information. Proc Natl Acad Sci U S A. 2018 Oct 2;115(40):E9439-E9448. doi: 10.1073/pnas.1803716115. Epub 2018 Sep 19. PMID: 30232259; PMCID: PMC6176602.

      • The authors should describe the behavior task before explaining Fig1e-g results.

      We agree that introducing the task earlier would improve clarity and have added a brief summary of the task at the beginning of the results section (before reference to Figure 1) and point the reader to the schematics that summarize training for each experiment (Figures 2A and 4D).

      NOTE R2 includes specific comments in their Public review. We have considered those as their recommendations and address them here.

      1) In such discrimination training, Pavlovian (CS-Food) and instrumental (LeverPress-Food) contingencies are intermixed. It would therefore be very interesting if the authors provided evidence of other behavioural responses (e.g. magazine visits) during extinction training and tests.

      In a discriminated operant procedure, the DS (e.g. clicker) indicates when the instrumental response will be reinforced (e.g., lever-pressing is reinforced only when the stimulus is present, and not when the stimulus is absent). This is distinct from something like a Pavlovianinstrumental transfer procedure and so we wish to just clarify that there is no Pavlovian phase where the stimuli are directly paired with food. After a successful lever-press the rat must enter the magazine to collect the food, but food is only delivered contingency upon lever-pressing and so magazine entries here are not a clear indicator of Pavlovian learning as they may be in other paradigms.

      Nonetheless, we have compiled magazine entry data which although not fully independent of the lever-press response in this paradigm, still tells us something about the animals’ expectation regarding reward delivery.

      For the ChR2 experiment, largely paralleling the results seen in the lever-press data, there were no group differences in magazine responses at the end of training [F(2,40)=2.442, p=0.100].

      Responding decreased across days of extinction (when optogenetic stimulation was given) [F(2, 80)=38.070, p<0.001], but there was no effect of group [F(2,40)=0.801, p=0.456] and no interaction between day and group [F(4,40)=1.461, p=0.222]. Although a similar pattern is seen in the test data, group differences were not statistically different in the first [F(2,40)=2.352, p=0.108] or second [F(2,40)=1.900, p=0.166] tests, perhaps because magazine responses were quite low. Thus, overall, magazine data do not present a different picture than lever-pressing, but because of the lack of statistical effects during testing, we have chosen not to include these data in the manuscript.

      For the eNpHR experiment, again a similar pattern to lever-pressing was seen. There were no group differences at the end of acquisition [F(1,15)=0.290, p=0.598]. Responding decreased across days of extinction [F(2, 30)=4.775, p=0.016] but there was no main effect of group [F(1,15)=1.188, p=0.293], and no interaction between extinction and group [F(2,30)=0.070, p=0.932]. There were no group differences in the number of magazine entries in Test 1 [F(1,15)=1.378, p=0.259] or Test 2 [F(1,15)=0.319, p=0.580].

      Author response image 1.

      Author response image 2.

      2) In Figure 1, the authors show the behavioural data of the different groups of control animals which were later collapsed in a single control group. It would be very nice if the authors could provide the data for each step of the discrimination training.

      We are a little confused by this comment. Figure 1, panels E, F, and G show the different control groups at the end of training, for each day of extinction (when manipulations occurred) and for each test, respectively. It’s not clear if there is an additional step the reviewer is interested in? We note neural manipulation only occurred during extinction sessions.

      We chose to compare the control groups initially, and finding no differences, to collapse them for subsequent analyses as this simplifies the statistical analysis substantially; when group differences are found, each of the subgroups has to be investigated (including the different controls means there are 5 groups instead of 3). It doesn’t change the story because we tested that there were not differences between controls before collapsing them, but collapsing the controls makes the presentation of the statistical data much shorter and easier to follow.

      3) Inspection of Figures 2C & 2D shows that responding in control animals is about the same at test 2 as at the end of extinction training. Therefore, could the authors provide evidence for spontaneous recovery in control animals? This is of importance given that the main conclusion of the authors is that LC stimulation during extinction training led to an increased expression of extinction memory as expressed by reduced spontaneous recovery.

      To address this we have added analyses of trial data, specifically comparison of the final 3 trials of extinction to the subsequent three trials of each test. These analyses are included on page 5 of the manuscript and additional data figures can be found as panels 2E and 2F and pasted below.

      What we observe in the trial data for controls is an increase in responding from the end of extinction to the beginning of each test, thus demonstrating spontaneous recovery. Importantly, responding in the ChR2 group does not increase from the end of extinction to the beginning of the test, illustrating that LC stimulation during extinction prevents spontaneous recovery.

      Comparison of the final three trials of Extinction to the three trials of Test 1:

      Author response image 3.

      Comparison of the final three trials of Extinction to the three trials of Test 2:

      Author response image 4.

      Halorhodopsin Experiment Tests 1 and 2, respectively.

      Author response image 5.

      4) Current evidence suggests that there are differences in LC/NA system functioning between males and females. Could the authors provide details about the allocation of male and female animals in each group?

      More females had surgical complications (excess bleeding) than males resulting in the following allocations; control group; 14 males and 8 females; ChR2 group 8 males and 7 females; offset 6 males.

      In our dataset, we did not detect sex differences in training [no main effect of sex: F(1,38)=1.097, p=0.302, sex x group interaction: F(1,38)= 1.825, p=0.185], extinction [no effect of sex; F(1,38)=0.370, p=0.547; no sex x extinction interaction: F(2,76)=0.701, p=0.499 ; no sex x extinction x group interaction: F(2,76)=2.223, p=0.115] or testing [Test 1 no effect of sex: F(1,38)=1.734, =0.196; no sex x group interaction: F(1,38)=0.009, p=0.924; Test 2 no effect of sex: F(1,38)=0.661, p=0.421; no sex x group interaction: F(1,38)=0.566, p=0.456].

      5) The histology section in both experiments looks a bit unsatisfying. Could the authors provide more details about the number of counted cells and also their distribution along the anteroposterior extent of the LC. Could the authors also take into account the sex in such an analysis?

      The antero-posterior coordinates used for cell counts and calculation of % infection rates were between -9.68 and -10.04 (Paxinos and Watson, 2007, 6th Edition) as infection rates were most consistent in this region and it was well-positioned relative to the optic probe although TH and mCherry positive cells were observed both rostral and caudal to this area. For each animal, an average of ~116+/- 25 TH-positive LC neurons as determined by DAPI and GFP positive cells were identified. Viral expression was identified by colocalized mCherry staining. Animals that did not have viral expression in the LC were not included in the experimental groups. We have added these details to the histology results on page 4.

      Males and females showed very similar infection rates (Males, 74%; Females, 72%). While sex differences, such as total number of LC cells or total LC volume have been reported (Guillamon, A. et al. 2005), Garcia-Falgueras et al. (2005) reported no differences in LC volume or number of LC neurons between male and female Long-Evans rats. So while differences may exist in the LC of Long-Evans rats, the cell counts here were comparable between groups (males, 103 +/- 27; females, 129 +/- 17; t-test, p>0.05).

      References:

      1) Garcia-Falgueras, A., Pinos, H., Collado, P., Pasaro, E., Fernandez, R., Segovia, S., & Guillamon, A. (2005). The expression of brain sexual dimorphism in artificial selection of rat strains. Brain Research, 1052(2), 130–138. https://doi.org/10.1016/j.brainres.2005.05.066

      2) Guillamon, A., De Bias, M. R., & Segovia, S. (1988). Effects of sex steroids on the of the locus coeruleus in the rat. Developmental Brain Research, 40, 306–310.

      Reviewer #3 (Recommendations For The Authors):

      MAJOR

      1) It is worth noting that responding in Group ChR2 decreased from Extinction 3 to Test 1, while responding in the other two groups appears to have remained the same. This suggests that there was no spontaneous recovery of responding in the controls; and, as such, something more must be said about the basis of the between-group differences in responding at test. This is particularly important as each extinction session involved eight presentations of the to-betested stimulus, whereas the test itself consisted of just three stimulus presentations. Hence, comparing the mean levels of performance to the stimulus across its extinction and testing overestimates the true magnitude of spontaneous recovery, which is simply not clear in the results of this study. That is, it is not clear that there is any spontaneous recovery at all and, therefore, that the basis of the difference between Group ChR2 and controls at test is in terms of spontaneous recovery.

      The reviewer is correct that there were a different number of trials in extinction vs. test sessions making direct comparison difficult and displaying the data as averages of the test session does not demonstrate spontaneous recovery per se. To address this we have added analyses of trial data and comparison of the final 3 trials of extinction to the subsequent three trials of each test. These analyses are included on page 5 and 6 of the manuscript and additional data figures can be found as panels 2E and 2F and 4 H and I, and pasted below.<br /> What we observe in the trial data for controls is an increase in responding from the end of extinction to the beginning of each test, thus demonstrating spontaneous recovery. Importantly, responding in the ChR2 group does not increase from the end of extinction to the beginning of the test, illustrating that LC stimulation during extinction prevents spontaneous recovery.

      Comparison of the final three trials of Extinction to the three trials of Test 1:

      Author response image 6.

      Comparison of the final three trials of Extinction to the three trials of Test 2:

      Author response image 7.

      Halorhodopsin Experiment Tests 1 and 2, respectively.

      Author response image 8.

      2a) Did the manipulations have any effect on the rates of lever-pressing outside of the stimulus?

      We did not detect any effect of the optogenetic manipulations on rates of lever pressing outside of the stimulus. This is demonstrated in the pre-CS intervals collected on stimulation days (i.e., extinction sessions) where we see similar response rates between controls and the ChR2 and Offset groups as shown below. There was no effect of group [F(2,40)=0.156, 0.856] or group x extinction day interaction [F(2,40)=0.146, p=0.865].

      Author response image 9.

      2b) Did the manipulations have any effect on rates of magazine entry either during or after the stimulus?

      For the ChR2 experiment, there were no group differences in magazine responses at the end of training [F(2,40)=2.442, p=0.100]. Responding decreased across days of extinction (when optogenetic stimulation was given) [F(2, 80)=38.070, p<0.001], but there was no effect of group [F(2,40)=0.801, p=0.456] and no interaction between day and group [F(4,40)=1.461, p=0.222]. Although a similar pattern is seen in the test data, group differences were not statistically different in the first [F(2,40)=2.352, p=0.108] or second [F(2,40)=1.900, p=0.166] tests, perhaps because magazine responses were quite low. Thus, overall, magazine data do not present a different picture than lever-pressing, but because of the lack of statistical effects during testing, we have chosen not to include these data in the manuscript.

      For the eNpHR experiment, again a similar pattern to lever-pressing was seen. There were no group differences at the end of acquisition [F(1,15)=0.290, p=0.598]. Responding decreased across days of extinction [F(2, 30)=4.775, p=0.016] but there was no main effect of group [F(1,15)=1.188, p=0.293], and no interaction between extinction and group [F(2,30)=0.070, p=0.932]. There were no group differences in the number of magazine entries in Test 1 [F(1,15)=1.378, p=0.259] or Test 2 [F(1,15)=0.319, p=0.580].

      Author response image 10.

      Author response image 11.

      2c) Did the manipulations affect the coupling of lever-press and magazine entry responses? I imagine that, after training, the lever-press and magazine entry responses are coupled: rats only visit the magazine after having made a lever-press response (or some number of leverpress responses). Stimulating the LC clearly had no acute effect on the performance of the lever-press response. If it also had no effect on the total number of magazine entries performed during the stimulus, it would be interesting to know whether the coupling of lever-presses and magazine entries had been disturbed in any way. One could assess this by looking at the jointdistribution of lever-presses (or runs of lever-presses) and magazine visits in each extinction session, or across the three sessions of extinction. As a proxy for this, one could look at the average latency to enter the magazine following a lever-press response (or run of leverpresses). Any differences here between the Controls and Group ChR2 would be informative with respect to the effects of the LC manipulations: that is, the results shown in Figure indicate that stimulating the LC has no acute effects on lever-pressing but protects against something like spontaneous recovery; whereas the results shown in Figure 4 indicate that inhibiting the LC facilitates the loss of responding across extinction without protecting against spontaneous recovery. The additional data/analyses suggested here would indicate whether LC stimulation had any acute effects on responding that might explain the protection from spontaneous recovery; and whether LC inhibition specifically reduced lever-pressing across extinction or whether it had equivalent effects on rates of magazine entry.

      Lever-press and magazine response data were collected trial by trial but not with the temporal resolution required for the analyses suggested by the reviewer. We do not have timestamps for magazine entries nor latency data. We can collect this type of data in future studies. At the session or trial level, magazine entries generally correspond to lever-pressing; being trained on ratio schedules, and from informal observation, rats will do several lever-presses and then check the magazine. Rates of each decrease across extinction (magazine data included in response to comment 2b. above). Optogenetic manipulation appeared to have no immediate effect on either response during extinction.

      ROCEDURAL

      1) Why were there three discriminative stimuli in acquisition: a light, white noise, and clicker?

      This was done to be consistent with and apply parameters similar to previous, related studies (Rescorla, 2006; Janak & Corbit, 2011) and to allow comparison to potential future studies that may involve stimulus compounds etc. (requiring training of multiple stimuli).

      2) Why were some rats extinguished to the noise while others were extinguished to the clicker? Were the effects of LC stimulation/inhibition dependent on the identity of the extinguished stimulus?

      Because the animals were trained with multiple stimuli, it allowed us some ability to choose amongst those stimuli to best balance response rates across groups before the key manipulations. The effects of LC manipulation did not differ between animals based on the identity of the extinguished stimulus.

      3) Did the acute effects of LC inhibition on extinction vary as a function of the stimulus identity?

      No

      4) Was the ITI in extinction the same as that in acquisition?

      Yes, the ITI was the same for acquisition and extinction sessions (variable, averaging to 90 seconds). We have added a sentence to the methods (p. 11) to reflect this.

      5) For Group Offset, when was the photo-stimulation applied in relation to the extinguished stimulus: was it immediately upon offset of the stimulus or at a later point in the ITI?

      The group label “Offset” was used to be consistent with Umaetsu et al. (2017) that delivered stimulation 50-70s after a trial. SImilarly, we mean it as discontinuous with the stimulus, not at the termination of the stimulus. We have revised the description of this group on page 11 to clarify the timing of the photostimulation as follows:

      “Animals in the Offset group (and relevant controls) underwent identical training with the exception that stimulation in extinction sessions occurred in the middle of the variable length ITI (45s after stimulus termination, on average).”

      MINOR

      1) "Such recovery phenomena undermine the success of extinction-based therapies..."

      ***Perhaps a different phrasing is needed here: "These phenomena show that extinction-based therapies are not always effective in suppressing an already-established response..."

      We have revised this sentence in line with the reviewer’s suggestion:

      “These phenomena mean that extinction-based therapies are not always successful in suppressing previously-established behaviours” (first paragraph of the introduction).

      2) Typo in para 1 of results: "F(2,19)=0.0.352"

      Thank you for finding this typo. It has been corrected. (p.4)

      3) "As another example of modular functional organization, no improvements to strategy setshifting following global LC stimulation, but improvements were observed when LC terminals in the medial prefrontal cortex were targeted (Cope et al., 2019)." ***This sentence is missing a "there were" before "no improvements".

      Thank you for finding this error. It has been corrected. (p.8)

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the authors employed direct RNA sequencing with nanopores, enhanced by 5' end adaptor ligation, to comprehensively interrogate the human transcriptome at singlemolecule and nucleotide resolution. They conclude that cellular stress induces prevalent 5' end RNA decay that is coupled to translation and ribosome occupancy. Contrary to the literature, they found that, unlike typical RNA decay models in normal conditions, stress-induced RNA decay is dependent on XRN1 but does not depend on the removal of the poly(A) tail. The findings presented are interesting but a substantial amount of work is needed to fully establish these paradigm-shifting findings.

      Strengths:

      These are paradigm-shifting observations using cutting-edge technologies.

      Weaknesses:

      The conclusions do not appear to be fully supported by the data presented.

      Our response to the reviewer comments is provided at the end of this document in the section "Recommendations For The Authors"

      Reviewer #2 (Public Review):

      In the manuscript "Full-length direct RNA sequencing uncovers stress-granule dependent RNA decay upon cellular stress", Dar, Malla, and colleagues use direct RNA sequencing on nanopores to characterize the transcriptome after arsenite and oxidative stress. They observe a population of transcripts that are shortened during stress. The authors hypothesize that this shortening is mediated by the 5'-3' exonuclease XRN1, as XRN1 knockdown results in longer transcripts. Interestingly, the authors do not observe a polyA-tail shortening, which is typically thought to precede decapping and XRN1-mediated transcript decay. Finally, the authors use G3BP1 knockout cells to demonstrate that stress granule formation is required for the observed transcript shortening.

      The manuscript contains intriguing findings of interest to the mRNA decay community. That said, it appears that the authors at times overinterpret the data they get from a handful of direct RNA sequencing experiments. To bolster some of the statements additional experiments might be desirable.

      A selection of comments:

      (1) Considering that the authors compare the effects of stress, stress granule formation, and XRN1 loss on transcriptome profiles, it would be desirable to use a single-cell system (and validated in a few more). Most of the direct RNAseq is performed in HeLa cells, but the experiments showing that stress granule formation is required come from U2OS cells, while short RNAseq data showing loss of coverage on mRNA 5'ends is reanalyzed from HEK293 cells. It may be plausible that the same pathways operate in all those cells, but it is not rigorously demonstrated.

      We agree with the reviewer that performing all experiments in a single cell system would be desirable. Presently, our core findings on 5’ RNA shortening are all performed in HeLa cells: the identification of 5’ RNA shortening, the reliance of shortening through XRN1 silencing, suppression of shortening by translation inhibition, and now the relationship between 5’ shortening and deadenylation/decapping through experiments described further below. Our use of other cell lines is primarily to show that 5’ shortening is a general phenomenon, and we have now done this for U20S cells, HEK293 cells, and primary 3T3 cells from mouse. 

      Regarding stress granule formation, we are unfortunately restricted by the lack of available wellcharacterized resources. The DDG3BP1/2 U2OS is a well characterized cell line that has been extensively used for stress granule-related experiments. We have therefore opted to use it and performed experiments to verify both the occurrence of stress-induced RNA shortening as well as the rescue in the absence of stress granules. The reproducibility and breadth of the cell lines used in our analysis makes us confident on the generality of our findings.

      (2) An interesting finding of the manuscript is that polyA tail shortening is not observed prior to transcript shortening. The authors would need to demonstrate that their approach is capable of detecting shortened polyA tails. Using polyA purified RNA to look at the status of polyA tail length may not be ideal (as avidity to oligodT beads may increase with polyA tail length and therefore the authors bias themselves to longer tails anyway). At the very least, the use of positive controls would be desirable; e.g. knockdown of CCR4/NOT.

      We thank the reviewer for their comment. Previous studies, using in vitro transcribed RNA molecules, have shown that direct RNA sequencing can capture and quantify poly(A) tails of varying lengths (Krause et al. 2019). Specifically, a range of 10 to 150 nt has been tested and a high concordance between known and dRNA-Seq determined values was observed. Both tailfindR and nanopolish (used in this work) showed high poly(A) tail estimation accuracy.

      Regardless, we agree with the reviewer that our method depends on poly(A) tail capture and thus may be incomplete for fully quantifying poly(A) length changes. We therefore opted to replace these data and instead follow this and other reviewers’ suggestions and perform experiments following knockdown of CCR4/NOT using cells expressing a catalytically inactive CNOT8 (CNOT8*) dominant negative mutant (Chang et al. 2019). Our new data show that stress-induced 5’ end decay is indeed not dependent on prior removal of the poly(A) tail. Specifically, we find that transcript shortening is still observed upon oxidative stress in cells expressing CNOT8* compared to control cells. We present these new results in Fig. 3 and Sup. Fig 3. 

      (3) The authors use a strategy of ligating an adapter to 5' phosphorylated RNA (presumably the breakdown fragments) to be able to distinguish true mRNA fragments from artifacts of abortive nanopore sequencing. This is a fantastic approach to curating a clean dataset. Unfortunately, the authors don't appear to go through with discarding fragments that are not adapter-ligated (presumably to increase the depth of analysis; they do offer Figure 1e that shows similar changes in transcript length for fragments with adapter, compared to Figure 1d). It would be good to know how many reads in total had the adapter. Furthermore, it would be good to know what percentage of reads without adapters are products of abortive sequencing. What percentage of reads had 5'OH ends (could be answered by ligating a different adapter to kinasetreated transcripts). More read curation would also be desirable when building the metagene analysis - why do the authors include every 3'end of sequenced reads (their RNA purification scheme requires a polyA tail, so non-polyadenylated fragments are recovered in a nonquantitative manner and should be discarded).

      We thank the reviewer for appreciating our approach. The reviewer is correct that we do not discard reads that are not adapter-ligated. As the reviewer correctly mentions this is to increase the sequencing depth. We have found that the ligation efficiency is very low, ~1-2 % of total reads (now in Sup. Table. 1), across all libraries, and so the percentage of REL5-ligated reads does not directly infer the total amount of non-artifactual 5’ ends. Instead, we use these REL5ligated reads as a subset of our data for which we have extremely high confidence in the true 5’end. Our results show that non-ligated reads display the same length distribution as ligated ones, and that the results are reproducible regardless of read selection (e.g. Fig. 1c, e, Sup. Fig. 1k, l, Fig. 3b, c). This strong concordance between REL5-ligated and non-ligated reads suggests that our conclusions on 5’ end shortening are not substantially influenced by abortive sequencing or other artefactual creation of 5’ shortening. We have modified the text to clarify these points and have added plots using only ligated molecules for relevant figures that this was not previously done (Sup. Fig 1l, 3c)

      We agree with the reviewer that non-polyadenylated reads could be discarded from metagene analysis and we have performed this change in the revised version. Our conclusions following removal of non-polyadenylated reads remain unchanged (Sup. Fig. 1g).

      (4) The authors should come to a clear conclusion about what "transcript shortening" means. Is it exonucleolytic shortening from the 5'end? They cannot say much about the 3'ends anyway (see above). Or are we talking about endonucleolytic cuts leaving 5'P that then can be attached by XRN1 (again, what is the ratio of 5'P and 5'OH fragments; also, what is the ratio of shortened to full-length RNA)?

      We thank the reviewer for their suggestion. We have performed additional experiments to investigate the role of deadenylation and decapping by expressing dominant negative forms of the NOT8 deadenylase (NOT8*) and DCP2 decapping (DCP2*) enzyme in HeLa cells. Our results show that neither expression of NOT8* nor DCP2* can inhibit stress-induced transcript shortening following arsenite treatment (Fig. 3e-f). These new data suggest that neither deadenylation nor decapping are required for stress-induced RNA decay. Instead, our data are more compatible with endonucleolytic cleavage as the most likely mechanism for stressinduced RNA decay. We have incorporated these results in the text and present them in Fig. 3 and Sup. Fig. 3.

      (5) The authors should clearly explain how they think the transcript shortening comes about. They claim it does not need polyA shortening, but then do not explain where the XRN1 substrate comes from. Does their effect require decapping? Or endonucleolytic attacks?

      Please also refer to our answer to the previous comment (#4). Collectively, our results from a) the dominant negative expression of NOT8* and DCP2* that show no effect on stress-induced shortening and b) the rescue of transcript length upon translation initiation inhibition, indicate a potential endonucleolytic mechanism as a mediator of stress-induced RNA decay. However, we believe that extensive, further studies currently beyond the scope of this work, will be required to discover the nuclease and to dissect the exact molecular mechanisms that define the 5' ends of mRNAs upon stress-induced decay. We now discuss these points in the discussion.

      (6) XRN1 KD results in lengthened transcripts. That is not surprising as XRN1 is an exonuclease - and XRN1 does not merely rescue arsenite stress-mediated transcript shortening, but results in a dramatic transcript lengthening.

      The reviewer raises an intriguing point. Additional analysis of data has showed that in fact, in unstressed cells, XRN1 KD leads to modestly significant reduction in overall transcript length (Fig. 3b, c). This could possibly be the result of an accumulation of intermediate cleavage products normally expected to be degraded by XRN1 as previously described (Pelechano, Wei, and Steinmetz 2015; Ibrahim et al. 2018).

      Instead, we find that under stress, XRN1 KD shows an almost identical transcript length distribution to unstressed cells and significantly higher than siCTRL stressed cells (Fig. 3b, c). These results indicate that in the absence of XRN1, stress-induced decay is largely abolished. As the reviewer correctly points out, this seems to affect the majority of RNAs which we believe is evidence of the general lack of specificity in the mechanism. Nevertheless, we find that transcripts that are the primary substrates to stress-induced shortening are substantially more lengthened than all other transcripts (Fig. 3e). This indicates that transcripts primarily affected by stress-induced decay are also lengthened the most in the absence of XRN1 and at an even higher level than expected by general XRN1 KD effects.

      Reviewer #3 (Public Review):

      The work by Dar et al. examines RNA metabolism under cellular stress, focusing on stressgranule-dependent RNA decay. It employs direct RNA sequencing with a Nanopore-based method, revealing that cellular stress induces prevalent 5' end RNA decay that is coupled to translation and ribosome occupancy but is independent of the shortening of the poly(A) tail. This decay, however, is dependent on XRN1 and enriched in the stress granule transcriptome. Notably, inhibiting stress granule formation in G3BP1/2-null cells restores the RNA length to the same level as wild-type. It suppresses stress-induced decay, identifying RNA decay as a critical determinant of RNA metabolism during cellular stress and highlighting its dependence on stress-granule formation.

      This is an exciting and novel discovery. I am not an expert in sequencing technologies or sequencing data analysis, so I will limit my comments purely to biology and not technical points. The PI is a leader in applying innovative sequencing methods to studying mRNA decay.

      One aspect that appeared overlooked is that poly(A) tail shortening per se does lead to decapping. It is shortening below a certain threshold of 8-10 As that triggers decapping. Therefore, I found the conclusion that poly(A) tail shortening is not required for stress-induced decay to be somewhat premature. For a robust test of this hypothesis, the authors should consider performing their analysis in conditions where CNOT7/8 is knocked down with siRNA.

      We agree with the reviewer. We have now performed experiments in cells expressing a well characterized catalytically inactive dominant negative NOT8 isoform (NOT8*) (Chang et al.

      2019). Our new data show that stress-induced decay still occurs in cells expressing NOT8*.

      These results confirm our findings that stress-induced decay does not require deadenylation. We present these new results in Fig. 3 and Sup. Fig. 3. 

      Similarly, as XRN1 requires decapping to take place, it necessitates the experiment where a dominant-negative DCP2 mutant is over-expressed.

      We agree with the reviewer and have performed this experiment as requested. Expression of a dominant negative DCP2 (DCP2*) isoform (Loh, Jonas, and Izaurralde 2013) in HeLa cells showed that decapping is also not required for stress-induced decay. We present these new results in Fig. 3 and Sup. Fig. 3.

      Are G3BP1/2 stress granules required for stress-induced decay or simply sites for storage? This part seems unclear. A very worthwhile test here would be to assess in XRN1-null background.

      We thank the reviewer for their comment. Our data show that stress-induced decay is not observed in DDG3BP1/2 U2OS cells, unable to form stress granules (Fig. 6). This result suggests that G3BP1/2 SGs are either a) required for 5’ RNA shortening or b) preserve partially fragmented RNAs that would otherwise be rapidly degraded. We find the second option unlikely for two reasons. First, even if the fragments were rapidly degraded, we would still expect to find evidence of their presence in our data. However, Fig. 6f shows that the length distribution of DDG3BP1/2 U2OS cells, with and without arsenite, are almost identical, thus arguing against the presence of such a pool of rapidly degrading RNAs. Second, if these RNAs were protected by SGs, then they would be expected to be downregulated in the absence of SGs in DDG3BP1/2 U2OS cells treated with arsenite. Our results contradict this hypothesis as no association is found between the level of downregulation in arsenite-treated DDG3BP1/2 U2OS cells and the observed stress-induced fragmentation in WT. Collectively our results point towards G3BP1/2 stress granules being required for stress-induced decay. We have expanded on these points in the manuscript to clarify.

      Finally, the authors speculate that the mechanism of stress-induced decay may have evolved to relieve translational load during stress. But why degrade the 5' end when removing the cap may be sufficient? This returns to the question of assessing the role of decapping in this mechanism.

      The reviewer raises a very interesting point. Our new results, following expression of dominant negative DCP2, show that stress-induced decay does not require decapping. It is therefore plausible that a stress-induced co-translational mechanism cleaves mRNAs endonucleolyticaly to reduce the translational load. Such a mechanism would have many functional benefits as it would acutely reduce the translational load, degrade non-essential RNAs, preserve energy and release ribosomes for translation of the stress response program. We have expanded the discussion to mention these points.

      Recommendations for the authors:

      Reviewing Editor (Recommendations For The Authors):

      As you can see from the comments, although the reviewers appreciate the novelty of your findings, there was a consensus opinion from all reviewers that the authors overinterpreted their data, since they only have one assay and did not fully analyze it, as laid out in one of the reviewer's critiques. Some orthogonal validation of the "groundbreaking" claims is necessary. Examination of the effects of upstream events in 5'-to-3' decay, namely deadenylation, and decapping, would be necessary for a better understanding of the phenomena the authors describe. Many tools and approaches for studying this are described well in the literature (CNOT7-KD, dominant negative DCP2 E148Q, XRN1-null cell lines), so it is well within the authors' reach. Overall, while some of the evidence presented is novel and solid, for some of the claims there is only incomplete evidence.

      We thank the reviewers and the editor for their comments and suggestions. We have performed several additional experiments to further support our conclusions. We have notably investigated the role of deadenylation and decapping in the stress-induced decay by expressing dominant negative NOT8 and DCP2, respectively, as suggested. Our results show that neither deadenylation nor decapping is necessary for stress-induced transcript shortening, suggesting an endonucleolytic event. We believe that these additional experiments strengthen the main conclusions of our work. 

      Reviewer #1 (Recommendations For The Authors):

      Major comments:

      (1) The experiments were conducted in two unrelated cell lines, HeLa and U2OS. The authors should determine if the 5'end RNA decay in response to stress is also observed in normal human cells such as normal human diploid fibroblasts. Furthermore, it would be important to know if this mechanism is conserved between human and mouse cells. This can be tested in mouse embryonic fibroblasts.

      We thank the reviewer for their suggestion. We have now also performed experiments in the mouse embryonic fibroblast NIH 3T3 cell line. Our new results confirm that stress-induced 5’ end RNA decay is also observed in this primary cell line and is conserved between human and mouse (Sup. Fig. 1k, I). 

      (2) The authors state that they monitored cell viability up to 24 hours after Arsenite treatment, but the data is shown up to 240 min (Suppl. 1a). Also, the Y-axis label of this Figure is "Active cells (%)". This should be changed to "Live cells (%)" if this is what they are referring to.

      We thank the reviewer for identifying this mistake. Cell viability was monitored up to 4 hours after arsenite treatment. We have corrected the text and modified the figure according to the reviewer’s suggestion.

      (3) Based on direct Nanopore-based RNA-seq the authors surprisingly found that RNAs in oxidative stress were globally shorter than unstressed cells. Since Nanopore-based RNA-seq will not detect RNAs that lack a poly A-tail, are they not missing out on RNAs that have already started getting degraded due to the loss of a poly A-tail? Also, I am not sure if they used a spikein control which would be critical to claim global changes in RNA expression.

      We agree with the reviewer that our strategy does not capture RNA molecules without a poly(A) tail. Nevertheless, our data do identify shortening upon stress at the 5’ end of RNAs that include poly(A) tails. We considered this as direct evidence that decay at the 5’ end does not require prior removal of the poly(A) tail. Otherwise, these molecules would not have been captured and observed. Indeed, our newly added data from cells expressing a well characterized catalytically inactive dominant negative NOT8 isoform (Chang et al. 2019) show that stress-induced decay occurs even upon silencing of the CCR4-NOT deadenylation complex. We present these results in Fig. 3 and Sup. Fig 3.

      We would like to clarify that in our results we did not use a spike-in control and thus refrain from claiming global changes in RNA expression. Instead, we compare relative ratios of groups of molecules within libraries that are internally normalized, we perform correlative comparisons that are invariant to normalization and we perform differential gene expression using established normalization schemes such as DESeq2 (Love, Huber, and Anders 2014). 

      (4) Many graphs are confusing and inconsistent. For example, samples for Nanopore RNA-seq were prepared in triplicates. Biological or technical? The schematic in Figure 1a shows ISRIB but it appears from Figure 4 onwards. It is missing in the Figure 1 results and the Figure legend. The X-axis labels of many graphs are confusing. For example, Supplementary Figure 1d, 1e, 1g and 1h. It says transcript length but are these nucleotides? P-values are missing from many of these graphs. For some graphs, the authors compared Unstressed vs Arsenite (Figure 1), but in other panels they state No Ars vs 0.5 mM Ars (Fig. 3a) or Control vs Ars (Figure 5c). Likewise, in Figure 1b, Expression change (log2) is unstressed vs Arsenite or Arsenite vs unstressed?

      We thank the reviewer identifying these inconsistencies in the presentation of our results. The replicates for nanopore RNA-seq experiments were biological. We have now clarified this point in the text. Furthermore, we have removed “ISRIB” from Fig. 1a to avoid any confusion. We have also made our labelling across all figures more consistent using ‘unstressed’ for NO arsenite treatment vs “arsenite” or ‘+ Ars’ for arsenite treatment. 

      (5) The authors transfected cells with siCTRL or siXRN1 using electroporation and treated the cells 72 hours after transfection. Since XRN1 is an essential gene, it would be important to determine the viability of cells 72 hours after transfection. Along these lines, in Figure 3b, it would be important to determine the effect of XRN1 knockdown in unstressed cells. Currently, there are only 3 comparisons in Figure 3b - unstressed, siCTRL + Ars and siXRN1 + Ars, and this is insufficient to conclude the effects of XRN1 knockdown in the presence of Arsenite.

      We thank the reviewer for their suggestion. We have updated Fig. 3b and the text to show the requested conditions: siCTRL and siXRN1 with and without arsenite. While XRN2 is an essential gene for many organisms, XRN1 is not essential in mammalian cells and no increased cell death has been reported for XRN1-KO or –KD cells (Brothers et al. 2023). We have also tested different concentration (up to 40 nM) of siRNA and monitored the cells up to five days after transfection without observing any cell toxicity, as previously reported.

      (6) More broadly, the whole study is somewhat descriptive. The biological effect of 5'end mRNA shortening on gene expression is unclear. There is no data indicating how these changes in RNA lengths impact protein expression. Global quantitative proteomics would be critical to determine this.

      We thank the reviewer for their suggestion. To address this concern we have performed additional experiments using cells expressing catalytically inactive forms of NOT8 (Chang et al. 2019) and DCP2 (Loh, Jonas, and Izaurralde 2013) to inhibit deadenylation and decapping.

      These experiments provide additional mechanistic details for 5’ shortening and suggest endonucleolytic cleavage as a critical step (Fig. 3 and Sup. Fig. 3). We agree that it would be interesting to study the fate of these shortened transcripts notably regarding translation. However, given the complexity of the expected proteome changes also following global translation arrest under stress (Harding et al., 2003; Pakos-Zebrucka et al., 2016), we think that this work is beyond the scope of this manuscript and will be the subject of future studies. 

      Minor comments:

      (1) Some of the affected RNAs can be validated in HeLa and other cell lines.

      We thank the reviewer for their suggestion. We have performed RT-qPCR on 3 different mRNAs that present 5’ shortening upon oxidative stress using different primers located along the mRNA. We hypothesized that the closer the primer set is located to the 5’ end, the less abundant the corresponding region would be for arsenite-treated compared to untreated cells. Our results show indeed that the measured level of these mRNAs depends on the location of the primer sets used for the qPCR, the closer to the 5’end it is, the less abundant the mRNA is upon oxidative stress compared to control cells. We present these data as well as a schematic representing the positions of the primers in Sup. Fig. 2d. 

      (2) The authors should check whether XRN1 also co-localizes in SGs.

      We thank the reviewer for their suggestion. We have performed immunofluorescence on U2OS and HeLa upon oxidative stress and did not observe a co-localization of XRN1 with TIA-1, a marker of stress granules (see below). These results are consistent with (Kedersha et al. 2005) that have shown that XRN1 mainly co-localizes to processing bodies and are very weakly detectable in SGs in DU145 cells. We think that this result is beyond the scope of this study and thus decided to only include it for the reviewers.

      Author response image 1.

      Representative immunofluorescence merged image of HeLa (left panel) and U2OS (right panel) cells treated with sodium arsenite and labelled with anti-TIA1 (red), anti-XRN1 (green) antibodies and DAPI (blue). Scale bar 50 µm.

      (3) XRN1 should be knocked down with more than one siRNA.

      We thank the reviewer for this suggestion. Our results show that our XRN1 KD specifically rescues the length of the most shortened mRNAs (Fig. 3e). This is a highly specific effect that makes us confident it is not mediated by non-specific siRNA binding; thus, we do not consider it necessary to repeat the experiment.

      (4) There are typos in the text regarding Figure 6d, e, and f. Also, Supplementary Figure 4a.

      We thank the reviewer for identifying these mistakes. We have corrected the typos. 

      Reviewer #3 (Recommendations For The Authors):

      The authors should consider testing their hypotheses by arresting the decay pathway using the approaches I mentioned previously. As it stands, some conclusions are somewhat speculative.

      We have replied to the reviewer comments in the public review section. 

      References:

      • Brothers, William R., Farah Ali, Sam Kajjo, and Marc R. Fabian. 2023. “The EDC4-XRN1 Interaction Controls P-Body Dynamics to Link MRNA Decapping with Decay.” The EMBO Journal, August, e113933.

      • Chang, Chung-Te, Sowndarya Muthukumar, Ramona Weber, Yevgen Levdansky, Ying Chen, Dipankar Bhandari, Catia Igreja, Lara Wohlbold, Eugene Valkov, and Elisa Izaurralde. 2019. “A Low-Complexity Region in Human XRN1 Directly Recruits Deadenylation and Decapping Factors in 5’-3’ Messenger RNA Decay.” Nucleic Acids Research 47 (17): 9282–95.

      • Harding, Heather P., Yuhong Zhang, Huiquing Zeng, Isabel Novoa, Phoebe D. Lu, Marcella Calfon, Navid Sadri, et al. 2003. “An Integrated Stress Response Regulates Amino Acid Metabolism and Resistance to Oxidative Stress.” Molecular Cell 11 (3): 619–33.

      • Ibrahim, Fadia, Manolis Maragkakis, Panagiotis Alexiou, and Zissimos Mourelatos. 2018. “Ribothrypsis, a Novel Process of Canonical MRNA Decay, Mediates Ribosome-Phased MRNA Endonucleolysis.” Nature Structural & Molecular Biology 25 (4): 302–10.

      • Kedersha, Nancy, Georg Stoecklin, Maranatha Ayodele, Patrick Yacono, Jens Lykke-Andersen, Marvin J. Fritzler, Donalyn Scheuner, Randal J. Kaufman, David E. Golan, and Paul Anderson. 2005. “Stress Granules and Processing Bodies Are Dynamically Linked Sites of MRNP Remodeling.” The Journal of Cell Biology 169 (6): 871–84.

      • Krause, Maximilian, Adnan M. Niazi, Kornel Labun, Yamila N. Torres Cleuren, Florian S. Müller, and Eivind Valen. 2019. “Tailfindr: Alignment-Free Poly(A) Length Measurement for Oxford Nanopore RNA and DNA Sequencing.” RNA  25 (10): 1229–41.

      • Loh, Belinda, Stefanie Jonas, and Elisa Izaurralde. 2013. “The SMG5-SMG7 Heterodimer Directly Recruits the CCR4-NOT Deadenylase Complex to MRNAs Containing Nonsense Codons via Interaction with POP2.” Genes & Development 27 (19): 2125–38.

      • Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (12): 550.

      • Pakos-Zebrucka, Karolina, Izabela Koryga, Katarzyna Mnich, Mila Ljujic, Afshin Samali, and Adrienne M. Gorman. 2016. “The Integrated Stress Response.” EMBO Reports 17 (10): 1374–95.

      • Pelechano, Vicent, Wu Wei, and Lars M. Steinmetz. 2015. “Widespread Co-Translational RNA Decay Reveals Ribosome Dynamics.” Cell 161 (6): 1400–1412.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Strengths:

      The three experiments are well designed and the various conditions are well controlled. The rationale of the study is clear, and the manuscript is pleasant to read. The analysis choices are easy to follow, and mostly appropriate.

      We are grateful to the reviewer’s thoughtful comments.

      Weaknesses:

      I only have one potential worry. The analysis for gait tracking (1 Hz) in Experiment 2 (Figures 3a/b) starts by computing a congruency effect (A/V stimulation congruent (same frequency) versus A/V incongruent (V at 1 Hz, A at either 0.6 or 1.4 Hz), separately for the Upright and Inverted conditions. Then, this congruency effect is contrasted between Upright and Inverted, in essence computing an interaction score (Congruent/Incongruent X Upright/Inverted). Then, the channels in which this interaction score is significant (by cluster-based permutation test; Figure 3a) are subselected for further analysis. This further analysis is shown in Figure 3b and described in lines 195-202. Critically, the further analysis exactly mirrors the selection criteria, i.e. it is aimed at testing the effect of Congruent/Incongruent and Upright/Inverted. This is colloquially known as "double dipping", the same contrast is used for selection (of channels, in this case) as for later statistical testing. This should be avoided, since in this case even random noise might result in a significant effect. To strengthen the evidence, either the authors could use a selection contrast that is orthogonal to the subsequent statistical test, or they could skip either the preselection step or the subsequent test. (It could be argued that the test in Figure 3b and related text is not needed to make the point - that same point is already made by the cluster-based permutation test.)

      Thanks for the helpful suggestions. In Experiment 2, to investigate whether the multisensory integration effect was specialized for biological motion perception, we contrasted the congruency effect between the upright and inverted conditions to search for clusters showing a significant interaction effect. We performed further analyses based on neural responses from this cluster to examine whether the congruency effect was significant in the upright and the inverted conditions, respectively, following the logic of post hoc comparisons after identifying an interaction effect. However, we agree with the reviewer that comparing the congruency effects between the upright and inverted conditions again based on data from this cluster was redundant and resulted in doubledipping. Therefore, we have removed this comparison from the main text and optimized the way to present our results in the revised Fig. 3).

      Related to the above: the test for the three-way interaction (lines 211-216) is reported as "marginally significant", with a p-value of 0.087. This is not very strong evidence.

      As shown in Fig.3b & e, the magnitude of amplitude differs between the gaitcycle frequency (mean = 0.008, SD = 0.038) and the step-cycle frequency (mean = 0.052; SD =0.056), which might influence the statistical results of the interaction effect. To reduce such influence, we converted the amplitude data at each frequency condition into Z-scores, separately. The repeated-measures ANOVA analysis on these normalized amplitude data revealed a significant three-way interaction (F (1,23) = 7.501, p = 0.012, ƞ<sub>p</sub><sup>2</sup> \= 0.246). We have updated the results in the revised manuscript (lines 218-225).

      Reviewer #1 (Recommendations For The Authors):

      -  Which variable caused one data point to be classified as outlier? (line 221).

      The outlier is a participant whose audiovisual congruency effect (Upright – Inverted) in neural responses at the frequency of interest exceeds 3 SD from the group mean. It is marked by a red diamond in Author response 2. Before removing the data, the correlation between the AQ score and the congruency effect is r \= -0.396, p \= 0.055. For comparison, the results after removing the outlier are shown in Fig. 3c of the revised manuscript. We have added more information about the variable causing the outlier in the revised manuscript (lines 231-232).

      Author response image 1.

      The correlation between AQ score and congruency effect

      -  The authors cite Maris & Oostenveld (2007) in line 415 as the main reference for the FieldTrip toolbox, but the correct reference here is different, see https://www.fieldtriptoolbox.org/faq/how_should_i_refer_to_fieldtrip_in_my_p ublication/

      Thank you for pointing out this issue. Citation corrected.

      -  The authors could consider giving some more background on the additive vs superadditive distinction in the Introduction, which may increase the impact; as it stands the reader might not know why this is particularly interesting. Summarize some of the takeaways of the Stevenson et al. (2014) review in this respect.

      Thanks for the suggestion and we have added the following relevant information in the Introduction (lines 80-90):

      “Moreover, we adopted an additive model to classify multisensory integration based on the AV vs A+V comparison. This model assumes independence between inputs from each sensory modality and distinguishes among sub-additive (AV < A+V), additive (AV = A+V), and super-additive (AV > A+V) response modes (see a review by Stevenson et al., 2014). The additive mode represents a linear combination between two modalities. In contrast, the super-additive and subadditive modes indicate non-linear interaction processing, either with potentiated neural activation to facilitate the perception or detection of nearthreshold signals (super-additive) or a deactivation mechanism to minimize the processing of redundant information cross-modally (sub-additive) (Laurienti et al., 2005; Metzger et al., 2020; Stanford et al., 2005; Wright et al., 2003).”

      Reviewer #2 (Public Review):

      Strengths:

      The manuscript is well-written, with a concise and clear writing style. The visual presentation is largely clear. The study involves multiple experiments with different participant groups. Each experiment involves specific considered changes to the experimental paradigm that both replicate the previous experiment's finding yet extend it in a relevant manner.

      We thank the reviewer for the valuable feedback.

      Weaknesses:

      The manuscript interprets the neural findings using mechanistic and cognitive claims that are not justified by the presented analyses and results.

      First, entrainment and cortical tracking are both invoked in this manuscript, sometimes interchangeably so, but it is becoming the standard of the field to recognize their separate evidential requirements. Namely, step and gate cycles are striking perceptual or cognitive events that are expected to produce event-related potentials (ERPs). The regular presentation of these events in the paradigm will naturally evoke a series of ERPs that leave a trace in the power spectrum at stimulation rates even if no oscillations are at play. Thus, the findings should not be interpreted from an entrainment framework except if it is contextualized as speculation, or if additional analyses or experiments are carried out to support the assumption that oscillations are present. Even if oscillations are shown to be present, it is then a further question whether the oscillations are causally relevant toward the integration of biological motion and for the orchestration of cognitive processes.

      Second, if only a cortical tracking account is adopted, it is not clear why the demonstration of supra-additivity in spectral amplitude is cognitively or behaviorally relevant. Namely, the fact that frequency-specific neural responses to the [audio & visual] condition are stronger than those to [audio] and [visual] combined does not mean this has implications for behavioral performance. While the correlation to autism traits could suggest some relation to behavior and is interesting in its own right, this correlation is a highly indirect way of assessing behavioral relevance. It would be helpful to test the relevance of supra-additive cortical tracking on a behavioral task directly related to the processing of biological motion to justify the claim that inputs are being integrated with the service of behavior. Under either framework, cortical tracking or entrainment, the causal relevance of neural findings toward cognition is lacking.

      Overall, I believe this study finds neural correlates of biological motion, and it is possible that such neural correlates relate to behaviorally relevant neural mechanisms, but based on the current task and associated analyses this has not been shown.

      Thanks for raising the important concerns regarding the interpretation of our results within the entrainment or the cortical tracking frame. A strict neural entrainment account emphasizes the alignment of endogenous neural oscillations with external rhythms, rather than a mere regular repetition of stimulus-evoked responses. However, it is challenging to fully dissociate these components, given that rhythmic stimulation can shape intrinsic neural oscillations, resulting in an intricate interplay between endogenous neural oscillations and stimulus-evoked responses (Duecker et al., 2024; Herrmann et al., 2016; Hosseinian et al., 2021). Therefore, some research, including the current study, use the term “entrainment” to refer to the alignment of brain activity to rhythmic stimulation in a broader context, without isolating the intrinsic oscillations and evoked responses (e.g., Ding et al., 2016; Nozaradan et al., 2012; Obleser & Kayser, 2019). Nevertheless, we agree with the reviewer that since the current results did not examine or provide direct evidence for endogenous oscillations, it is better to contextualize the oscillation view as speculations. Hence, we have replaced most of the expressions about “entrainment” with a more general term “tracking” in the revised manuscript (as well as in the title of the manuscript). We only briefly mentioned the entrainment account in the Discussion to facilitate comparison with the literature (lines 307-312).

      Regarding the relevance between neural findings and cognition or behavioral performance, the first supporting evidence comes from the inversion effect in Experiment 2. For the neural responses at gait-cycle frequency, we observed a significantly enhanced audiovisual congruency effect in the upright condition compared with the inverted condition. Inversion disrupts the distinctive kinematic features of biological motion (e.g., gravity-compatible ballistic movements) and significantly impairs biological motion processing, but it does not change the basic visual properties of the stimuli, including the rhythmic signals generated by low-level motion cues. Therefore, the inversion effect has long been regarded as an indicator of the specificity of biological motion processing in numerous behavioral and neuroimaging studies (Bardi et al., 2014; Grossman & Blake, 2001; Shen, Lu, Yuan, et al., 2023; Simion et al., 2008; Troje & Westhoff, 2006; Vallortigara & Regolin, 2006; Wang et al., 2014; Wang & Jiang, 2012; Wang et al., 2022). Here, our finding of the cortical tracking of higher-order rhythmic structures (gait cycles) present in the upright but not in the inverted condition suggests that this cortical tracking effect can not be explained by ERPs evoked by regular onsets of rhythmic events. Rather, it is closely linked with the specialized cognitive processing of biological motion. Furthermore, we found that the BM-specific cortical tracking effect at gait-cycle frequency (rather than the non-selective tracking effect at step-cycle frequency) correlates with observers’ autistic traits, indicating its functional relevance to social cognition. These findings convergingly suggest that the cortical tracking effect that we currently observed engages cognitively relevant neural mechanisms. In addition, our recent behavioral study showed that listening to frequency-congruent footstep sounds, compared with incongruent sounds, enhanced the visual search for human walkers but not for non-biological motion stimuli containing the same rhythmic signals (Shen, Lu, Wang, et al., 2023). These results suggest that audiovisual correspondence specifically enhances the perceptual and attentional processing of biological motion. Future research could examine whether the cortical tracking of rhythmic structures plays a functional role in this process, which may shed more light on the behavioral relevance of the cortical tracking effect to biological motion perception. We have incorporated the above information into the Discussion (lines 268-293).

      Reviewer #2 (Recommendations For The Authors):

      In Figure 1c, it could be helpful to add the word "static" in the illustration for the auditory condition so that readers understand without reading the subtext that it is a static image without biological motion.

      Suggestion taken.

      In the Discussion, I believe it is important to justify an oscillation and entrainment account, or if it cannot be justified based on the current results and analyses (which is my opinion), it could be helpful to explicitly frame it as speculation.

      We agree with the reviewer. For more clarification, please refer to our response to the public review.

      L335, I did not understand this sentence - a reformulation would be helpful.

      The point-light stimuli were created by capturing the motion of a walking actor (Vanrie & Verfaillie, 2004). The global motion of the walking sequences was eliminated so that the point-light walker looks like walking on a treadmill without translational motion. We have reformulated the sentence as follows: “The point-light walker was presented at the center of the screen without translational motion.”

      The results in Figure 2a and 2d are derived by performing a t-test between the amplitude at the frequency of gait and step cycles and zero. Comparison against amplitude of zero is too liberal; the possibility for a Type-I error is inflated because even EEG data with only noise will not have amplitudes of zero at all frequencies. A better baseline (H0) is either the 1/frequency trend in the power spectrum derived using methods like FOOOF (https://fooof-tools.github.io/fooof/) or by performing non-parametric shuffling based methods (https://doi.org/10.1016/j.jneumeth.2007.03.024).

      In our data analysis, instead of performing the t-test between raw amplitude with zero, we compared the normalized amplitude at each frequency bin (by subtracting the average amplitude measured at the neighboring frequency bins from the original amplitude data) against zero. Such analysis is equal to contrasting the raw amplitude to its neighboring frequency bins, allowing us to test whether the neural response in each frequency bin showed a significant enhancement compared with its neighbors. The multiple comparisons on each frequency bin were controlled by false discovery rate (FDR) correction, reducing the Type-I error. Such analysis procedures help reduce (though not totally remove) the influence of the 1/f trend and have been widely used in this field (Cirelli et al., 2016; Henry & Obleser, 2012; Lenc et al., 2018; Nozaradan et al., 2012; Peter et al., 2023).

      To further verify our findings, we adopted the reviewer’s suggestion and created a baseline by performing a non-parametric shuffling-based analysis. More specifically, to establish the statistical significance of amplitude peaks, we carried out a surrogate analysis on each condition. For each participant, a single control surrogate dataset was derived from their actual dataset by jittering the onset of each step-cycle relative to the actual original onset by a randomly selected integer value ranging between − 490–490 ms. This procedure removed the consistent relationship between the EEG signal and the stimuli while preserving each epoch’s general timing within the exposure period. Then, epochs were extracted based on surrogate stimuli onset, and amplitude was computed across frequencies through FFT under a null model of non-entrainment (Moreau et al., 2022). This entire procedure was performed 100 times, producing a surrogate amplitude distribution of 100 group-averaged values for each condition. If the observed amplitude values at the frequency of interest exceeded the value corresponding to the 95th percentile of the surrogate distribution (p < .05) within a given condition (e.g., AV), the amplitude peak was considered significant (Batterink, 2020). As shown in Author response image 2, the statistical results from these analyses are similar to those reported in the manuscript, confirming the significant amplitude peaks at the frequencies of interest.

      Author response image 2.

      Non-parametric analysis for spectral peak. The dotted lines represent the random data based on shuffling analysis. The solid lines represent the observed data in measured EEG signals. All conditions induced significant peaks at step-cycle frequency and its harmonic, while only the AV condition induced a significant peak at gait-cycle frequency.

      Reviewer #3 (Public Review):

      Strengths:

      The main strengths of the paper relate to the conceptualization of BM and the way it is operationalized in the experimental design and analyses. The use of entrainment, and the tracking of different, nested aspects of BM result in seemingly clean data that demonstrate the basic pattern. The first experiments essentially provide the basic utility of the methodological innovation and the second experiment further hones in on the relevant interpretation of the findings by the inclusion of better control stimuli sets.

      Another strength of the work is that it includes at a conceptual level two replications.

      We appreciate the reviewer for the comprehensive review and positive comments.

      Weaknesses:

      The statistical analysis is misleading and inadequate at times. The inclusion of the autism trait is not foreshadowed and adequately motivated and is likely underpowered. Finally, a broader discussion over other nested frequencies that might reside in the point-light walker stimuli would also be important to fully interpret the different peaks in the spectra.

      (1) Regarding the nested frequency peaks in the spectra, we did observe multiple significant amplitude peaks at 1f (1/0.83 Hz), 2f (2/1.67 Hz), and 4f (4/3.33 Hz) relative to the gait-cycle frequency (Fig. 2 a&d). To further test the functional roles of the neural activity at different frequencies, we analyzed the audiovisual integration modes at each frequency. Note that we collapsed the data from Experiments 1a & 1b in the analysis as they yielded similar results. Overall, results show a similar additive audiovisual integration mode at 2f and 4f and a super-additive integration mode only at 1f (Figure S1), suggesting that the cortical tracking effects at 2f and 4f may be functionally linked but independent of that at 1f. We have reported the detailed results in the Supplementary Information.

      (2) For the reviewer’s other concerns about statistical analysis and autism traits, please refer to our responses below to the Recommendations for the authors.

      Reviewer #3 (Recommendations For The Authors):

      The description of the analyses performed for experiment 2 comes across as double dipping. Congruency effects for BM and non-BM motion (inverted) were compared using cluster-based statistics. Then identified clusters informed an averaging of signals which then were subjected to a paired comparison. At this point, it is no surprise that these paired comparisons are highly significant seeing that the channels were selected based on a cluster analysis of the same exact contrast. This approach should be avoided.

      In the analysis of the repeated measures ANOVA reporting a trend as marginally significant is misleading. Reporting the statistical results whilst indicating that those do not reach significance is the appropriate way to communicate this finding. Other statistics can be used in order to provide the likelihood of those findings supporting H1 or H0 if the authors would like to state something more precise (Bayesian).

      Thanks for the comments. We have addressed these two points in our response to the public review of Reviewer #1.

      The authors perform a correlation along "autistic trait" scores in an individual differences approach. Individual differences are typically investigated in larger samples (>n=40). In addition, the range of AQ scores seems limited to mostly average or lower-than-average AQs (barring a couple). These points make the conclusions on the possible role of BM in the autistic phenotype very tentative. I would recommend acknowledging this.

      An alternative analysis approach that might better suit the smaller sample size is a comparison between high and low AQ participants, defined based on a median split.

      Many thanks for the suggestion. We agree with the reviewer that the sample size (n = 24) in the current study is not large for exploring the correlation between BM and autistic traits. The narrow range of AQ scores was due to the fact that all participants were non-clinical populations and we did not pre-select participants by AQ scores. To further confirm our findings, we adopted your suggestion to compare the BM-specific cortical tracking effect (i.e., audiovisual congruency effect (Upright - Inverted)) between high and low AQ participants split by the median AQ score (20) of this sample. Similar to correlation analysis, one outlier, whose audiovisual congruency effect (Upright – Inverted) in neural responses at 1 Hz exceeds 3 SD from the group mean, was removed from the following analysis. As shown in Figure S3, at 1 Hz, participants with low AQ showed a greater cortical tracking effect compared with high AQ participants (t (21) = 2.127, p \= 0.045). At 2 Hz, low and high AQ participants showed comparable neural responses (t (22) = 0.946, p \= 0.354). These results are in line with the correlation analysis, providing further support to the functional relevance between social cognition and cortical tracking of biological motion as well as its dissociation at the two temporal scales. We have added these results to the main text (lines 238-244) and the supplementary information.

      Writing

      The narrative could be better unfolded and studies better motivated. The transition from basic science research on BM to possibly delineating a mechanistic understanding of autism was a surprise at the end of the intro. Once the authors consider the suggestions and comments above it would be good to have this detail and motivation more obviously foreshadowed in the text.

      Thanks for the great suggestion and we have provided an introduction about how audiovisual BM processing links with social cognition and ASD in the first paragraph of the revised manuscript (lines 46-56). In particular, integrating multisensory BM cues is foundational for perceiving and attending to other people and developing further social interaction. However, such ability is usually compromised in people with social deficits, such as individuals with autism spectrum disorder (ASD) (Feldman et al., 2018), and even in non-clinical populations with high autistic traits (Ujiie et al., 2015). These behavioral findings underline the close relationship between multisensory BM processing and one’s social cognitive capability, motivating us to further explore this issue at the neural level in the current study. We have also modified the relevant content in the last paragraph of the Introduction (lines 100-108), briefly mentioning the methods that we used to investigate this issue.

      The use of terminology related to neural oscillations which are entraining to the BM seems to suggest that the rhythmic tracking inevitably stems from the shaping of existing intrinsic dynamics of the brain. I am not sure this is necessarily the case. I would therefore adopt a more concrete jargon for the description of the entrainment seen in this study. If a discussion over internal dynamics shaped by external stimuli should be invoked, it should be done explicitly with appropriate references (but in my opinion, it isn't quite required).

      Please refer to our response to a similar point raised in the public review of Reviewer #2.

      References

      Bardi, L., Regolin, L., & Simion, F. (2014). The First Time Ever I Saw Your Feet: Inversion Effect in Newborns’ Sensitivity to Biological Motion. Developmental Psychology, 50. https://doi.org/10.1037/a0034678

      Baron-Cohen, S., Wheelwright, S., Skinner, R., Martin, J., & Clubley, E. (2001). The autism-spectrum quotient (AQ): Evidence from Asperger syndrome/highfunctioning autism, males and females, scientists and mathematicians. Journal of Autism and Developmental Disorders, 31(1), 5–17. https://doi.org/10.1023/a:1005653411471

      Batterink, L. (2020). Syllables in Sync Form a Link: Neural Phase-locking Reflects Word Knowledge during Language Learning. Journal of Cognitive Neuroscience, 32(9), 1735–1748. https://doi.org/10.1162/jocn_a_01581

      Cirelli, L. K., Spinelli, C., Nozaradan, S., & Trainor, L. J. (2016). Measuring Neural Entrainment to Beat and Meter in Infants: Effects of Music Background. Frontiers in Neuroscience, 10. https://doi.org/10.3389/fnins.2016.00229

      Ding, N., Melloni, L., Zhang, H., Tian, X., & Poeppel, D. (2016). Cortical tracking of hierarchical linguistic structures in connected speech. Nature Neuroscience, 19(1), 158–164. https://doi.org/10.1038/nn.4186

      Duecker, K., Doelling, K. B., Breska, A., Coffey, E. B. J., Sivarao, D. V., & Zoefel, B. (2024). Challenges and approaches in the study of neural entrainment. Journal of Neuroscience, 44(40). https://doi.org/10.1523/JNEUROSCI.1234-24.2024

      Falck-Ytter, T., Nyström, P., Gredebäck, G., Gliga, T., Bölte, S., & the EASE team. (2018). Reduced orienting to audiovisual synchrony in infancy predicts autism diagnosis at 3 years of age. Journal of Child Psychology and Psychiatry, 59(8), 872–880. https://doi.org/10.1111/jcpp.12863

      Feldman, J. I., Dunham, K., Cassidy, M., Wallace, M. T., Liu, Y., & Woynaroski, T. G. (2018). Audiovisual multisensory integration in individuals with autism spectrum disorder: A systematic review and meta-analysis. Neuroscience & Biobehavioral Reviews, 95, 220–234. https://doi.org/10.1016/j.neubiorev.2018.09.020

      Grossman, E. D., & Blake, R. (2001). Brain activity evoked by inverted and imagined biological motion. Vision Research, 41(10), 1475–1482. https://doi.org/10.1016/S0042-6989(00)00317-5

      Henry, M. J., & Obleser, J. (2012). Frequency modulation entrains slow neural oscillations and optimizes human listening behavior. Proceedings of the National Academy of Sciences, 109(49), 20095–20100. https://doi.org/10.1073/pnas.1213390109

      Herrmann, C. S., Murray, M. M., Ionta, S., Hutt, A., & Lefebvre, J. (2016). Shaping Intrinsic Neural Oscillations with Periodic Stimulation. Journal of Neuroscience, 36(19), 5328–5337. https://doi.org/10.1523/JNEUROSCI.0236-16.2016

      Hosseinian, T., Yavari, F., Biagi, M. C., Kuo, M.-F., Ruffini, G., Nitsche, M. A., & Jamil, A. (2021). External induction and stabilization of brain oscillations in the human. Brain Stimulation, 14(3), 579–587. https://doi.org/10.1016/j.brs.2021.03.011

      Klin, A., Lin, D. J., Gorrindo, P., Ramsay, G., & Jones, W. (2009). Two-year-olds with autism orient to non-social contingencies rather than biological motion. Nature, 459(7244), 257–261. https://doi.org/10.1038/nature07868

      Laurienti, P. J., Perrault, T. J., Stanford, T. R., Wallace, M. T., & Stein, B. E. (2005). On the use of superadditivity as a metric for characterizing multisensory integration in functional neuroimaging studies. Experimental Brain Research, 166(3), 289–297. https://doi.org/10.1007/s00221-005-2370-2

      Lenc, T., Keller, P. E., Varlet, M., & Nozaradan, S. (2018). Neural tracking of the musical beat is enhanced by low-frequency sounds. Proceedings of the National Academy of Sciences, 115(32), 8221–8226. https://doi.org/10.1073/pnas.1801421115

      Metzger, B. A., Magnotti, J. F., Wang, Z., Nesbitt, E., Karas, P. J., Yoshor, D., & Beauchamp, M. S. (2020). Responses to Visual Speech in Human Posterior Superior Temporal Gyrus Examined with iEEG Deconvolution. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 40(36), 6938–6948. https://doi.org/10.1523/JNEUROSCI.0279-20.2020

      Moreau, C. N., Joanisse, M. F., Mulgrew, J., & Batterink, L. J. (2022). No statistical learning advantage in children over adults: Evidence from behaviour and neural entrainment. Developmental Cognitive Neuroscience, 57, 101154. https://doi.org/10.1016/j.dcn.2022.101154

      Nozaradan, S., Peretz, I., & Mouraux, A. (2012). Selective Neuronal Entrainment to the Beat and Meter Embedded in a Musical Rhythm. Journal of Neuroscience, 32(49), 17572–17581. https://doi.org/10.1523/JNEUROSCI.3203-12.2012

      Obleser, J., & Kayser, C. (2019). Neural Entrainment and Attentional Selection in the Listening Brain. Trends in Cognitive Sciences, 23(11), 913–926. https://doi.org/10.1016/j.tics.2019.08.004

      Peter, V., Goswami, U., Burnham, D., & Kalashnikova, M. (2023). Impaired neural entrainment to low frequency amplitude modulations in English-speaking children with dyslexia or dyslexia and DLD. Brain and Language, 236, 105217. https://doi.org/10.1016/j.bandl.2022.105217

      Shen, L., Lu, X., Wang, Y., & Jiang, Y. (2023). Audiovisual correspondence facilitates the visual search for biological motion. Psychonomic Bulletin & Review, 30(6), 2272–2281. https://doi.org/10.3758/s13423-023-02308-z

      Shen, L., Lu, X., Yuan, X., Hu, R., Wang, Y., & Jiang, Y. (2023). Cortical encoding of rhythmic kinematic structures in biological motion. NeuroImage, 268, 119893. https://doi.org/10.1016/j.neuroimage.2023.119893

      Simion, F., Regolin, L., & Bulf, H. (2008). A predisposition for biological motion in the newborn baby. Proceedings of the National Academy of Sciences, 105(2), 809–813. https://doi.org/10.1073/pnas.0707021105

      Stanford, T. R., Quessy, S., & Stein, B. E. (2005). Evaluating the Operations Underlying Multisensory Integration in the Cat Superior Colliculus. Journal of Neuroscience, 25(28), 6499–6508. https://doi.org/10.1523/JNEUROSCI.5095-04.2005

      Stevenson, R. A., Ghose, D., Fister, J. K., Sarko, D. K., Altieri, N. A., Nidiffer, A. R., Kurela, L. R., Siemann, J. K., James, T. W., & Wallace, M. T. (2014). Identifying and Quantifying Multisensory Integration: A Tutorial Review. Brain Topography, 27(6), 707–730. https://doi.org/10.1007/s10548-014-0365-7

      Troje, N. F., & Westhoff, C. (2006). The Inversion Effect in Biological Motion Perception: Evidence for a “Life Detector”? Current Biology, 16(8), 821–824. https://doi.org/10.1016/j.cub.2006.03.022

      Ujiie, Y., Asai, T., & Wakabayashi, A. (2015). The relationship between level of autistic traits and local bias in the context of the McGurk effect. Frontiers in Psychology, 6. https://doi.org/10.3389/fpsyg.2015.00891

      Vallortigara, G., & Regolin, L. (2006). Gravity bias in the interpretation of biological motion by inexperienced chicks. Current Biology, 16(8), R279–R280. https://doi.org/10.1016/j.cub.2006.03.052

      Vanrie, J., & Verfaillie, K. (2004). Perception of biological motion: A stimulus set of human point-light actions. Behavior Research Methods, Instruments, & Computers, 36(4), 625–629. https://doi.org/10.3758/BF03206542

      Wang, L., & Jiang, Y. (2012). Life motion signals lengthen perceived temporal duration. Proceedings of the National Academy of Sciences of the United States of America, 109(11), E673-677. https://doi.org/10.1073/pnas.1115515109

      Wang, L., Yang, X., Shi, J., & Jiang, Y. (2014). The feet have it: Local biological motion cues trigger reflexive attentional orienting in the brain. NeuroImage, 84, 217–224. https://doi.org/10.1016/j.neuroimage.2013.08.041

      Wang, Y., Zhang, X., Wang, C., Huang, W., Xu, Q., Liu, D., Zhou, W., Chen, S., & Jiang, Y. (2022). Modulation of biological motion perception in humans by gravity. Nature Communications, 13(1), Article 1. https://doi.org/10.1038/s41467-022-30347-y

      Wright, T. M., Pelphrey, K. A., Allison, T., McKeown, M. J., & McCarthy, G. (2003). Polysensory Interactions along Lateral Temporal Regions Evoked by Audiovisual Speech. Cerebral Cortex, 13(10), 1034–1043. https://doi.org/10.1093/cercor/13.10.1034

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Dong et al here have studied the impact of the small Ras-like GTPase Rab10 on the exocytosis of dense core vesicles (DVC), which are important mediators of neuropeptide signaling in the brain. They use optical imaging to show that lentiviral depletion of Rab10 in mouse hippocampal neurons in culture independent of the established defects in neurite outgrowth hamper DCV exocytosis. They further demonstrate that such defects are paralleled by changes in ER morphology and defective ER-based calcium buffering as well as reduced ribosomal protein expression in Rab10-depleted neurons. Re-expression of Rab10 or supplementation of exogenous L-leucine to restore defective neuronal protein synthesis rescues impaired DCV secretion. Based on these results they propose that Rab10 regulates DCV release by maintaining ER calcium homeostasis and neuronal protein synthesis.

      Strengths:

      This work provides interesting and potentially important new insights into the connection between ER function and the regulated secretion of neuropeptides via DCVs. The authors combine advanced optical imaging with light and electron microscopy, biochemistry, and proteomics approaches to thoroughly assess the effects of Rab10 knockdown at the cellular level in primary neurons. The proteomic dataset provided may be valuable in facilitating future studies regarding Rab10 function. This work will thus be of interest to neuroscientists and cell biologists.

      We appreciate the positive evaluation of our manuscript.

      Weaknesses:

      While the main conclusions of this study are comparably well supported by the data, I see three major weaknesses:

      (1) For some of the data the statistical basis for analysis remains unclear. I.e. is the statistical assessment based on N= number of experiments or n = number of synapses, images, fields of view etc.? As the latter cannot be considered independent biological replicates, they should not form the basis of statistical testing.

      This is an important point and we agree that multiple samples from the same biological replicate are not independent observations. We reanalyzed all nested data using a linear mixed model and indicated this in the Methods section and the relevant figure legends (Brunner et al., 2022). In brief, biological replicates (individual neuronal cultures) were used as a linear predictor. Outliers were identified and excluded using the ROUT method in GraphPad. A fixed linear regression model was then fitted to the data using the lm() function in R. A one-way anova (analysis of variance) was used to assess whether including the experimental group as a second linear predictor (formula = y ~ Group + Culture) statistically improved the fit of a model without group information (formula = y ~ 1 + Culture). Post-hoc analysis was performed using the emmeans() function with Tukey’s adjustment when more than two experimental groups were present. Importantly, our conclusions remain unchanged.

      (2) As it stands the paper reports on three partially independent phenotypic observations, the causal interrelationship of which remains unclear. Based on prior studies (e.g. Mercan et al 2013 Mol Cell Biol; Graves et al JBC 1997) it is conceivable that defective ER-based calcium signaling and the observed reduction in protein synthesis are causally related. For example, ER calcium release is known to promote pS6K1 phosphorylation, a major upstream regulator of protein synthesis and ribosome biogenesis. Conversely, L-leucine supplementation is known to trigger calcium release from ER stores via IP3Rs. Given the reported impact of Rab10 on axonal transport of autophagosomes and, possibly, lysosomes via JIP3/4 or other mediators (see e.g. Cason and Holzbaur JCB 2023) and the fact that mTORC1, the alleged target of leucine supplementation, is located on lysosomes, which in turn form membrane contacts with the ER, it seems worth analyzing whether the various phenotypes observed are linked at the level of mTORC1 signaling.

      This is great suggestion that could indeed further clarify the potential interplay between ER-based Ca2+ signaling and protein synthesis. To address this, we assessed the phosphorylation level of pS6K1 in control and Rab10 knockdown (KD) neurons with or without leucine treatment. These data are included in the new Figure 8—figure supplement 1 in the revised manuscript. Our results indicate that pS6K1 phosphorylation was not upregulated in Rab10 KD neurons, suggesting that the level of mTORC1 signaling is not different between wild-type or KD neurons. Furthermore, leucine treatment increased the pS6K1 phosphorylation level, as expected, but this effect was similar in both groups. Hence, we conclude that differences in mTORC1 signaling induced by Rab10 loss is not a major factor in the observed impairment in protein synthesis.

      Author response image 1.

      Rab10 depletion does not upregulate mTORC1 pathway. (A)Typical immunoblot showing pS6K1 levels in each condition. (B) Quantification of relative pS6K1 levels in each condition. All Data are plotted as mean±s.e.m. (C) Control, Control + Leu: N = 2, n = 2, Rab10 KD, Rab10 KD + Leu: N = 2, n = 4.

      (3) The claimed lack of effect of Rab10 depletion on SV exocytosis is solely based on very strong train stimulation with 200 Aps, a condition not very well suited to analyze defects in SV fusion. The conclusion that Rab10 loss does not impact SV fusion thus seems premature.

      We agree that 200 APs stimulation might be too strong to detect specific effects on evoked synaptic vesicle release, although this stimulation pattern is an established pattern in hundreds of studies (Emperador-Melero et al., 2018; Granseth et al., 2006; Ivanova et al., 2021; Kwon and Chapman, 2011; Reshetniak et al., 2020). We have toned down our conclusions and clarified in the revised manuscript that Rab10 is dispensable for SV exocytosis evoked by intense stimulations. The corresponding statements in the text have been modified accordingly (p. 5, l. 98, 124) and in figure legend (p. 17, 490).

      Reviewer #2 (Public Review):

      Summary:<br /> In this paper, the authors assess the function of Rab10 in dense core vesicle (DCV) exocytosis using RNAi and cultured neurons. The author provides evidence that their knockdown (KD) is effective and provides evidence that DCV is compromised. They also perform proteomic analysis to identify potential pathways that are affected upon KD of Rab10 that may be involved in DCV release. Upon focusing on ER morphology and protein synthesis, the authors conclude that defects in protein synthesis and ER Ca2+ homeostasis contributes to the DVC release defect upon Rab10 KD. The authors claim that Rab10 is not involved in synaptic vesicle (SV) release and membrane homeostasis in mature neurons.

      Strengths:

      The data related to Rab10's role in DCV release seems to be strong and carried out with rigor. While the paper lacks in vivo evidence that this gene is indeed involved in DCV in a living mammalian organism, I feel the cellular studies have value. The identification of ER defect in Rab10 manipulation is not truly novel but it is a good conformation of studies performed in other systems. The finding that DCV release defect and protein synthesis defect seen upon Rab10 KD can be significantly suppressed by Leucine supplementation is also a strength of this work.

      We appreciate the positive evaluation of our manuscript.

      Weaknesses:

      The data showing Rab10 is NOT involved in SV exocytosis seems a bit weak to me. Since the proteomic analysis revealed so many proteins that are involved in SV exo/encodytosis to be affected upon Rab10, it is a bit strange that they didn't see an obvious defect. Perhaps this could have been because of the protocol that the authors used to trigger SV release (I am not an E-phys expert but perhaps this could have been a 'sledge-hammer' manipulation that may mask any subtle defects)? Perhaps the authors can claim that DCV is more sensitive to Rab10 KD than SV, but I am not sure whether the authors should make a strong claim about Rab10 not being important for SV exocytosis.

      We agree that 200 APs stimulation might be too strong to see specific effects on evoked synaptic vesicle release, although this stimulation pattern is an established pattern in hundreds of studies. We have toned down our conclusions and clarified in the revised manuscript that Rab10 is dispensable for SV exocytosis evoked by intense stimulations. The corresponding statements in the text have been modified accordingly (p. 5, l. 98, 124) and in figure legend (p. 17, 490).

      Also, the authors mention "Rab10 does not regulate membrane homeostasis in mature neurons" but I feel this is an overstatement. Since the authors only performed KD experiments, not knock-out (KO) experiments, I believe they should not make any conclusion about it not being required, especially since there is some level of Rab10 present in their cells. If they want to make these claims, I believe the authors will need to perform conditional KO experiments, which are not performed in this study.

      This is a valid point. We have changed the statement to “membrane homeostasis in mature neurons was unaffected by Rab10 knockdown” (p. 13, l.376-377).

      Finally, the authors show that protein synthesis and ER Ca2+ defects seem to contribute to the defect but they do not discuss the relationship between the two defects. If the authors treat the Rab10 KD cells with both ionomycin and Leucine, do they get a full rescue? Or is one defect upstream of the other (e.g. can they see rescue of ER morphology upon Leucine treatment)? While this is not critical for the conclusions of the paper, several additional experiments could be performed to clarify their model, especially considering there is no clear model that explains how Rab10, protein synthesis, ER homeostasis, and Ca2+ are related to DCV (but not SV) exocytosis.

      This is an important point and a great suggestion. We have now tested the rescue effects of leucine treatment on ER morphology, as suggested. These data are included in the new Figure 8—figure supplement 2 in the revised manuscript. Our results indicate that the same dose of leucine that rescues DCV fusion and protein translation failed to rescue ER morphology. Hence, the defects in ER morphology appear to be independent of the impaired protein translation.

      Author response image 2.

      Leucine supplementation does not rescue ER morphological deficiency in Rab10 KD neurons. (A) Typical examples showing the KDEL signals in each condition. (B) Quantification of RTN4 intensity in MAP2-positive dendrites. (C) The ratio of neuritic to somatic RTN4 intensity (N/S). All Data are plotted as mean±s.e.m. (B, C) Control: N = 3, n = 10; Rab10 KD: N = 3, n = 11; Rab10 KD + Leu: N = 3; n = 11. A one-way ANOVA tested the significance of adding experimental group as a predictor. **** = p<0.0001, ns = not significant.

      Reviewer #3 (Public Review):

      In the submitted manuscript, Dong and colleagues set out to dissect the role of the Rab10 small GTPase on the intracellular trafficking and exocytosis of dense core vesicles (DCVs). While the authors have already shown that Rab3 plays a central role in the exocytosis of DVC in mammalian neurons, the roles of several other Rab-members have been identified genetically, but their precise mechanism of action in mammalian neurons remains unclear. In this study, the authors use a carefully designed and thoroughly executed series of experiments, including live-cell imaging, functional calcium-imaging, proteomics, and electron microscopy, to identify that DCV secretion upon Rab10 depletion in adult neurons is primarily a result of dysregulated protein synthesis and, to a lesser extent, disrupted intracellular calcium buffering. Given that the full deletion of Rab10 has a deleterious effect on neurons and that Rab10 has a major role in axonal development, the authors cautiously employed the knock-down strategy from 7 DIV, to focus on the functional impact of Rab10 in mature neurons. The experiments in this study were meticulously conducted, incorporating essential controls and thoughtful considerations, ensuring rigorous and comprehensive results.

      We are grateful for the positive evaluation of our manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      The work by Dong et al provides interesting and potentially important new insights into the connection between ER function and the regulated secretion of neuropeptides via DCVs. I suggest that the authors address the following points experimentally to increase the impact of this potentially important study.

      Major points:

      (1) As alluded to above, for some of the data the statistical basis for analysis remains unclear (examples are Figures 1C-F, J,K; Figure 2 1B-D,I-K; Figure 2 - Supplement 1D-F; Figure 2 - Supplement 2J,K, etc). I.e. is the statistical assessment based on N = number of experiments or n = number of synapses, images, fields of view etc.? As the latter cannot be considered independent biological replicates, they should not form the basis of statistical testing. The Ms misses also misses a dedicated paragraph on statistics in the methods section.

      See reply to reviewer 1 above. We fully agree and solved this point.

      (2) A main weakness of the paper is the missing connection between neuronal protein synthesis, and the observed structural and signaling defects at the level of the ER. I suggest that the authors analyze mTORC1 signaling in Rab10 depleted neurons and under rescue conditions (+Leu or re-expression of Rab10) as ribosome biogenesis is a major downstream target of mTORC1 and mTORC1 activity is related to lysosome position, which may be affected upon rab10 loss -either directly or via effects on the ER that forms tight contacts with lysosomes.

      See reply to reviewer 1 above. We agreed and followed up experimentally.

      (3) Related to the above: Does overexpression of SERCA2 restore normal DCV exocytosis in Rab10-depleted neurons? This would help to distinguish whether calcium storage and release at the level of the ER indeed contribute to the exocytosis defect.

      This is an important point and a great suggestion. We have now tested the rescue effects of overexpression of SERCA2 on DCV fusion. These data are included in the new Figure 8—figure supplement 3 in the revised manuscript. SERCA2 OE failed to rescue the DCV fusion defects in Rab10 KD neurons.

      Author response image 3.

      Overexpression of SERCA2 does not rescue DCV fusion deficits in Rab10 KD neurons. (A) Typical examples showing the SERCA2 signals in each condition. (B) Cumulative plot of DCV fusion events per cell. (C) Summary graph of DCV fusion events per cell. (A) Total number of DCVs (total pool) per neuron, measured as the number of NPY-pHluorin puncta upon NH4Cl perfusion. (B) Fraction of NPY-pHluorin-labeled DCVs fusing during stimulation. All Data are plotted as mean±s.e.m. (C-E) Control: N = 2, n = 10; Rab10 KD: N = 2, n = 13; SERCA2 OE: N = 2; n = 15. A one-way ANOVA tested the significance of adding experimental group as a predictor. *** = p<0.001, ** = p<0.01, ns = not significant.

      (4) The claimed lack of effect of Rab10 depletion on SV exocytosis is solely based on very strong train stimulation with 200 Aps, a condition not very well suited to analyze defects in SV fusion. The conclusion that Rab10 loss does not impact SV fusion thus seems premature. The authors should conduct additional experiments under conditions of single or few Aps (e.g. 4 or 10 Aps) to really assess whether or not Rab10 depletion alters SV exocytosis at the level of pHluorin analysis in cultured neurons.

      See reply to reviewer 2 above. Agreed to and made textual adjustments to solve this

      (5) Related to the above: I am puzzled by the data shown in Figure 1H-J: From the pHluorin traces shown I would estimate a tau value of about 20-30 s (e.g. decay to 1/e = 37% of the peak value). The bar graph in Figure 1K claims 3-4 s, clearly clashing with the data shown. Were these experiments conducted at RT (where expected tau values are in the range of 30s) or at 37{degree sign}C (one would expect taus of around 10 s in this case for Syp-pH)? I ask the authors to carefully check and possibly re-analyze their datasets.

      This is indeed a mistake. We thank the reviewer for flagging this miscalculation. Our original Matlab script used for calculating the tau value contained an error and the datasets were normalized twice by mistake. We now reanalyzed the data and the corresponding figures and texts have been updated. Our conclusion that Rab10 KD does not affect SV endocytosis remains unchanged since the difference in tau between the control (28.5 s) and Rab10 KD (32.8 s) suffered from the same systematic error and were/are not significantly different.

      (6) How many times was the proteomics experiment shown in Figure 3 conducted? I noticed that the data in panel H missed statistical analysis and error bars. Given the typical variation in these experiments, I suggest to only include data for proteins identified in at least 3 out of 4 experimental replicates.

      We agree that this information has not been clear. We have now explained replication in the Methods section (p. 42, l. 879-885). In brief, the proteomics experiment presented in Fig 3 was conducted with two independent cultures (‘biological replicates’), hence, formally only two independent observations. For each biological replicate, we performed four technical replicates. For our analysis, we only included peptides that were consistently detected across all samples (not only three as this reviewer suggests). Proteins in Panel H are ER-related proteins that are significantly different from control neurons with an adjusted FDR ≤ 0.01 and Log2 fold change ≥ 0.56. The primary purpose of our proteomics experiments was to generate hypotheses and guide subsequent experiments and the main findings were corroborated by other experiments presented in the manuscript.

      Minor:

      (7) Figure 2 - supplement 3 and Figure 4 - supplement 3 are only mentioned in the discussion. The authors should consider referring to these data in the results section.

      This is a valid point. We have now added a new statement “Moreover, only 10% of DCVs co-transport with Rab10” in the Results (p. 6-7, l. 162-164).

      (8) Where is the pHluorin data shown in Figure 1 bleach-corrected? If so, this should be stated somewhere in the Ms. Moreover, the timing of the NH4Cl pulse should be indicated in the scheme in panel I.

      We thank the reviewer for pointing these omissions out. We have now included information about the timing of NH4Cl pulse in panel I. We did not do bleach-correction for the pHluorin data shown in Figure 1. It has been shown that pHluorin is very stable with a bleaching rate in the alkaline state of 0.06% per second and 0.0024% per second in the quenched state (Balaji and Ryan, 2007). Indeed, we did not observe obvious photobleaching in the first 30s during our imaging as indicated by the average trace of pHluorin intensity in panel I.

      (9) Page 3/ lines 59-60: "...strongest inhibition of neuropeptide accumulation...". What is probably meant is "...strongest inhibition of neuropeptide release".

      We agree this statement is unclear. Sasidharan et al used a coelomocyte uptake assay as an indirect readout for DCV release. The ‘strongest inhibition of neuropeptide accumulation’ in coelomocytes in Rab10 mutant indicates DCV fusion deficits. We have now replaced the text with “Rab10 deficiency produces the strongest inhibition of neuropeptide release in C. elegans” to make it more clear.

      Reviewer #3 (Recommendations For The Authors):

      I strongly recommend the publishing of this study as a VOR with minor comments directed to the authors.

      (1) In Figure 4, the authors should include examples of tubular ER at the synapse, especially as this is an interesting point discussed in ln 226-229. Are there noticeable changes in the ER-mitochondria contacts at the synaptic boutons?

      We agree that examples of tubular ER at the synapse would improve the manuscript. We have now replaced the Figure 4A with such examples. We found it challenging to quantify ER-mitochondria contacts based on the electron microscopy (EM) images we currently have. The ER-mitochondria contact sites are quite rare in the cross-sections of our samples, making it difficult to perform a reliable quantitative analysis.

      (2) The limited impairment of calcium-ion homeostasis in Rab10 KD neurons is very interesting. Would the overexpression of Rab10T23N mimic the effect of a KD scenario? Is there a separation of function for Rab10 in calcium homeostasis vs. the regulation of protein synthesis?

      This is an interesting possibility. We tested this and expressed Rab10T23N in a new series of experiments. These data are presented as a new Figure 5 in the revised manuscript (p. 29). We observed that Ca2+ refilling after caffeine treatment was delayed to a similar extent in Rab10T23N-expressing and Rab10 KD neurons. While impaired Ca2+ homeostasis may affect protein synthesis through ER stress or mTORC1 activation, our findings indicate otherwise in Rab10 KD neurons. First, ATF4 levels, a marker of ER stress, were unaffected in Rab10 KD neurons. This indicates that any ER stress present is minimal or insufficient to significantly impact protein synthesis through this pathway. Second, we did not observe significant changes in mTORC1 activation in Rab10 KD neurons as indicated by a normal pS6K1 phosphorylation (see above). Based on these observations, we conclude that Rab10's roles in calcium homeostasis and protein synthesis are most likely separate.

      (3) The authors indicate that the internal release of calcium ions from the ER has no effect on DCV trafficking and fusion without showing the data. It is important to include this data as the major impact of the study is the dissecting of the calcium effects in mammalian neurons from the previous studies in invertebrates.

      We agree this is an important aspect in our reasoning. We are submitting the related manuscript on internal calcium stores to BioRVix. The link will be added to the consolidated version of our manuscript

      (4) The distinction between Rab3 and Rab10 co-trafficking on DCVs should be reported in the Results (currently, Figure 2 - supplement 3 is only mentioned in the Discussion) as it helps to understand the effects on DCV fusion.

      We agree. We now added a new statement “Moreover, only 10% of DCVs co-transport with Rab10” in the Results (p. 6, l. 162-163).

      Reference:

      Balaji, J., Ryan, T.A., 2007. Single-vesicle imaging reveals that synaptic vesicle exocytosis and endocytosis are coupled by a single stochastic mode. Proceedings of the National Academy of Sciences 104, 20576–20581. https://doi.org/10.1073/pnas.0707574105

      Brunner, J.W., Lammertse, H.C.A., Berkel, A.A. van, Koopmans, F., Li, K.W., Smit, A.B., Toonen, R.F., Verhage, M., Sluis, S. van der, 2022. Power and optimal study design in iPSC-based brain disease modelling. Molecular Psychiatry 28, 1545. https://doi.org/10.1038/s41380-022-01866-3

      Emperador-Melero, J., Huson, V., van Weering, J., Bollmann, C., Fischer von Mollard, G., Toonen, R.F., Verhage, M., 2018. Vti1a/b regulate synaptic vesicle and dense core vesicle secretion via protein sorting at the Golgi. Nat Commun 9, 3421. https://doi.org/10.1038/s41467-018-05699-z

      Granseth, B., Odermatt, B., Royle, S.J., Lagnado, L., 2006. Clathrin-Mediated Endocytosis Is the Dominant Mechanism of Vesicle Retrieval at Hippocampal Synapses. Neuron 51, 773–786. https://doi.org/10.1016/j.neuron.2006.08.029

      Ivanova, D., Dobson, K.L., Gajbhiye, A., Davenport, E.C., Hacker, D., Ultanir, S.K., Trost, M., Cousin, M.A., 2021. Control of synaptic vesicle release probability via VAMP4 targeting to endolysosomes. Science Advances 7, eabf3873. https://doi.org/10.1126/sciadv.abf3873

      Kwon, S.E., Chapman, E.R., 2011. Synaptophysin Regulates the Kinetics of Synaptic Vesicle Endocytosis in Central Neurons. Neuron 70, 847–854. https://doi.org/10.1016/j.neuron.2011.04.001

      Reshetniak, S., Fernández-Busnadiego, R., Müller, M., Rizzoli, S.O., Tetzlaff, C., 2020. Quantitative Synaptic Biology: A Perspective on Techniques, Numbers and Expectations. International Journal of Molecular Sciences 21, 7298. https://doi.org/10.3390/ijms21197298

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This valuable work analyzes how specialized cells in the auditory cells, known as the octopus cells, can detect coincidences in their inputs at the submillisecond time scale. While previous work indicated that these cells receive no inhibitory inputs, the present study unambiguously demonstrates that these cells receive inhibitory glycinergic inputs. The physiologic impact of these inputs needs to be studied further. It remains incomplete at present but could be made solid by addressing caveats related to similar sizes of excitatory postsynaptic potentials and spikes in the octopus neurons.

      We apologize for not explicitly describing our experimental methods and analyses procedures that ensure the discrimination between action potentials and EPSPs. This has been addressed in responses to reviewer comments and amended in the manuscript.

      Reviewer #1 (Public Review):

      Kreeger and colleagues have explored the balance of excitation and inhibition in the cochlear nucleus octopus cells of mice using morphological, electrophysiological, and computational methods. On the surface, the conclusion, that synaptic inhibition is present, does not seem like a leap. However, the octopus cells have been in the past portrayed as devoid of inhibition. This view was supported by the seeming lack of glycinergic fibers in the octopus cell area and the lack of apparent IPSPs. Here, Kreeger et al. used beautiful immunohistochemical and mouse genetic methods to quantify the inhibitory and excitatory boutons over the complete surface of individual octopus cells and further analyzed the proportions of the different subtypes of spiral ganglion cell inputs. I think the analysis stands as one of the most complete descriptions of any neuron, leaving little doubt about the presence of glycinergic boutons.

      Kreeger et al then examined inhibition physiologically, but here I felt that the study was incomplete. Specifically, no attempt was made to assess the actual, biological values of synaptic conductance for AMPAR and GlyR. Thus, we don't really know how potent the GlyR could be in mediating inhibition. Here are some numbered comments:

      (1) "EPSPs" were evoked either optogenetically or with electrical stimulation. The resulting depolarizations are interpreted to be EPSPs. However previous studies from Oertel show that octopus cells have tiny spikes, and distinguishing them from EPSPs is tricky. No mention is made here about how or whether that was done. Thus, the analysis of EPSP amplitude is ambiguous.

      We agree that large EPSPs can be difficult to distinguish from an octopus cell’s short spikes during experiments. During analysis, we distinguished spikes from EPSPs by generating phase plots, which allow us to visualize the first derivative of the voltage trace on the y-axis and the value of the voltage on the x-axis at each moment in time. In the example shown below, four depolarizing events were electrically evoked in an octopus cell (panel A). The largest of these events (shown in orange in panels B-D) has an amplitude of ~9mV and could be a small spike. The first derivative of the voltage (panel C) reveals a bi-phasic response in the larger orange trace, where during the rising phase (mV/ms > 0) of the EPSP there is a second, sharper rising phase for the spike. Like more traditionally sized action potentials, phase plots for octopus cell spikes also reveal a sharp change in the rate of voltage change over time (Author response image 1 panel D, ✱) after the rising action of the EPSP begins to slow. EPSPs (shown in blue in panels B-D) lack the deflection in the phase plot. Not all cases were as unambiguous as this example. Therefore, our analysis only included subthreshold stimulation that unambiguously evoked EPSPs, not spikes. A brief description of this analysis has been added to the methods text (lines 625-627) and we have noted in the results section that both ChR2-evoked and electrically-evoked stimulation can produce small action potentials, which were excluded from analysis (lines 156-158).

      Author response image 1.

      (2) For this and later analysis, a voltage clamp of synaptic inputs would have been a simple alternative to avoid contaminating spikes or shunts by background or voltage-gated conductances. Yet only the current clamp was employed. I can understand that the authors might feel that the voltage clamp is 'flawed' because of the failure to clamp dendrites. But that may have been a good price to pay in this case. The authors should have at least justified their choice of method and detailed its caveats.

      We agree that data collected using voltage-clamp would have eliminated the confound of short action potentials and avoided the influence of voltage-gated conductances. The large-diameter, and comparatively simple dendritic trees of octopus cells make them good morphological candidates for reliable voltage clamp. However, as suggested, we were concerned that the abundance of channels open at the neuron’s resting potential would make it difficult to sufficiently clamp dendrites. Ultimately, given the low input resistances of octopus cells and the fast kinetics of excitatory inputs, we determined that bad voltage clamp conditions were likely to result in unclamped synaptic events with unpredicted distortions in kinetics and attenuation (To et al. 2022; PMID: 34480986; DOI: 10.1016/j.neuroscience.2021.08.024). We therefore chose to focus our efforts on current-clamp.

      Beyond the limits of both current-clamp and voltage-clamp, we chose to leave all conductances that influence EPSP dendritic propagation intact because our model demonstrates that active Kv and leak conductances shape and attenuate synaptic inputs as they travel through the dendritic tree (Supp. Fig. 4F-G). The addition of voltage-clamp recordings would not impact the conclusions we make about EPSP summation at the soma. Future studies will need to focus on a dendrite-centric view of local excitatory and inhibitory summation. For dendrite-centric experiments, dendritic voltage-clamp recordings are well suited to answer that set of questions.

      (3) The modeling raised several concerns. First, there is little presentation of assumptions, and of course, a model is entirely about its assumptions. For example, what excitatory conductance amplitudes were used? The same for inhibitory conductance? How were these values arrived at? The authors note that EPSGs and IPSGs had peaks at 0.3 and 3 ms. On what basis were these numbers obtained? The model's conclusions entirely depend on these values, and no measurements were made here that could have provided them. Parenthetical reference is made to Figure S5 where a range of values are tested, but with little explanation or justification.

      We apologize for not providing this information. We used our octopus neuron model to fit both EPSP and IPSP parameters to match experimental data. We have expanded the methods to include final values for the conductances (lines 649-651), which were adjusted to match experimental values seen in current-clamp recordings. We have also expanded the results section to describe each of the parameters we tuned (lines 203-222). An example of these adjustments is illustrated in Fig. 4F where the magnitude of inhibitory potentials at different conductances (100nS and 1nS) was compared to experimental data over a range of octopus cell input resistance conditions. Kinetic parameters were determined by aligning modeled PSPs to the rise times and full width at half maximum (FWHM) measurements from experiments under control and Kv block conditions. The experimental data for EPSPs and IPSPs that was used to fit the model is shown in Author response image 2 below.

      Author response image 2.

      (4) In experiments that combined E and I stimulation, what exactly were time courses of the conductance changes, and how 'synchronous' were they, given the different methods to evoke them? (had the authors done voltage clamp they would know the answers).

      We chose to focus data collection on voltage changes at the soma under physiological conditions to better understand how excitation and inhibition integrate at the somatic compartment. Our conclusions in the combined E and I stimulation experiments require the resting membrane properties of octopus cells to be intact to make physiologically-relevant conclusions. Our current-clamp data includes the critical impact of leak, Kv, and HCN conductances on this computation. Reliable voltage-clamp would necessitate the removal of the Kv and HCN conductances that shape PSP magnitude, shape, and speed. Because it was not necessary to measure the conductances and kinetics of specific channels, we chose to use current-clamp.

      Evoked IPSPs and EPSPs had cell-to-cell variability in their latencies to onset. Somatically-recorded optically-evoked inhibition under pharmacological conditions that changed cable properties had onset latencies between 2.5 and 4.3ms; electrically-evoked excitation under control conditions had latencies between 0.8 and 1.4ms. To overcome cell-to-cell timing variabilities, we presented a shuffled set of stimulation pairings that had a 3ms range of timings with 200µs intervals. As the evoked excitation and inhibition become more ‘synchronous’, the impact on EPSP magnitude and timing is greatest. Data presented in this paper was for the stimulation pairings that evokes a maximal shift in EPSP timing. On average, this occurred when the optical stimulation began ~1.2ms before electrical stimulation. Stimulation pairing times ranged between a 0ms offset and a 1.8ms offset at the extremes. An example of the shuffled stimulation pairings is shown in Author response image 3 below, and we have included information about the shuffled stimulus in the methods (lines 627-630)

      Author response image 3.

      (5) Figure 4G is confusing to me. Its point, according to the text, is to show that changes in membrane properties induced by a block of Kv and HCN channels would not be expected to alter the amplitudes of EPSCs and IPSCs across the dendritic expanse. Now we are talking about currents (not shunting effects), and the presumption is that the blockers would alter the resting potential and thus the driving force for the currents. But what was the measured membrane potential change in the blockers? Surely that was documented. To me, the bigger concern (stated in the text) is whether the blockers altered exocytosis, and thus the increase in IPSP amplitude in blockers is due BOTH to loss of shunting and increase in presynaptic spike width. Added to this is that 4AP will reduce the spike threshold, thus allowing more ChR2-expressing axons to reach the threshold. Figure 4G does not address this point.

      These are valuable points that motivated us to improve the clarity of this figure and the corresponding text. We discussed two separate points in this paragraph and were not clear. Our intention with Figure 4G was to address concerns that using pharmacological blockers changes driving forces and may confound the measured change in magnitude of postsynaptic potentials. Membrane potentials hyperpolarized by approximately 8-10 mV after application of blockers. We corrected for this effect by adding a holding current to depolarize the neuron to its baseline resting potential. Text in the results (lines 187-190) and figure legends have been changed to clarify these points.

      We also removed any discussion of presynaptic effects from this portion of the text because our description was incomplete and we did not directly collect data related to these claims. We originally wrote, “While blocking Kv and HCN allowed us to reveal IPSPs at the soma, 4-AP increases the duration of the already unphysiological ChR2-evoked presynaptic action potential (Jackman et al., 2014; DOI: 10.1523/jneurosci.4694-13.2014), resulting in altered release probabilities and synaptic properties, amongst other caveats (Mathie et al., 1998; DOI: 10.1016/S0306-3623(97)00034-7)”. Ultimately, effects on exocytosis, presynaptic excitability, or release probability are only relevant for the experiments presented in Figure 4. Figure 4 serves as evidence that synaptic release of glycine elicits strychnine-sensitive inhibitory postsynaptic potentials in octopus cells. Concerns of presynaptic effects do not carry over to the data presented in Figure 5, as Kv and HCN were not blocked in these experiments. Therefore, we have removed this portion of the text.

      (6) Figure 5F is striking as the key piece of biological data that shows that inhibition does reduce the amplitude of "EPSPs" in octopus cells. Given the other uncertainties mentioned, I wondered if it makes sense as an example of shunting inhibition. Specifically, what are the relative synaptic conductances, and would you predict a 25% reduction given the actual (not modeled) values?

      We agree that both shunting and hyperpolarizing inhibition could play a role in the measured EPSP changes. Because we focused data collection on voltage changes at the soma under physiological conditions, we cannot calculate the relative synaptic conductances. Together, our experimental current-clamp results paired with estimates from the model provide compelling evidence for the change we observe in EPSPs. Regardless, the relative weights of the synaptic conductances is a very interesting question, but this information is not necessary to answer the questions posed in this study, namely the impact of dendritic inhibition on the arrival of EPSPs in the soma.

      (7) Some of the supplemental figures, like 4 and 5, are hardly mentioned. Few will glean anything from them unless the authors direct attention to them and explain them better. In general, the readers would benefit from more complete explanations of what was done.

      We apologize for not fully discussing these figures in the results text. We have fully expanded the results section to detail the experiments and results presented in the supplement (lines 203-238).

      Reviewer #2 (Public Review):

      Summary:

      Kreeger et.al provided mechanistic evidence for flexible coincidence detection of auditory nerve synaptic inputs by octopus cells in the mouse cochlear nucleus. The octopus cells are specialized neurons that can fire repetitively at very high rates (> 800 Hz in vivo), yield responses dominated by the onset of sound for simple stimuli, and integrate auditory nerve inputs over a wide frequency span. Previously, it was thought that octopus cells received little inhibitory input, and their integration of auditory input depended principally on temporally precise coincidence detection of excitatory auditory nerve inputs, coupled with a low input resistance established by high levels of expression of certain potassium channels and hyperpolarization-activated channels.

      In this study, the authors used a combination of numerous genetic mouse models to characterize synaptic inputs and enable optogenetic stimulation of subsets of afferents, fluorescent microscopy, detailed reconstructions of the location of inhibitory synapses on the soma and dendrites of octopus cells, and computational modeling, to explore the importance of inhibitory inputs to the cells. They determined through assessment of excitatory and inhibitory synaptic densities that spiral ganglion neuron synapses are densest on the soma and proximal dendrite, while glycinergic inhibitory synaptic density is greater on the dendrites compared to the soma of octopus cells. Using different genetic lines, the authors further elucidated that the majority of excitatory synapses on the octopus cells are from type 1a spiral ganglion neurons, which have low response thresholds and high rates of spontaneous activity. In the second half of the paper, the authors employed electrophysiology to uncover the physiological response of octopus cells to excitatory and inhibitory inputs. Using a combination of pharmacological blockers in vitro cellular and computational modeling, the authors conclude that glycine in fact evokes IPSPs in octopus cells; these IPSPs are largely shunted by the high membrane conductance of the cells under normal conditions and thus were not clearly evident in prior studies. Pharmacological experiments point towards a specific glycine receptor subunit composition. Lastly, Kreeger et. al demonstrated with in vitro recordings and computational modeling that octopus cell inhibition modulates the amplitude and timing of dendritic spiral ganglion inputs to octopus cells, allowing for flexible coincidence detection.

      Strengths:

      The work combines a number of approaches and complementary observations to characterize the spatial patterns of excitatory and inhibitory synaptic input, and the type of auditory nerve input to the octopus cells. The combination of multiple mouse lines enables a better understanding of and helps to define, the pattern of synaptic convergence onto these cells. The electrophysiology provides excellent functional evidence for the presence of the inhibitory inputs, and the modeling helps to interpret the likely functional role of inhibition. The work is technically well done and adds an interesting dimension related to the processing of sound by these neurons. The paper is overall well written, the experimental tests are well-motivated and easy to follow. The discussion is reasonable and touches on both the potential implications of the work as well as some caveats.

      Weaknesses:

      While the conclusions presented by the authors are solid, a prominent question remains regarding the source of the glycinergic input onto octopus cells. In the discussion, the authors claim that there is no evidence for D-stellate, L-stellate, and tuberculoventral cell (all local inhibitory neurons of the ventral and dorsal cochlear nucleus) connections to octopus cells, and cite the relevant literature. An experimental approach will be necessary to properly rule out (or rule in) these cell types and others that may arise from other auditory brainstem nuclei. Understanding which cells provide the inhibitory input will be an essential step in clarifying its roles in the processing of sound by octopus cells.

      We are glad that the reviewer agrees with the conclusions we have made and is interested in learning more about how these findings impact sound processing. We agree that defining the source of inhibition will dramatically shape our understanding of the computation octopus cells are making. However, this is not an easy task, given the small size of the octopus cell area, and will involve considerable additional work. Since the overall findings do not depend on knowing the source of inhibition, we have instead re-written the discussion to clarify the lack of evidence for intrinsic inhibitory inputs to octopus cells, in addition to presenting likely candidates. As genetic profiles of cochlear nucleus and other auditory brainstem neurons become available, we intend to make and utilize genetic mouse models to answer questions like this.

      The authors showed that type 1a SGNs are the most abundant inputs to octopus cells via microscopy. However, in Figure 3 they compare optical stimulation of all classes of ANFs, then compare this against stimulation of type 1b/c ANFs. While a difference in the paired-pulse ratio (and therefore, likely release probability) can be inferred by the difference between Foxg1-ChR2 and Ntng1-ChR2, it would have been preferable to have specific data with selective stimulation of type 1a neurons.

      We agree that complete genetic access to only the Ia population would have been the preferable approach, but we did not have an appropriate line when beginning these experiments. Because our results did not suggest a meaningful difference between the populations, we did not pursue further investigation once a line was available.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Besides the points mentioned in the main review:

      Minor

      (1) I really like the graphics and the immunohistological presentation.

      (2) Lines 316-319 say that octopus cells lack things like back-propagating spikes and dendritic Ca spikes. How do you know this?

      This statement was intended to be a summary of suggestions from the literature and lacked references and context as written. We have rewritten this section and clarified that our hypothesis was formed from data found in the literature (lines 334-337).

      (3) Spectrograms of Figure 6A...where were these data obtained?

      We recorded and visualized human-generated rhythmic tapping and high-frequency squeaking sounds using Audacity. The visualizations of rhythmic tapping and imitated vocalizations are meant to show two different types of multi-frequency stimuli we hypothesize would result in somatic summation within an octopus cell’s spike integration window, despite differences in timing. We rewrote the figure legend to explain more clearly what is shown and how it relates to the model in Figure 6.

      (4) 'on-path' and 'off-path' seem like jargon that may not be clear to the average reader.

      Thank you for pointing out our use of unapproachable jargon. We have replaced the term from the figure with “proximal” and “distal” inhibition. In the main text, we now describe on-path and off-path together as the effect of location of dendritic inhibition on somatically recorded EPSPs.

      (5) The paper could benefit from a table of modeled values.

      We have added specific details about the modelling in the text and clarified which modeled values were referenced from previous computational models and which were tuned to fit experimental data. Since most values were taken from a referenced publication, we did not add a table and instead point readers towards that source.

      (6) Figure S4A-C what currents were delivered to the modeled cells?

      The model cells were injected with a -0.8 nA DC current for 300 ms in current clamp mode. This information has been added to the figure legend.

      (7) In that figure "scaling factors" scale exactly which channels?

      Scaling factor is used to scale low-voltage activated K<sup>+</sup> (ḡ<sub>KLT</sub>), high threshold K<sup>+</sup> (ḡ<sub>KHT</sub>), fast transient K<sup>+</sup> (ḡ<sub>KA</sub>), hyperpolarization-activated cyclic nucleotide-gated HCN (ḡ<sub>h</sub>) but not fast Na<sup>+</sup> (ḡ<sub>Na</sub>) and leak K<sup>+</sup> (ḡ<sub>leak</sub>). This information has been added to the text (lines 205-208 and 646-653).

      (8) In performing and modeling Kv/HCN block, do you know how complete the level of the block is?

      Since we cannot assess how complete the level of block is, we have changed the language in the text to clarify that we are reducing Kv and HCN channel conductance to the degree needed to increase resistance of the neuron (line 185).

      (9) More on this Figure S4. It is hardly referred to in the text except to say that it supports that blocking the Kv/HCN channels will enhance the IPSP. Given how large the figure is, can you offer more of a conclusion than that? Also, in the synaptic model in that figure, the IPSCs are presumably happening in current-clamp conditions, and the reduction in amplitude of the IPSC (as opposed to the increase in IPSP) is due to hyperpolarization. Can you simply state that so readers can track what this figure is showing? Other similar things: what is a transfer impedance? How is it measured? What do we take from the analysis?

      We have elaborated on our description of both Supp. Fig. 4 and Supp. Fig. 5 in the results section of the text (lines 203-238).

      (10) Figure S5 also needs a better explanation. E.g., in C-D, what does 'average' mean? The gray is an SD of this average? You modeled a range of values...but which ones are physiological? To me, this is a key point.

      We have elaborated on our description of both Supp. Fig. 4 and Supp. Fig. 5 in the results section of the text (lines 203-238).

      Reviewer #2 (Recommendations For The Authors):

      General:

      The images and 3-D reconstructions are visually stunning, but they are not colorblind-friendly and in some cases, hard to distinguish. This shows up particularly in the green and blue colors used in Figure 1. Also, better representative images could be used for Figure 1B.

      Thank you for pointing out that blue and green were difficult to distinguish in Figure 1H. We have outlined the green inhibitory puncta in this image to make them more distinguishable. We have also increased the resolution of the image in Figure 1B for better clarity. All other colors are selected from Wong, 2011 (PMID: 21850730; DOI: https://doi.org/10.1038/nmeth.1618).

      Supplemental Figure 1D: The low-power view is good to have, but the CN is too small and the image appears a bit noisy. An inset showing the CN on a larger scale (higher resolution image?) would be more convincing. In this image, I see what appear to be cells in the DCN labeled, which calls into question the purity of the source of optogenetic synaptic activation. It is also difficult to tell whether there are other cells labeled in the VCN. Such inputs would still be minor, but it would be good to be very clear about the expression pattern.

      To offer more information about the activity of the Ntng1<sup>Cre</sup> line in other regions of the auditory system, we increased the resolution of the image included in Supp. Fig. 1D and have also included an additional image (Supp. Fig. 1E) of a coronal section of the cochlear nucleus complex with Ntng1-tdT labelling. This image provides additional context for the cells labeled in the DCN. The text in the figure legend has been changed to clarify that some cells in the DCN were labeled (lines 118-120).

      We agree that in the Ntng1<sup>Cre</sup> experiments, there is the possibility of minor contamination from excitatory cells that express ChR2 outside of the spiral ganglion. This is also true for our Foxg1<sup>Cre</sup> and Foxg1<sup>Flp</sup> experiments, because these lines label cortical cells in addition to cochlear cells. However, we do not observe direct descending inputs from the cortex into the PVCN, making contamination from other Foxg1<sup>Cre</sup>-positive neurons unlikely. While non-cochlear inputs from the Ntng1<sup>Cre</sup> line are possible, evidence from both lines gives us confidence that we are not capturing inputs to octopus cells outside the cochlea. Central axons from Type I spiral ganglion neurons have VGLUT1+ synaptic terminals. When comparing the overlap between VGLUT1+ terminals and Foxg1-tdT labelling, we see full coverage. That is, all VGLUT+ terminals on octopus cells are co-labelled by Foxg1<sup>Cre</sup>-mediated expression of tdTomato. An example image is shown below. Here, an octopus cell soma is labeled with blue fluorescent Nissl stain and inputs to the cochlear nucleus complex are labeled with Foxg1<sup>Cre</sup>-dependent tdTomato (Foxg1-tdT; magenta). We have also immunolabeled for VGLUT1 puncta in green. This eliminates the possibility that VGLUT+ cells from outside the cochlea and cortex are sources of excitation to octopus cells.

      Author response image 4.

      Further, we have looked at expression of Ntng1-tdT and Foxg1-EYFP together in the octopus cell area.  An example image is shown below. All Ntng1-tdT+ fibers (magenta) are also Foxg1-EYFP+ (green), suggesting that all Ntng1<sup>Cre</sup>-targeted inputs to octopus cells are a part of the Foxg1<sup>Cre</sup>-targeted input population, which are very likely to only be from the cochlea. We have expanded the results section to include information about the overlap in expression driven by the Ntng1<sup>Cre</sup> and Foxg1<sup>Flp</sup> lines.

      Author response image 5.

      Supplemental Figure 2 G: These are a bit hard to read. Perhaps use a different image, or provide a reference outline drawing telling us what is what.

      We have used a different image with a Thy1-YFP labeled octopus cell for clarity.

      In some places, the term "SGN" is used when referencing the axons and terminals within the CN, and without some context, this was occasionally confusing (SGN would seem to refer to the cell bodies). In some places in the text, it may be preferable to separate SGN, auditory nerve fibers (ANFs), and terminals, as entities for clarity.

      In order to make the study accessible to a broad neuroscience audience, we refer to the neurons of the spiral ganglion and their central axon projections using one name. We understand why, for those well acquainted with the auditory periphery, condensing terminology may feel awkward. However, for those readers unfamiliar with the anatomy of the cochlea and auditory nerve, we feel that the use of “SGN central axon” makes it clear that the “auditory nerve fibers” come from neurons in the spiral ganglion. This is clarified in the first paragraph of the introduction (lines 29-31) and in the methods (line 533).

      Specific: Numbers refer to the line numbers on the manuscript.

      L29-31: Cochlear nucleus neurons are more general in their responses than this sentence indicates. While we can all agree that they are specialized to carry (or improve upon) the representation of these specific features of sound, they also respond more generally to sounds that might not have specific information in any of these domains. They are not silos of neural computation, and their outputs become mixed and "re-represented" well before they reach the auditory cortex. Octopus cells are no exception to this. I suggest striking most of the first paragraph, and instead using the first sentence to lead into the second paragraph, and putting the last sentence (of the current first paragraph) at the end of the second (now first) paragraph.

      We agree with this assessment and have made major changes to the introduction in line with these suggestions.

      L33-46: A number of points in this paragraph need references (exp. line 41).

      We agree and have added references accordingly.

      L43: Not sure what is meant by "fire at the onset of the sound, breaking it up into its frequency components"?

      We changed this text as part of a major reworking of the introduction.

      L47-66: Again more citations are needed (at the end of sentence at line 55, probably moving some of the citations from the next sentence up).

      We agree and have added references accordingly.

      L51: The consistent orientation of octopus cell dendrites across the ANFs has been claimed in the literature (as mentioned here), but there are some (perhaps problematic - plane of sectioning?) counterexamples from the older Golgi-stained images, and even amongst intracellularly stained cells (for example see Reccio-Spinoza and Rhode, 2020). This is important with regards to the broader hypothesis regarding traveling-wave compensation (e.g., McGinley et al; but also many others); if the cells are not all in the appropriate orientation then such compensation may be problematic. Likewise, the data from Lu et al., 2022, points towards a range of sensitivity to frequency-swept stimuli, some of which work in opposition to the traveling wave compensation hypothesis. It would seem that with the Thy1 mice, you have an opportunity to clarify the orientation. Figures 1A and 2A show a consistent dendritic orientation, assuming that these drawings are reconstructions of the cells as they were actually oriented in the tissue. Can you either comment on this or provide clearer evidence?

      We are happy to offer more information about the appearances of octopus cells in our preparations. In our hands, sparsely labeled octopus cells in Thy1-YFP-H mice show consistent dendritic orientation when visualized in a 15 degree parasaggital plane, with the most diversity apparent in cells with somas located more dorsally in the octopus cell area. We hypothesize that this is due to the limited area through which the central projections of spiral ganglion neurons (i.e. ANFs) must pass through before they enter the dorsal cochlear nucleus and continue their tonotopic organization in that area.

      A caveat to studies without physiological or genetic identification of octopus cells is the assumption that all neurons in the octopus cell area are octopus cells. We find, especially along the borders of the octopus cell area, that stellate cells can be seen amongst octopus cells. Because stellate cell dendrites are not oriented like octopus cell dendrites, any stellate cells misidentified as octopus cells would appear to have poorly-oriented dendrites. This may explain why some studies report this finding. In addition, it can be difficult to assess tonotopic organization because of the 3D trajectory of tightly bundled axons, which is not capturable by a single section plane. Although a parasaggital plane of sectioning captures the tonotopic axis in one part of the octopus cell area, that same plane may be perpendicular at the opposing end.

      L67: canonical -> exceptional.

      Thank you for the suggestion. We have made this change in the introduction.

      L127: This paragraph was confusing on first reading. I don't think Supplemental Figure 1D shows the restricted pattern of expression very clearly. The "restricted to SGNs" might be better as "restricted to auditory nerve fibers" (except in the DCN, where there seem to be some scattered small cells?). A higher magnification image of the CN, but lower magnification than in panel E, would be helpful here.

      To avoid confusion, we have re-written this paragraph (lines 117-127) and included a higher magnification image of the CN in a revised Supp. Fig. 1.

      L168: Here, perhaps say ANFs instead of SGNs.

      As above, we have decided to describe ANFs as SGN central axons to make the anatomy more accessible to people unfamiliar with cochlear anatomy.

      L201-204: The IPSPs are surprisingly slow (Figures 5B, C), especially given the speed of the EPSPs/EPSCs in these cells. This is reminiscent of the asymmetry between EPSC and IPSC kinetics in bushy cells (Xie and Manis, 2014). The kinetics used in the model (3 ms; mentioned on line 624) however seem a bit arbitrary and no data is provided for the selection of that value. Were there any direct measurements of the IPSC kinetics (all of the traces in the paper are in the current clamp) that were used to justify this value?

      The kinetics of the somatically-recorded IPSPs are subject to the effects of our pharmacological manipulations. EPSPs measured at the soma under control conditions are small amplitude and rapid. With pharmacological reduction of HCN and Kv channels, EPSPs are larger and slower (please see figure in response to a similar question posed by Reviewer #1). We expect that this change also occurs with the IPSP kinetics under pharmacological conditions. Our justification of kinetics has been expanded and justified in the methods section (lines 641-661).

      L594: Technically, this is a -11 mV junction potential, but thanks for including the information.

      We have corrected this in the text (line 618). Thank you for the close reading of all experimental and methodological details.

      L595: The estimated power of the LED illumination at the focal plane should be measured and indicated here.

      We measured the power of the LED illumination at the focal plane using a PM100D Compact Power and Energy Meter Console (Thorlabs), a S120C Photodiode Power Sensor (Thorlabs), and a 1000µm diameter Circular Precision Pinhole (Thorlabs). Light intensity at the focal plane ranged between 1.9 and 4.1mW/mm<sup>2</sup>, corresponding to 6% and 10% intensity on the Colibri5 system. We have reported these measurements in the results section (Lines 621-622).

      L609: One concern about the model is that the integration time of 25 microseconds is rather close to the relative shifts in latency. While I doubt it will make a difference (except in the number), it may be worth verifying (spot checks, at least) that running the model with a 5 or 10-microsecond step yields a similar pattern of latency shifts (e.g., Supplementary Figure 5, Figure 5).

      Also, it is not clear what temperature the model was executed at (I would presume 35C); this needs to be given, and channel Q10's listed.

      We realize that additional information is needed to fully understand the model and have added this to the results and the methods. The synaptic mechanism (.mod) files were obtained from Manis and Campagnola (2018) (PMID: 29331233; DOI: https://doi.org/10.1016/j.heares.2017.12.017). Q10 (3) and temperature (22°C) were also matched to parameters from Manis and Campagnola (2018). Because temperature is a critical factor for channel kinetics, we verified that our primary results remain consistent under conditions using a temperature of 35°C and a time step of 5µs, depicted below. Panel A illustrates the increase in IPSP as a function of glycine conductance under Kv+HCN block conditions at 35°C. As at 22°C, an increase in IPSP magnitude is absent in the control condition at 35°C. Panels B and C provide a direct comparison between the initial (i.e. 22°C) and suggested (i.e. 35°C) simulation conditions. Again we found that temperature does not have a major impact on the amplitude of IPSPs. Thus, results at 35°C do not change the conclusions we make from the model.

      Author response image 6.

      The nominal conductance densities should at least be provided in a table (supplemental, in addition to including them in the deposited code). The method for "optimization" of the conductance densities to match the experimental recordings needs to be described; the parameter space can be quite large in a model such as this. The McGinley reference needs a number.

      We added a more thorough description of modeling parameters and justification of choices in the methods section of the text (lines 641-661). We have also added a reference number to the McGinley 2012 reference in the text.

      I think this is required by the journal:

      The model code, test results, and simulation results should be deposited in a public resource (Github would be preferable, but dryad, Zenodo, or Figshare could work), and the URL/doi for the resource provided in the manuscript. This includes the morphology swc/hoc file. The code should be in a form, and with a description, that readily allows an interested party with appropriate skills to download it and run it to generate the figures.

      We will upload the code and all associated simulation files to the ModelDB repository upon publication.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Response to Reviewer #1:

      Thank you for the careful reading and the positive evaluation of our manuscript. As you mentioned, the present study tried to address the question of how the lost genomic functions could be compensated by evolutionary adaptation, indicating the potential mechanism of "constructive" rather than "destructive" evolution. Thank you for the instructive comments that helped us to improve the manuscript. We sincerely hope the revised manuscript and the following point-to-point response meet your concerns.

      • Line 80 "Growth Fitness" is this growth rate?

      Yes. The sentence was revised as follows.

      (L87-88) “The results demonstrated that most evolved populations (Evos) showed improved growth rates, in which eight out of nine Evos were highly significant (Fig. 1B, upper).”

      • Line 94 a more nuanced understanding of r/K selection theory, allows for trade-ups between R and K, as well as trade-offs. This may explain why you did not see a trade-off between growth and carrying capacity in this study. See this paper https://doi.org/10.1038/s41396-023-01543-5. Overall, your evos lineages evolved higher growth rates and lower carrying capacity (Figures 1B, C, E). If selection was driving the evolution of higher growth rates, it may have been that there was no selective pressure to maintain high carrying capacity. This means that the evolutionary change you observed in carrying capacity may have been neutral "drift" of the carrying capacity trait, during selection for growth rate, not because of a trade-off between R and K. This is especially likely since carrying capacity declined during evolution. Unless the authors have convincing evidence for a tradeoff, I suggest they remove this claim.

      • Line 96 the authors introduce a previous result where they use colony size to measure growth rate, this finding needs to be properly introduced and explained so that we can understand the context of the conclusion.

      • Line 97 This sentence "the collapse of the trade-off law likely resulted from genome reduction." I am not sure how the authors can draw this conclusion, what is the evidence supporting that the genome size reduction causes the breakdown of the tradeoff between R and K (if there was a tradeoff)?

      Thank you for the reference information and the thoughtful comments. The recommended paper was newly cited, and the description of the trade-off collapse was deleted. Accordingly, the corresponding paragraph was rewritten as follows.

      (L100-115) “Intriguingly, a positive correlation was observed between the growth fitness and the carrying capacity of the Evos (Fig. 1D). It was somehow consistent with the positive correlations between the colony growth rate and the colony size of a genome-reduced strain 11 and between the growth rates and the saturated population size of an assortment of genome reduced strains 13. Nevertheless, the negative correlation between growth rate and carrying capacity, known as the r/K selection30,31 was often observed as the trade-off relationship between r and K in the evolution and ecology studies 32 33,34. As the r/K trade-off was proposed to balance the cellular metabolism that resulted from the cost of enzymes involved 34, the deleted genes might play a role in maintaining the metabolism balance for the r/K correlation. On the other hand, the experimental evolution (i.e., serial transfer) was strictly performed within the exponential growth phase; thus, the evolutionary selection was supposed to be driven by the growth rate without selective pressure to maintain the carrying capacity. The declined carrying capacity might have been its neutral "drift" but not a trade-off to the growth rate. Independent and parallel experimental evolution of the reduced genomes selecting either r or K is required to clarify the actual mechanisms.”

      • Line 103 Genome mutations. The authors claim that there are no mutations in parallel but I see that there is a 1199 base pair deletion in eight of the nine evo strains (Table S3). I would like the author to mention this and I'm actually curious about why the authors don't consider this parallel evolution.

      Thank you for your careful reading. According to your comment, we added a brief description of the 1199-bp deletion detected in the Evos as follows.

      (L119-122) “The number of mutations largely varied among the nine Evos, from two to 13, and no common mutation was detected in all nine Evos (Table S3). A 1,199-bp deletion of insH was frequently found in the Evos (Table S3, highlighted), which well agreed with its function as a transposable sequence.”

      • Line 297 Please describe the media in full here - this is an important detail for the evolution experiment. Very frustrating to go to reference 13 and find another reference, but no details of the method. Looked online for the M63 growth media and the carbon source is not specified. This is critical for working out what selection pressures might have driven the genetic and transcriptional changes that you have measured. For example, the parallel genetic change in 8/9 populations is a deletion of insH and tdcD (according to Table S3). This is acetate kinase, essential for the final step in the overflow metabolism of glucose into acetate. If you have a very low glucose concentration, then it could be that there was selection to avoid fermentation and devote all the pyruvate that results from glycolysis into the TCA cycle (which is more efficient than fermentation in terms of ATP produced per pyruvate).

      Sorry for the missing information on the medium composition, which was additionally described in the Materials and Methods. The glucose concentration in M63 was 22 mM, which was supposed to be enough for bacterial growth. Thank you for your intriguing thinking about linking the medium component to the genome mutation-mediated metabolic changes. As there was no experimental result regarding the biological function of gene mutation in the present study, please allow us to address this issue in our future work.

      (L334-337) “In brief, the medium contains 62 mM dipotassium hydrogen phosphate, 39 mM potassium dihydrogen phosphate, 15 mM ammonium sulfate, 15 μM thiamine hydrochloride, 1.8 μM Iron (II) sulfate, 0.2 mM magnesium sulfate, and 22 mM glucose.”

      • Line 115. I do not understand this argument "They seemed highly related to essentiality, as 11 out of 49 mutated genes were essential (Table S3)." Is this a significant enrichment compared to the expectation, i.e. the number of essential genes in the genome? This enrichment needs to be tested with a Hypergeometric test or something similar.

      • Also, "As the essential genes were known to be more conserved than nonessential ones, the high frequency of the mutations fixed in the essential genes suggested the mutation in essentiality for fitness increase was the evolutionary strategy for reduced genome." I do not think that there is enough evidence to support this claim, and it should be removed.

      Sorry for the unclear description. Yes, the mutations were significantly enriched in the essential genes (11 out of 45 genes) compared to the essential genes in the whole genome (286 out of 3290 genes). The improper description linking the mutation in essential genes to the fitness increase was removed, and an additional explanation on the ratio of essential genes was newly supplied as follows.

      (L139-143) “The ratio of essential genes in the mutated genes was significantly higher than in the total genes (286 out of 3290 genes, Chi-square test p=0.008). As the essential genes were determined according to the growth35 and were known to be more conserved than nonessential ones 36,37, the high frequency of the mutations fixed in the essential genes was highly intriguing and reasonable.”

      • Line 124 Regarding the mutation simulations, I do not understand how the observed data were compared to the simulated data, and how conclusions were drawn. Can the authors please explain the motivation for carrying out this analysis, and clearly explain the conclusions?

      Random simulation was additionally explained in the Materials and Methods and the conclusion of the random simulation was revised in the Results, as follows.

      (L392-401) “The mutation simulation was performed with Python in the following steps. A total of 65 mutations were randomly generated on the reduced genome, and the distances from the mutated genomic locations to the nearest genomic scars caused by genome reduction were calculated. Subsequently, Welch's t-test was performed to evaluate whether the distances calculated from the random mutations were significantly longer or shorter than those calculated from the mutations that occurred in Evos. The random simulation, distance calculation, and statistic test were performed 1,000 times, which resulted in 1,000 p values. Finally, the mean of p values (μp) was calculated, and a 95% reliable region was applied. It was used to evaluate whether the 65 mutations in the Evos were significantly close to the genomic scars, i.e., the locational bias.”

      (L148-157) “Random simulation was performed to verify whether there was any bias or hotspot in the genomic location for mutation accumulation due to the genome reduction. A total of 65 mutations were randomly generated on the reduced genome (Fig. 2B), and the genomic distances from the mutations to the nearest genome reduction-mediated scars were calculated. Welch's t-test was performed to evaluate whether the genomic distances calculated from random mutations significantly differed from those from the mutations accumulated in the Evos. As the mean of p values (1,000 times of random simulations) was insignificant (Fig. 2C, μp > 0.05), the mutations fixed on the reduced genome were either closer or farther to the genomic scars, indicating there was no locational bias for mutation accumulation caused by genome reduction.”

      • Line 140 The authors should give some background here - explain the idea underlying chromosomal periodicity of the transcriptome, to help the reader understand this analysis.

      • Line 142 Here and elsewhere, when referring to a method, do not just give the citation, but also refer to the methods section or relevant supplementary material.

      The analytical process (references and methods) was described in the Materials and Methods, and the reason we performed the chromosomal periodicity was added in the Results as follows.

      (L165-172) “As the E. coli chromosome was structured, whether the genome reduction caused the changes in its architecture, which led to the differentiated transcriptome reorganization in the Evos, was investigated. The chromosomal periodicity of gene expression was analyzed to determine the structural feature of genome-wide pattern, as previously described 28,38. The analytical results showed that the transcriptomes of all Evos presented a common six-period with statistical significance, equivalent to those of the wild-type and ancestral reduced genomes (Fig. 3A, Table S4).”

      • Line 151 "The expression levels of the mutated genes were higher than those of the remaining genes (Figure 3B)"- did this depend on the type of mutation? There were quite a few early stops in genes, were these also more likely to be expressed? And how about the transcriptional regulators, can you see evidence of their downstream impact?

      Sorry, we didn't investigate the detailed regulatory mechanisms of 49 mutated genes, which was supposed to be out of the scope of the present study. Fig. 3B was the statistical comparison between 3225 and 49 genes. It didn't mean that all mutated genes expressed higher than the others. The following sentences were added to address your concern.

      (L181-185) “As the regulatory mechanisms or the gene functions were supposed to be disturbed by the mutations, the expression levels of individual genes might have been either up- or down-regulated. Nevertheless, the overall expression levels of all mutated genes tended to be increased. One of the reasons was assumed to be the mutation essentiality, which remained to be experimentally verified.”

      • Line 199 onward. The authors used WGCNA to analyze the gene expression data of evolved organisms. They identified distinct gene modules in the reduced genome, and through further analysis, they found that specific modules were strongly associated with key biological traits like growth fitness, gene expression changes, and mutation rates. Did the authors expect that there was variation in mutation rate across their populations? Is variation from 3-16 mutations that they observed beyond the expectation for the wt mutation rate? The genetic causes of mutation rate variation are well understood, but I could not see any dinB, mutT,Y, rad, or pol genes among the discovered mutations. I would like the authors to justify the claim that there was mutation rate variation in the evolved populations.

      Thank you for the intriguing thinking. We don't think the mutation rates were significantly varied across the nine populations, as no mutation occurred in the MMR genes, as you noticed. Our previous study showed that the spontaneous mutation rate of the reduced genome was higher than that of the wild-type genome (Nishimura et al., 2017, mBio). As nonsynonymous mutations were not detected in all nine Evos, the spontaneous mutation rate couldn't be calculated (because it should be evaluated according to the ratio of nonsynonymous and synonymous single-nucleotide substitutions in molecular evolution). Therefore, discussing the mutation rate in the present study was unavailable. The following sentence was added for a better understanding of the gene modules.

      (L242-245) “These modules M2, M10 and M16 might be considered as the hotspots for the genes responsible for growth fitness, transcriptional reorganization, and mutation accumulation of the reduced genome in evolution, respectively.”

      • Line 254 I get the idea of all roads leading to Rome, which is very fitting. However, describing the various evolutionary strategies and homeostatic and variable consequence does not sound correct - although I am not sure exactly what is meant here. Looking at Figure 7, I will call strategy I "parallel evolution", that is following the same or similar genetic pathways to adaptation and strategy ii I would call divergent evolution. I am not sure what strategy iii is. I don't want the authors to use the terms parallel and divergent if that's not what they mean. My request here would be that the authors clearly describe these strategies, but then show how their results fit in with the results, and if possible, fit with the naming conventions, of evolutionary biology.

      Thank you for your kind consideration and excellent suggestion. It's our pleasure to adopt your idea in tour study. The evolutionary strategies were renamed according to your recommendation. Both the main text and Fig. 7 were revised as follows.

      (L285-293) “Common mutations22,44 or identical genetic functions45 were reported in the experimental evolution with different reduced genomes, commonly known as parallel evolution (Fig. 7, i). In addition, as not all mutations contribute to the evolved fitness 22,45, another strategy for varied phenotypes was known as divergent evolution (Fig. 7, ii). The present study accentuated the variety of mutations fixed during evolution. Considering the high essentiality of the mutated genes (Table S3), most or all mutations were assumed to benefit the fitness increase, partially demonstrated previously 20. Nevertheless, the evolved transcriptomes presented a homeostatic architecture, revealing the divergent to convergent evolutionary strategy (Fig. 7, iii).”

      Author response image 1.

      • Line 327 Growth rates/fitness. I don't think this should be called growth fitness- a rate is being calculated. I would like the authors to explain how the times were chosen - do the three points have to be during the log phase? Can you also explain what you mean by choosing three ri that have the largest mean and minor variance?

      Sorry for the confusing term usage. The fitness assay was changed to the growth assay. Choosing three ri that have the largest mean and minor variance was to avoid the occasional large values (blue circle), as shown in the following figure. In addition, the details of the growth analysis can be found at https://doi.org/10.3791/56197 (ref. 59), where the video of experimental manipulation, protocol, and data analysis is deposited. The following sentence was added in accordance.

      Author response image 2.

      (L369-371) “The growth rate was determined as the average of three consecutive ri, showing the largest mean and minor variance to avoid the unreliable calculation caused by the occasionally occurring values. The details of the experimental and analytical processes can be found at https://doi.org/10.3791/56197.”

      • Line 403 Chromosomal periodicity analysis. The windows chosen for smoothing (100kb) seem big. Large windows make sense for some things - for example looking at how transcription relates to DNA replication timing, which is a whole-genome scale trend. However, here the authors are looking for the differences after evolution, which will be local trends dependent on specific genes and transcription factors. 100kb of the genome would carry on the order of one hundred genes and might be too coarse-grained to see differences between evos lineages.

      Thank you for the advice. We agree that the present analysis focused on the global trend of gene expression. Varying the sizes may lead to different patterns. Additional analysis was performed according to your comment. The results showed that changes in window size (1, 10, 50, 100, and 200 kb) didn't alter the periodicity of the reduced genome, which agreed with the previous study on a different reduced genome MDS42 of a conserved periodicity (Ying et al., 2013, BMC Genomics). The following sentence was added in the Materials and Methods.

      (L460-461) “Note that altering the moving average did not change the max peak.”

      • Figures - the figures look great. Figure 7 needs a legend.

      Thank you. The following legend was added.

      (L774-777) “Three evolutionary strategies are proposed. Pink and blue arrowed lines indicate experimental evolution and genome reduction, respectively. The size of the open cycles represents the genome size. Black and grey indicate the ancestor and evolved genomes, respectively.”

      Response to Reviewer #2:

      Thank you for reviewing our manuscript and for your fruitful comments. We agree that our study leaned towards elaborating observed findings rather than explaining the detailed biological mechanisms. We focused on the genome-wide biological features rather than the specific biological functions. The underlying mechanisms indeed remained unknown, leaving the questions as you commented. We didn't perform the fitness assay on reconstituted (single and combinatorial) mutants because the research purpose was not to clarify the regulatory or metabolic mechanisms. It's why the RNA-Seq analysis provided the findings on genome-wide patterns and chromosomal view, which were supposed to be biologically valuable. We did understand your comments and complaints that the conclusions were biologically meaningless, as ALE studies that found the specific gene regulation or improved pathway was the preferred story in common, which was not the flow of the present study.

      For this reason, our revision may not address all these concerns. Considering your comments, we tried our best to revise the manuscript. The changes made were highlighted. We sincerely hope the revision and the following point-to-point response are acceptable.

      Major remarks:

      (1) The authors outlined the significance of ALE in genome-reduced organisms and important findings from published literature throughout the Introduction section. The description in L65-69, which I believe pertains to the motivation of this study, seems vague and insufficient to convey the novelty or necessity of this study i.e. it is difficult to grasp what aspects of genome-reduced biology that this manuscript intends to focus/find/address.

      Sorry for the unclear writing. The sentences were rewritten for clarity as follows.

      (L64-70) “Although the reduced growth rate caused by genome reduction could be recovered by experimental evolution, it remains unclear whether such an evolutionary improvement in growth fitness was a general feature of the reduced genome and how the genome-wide changes occurred to match the growth fitness increase. In the present study, we performed the experimental evolution with a reduced genome in multiple lineages and analyzed the evolutionary changes of the genome and transcriptome.”

      (2) What is the rationale behind the lineage selection described in Figure S1 legend "Only one of the four overnight cultures in the exponential growth phase (OD600 = 0.01~0.1) was chosen for the following serial transfer, highlighted in red."?

      The four wells (cultures of different initial cell concentrations) were measured every day, and only the well that showed OD600=0.01~0.1 (red) was transferred with four different dilution rates (e.g., 10, 100, 1000, and 10000 dilution rates). It resulted in four wells of different initial cell concentrations. Multiple dilutions promised that at least one of the wells would show the OD600 within the range of 0.01 to 0.1 after the overnight culture. They were then used for the next serial transfer. Fig. S1 provides the details of the experimental records. The experimental evolution was strictly controlled within the exponential phase, quite different from the commonly conducted ALE that transferred a single culture in a fixed dilution rate. Serial transfer with multiple dilution rates was previously applied in our evolution experiments and well described in Nishimura et al., 2017, mBio; Lu et al., 2022, Comm Biol; Kurokawa et al., 2022, Front Microbiol, etc. The following sentence was added in the Materials and Methods.

      (L344-345) “Multiple dilutions changing in order promised at least one of the wells within the exponential growth phase after the overnight culture.”

      (3) The measured growth rate of the end-point 'F2 lineage' shown in Figure S2 seemed comparable to the rest of the lineages (A1 to H2), but the growth rate of 'F2' illustrated in Figure 1B indicates otherwise (L83-84). What is the reason for the incongruence between the two datasets?

      Sorry for the unclear description. The growth rates shown in Fig. S2 were obtained during the evolution experiment using the daily transfer's initial and final OD600 values. The growth rates shown in Fig. 1B were obtained from the final population (Evos) growth assay and calculated from the growth curves (biological replication, N=4). Fig. 1B shows the precisely evaluated growth rates, and Fig. S2 shows the evolutionary changes in growth rates. Accordingly, the following sentence was added to the Results.

      (L84-87) “As the growth increases were calculated according to the initial and final records, the exponential growth rates of the ancestor and evolved populations were obtained according to the growth curves for a precise evaluation of the evolutionary changes in growth.”

      (4) Are the differences in growth rate statistically significant in Figure 1B?

      Eight out of nine Evos were significant, except F2. The sentences were rewritten and associated with the revised Fig. 1B, indicating significance.

      (L87-90) “The results demonstrated that most evolved populations (Evos) showed improved growth rates, in which eight out of nine Evos were highly significant (Fig. 1B, upper). However, the magnitudes of growth improvement were considerably varied, and the evolutionary dynamics of the nine lineages were somehow divergent (Fig. S2).”

      (5) The evolved lineages showed a decrease in their maximal optical densities (OD600) compared to the ancestral strain (L85-86). ALE could accompany changes in cell size and morphologies, (doi: 10.1038/s41586-023-06288-x; 10.1128/AEM.01120-17), which may render OD600 relatively inaccurate for cell density comparison. I suggest using CFU/mL metrics for the sake of a fair comparison between Anc and Evo.

      The methods evaluating the carrying capacity (i.e., cell density, population size, etc.) do not change the results. Even using CFU is unfair for the living cells that can not form colonies and unfair if the cell size changes. Optical density (OD600) provides us with the temporal changes of cell growth in a 15-minute interval, which results in an exact evaluation of the growth rate in the exponential phase. CFU is poor at recording the temporal changes of population changes, which tend to result in an inappropriate growth rate. Taken together, we believe that our method was reasonable and reliable. We hope you can accept the different way of study.

      (6) Please provide evidence in support of the statement in L115-119. i.e. statistical analysis supporting that the observed ratio of essential genes in the mutant pool is not random.

      The statistic test was performed, and the following sentence was added.

      (L139-141) “The ratio of essential genes in the mutated genes was significantly higher than in the total genes (286 out of 3290 genes, Chi-square test p=0.008).”

      (7) The assumption that "mutation abundance would correlate to fitness improvement" described in L120-122: "The large variety in genome mutations and no correlation of mutation abundance to fitness improvement strongly suggested that no mutations were specifically responsible or crucially essential for recovering the growth rate of the reduced genome" is not easy to digest, in the sense that (i) the effect of multiple beneficial mutations are not necessarily summative, but are riddled with various epistatic interactions (doi: 10.1016/j.mec.2023.e00227); (ii) neutral hitchhikers are of common presence (you could easily find reference on this one); (iii) hypermutators that accumulate greater number of mutations in a given time are not always the eventual winners in competition games (doi: 10.1126/science.1056421). In this sense, the notion that "mutation abundance correlates to fitness improvement" in L120-122 seems flawed (for your perusal, doi: 10.1186/gb-2009-10-10-r118).

      Sorry for the improper description and confusing writing, and thank you for the fruitful knowledge on molecular evolution. The sentence was deleted, and the following one was added.

      (L145-146) “Nevertheless, it was unclear whether and how these mutations were explicitly responsible for recovering the growth rate of the reduced genome.”

      (8) Could it be possible that the large variation in genome mutations in independent lineages results from a highly rugged fitness landscape characterized by multiple fitness optima (doi: 10.1073/pnas.1507916112)? If this is the case, I disagree with the notion in L121-122 "that no mutations were specifically responsible or crucially essential" It does seem to me that, for example, the mutations in evo A2 are specifically responsible and essential for the fitness improvement of evo A2 in the evolutionary condition (M63 medium). Fitness assessment of individual (or combinatorial) mutants reconstituted in the Ancestral background would be a bonus.

      Thank you for the intriguing thinking. The sentence was deleted. Please allow us to adapt your comment to the manuscript as follows.

      (L143-145) “The large variety of genome mutations fixed in the independent lineages might result from a highly rugged fitness landscape 38.”

      (9) L121-122: "...no mutations were specifically responsible or crucially essential for recovering the growth rate of the reduced genome". Strictly speaking, the authors should provide a reference case of wild-type E. coli ALE in order to reach definitive conclusions that the observed mutation events are exclusive to the genome-reduced strain. It is strongly recommended that the authors perform comparative analysis with an ALEed non-genome-reduced control for a more definitive characterization of the evolutionary biology in a genome-reduced organism, as it was done for "JCVI-syn3.0B vs non-minimal M. mycoides" (doi: 10.1038/s41586-023-06288-x) and "E. coli eMS57 vs MG1655" (doi: 10.1038/s41467-019-08888-6).

      The improper description was deleted in response to comments 7 and 8. The mentioned references were cited in the manuscript (refs 21 and 23). Thank you for the experimental advice. We are sorry that the comparison of wild-type and reduced genomes was not in the scope of the present study and will probably be reported soon in our future work.

      (10) L146-148: "The homeostatic periodicity was consistent with our previous findings that the chromosomal periodicity of the transcriptome was independent of genomic or environmental variation" A Previous study also suggested that the amplitudes of the periodic transcriptomes were significantly correlated with the growth rates (doi: 10.1093/dnares/dsaa018). Growth rates of 8/9 Evos were higher compared to Anc, while that of Evo F2 remained similar. Please comment on the changes in amplitudes of the periodic transcriptomes between Anc and each Evo.

      Thank you for the suggestion. The correlation between the growth rates and the amplitudes of chromosomal periodicity was statistically insignificant (p>0.05). It might be a result of the limited data points. Compared with the only nine data points in the present study, the previous study analyzed hundreds of transcriptomes associated with the corresponding growth rates, which are suitable for statistical evaluation. In addition, the changes in growth rates were more significant in the previous study than in the present study, which might influence the significance. It's why we did not discuss the periodic amplitude.

      (11) Please elaborate on L159-161: "It strongly suggested the essentiality mutation for homeostatic transcriptome architecture happened in the reduced genome.".

      Sorry for the improper description. The sentence was rewritten as follows.

      (L191-193) “The essentiality of the mutations might have participated in maintaining the homeostatic transcriptome architecture of the reduced genome.”

      (12) Is FPKM a valid metric for between-sample comparison? The growing consensus in the community adopts Transcripts Per Kilobase Million (TPM) for comparing gene expression levels between different samples (Figure 3B; L372-379).

      Sorry for the unclear description. The FPKM indicated here was globally normalized, statistically equivalent to TPM. The following sentence was added to the Materials and Methods.

      (L421-422) “The resulting normalized FPKM values were statistically equivalent to TPM.”

      (13) Please provide % mapped frequency of mutations in Table S3.

      They were all 100%. The partially fixed mutations were excluded in the present study. The following sentence was added to the caption of Table S3.

      (Supplementary file, p 9) “Note that the entire population held the mutations, i.e., 100% frequency in DNA sequencing.”

      (14) To my knowledge, M63 medium contains glucose and glycerol as carbon sources. The manuscript would benefit from discussing the elements that impose selection pressure in the M63 culture condition.

      Sorry for the missing information on M63, which contains 22 mM glucose as the only carbon source. The medium composition was added in the Materials and Methods, as follows.

      (L334-337) “In brief, the medium contains 62 mM dipotassium hydrogen phosphate, 39 mM potassium dihydrogen phosphate, 15 mM ammonium sulfate, 15 μM thiamine hydrochloride, 1.8 μM Iron (II) sulfate, 0.2 mM magnesium sulfate, and 22 mM glucose.”

      (15) The RNA-Seq datasets for Evo strains seemed equally heterogenous, just as their mutation profiles. However, the missing element in their analysis is the directionality of gene expression changes. I wonder what sort of biological significance can be derived from grouping expression changes based solely on DEGs, without considering the magnitude and the direction (up- and down-regulation) of changes? RNA-seq analysis in its current form seems superficial to derive biologically meaningful interpretations.

      We agree that most studies often discuss the direction of transcriptional changes. The present study aimed to capture a global view of the magnitude of transcriptome reorganization. Thus, the analyses focused on the overall features, such as the abundance of DEGs, instead of the details of the changes, e.g., the up- and down-regulation of DEGs. The biological meaning of the DEGs' overview was how significantly the genome-wide gene expression fluctuated, which might be short of an in-depth view of individual gene expression. The following sentence was added to indicate the limitation of the present analysis.

      (L199-202) “Instead of an in-depth survey on the directional changes of the DEGs, the abundance and functional enrichment of DEGs were investigated to achieve an overview of how significant the genome-wide fluctuation in gene expression, which ignored the details of individual genes.”

      Minor remarks

      (1) L41: brackets italicized "(E. coli)".

      It was fixed as follows.

      (L40) “… Escherichia coli (E. coli) cells …”

      (2) Figure S1. It is suggested that the x-axis of ALE monitor be set to 'generations' or 'cumulative generations', rather than 'days'.

      Thank you for the suggestion. Fig. S1 describes the experimental procedure, so the" day" was used. Fig. S2 presents the evolutionary process, so the "generation" was used, as you recommended here.

      (3) I found it difficult to digest through L61-64. Although it is not within the job scope of reviewers to comment on the language style, I must point out that the manuscript would benefit from professional language editing services.

      Sorry for the unclear writing. The sentences were revised as follows.

      (L60-64) “Previous studies have identified conserved features in transcriptome reorganization, despite significant disruption to gene expression patterns resulting from either genome reduction or experimental evolution 27-29. The findings indicated that experimental evolution might reinstate growth rates that have been disrupted by genome reduction to maintain homeostasis in growing cells.”

      (4) Duplicate references (No. 21, 42).

      Sorry for the mistake. It was fixed (leaving ref. 21).

      (5) Inconsistency in L105-106: "from two to 13".

      "From two to 13" was adopted from the language editing. It was changed as follows.

      (L119) “… from 2 to 13, …”

      Response to Reviewer #3:

      Thank you for reviewing our manuscript and for the helpful comments, which improved the strength of the manuscript. The recommended statistical analyses essentially supported the statement in the manuscript were performed, and those supposed to be the new results in the scope of further studies remained unconducted. The changes made in the revision were highlighted. We sincerely hope the revised manuscript and the following point-to-point response meet your concerns. You will find all your suggested statistic tests in our future work that report an extensive study on the experimental evolution of an assortment of reduced genomes.

      (1) Line 106 - "As 36 out of 45 SNPs were nonsynonymous, the mutated genes might benefit the fitness increase." This argument can be strengthened. For example, the null expectation of nonsynonymous SNPs should be discussed. Is the number of observed nonsynonymous SNPs significantly higher than the expected one?

      (2) Line 107 - "In addition, the abundance of mutations was unlikely to be related to the magnitude of fitness increase." Instead of just listing examples, a regression analysis can be added.

      Yes, it's significant. Random mutations lead to ~33% of nonsynonymous SNP in a rough estimation. Additionally, the regression is unreliable because there's no statistical significance between the number of mutations and the magnitude of fitness increase. Accordingly, the corresponding sentences were revised with additional statistical tests.

      (L123-129) “As 36 out of 45 SNPs were nonsynonymous, which was highly significant compared to random mutations (p < 0.01), the mutated genes might benefit fitness increase. In addition, the abundance of mutations was unlikely to be related to the magnitude of fitness increase. There was no significant correlation between the number of mutations and the growth rate in a statistical view (p > 0.1). Even from an individual close-up viewpoint, the abundance of mutations poorly explained the fitness increase.”

      (3) Line 114 - "They seemed highly related to essentiality, as 11 out of 49 mutated genes were essential (Table S3)." Here, the information mentioned in line 153 ("the ratio of essential to all genes (302 out of 3,290) in the reduced genome.") can be used. Then a statistical test for a contingency table can be used.

      (4) Line 117 - "the high frequency of the mutations fixed in the essential genes suggested the mutation in essentiality for fitness increase was the evolutionary strategy for reduced genome." What is the expected number of fixed mutations in essential genes vs non-essential genes? Is the observed number statistically significantly higher?

      Sorry for the improper and insufficient information on the essential genes. Yes, it's significant. The statistical test was additionally performed. The corresponding part was revised as follows.

      (L134-146) “They seemed highly related to essentiality7 (https://shigen.nig.ac.jp/ecoli/pec/genes.jsp), as 11 out of 49 mutated genes were essential (Table S3). Although the essentiality of genes might differ between the wild-type and reduced genomes, the experimentally determined 302 essential genes in the wild-type E. coli strain were used for the analysis, of which 286 were annotated in the reduced genome. The ratio of essential genes in the mutated genes was significantly higher than in the total genes (286 out of 3290 genes, Chi-square test p=0.008). As the essential genes were determined according to the growth35 and were known to be more conserved than nonessential ones 36,37, the high frequency of the mutations fixed in the essential genes was highly intriguing and reasonable. The large variety of genome mutations fixed in the independent lineages might result from a highly rugged fitness landscape 38. Nevertheless, it was unclear whether and how these mutations were explicitly responsible for recovering the growth rate of the reduced genome.”

      (5) The authors mentioned no overlapping in the single mutation level. Is that statistically significant? The authors can bring up what the no-overlap probability is given that there are in total x number of fixed mutations observed (either theory or simulation is good).

      Sorry, we feel confused about this comment. It's unclear to us why it needs to be statistically simulated. Firstly, the mutations were experimentally observed. The result that no overlapped mutated genes were detected was an Experimental Fact but not a Computational Prediction. We feel sorry that you may over-interpret our finding as an evolutionary rule, which always requires testing its reliability statistically. We didn't conclude that the evolution had no overlapped mutations. Secondly, considering 65 times random mutations happened to a ~3.9 Mb sequence, the statistical test was meaningful only if the experimental results found the overlapped mutations. It is interesting how often the random mutations cause the overlapped mutations in parallel evolutionary lineages while increasing the evolutionary lineages, which seems to be out of the scope of the present study. We are happy to include the analysis in our ongoing study on the experimental evolution of reduced genomes.

      (6) The authors mentioned no overlapping in the single mutation level. How about at the genetic level? Some fixed mutations occur in the same coding gene. Is there any gene with a significantly enriched number of mutations?

      No mutations were fixed in the same gene of biological function, as shown in Table S3. If we say the coding region, the only exception is the IS sequences, well known as the transposable sequences without genetic function. The following description was added.

      (L119-122) “The number of mutations largely varied among the nine Evos, from 2 to 13, and no common mutation was detected in all nine Evos (Table S3). A 1,199-bp deletion of insH was frequently found in the Evos (Table S3, highlighted), which well agreed with its function as a transposable sequence.”

      (7) Line 151-156- It seems like the authors argue that the expression level differences can be just explained by the percentage of essential genes that get fixed mutations. One further step for the argument could be to compare the expression level of essential genes with vs without fixed mutations. Also, the authors can compare the expression level of non-essential genes with vs without fixed mutations. And the authors can report whether the differences in expression level became insignificant after the control of the essentiality.

      It's our pleasure that the essentiality intrigued you. Thank you for the analytical suggestion, which is exciting and valuable for our studies. As only 11 essential genes were detected here and "Mutation in essentiality" was an indication but not the conclusion of the present study, we would like to apply the recommended analysis to the datasets of our ongoing study to demonstrate this statement. Thank you again for your fruitful analytical advice.

      (8) Line 169- "The number of DEGs partially overlapped among the Evos declined significantly along with the increased lineages of Evos (Figure 4B). " There is a lack of statistical significance here while the word "significantly" is used. One statistical test that can be done is to use re-sampling/simulation to generate a null expectation of the overlapping numbers given the DEGs for each Evo line and the total number of genes in the genome. The observed number can then be compared to the distribution of the simulated numbers.

      Sorry for the inappropriate usage of the term. Whether it's statistically significant didn't matter here. The word "significant" was deleted as follows.

      (L205--206) “The number of DEGs partially overlapped among the Evos declined along with the increased lineages of Evos (Fig. 4B).”

      (9) Line 177-179- "In comparison,1,226 DEGs were induced by genome reduction. The common DEGs 177 of genome reduction and evolution varied from 168 to 540, fewer than half of the DEGs 178 responsible for genome reduction in all Evos" Is the overlapping number significantly lower than the expectation? The hypergeometric test can be used for testing the overlap between two gene sets.

      There's no expectation for how many DEGs were reasonable. Not all numbers experimentally obtained are required to be statistically meaningful, which is commonly essential in computational and data science.

      (10) The authors should give more information about the ancestral line used at the beginning of experimental evolution. I guess it is one of the KHK collection lines, but I can not find more details. There are many genome-reduced lines. Why is this certain one picked?

      Sorry for the insufficient information on the reduced genome used for the experimental evolution. The following descriptions were added in the Results and the Materials and Methods, respectively.

      (L75-79) “The E. coli strain carrying a reduced genome, derived from the wild-type genome W3110, showed a significant decline in its growth rate in the minimal medium compared to the wild-type strain 13. To improve the genome reduction-mediated decreased growth rate, the serial transfer of the genome-reduced strain was performed with multiple dilution rates to keep the bacterial growth within the exponential phase (Fig. S1), as described 17,20.”

      (L331-334) “The reduced genome has been constructed by multiple deletions of large genomic fragments 58, which led to an approximately 21% smaller size than its parent wild-type genome W3110.”

      (11) How was the saturated density in Figure 1 actually determined? In particular, the fitness assay of growth curves is 48h. But it seems like the experimental evolution is done for ~24 h cycles. If the Evos never experienced a situation like a stationary phase between 24-48h, and if the author reported the saturated density 48 h in Figure 1, the explanation of the lower saturated density can be just relaxation from selection and may have nothing to do with the increase of growth rate.

      Sorry for the unclear description. Yes, you are right. The evolution was performed within the exponential growth phase (keeping cell division constant), which means the Evos never experienced the stationary phase (saturation). The final evolved populations were subjected to the growth assay to obtain the entire growth curves for calculating the growth rate and the saturated density. Whether the decreased saturated density and the increased growth rate were in a trade-off relationship remained unclear. The corresponding paragraph was revised as follows.

      (L100-115) “Intriguingly, a positive correlation was observed between the growth fitness and the carrying capacity of the Evos (Fig. 1D). It was somehow consistent with the positive correlations between the colony growth rate and the colony size of a genome-reduced strain 11 and between the growth rates and the saturated population size of an assortment of genome reduced strains 13. Nevertheless, the negative correlation between growth rate and carrying capacity, known as the r/K selection30,31 was often observed as the trade-off relationship between r and K in the evolution and ecology studies 32 33,34. As the r/K trade-off was proposed to balance the cellular metabolism that resulted from the cost of enzymes involved 34, the deleted genes might play a role in maintaining the metabolism balance for the r/K correlation. On the other hand, the experimental evolution (i.e., serial transfer) was strictly performed within the exponential growth phase; thus, the evolutionary selection was supposed to be driven by the growth rate without selective pressure to maintain the carrying capacity. The declined carrying capacity might have been its neutral "drift" but not a trade-off to the growth rate. Independent and parallel experimental evolution of the reduced genomes selecting either r or K is required to clarify the actual mechanisms.”

      (12) What annotation of essentiality was used in this paper? In particular, the essentiality can be different in the reduced genome background compared to the WT background.

      Sorry for the unclear definition of the essential genes. They are strictly limited to the 302 essential genes experimentally determined in the wild-type E coli strain. Detailed information can be found at the following website: https://shigen.nig.ac.jp/ecoli/pec/genes.jsp. We agree that the essentiality could differ between the WT and reduced genomes. Identifying the essential genes in the reduced genome will be an exhaustedly vast work. The information on the essential genes defined in the present study was added as follows.

      (L134-139) “They seemed highly related to essentiality7 (https://shigen.nig.ac.jp/ecoli/pec/genes.jsp), as 11 out of 49 mutated genes were essential (Table S3). Although the essentiality of genes might differ between the wild-type and reduced genomes, the experimentally determined 302 essential genes in the wild-type E. coli strain were used for the analysis, of which 286 were annotated in the reduced genome.”

      (13) The fixed mutations in essential genes are probably not rarely observed in experimental evolution. For example, fixed mutations related to RNA polymerase can be frequently seen when evolving to stressful environments. I think the author can discuss this more and elaborate more on whether they think these mutations in essential genes are important in adaptation or not.

      Thank you for your careful reading and the suggestion. As you mentioned, we noticed that the mutations in RNA polymerases (rpoA, rpoB, and rpoD) were identified in three Evos. As they were not shared across all Evos, we didn't discuss the contribution of these mutations to evolution. Instead of the individual functions of the mutated essential gene functions, we focused on the enriched gene functions related to the transcriptome reorganization because they were the common feature observed across all Evos and linked to the whole metabolic or regulatory pathways, which are supposed to be more biologically reasonable and interpretable. The following sentence was added to clarify our thinking.

      (L268-273) “In particular, mutations in the essential genes, such as RNA polymerases (rpoA, rpoB, rpoD) identified in three Evos (Table S3), were supposed to participate in the global regulation for improved growth. Nevertheless, the considerable variation in the fixed mutations without overlaps among the nine Evos (Table 1) implied no common mutagenetic strategy for the evolutionary improvement of growth fitness.”

      (14) In experimental evolution to new environments, several previous literature also show that long-term experimental evolution in transcriptome is not consistent or even reverts the short-term response; short-term responses were just rather considered as an emergency plan. They seem to echo what the authors found in this manuscript. I think the author can refer to some of those studies more and make a more throughput discussion on short-term vs long-term responses in evolution.

      Thank you for the advice. It's unclear to us what the short-term and long-term responses referred to mentioned in this comment. The "Response" is usually used as the phenotypic or transcriptional changes within a few hours after environmental fluctuation, generally non-genetic (no mutation). In comparison, long-term or short-term experimental "Evolution" is associated with genetic changes (mutations). Concerning the Evolution (not the Response), the long-term experimental evolution (>10,000 generations) was performed only with the wild-type genome, and the short-term experimental evolution (500~2,000 generations) was more often conducted with both wild-type and reduced genomes, to our knowledge. Previous landmark studies have intensively discussed comparing the wild-type and reduced genomes. Our study was restricted to the reduced genome, which was constructed differently from those reduced genomes used in the reported studies. The experimental evolution of the reduced genomes has been performed in the presence of additional additives, e.g., antibiotics, alternative carbon sources, etc. That is, neither the genomic backgrounds nor the evolutionary conditions were comparable. Comparison of nothing common seems to be unproductive. We sincerely hope the recommended topics can be applied in our future work.

      Some minor suggestions

      • Figures S3 & Table S2 need an explanation of the abbreviations of gene categories.

      Sorry for the missing information. Figure S3 and Table S3 were revised to include the names of gene categories. The figure was pasted followingly for a quick reference.

      Author response image 3.

      • I hope the authors can re-consider the title; "Diversity for commonality" does not make much sense to me. For example, it can be simply just "Diversity and commonality."

      Thank you for the suggestion. The title was simplified as follows.

      (L1) “Experimental evolution for the recovery of growth loss due to genome reduction.”

      • It is not easy for me to locate and distinguish the RNA-seq vs DNA-seq files in DRA013662 at DDBJ. Could you make some notes on what RNA-seq actually are, vs what DNA-seq files actually are?

      Sorry for the mistakes in the DRA number of DNA-seq. DNA-seq and RNA-seq were deposited separately with the accession IDs of DRA013661 and DRA013662, respectively. The following correction was made in the revision.

      (L382-383) “The raw datasets of DNA-seq were deposited in the DDBJ Sequence Read Archive under the accession number DRA013661.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      The authors sought to test whether anterior insular cortex neurons increase or decrease firing during fear behavior and freezing, bi-directionally control fear via separate, anatomically defined outputs. Using a fairly simple behavior where mice were exposed to tone-shock pairings, they found roughly equal populations that do indeed either increase or decrease firing during freezing. Next, they sought to test whether these distinct populations may also have distinct outputs. Using retrograde tracers they found that the anterior insular cortex contains non-overlapping neurons which project to the mediodorsal thalamus or amygdala. Mediodorsal thalamus-projecting neurons tended to cluster in deep cortical layers while amygdala-projecting neurons were primarily in more superficial layers. Stimulation of insula-thalamus projection decreased freezing behavior, and stimulation of insula-amygdala projections increased fear behavior. Given that the neurons that increased firing were located in deep layers, that thalamus projections occurred in deep layers, and that stimulation of insula-thalamus neurons decreased freezing, the authors concluded that the increased firing neurons may be thalamus projections. Similarly, given that decreased-firing neurons tended to occur in more superficial layers, that insula-amygdala projections were primarily superficial, and that insula-amygdala stimulation increased freezing behavior, authors concluded that the decreased firing cells may be amygdala projections. The study has several strengths though also some caveats.

      Strengths:

      The potential link between physiological activity, anatomy, and behavior is well laid out and is an interesting question. The activity contrast between the units that increase/decrease firing during freezing is clear.

      It is nice to see the recording of extracellular spiking activity, which provides a clear measure of neural output, whereas similar studies often use bulk calcium imaging, a signal that rarely matches real neural activity even when anatomy suggests it might (see London et al 2018 J Neuro - there are increased/decreased spiking striatal populations, but both D1 and D2 striatal neurons increase bulk calcium).

      Weaknesses:

      The link between spiking, anatomy, and behavior requires assumptions/inferences: the anatomically/genetically defined neurons which had distinct outputs and opposite behavioral effects can only be assumed the increased/decreased spiking neurons, based on the rough area of the cortical layer they were recorded.

      Yes, we are aware that we could not provide a direct link between spiking, anatomy and behavior. We have specifically noted this in the discussion section and added a possible experiment that could be carried out to provide a more direct link in a future study.

      [Lines 371-375] We would like to provide a more direct evidence between the neuronal response types and projection patterns in future studies by electrophysiologically identifying freezing-excited and freezing-inhibited aIC neurons and testing whether those neurons activates to optogenetic activation of amygdala or medial thalamus projecting aIC neurons.

      The behavior would require more control to fully support claims about the associative nature of the fear response (see Trott et al 2022 eLife) - freezing, in this case, could just as well be nonassociative. In a similar vein, fixed intertrial intervals, though common practice in the fear literature, pose a problem for neurophysiological studies. The first is that animals learn the timing of events, and the second is that neural activity is dynamic and changes over time. Thus it is very difficult to determine whether changes in neural activity are due to learning about the tone-shock contingency, timing of the task, simply occur because of time and independently of external events, or some combination of the above.

      Trott et al. (2022) stated that "...freezing was the purest reflection of associative learning." The nonassociative processes mentioned in the study were related to running and darting behaviors, which the authors argue are suppressed by associative learning. Moreover, considerable evidence from immediate postshock freezing and immediate postshock context shift studies all indicate that the freezing response is an associative (and not nonassociative) response (Fanselow, 1980 and 1986; and Landeira-Fernandez et al., 2006). Thus, our animals' freezing response to the tone CS presentation in a novel context, following three tone CS-footshock US pairings, most likely reflects associative learning. 

      Concerning the issue of fixed inter-trial intervals (ITIs), which are standard in fear conditioning studies, particularly those with few CS-US paired trials, we acknowledge the challenge in interpreting the neural correlates of behavior. However, the ITIs in our extinction study was variable and we still found neural activities that had significant correlation with freezing. The results of our extinction study, carried out with variable it is, suggest that the aIC neural activity changes measured in this study is likely due to freezing behavior associated with fear learning, not due to learning the contingencies of fixed ITIs.

      Reviewer #2 (Public Review):

      In this study, the authors aim to understand how neurons in the anterior insular cortex (insula) modulate fear behaviors. They report that the activity of a subpopulation of insula neurons is positively correlated with freezing behaviors, while the activity of another subpopulation of neurons is negatively correlated to the same freezing episodes. They then used optogenetics and showed that activation of anterior insula excitatory neurons during tones predicting a footshock increases the amount of freezing outside the tone presentation, while optogenetic inhibition had no effect. Finally, they found that two neuronal projections of the anterior insula, one to the amygdala and another to the medial thalamus, are increasing and decreasing freezing behaviors respectively. While the study contains interesting and timely findings for our understanding of the mechanisms underlying fear, some points remain to be addressed.

      We are thankful for the detailed and constructive comments by the reviewer and addressed the points. Specifically, we included possible limitations of using only male mice in the study, included two more studies about the insula as references, specified the L-ratio and isolated distance used in our study, added the ratio of putative-excitatory and putative-inhibitory neurons obtained from our study, changed the terms used to describe neuronal activity changes (freezing-excited and freezing-inhibited cells), added new analysis (Figure 2H), rearranged Figure 2 for clarity, added new histology images, and added atlas maps with viral expressions (three figure supplements).

      Reviewer #1 (Recommendations For The Authors):

      - I would suggest keeping the same y-axis for all figures that display the same data type - Figure 5D, for example.

      Thank you for the detailed suggestion. We corrected the y-axis that display the same data type to be the same for all figures.

      - In the methods, it says 30s bins were used for neural analysis (line 435). I cannot imagine doing this, and looking at the other figures, it does not look like this is the case so could you please clarify what bins, averages, etc were used for neural and behavioral analysis?

      Bin size for neural analysis varied; 30s, 5s, 1s bins were used depending on the analysis. We corrected this and specified what time bin was used for which figure in the methods.

      Bin size for neural and freezing behavior was 30s and we also added this to the methods.

      - I would not make any claims about the fear response here being associative/conditional. This would require a control group that received an equal number of tone and shock exposures, whether explicitly unpaired or random.

      The unpaired fear conditioning paradigm, unpaired tone and shock, suggested by the reviewer is well characterized not to induce fear behavior by CS (Moita et al., 2003 and Kochli et al., 2015). In addition, considerable evidence from immediate post-shock freezing and immediate post-shock context shift studies all indicate that the freezing response is an associative (and not nonassociative) response (Fanselow, 1980 and 1986; and Landeira-Fernandez et al., 2006). Thus, our animals' freezing response to the tone CS presentation in a novel context, following three tone CS-footshock US pairings, most likely reflects associative learning.

      - I appreciate the discussion about requiring some inference to conclude that anatomically defined neurons are the physiologically defined ones. This is a caveat that is fully disclosed, however, I might suggest adding to the discussion that future experiments could address this by tagging insula-thalamus or insula-amygdala neurons with antidromic (opto or even plain old electric!) stimulation. These experiments are tricky to perform, of course, but this would be required to fully close all the links between behavior, physiology, and anatomy.

      As suggested, we have included that, in a future study, we would like to elucidate a more direct link between physiology, anatomy and behaviors by optogenetically tagging the insula-thalamus/insula-amygdala neurons and identifying whether it may be a positive or a negative cell (now named the freezing-excited and freezing-inhibited cells, respectively) in the discussion.

      [Lines 371-375] We would like to provide a more direct evidence between the neuronal response types and projection patterns in future studies by electrophysiologically identifying freezing-excited and freezing-inhibited aIC neurons and testing whether those neurons activates to optogenetic activation of amygdala or medial thalamus projecting aIC neurons.

      Reviewer #2 (Recommendations For The Authors):

      Major comments:

      (1) As all experiments have been performed only in male mice, the authors need to clearly state this limit in the introduction, abstract, and title of the manuscript.

      With increasing number of readers becoming interested in the biological sex used in preclinical studies, we also feel that it should be mentioned in the beginning of the manuscript. As suggested, we explicitly wrote that we only used male mice in the title, abstract, and introduction. In addition, we discussed possible limitations of only using male mice in the discussion section as follows:

      [Lines 381-386] Another factor to consider is that we have only used male mice in this study. Although many studies report that there is no biological sex difference in cued fear conditioning (42), the main experimental paradigm used in this study, it does not mean that the underlying brain circuit mechanism would also be similar. The bidirectional fear modulation by aIC→medial thalamus or the aIC→amygdala projections may be different in female mice, as some studies report reduced cued fear extinction in females (42).

      (2) The authors are missing important publications reporting findings on the insular cortex in fear and anxiety. For example, the authors should cite studies showing that anterior insula VIP+ interneurons inhibition reduces fear memory retrieval (Ramos-Prats et al., 2022) and that posterior insula neurons are a state-dependent regulator of fear (Klein et al., 2021). Also, regarding the anterior insula to basolateral amygdala projection (aIC-BLA), the author should include recent work showing that this population encodes both negative valence and anxiogenic spaces (Nicolas et al., 2023). 

      We appreciate the detailed suggestions and we added appropriate publications in the discussion section. The anterior insula VIP+ interneuron study (Ramos-Prats et al., 2022) is interesting, but based on the evidence provided in the paper, we felt that the role of aIC VIP+ interneuron in fear conditioning is low. VIP+ interneurons in the aIC seem to be important in coding sensory stimuli, however, it’s relevance to conditioned stimuli seems to be low; overall VIP intracellular calcium activity to CS was low and did not differ between acquisition and retrieval. Also, inhibition of VIP did not influence fear acquisition. VIP inhibition during fear acquisition did reduce fear retrieval (CS only, no light stimulation), but this does not necessarily mean that VIP activity will be involved in fear memory storage or retrieval, especially because intracellular calcium activity of VIP+ neurons was low during fear conditioning and retrieval.

      Studies by Klein et al. (2021) and Nicolas et al. (2023) are integrated in the discussion section as follows.

      [Lines 297-301] Group activity of neurons in the pIC measured with fiberphotometry, interestingly, exhibited fear state dependent activity changes—decreased activity with high fear behavior and increased activity with lower fear behavior (29)—suggesting that group activity of the pIC may be involves in maintain appropriate level of fear behavior.

      [Lines 316-319] Another distinction between the aIC and pIC may be related with anxiety, as a recent study showed that group activity of aIC neurons, but not that of the pIC, increased when mice explored anxiogenic space (open arms in an elevated plus maze, center of an open field box) (32).

      (3) The authors should specify how many neurons they excluded after controlling the L-ratio and isolation distance. It is also important to specify the percentage of putative excitatory and inhibitory interneurons recorded among the 11 mice based on their classification (the number of putative inhibitory interneurons in Figure 1D seems too low to be accurate).

      We use manual cluster cutting and only cut clusters that are visually well isolated. So we hardly have any neurons that are excluded after controlling for L-ratio and isolation distance. The criterion we used was L-ratio<0.3 and isolation distance>15, and we specified this in the methods as follows.

      [Lines 454-458] We only used well-isolated units (L-ratio<0.3, isolation distance>15) that were confirmed to be recorded in the aIC (conditioned group: n = 116 neurons, 11 mice; control group: n = 14 neurons, 3 mice) for the analysis (46). The mean of units used in our analysis are as follows: L-ratio = 0.09 ± 0.012, isolation distance = 44.97 ± 5.26 (expressed as mean ± standard deviation).

      As suggested, we also specified the percentage of putative excitatory and inhibitory interneurons recorded from our study in the results and methods section. The relative percentage of putative excitatory and inhibitory interneurons were similar for both the conditioned and the control groups (conditioned putative-excitatory: 93.1%, putative-inhibitory: 6.9%; control putative-excitatory: 92.9%, putative-inhibitory: 7.1%). Although the number of putative-interneurons isolated from our recordings is low that is what we obtained. Putative inhibitory neurons, probably because of their relatively smaller size, has a tendency to be underrepresented than the putative excitatory cells.

      [Lines 83-87] Of the recorded neurons, we analyzed the activity of 108 putative pyramidal neurons (93% of total isolated neurons) from 11 mice, which were distinguished from putative interneurons (n = 8 cells, 7% of total isolated neurons) based on the characteristics of their recorded action potentials (Figure 1D; see methods for details).

      [Lines 464-467] The percentage of putative excitatory neurons and putative inhibitory interneurons obtained from both groups were similar (conditioned putative-excitatory: 93.1%, putative-inhibitory: 6.9%; control putative-excitatory: 92.9%, putative-inhibitory: 7.1%).

      (4) While the use of correlation of single-unit firing frequency with freezing is interesting, classically, studies analyze the firing in comparison to the auditory cues. If the authors want to keep the correlation analysis with freezing, rather than correlations to the cues, they should rename the cells as "freezing excited" and "freezing inhibited" cells instead of positive and negative cells.

      As suggested, we used the terms “freezing-excited” and “freezing-inhibited” cells instead of positive and negative cells.

      (5) To improve clarity, Figure 2 should be reorganized to start with the representative examples before including the average of population data. Thus Panel D should be the first one. The authors should also consider including the trace of the firing rate of these representative units over time, on top of the freezing trace, as well as Pearson's r and p values for both of them. Then, the next panels should be ordered as follows: F, G, H, C, A, B, I, and finally E.

      We have rearranged Figure 2 based on the suggestions.

      (6) It is unclear why the freezing response in Figure 2 is different in current panels F, G, and H. Please clarify this point.

      It was because the freezing behaviors of slightly different population of animals were averaged. Some animals did not have positive/negative (or both) cells and only the behavior of animals with the specified cell-type were used for calculating the mean freezing response. With rearrangement of Figure 2, now we do not have plots with juxtaposed mean neuronal response-types and behavior.

      (7) Even though the peak of tone-induced firing rate change between negative and positive cells is 10s later for positive cells, the conclusion that this 'difference suggests differential circuits may regulate the activities of different neuron types in response to fear' is overstating the observation. This statement should be rephrased. Indeed, it could be the same circuits that are regulated by different inputs (glutamatergic, GABA, or neuromodulatory inputs).

      We agree and delete the statement from the manuscript.

      (8) The authors mention they did not find tone onset nor tone offset-induced responses of anterior insula neurons. It would be helpful to represent this finding in a Figure, especially, which were the criteria for a cell to be tone onset or tone offset responding.

      We added how tone-onset and tone-offset were analyzed in the methods section and added a plot of the analysis in Figure 2H.

      (9) Based on the spread of the viral expression shown in Figure 3B, it appears that the authors are activating/inhibiting insula neurons in the GI layer, whereas single-unit recordings report the electrodes were located in DI, AID, and AIV layers. The authors should provide histology maps of the viral spread for ChR2, NpHR3, and eYFP expression.

      Thank you for the excellent suggestion. Now the histological sample in Figure 3B is a sample with expression in the GI/DI/AID layers and it also has an image taken at higher resolution (x40) to show that viral vectors are expressed inside neurons. We also added histological maps with overlay of viral expression patterns of the ChR2, eYFP, and NpHR3 groups in Figure 3—figure supplement 1.

      (10) In Figure 5B, the distribution of terminals expressing ChR2 appears much denser in CM than in MD. This should be quantified across mice and if consistent with the representative image, the authors should refer to aIC-CM rather than aIC-MD terminals.

      Overall, we referred to the connection as aIC-medial thalamus, which collectively includes both the CM and the MD. Microscopes we have cannot determine whether terminals end at the CM or MD, but the aIC projections seems to pass through the CM to reach the MD. The Allen Brain Institute’s Mouse brain connectivity map (https://connectivity.brain-map.org/projection/experiment/272737914) of a B6 mouse, the mouse strain we used in our study, with tracers injected in similar location as our study also supports our speculation and shows that aIC neuronal projections terminate more in the MD than in the CM. In addition, the power of light delivered for optogenetic manipulation is greatly reduced over distance, and therefore, the MD projecting terminals which is closer to the optic fiber will be more likely to be activated than the CM projecting terminals. However, since we could not determine whether the aIC terminate at the CM or the MD, we collectively referred to the connection as the aIC-medial thalamus throughout the manuscript.

      Author response image 1.

      (11) Histological verifications for each in vivo electrophysiology, optogenetic, and tracing experiments need to include a representative image of the implantation/injection site, as well as a 40x zoom-in image focusing on the cell bodies or terminals right below the optic fiber (for optogenetic experiments). Moreover, an atlas map including all injection locations with the spread of the virus and fiber placement should be added in the Supplement Figures for each experiment (see Figure S1 Klein et al., 2021). Similarly, the authors need to add a representation of the spread of the retrograde tracers for each mouse used for this tracing experiment.

      As suggested, we added a histology sample showing electrode recording location for in-vivo electrophysiology in Figure 1 and added atlas maps for the optogenetic and tracing experiments in supplementary figures. We also provide a 40x zoom-in image of the expression pattern for the optogenetic experiments (Figure 3B).

      (12) To target anterior insula neurons, authors mention coordinates that do not reach the insula on the Paxinos atlas (AP: +1.2 mm, ML: -3.4 mm, DV: -1.8 mm). If the DV was taken from the brain surface, this has to be specified, and if the other coordinates are from Bregma, this also needs to be specified. Finally, the authors cite a review from Maren & Fanselow (1996), for the anterior insula coordinates, but it remains unclear why.

      AP and ML coordinates are measurement made in reference to the bregma. DV was calculated from the brain surface. We specified these in the Methods. We did not cite a review from Maren & Fenselow for the aIC coordinates.

      Minor comments:

      (1) A schematic of the microdrive and tetrodes, including the distance of each tetrode would also be helpful.

      We used a handcrafted Microdrives with four tetrodes. Since they were handcrafted, the relative orientation of the tetrodes varies and tetrode recording locations has to be verified histologically. We, however, made sure that the distance between tetrodes to be more than 200 μm apart so that distinct single-units will be obtained from different tetrodes. We added this to the methods as follows.

      [Lines 430-431] The distance between the tetrodes were greater than 200 μm to ensure that distinct single-units will be obtained from different tetrodes.

      (2) Figure 2E: representation of the baseline firing (3-min period before the tone presentation) is missing.

      Figure 2E is the 3 min period before tone presentation

      (3) Figure 2: Averages Pearson's correlation r and p values should be stated on panels F, G, and H (positive cell r = 0.81, P < 0.05; negative cell r = -0.68, P < 0.05).

      They were all originally stated in the figures. But with reorganization of Figure 2, we now have a plot of the Pearson’s Correlation with r and p values in Figure 2F.

      (4) Figure 2I: Representation of the absolute value of the normalized firing is highly confusing. Indeed, as the 'negative cells' are inhibited to freezing, firing should be represented as normalized, and negative for the inhibited cells.

      To avoid confusion, we did not take an absolute value of the “negative cells”, which are now called the “freezing-inhibited cells”.

      (5) Figure 4E (retrograde tracing): representation of individual values is missing.

      Figure 4E now has individual values.

      References:

      London, T. D., Licholai, J. A., Szczot, I., Ali, M. A., LeBlanc, K. H., Fobbs, W. C., & Kravitz, A. V. (2018). Coordinated ramping of dorsal striatal pathways preceding food approach and consumption. Journal of Neuroscience, 38(14), 3547-3558.

      Trott, J. M., Hoffman, A. N., Zhuravka, I., & Fanselow, M. S. (2022). Conditional and unconditional components of aversively motivated freezing, flight and darting in mice. Elife, 11, e75663.

      Fanselow, M. S. (1980). Conditional and unconditional components of post-shock freezing. The Pavlovian journal of biological science: Official Journal of the Pavlovian, 15(4), 177-182.

      Fanselow, M. S. (1986). Associative vs topographical accounts of the immediate shock-freezing deficit in rats: implications for the response selection rules governing species-specific defensive reactions. Learning and Motivation, 17(1), 16-39.

      Landeira-Fernandez, J., DeCola, J. P., Kim, J. J., & Fanselow, M. S. (2006). Immediate shock deficit in fear conditioning: effects of shock manipulations. Behavioral neuroscience, 120(4), 873.

      Moita, M. A., Rosis, S., Zhou, Y., LeDoux, J. E., & Blair, H. T. (2003). Hippocampal place cells acquire location-specific responses to the conditioned stimulus during auditory fear conditioning. Neuron, 37(3), 485-497.

      Kochli, D. E., Thompson, E. C., Fricke, E. A., Postle, A. F., & Quinn, J. J. (2015). The amygdala is critical for trace, delay, and contextual fear conditioning. Learning & memory, 22(2), 92-100.

      Ramos-Prats, A., Paradiso, E., Castaldi, F., Sadeghi, M., Mir, M. Y., Hörtnagl, H., ... & Ferraguti, F. (2022). VIP-expressing interneurons in the anterior insular cortex contribute to sensory processing to regulate adaptive behavior. Cell Reports, 39(9).

      Klein, A. S., Dolensek, N., Weiand, C., & Gogolla, N. (2021). Fear balance is maintained by bodily feedback to the insular cortex in mice. Science, 374(6570), 1010-1015.

      Nicolas, C., Ju, A., Wu, Y., Eldirdiri, H., Delcasso, S., Couderc, Y., ... & Beyeler, A. (2023). Linking emotional valence and anxiety in a mouse insula-amygdala circuit. Nature Communications, 14(1), 5073.

      Maren, S., & Fanselow, M. S. (1996). The amygdala and fear conditioning : Has the nut been cracked? Neuron, 16(2), 237‑240. https://doi.org/10.1016/s0896-6273(00)80041-0

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work by Ding et al uses agent-based simulations to explore the role of the structure of molecular motor myosin filaments in force generation in cytoskeletal structures. The focus of the study is on disordered actin bundles which can occur in the cell cytoskeleton and have also been investigated with in vitro purified protein experiments.

      Strengths:

      The key finding is that cooperative effects between multiple myosin filaments can enhance both total force and the efficiency of force generation (force per myosin). These trends were possible to obtain only because the detailed structure of the motor filaments with multiple heads is represented in the model.

      We appreciate your comments about the strength of our study. 

      Weaknesses:

      It is not clearly described what scientific/biological questions about cellular force production the work answers. There should be more discussion of how their simulation results compare with existing experiments or can be tested in future experiments.

      Please see our response to the comment (1) below.

      The model assumptions and scientific context need to be described better.

      We apologize for the insufficient descriptions about the model and the scientific context. We revised the manuscript to better explain model assumptions and scientific context as described in our responses below.

      The network contractility seems to be a mere appendix to the bundle contractility which is presented in much more detail.

      Please see our response to the comment (6) below.

      Reviewer #1 (Recommendations for the authors):

      (1) It is not clearly described what scientific/biological questions about cellular force production the work answers. There should be more discussion of how their simulation results compare with existing experiments, or can be tested in future experiments. The authors do briefly mention Reference 4 where different myosin isoforms were used, but it is not clear that these experiments support the scalings predicted in this work in Figures 3-6. Also, the experiments in Ref. 4 apparently did not involve passive crosslinkers (ACPs) which are key in this study.

      Thank you for the comment. In the 5th paragraph of the discussion section of the original manuscript, we applied our findings to understand how structural differences between ventral stress fibers and actin arcs could affect force generation. In addition, at the end of the discussion section, we mentioned that experiments with artificially-made myosin thick filaments could be used for verifying our results. 

      The experiments in Ref. 4 were only ones that we could directly compare our results with. In previous study, actomyosin bundles were experimentally created with ACPs (K.L. Weirich et al., Biophys J, 2021, 120: 1957-1970), but the motions of myosin thick filaments were only quantities measured in the experiments. In general, measuring forces generated by in vitro actomyosin bundles is very challenging. This is why the predictions from our model are particularly valuable for understanding the force generation of actomyosin structures. 

      (2) The architecture of the bundles seems to be prescribed by hand in these simulations. Several well-known stochastic aspects of the dynamics of actin and actin-binding proteins are not included in the model. For example, there is no remodeling of the actin structures through actin polymerization and depolymerization, or crosslink (ACP) binding and unbinding. Can the authors comment on why these effects could be neglected for the questions they want to address?

      Thank you for the comment. We previously showed that the force generation process in actomyosin networks and bundles is affected by actin dynamics (Q. Yu et al., Biophys J, 2018, 115: 2003-2013) and the unbinding of ACPs (T. Kim, Biomech Model Mechanobiol, 2015, 14(2): 345-355 and W. Jung et al., Comput Part Mech, 2015, 2(4): 317-327). 

      However, we did not include the actin dynamics and the ACP unbinding in the current study to clearly understand the effects of the structural properties of thick filaments on the force generation process. We have learned that the stochastic behaviors of cytoskeletal components lead to noisier results, which requires us to run a much larger number of simulations to obtain statistically convincing data. We added the following paragraph in the discussion section of the revised manuscript:

      “Although this study focused mainly on parameters related to motor structures, we expect that other parameters would affect the force generation process. For example, as we showed before, a decrease in ACP density would reduce forces by deteriorating connectivity between filaments. With very low ACP density, some of neighboring motors may not have ACPs between them, thus adding up their forces as shown in Fig. 2. However, such low ACP density may not maintain the structure of bundles or cross-linked networks well. In addition, the force-dependent unbinding of ACPs could change the spatial distribution of ACPs during force generation. If they behave as a slip bond which unbinds more frequently with higher forces, ACPs may not stay between two motors for long time due to high tension. Then, forces generated by two motors may have a higher chance to add up. By contrast, if they behave as a catch bond which unbinds less frequently with larger forces, more ACPs will be recruited between two motors, reducing a chance to add up

      forces. The length of actin filaments is unlikely to affect the force generation process significantly unless filaments are very short. Additionally, as we showed before, actin turnover would reduce forces by competing with motor activities, change connectivity between filaments over time, and prevent motors from being stalled for long time, all of which could affect force generation.”

      (3) The present study is confined to the fixed density of motors and ACPs. However, these can be easily varied in in vitro experiments. Works such as Reference 4 show an optimum in contractility vs myosin concentration. Myosins act not only to slide actin filaments but also crosslink them.

      Can the authors vary myosin concentration to demonstrate such effects in their model?

      As the reviewer pointed out, there is a belief that myosin thick filaments can serve as crosslinkers as well. However, unless there are a fraction of dead myosins (which remain bound on filaments without walking) or myosins dwell at the barbed ends filaments for very long time, it looks very hard for bundles or networks to generate large forces. A former experiment showed that active myosins increases the viscosity of actin networks, not elasticity (D. Humphrey et al., Nature, 2002, 416: 413-416) Computer simulations with reasonable assumptions did not show significant force generation without cross-linkers. We have tested systems with a large number of motors and a few cross-linkers in previous studies (T. Kim, Biomech Model Mechanobiol, 2015, 14(2): 345-355 and W. Jung et al., Comput Part Mech, 2015, 2(4): 317-327). We observed that large force/stress was generated momentarily, but it was relaxed very fast. It is expected that there will be similar outcomes if we try such conditions in the current study.

      (4) Why is there a (factor of 1.5-2) discrepancy in the measured (Ftot) and estimated (Fest) force values in Figure 4-6? How can the authors improve their scaling arguments to capture this? What about the estimated efficiency?

      Thank you for the comment. Indeed, there was a discrepancy between the actual and estimated forces. When the estimated force was calculated, we used the z positions of motors without consideration of the actual bundle geometry with multiple filaments. For example, if two motors are located on the opposite sides of the bundle (i.e., if they are located far from each other in x or y direction), forces generated by them may not counterbalance each other. Then, the estimated force can be smaller than the actual force because counterbalance between motors can be overcounted. The original manuscript had the following sentences to clarify this point: “F</sub>est</sub> was generally smaller than F<sub>tot</sub> because this analysis does not account for actual bundle geometry consisting of multiple F-actins; if two motors are located far from each other in x or y direction, they may not counterbalance or add up forces. Nevertheless, we found that F<sub>est</sub> captures the overall dependence of F<sub>tot</sub> on parameters well.”

      (5) Several choices of parameter values used in the simulations are not clear:

      a) Why consider F actin of 140 nm specifically? Actin can come in a range of lengths. How do their results depend upon the length scale of actin?

      It seems that there is a misunderstanding. 140 nm is the equilibrium length of one actin segment in our model. The actual F-actin consists of multiple actin segments. The length of Factin was 9 μm in bundle simulations and 10 μm (average) in network simulations. We expect that the general tendency of our results would not change with different filament length. However, if filament length becomes too short, the force generation process would be impaired due to lack of connectivity between filaments. 

      b) Similarly, very specific values of myosin backbone length (42 nm), number of myosin heads (8), number of arms (24), and Actin Cross-linking Proteins (ACPs). What informs these values and how will the results change if they are different? It is not especially clear how an "Arm" differs from "heads" and what kind of coarse-graining is involved.

      In the “model overview” section of the original manuscript, we mentioned the following to clarify the definitions of motor arms and motor heads: 

      “To mimic the structure of bipolar filaments, each motor has a backbone, consisting of serially linked segments, and two arms on each endpoint of the backbone segments that represent 8 myosin heads (N<sub>h</sub> = 8).”

      We devised this coarse-graining scheme of myosin thick filaments in our previous work (T. Kim, Biomech Model Mechanobiol, 2015, 14(5): 1143-1155). Through extensive tests, we showed that force generation and motor behaviors are largely independent of coarse-graining level. In other words, a motor with the same value of N<sub>h</sub>N<sub>a</sub> leads to similar outcomes regardless of the value of N<sub>a</sub>. However, in a bundle with multiple filaments, each motor has a sufficient number of arms to ensure simultaneous interactions with those filaments. This is why we decided to useN<sub>h</sub> = 8 and N<sub>a</sub> = 24. 

      To match the length of thick filaments and the total number of heads (N<sub>h</sub>N<sub>a</sub>) in the model with real myosin thick filaments, we have used 42 nm for each backbone length. Varying this length is equivalent to a variation in L<sub>sp</sub> that we did for Fig. 6.

      We used high ACP density to ensure connections between all neighboring pairs of actin filaments. We already showed how the presence of ACPs affects the force generation process in Fig. 2 using two actin filaments. It is expected that a variation of ACP density would affect our results to some extent. Since the main focus of the current study is the structural properties of motors, we did not explore the effects of ACP density. I hope that the reviewer would understand our intention. 

      (6) The manuscript focuses on disordered bundles with only one figure on networks. However, actin fibers also ubiquitously exist as disordered networks, and it is important to explore in more detail the contractile forces in such network arrangements.

      We appreciate the comment. Because we plan to delve into the effects of motor structures on the force generation in networks as a follow-up study, we showed the minimal results in the current study to prove the generality of our findings. I hope that the reviewer would understand our intention and plan.

      It is not described very clearly how these networks were generated.

      We apologize for lack of explanation about how the networks were generated. We added the following section in Supplementary Text of the revised manuscript:

      “Network assembly

      Unlike F-actin in bundle simulations, F-actin in network simulations is formed by stochastic processes as in our previous studies. The formation of F-actin is initiated from a nucleation event with a constant rate constant, k<sub>n,A</sub>, with the appearance of one cylindrical segment in a random position with a random orientation perpendicular to the z direction. The polymerization of F-actin is simulated by adding cylindrical segments at the barbed end of existing filaments with a rate constant, k<sub>p,A</sub>. The ratio of k<sub>n,A</sub>to k<sub>p,A</sub> is adjusted to result in the average filament length of ~10 μm. The rest of the assembly process is identical to that described in the main text.”

      Crosslinked biopolymers like actin typically form disordered elastic networks with their coordination number below rigidity percolation threshold (z=4 in 2D), see for example review by Broedersz and Mackintosh Rev. Mod, Phys. 2013. Such networks should exist in the bendingdominated regime, where bending forces play a vital role in force propagation. Was that observed in the simulations? Why or why not?

      We appreciate the comment. We are aware of the bending-dominated regime and indeed showed the importance of the bending stiffness of actin filaments at low shear strain level in our previous work (T. Kim et al., PLOS Comput Biol, 2009, 5(7): e1000439). In case of active networks with motors, such a bending-dominated regime has not been observed without external shear strain. Instead, buckling of actin filaments was found to be essential for breaking symmetry between tensile and compressive forces developed by motor activities. We have shown that the free contraction of networks is inhibited if filament bending stiffness is increased substantially (J. Li et al., Soft Matter, 2017, 13: 3213-3220 and T. Bidone et al., PLOS Comput Biol, 2017, 13(1): e1005277). We expect that contractile forces generated by bundles or networks will be reduced significantly if we highly increase bending stiffness. However, considering the focus of the current study is on the structural properties of motors, we did not perform such simulations. 

      (7) It would be interesting to see the simulated predictions of the bundle or network contraction dynamics. This can be done by changing to free boundary conditions so that the bundle can contract.

      Thank you for the suggestion. We have previously investigated the free contraction of actomyosin networks with different motor density and ACP density (J Li et al., Soft Matter, 2017, 13: 3213). We observed that the rate of network contraction was higher with more motors and ACPs. However, we did not test the effects of the structural properties of thick filaments in the previous study. We plan to investigate the effects in future studies because the focus of the current study is the force generation process. Please note that in the discussion section of the original manuscript, we mentioned the following:

      “Although we focused on force generation, the contractile behaviors of actomyosin structures (i.e., a decrease in length) have also been of great interest. Our model can be used to study such contractile behaviors by deactivating the periodic boundary condition and removing connection between one end of bundle/network and a domain boundary as done previously [20]. To achieve higher contractile speed with the same total number of myosin heads, the existence of multiple contractile units would be better as suggested in a previous work [4]. This means that there is a trade-off between force generation and contractile speed. Previous studies also showed that the contractile speed of networks is proportional to motor density [18, 43, 51]. We may be able to use our model to systematically investigate how the contractile speed is regulated by parameters that we tested in this study, including the number, distribution, length, and structure of motors.”

      Minor suggestions for improvement:

      (1) What are the vertical markers in Figures 1E and F? They should be labelled. if they are crosslinkers, it is not clear why the color is different from Figure 1A and B.

      We believe that the reviewer meant Figs. 2E, F. Those vertical lines are indeed ACPs (crosslinkers). We changed the color of ACPs in Fig. 1A and Fig. 2B-D to purple to be consistent. In addition, we changed the colors of two filaments in Figs. 2B-D slightly to be consistent with Fig. 2E.

      (2) To help understanding, please include a figure showing how forces are measured.

      We added Fig. S1 in the revised manuscript to explain how the bundle force is calculated.

      (3) It should be possible to extend the scaling arguments to predict what is the crossover myosin density (N_M) in Figure 4a at which the efficiency changes from going as 1/N_M to saturating. 

      As the reviewer might have observed, the slope of the efficiency in Fig. 4A gradually changes, rather than showing a sharp transition. Thus, it is hard to define one crossover myosin density. 

      Similarly, what are the slopes in Figure 6a-b?

      We drew the reference lines in those two plots. Unfortunately, we do not have explanations about the origin of these slopes.

      (4) Some more explanation for the observed values should be added. Figure 4: Why does efficiency plateau at a value close to 0.8 in (A)? 

      We assume that the reviewer meant the plateau of η close to 0.08, not 0.8. Our speculation for the origin of this plateau value is related to L<sub>M</sub> (= 462 nm under the reference condition). Ideally, ~43 motors are required to cover the entire length of the bundle (= 20 μm). Under this condition, η is ~0.023. Although this is not 0.08, we believe that these two values are related to each other. For example, if we increase L<sub>M</sub>, this plateau level would increase. We added the following sentences in the result section of the revised manuscript:

      “The plateau level of η at ~0.08 is related to the minimum number of motors required for saturating an entire bundle, implying that the plateau level would be higher if each motor is longer.”

      Figure 5: Overlapping between motors seems to increase the total force applied by them because of cooperative effects. However, it is not abundantly clear why that should peak at a value of f = 0.06.

      As shown in Fig. 5B, smaller f always results in higher F<sub>tot</sub> due to higher level of cooperative overlap. The minimum value of f we tested in this study was 0.06, so F<sub>tot</sub> was maximal at f = 0.06.

      (5) Why is the network force expected to scale approximately as sqrt(N_M)? Is it because of the 2D geometry where the number of motors along the x or y-direction scale as sqrt(N_M)?

      We initially thought that the weaker dependence of the total force on N<sub>M</sub> was related to the random orientations of motors. However, if the network is fully saturated with motors, the inclusion of more motors will increase forces in both x and y directions almost linearly, resulting in the direct proportionality of F<sub>tot</sub> to N<sub>M</sub>. Our new hypothesis for weaker dependence is consistent with the reviewer’s speculation; the network is not fully saturated even with 1000 motors, so the entire regime shown in Fig. 7B corresponds to that with N<sub>M</sub> < 100 in Fig. 4A where similar weaker dependence on N<sub>M</sub> was observed. We added the following sentence in the result section of the revised manuscript to clarify this point:

      “the average number of motors in each direction which can experience the cooperative overlap would be ~. Maximal N<sub>M</sub> tested with the network was ~2,500, so the dependence of F<sub>tot</sub> on N<sub>M</sub> with the network is similar to that with N<sub>M</sub> < ~50 with the bundle (Fig. 4A).”

      (6) Figures 6 D and A: Figure 6D suggests that there is a more full overlap in the cases where there was a longer bare zone or larger spacing between motor arms. However, the quantification of the total force in A shows that the force is highest for the case where LM was increased by increasing the number of arms. Why do the authors think that is? I would expect from the explanation in Fig 6D that the Lsp and Lbz would be higher than Na in Fig 6A.

      Fig. 6D shows a difference in the level of the cooperative overlap () between two motors. As the reviewer pointed out, the case with more arms shows the lowest , resulting in the lowest as we showed in Fig. S2B. However, as show in in Eq. 7, the total force is a function of both N<sub>a</sub> and . Thus, due to higher N<sub>a</sub> and lower , the force in the case with different N<sub>a</sub> can be similar to that in the case with different L<sub>bz</sub>. In the original manuscript, we had the following sentence to explain how the force can be similar between the two cases: 

      “Thus, was higher (Fig. S2B, blue), resulting in higher F<sub>tot</sub> and η despite smaller N<sub>a</sub>.”

      Reviewer #2 (Public review):

      Summary:

      In this study, the authors use a mechanical model to investigate how the geometry and deformations of myosin II filaments influence their force generation. They introduce a force generation efficiency that is defined as the ratio of the total generated force and the maximal force that the motors can generate. By changing the architecture of the myosin II filaments, they study the force generation efficiency in different systems: two filaments, a disorganized bundle, and a 2D network. In the simple two-filament systems, they found that in the presence of actin crosslinking proteins motors cannot add up their force because of steric hindrances. In the disorganized bundle, the authors identified a critical overlap of motors for cooperative force generation. This overlap is also influenced by the arrangement of the motor on the filaments and influenced by the length of the bare zone between the motor heads.

      Strengths:

      The strength of the study is the identification of organizational principles in myosin II filaments that influence force generation. It provides a complementary mechanistic perspective on the operation of these motor filaments. The force generation efficiency and the cooperative overlap number are quantitative ways to characterize the force generation of molecular motors in clusters and between filaments. These quantities and their conceptual implications are most likely also applicable in other systems.

      Thank you for the comments about the strength of our study. 

      Weaknesses:

      The detailed model that the authors present relies on over 20 numerical parameters that are listed in the supplement. Because of this vast amount of parameters, it is not clear how general the findings are. On the other hand, it was not obvious how specific the model is to myosin II, meaning how well it can describe experimental findings or make measurable predictions. The model seems to be quantitative, but the interpretation and connection to real experiments are rather qualitative in my point of view.

      As the reviewer mentioned, all agent-based computational models for simulating the actin cytoskeleton are inevitably involved with such a large number of parameters. Some of the parameter values are not known well, so we have tuned our parameter values carefully by comparing our results with experimental observations in our previous studies since 2009.We were aware of the importance of rigorous representation of unbinding and walking rates of myosin motors, so we implemented the parallel cluster model, which can predict those rates with consideration of the mechanochemical rates of myosin II, into our model. Thus, we are convincing that our motors represent myosin II.

      In our manuscript, our results were compared with prior observations in Ref. 4 (Thoresen et al., Biophys J, 2013) several times. In particular, larger force generation with more myosin heads per thick filament was consistent between the experiment and our simulations. 

      Our study can make various predictions. First, our study explains why non-muscle myosin II in stress fibers shows focal distributions rather than uniform distributions; if they stay closely, they can generate much larger forces in the stress fibers via the cooperative overlap. Our study also predicts a difference between bipolar structures (found in skeletal muscle myosins and nonmuscle myosins) and side polar structures (found in smooth muscle myosins) in terms of the likelihood of the cooperative overlap. As shown below, myosin filaments with the bipolar structure can add up their forces better than those with the side polar structure when their overlap level is the same.

      Author response image 1.

       

      It was often difficult for me to follow what parameters were changed and what parameters were set to what numerical values when inspecting the curve shown in the figures. The manuscript could be more specific by explicitly giving numbers. For example, in the caption for Figure 6, instead of saying "is varied by changing the number of motor arms, the bare zone length, the spacing between motor arms", the authors could be more specific and give the ranges: "is varied by changing the number of motor arms form ... to .., the bare zone length from .. to..., and the spacing between motor arms from .. to ..".

      This unspecificity is also reflected in the text: "We ran simulations with a variation in either L<sub>sp</sub> or L<sub>bz</sub>" What is the range of this variation? "WhenL<sub>M</sub> was similar" similar to what? "despite different N<sub>M</sub>." What are the different values for N<sub>M</sub>? These are only a few examples that show that the text could be way more specific and quantitative instead of qualitative descriptions.

      We appreciate the comment. In the revised manuscript, we specified the range of the variation in each parameter.

      In the text, after equation (2) the authors discuss assumptions about the binding of the motor to the actin filament. I think these model-related assumptions and explanations should be discussed not in the results section but rather in the "model overview" section.

      Thank you for pointing this out. In the original manuscript, we described all the details of the model in Supplementary Material. We feel that the assumptions about interactions between motors and actin filaments are too detailed information to be included in the model overview section.

      The lines with different colors in Figure 2A are not explained. What systems and parameters do they represent?

      The different colors used in Fig. 2A were used for distinguishing 20 cases. We added the explanation about the colors in the figure caption in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      To guarantee the reproducibility of the results, I recommend that the authors publish their simulation code on GitHub.

      We appreciate the reviewer’s suggestion. Following the suggestion, we prepared and posted the code on GitHub as mentioned in the Data Availability of the revised manuscript: The source code of our model is available on GitHub: https://github.com/ktyman2/ThickFilament”

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Su et al propose the existence of two mechanisms repressing SBF activity during entry into meiosis in budding yeast. First, a decrease in Swi4 protein levels by a LUTI-dependent mechanism where Ime1 would act closing a negative feedback loop. Second, the sustained presence of Whi5 would contribute to maintaining SBF inhibited under sporulation conditions. The article is clearly written and the experimental approaches used are adequate to the aims of this work. The results obtained are in line with the conclusions reached by the authors but, in my view, they could also be explained by the existing literature and, hence, would not represent a major advance in the field of meiosis regulation.

      We respectfully disagree with the reviewer about their comment that this work can be explained by the existing literature. First, while SWI4LUTI has been previously identified in meiotic cells along with ~ 380 LUTIs, the biological purpose of these alternative mRNA isoforms and their effect on cellular physiology still remain largely unknown. Our manuscript clarifies this gap in understanding for SWI4LUTI. Loss of SWI4LUTI contributes to dysregulation of meiotic entry and does so by failing to properly repress the known inhibitors of meiotic entry, the CLNs. Furthermore, even though Cln1 and Cln2 have been previously shown to antagonize meiosis, the mechanisms that restrict their activity was unclear prior to our study.

      We recognize work done by others demonstrating Whi5-dependent repression of SBF during mitotic G1/S transition (De Bruin et al., 2004; Costanzo et al., 2004). We further examined Whi5’s involvement during meiotic entry and found that it acts in conjunction with the LUTI-based mechanism to restrict SBF activity. Combined loss of both mechanisms results in the increased expression of G1 cyclins, decreased expression of early meiotic genes, and a delay in meiotic entry (Figure 6). Neither mechanism was previously known to regulate meiotic entry. Our study not only adds to our broader understanding of gene regulation during meiosis but also raises additional questions regarding how LUTIs regulate gene expression and function.

      Regarding the first mechanism, Fig 1 shows that Swi4 decreases very little after 1-2h in sporulation medium, whereas G1-cyclin expression is strongly repressed very rapidly under these conditions (panel D and work by others). This fact dampens the functional relevance of Swi4 downregulation as a causal agent of G1 cyclin repression.

      Reviewer 1 expresses concern for the observation that by 2 h in sporulation media there is a 32% decrease in Swi4-3V5 protein abundance compared to 0 h in SPO. This is consistent with the range of protein level decrease typically accomplished by LUTI-based gene regulation (Chen et al., 2017; Chia et al., 2017; Tresenrider et al., 2021), and while it is a modest reduction, it is consistent across replicates. Furthermore, we don’t make the argument that reduction in Swi4 levels alone is the sole regulator of G1 cyclin levels. In fact, we report that in addition to Swi4 downregulation, Whi5 also functions to restrict SBF activity during meiotic entry, thereby ensuring G1 cyclin repression.

      In addition, the LUTI-deficient SWI4 mutant does not cause any noticeable relief in CLN2 repression, arguing against the relevance of this mechanism in the repression of G1-cyclin transcription during entry into meiosis. The authors propose a second mechanism where Whi5 would maintain SBF inactive under sporulation conditions. The role of Whi5 as a negative regulator of the SBF regulon is well known. On the other hand, the double WHI5-AA SWI4-dLUTI mutant does not upregulate CLN2, the G1 cyclin with the strongest negative effect on sporulation, raising serious doubts on the functional relevance of this backup mechanism during entry into meiosis.

      Due to replicate variance, CLN2 did not make the cut by our mRNA-seq data analysis as a significant hit. To address reviewer 1’s final point we opted for the “gold standard” of reverse transcription coupled with qPCR to measure CLN2 transcript levels in the double mutant ∆LUTI; WHI5-AA and the wild-type control. This revealed that CLN2 levels were significantly increased in the double mutant compared to wild type at 2 h in SPO (Author Response Image 1, *, p = 0.0288, two-tailed t-test).

      Author response image 1.

      Wild type (UB22199) and ∆LUTI;WHI5-AA (UB25428) cells were collected to perform RT-qPCR for CLN2 transcript abundance. Transcript abundance was quantified using primer sets specific for each respective gene from three technical replicates for each biological replicate. Quantification was performed in reference to PFY1 and then normalized to wild-type control. FC = fold change. Experiments were performed twice using biological replicates, mean value plotted with range. Differences in wild type versus ∆LUTI; WHI5-AA transcript levels compared with a two-tailed t-test (*, p = 0.0288)

      Reviewer #2 (Public Review):

      Summary:

      The manuscript highlights a mechanistic insight into meiotic initiation in budding yeast. In this study, the authors addressed a genetic link between mitotic cell cycle regulator SBF (the Swi4-Swi6 complex) and a meiosis inducing regulator Ime1 in the context of meiotic initiation. The authors' comprehensive analyses with cytology, imaging, RNA-seq using mutant strains lead the authors to conclude that Swi4 levels regulates Ime1-Ume6 interaction to activate expression of early meiosis genes for meiotic initiation. The major findings in this paper are that (1) the higher level of Swi4, a subunit of SBF transcription factor for mitotic cell cycle regulation, is the limiting factor for mitosis-to-meiosis transition; (2) G1 cyclins (Cln1, Cln2), that are expressed under SBF, inhibit Ime1-Ume6 interaction under overexpression of SWI4, which consequently leads to downregulation of early meiosis genes; (3) expression of SWI4 is regulated by LUTI-based transcription in the SWI4 locus that impedes expression of canonical SWI4 transcripts; (4) expression of SWI4 LUTI is likely negatively regulated by Ime1; (5) Action of Swi4 is negatively regulated by Whi5 (homologous to Rb)-mediated inhibition of SBF, which is required for meiotic initiation. Thus, the authors proposed that meiotic initiation is regulated under the balance of mitotic cell cycle regulator SBF and meiosis-specific transcription factor Ime1.

      Strengths:

      The most significant implication in their paper is that meiotic initiation is regulated under the balance of mitotic cell cycle regulator and meiosis-specific transcription factor. This finding will provide a mechanistic insight in initiation of meiosis not only into the budding yeast also into mammals. The manuscript is overall well written, logically presented and raises several insights into meiotic initiation in budding yeast. Therefore, the manuscript should be open for the field. I would like to raise the following concerns, though they are not mandatory to address. However, it would strengthen their claims if the authors could technically address and revise the manuscript by putting more comprehensive discussion.

      Weaknesses:

      The authors showed that increased expression of the SBF targets, and reciprocal decrease in expression of meiotic genes upon SWI4 overexpression at 2 h in SPO (Figure 2F). However, IME1 was not found as a DEG in Supplemental Table 1. Meanwhile, IME1 transcript level was decreased at 2 h SPO condition in pATG8-CLN2 cells in Fig S4C.

      Now this reviewer still wonders with confusion whether expression of IME1 transcripts per se is directly or in directly suppressed under SBF-activated gene expression program at 2 h SPO in pATG8-SWI4 and pATG8-CLN2 cells. This reviewer wonders how Fig S4C data reconciles with the model summarized in Fig 6F.

      One interpretation could be that persistent overexpression of G1 cyclin caused active mitotic cell cycle, and consequently delayed exit from mitotic cell cycle, which may have given rise to an apparent reduction of cell population that was expressing IME1. For readers to better understand, it would be better to explain comprehensively this issue in the main text.

      We believe there was an oversight here. In supplemental table 1, IME1 expression is reported as significantly decreased. The volcano plot shown below also highlights this change (Author response image 2).

      Author response image 2.

      Volcano plot of DE-Seq2 analysis for ∆LUTI;WHI5-AA versus wild type. Dashed line indicates padj (p value) = 0.05. Analysis was performed using mRNA-seq from two biological replicates. Wild type (UB22199) and ∆LUTI;WHI5-AA (UB25428) cells were collected at 2 h in SPO. SBF targets (pink) (Iyer et al., 2001) and early meiotic genes (blue) defined by (Brar et al., 2012). Darker pink or darker blue, labeled dots are well studied targets in either gene set list.

      The % of cells with nuclear Ime1 was much reduced in pATG8-CLN2 cells (Fig 2B) than in pATG8-SWI4 cells (Fig 4C). Is the Ime1 protein level comparable or different between pATG8-CLN2 strain and pATG8-SWI4 strain? Since it is difficult to compare the quantifications of Ime1 levels in Fig S1D and Fig S4B, it would be better to comparably show the Ime1 protein levels in pATG8-CLN2 and pATG8-SWI4 strains.

      Further, it is uncertain how pATG8-CLN2 cells mimics the phenotype of pATG8-SWI4 cells in terms of meiotic entry. It would be nice if the authors could show RNA-seq of pATG8-CLN2/WT and/or quantification of the % of cells that enter meiosis in pATG8-CLN2.

      Analyzing bulk Ime1 protein levels across a population of cells (Author response image 3) reveals that overexpression of CLN2 causes a more severe decrease in Ime1 levels than overexpression of SWI4. This is consistent with our observation that pATG8-CLN2 has a more severe impact on meiotic entry than pATG8-SWI4. The higher CLN2 levels (Author response image 4) likely accounts for the observed difference in severity of phenotype between the two mutants.

      Author response image 3.

      Samples from strain wild type (UB22199), pATG8-SWI4 (UB2226), pATG8-CLN2 (UB25959) and were collected between 0-4 hours (h) in sporulation medium (SPO) and immunoblots were performed using α-GFP. Hxk2 was used a loading control.

      Author response image 4.

      Wild type (UB22199), pATG8-SWI4 (UB2226), pATG8-CLN2 (UB25959) cells were collected to perform RT-qPCR for CLN2 transcript abundance. Quantification was performed in reference to PFY1 and then normalized to wild-type control. FC = fold change.

      The authors stated that reduced Ime1-Ume6 interaction is a primary cause of meiotic entry defect by CLN2 overexpression (Line 320-322, Fig 4J-L). This data is convincing. However, the authors also showed that GFP-Ime1 protein level was decreased compared to WT in pATG8-CLN2 cells by WB (Fig S4A).

      Compared to wild type, pATG8-CLN2 cells have lower levels of Ime1. Consequently, reviewer 2 suggests that this reduction may be responsible for the observed meiotic defect. However, we tested this possibility and found it not to be the primary cause of the meiotic defect in pATG8-CLN2 cells. As shown in Figure S4A, when IME1 was overexpressed from the pCUP1 promoter, Ime1 protein levels were similar between wild-type and pATG8-CLN2 cells. Despite this similarity, we still observed a decrease in nuclear Ime1 (Figure 4F) and no rescue in sporulation (Figure 4A). Therefore, the reduction in Ime1 protein levels alone cannot explain the meiotic defect caused by CLN2 overexpression.

      Further, GFP-Ime1 signals were overall undetectable through nuclei and cytosol in pATG8-CLN2 cells (Fig 4B), and accordingly cells with nuclear Ime1 were reduced (Fig 4C). Although the authors raised a possibility that the meiotic entry defect in the pATG8-CLN2 mutant arises from downregulation of IME1 expression (Line 282-283), causal relationship between meiotic entry defect and CLN2 overexpression is still not clear.

      As reviewer 2 comments, we initially considered the possibility that meiotic entry defect induced by CLN2 overexpression could be attributed to decreased IME1 expression. However, in the following paragraph in the manuscript, we demonstrate equalizing IME1 transcript levels using the pCUP1-IME1 allele does not rescue the meiotic defect caused by CLN2 overexpression. Consequently, we conclude that the decrease in IME1 transcript levels alone cannot explain the meiotic defect caused by increased CLN2 levels.

      Is the Ime1 protein level reduced in the pATG8-CLN2;UME6-⍺GFP strain compared to WT? It would be better to comparably show the Ime1 protein levels in the pATG8-CLN2 strain and the pATG8-CLN2;UME6-⍺GFP strain by WB. Also, it would be nice if the authors could show quantification of the % of cells that enter meiosis in the pATG8-CLN2;UME6-⍺GFP strain to see how and whether artificial tethering of Ime1 to Ume6 rescued normal meiosis program rather than simply showing % sporulation in Fig4A.

      We do not agree with the suggestion to compare the pATG8-CLN2;UME6-⍺GFP with wild type as the kinetics of meiosis is rather different. The more appropriate comparison is UME6-⍺GFP and pATG8-CLN2;UME6-⍺GFP which shows GFP-Ime1 bulk protein levels are slightly lower (Author response image 5). However, when we use a more sensitive measurement of meiotic entry through the nuclear accumulation of Ime1 in single cells, as illustrated in Figure 4L, it becomes evident that the Ume6-Ime1 tether is capable of restoring nuclear Ime1 levels, even in the presence of CLN2 overexpression. Given that these cells exhibited wild type levels of nuclear Ime1 and underwent sporulation after 24 hours, we make the fair assumption that they have successfully initiated the meiotic program.

      Author response image 5.

      Wild type (UB22199), pATG8-SWI4 (UB35106), UME6-⍺GFP (UB35300), and UME6-⍺GFP; pATG8-CLN2 (UB35177) cells collected between 0-3 hours (h) in sporulation medium (SPO) and immunoblots were performed using α-GFP. Hxk2 was used a loading control

      The authors showed Ume6 binding at the SWI4LUTI promoter (Figure 5K). However, since Ume6 forms a repressive form with Rpd3 and Sin3a and binds to target genes independently of Ime1, Ume6 binding at the SWI4LUTI promoter bind does not necessarily represent Ime1-Ume6 binding there. Instead, it would be better to show Ime1 ChIP-seq at the SWI4LUTI promoter.

      We agree with reviewer 2 that Ime1 ChIP would be the ideal measurement. Unfortunately, this has proved to be technically challenging. To address this limitation, we utilized a published Ume6 ChIP-seq dataset along with a published UME6-T99N RNA-seq dataset. Cells carrying the UME6-T99N allele are unable to induce the expression of early meiotic transcripts due to lack of Ime1 binding to Ume6 (Bowdish et al., 1995). Accordingly, RNA-seq analysis should reveal whether or not the LUTIs identified by Ume6 ChIP are indeed regulated by Ime1-Ume6 during meiosis. For SWI4LUTI, this is exactly what we observe. Not only is there Ume6 binding at the SWI4LUTI promoter (Figure 5K), but there is also a significant decrease in SWI4LUTI expression in UME6-T99N cells under meiotic conditions (Figure S5). Based on these data, we conclude that the Ime1-Ume6 complex is responsible for regulating SWI4LUTI expression during meiosis.

      The authors showed ∆LUTI mutant and WHI5-AA mutant did not significantly change the expression of SBF targets nor early meiotic genes relative to wildtype (Figure 6A, C). Accordingly, they concluded that LUTI- or Whi5-based repression of SBF alone was not sufficient to cause a delay in meiotic entry (Line451-452), and perturbation of both pathways led to a significant delay in meiotic entry (Figure 6E). This reviewer wonders whether Ime1 expression level and nuclear localization of Ime1 was normal in ∆LUTI mutant and WHI5-AA mutant.

      Based on our observations in Figure 4, Ime1 protein and expression levels were not reliable indicators of meiotic entry. Consequently, we opted for a more downstream and functionally relevant measure of meiotic entry, which involved time-lapse fluorescence imaging of Rec8, an Ime1 target.

      Reviewer #1 (Recommendations For The Authors):

      The authors would like to mention previous work showing that G1-cyclin overexpression decreases the expression and nuclear accumulation of Ime1 (Colomina et al 1999 EMBO J 18:320). In this work, the interaction between Ime1 and Ume6 had been found to be resistant to G1-cyclin expression, arguing against a direct effect on the recruitment of Ime1 at meiotic promoters. Alternatively, differences in the experimental approaches used could be discussed to explain this apparent discrepancy.

      To clarify, in the paper that reviewer 1 is referring to (Colomina et al., 1999), the authors determine that the interaction between Ime1 and Ume6 is regulated by the presence of a non-fermentable carbon source. Additional work by others reveals that Ime1 undergoes phosphorylation by the protein kinases Rim11 and Rim15, promoting its nuclear localization and enabling interaction with Ume6 (Vidan and Mitchell, 1997; Pnueli et al., 2004; Malathi et al., 1999, 1997). Furthermore, both Rim11 and Rim15 kinase activities are inhibited by the presence of glucose via the PKA pathway (Pedruzzi et al., 2003; Rubin-Bejerano et al., 2004; Vidan and Mitchell, 1997). Accordingly, the elimination of cyclins in the presence of a non-fermentable carbon source (glucose) in (Colomina et al., 1999) is unlikely to result in an interaction between Ime1 and Ume6, as Rim11 and Rim15 remain repressed. Removal of cyclins in acetate does not further increase Ime1-Ume6 interaction leading the authors to conclude that G1 cyclins do not block Ime1 function through its interaction with Ume6. This work however uses loss of function (removal of G1 cyclins) to study the G1 cyclins’ effect on Ime1-Ume6 interaction while using timepoints that are well beyond meiotic entry. Additionally, Ime1-Ume6 interaction is being tested using yeast-two hybrid analysis with just the proposed interaction domain of Ime1 (amino acids 270-360). Therefore, the interpretation that G1 cyclins are dispensable for regulating the interaction between Ime1 and Ume6 is unclear from this work alone.

      There are many differences that can explain the discrepancy between our work and (Colomina et al., 1999). Our work uses increased expression of cyclins during meiotic entry. Additionally, in our study, we collected timepoints to measure meiotic entry (2 h in SPO) and sporulation (gamete formation) efficiency (24 h in SPO). Finally, we are using the endogenous, full length Ime1. These differences could very well explain the discrepancy with previous work. Lastly, in our discussion we acknowledge the lack of CDK consensus phosphorylation sites on Ime1. Therefore, it is most likely that G1 cyclins are not directly phosphorylating Ime1 and that other factors like Rim11 and Rim15 could be direct targets of the G1 cyclins, considering their involvement in the phosphorylation of Ime1-Ume6, as well as their role in regulating Ime1 localization and its interaction with Ume6. We have included these points in the revised manuscript (lines 547-551).

      Reviewer #2 (Recommendations For The Authors):

      This reviewer thinks that the findings in this paper are of general interest to meiosis field and help understanding the mechanism of meiotic initiation in mammals. The way of the current manuscript seems to be written for limited budding yeast scientists, and should not limited to the interest by the budding yeast scientists. Thus, it would be better to discuss more about what is known about the mechanism of initiation of meiosis not only in budding yeast but also in other species to share their finding to more broad scientists using other organisms.

      We appreciate reviewer 2’s comment and have added more discussion about the parallels between yeast and mammalian systems in meiotic initiation (lines 613-624).

      Reviewer #3 (Recommendations For The Authors):

      The effect of overexpression of Swi4 is tested for MI and MII (Fig1F): this is a very indirect readout of meiotic entry. The authors could present Rec8 localization (Fig2I) at this stage. However, this is still a superficial description of the meiotic phenotype: is the phenotype only a delay or is the meiotic prophase altered. It is specifically important to analyse this in more detail to answer whether the overexpression of Swi4 leads to an identical phenotype to the one of CLN2. Also the comparison between overexpression of Swi4 and Cln2 is difficult to evaluate: what is the level of CLN2 when SwI4 is overexpressed compared to CLN2 overexpression. The percentage of nuclear Ime1 is 50% vs 5% when Swi4 or Cln2 are overexpressed. What is the interpretation? What are the levels of Ime1? (Y axis of quantifications not comparable, see also comment for Fig5F,H)

      CLN2 is expressed at a much higher level in pATG8-CLN2 cells relative to pATG8-SWI4 (Author Response Image 4). Therefore, we don’t expect identical phenotypes, but rather a more severe deficiency in meiotic entry upon CLN2 overexpression. The key experiment that establishes causality between SWI4 and CLNs is reported in Figure 3, where deletion of either CLN1 or CLN2 rescues the meiotic entry delay exerted by SWI4 overexpression.

      Fig3EF: What is the phenotype of Cln1 and Cln2 without overexpression of Swi4?

      Meiotic entry is not faster in cln1∆ or cln2∆ cells compared to wild-type. We included these data in Supplemental Figure 3 and made the relevant changes in the manuscript (lines 257-261).

      Fig4F: Need a control with CLN2 overexpression only.

      A control with only CLN2 overexpression (pATG8-CLN2) is not appropriate since these meiotic time course experiments are synchronized using the pCUP1-IME1 allele. It would be a misleading comparison since the two meiosis would have different kinetics. Figure 4F reports that despite similar IME1 transcript levels and Ime1 protein levels, CLN2 overexpressing cells still have reduced nuclear Ime1. Since side-by-side comparison of pATG8-CLN2 and pCUP1-IME1 is not possible, we chose to measure sporulation efficiency at 24 h in Figure 4A. These data together suggest that elevated IME1 transcript and protein levels cannot rescue the defects associated with increased CLN2 expression.

      Fig5E: in wild type, by Northern blot, Swi4canon level is increasing during meiosis, not decreasing?, whereas protein level is decreasing, what is the interpretation?

      Northern data is less quantitative than smFISH, which show that SWI4canon transcript levels are significantly lower in meiosis compared to vegetative cells (Figure 5D). We also note that the Northern blot data were acquired from unsynchronized meiotic cells and could have additional limitations based on the population-based nature of the assay. Finally, additional analysis of a transcript leader sequencing (TL-seq) dataset from synchronized cells (Tresenrider et al., 2021) further confirms the decrease in SWI4canon transcript levels upon meiotic entry. (Author response image 6).

      Author response image 6.

      TL-seq data from (Tresenrider et al. 2021) visualized on IGV at the SWI4 locus. Two timepoints are plotted including premeiotic before IME1 induction (pink) and meiotic prophase or after IME1 induction (blue).

      Fig5F, H. This quantification needs duplicates for validation.

      Replicates are submitted for every blot in this paper to eLIFE.It can be found in the shared Dropbox folder to the editors (named Raw-blots-for-eLIFE).

      Fig5F, H. Why are the wild type values so different?

      The immunoblotting done between Figure 5F and Figure 5H are on separate blots and therefore should not be compared. Additionally, these values are not absolute measurements of wild type values of Swi4-3V5 and therefore we should not expect them to be the same. Any comparisons done of relative amounts of Swi4-3V5 are always done on the same blot and normalized to a loading control, hexokinase.

      FigS5: What is the effect of the Ume6-T99N on Swi4 protein level and on meiotic entry? Is the backup mechanism proposed active?

      We haven’t measured Swi4 protein levels in the UME6-T99N background but given that this mutation is known to disrupt the interaction between Ime1 and Ume6, we expect a similar trend to that reported in Figure 5I (pCUP1-IME1 uninduced).

      What is the evidence that Swi4/6 is a E2F homolog? What is the homology at the protein level?

      While there is no sequence homology between SBF and E2F there is remarkable similarity between metazoans and yeast in terms of the regulation of the G1/S transition (reviewed in Bertoli et al., 2013). E2F and SBF are both repressed before the G1/S transition by the inhibitors Rb and Whi5, respectfully (Costanzo et al., 2004; De Bruin et al., 2004; Hasan et al., 2014). During G1/S transition, a cyclin dependent kinase phosphorylates and inactivates these inhibitors. We have carefully edited our language in the manuscript to “functional homology” instead of just “homology”.

      FigS3 is missing

      Each supplemental figure was matched to its corresponding main figure. In the original submission, we didn’t have Figure S3. However, the revised manuscript now contains FigS3.

      Bertoli, C., J.M. Skotheim, and R.A.M. De Bruin. 2013. Control of cell cycle transcription during G1 and S phases. Nat. Rev. Mol. Cell Biol. 14:518–528. doi:10.1038/nrm3629.

      Bowdish, K.S., H.E. Yuan, and A.P. Mitchell. 1995. Positive control of yeast meiotic genes by the negative regulator UME6. Mol. Cell. Biol. 15:2955–2961. doi:10.1128/mcb.15.6.2955.

      Brar, G.A., M. Yassour, N. Friedman, A. Regev, N.T. Ingolia, and J.S. Weissman. 2012. High-Resolution View of the Yeast Meiotic Program Revealed by Ribosome Profiling. Science (80-. ). 335:552–558. doi:10.1126/science.1215110.

      De Bruin, R.A.M., W.H. McDonald, T.I. Kalashnikova, J. Yates, and C. Wittenberg. 2004. Cln3 activates G1-specific transcription via phosphorylation of the SBF bound repressor Whi5. Cell. 117:887–898. doi:10.1016/j.cell.2004.05.025.

      Chen, J., A. Tresenrider, M. Chia, D.T. McSwiggen, G. Spedale, V. Jorgensen, H. Liao, F.J. Van Werven, and E. Ünal. 2017. Kinetochore inactivation by expression of a repressive mRNA. Elife. 6:1–31. doi:10.7554/eLife.27417.

      Chia, M., A. Tresenrider, J. Chen, G. Spedale, V. Jorgensen, E. Ünal, and F.J. van Werven. 2017. Transcription of a 5’ extended mRNA isoform directs dynamic chromatin changes and interference of a downstream promoter. Elife. 6:1–23. doi:10.7554/eLife.27420.

      Colomina, N., E. Garí, C. Gallego, E. Herrero, and M. Aldea. 1999. G1cyclins block the Ime1 pathway to make mitosis and meiosis incompatible in budding yeast. EMBO J. 18:320–329. doi:10.1093/emboj/18.2.320.

      Costanzo, M., J.L. Nishikawa, X. Tang, J.S. Millman, O. Schub, K. Breitkreuz, D. Dewar, I. Rupes, B. Andrews, and M. Tyers. 2004. CDK activity antagonizes Whi5, an inhibitor of G1/S transcription in yeast. Cell. 117:899–913. doi:10.1016/j.cell.2004.05.024.

      Hasan, M., S. Brocca, E. Sacco, M. Spinelli, P. Elena, L. Matteo, A. Lilia, and M. Vanoni. 2014. A comparative study of Whi5 and retinoblastoma proteins : from sequence and structure analysis to intracellular networks. 4:1–24. doi:10.3389/fphys.2013.00315.

      Iyer, V.R., C.E. Horak, P.O. Brown, D. Botstein, V.R. Iyer, M. Snyder, and C.S. Scafe. 2001. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature. 409:533–538. doi:10.1038/35054095.

      Malathi, K., Y. Xiao, and A.P. Mitchell. 1997. Interaction of yeast repressor-activator protein Ume6p with glycogen synthase kinase 3 homolog Rim11p. Mol. Cell. Biol. 17:7230–7236. doi:10.1128/mcb.17.12.7230.

      Malathi, K., Y. Xiao, and A.P. Mitchell. 1999. Catalytic roles of yeast GSK3β/shaggy homolog Rim11p in meiotic activation. Genetics. 153:1145–1152. doi:10.1093/genetics/153.3.1145.

      Pedruzzi, I., F. Dubouloz, E. Cameroni, V. Wanke, J. Roosen, J. Winderickx, and C. De Virgilio. 2003. TOR and PKA Signaling Pathways Converge on the Protein Kinase Rim15 to Control Entry into G0. Mol. Cell. 12:1607–1613. doi:10.1016/S1097-2765(03)00485-4.

      Pnueli, L., I. Edry, M. Cohen, and Y. Kassir. 2004. Glucose and Nitrogen Regulate the Switch from Histone Deacetylation to Acetylation for Expression of Early Meiosis-Specific Genes in Budding Yeast. Mol. Cell. Biol. 24:5197–5208. doi:10.1128/mcb.24.12.5197-5208.2004.

      Rubin-Bejerano, I., S. Sagee, O. Friedman, L. Pnueli, and Y. Kassir. 2004. The In Vivo Activity of Ime1, the Key Transcriptional Activator of Meiosis-Specific Genes in Saccharomyces cerevisiae, Is Inhibited by the Cyclic AMP/Protein Kinase A Signal Pathway through the Glycogen Synthase Kinase 3- Homolog Rim11. Mol. Cell. Biol. 24:6967–6979. doi:10.1128/mcb.24.16.6967-6979.2004.

      Tresenrider, A., K. Morse, V. Jorgensen, M. Chia, H. Liao, F.J. van Werven, and E. Ünal. 2021. Integrated genomic analysis reveals key features of long undecoded transcript isoform-based gene repression. Mol. Cell. 81:2231-2245.e11. doi:10.1016/j.molcel.2021.03.013.

      Vidan, S., and A.P. Mitchell. 1997. Stimulation of yeast meiotic gene expression by the glucose-repressible protein kinase Rim15p. Mol. Cell. Biol. 17:2688–2697. doi:10.1128/mcb.17.5.2688.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the reviewers and editors for their careful read of our paper, and appreciate the thoughtful comments.

      Both reviewers agreed that our work had several major strengths: the large dataset collected in collaboration across ten labs, the streamlined processing pipelines, the release of code repositories, the multi-task neural network, and that we definitively determined that electrode placement is an important source of variability between datasets.

      However, a number of key potential improvements were noted: the reviewers felt that a more standard model-based characterization of single neuron responses would benefit our reproducibility analysis, that more detail was needed about the number of cells, sessions, and animals, and that more information was needed to allow users to deploy the RIGOR standards and to understand their relationship to other metrics in the field.

      We agree with these suggestions and have implemented many major updates in our revised manuscript. Some highlights include:

      (1)  A new regression analysis that specifies the response profile of each neuron, allowing a comparison of how similar these are across labs and areas (See Figure 7 in the new section, “Single neuron coefficients from a regression-based analysis are rep oducible across labs”);

      (2) A new decoding analysis (See Figure 9 in the section, “Decodability of task variables is consistent across labs, but varies by brain region”);

      (3) A new RIGOR notebook to ease useability;

      (4) A wealth of additional information about the cells, animals and sessions in each figure;

      (5) Many new additional figure panels in the main text and supplementary material to clarify the specific points raised by the reviewers.

      Again, we are grateful to the reviewers and editors for their helpful comments, which have significantly improved the work. We are hopeful that the many revisions we have implemented will be sufficient to change the “incomplete” designation that was originally assigned to the manuscript.

      Reviewer #1 (Public review):

      Summary:

      The authors explore a large-scale electrophysiological dataset collected in 10 labs while mice performed the same behavioral task, and aim to establish guidelines to aid reproducibility of results collected across labs. They introduce a series of metrics for quality control of electrophysiological data and show that histological verification of recording sites is important for interpreting findings across labs and should be reported in addition to planned coordinates. Furthermore, the authors suggest that although basic electrophysiology features were comparable across labs, task modulation of single neurons can be variable, particularly for some brain regions. The authors then use a multi-task neural network model to examine how neural dynamics relate to multiple interacting task- and experimenter-related variables, and find that lab-specific differences contribute little to the variance observed. Therefore, analysis approaches that account for correlated behavioral variables are important for establishing reproducible results when working with electrophysiological data from animals performing decision-making tasks. This paper is very well-motivated and needed. However, what is missing is a direct comparison of task modulation of neurons across labs using standard analysis practice in the fields, such as generalized linear model (GLM). This can potentially clarify how much behavioral variance contributes to the neural variance across labs; and more accurately estimate the scale of the issues of reproducibility in behavioral systems neuroscience, where conclusions often depend on these standard analysis methods.

      We fully agree that a comparison of task-modulation across labs is essential. To address this, we have performed two new analyses and added new corresponding figures to the main text (Figures 7 and 9). As the reviewer hoped, this analysis did indeed clarify how much behavioral variance contributes to the variance across labs. Critically, these analyses suggested that our results were more robust to reproducibility than the more traditional analyses would indicate.

      Additional details are provided below (See detailed response to R1P1b).

      Strengths:

      (1) This is a well-motivated paper that addresses the critical question of reproducibility in behavioural systems neuroscience. The authors should be commended for their efforts.

      (2) A key strength of this study comes from the large dataset collected in collaboration across ten labs. This allows the authors to assess lab-to-lab reproducibility of electrophysiological data in mice performing the same decision-making task.

      (3) The authors' attempt to streamline preprocessing pipelines and quality metrics is highly relevant in a field that is collecting increasingly large-scale datasets where automation of these steps is increasingly needed.

      (4) Another major strength is the release of code repositories to streamline preprocessing pipelines across labs collecting electrophysiological data.

      (5) Finally, the application of MTNN for characterizing functional modulation of neurons, although not yet widely used in systems neuroscience, seems to have several advantages over traditional methods.

      Thanks very much for noting these strengths of our work.

      Weaknesses:

      (1) In several places the assumptions about standard practices in the field, including preprocessing and analyses of electrophysiology data, seem to be inaccurately presented:

      a) The estimation of how much the histologically verified recording location differs from the intended recording location is valuable information. Importantly, this paper provides citable evidence for why that is important. However, histological verification of recording sites is standard practice in the field, even if not all studies report them. Although we appreciate the authors' effort to further motivate this practice, the current description in the paper may give readers outside the field a false impression of the level of rigor in the field.

      We agree that labs typically do perform histological verification. Still, our methods offer a substantial improvement over standard practice, and this was critical in allowing us to identify errors in targeting. For instance, we used new software, LASAGNA, which is an innovation over the traditional, more informal approach to localizing recording sites. Second, the requirement that two independent reviewers concur on each proposed location for a recording site is also an improvement over standard practice. Importantly, these reviewers use electrophysiological features to more precisely localize electrodes, when needed, which is an improvement over many labs. Finally, most labs use standard 2D atlases to identify recording location (a traditional approach); our use of a 3D atlas and a modern image registration pipeline has improved the accuracy of identifying the true placement of probes in 3D space.

      Importantly, we don’t necessarily advocate that all labs adopt our pipeline; indeed, this would be infeasible for many labs. Instead, our hope is that the variability in probe trajectory that we uncovered will be taken into account in future studies. Here are 3 example ways in which that could happen. First, groups hoping to target a small area for an experiment might elect to use a larger cohort than previously planned, knowing that some insertions will miss their target. Second, our observation that some targeting error arose because experimenters had to move probes due to blood vessels will impact future surgeries: when an experimenter realizes that a blood vessel is in the way, they might still re-position the probe, but they can also adjust its trajectory (e.g., changing the angle) knowing that even little nudges to avoid blood vessels can have a large impact on the resulting insertion trajectory. Third, our observation of a 7 degree deviation between stereotaxic coordinates and Allen Institute coordinates can be used for future trajectory planning steps to improve accuracy of placement. Uncovering this deviation required many insertions and our standardized pipeline, but now that it is known, it can be easily corrected without needing such a pipeline.

      We thank the reviewer for bringing up this issue and have added new text (and modified existing text) in the Discussion to highlight the innovations we introduced that allowed us to carefully quantify probe trajectory across labs (lines 500 - 515):

      “Our ability to detect targeting error benefited from an automated histological pipeline combined with alignment and tracing that required agreement between multiple users, an approach that greatly exceeds the histological analyses done by most individual labs. Our approach, which enables scalability and standardization across labs while minimizing subjective variability, revealed that much of the variance in targeting was due to the probe entry positions at the brain surface, which were randomly displaced across the dataset. … Detecting this offset relied on a large cohort size and an automated histological pipeline, but now that we have identified the offset, it can be easily accounted for by any lab. Specifically, probe angles must be carefully computed from the CCF, as the CCF and stereotaxic coordinate systems do not define the same coronal plane angle. Minimizing variance in probe targeting is another important element in increasing reproducibility, as slight deviations in probe entry position and angle can lead to samples from different populations of neurons. Collecting structural MRI data in advance of implantation could reduce targeting error, although this is infeasible for most labs. A more feasible solution is to rely on stereotaxic coordinates but account for the inevitable off-target measurements by increasing cohort sizes and adjusting probe angles when blood vessels obscure the desired location.”

      b) When identifying which and how neurons encode particular aspects of stimuli or behaviour in behaving animals (when variables are correlated by the nature of the animals behaviour), it has become the standard in behavioral systems neuroscience to use GLMs - indeed many labs participating in the IBL also has a long history of doing this (e.g., Steinmetz et al., 2019; Musall et al., 2023; Orsolic et al., 2021; Park et al., 2014). The reproducibility of results when using GLMs is never explicitly shown, but the supplementary figures to Figure 7 indicate that results may be reproducible across labs when using GLMs (as it has similar prediction performance to the MTNN). This should be introduced as the first analysis method used in a new dedicated figure (i.e., following Figure 3 and showing results of analyses similar to what was shown for the MTNN in Figure 7). This will help put into perspective the degree of reproducibility issues the field is facing when analyzing with appropriate and common methods. The authors can then go on to show how simpler approaches (currently in Figures 4 and 5) - not accounting for a lot of uncontrolled variabilities when working with behaving animals - may cause reproducibility issues.

      We fully agree with the reviewer's suggestion. We have addressed their concern by implementing a Reduced-Rank Regression (RRR) model, which builds upon and extends the principles of Generalized Linear Models (GLMs). The RRR model retains the core regression framework of GLMs while introducing shared, trainable temporal bases across neurons, enhancing the model’s capacity to capture the structure in neural activity (Posani, Wang, et al., bioRxiv, 2024). Importantly, Posani, Wang et al compared the predictive performance of GLMs vs the RRR model, and found that the RRR model provided (slightly) improved performance, so we chose the RRR approach here.

      We highlight this analysis in a new section (lines 350-377) titled, “Single neuron coefficients from a regression-based analysis are reproducible across labs”. This section includes an entirely new Figure (Fig. 7), where this new analysis felt most appropriate, since it is closer in spirit to the MTNN analysis that follows (rather than as a new Figure 3, as the reviewer suggested). As the reviewer hoped, this analysis provides some reassurance that including many variables when characterizing neural activity furnishes results with improved reproducibility. We now state this in the Results and the Discussion (line 456-457), highlighting that these analyses complement the more traditional selectivity analyses, and that using both methods together can be informative.

      When the authors introduce a neural network approach (i.e. MTNN) as an alternative to the analyses in Figures 4 and 5, they suggest: 'generalized linear models (GLMs) are likely too inflexible to capture the nonlinear contributions that many of these variables, including lab identity and spatial positions of neurons, might make to neural activity'). This is despite the comparison between MTNN and GLM prediction performance (Supplement 1 to Figure 7) showing that the MTNN is only slightly better at predicting neural activity compared to standard GLMs. The introduction of new models to capture neural variability is always welcome, but the conclusion that standard analyses in the field are not reproducible can be unfair unless directly compared to GLMs.

      In essence, it is really useful to demonstrate how different analysis methods and preprocessing approaches affect reproducibility. But the authors should highlight what is actually standard in the field, and then provide suggestions to improve from there.

      Thanks again for these comments. We have also edited the MTNN section slightly to accommodate the addition of the previous new RRR section (line 401-402).

      (2) The authors attempt to establish a series of new quality control metrics for the inclusion of recordings and single units. This is much needed, with the goal to standardize unit inclusion across labs that bypasses the manual process while keeping the nuances from manual curation. However, the authors should benchmark these metrics to other automated metrics and to manual curation, which is still a gold standard in the field. The authors did this for whole-session assessment but not for individual clusters. If the authors can find metrics that capture agreed-upon manual cluster labels, without the need for manual intervention, that would be extremely helpful for the field.

      We thank the reviewer for their insightful suggestions regarding benchmarking our quality control metrics against manual curation and other automated methods at the level of individual clusters. We are indeed, as the reviewer notes, publishing results from spike sorting outputs that have been automatically but not manually verified on a neuron-by-neuron basis. To get to the point where we trust these results to be of publishable quality, we manually reviewed hundreds of recordings and thousands of neurons, refining both the preprocessing pipeline and the single-unit quality metrics along the way. All clusters, both those passing QCs and those not passing QCs, are available to review with detailed plots and quantifications at https://viz.internationalbrainlab.org/app (turn on “show advanced metrics” in the upper right, and navigate to the plots furthest down the page, which are at the individual unit level). We would emphasize that these metrics are definitely imperfect (and fully-automated spike sorting remains a work in progress), but so is manual clustering. Our fully automated approach has the advantage of being fully reproducible, which is absolutely critical for the analyses in the present paper. Indeed, if we had actually done manual clustering or curation, one would wonder whether our results were actually reproducible independently. Nevertheless, it is not part of the present manuscript’s objectives to validate or defend these specific choices for automated metrics, which have been described in detail elsewhere (see our Spike Sorting whitepaper, https://figshare.com/articles/online_resource/Spike_sorting_pipeline_for_the_International_Brain_La boratory/19705522?file=49783080). It would be a valuable exercise to thoroughly compare these metrics against a careful, large, manually-curated set, but doing this properly would be a paper in itself and is beyond the scope of the current paper. We also acknowledge that our analyses studying reproducibility across labs could, in principle, result in more or less reproducibility under a different choice of metrics, which we now describe in the Discussion (line 469-470)”:

      “Another significant limitation of the analysis presented here is that we have not been able to assess the extent to which other choices of quality metrics and inclusion criteria might have led to greater or lesser reproducibility.”

      (3) With the goal of improving reproducibility and providing new guidelines for standard practice for data analysis, the authors should report of n of cells, sessions, and animals used in plots and analyses throughout the paper to aid both understanding of the variability in the plots - but also to set a good example.

      We wholeheartedly agree and have added the number of cells, mice and sessions for each figure. This information is included as new tabs in our quality control spreadsheet (https://docs.google.com/spreadsheets/d/1_bJLDG0HNLFx3SOb4GxLxL52H4R2uPRcpUlIw6n4 n-E/). This is referred to in line 158-159 (as well as its original location on line 554 in the section, “Quality control and data inclusion”).

      Other general comments:

      (1) In the discussion (line 383) the authors conclude: 'This is reassuring, but points to the need for large sample sizes of neurons to overcome the inherent variability of single neuron recording'. - Based on what is presented in this paper we would rather say that their results suggest that appropriate analytical choices are needed to ensure reproducibility, rather than large datasets - and they need to show whether using standard GLMs actually allows for reproducible results.

      Thanks. The new GLM-style RRR analysis in Figure 7, following the reviewer’s suggestion, does indeed indicate improved reproducibility across labs. As described above, we see this new analysis as complementary to more traditional analyses of neural selectivity and argue that the two can be used together. The new text (line 461) states:

      “This is reassuring, and points to the need for appropriate analytical choices to ensure reproducibility.”

      (2) A general assumption in the across-lab reproducibility questions in the paper relies on intralab variability vs across-lab variability. An alternative measure that may better reflect experimental noise is across-researcher variability, as well as the amount of experimenter experience (if the latter is a factor, it could suggest researchers may need more training before collecting data for publication). The authors state in the discussion that this is not possible. But maybe certain measures can be used to assess this (e.g. years of conducting surgeries/ephys recordings etc)?

      We agree that understanding experimenter-to-experimenter variability would be very interesting and indeed we had hoped to do this analysis for some time. The problem is that typically, each lab employed one trainee to conduct all the data collection. This prevents us from comparing outcomes from two different experimenters in the same lab. There are exceptions to this, such as the Churchland lab in which 3 personnel (two postdocs and a technician) collected the data. However, even this fortuitous situation did not lend itself well to assessing experimenter-to-experimenter variation: the Churchland lab moved from Cold Spring Harbor to UCLA during the data collection period, which might have caused variability that is totally independent of experimenter (e.g., different animal facilities). Further, once at UCLA, the postdoc and technician worked closely together- alternating roles in animal training, surgery and electrophysiology. We believe that the text in our current Discussion (line 465-468) accurately characterizes the situation:

      “Our experimental design precludes an analysis of whether the reproducibility we observed was driven by person-to-person standardization or lab-to-lab standardization. Most likely, both factors contributed: all lab personnel received standardized instructions for how to implant head bars and train animals, which likely reduced personnel-driven differences.”

      Quantifying the level of experience of each experimenter is an appealing idea and we share the reviewer’s curiosity about its impact on data quality. Unfortunately, quantifying experience is tricky. For instance, years of conducting surgeries is not an unambiguously determinable number. Would we count an experimenter who did surgery every day for a year as having the same experience as an experimenter who did surgery once/month for a year? Would we count a surgeon with expertise in other areas (e.g., windows for imaging) in the same way as surgeons with expertise in ephys-specific surgeries? Because of the ambiguities, we leave this analysis to be the subject of future work; this is now stated in the Discussion (line 476).

      (3) Figure 3b and c: Are these plots before or after the probe depth has been adjusted based on physiological features such as the LFP power? In other words, is the IBL electrophysiological alignment toolbox used here and is the reliability of location before using physiological criteria or after? Beyond clarification, showing both before and after would help the readers to understand how much the additional alignment based on electrophysiological features adjusts probe location. It would also be informative if they sorted these penetrations by which penetrations were closest to the planned trajectory after histological verification.

      The plots in Figure 3b and 3c reflect data after the probe depth has been adjusted based on electrophysiological features. This adjustment incorporates criteria such as LFP power and spiking activity to refine the trajectory and ensure precise alignment with anatomical landmarks. The trajectories have also been reviewed and confirmed by two independent reviewers. We have clarified this in line 180 and in the caption of Figure 3.

      To address this concern, we have added a new panel c in Figure 3 supplementary 1 (also shown below) that shows the LFP features along the probes prior to using the IBL alignment toolbox. We hope the reviewer agrees that a comparison of panels (a) and (c) below make clear the improvement afforded by our alignment tools.

      In Figure 3 and Figure 3 supplementary 1, as suggested, we have also now sorted the probes by those that were closest to the planned trajectory. This way of visualizing the data makes it clear that as the distance from the planned trajectory increases, the power spectral density in the hippocampal regions becomes less pronounced and the number of probes that have a large portion of the channels localized to VISa/am, LP and PO decreases. We have added text to the caption to describe this. We thank the reviewer for this suggestion and agree that it will help readers to understand how much the additional alignment (based on electrophysiological features) adjusts probe location.

      (4) In Figures 4 and 6: If the authors use a 0.05 threshold (alpha) and a cell simply has to be significant on 1/6 tests to be considered task modulated, that means that they have a false positive rate of ~30% (0.05*6=0.3). We ran a simple simulation looking for significant units (from random null distribution) from these criteria which shows that out of 100.000 units, 26500 units would come out significant (false error rate: 26.5%). That is very high (and unlikely to be accepted in most papers), and therefore not surprising that the fraction of task-modulated units across labs is highly variable. This high false error rate may also have implications for the investigation of the spatial position of task-modulated units (as effects of the spatial position may drown in falsely labelled 'task-modulated' cells).

      Thank you for this concern. The different tests were kept separate, so we did not consider a neuron modulated if it was significant in only one out of six tests, but instead we asked whether a neuron was modulated according to test one, whether it was modulated according to test two, etc., and performed further analyses separately for each test. Thus, we are only vulnerable to the ‘typical’ false positive rate of 0.05 for any given test. We made this clearer in the text (lines 232-236) and hope that the 5% false positive rate seems more acceptable.

      (5) The authors state from Figure 5b that the majority of cells could be well described by 2 PCs. The distribution of R2 across neurons is almost uniform, so depending on what R2 value one considers a 'good' description, that is the fraction of 'good' cells. Furthermore, movement onset has now been well-established to be affecting cells widely and in large fractions, so while this analysis may work for something with global influence - like movement - more sparsely encoded variables (as many are in the brain) may not be well approximated with this suggestion. The authors could expand this analysis into other epochs like activity around stimulus presentation, to better understand how this type of analysis reproduces across labs for features that have a less global influence.

      We thank the reviewer for the suggestion and fully agree that the window used in our original analysis would tend to favor movement-driven neurons. To address this, we repeated the analysis, this time using a window centered around stimulus onset (from -0.5 s prior to stimulus onset until 0.1 s after stimulus onset). As the reviewer suspected, far fewer neurons were active in this window and consequently far fewer were modelled well by the first two PCs, as shown in Author response image 1b (below). Similar to our original analysis using the post-movement window, we found mixed results for the stimulus-centered window across labs. Interestingly, regional differences were weaker in this new analysis compared to the original analysis of the post-movement window. We have added a sentence to the results describing this. Because the results are similar to the post-movement window main figure, we would prefer to restrict the new analysis only to this point-by-point response, in the hopes of streamlining the paper.

      Author response image 1.

      PCA analysis applied to a stimulus-aligned window ([-0.5, 0.1] sec relative to stim onset). Figure conventions as in main text Fig 5. Results are comparable to the post-movement window analysis, however regional differences are weaker here, possibly because fewer cells were active in the pre-movement window. We added panel j here and in the main figure, showing cell-number-controlled results. I.e. for each test, the minimum neuron number of the compared classes was sampled from all classes (say labs in a region), this sampling was repeated 1000 times and p-values combined via Fisher’s method, overall resulting in much fewer significant differences across laboratories and, independently, regions.

      (6) Additionally, in Figure 5i: could the finding that one can only distinguish labs when taking cells from all regions, simply be a result of a different number of cells recorded in each region for each lab? It makes more sense to focus on the lab/area pairing as the authors also do, but not to make their main conclusion from it. If the authors wish to do the comparison across regions, they will need to correct for the number of cells recorded in each region for each lab. In general, it was a struggle to fully understand the purpose of Figure 5. While population analysis and dimensionality reduction are commonplace, this seems to be a very unusual use of it.

      We agree that controlling for varying cell numbers is a valuable addition to this analysis. We added panel j in Fig. 5 showing cell-number-controlled test results of panel i. I.e. for a given statistical comparison, we sample the lowest number of cells of compared classes from the others, do the test, and repeat this sampling 1000 times, before combining the p-values using Fisher’s method. This cell-number controlled version of the tests resulted in clearly fewer significant differences across distributions - seen similarly for the pre-movement window shown in j in Author response image 1. We hope this clarified our aim to illustrate that low-dimensional embedding of cells’ trial-averaged activity can show how regional differences compare with laboratory differences.

      As a complementary statistical analysis to the shown KS tests, we fitted a linear-mixed-effects model (statsmodels.formula.api mixedlm), to the first and second PC for both activity windows (“Move”: [-0.5,1] first movement aligned; “Stim”: [-0.5,0.1] stimulus onset aligned), independently. Author response image 2 (in this rebuttal only) is broadly in line with the KS results, showing more regional than lab influences on the distributions of first PCs for the post-movement window.

      Author response image 2:

      Linear mixed effects model results for two PCs and two activity windows. For the post-movement window (“Move”), regional influences are significant (red color in plots) for all but one region while only one lab has a significant model coefficient for PC1. For PC2 more labs and three regions have significant coefficients. For the pre-movement window (“Stim”) one region for PC1 or PC2 has significant coefficients. The variance due to session id was smaller than all other effects (“eids Var”). “Intercept” shows the expected value of the response variable (PC1, PC2) before accounting for any fixed or random effects. All p-values were grouped as one hypothesis family and corrected for multiple comparisons via Benjamini-Hochberg.

      (7) In the discussion the authors state: " Indeed this approach is a more effective and streamlined way of doing it, but it is questionable whether it 'exceeds' what is done in many labs.

      Classically, scientists trace each probe manually with light microscopy and designate each area based on anatomical landmarks identified with nissl or dapi stains together with gross landmarks. When not automated with 2-PI serial tomography and anatomically aligned to a standard atlas, this is a less effective process, but it is not clear that it is less precise, especially in studies before neuropixels where active electrodes were located in a much smaller area. While more effective, transforming into a common atlas does make additional assumptions about warping the brain into the standard atlas - especially in cases where the brain has been damaged/lesioned. Readers can appreciate the effectiveness and streamlining provided by these new tools without the need to invalidate previous approaches.

      We thank the reviewer for highlighting the effectiveness of manual tracing methods used traditionally. Our intention in the statement was not to invalidate the precision or value of these classical methods but rather to emphasize the scalability and streamlining offered by our pipeline. We have revised the language to more accurately reflect this (line 500-504):

      “Our ability to detect targeting error benefited from an automated histological pipeline combined with alignment and tracing that required agreement between multiple users, an approach that greatly exceeds the histological analyses done by most individual labs. Our approach, which enables scalability and standardization across labs while minimizing subjective variability, revealed that much of the variance in targeting was due to the probe entry positions at the brain surface, which were randomly displaced across the dataset.”

      (8) What about across-lab population-level representation of task variables, such as in the coding direction for stimulus or choice? Is the general decodability of task variables from the population comparable across labs?

      Excellent question, thanks! We have added the new section “Decodability of task variables is consistent across labs, but varies by brain region” (line 423-448) and Figure 9 in the revised manuscript to address this question. In short, yes, the general decodability of task variables from the population is comparable across labs, providing additional reassurance of reproducibility.

      Reviewer #2 (Public review):

      Summary:

      The authors sought to evaluate whether observations made in separate individual laboratories are reproducible when they use standardized procedures and quality control measures. This is a key question for the field. If ten systems neuroscience labs try very hard to do the exact same experiment and analyses, do they get the same core results? If the answer is no, this is very bad news for everyone else! Fortunately, they were able to reproduce most of their experimental findings across all labs. Despite attempting to target the same brain areas in each recording, variability in electrode targeting was a source of some differences between datasets.

      Major Comments:

      The paper had two principal goals:

      (1) to assess reproducibility between labs on a carefully coordinated experiment

      (2) distill the knowledge learned into a set of standards that can be applied across the field.

      The manuscript made progress towards both of these goals but leaves room for improvement.

      (1) The first goal of the study was to perform exactly the same experiment and analyses across 10 different labs and see if you got the same results. The rationale for doing this was to test how reproducible large-scale rodent systems neuroscience experiments really are. In this, the study did a great job showing that when a consortium of labs went to great lengths to do everything the same, even decoding algorithms could not discern laboratory identity was not clearly from looking at the raw data. However, the amount of coordination between the labs was so great that these findings are hard to generalize to the situation where similar (or conflicting!) results are generated by two labs working independently.

      Importantly, the study found that electrode placement (and thus likely also errors inherent to the electrode placement reconstruction pipeline) was a key source of variability between datasets. To remedy this, they implemented a very sophisticated electrode reconstruction pipeline (involving two-photon tomography and multiple blinded data validators) in just one lab-and all brains were sliced and reconstructed in this one location. This is a fantastic approach for ensuring similar results within the IBL collaboration, but makes it unclear how much variance would have been observed if each lab had attempted to reconstruct their probe trajectories themselves using a mix of histology techniques from conventional brain slicing, to light sheet microscopy, to MRI imaging.

      This approach also raises a few questions. The use of standard procedures, pipelines, etc. is a great goal, but most labs are trying to do something unique with their setup. Bigger picture, shouldn't highly "significant" biological findings akin to the discovery of place cells or grid cells, be so clear and robust that they can be identified with different recording modalities and analysis pipelines?

      We agree, and hope that this work may help readers understand what effect sizes may be considered “clear and robust” from datasets like these. We certainly support the reviewer’s point that multiple approaches and modalities can help to confirm any biological findings, but we would contend that a clear understanding of the capabilities and limitations of each approach is valuable, and we hope that our paper helps to achieve this.

      Related to this, how many labs outside of the IBL collaboration have implemented the IBL pipeline for their own purposes? In what aspects do these other labs find it challenging to reproduce the approaches presented in the paper? If labs were supposed to perform this same experiment, but without coordinating directly, how much more variance between labs would have been seen? Obviously investigating these topics is beyond the scope of this paper. The current manuscript is well-written and clear as is, and I think it is a valuable contribution to the field. However, some additional discussion of these issues would be helpful.

      We thank the reviewer for raising this important issue. We know of at least 13 labs that have implemented the behavioral task software and hardware that we published in eLife in 2021, and we expect that over the next several years labs will also implement these analysis pipelines (note that it is considerably cheaper and faster to implement software pipelines than hardware). In particular, a major goal of the staff in the coming years is to continue and improve the support for pipeline deployment and use. However, our goal in this work, which we have aimed to state more clearly in the revised manuscript, was not so much to advocate that others adopt our pipeline, but instead to use our standardized approach as a means of assessing reproducibility under the best of circumstances (see lines 48-52): “A high level of reproducibility of results across laboratories when procedures are carefully matched is a prerequisite to reproducibility in the more common scenario in which two investigators approach the same high-level question with slightly different experimental protocols.”

      Further, a number of our findings are relevant to other labs regardless of whether they implement our exact pipeline, a modified version of our pipeline, or something else entirely. For example, we found probe targeting to be a large source of variability. Our ability to detect targeting error benefited from an automated histological pipeline combined with alignment and tracing that required agreement between multiple users, but now that we have identified the offset, it can be easily accounted for by any lab. Specifically, probe angles must be carefully computed from the CCF, as the CCF and stereotaxic coordinate systems do not define the same coronal plane angle. Relatedly, we found that slight deviations in probe entry position can lead to samples from different populations of neurons. Although this took large cohort sizes to discover, knowledge of this discovery means that future experiments can plan for larger cohort sizes to allow for off-target trajectories, and can re-compute probe angle when the presence of blood vessels necessitates moving probes slightly. These points are now highlighted in the Discussion (lines 500-515).

      Second, the proportion of responsive neurons (a quantity often used to determine that a particular area subserves a particular function), sometimes failed to reproduce across labs. For example, for movement-driven activity in PO, UCLA reported an average change of 0 spikes/s, while CCU reported a large and consistent change (Figure 4d, right most panel, compare orange vs. yellow traces). This argues that neuron-to-neuron variability means that comparisons across labs require large cohort sizes. A small number of outlier neurons in a session can heavily bias responses. We anticipate that this problem will be remedied as tools for large scale neural recordings become more widely used. Indeed, the use of 4-shank instead of single-shank Neuropixels (as we used here) would have greatly enhanced the number of PO neurons we measured in each session. We have added new text to Results explaining this (lines 264-268):

      “We anticipate that the feasibility of even larger scale recordings will make lab-to-lab comparisons easier in future experiments; multi-shank probes could be especially beneficial for cortical recordings, which tend to be the most vulnerable to low cell counts since the cortex is thin and is the most superficial structure in the brain and thus the most vulnerable to damage. Analyses that characterize responses to multiple parameters are another possible solution (See Figure 7).”

      (2) The second goal of the study was to present a set of data curation standards (RIGOR) that could be applied widely across the field. This is a great idea, but its implementation needs to be improved if adoption outside of the IBL is to be expected. Here are three issues:

      (a) The GitHub repo for this project (https://github.com/int-brain-lab/paper-reproducible-ephys/) is nicely documented if the reader's goal is to reproduce the figures in the manuscript. Consequently, the code for producing the RIGOR statistics seems mostly designed for re-computing statistics on the existing IBL-formatted datasets. There doesn't appear to be any clear documentation about how to run it on arbitrary outputs from a spike sorter (i.e. the inputs to Phy).

      We agree that clear documentation is key for others to adopt our standards. To address this, we have added a section at the end of the README of the repository that links to a jupyter notebook (https://github.com/int-brain-lab/paper-reproducible-ephys/blob/master/RIGOR_script.ipynb) that runs the RIGOR metrics on a user’s own spike sorted dataset. The notebook also contains a tutorial that walks through how to visually assess the quality of the raw and spike sorted data, and computes the noise level metrics on the raw data as well as the single cell metrics on the spike sorted data.

      (b) Other sets of spike sorting metrics that are more easily computed for labs that are not using the IBL pipeline already exist (e.g. "quality_metrics" from the Allen Institute ecephys pipeline [https://github.com/AllenInstitute/ecephys_spike_sorting/blob/main/ecephys_spike_sorting/m odules/quality_metrics/README.md] and the similar module in the Spike Interface package [https://spikeinterface.readthedocs.io/en/latest/modules/qualitymetrics.html]). The manuscript does not compare these approaches to those proposed here, but some of the same statistics already exist (amplitude cutoff, median spike amplitude, refractory period violation).

      There is a long history of researchers providing analysis algorithms and code for spike sorting quality metrics, and we agree that the Allen Institute’s ecephys code and the Spike Interface package are the current options most widely used (but see also, for example, Fabre et al. https://github.com/Julie-Fabre/bombcell). Our primary goal in the present work is not to advocate for a particular implementation of any quality metrics (or any spike sorting algorithm, for that matter), but instead to assess reproducibility of results, given one specific choice of spike sorting algorithm and quality metrics. That is why, in our comparison of yield across datasets (Fig 1F), we downloaded the raw data from those comparison datasets and re-ran them under our single fixed pipeline, to establish a fair standard of comparison. A full comparison of the analyses presented here under different choices of quality metrics and spike sorting algorithms would undoubtedly be interesting and useful for the field - however, we consider it to be beyond the scope of the present work. It is therefore an important assumption of our work that the result would not differ materially under a different choice of sorting algorithm and quality metrics. We have added text to the Discussion to clarify this limitation:

      “Another significant limitation of the analysis presented here is that we have not been able to assess the extent to which other choices of quality metrics and inclusion criteria might have led to greater or lesser reproducibility.”

      That said, we still intend for external users to be able to easily run our pipelines and quality metrics.

      (c) Some of the RIGOR criteria are qualitative and must be visually assessed manually. Conceptually, these features make sense to include as metrics to examine, but would ideally be applied in a standardized way across the field. The manuscript doesn't appear to contain a detailed protocol for how to assess these features. A procedure for how to apply these criteria for curating non-IBL data (or for implementing an automated classifier) would be helpful.

      We agree. To address this, we have provided a notebook that runs the RIGOR metrics on a user’s own dataset, and contains a tutorial on how to interpret the resulting plots and metrics (https://github.com/int-brain-lab/paper-reproducible-ephys/blob/master/RIGOR_script.ipynb).

      Within this notebook there is a section focused on visually assessing the quality of both the raw data and the spike sorted data. The code in this section can be used to generate plots, such as raw data snippets or the raster map of the spiking activity, which are typically used to visually assess the quality of the data. In Figure 1 Supplement 2 we have provided examples of such plots that show different types of artifactual activity that should be inspected.

      Other Comments:

      (1) How did the authors select the metrics they would use to evaluate reproducibility? Was this selection made before doing the study?

      Our metrics were selected on the basis of our experience and expertise with extracellular electrophysiology. For example: some of us previously published on epileptiform activity and its characteristics in some mice (Steinmetz et al. 2017), so we included detection of that type of artifact here; and, some of us previously published detailed investigations of instability in extracellular electrophysiological recordings and methods for correcting them (Steinmetz et al. 2021, Windolf et al. 2024), so we included assessment of that property here. These metrics therefore represent our best expert knowledge about the kinds of quality issues that can affect this type of dataset, but it is certainly possible that future investigators will discover and characterize other quality issues.

      The selection of metrics was primarily performed before the study (we used these assessments internally before embarking on the extensive quantifications reported here), and in cases where we refined them further during the course of preparing this work, it was done without reference to statistical results on reproducibility but instead on the basis of manual inspection of data quality and metric performance.

      (2) Was reproducibility within-lab dependent on experimenter identity?

      We thank the reviewer for this question. We have addressed it in our response to R1 General comment 2, as follows:

      We agree that understanding experimenter-to-experimenter variability would be very interesting and indeed we had hoped to do this analysis for some time. The problem is that typically, each lab employed one trainee to conduct all the data collection. This prevents us from comparing outcomes from two different experimenters in the same lab. There are exceptions to this, such as the Churchland lab in which 3 personnel (two postdocs and a technician) collected the data. However, even this fortuitous situation did not lend itself well to assessing experimenter-to-experimenter variation: the Churchland lab moved from Cold Spring Harbor to UCLA during the data collection period, which might have caused variability that is totally independent of experimenter (e.g., different animal facilities). Further, once at UCLA, the postdoc and technician worked closely together- alternating roles in animal training, surgery and electrophysiology. We believe that the text in our current Discussion (line 465-468) accurately characterizes the situation:

      “Our experimental design precludes an analysis of whether the reproducibility we observed was driven by person-to-person standardization or lab-to-lab standardization. Most likely, both factors contributed: all lab personnel received standardized instructions for how to implant head bars and train animals, which likely reduced personnel-driven differences.”

      Quantifying the level of experience of each experimenter is an appealing idea and we share the reviewer’s curiosity about its impact on data quality. Unfortunately, quantifying experience is tricky. For instance, years of conducting surgeries is not an unambiguously determinable number. Would we count an experimenter who did surgery every day for a year as having the same experience as an experimenter who did surgery once/month for a year? Would we count a surgeon with expertise in other areas (e.g., windows for imaging) in the same way as surgeons with expertise in ephys-specific surgeries? Because of the ambiguities, we leave this analysis to be the subject of future work; this is now stated in the Discussion (line 476).

      (3) They note that UCLA and UW datasets tended to miss deeper brain region targets (lines 185-188) - they do not speculate why these labs show systematic differences. Were they not following standardized procedures?

      Thank you for raising this point. All researchers across labs were indeed following standardised procedures. We note that our statistical analysis of probe targeting coordinates and angles did not reveal a significant effect of lab identity on targeting error, even though we noted the large number of mis-targeted recordings in UCLA and UW to help draw attention to the appropriate feature in the figure. Given that these differences were not statistically significant, we can see how it was misleading to call out these two labs specifically. While the overall probe placement surface error and angle error both show no such systematic difference, the magnitude of surface error showed a non-significant tendency to be higher for samples in UCLA & UW, which, compounded with the direction of probe angle error, caused these probe insertions to land in a final location outside LP & PO.

      This shows how subtle differences in probe placement & angle accuracy can lead to compounded inaccuracies at the probe tip, especially when targeting deep brain regions, even when following standard procedures. We believe this is driven partly by the accuracy limit or resolution of the stereotaxic system, along with slight deviations in probe angle, occurring during the setup of the stereotaxic coordinate system during these recordings.

      We have updated the relevant text in lines 187-190 as follows, to clarify:

      “Several trajectories missed their targets in deeper brain regions (LP, PO), as indicated by gray blocks, despite the lack of significant lab-dependent effects in targeting as reported above. These off-target trajectories tended to have both a large displacement from the target insertion coordinates and a probe angle that unfavorably drew the insertions away from thalamic nuclei (Figure 2f).”

      (4) The authors suggest that geometrical variance (difference between planned and final identified probe position acquired from reconstructed histology) in probe placement at the brain surface is driven by inaccuracies in defining the stereotaxic coordinate system, including discrepancies between skull landmarks and the underlying brain structures. In this case, the use of skull landmarks (e.g. bregma) to determine locations of brain structures might be unreliable and provide an error of ~360 microns. While it is known that there is indeed variance in the position between skull landmarks and brain areas in different animals, the quantification of this error is a useful value for the field.

      We thank the reviewer for their thoughtful comment and are glad that they found the quantification of variance useful for the field.

      (5) Why are the thalamic recording results particularly hard to reproduce? Does the anatomy of the thalamus simply make it more sensitive to small errors in probe positioning relative to the other recorded areas?

      We thank the reviewer for raising this interesting question. We believe that they are referring to Figure 4: indeed when we analyzed the distribution of firing rate modulations, we saw some failures of reproducibility in area PO (bottom panel, Figure 4h). However, the thalamic nuclei were not, in other analyses, more vulnerable to failures in reproducibility. For example, in the top panel of Figure 4h, VisAM shows failures of reproducibility for modulation by the visual stimulus. In Fig. 5i, area CA1 showed a failure of reproducibility. We fear that the figure legend title in the previous version (which referred to the thalamus specifically) was misleading, and we have revised this. The new title is, “Neural activity is modulated during decision-making in five neural structures and is variable between laboratories.” This new text more accurately reflects that there were a number of small, idiosyncratic failures of reproducibility, but that these were not restricted to a specific structure. The new analysis requested by R1 (now in Figure 7) provides further reassurance of overall reproducibility, including in the thalamus (see Fig. 7a, right panels; lab identity could not be decoded from single neuron metrics, even in the thalamus).

      Reviewer #1 (Recommendations for the authors):

      (1) Figure font sizes and formatting are variable across panels and figures. Please streamline the presentation of results.

      Thank you for your feedback. We have remade all figures with the same standardized font sizes and formatting.

      (2) Please correct the noncontinuous color scales in Figures 3b and 3d.

      Thank you for pointing this out, we fixed the color bar.

      (3) In Figures 5d and g, the error bars are described as: 'Error bands are standard deviation across cells normalised by the square root of the number of sessions in the region'. How does one interpret this error? It seems to be related to the standard error of the mean (std/sqrt(n)) but instead of using the n from which the standard deviation is calculated (in this case across cells), the authors use the number of sessions as n. If they took the standard deviation across sessions this would be the sem across sessions, and interpretable (as sem*1.96 is the 95% parametric confidence interval of the mean). Please justify why these error bands are used here and how they can be interpreted - it also seems like it is the only time these types of error bands are used.

      We agree and for clarity use standard error across cells now, as the error bars do not change dramatically either way.

      (4) It is difficult to understand what is plotted in Figures 5e,h, please unpack this further and clarify.

      Thank you for pointing this out. We have added additional explanation in the figure caption (See caption for Figure 5c) to explain the KS test.

      (5) In lines 198-201 the authors state that they were worried that Bonferroni correction with 5 criteria would be too lenient, and therefore used 0.01 as alpha. I am unsure whether the authors mean that they are correcting for multiple comparisons across features or areas. Either way, 0.01 alpha is exactly what a Bonferroni corrected alpha would be when correcting for either 5 features or 5 areas: 0.05/5=0.01. Or do they mean they apply the Bonferroni correction to the new 0.01 alpha: i.e., 0.01/5=0.002? Please clarify.

      Thank you, that was indeed written confusingly. We considered all tests and regions as whole, so 7 tests * 5 regions = 35 tests, which would result in a very strong Bonferroni correction. Indeed, if one considers the different tests individually, the correction we apply from 0.05 to 0.01 can be considered as correcting for the number of regions, which we now highlight better. We apply no further corrections of any kind to our alpha=0.01. We clarified this in the manuscript in all relevant places (lines 205-208, 246, 297-298, and 726-727).

      (6) Did the authors take into account how many times a probe was used/how clean the probe was before each recording. Was this streamlined between labs? This can have an effect on yield and quality of recording.

      We appreciate the reviewer highlighting the potential impact of probe use and cleanliness on recording quality and yield. While we did not track the number of times each probe was used, we ensured that all probes were cleaned thoroughly after each use using a standardized cleaning protocol (Section 16: Cleaning the electrode after data acquisition in Appendix 2: IBL protocol for electrophysiology recording using Neuropixels probe). We acknowledge that tracking the specific usage history of each probe could provide additional insights, but unfortunately we did not track this information for this project. In prior work the re-usability of probes has been quantified, showing insignificant degradation with use (e.g. Extended Data Fig 7d from Jun et al. 2017).

      (7) Figure 3, Supplement1: DY_013 missed DG entirely? Was this included in the analysis?

      Thank you for this question. We believe the reviewer is referring to the lack of a prominent high-amplitude LFP band in this mouse, and lack of high-quality sorted units in that region. Despite this, our histology did localize the recording trajectory to DG. This recording did pass our quality control criteria overall, as indicated by the green label, and was used in relevant analyses.

      The lack of normal LFP features and neuron yield might reflect the range of biological variability (several other sessions also have relatively weak DG LFP and yield, though DY_013 is the weakest), or could reflect some damage to the tissue, for example as caused by local bleeding. Because we could not conclusively identify the source of this observation, we did not exclude it.

      (8) Given that the authors argue for using the MTNN over GLMs, it would be useful to know exactly how much better the MTNN is at predicting activity in the held-out dataset (shown in Figure 7, Supplement 1). It looks like a very small increase in prediction performance between MTNN and GLMs, is it significantly different?

      The average variance explained on the held-out dataset, as shown in Figure 8–Figure Supplement 1 Panel B, is 0.065 for the GLMs and 0.071 for the MTNN. As the reviewer correctly noted, this difference is not significant. However, one of the key advantages of the MTNN over GLMs lies in its flexibility to easily incorporate covariates, such as electrophysiological characteristics or session/lab IDs, directly into the analysis. This feature is particularly valuable for assessing effect sizes and understanding the contributions of various factors.

      (9) In line 723: why is the threshold for mean firing rate for a unit to be included in the MTNN results so high (>5Hz), and how does it perform on units with lower firing rates?      

      We thank the reviewer for pointing this out. The threshold for including units with a mean firing rate above 5 Hz was set because most units with firing rates below this threshold were silent in many trials, and reducing the number of units helped keep the MTNN training time reasonable. Based on this comment, we ran the MTNN experiments including all units with firing rates above 1 Hz, and the results remained consistent with our previous conclusions (Figure 8). Crucially, the leave-one-out analysis consistently showed that lab and session IDs had effect sizes close to zero, indicating that both within-lab and between-lab random effects are small and comparable.

      Reviewer #2 (Recommendations for the authors):

      (1) Most of the more major issues were already listed in the above comments. The strongest recommendation for additional work would be to improve the description and implementation of the RIGOR statistics such that non-IBL labs that might use Neuropixels probes but not use the entire IBL pipeline might be able to apply the RIGOR framework to their own data.

      We thank the reviewer for highlighting the importance of making the RIGOR statistics more accessible to a broader audience. We agree that improving the description and implementation of the RIGOR framework is essential for facilitation of non-IBL labs using Neuropixels probes. To address this we created a jupyter notebook with step-by-step guidance that is not dependent on the IBL pipeline. This tool (https://github.com/int-brain-lab/paper-reproducible-ephys/blob/develop/RIGOR_script.ipynb) is publicly available through the repository, accompanied by example datasets and usage tutorials.

      (2) Table 1: How are qualitative features like "drift" defined? Some quantitative statistics like "presence ratio" (the fraction of the dataset where spikes are present) already exist in packages like ecephys_spike_sorting. Who measured these qualitative features? What are the best practices for doing these qualitative analyses?

      At the probe level, we compute the estimate of the relative motion of the electrodes to the brain tissue at multiple depths along the electrode. We overlay the drift estimation over a raster plot to detect sharp displacements as a function of time. Quantitatively, the drift is the cumulative absolute electrode motion estimated during spike sorting (µm). We clarified the corresponding text in Table 1.

      The qualitative assessments were carried out by IBL staff and experimentalists. We have now provided code to run the RIGOR metrics along with an embedded tutorial, to complement the supplemental figures we have shown about qualitative metric interpretation.

      (3) Table 1: What are the units for the LFP derivative?

      We thank the reviewer for noting that the unit was missing. The unit (decibel per unit of space) is now in the table.

      (4) Table 1: For "amplitude cutoff", the table says that "each neuron must pass a metric". What is the metric?

      We have revised the table to include this information. This metric was designed to detect potential issues in amplitude distributions caused by thresholding during deconvolution, which could result in missed spikes. There are quantitative thresholds on the distribution of the low tail of the amplitude histogram relative to the high tail, and on the relative magnitude of the bins in the low tail. We now reference the methods text from the table, which includes a more extended description and gives the specific threshold numbers. Also, the metric and thresholds are more easily understood with graphical assistance; see the IBL Spike Sorting Whitepaper for this (Fig. 17 in that document and nearby text; https://doi.org/10.6084/m9.figshare.19705522.v4). This reference is now also cited in the text.

      (5) Figure 2: In panel A, the brain images look corrupted.

      Thanks; in the revised version we have changed the filetype to improve the quality of the panel image.

      (6) Figure 7: In panel D, make R2 into R^2 (with a superscript)

      Panel D y-axis label has been revised to include superscript (note that this figure is now Figure 8).

      Works Cited

      Julie M.J. Fabre, Enny H. van Beest, Andrew J. Peters, Matteo Carandini, and Kenneth D. Harris. Bombcell: automated curation and cell classification of spike-sorted electrophysiology data, July 2023. URL https://doi.org/10.5281/zenodo.8172822.

      James J. Jun, Nicholas A. Steinmetz, Joshua H. Siegle, Daniel J. Denman, Marius Bauza, Brian Barbarits, Albert K. Lee, Costas A. Anastassiou, Alexandru Andrei, C¸ a˘gatayAydın, Mladen Barbic, Timothy J. Blanche, Vincent Bonin, Jo˜ao Couto, Barundeb Dutta, Sergey L. Gratiy, Diego A. Gutnisky, Michael H¨ausser, Bill Karsh, Peter Ledochowitsch, Carolina Mora Lopez, Catalin Mitelut, Silke Musa, Michael Okun, Marius Pachitariu, Jan Putzeys, P. Dylan Rich, Cyrille Rossant, Wei-lung Sun, Karel Svoboda, Matteo Carandini, Kenneth D. Harris, Christof Koch, John O’Keefe, and Timothy D.Harris. Fully integrated silicon probes for high-density recording of neural activity.Nature, 551(7679):232–236, Nov 2017. ISSN 1476-4687. doi: 10.1038/nature24636. URL https://doi.org/10.1038/nature24636.

      Simon Musall, Xiaonan R. Sun, Hemanth Mohan, Xu An, Steven Gluf, Shu-Jing Li, Rhonda Drewes, Emma Cravo, Irene Lenzi, Chaoqun Yin, Bj¨orn M. Kampa, and Anne K. Churchland. Pyramidal cell types drive functionally distinct cortical activity patterns during decision-making. Nature Neuroscience, 26(3):495– 505, Mar 2023. ISSN 1546-1726. doi: 10.1038/s41593-022-01245-9. URL https://doi.org/10.1038/s41593-022-01245-9.

      Ivana Orsolic, Maxime Rio, Thomas D Mrsic-Flogel, and Petr Znamenskiy. Mesoscale cortical dynamics reflect the interaction of sensory evidence and temporal expectation during perceptual decision-making. Neuron, 109(11):1861–1875.e10, April 2021. Hyeong-Dong Park, St´ephanie Correia, Antoine Ducorps, and Catherine Tallon-Baudry.Spontaneous fluctuations in neural responses to heartbeats predict visual detection.Nature Neuroscience, 17(4):612–618, Apr 2014. ISSN 1546-1726. doi: 10.1038/nn.3671. URL https://doi.org/10.1038/nn.3671.

      Lorenzo Posani, Shuqi Wang, Samuel Muscinelli, Liam Paninski, and Stefano Fusi. Rarely categorical, always high-dimensional: how the neural code changes along the cortical hierarchy. bioRxiv, 2024. doi: 10.1101/2024.11.15.623878. URL https://www.biorxiv.org/content/early/2024/12/09/2024.11.15.623878.

      Nicholas A. Steinmetz, Christina Buetfering, Jerome Lecoq, Christian R. Lee, Andrew J. Peters, Elina A. K. Jacobs, Philip Coen, Douglas R. Ollerenshaw, Matthew T. Valley, Saskia E. J. de Vries, Marina Garrett, Jun Zhuang, Peter A. Groblewski, Sahar Manavi, Jesse Miles, Casey White, Eric Lee, Fiona Griffin, Joshua D. Larkin, Kate Roll, Sissy Cross, Thuyanh V. Nguyen, Rachael Larsen, Julie Pendergraft, Tanya Daigle, Bosiljka Tasic, Carol L. Thompson, Jack Waters, Shawn Olsen, David J. Margolis, Hongkui Zeng, Michael Hausser, Matteo Carandini, and Kenneth D. Harris. Aberrant cortical activity in multiple gcamp6-expressing transgenic mouse lines. eNeuro, 4(5), 2017. doi: 10.1523/ENEURO.0207-17.2017. URL https://www.eneuro.org/content/4/5/ENEURO.0207-17.2017.

      Nicholas A. Steinmetz, Peter Zatka-Haas, Matteo Carandini, and Kenneth D. Harris. Distributed coding of choice, action and engagement across the mouse brain. Nature, 576(7786):266–273, Dec 2019. ISSN 1476-4687. doi: 10.1038/s41586-019-1787-x. URL https://doi.org/10.1038/s41586-019-1787-x.

      Nicholas A. Steinmetz, Cagatay Aydin, Anna Lebedeva, Michael Okun, Marius Pachitariu, Marius Bauza, Maxime Beau, Jai Bhagat, Claudia B¨ohm, Martijn Broux, Susu Chen, Jennifer Colonell, Richard J. Gardner, Bill Karsh, Fabian Kloosterman, Dimitar Kostadinov, Carolina Mora-Lopez, John O’Callaghan, Junchol Park, Jan Putzeys, Britton Sauerbrei, Rik J. J. van Daal, Abraham Z. Vollan, Shiwei Wang, Marleen Welkenhuysen, Zhiwen Ye, Joshua T. Dudman, Barundeb Dutta, Adam W. Hantman,Kenneth D. Harris, Albert K. Lee, Edvard I. Moser, John O’Keefe, Alfonso Renart, Karel Svoboda, Michael H¨ausser, Sebastian Haesler, Matteo Carandini, and Timothy D. Harris. Neuropixels 2.0: A miniaturized high-density probe for stable, long-term brain recordings. Science, 372(6539):eabf4588, 2021. doi: 10.1126/science.abf4588.URL https://www.science.org/doi/abs/10.1126/science.abf4588.

      Charlie Windolf, Han Yu, Angelique C. Paulk, Domokos Mesz´ena, William Mu˜noz, Julien Boussard, Richard Hardstone, Irene Caprara, Mohsen Jamali, Yoav Kfir, Duo Xu, Jason E. Chung, Kristin K. Sellers, Zhiwen Ye, Jordan Shaker, Anna Lebedeva, Manu Raghavan, Eric Trautmann, Max Melin, Jo˜ao Couto, Samuel Garcia, Brian Coughlin, Csaba Horv´ath, Rich´ard Fi´ath, Istv´an Ulbert, J. Anthony Movshon, Michael N. Shadlen, Mark M. Churchland, Anne K. Churchland, Nicholas A. Steinmetz, Edward F. Chang, Jeffrey S. Schweitzer, Ziv M. Williams, Sydney S. Cash, Liam Paninski, and Erdem Varol. Dredge: robust motion correction for high-density extracellular recordings across species. bioRxiv, 2023. doi: 10.1101/2023.10.24.563768. URL https://www.biorxiv.org/content/early/2023/10/29/2023.10.24.563768.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      This study extends the previous interesting work of this group to address the potentially differential control of movement and posture. Their earlier work explored a broad range of data to make the case for a downstream neural integrator hypothesized to convert descending velocity movement commands into postural holding commands. Included in that data were observations from people with hemiparesis due to stroke. The current study uses similar data but pushes into a different, but closely related direction, suggesting that these data may address the independence of these two fundamental components of motor control. I find the logic laid out in the second sentence of the abstract ("The paretic arm after stroke is notable for abnormalities both at rest and during movement, thus it provides an opportunity to address the relationships between control of reaching, stopping, and stabilizing") less than compelling, but the study does make some interesting observations. Foremost among them, is the relation between the resting force postural bias and the effect of force perturbations during the target hold periods, but not during movement. While this interesting observation is consistent with the central mechanism the authors suggest, it seems hard to me to rule out other mechanisms, including peripheral ones. 

      Response 1.1. Thank you for your comments, which we address in detail below and in our response to Recommendations to the authors (see pp. 15-19 of this letter). We would first like to clarify the motivation behind our use of a stroke population to understand the interactions between the control of reaching in and holding. We agree that this idea can be laid out in a more compelling way.

      The fact that stroke patients usually display issues with their control of both reaching and holding, allows for within-individual comparisons of those two modes of control. Further, the magnitude of abnormalities is relatively large, making it easier to measure, compare and investigate effects. And, importantly, these two modes of control can be differentially affected after stroke (also pointed out by Reviewer 2, point 4 in Comments to the Authors). Finally, this kind of work – examining interactions between positive signs of stroke (such as abnormal posture or synergy) vs. negative signs (such as loss of motor control) – needs to be done in humans, as positive signs are relatively absent even in primates (Tower, 1940).

      We have changed our abstract (changes shown below in red), and our intro (expanding the second paragraph, lines 75-76), to lay out our motivation more clearly.

      From the abstract:

      “The paretic arm after stroke exhibits different abnormalities during rest vs. movement, providing an opportunity to ask whether control of these behaviors is independently affected in stroke. “

      On the other hand, the relation between force bias and the well-recognized flexor synergy seems rather self-evident, and I don't see that these results add much to that story.

      Response 1.2. While it seems natural that these biases would be the resting expression of abnormal flexor synergies (given their directionality towards the body, as shown in Figures 2-3, and the other similarities we demonstrate in Figure 8), we do not believe it is self-evident. These biases are measured at rest, with the patient passively moved and held still, whereas abnormal synergies emerge when the patient actively tries to move. The lack of relationship we find between these resting force biases and active movement underlines that the relation between force bias and flexor synergy should not be taken as self-evident, making it worthwhile to examine it (as we motivate in lines 589-596 and show in Figure 8).

      The paradox here is that, in spite of a relationship between force bias and flexor synergy (itself manifesting during attempted movement), there seems to be no relationship between force bias and direct measures of active movement (Figures 5,6). This is the paradox that inspired our conceptual model (Figure 9) and inspires to further investigate the factors under which these two systems are intermingled or kept separate. We thus find it to be a helpful element in the story.

      I am also struck by what seems to be a contradiction between the conclusions of the current and former studies: "These findings in stroke suggest that moving and holding still are functionally separable modes of control" and "the commands that hold the arm and finger at a target location depend on the mathematical integration of the commands that moved the limb to that location." The former study is mentioned here only in passing, in a single phrase in the discussion, with no consideration of the relation between the two studies. This is odd and should be addressed. 

      Response 1.3. While these two sets of findings are not contradictory, we understand how they can appear as such without providing context. We now discuss the relationship between our present study and the previous one more directly (lines 66-70 and 663-669 of the revised manuscript).

      The previous study examined how the control of movement informs the control of holding after the movement was over; the current study examines whether abnormalities in holding measured at rest with the movement leading to the rest position being passive. There are thus two important distinctions:

      First, directionality of potential effects: here we examine the effect of (abnormalities in) holding control upon movement, but the 2020 study (Albert et al., 2020) examines the effects of movement upon holding control. Stroke patient data in the 2020 study showed that, under CST damage, while the reach controller is disrupted, the hold controller can continue to integrate the malformed reach commands faithfully. In line with this, we proposed a model where the postural controller system sits downstream of the moving controller (Figure 7G in the 2020 paper). We thus did not claim, in 2020, that integration of movement commands is the only way to do determine posture control, as we stated explicitly back then, e.g. (emphasis ours):

      “Equations (1) and (2) describe how the integration of move activity may relate to changes in hold commands, but does not specify the hold command at the target.”

      In short, finding no effect of holding abnormalities upon movement (present finding) does not mean there is no potential effect of movement upon holding (2020 finding). This is something we had alluded to in the Discussion but not clarified, which we do now (see edits at the end of our response to this point).

      Second, active vs. passive movement: here, we measure holding control at rest (Experiment 1). The 2020 study shows that endpoint forces reflect the integration of learned dynamics exerted during active movement that led to the endpoint position. However, in Experiment 1, there is no active reaching to integrate, as the robot passively moves the arm to the held position. Thus, resting postural forces measured in Experiment 1 could not reflect the integration of reach commands that led to each rest position.  

      Thus, the two sets of findings are not contradictory. Taking our current and 2020 findings together suggests that active holding control would comprise would reflect both the integration of movement control that led to assuming the held position, plus the force biases measured at rest.

      Hence our decision to describe these two systems as functionally separable: while these systems can interact, the effects of post-stroke malfunctions in each can be independent depending on the function and conditions at hand. This does not make this a limited finding: being able to dissociate post-stroke impairment based on each of these two modes of control may inform rehabilitation, and also importantly, understanding the conditions in which these two modes of control become separable can substantially advance our understanding of both how different stroke signs interact with each other and how motor control is assembled in the healthy motor system. Figure 9 illustrates our conceptual model behind this and may serve as a blueprint to further dissect these circuits in the future.

      We discuss these issues briefly in lines 663-669 in our Discussion section, reproduced below for convenience:

      “It should be noted, however, that having distinct neural circuits for reaching and holding does not rule out interactions between them. For example, we recently demonstrated how arm holding control reflects the integration of motor commands driving the preceding active movement that led to the hold position, in both healthy participants and patients with hemiparesis (Albert et al., 2020). However, in that paper, we did not claim that this integration is the only source of holding control. Indeed, in Experiment 1 of the current study, we used passive movement to bring the arm to each probed position, which means that the postural biases could not be the result of integration of motor commands.” 

      And, we have adjusted our Introduction to provide pertinent context regarding our 2020 work (first paragraph, lines 66-70 of the updated manuscript).

      A minor wording concern I had is that the term "holding still" is frequently hard to parse. A couple of examples: "These findings in stroke suggest that moving and holding still are functionally separable modes of control." This example is easily read, "moving and holding [continue to be] functionally separable". Another: "...active reaching and holding still in the same workspace, " could be "...active reaching and holding [are] still in the same workspace." Simply "holding", "posture" or "posture maintenance" would all be better options.

      Response 1.4. Thank you for your suggestion. Following your comment, we have abbreviated this term to simply “holding”, both on the title and throughout the text.

      Reviewer #2 (Public Review):

      Summary: 

      Here the authors address the idea that postural and movement control are differentially impacted with stroke. Specifically, they examined whether resting postural forces influenced several metrics of sensorimotor control (e.g., initial reach angle, maximum lateral hand deviation following a perturbation, etc.) during movement or posture. The authors found that resting postural forces influenced control only following the posture perturbation for the paretic arm of stroke patients, but not during movement. They also found that resting postural forces were greater when the arm was unsupported, which correlated with abnormal synergies (as assessed by the Fugl-Meyer). The authors suggest that these findings can be explained by the idea that the neural circuitry associated with posture is relatively more impacted by stroke than the neural circuitry associated with movement. They also propose a conceptual model that differentially weights the reticulospinal tract (RST) and corticospinal tract (CST) to explain greater relative impairments with posture control relative to movement control, due to abnormal synergies, in those with stroke.

      Strengths: 

      The strength of the paper is that they clearly demonstrate with the posture task (i.e., active holding against a load) that the resting postural forces influence subsequent control (i.e., the path to stabilize, time to stabilize, max. deviation) following a sudden perturbation (i.e., suddenly removal of the load). Further, they can explain their findings with a conceptual model, which is depicted in Figure 9. 

      Weaknesses: 

      Current weaknesses and potential concerns relate to i) not displaying or reporting the results of healthy controls and non-paretic arm in Experiment 2 and ii) large differences in force perturbation waveforms between movement (sudden onset) and posture (sudden release), which could potentially influence the results and or interpretation. 

      Response 2.0. Thank you for your assessment, and for pointing out ways to improve our paper. We address the weakness and potential concerns in detail below.

      Larger concerns

      (1) Additional analyses to further support the interpretation. In Experiment 1 the authors present the results for the paretic arm, non-paretic arm, and controls. However, in Experiment 2 for several key analyses, they only report summary statistics for the paretic arm (Figure 5D-I; Figure 6D-E; Figure 7F). It is understood that the controls have much smaller resting postural force biases, but they are still present (Figure 3B). It would strengthen the position of the paper to show that controls and the non-paretic arm are not influenced by resting postural force biases during movement and particularly during posture, while acknowledging the caveat that the resting positional forces are smaller in these groups. It is recommended that the authors report and display the results shown in Figure 5D-I; Figure 6D-E; Figure 7F for the controls and non-paretic arm. If these results are all null, the authors could alternatively place these results in an additional supplementary. 

      Response 2.1a. Thank you for your recommendations. We agree both on the value of these analyses and the caveat associated with them: these resting postural force biases are substantially smaller for the non-paretic and control data (for example, the magnitude of resting biases in the supported condition is 2.8±0.4N for the paretic data, but only 1.8±0.4N and 1.3±0.2N for the non-paretic and control data, respectively; the difference is even greater in the unsupported condition, though this is not the one being compared to Experiment 2).

      We now conduct a comprehensive series of supplementary analyses, including the examination of non-paretic and control data for all three components of Experiment 2 (unperturbed reaches; pulse perturbations; and active holding control). These are mentioned in the Results (lines 422-424, 512513, and 574-574 of the revised manuscript) and illustrated in the supplementary materials: Supplementary Figures S5-1, S6-1, and S7-1 contain the main analyses (comparisons of instances with the most extreme resting biases for each individual) for the unperturbed reach analysis, pulse perturbation analysis, and active holding control analysis, respectively.

      We find that non-paretic and control data do not display effects of resting biases upon unperturbed reaching control (Figure S5-1) or control against a pulse perturbation early during movement (Figure S6-1) – as is the case with the paretic data. Non-paretic and control data do not display evidence of influence of their resting force biases upon active holding control either (Figure S7-1), unlike the paretic data. For the non-paretic data, however, these influences are nominally towards the same direction as in the paretic data. Given that resting biases are substantially weaker for the non-paretic case, it is possible a similar relationship exists but requires increased statistical power to discern. Moreover, it is possible that the effect of resting biases is non-linear, with small biases effectively kept under check so that their impact upon active holding control is even less than a linearly scaled version of the impact of the stronger, paretic-side biases. This can be the subject of future work.

      Please also note that, following your recommendation (Recommendations to the Authors, point 2.1), we have conducted secondary analyses which estimate sensitivity to resting bias using all datapoints, validating our main analyses; these analyses were also performed for control and non-paretic data, with similar results (Response 2.A.1).

      Further, the results could be further boosted by reporting/displaying additional analyses. In Figure 6D the authors performed a correlation analysis. Can they also display the same analysis for initial deviation and endpoint deviation for the data shown in Figure 5D-F & 5G-I, as well for 7F for the path to stabilization, time to stabilization, and max deviation? This will also create consistency in the analyses performed for each dependent variable across the paper.

      Response 2.1b. Here, we set to test whether resting biases affect movement. It is best to do this using a within-individual comparison design, rather than using across-individual correlations: while correlation analyses can in general be informative, they obscure within-individual effects which are the main comparisons of interest in our study. Consider a participant with strong resting bias towards one direction, tested on opposing perturbations; averaging these responses for each individual would mostly cancel out any effects of resting biases. Even if we were to align responses to the direction of the perturbation before averaging, the power of correlation analyses may be diluted by inter-individual differences in other factors, such as overall stiffness.

      Thus, our analysis design was instead focused on examining the differential effects of resting posture biases within each individual’s data. We compared the most extreme opposing/aligned or clockwise/counter-clockwise instances within each individual, specifically to assess these differential effects. In our revised version, we have further reinforced these analyses to include all data rather than the most extreme instances (see response 2.A.1.a to the Reviewer’s recommendation to the authors) where we performed correlations of within-individual resting posture vs. the corresponding dependent variables and compared the resulting slopes. 

      The across-individual correlation analyses add little to that for the reasons we outlined above. At the same time, it is possible they can be helpful in e.g. illustrating across-individual variability. We thus now include across-individual correlation analyses for all dependent variables, but, given their limited value, only in the supplementary material. This also means that, for consistency, we moved the correlation analysis in Figure 6 to the corresponding supplementary figure as well (Figure S6-3).

      In addition, following the Reviewer’s comment about consistency in the analyses performed for each dependent variable across the paper, we added within-individual comparisons for settling time following the pulse perturbations (Figure 6D, right).

      (2) Inconsistency in perturbations that would differentially impact muscle and limb states during movement and posture. It is well known that differences in muscle state (activation / preloaded, muscle fiber length and velocity) and limb state (position and velocity) impact sensorimotor control (Pruszynski, J. A., & Scott, S. H. (2012). Experimental brain research, 218, 341-359.). Of course, it is appreciated that it is not possible to completely control all states when comparing movement and posture (i.e., muscle and limb velocity). However, using different perturbations differentially impacts muscle and limb states. Within this paper, the authors used very different force waveforms for movement perturbations (i.e., 12 N peak, bell-shaped, 0.7ms duration -> sudden force onset to push the limb; Figure 6A) and posture perturbations (i.e., 6N, 2s ramp up -> 3s hold -> sudden force release that resulted in limb movement; Figure 4) that would differentially impact muscle (and limb) states. Preloaded muscle (as in the posture perturbation) has a very different response compared to muscle that has little preload (as in the movement perturbations, where muscles that would resist a sudden lateral perturbation would likely be less activated since they are not contributing to the forward movement). Would the results hold if the same perturbation had been used for both posture and movement (e.g., 12 N pulse for both experiments)? It is recommended that the authors comment and discuss in the paper why they chose different perturbations and how that might impact the results. 

      Response 2.2a. We agree that it can be impossible to completely control all states when comparing movement and posture. We would also like to stress that these perturbations were not designed so that responses are directly compared to each other (though of course there is an indirect comparison in the sense that we show influence of biases in one type of perturbation but not the other). Instead, Experiment 2 tried to implement a probe optimized for each motor control modality (moving vs. holding). However, the Reviewer has a point that the potential impact of differences between the perturbations is important to discuss in the paper.

      The Reviewer points out two potentially interesting differences between the two perturbations. First, the magnitude (6N for the posture perturbation vs. 12N for the pulse perturbation); second, the presence of background load in the posture perturbation, in contrast to the pulse perturbation.

      For the movement perturbation, we used a 12-N, 70ms pulse. This perturbation and scaled versions have been tested before in both control and patient populations (Smith et al., 2000; Fine and Thoroughman, 2006). For the holding perturbation, we used a background load to ensure that active holding control is engaged, and the duration of the probe (holding for about 5s) made using a stronger perturbation impractical –maintaining a background load at, say, 12N for that long could lead to increased fatigue.

      The question raised by the Reviewer, whether the findings would be the same if the same, 12-N pulse were used to probe both moving and holding control, is interesting to investigate. We would expect the same qualitative findings (i.e. there would still be a connection between resting posture and active holding control when the latter were probed with a 12N pulse). Recent work provides more specific insight into what to expect. Our posture perturbation task is similar to the Unload Task in (Lowrey et al., 2019), whereby a background torque is released, whereas our pulse perturbation is more similar to their Load Task, whereby a torque is imposed against no background load (though it is a step perturbation rather than a pulse). Lowrey et al., 2019 find that their Unload task is harder than the Load task, with 2x the fraction of patient trials classified as failed (with failure defined as task performance being outside of the 95% confidence interval for controls), though there are still clear effects for the Load task. 

      This suggests that the potential effects of using a pulse-like perturbation to probe posture control would likely be weaker in magnitude, all other things being equal. At the same time, however, the Load and Unload tasks in Lowrey et al., 2019 were perturbations of the same magnitude; it is thus also likely that the reduction in effect would be mitigated, or reversed, by the fact that we would be using a 12N instead of a 6N perturbation.

      A relevant consequence of the Lowrey et al., 2019 findings is that the Unload paradigm is superior in its ability to detect impairment in static, posture perturbations, and thus provides a better signal to detect potential relationships with resting posture biases. This is not surprising, as a background load further engages the control of active holding, which what we were trying to probe in the first place.

      But then why not use the same paradigm (preloading and release) for movement? There are two main reasons. First, requiring a background load throughout the experiment is unfeasible due to fatigue. Second, for the holding perturbation, we wanted to ensure that the postural control system is meaningfully engaged when the perturbation hits, hence we picked the background load. Were we to impose the same during moving – i.e. impose a lateral background load on the movement - we could be engaging posture control on top of movement control. This preloading would reduce the degree to which the pulse probe isolates movement control, and lead to intrusion of the posture control system in the movement task by design. This relates to what the Reviewer proposes in the comment below: preloading may result in postural biases i.e. engage posture control; see below where we argue this interpretation is within the scope of our conceptual model rather a counter to it.

      We now explain the rationale behind our perturbation design in the Methods section (lines 211-220).

      Relatedly, an alternative interpretation of the results is that preloading muscle for stroke patients, whether by supporting the weight of one's arm (experiment 1) or statically resisting a load prior to force release (experiment 2), leads to a greater postural force bias that can subsequently influence control. It is recommended that the authors comment on this. 

      Response 2.2b. We find this interpretation valid, but we do not see how it meaningfully differs from the framework we propose. We already state that the RST may be tailored for both posture/holding control and the production of large forces (which would include muscle preloading):

      “Thus, the accumulated evidence suggests that the RST could control posture and large force production in the upper limb.“ (lines 698-699 in the current version)

      “the RST, in contrast, is weighted more towards slower postural control and generation of large isometric forces” (lines 724-726 in the current version)

      And, we discuss other conditions where the RST is involved in large force production, such as power grip, and how these interact with the role of the RST in posture/holding control (lines 758-768 in the current version).

      To better explain our model, we now provide the two examples mentioned by the reviewer along with our description of the proposed role for the RST (lines 726-727):

      “…the RST, in contrast, is weighted more towards slower postural control and generation of large isometric forces (such as vertical forces for arm support, or horizontal forces for holding the arm still against a background load like in our posture/release perturbation trials).”

      We note, however, that we find resting posture abnormalities even in the presence of arm support, suggesting the involvement of the RST in holding control even when the forces involved (and the need to preload the muscle) are small.

      Reviewer #3 (Public Review): 

      The authors attempt to dissociate differences in resting vs active vs perturbed movement biases in people with motor deficits resulting from stroke. The analysis of movement utilizes techniques that are similar to previous motor control in both humans and non-human primates, to assess impairments related to sensorimotor injuries. In this regard, the authors provide additional support to the extensive literature describing movement abnormalities in patients with hemiparesis both at rest and during active movement. The authors describe their intention to separate out the contribution of holding still at a position vs active movement as a demonstration that these two aspects of motor control are controlled by two separate control regimes.

      Strengths: 

      (1) The authors utilize a device that is the same or similar to devices previously used to investigate motor control of movement in normal and impaired conditions in humans and non-human primates. This allows comparisons to existing motor control studies. 

      (2) Experiment 1 demonstrates resting flexion biases both in supported and unsupported forelimb conditions. These biases show a correlated relationship with FM-UE scores, suggesting that the degree of motor impairment and the degree of resting bias are related.

      (3) The stroke patient participant population had a wide range of both levels of impairment and time since stroke, including both sub-acute and chronic cases allowing the results to be compared across impairment levels.

      The authors describe several results from their study: 1. Postural biases were systematically toward the body (flexion) and increased with distance from the body (when the arm was more extended) and were stronger when the arm was unsupported. 2. These postural biases were correlated with FM-UE score. 3. They found no evidence of postural biases impacting movement, even when that movement was perturbed. 4. When holding a position at the end of a movement, if the position was perturbed opposite of the direction of bias, movement back to the target was improved compared to the perturbation in the direction of bias. Taken together, the authors suggest that there are at least two separate motor controls for tasks at rest versus with motion. Further, the authors propose that these results indicate that there is an imbalance between cortical control of movement (through the corticospinal tracts) and postural control (through the reticulospinal tract).

      Response 3.1. Thank you for pointing out some of the strengths of our work and summarizing our findings. A minor clarification we would like to make, related to (3), is that, while our study did enroll two patients towards the end of the subacute stage (2-3 months), the rest of the population were at the chronic stage, at one year and beyond. We thus find it very unlikely that time after stroke was the primary driver of differences in impairment in the population we studied.

      There are several weaknesses related to the interpretation of the results:

      In Experiment 1, the participants are instructed to keep their limbs in a passive position after being moved. The authors show that, in the impaired limb, these resting biases are significantly higher when the limb is unsupported and increase when the arm is moved to a more extended position.

      When supported by the air sled, the arm is in a purely passive position, not requiring the same antigravity response so will have less RST but also less CST involvement. While the unsupported task invokes more involvement of the reticulospinal tract (RST), it likely also has significantly higher CST involvement due to the increased difficulty and novelty of the task.

      If there were an imbalance in CST regulating RST as proposed by the authors, the bias should be higher in the supported condition as there should be relatively less CST activation/involvement/ modulation leading to less moderating input onto the RST and introducing postural biases. In the unsupported condition, there is likely more CST involvement, potentially leading to an increased modulatory effect on RST. If the proportion of CST involvement significantly outweighs the RST activation in the unsupported task, then it isn't obvious that there is a clear differentiation of motor control. As the degree of resting force bias and FM-UE score are correlated, an argument could be made that they are both measuring the impairment of the CST unrelated to any RST output. If it is purely the balance of CST integrity compared to RST, then the degree of bias should have been the same in both conditions. In this idea of controller vs modulator, it is unclear when this switch occurs or how to weigh individual contributions of CST vs. extrapyramidal tracts. Further, it isn't clear why less modulation on the RST would lead only to abnormal flexion.

      Response 3.2. Our model posits two mechanisms by which CST impairment would lead to increased RST involvement. The first – which is the one discussed by the Reviewer here - is a direct one, whereby weaker modulation of the RST by the CST leads to increased RST involvement. The second is an indirect one, whereby the incapacity of CST to drive sufficient motor output to deal with tasks eventually leads to increased RST drive.

      The reviewer suggests it is likely that the unsupported task demands increased activation through both the CST and the RST. If that were the case, however, it would exaggerate the effects of CST/RST imbalance after stroke compared to healthy motor control: if task conditions (lack of support) required higher CST involvement, then CST damage would have an even larger effect. In turn, this would lead to even higher RST involvement and further diminishing the ability of CST to moderate RST. Thus, RST-driven biases would be higher in the unsupported condition.

      And, given that the CST itself is damaged and has to deal with an even-increased RST activation, we would not expect that the proportion of CST involvement would outweigh RST activation, but the opposite. In fact, a series of relatively recent findings suggest just this. For example,

      • Zaaimi et al., 2012  showed that unilateral CST lesions in monkeys lead to significant increases in the excitability of the contralesional RST (Zaaimi et al., 2012). Interestingly, this effect was present in flexors but not extensors, potentially explaining why less modulation and/or overactivation of the RST would primarily lead to abnormal flexion. 

      • McPherson et al. (further discussed in point 2.A.23, by Reviewer 2 – Recommendations to the Authors) showed that, after stroke, contralesional activity (which would include the ipsilateral RST) increases relative to ipsilesional activity (which would include the contralateral CST)

      (McPherson et al., 2018). The same study also provides evidence that FM-UE may primarily reflect RST-driven impairment. The ipsilateral(RST)/contralateral(CST) balance, expressed as a laterality index, correlated with FM-UE, with lower FM-UE for indices indicating higher RST involvement. (Interestingly, the slope of this relationship was steeper when the laterality of brain activation patterns was examined under tasks with less arm support, mirroring the steeper FM-UE vs resting bias slope when arm support is absent, as shown in our Figure 8).

      • Wilkins et al., 2020 (Wilkins et al., 2020) found that providing less support (i.e. requiring increased shoulder abduction) increases ipsilateral activation (representing RST) relative to contralateral activation (representing CST).

      This resting bias could be explained by an imbalance in the activation of flexors vs extensors which follows the results that this bias is larger as the arm is extended further, and/or in a disconnect in sensory integration that is overcome during active movement. Neither would necessitate separate motor control for holding vs active movement. 

      Response 3.3. We do not think that either of these points necessarily argue against our model. First, the resting biases we observe are clearly pointed towards increased flexion, and can thus be seen as the outcome of an imbalance in the activation of flexors vs. extensors at rest. This imbalance between flexors/extensors can also be explained by the CST/RST imbalance posited by our conceptual model: in their study of CST lesions in the monkey, Zaaimi et al., 2012 found increased RST activation for flexors but not extensors, suggesting that RST over-involvement may specifically lead to flexor abnormalities (Zaaimi et al., 2012). Second, overcoming a disconnect in sensory integration may be one way the motor system switches between separate controllers; how this switch happens is not examined by our conceptual model.

      In Experiment 2, the participants are actively moving to and holding at targets for all trials while being supported by the air sled. Even with the support, the paretic participants all showed start- and endpoint force biases around the movement despite not showing systematic deviations in force direction during active movement start or stop. There could be several factors that limit systematic deviations in force direction. The most obvious is that the measured biases are significantly higher when the limb is unsupported and by testing with a supported limb the authors are artificially limiting any effect of the bias.

      Response 3.4. We do expect, in line with what the reviewer suggests, that any potential effects would be stronger in the unsupported condition. The decision to test active motor control with arm support was done as running the same Experiment 2 would pose challenges, particularly with our most impaired patients, given the duration of Experiment 2 (~2 hours, about 1 hour with each arm) and the expected fatigue that would ensue.

      However, a key characteristic of our comparisons is that we are comparing Experiment 2 active control data under arm support, against Experiment 1 resting bias data also under arm support. While Experiment 1 measured biases without arm support as well, these are not used for this comparison. And, while resting biases are weaker with arm support, they are still clear and significant; yet they do not lead to detectable changes in active movement.

      At the same time, we do not rule out that, if we were to repeat Experiment 2 without arm support, we could find some systematic deviation in the direction of resting bias in movement control. Our conceptual model, in fact, suggests that this may be the case, as we described in lines 618-620 of our original manuscript. The idea here is that, when arm support is not provided, the increased strength requirements lead to increased drive through the RST, to the point that posture control (and its abnormalities) spills into movement control (Figure 9). We now better clarify this position in our Discussion (lines 744-750):

      “The interesting implication of this conceptual model is that synergies are in fact postural abnormalities that spill over into active movement when the CST can no longer modulate the increased RST activation that occurs when weight support is removed (i.e. resting biases may influence active reaching in absence of weight support). Supporting this idea, a study found increased ipsilateral activity (which primarily represents activation via the descending ipsilateral RST (Zaaimi et al., 2012)) when the paretic arm had reduced support compared to full support (McPherson et al., 2018).”

      It is also possible that significant adaptation or plasticity with the CST or rubrospinal tracts could give rise to motor output that already accounts for any intrinsic resting bias.  

      Response 3.5. This kind of adaptation – regardless of the tracts potentially involved – is an issue we examined in our experiment. As we talk about in our Results (lines 458-460 in the updated manuscript), with most of our patient population in the chronic stage, it could be likely that their motor system adapted to those biases to the point that movement planning took them into account, thereby limiting their effect. This motivated us to examine responses to unpredictable perturbations during movement (Figure 6) where we still find lack of an obvious effect of resting biases upon reaching control. We thus believe that our findings are not explained by this kind of adaptation, though we agree it would be of great interest for future work to compare resting biases and reaching control in acute vs. chronic stroke populations to examine the degree to which stroke patients adapt to these biases as they recover.

      In any case, the results from the reaching phase of Experiment 2 do not definitively show that directional biases are not present during active reaching, just that the authors were unable to detect them with their design. The authors do acknowledge the limitations in this design (a 2D constrained task) in explaining motor impairment in 3D unconstrained tasks. 

      Response 3.6. It is, of course, an inherent limitation of a negative finding is that it cannot be proven. What we show here is that, there is no hint of intrusion of resting posture abnormalities upon active movement in spite of these resting posture abnormalities being substantial and clearly demonstrated even under arm support. To allow for the maximum bandwidth to detect any such effects, we specifically chose to compare the most extreme instances (resting bias-wise) for each individual, and yet we did not find any relationship between biases and active reaching.

      This suggests that, even if these biases could be in some form present during active movement, their effect would be minimal and thus limited in meaningfully explaining post-stroke impairment in active movement under arm support.

      Note that, as we already discuss, our conceptual model (Figure 9) suggests that the degree to which directional biases would be present in active reaching may be influenced by arm support (or the specific movements examined – hence our limitation in not examining 3D movement). Thus we do not claim that this independence is absolute. Examples include the last line of the passage quoted right above, and the summary statement of our Discussion quoted below (lines 639-641):

      “…which raises the possibility that the observed dissociation of movement and posture control for planar weight-supported movements may break down for unsupported 3D arm movements.”

      Finally, we now more explicitly acknowledge that abnormal resting biases may influence active movement in the absence of arm support (see Response 3.4).

      It would have been useful, in Experiment 2, to use FM-UE scores (and time from injury) as a factor to determine the relationship between movement and rest biases. Using a GLMM would have allowed a similar comparison to Experiment 1 of how impairment level is related to static perturbation responses. While not a surrogate for imaging tractography data showing a degree of CST involvement in stroke, FM-UE may serve as an appropriate proxy so that this perturbation at hold responses may be put into context relative to impairment.

      Response 3.7. Here the Reviewer suggests we use FM-UE scores as a proxy for CST integrity. We do not think this analysis would be particularly helpful in our case for a number of reasons:

      First, while FM-UE is a general measure of post-stroke impairment, it was designed to track - among other things - the emergence and resolution of abnormal synergies, a sign assumed to result from abnormally high RST outflow (McPherson et al., 2018; McPherson and Dewald, 2022). In line with this, the FM-UE scales with EMG-based measures of synergy abnormality (Bourbonnais et al., 1989). Impairments in dexterity, a sign associated with damage to the CST (Lawrence and Kuypers, 1968; Porter and Lemon, 1995; Duque et al., 2003), dissociate with synergy abnormalities when compared under arm support as we do here (Levin, 1996; Hadjiosif et al., 2022). This means that FM-UE would be a stronger proxy for RST activity and thus not a direct proxy for CST integrity particularly when one wants to dissociate RST-specific vs. CST-specific abnormalities. In fact, as we discuss in Response 3.2 above, there is a number of studies supporting this idea: for example, Zaaimi et al., 2012 show that relative RST activation – the balance between ipsilateral excitability, primarily reflecting RST, and contralateral excitability, primarily reflecting the CST, scales with FM-UE (Zaaimi et al., 2012).

      Second, this kind of analysis would obscure within-individual effects, since FM-UE scores are, of course, assigned to each individual. This is the same issue as doing across-individual correlation analyses in general (see response 2.1b).Strong resting force bias would have opposite effects on opposing perturbations, averaging across subjects would occlude these effects.

      Third, while FM-UE is a good measure of synergy abnormality, weakness alone could also give an abnormal FM-UE (Avni et al., 2024).

      The Reviewer also suggests we use time from injury for this analysis. Time from injury can indeed potentially be an important factor. However, this analysis would not be appropriate for our dataset, since the effective variation in recovery stage within our population is limited: our sample is essentially chronic (only two patients were examined within the subacute stage – at 2 and 3 months after stroke - with everybody else examined more than a year after stroke) with the “positive” elements of their phenotype (and FM-UE itself) essentially plateaued (Twitchell, 1951; Cortes et al., 2017). We thus would not expect to see any meaningful effects of time from injury within our population. It would be an excellent question for future work to investigate both resting biases and their relationship to reaching in acute/subacute patients, and examine whether the trajectory of resting biases (both emergence and abatement due to recovery) follows the one for abnormal synergies.

      It is not clear that even in the static perturbation trials that the hold (and subsequent move from perturbation) is being driven by reticulospinal projections. Given a task where ~20% of the trials are going to be perturbed, there is likely a significant amount of anticipatory or preparatory signaling from the CST. How does this balance with any proposed contribution that the RST may have with increased grip?

      Response 3.8. We included our response to this as part of Response 3.2. In brief, while we cannot rule out that these tasks may recruit increased CST signaling, this would tend to increase, rather than reduce, the effects of post-stroke impairment: the requirement for increased signaling from a CST that is damaged would magnify the effects of this damage, in turn leading to increased recruitment of other tracts, such as the RST.

      In general, the weakness of the interpretation of the results with respect to the CST/RST framework is that it is necessary to ascribe relative contributions of different tracts to different phases of movement and hold using limited or indirect measures. Barring any quantification of this data during these tasks, different investigators are likely to assess these contributions in different ways and proportions limiting the framework's utility.

      Response 3.9. We believe that our Reponses 3.2-3.6 put our findings in fair perspective, and the edits undertaken based on the Reviewer’s comments have clarified our position as to how the dissociation between holding and moving control may break down. We do agree, however, that our framework would be strengthened by the use of direct measures of CST/RST connectivity in future research. We present our conceptual model as a comprehensive explanation of our findings and how they blend with current hypotheses regarding the role of these two tracts in motor control after stroke.  As such, it provides a blueprint towards future research that more directly measures or modulates CST and RST involvement, using tools such as tractography or non-invasive brain stimulation.

      Recommendations for the authors:   

      Reviewer #1 (Recommendations For The Authors):

      L226 “…of this issue, we repeated the analysis of Figure 7F (a) by excluding these four patients…”.  Should this be three, based on the previous sentence? 

      Response 1.A.1. Thank you for pointing this typo, which is now corrected. The analysis in question (Figure S1 in the original submission, now re-numbered as Figure S7-4), excluded the three patients mentioned in the previous sentence.

      L254 “…the hand was held in a more distal position. The postural force biases were strongest when…”  Could this be "extended" rather than distal? See my later comment about the inadequate description of targets.

      Response 1.A.2. The reviewer is correct that, the arm will tend to be more extended in the distal targets. However, since these positions were defined in extrinsic coordinates, we think the terms distal/proximal are also appropriate. In either case, we now clarify these definitions in the text (see Response 1.A.3 below).

      L263 “…contained both distal and proximal targets, and, importantly, they were also the movement…”.  Distal/proximal targets were never described as part of the task. 

      Response 1.A.3. We improved our description by (i) changing the wording above to “represented positions both distal and proximal to the body,”, (ii) doing the same in our Methods (line 175) and (iii) indicating distal/proximal targets in Figure 3A (bottom right of panel A).

      L378 “…the pulse perturbation. We hypothesized that, should resting postural forces play a role, they…”  L379 “…would tend to reduce the effect of the pulse if they were in the opposite direction, and…”  Not really obvious why. A reduction in the displacement caused by a force pulse might be caused by different stiffness or viscosity, but not by a linear, time-invariant force bias. This situation is different from that of "moving the arm through a high-postural bias area vs. a low-postural bias area" where it would encounter time- (actually spatially) varying forces and varying amounts of displacement. Clarify the logic if this is a critical point.

      Response 1.A.4. We thank the Reviewer for highlighting this point of potential confusion. We now clarify that these postural bias forces are neuromuscular in origin (Kanade-Mehta et al., 2023), and likely result from an expression of abnormal synergy, at least under static conditions. In this case, we hypothesized that force pulses acting against the gradient of the postural bias field would act to stretch the already active muscles, which would lead to a further increase in postural resistance due to inherent length-tension properties of active muscle. By contrast, force pulses acting along the gradient of the postural bias field would act to shorten the same active muscles, which would lead to a reduction in postural resistance. The data did not support this in the case of force pulses imposed during movement. We note, however, that similar effects would affect responses to static perturbations as well, wherein we do find an effect of resting biases. We now better explain this reasoning (lines 479482).

      L466 “resting postural force). In short, our perturbations revealed that resting flexor biases switched  467 on after movement was over, providing evidence for separate control between moving” and 

      L468 “holding still.”

      I do not think the authors have presented clear evidence that forces, "switch on", implying the switch to a different controller which they posit. This could as easily be a nonlinear or time-varying property of a single controller (admittedly, the latter possibility overlaps broadly with their idea of distinct, interacting controllers). An example that the authors are certainly aware of is that of muscle "thixotropy" a purely peripheral mechanism due to the dynamics of crossbridge cycling that causes resting muscle to be stiffer than moving muscle, changing with a time constant of ~1-2 seconds. Neither this particular example nor changing levels of contraction (more likely during the unpredictable force perturbations) would be in the direction to explain the main observation here -- a point perhaps worth making, together with the stretch reflex comments. 

      Response 1.A.5. Thank you for this perspective. Indeed, it might be that “switching on” represents a shift along a nonlinear property of the same controller: in the extreme, if this nonlinearity is a step (on/off) function, this single controller would be functionally identical to two separate controllers. We thus cannot tell if these controllers are distinct in the strict sense. What we argue here is that, no matter the underlying controller architecture - two distinct controllers or two distinct modes of the same controller - is that the control of reaching vs. holding can be functionally separable even after stroke. In line with this idea, we used a more nuanced phrasing (e.g. “separable functional modes for moving vs. holding”) throughout our manuscript, and we have now edited out a mention of “separate controllers” to be consistent with this.

      Moreover, thank you for pointing out the example of thixotropy, showing how peripheral mechanisms could interact with central control. As you point out, this effect would not explain the main observation here: in fact, if stiffness were substantially higher during rest or holding (instead of moving) that would reduce the impact of the static perturbation, making it harder to detect any effects of resting biases compared to the moving perturbation case.

      L480 “…during movement (Sukal et al., 2007). Yet, Experiment 2 found no relationship between resting…” L481”… postural force biases and active movement control. To further investigate this apparent…”  The methods of the two studies seem fairly similar, but this question warrants a more careful comparison. How did the size of the two workspaces compare? What about the magnitude of the exerted forces? The movement condition in this study was done with the limb entirely supported. Under that condition, the Sukal study also found fairly small effects of the range of motion.

      Response 1.A.6. Sukal et al., 2007 did not directly measure exerted forces, but instead compared the active range of motion under different loading conditions. They used the extent of reach area to quantify the effect of abnormal synergies, with a more extended active range of motion signifying reduced effect of abnormal synergies. As the Reviewer points out, Sukal et al. found fairly small effects of synergies upon the range of motion when arm support was provided (the reach area for the paretic side was found to be about 85% of the nonparetic side under full arm support, though they were statistically significantly different, Figure 5 of their paper). They found increasing effect of synergies as arm support was reduced: on average, the reach area when participants had to fully support the arm was less than 50% the reach area when full arm support was given (comparing the 0% vs. 100% active support conditions [i.e. 100% vs. 0% external support] in their Figure 5). As we discuss in our paper, this effect of arm support upon synergy mirrors the one we found for resting postures.

      To compare our workspace with the one in Sukal et al., we overlaid our workspace (the array of positions for which the posture biases were measured, for a typical participant from Experiment 1) on the one they used as shown in their Figure 4. Note that their figure only shows an example participant, and thus our ability to compare is limited by the fact that each participant can vary widely in terms of their impairment, and assumptions had to be made to prepare this overlay (e.g. that (0,0) represents the position of the right acromion point). 

      For this example, and our assumptions, our workspace was smaller, with the main points of interest (red dots, the movement start/end points used for Experiment 2) within the Sukal et al. workspace. That our workspace is smaller is not surprising, given that the area in Sukal et al. represents the limit of what can be reached, and thus motor control *has* to be examined in a subset of that area.

      Author response image 1.

      Comparing the two study methodologies, however, suggests an advantage of measuring resting biases in terms of sensitivity and granularity: first, resting biases can be clearly detected even under arm support (something we point out in our Discussion, lines 715-717); second, they can measure abnormalities at any point in the workspace, rather than a binary within/without the reach area. The resting bias approach may thus be a more potent tool to probe the shared bias/synergy mechanisms we propose here.

      Figure 2 

      Needs color code. 

      The red dots could be bigger.

      Response 1.A.7. We have increased the size of the red dots and added a color code to explain the levels illustrated by the contours. We also expanded our caption to better explain this illustration.

      Figure 3

      Labeling is confusing. Drop the colored words (from both A and B), and stick to the color legend. Consider using open and filled symbols (and bars) to represent arm support or lack thereof. The different colored ovals are very hard to distinguish.

      Response 1.A.8. We find these recommendations improve the readability of Figure 3 and we have thus adopted them - see updated Figure 3.

      Figure 4

      Not terribly necessary.  

      Response 1.A.9. While this figure is indeed redundant based our descriptions in the text, we kept it as we believe it can be useful in clarifying the different stages of movement we examine.

      Figure 5 

      Tiny blue and green arrows are impossible to distinguish. 

      Although the general idea is clear, E and H are not terribly intuitive.  Add distance scale bars for D-I. 

      Response 1.A.10. For improved contrast, we now use red and blue (also in line with comment below regarding Figure 7), and switched to brighter colors in general. To make E and H more intuitive and easier to follow, we expanded the on-panel legend. Thank you for pointing out that distance scale bars are missing; we have now added them (panels EFHI).

      Figure 6 

      Panel E inset is too small. 

      Response 1.A.11. We have now moved the inset to the right and enlarged it.

      Figure 7 

      Green and blue colors are not good. 

      Response 1.A.12. For improved contrast, we now use red and blue.

      Figure 8 

      Delete or move to supplement? 

      Response 1.A.13. We respectfully disagree. While the relationships on these data are also captured by the ANOVA, we believe these scatter plots offer a better overview of the relationships between force biases and FM-UE across different conditions.

      Really minor

      L113 “…participants' lower arm was supported using a custom-made air-sled (Figure 1C). Above the  participant's…” 

      Response 1.A.14. We put the apostrophe after the s so to refer to participants in general (plural).

      L117 ”…subject-produced forces on the handle were recorder using a 6-axis force transducer.”  recorded 

      Response 1.A.14. Thank you for pointing out this error which we have now corrected.

      L136 “…2013), Experiment 1 assessed resting postural forces by passively moving participants to>…”  The experiment did not move the participant. 

      Response 1.A.15. We now fix this issue: “by having the robot passively move…”

      L248 “…experiment blocks: two with each arm, with or without arm weight support (provided by an air experimental…”

      Response 1.A.16. We have now corrected this.

      L364 “…responses to mid-movement perturbations. In 1/3 of randomly selected reaching movements…”  Obviously, you mean 1/3 of all movements: "One-third of the reaching movements were chosen randomly"  

      Response 1.A.17. We now clarify: “In 1/3 of reaching movements in Experiment 2, chosen randomly”. Also please note our response to Reviewer 2, point 10: we now report the exact number of trials for which each kind of perturbation was present.

      L609 “Damage to the CST after stroke reduces its moderating influence upon the RST (Figure 9,…”  "its" refers to the subject, "Damage", not "CST".

      Response 1.A.18. We have changed this to “Post-stroke damage to the CST reduces the moderating influence the CST has upon the RST”.

      Reviewer #2 (Recommendations For The Authors):

      (1) Throughout, the authors cleverly selected the most opposed and most aligned resting postural force biases to perform a within-subject analysis. However, this approach excludes a lot of data. The authors could perform an additional within-subject analysis. For each participant they could correlate lateral resting posture force bias to each dependent variable, utilizing all the trials of a participant. 

      Response 2.A.1a. Thank you for your appreciating our analysis design, and suggesting additional analyses. We focused our within-subject analysis design on the most extreme instances, as we believe that this approach would offer the best opportunity to detect any potential effects of resting biases. We reasoned that, since resting biases tend to be relatively small for most locations in the workspace, taking all biases into account would inject a disproportionate amount of noise in our analysis, which would in turn diminish our ability to detect any potential relationships. This could be because small biases lead to small effects but also small biases may themselves be more likely to reflect measurement noise in the first place. Note that our study talks about separability of active reaching from resting abnormalities based on lack of relationships between the two. While one cannot definitely prove a negative, it is also important to take the approach that maximizes the ability to detect any such relationship if there were one. We believe taking the most extreme instances fulfills that role.

      However, as the Reviewer points out, this approach also excludes a substantial amount of data. We agree that our findings could be further strengthened by exploring additional within-subject analyses that utilize all trials. Thus, following the reviewer’s suggestion, we estimated the sensitivity of each dependent variable to lateral resting posture force bias. Specifically, we estimated the slope of this relationship for each individual (separately for paretic and non-paretic data) using linear regression, and assessed whether the average slope is significant for each group (paretic data, non-paretic data, and control data).

      This secondary analysis replicated our main findings: lack of relationship between posture biases and active reaching control (both for unperturbed and perturbed movement), and a significant relationship between posture biases and active holding control. In addition, in line with main point 2.1 by the reviewer, we performed the same analyses for non-paretic and control data. While there are no definitive conclusions to be made for these cases (as was likely, given that the resting force biases are smaller, as also pointed out by the Reviewer in 2.1) these data are worthy of discussion, with potentially interesting insights (for example, there are hints that the connection between resting biases and active holding control is present in the non-paretic arm as well, and may be explored in future research).

      We have included these analyses in the supplementary materials, and we point to them in the main text. Specifically:

      First, in line with our main analyses in Figure 5, we find no effect (the average slope is insignificant) for start and endpoint biases upon the corresponding reaching angles. This is now mentioned in lines 425-434 of the Results, and illustrated in Figure S5-2. There was a lack of effect for the non-paretic and control data as well.

      Second, in line with our main analyses in Figure 6, we find no effect of start biases upon responses to the pulse (Figure S6-2, mentioned in lines 513-517 of the Results). As above, there was no effect of non-paretic or control data either.

      And, finally, in line with our main analysis in Figure 7, we find an effect of resting biases upon performance for the static perturbation (Figure S7-2, mentioned in lines 578-586 of the Results). Interestingly, there is a suggestion that resting biases may affect static perturbation responses in the non-paretic data as well based on the relationship between posture bias and maximum deviation, but not the other two metrics. Given the lack of consistency of resting bias effects for all three different dependent variables examined, however, our current data are thus unable to give a definite answer as to whether there is the connection between resting biases and active holding control is also present in the non-paretic side. Our hypothesis is that, since resting abnormalities and their effects are the pathological over-manifestations of mechanisms inherent in the motor system in general, then such a relationship would exist. Answering this question, however, would require an experiment design better tailored to detect relationships in the non-paretic arm, where resting biases are weaker.

      We thank the Reviewer for their suggestions and believe that these additional analyses provide a more complete picture of the data, and their consistency with our main results reinforces the message of the paper.

      Then, they can report the percentage of participants that display significant correlations separately for the paretic, nonparetic, and control arms. 

      Response 2.A.1b. We note that, even in cases where the average slope (across individuals) is significant, the individual slopes themselves are usually not significant, likely due to the large amount of noise for datapoints corresponding to weak resting biases. To further examine this, we performed additional analyses whereby we examined slopes by (a) pooling all participant data together (centered separately for each individual), and then (b) took a further step to normalize each participant’s data not only by centering but by also adjusting by each individual’s variability along each axis (i.e. assess the slope between z-scores of resting bias vs. z-scores of each dependent variable). These two analyses confirmed our finding that resting biases interacted with active motor control, with significant slopes between resting biases and outcome variables. (a) Pooling all data together: path to stabilization: p = 0.032; time to stabilization: p = 1.4x10-5; maximum deviation: p = 0.021. (b) Pooling and normalizing: path to stabilization: p = 0.0013; time to stabilization: p = 8.6x10-6; maximum deviation: p = 0.00056. The latter analysis showed even stronger connection between resting bias and active holding control, probably due to better accounting for differences in the range of resting biases across participants). For simplicity, however, we only provide the across-individual slope comparisons in the paper.

      (2) An important aspect of all the analyses is that they rely heavily on estimates of the resting postural force bias. How stable are these resting postural force biases at the individual level? The authors could assess this by reporting within-subject variance for both the magnitude and direction of the resting postural force bias.

      Response 2.A.2. Thank you for your suggestion. We now assess the individual-level variance in error across measurements for patients’ paretic data using an ANOVA: the variance that remains after all other factors (same probe location; same arm support condition; same participant) are taken into account. We found that individual level measurement variance explained a mere 9.0% of total variance for resting bias magnitude. (We note that the same figure was 20.2% for the non-paretic data, in line with the weaker average biases which would be more susceptible to noise). We now note this in the Methods, as part of the new subsection “Stability of resting posture bias measurements in Experiment 1” (lines 266-273).

      (3) Does resting postural force bias influence hand movement immediately following force release from the postural perturbation? This could be assessed before any volitional responses by examining the velocity of the hand during the first 50 ms following the postural perturbation.

      Response 2.A.3. The influence seems fairly rapid, within the first 100ms as shown to the right. Here we plot hand deviation in the direction of the perturbation for the most-opposed (red) vs. most-aligned (blue) instances to examine when these curves become different. The bottom plots show the difference between these two, whereas shading indicates SEM (note that these curves are referenced to the average deviation in the last 0.5 s before force release). The rightmost plots zoom in to make it easier to see how responses to the most opposed vs. most aligned instances diverge.

      To detect the earliest post-perturbation timepoint for which this effect was significant, we performed paired t-tests at each timestep, and found that the two responses were systematically statistically different 95ms after perturbation onset onwards. For reference, the same method detected a response at 25ms for the most aligned instances and 40ms for the most opposed instances.

      We have now added Supplementary Figure S7-4 with short commentary in the Supplementary Materials.

      (4) Abstract. lines 7-9. At a glance (and when reading the manuscript linearly) this sentence is unclear. If the paretic arm is compromised across rest and movement, how does that afford the opportunity to address the relationship between reaching, stopping, and stabilizing when all could be impacted? It might be useful to specify that these factors may impacted differently relative to one another with stroke, providing an opportunity to better understand the differences between movement and postural control. 

      Response 2.A.4. Thank you for pointing out this issue (also related to Reviewer 1’s point – Response 1.1). We have changed this to more clearly reflect our reasoning and highlight that the issue is that stroke can differentially impact reaching vs. holding, copied below:

      “The paretic arm after stroke exhibits different abnormalities during rest vs. movement, providing an opportunity to ask whether control of these behaviors is independently affected in stroke.”

      (5) Line 27. It is perhaps more appropriate to say conceptual model than simply 'model'.  

      Response 2.A.5. Thank you for your suggestion, which we have adopted throughout the manuscript.

      (6) Line 122-125. Figure 1A caption. The authors should specify that resting posture force biases occur when the limb or hand is physically constrained in a specific position. 

      Response 2.A.6. Thank you for pointing this out – we have clarified the caption:

      “If one were to physically constrain the hand in a position away from the resting posture, the torques involved in each component of the abnormal resting posture translate to a force on the hand (blue arrow);”

      (7) Line 147. Why was the order not randomized or counterbalanced? 

      Response 2.A.7. We prioritized paretic data, as the primary analyses and comparisons in our paper involved resting posture biases and active movement with the paretic arm. We note that our primary analyses, which rely on paretic-paretic comparisons, would not be affected by paretic vs. non-paretic ordering effects. However, ordering effects could potentially affect comparisons between paretic and non-paretic data. We now note the reasoning behind the absence of counterbalancing, and mention the potential limitation in interpreting paretic to non-paretic comparisons in lines 124-129 of the Methods.

      (8) Line 172. 12N is the peak force of the pulse?

      Response 2.A.8. The reviewer is correct; we have clarified our description (line 463 in the updated manuscript):

      “a 70 ms bell-shaped force pulse which was 12N at its peak”

      (9) Line 175. What is a clockwise pulse? Was the force vector rotating in direction over time so that it was always acting orthogonally to the movement, or did it always act leftwards or rightwards?

      Response 2.A.9. The force vector was not rotating in direction over time. Here, we used clockwise/counterclockwise to indicate rightwards/leftwards with respect to the ideal movement direction – the line from start position to target (which is what we understand the Reviewer means by “always act rightwards or leftwards”). We have clarified the text to indicate this (lines 193-195):

      …was applied by the robot lateral to the ideal movement direction (i.e. the direction formed between the center of the start position and the center of the target) after participants reached 2cm away from the starting position (Smith and Shadmehr, 2005; Fine and Thoroughman, 2006).

      (10) Lines 177-182. It might be useful to explicitly mention the frequency of each of the perturbations, just for ease of the reader. 

      Response 2.A.10. We have added this information to our Methods (lines 206-210):

      Thus, in summary, each 96-movement block consisted of 64 unperturbed movements and 32 movements perturbed with a force pulse (16 clockwise, and 16 counter-clockwise). For 20 out of the 96 movements in each block, the hold period was extended to test the hold perturbation (4 trials for each of the 5 target locations, each one of the 4 trials testing one perturbation direction as shown in Figure 7C).

      (11) Line 191. Lines 188-190. It would be useful to see a sample of several of these force traces over time (0-5s) that were used to make the average for a position. That would give insight into the stability of the forces of a participant for one of the postures. These traces could be shown in Figure 2.

      Response 2.A.11. Thank you for your suggestion. We have added these panels to Figure 1, (as Figure 2 was already large). Each panel illustrates the three measurements taken at similar positions (closest to midline, distal from the body) and the same condition (paretic arm, with arm support given) for one participant (same participants as in Figure 2). Solid lines indicate the force on the x-axis (positive values indicate forces towards the left), whereas dashed lines indicate the force on the y-axis (positive values indicate forces towards the body). The shaded area indicates the part averaged in order to estimate the resting bias, illustrating how resting biases were relatively stable by the 2s mark. Note that these examples include one trial (blue traces in the third panel) which was rejected following visual inspection as described in Materials and Methods – Data Exclusion Criteria (“trials where forces appeared unstable and/or there was movement during the robot hold period”). We find this helpful as this illustrates (and motivates) one component of our methodology. 

      (12) Line 196. Figure 1D (not 1E).  

      Response 2.A.12. Thank you for catching this error, which we have now corrected.

      (13) Line 215: The authors mentioned similar results. Were there any different results that impacted interpretation? Some evidence of this, similar to and in addition to Supplementary 1, would be helpful. 

      Response 2.A.13. We repeated our analyses without these exclusion criteria, with no impact to the interpretation. We now include versions of the main outcome panels from Figures 5, 6, and 7 in the supplementary materials calculated without this outlier exclusion (Figures S5-E, S6-E, and S7-E, respectively). 

      (14) Line 231: Perhaps better to explicitly state the furthest three positions are being across as the distal targets for the ANOVA. 

      Response 2.A.14. Thank you for your suggestion. We now explicitly clarify this in line 276:

      “distal targets [furthest three positions] vs. proximal targets [closest two positions]”

      (15) Figure 3B, lines 265. Clearly, these are different, but the authors should report statistics. 

      Response 2.A.15. We now report these numbers (lines 339-346 of the revised manuscript, which also include statistics related to bias direction as described in 2.A.17 below).

      (16) Figure 2 should have a heat map scale.  

      Response 2.A.16. We have now added this (also Response 1.A.7), including an explanation of what the heat map represents in the caption.

      (17) Figure 3C: It would be useful to quantify and plot the direction of the resting force bias vector. 

      Response 2.A.17. Thank you for your suggestion. We have expanded Figure 3 to include the average direction of the resting force bias vector (note the readjustment of colors following Reviewer 1’s comment: striped bars indicate No Support data, and full bars indicate Support data, with the colors being the same). The direction of the force bias vector, however, may not be very informative in cases where the magnitude is small (and the signal-to-noise ratio is small), whereas averaging the direction of the force bias vector across different positions for one participant may average out systematic variations in this direction across different locations. Nevertheless, the average direction appears generally towards the body (around -90°, or 6 o’clock) even in the non-paretic and control data (though the noise – as suggested by the size of the errorbars – is much higher in the latter cases, especially when the arm is supported). This is a (weak) suggestion that these resting biases may be present, though much subdued, in the nonparetic limb and healthy individuals; further work will be needed to elucidate this.

      (18) Line 428. It is not significantly longer compared to controls. Can the authors slightly revise this sentence?

      Response 2.A.18. We have revised this sentence (lines 529-532):

      Patients showed impaired capacity to resist and recover from this perturbation (the abrupt release of the imposed force). The time to stabilization for the paretic side (0.94±0.05s) was longer compared to the non-paretic side (0.79±0.03s, p = 0.024) and controls (0.78±0.06s, though this was statistically marginal, p = 0.061) as shown in Figure 7E, left.

      (19) Line 541. It is unclear how these data support the idea of three distinct controllers. Can the authors please clarify? 

      Response 2.A.19. Here, we compared our findings to previous ideas about distinct controllers, and discuss a potential fusion of these ideas with ours. Specifically, we find that holding is distinct from both initial reaching and coming to a stop. Previous work argues that initial reaching and coming to a stop are themselves distinct (Ghez et al., 2007; Jayasinghe et al., 2022). Combining these two sets of arguments, we arrive at the possibility of three distinct controllers. 

      (20) It would be useful if the authors provided a definition of synergy, as well as distinguishing between muscle and movement synergies. 

      Response 2.A.20. We now provide this in lines 591-594:

      Here, “synergies” refer to abnormal co-activation patterns across joints that manifest as the patient tries to move – for example, the elbow involuntarily flexing as the patient tries to abduct their shoulder (Twitchell, 1951; Brunnstrom, 1966). 

      (21) Line 592-593. The wording of this sentence could be improved. 

      Response 2.A.21. We have switched this sentence to active voice for more clarity:

      Thus, while full weight support reduces both resting flexor biases and movement-related flexor synergies, this reduction seems more complete for synergies rather than resting biases.

      (22) Figure 9. In the left column, it should read normal synergies and normal resting posture.  

      Response 2.A.22. We intentionally used the same terminology, as the idea behind our conceptual model is that these patterns, which manifest as well-recognized abnormal synergies and abnormal resting postures in stroke, may be present in the healthy motor system as well, but kept in check by CST moderating the RST. At the same time, we recognize that, by definition, synergies and posture in controls are the “normal” reference point against which “abnormal” synergies and posture are defined after stroke. To clarify this issue, we thus decided to forgo the use of the terms “abnormal” in the figure, and instead refer to “synergistic movement ” and “synergistic resting posture”.

      (23) Figure 9. With stroke, is RST upregulated, a decreased influence of CST, or both? All seem plausible.

      Response 2.A.23a. We believe both can be happening. From previous work (e.g. McPherson et al., 2018) it seems safe to say that RST upregulation is the case, whereas one would also expect a decreased CST influence due to its damage due to the stroke. The relative weight of these influences would be interesting to elucidate in future work.

      I have not read the paper, but did McPherson et al., 2018 test these different hypotheses?  

      Response 2.A.23b. The main point of McPherson et al., 2018 is that increased synergy expression is due to increased RST involvement, rather than reduced CST influence. However, McPherson et al. do not show separate increases/reductions in RST/CST activity; they show that contralesional activity relative to ipsilesional activity is increased (using a laterality index). While it does seem that RST is upregulated in this case, this does not exclude the possibility that CST influence is reduced as well.

      We also noticed that the citation itself, while mentioned in the text, was missing from the bibliography. This is now fixed.

      For Figure 9, McPherson is cited as they provide evidence for the idea that RST involvement increases when arm support is decreased. This evidence is both direct (e.g. in their Figure 3 where they show that “Stroke participants exhibited increased activity in the contralesional (R) hemisphere as SABD loading increased” [i.e. arm support was reduced]) and indirect: they connect synergies to RST involvement, and also show increased synergies with reduced arm support (also shown multiple times previously). Both these arguments suggest that arm support reduces RST involvement. We have clarified the relevant sentence:

      The interesting implication of this conceptual model is that synergies are in fact postural abnormalities that spill over into active movement when the CST can no longer modulate the increased RST activation that occurs when weight support is removed. Supporting this idea, McPherson et al. found increased ipsilateral activity (which primarily represents activation via the descending RST (Zaaimi et al., 2012)) when the paretic arm had reduced support compared to full support (McPherson et al., 2018).

      Reviewer #3 (Recommendations For The Authors):

      For Experiment 2, it is not immediately clear how the within-subject values are being pooled and compared across the different conditions. For instance, in the static perturbation trials, there are four blocks with 20 perturbation trials per block per arm (80 total per arm) with each location and direction once per block. For each participant, the comparison is between the location/direction that was most opposed (although this doesn't look accurately represented in Fig 7F). Therefore, the within-subject comparison is 4 trials per participant? Were these values averaged or pooled? It is a little odd that the SD for all the within-subjects trials are identical or nearly identical across conditions especially when looking at the example patient data in 7B and 7F.  

      Response 3.A.1. For static perturbation trials, the within-subject comparison involves 8 trials per participant: 4 trials corresponding to the perturbation direction/position combination with resting bias most opposed to the perturbation, and 4 trials corresponding to the perturbation direction/position combination with resting bias most aligned with the perturbation. These values were averaged for each individual. We have expanded our methods to make this part of our data analysis clear (lines 284-296) for all types of comparisons (unperturbed movement, pulse perturbation, static perturbations – now referred to as “release perturbation”).

      The across-subject SDs for the average resting forces for each one of these two conditions, shown in Figure 7F are indeed identical. This is due to how these two instances (most aligned vs. most resistive) were selected: because the perturbation directions come in pairs that exactly oppose each other (Figure 7B), if one were to select the position with the most opposing resting bias, that would mean that the combination with same position and the oppositely-directed perturbation would be the one with the most assistive resting bias. Hence the resting biases selected for the most opposing/assistive instances would be equal in magnitude and opposite to each other for each participant, as illustrated in Figure 7F, whereby the most-opposed bias for each individual is exactly opposite to the corresponding most-aligned bias for the same individual. We have added a brief commentary about this on the caption (lines 551-554), reproduced below:

      Note how the most-opposed resting bias for each patient is equal and opposite to the their mostaligned resting bias. This is because the same resting bias, when projected along the direction of two oppositely-directed perturbations (illustrated in C), it would oppose one with the same magnitude it would align with the other.

      Importantly, following suggestions by Reviewer 2 (see point 2.A.1), we now provide supplementary analyses that use the entirety of the relevant data, rather than the most extreme instances, which provide evidence supporting our main findings (Figures S5-2, S6-2, and S7-2).

      The printed colors in Figure 3 are very muddled and hard to read/interpret, especially in panel A. 

      Response 3.A.2. Thank you for pointing out this issue, also raised by Reviewer 1. We have adjusted the colors to be more distinct from each other and look clear both in print and on-screen, making use of dashed lines and stripes rather than different shades.

      I think it would improve readability and interpretation if Figure 8 and the results related to FM-UE were contained within the description of results for Experiment 1.

      Response 3.A.3. Thank you for this suggestion. This is actually a debate we had among ourselves earlier, and we can see merits to either ordering. It is very arguable that moving Figure 8 and the FMUE results within the rest of Experiment 1 may improve readability somewhat. However, we believe that presenting these results at the end better serves to illustrate the apparent paradox between the lack of direct connection between resting biases and active movement on one hand, and the relationship between resting biases and abnormal synergies on the other. We believe that this better sets the stage to present our conceptual model, which explains this paradox based on the role arm support plays in modulating the expression of both resting biases and abnormal synergies.

      Additional changes/corrections not outlined above

      Figure 1D displayed a right arm, but showed a target array (red dots) for a left arm paradigm. We now flip the target array shown for consistency.

      We corrected Figure 6C, which accidentally used an earlier definition of settling time which was based on lateral stabilization throughout the entire movement, rather focus on the period immediately following the pulse. The intended definition of settling time (as we had described in the Methods, lines 204-206 of original submission) focuses on lateral corrections specific to the pulse (rather than corrections when the participant approaches the endpoint) and better matches the one for settling time for the release (static) perturbation trials. Note that this change did not affect the (lack of) relationship between settling time and resting force bias, both across individuals (correlation plots now in Figure S6-1) and within individuals (now shown in the right part of panel 6D). Also in panel C, an error in the scaling for the maximum lateral deviation in the pulse direction (right side of the panel) is also now corrected.

      In addition, we made minor edits throughout the text to improve readability.

      References

      Albert ST, Hadjiosif AM, Jang J, Zimnik AJ, Soteropoulos DS, Baker SN, Churchland MM, Krakauer JW, Shadmehr R (2020) Postural control of arm and fingers through integration of movement commands. Elife 9:e52507.

      Avni I, Arac A, Binyamin-Netser R, Kramer S, Krakauer JW, Shmuelof L (2024) The Kinematics of 3D Arm Movements in Sub-Acute Stroke: Impaired Inter-Joint Coordination is Attributable to Both Weakness and Flexor Synergy Intrusion. Neurorehabil Neural Repair 38:646–658.

      Bourbonnais D, VANDEN NOVEN S, Carey KM, Rymer WZ (1989) Abnormal spatial patterns of elbow muscle activation in hemiparetic human subjects. Brain 112:85–102.

      Brunnstrom S (1966) Motor testing procedures in hemiplegia: based on sequential recovery stages. Phys Ther 46:357–375.

      Cortes JC, Goldsmith J, Harran MD, Xu J, Kim N, Schambra HM, Luft AR, Celnik P, Krakauer JW,

      Kitago T (2017) A Short and Distinct Time Window for Recovery of Arm Motor Control Early After Stroke Revealed With a Global Measure of Trajectory Kinematics. Neurorehabil Neural Repair 31:552–560.

      Duque J, Thonnard J, Vandermeeren Y, Sébire G, Cosnard G, Olivier E (2003) Correlation between impaired dexterity and corticospinal tract dysgenesis in congenital hemiplegia. Brain 126:732–747.

      Fine MS, Thoroughman KA (2006) Motor Adaptation to Single Force Pulses: Sensitive to Direction but Insensitive to Within-Movement Pulse Placement and Magnitude. J Neurophysiol 96:710–720.

      Ghez C, Scheidt R, Heijink H (2007) Different Learned Coordinate Frames for Planning Trajectories and Final Positions in Reaching. J Neurophysiol 98:3614–3626.

      Hadjiosif AM, Branscheidt M, Anaya MA, Runnalls KD, Keller J, Bastian AJ, Celnik PA, Krakauer JW (2022) Dissociation between abnormal motor synergies and impaired reaching dexterity after stroke. J Neurophysiol 127:856–868.

      Jayasinghe SA, Scheidt RA, Sainburg RL (2022) Neural Control of Stopping and Stabilizing the Arm. Front Integr Neurosci 16.

      Kanade-Mehta P, Bengtson M, Stoeckmann T, McGuire J, Ghez C, Scheidt RA (2023) Spatial mapping of posture-dependent resistance to passive displacement of the hypertonic arm post-stroke. J NeuroEngineering Rehabil 20:163.

      Lawrence DG, Kuypers HG (1968) The functional organization of the motor system in the monkey: II. The effects of lesions of the descending brain-stem pathways. Brain 91:15–36.

      Levin MF (1996) Interjoint coordination during pointing movements is disrupted in spastic hemiparesis. Brain 119:281–293.

      Lowrey CR, Bourke TC, Bagg SD, Dukelow SP, Scott SH (2019) A postural unloading task to assess fast corrective responses in the upper limb following stroke. J NeuroEngineering Rehabil 16:1–17.

      McPherson JG, Chen A, Ellis MD, Yao J, Heckman C, Dewald JP (2018) Progressive recruitment of contralesional cortico-reticulospinal pathways drives motor impairment post stroke. J Physiol 596:1211–1225.

      McPherson LM, Dewald JP (2022) Abnormal synergies and associated reactions post-hemiparetic stroke reflect muscle activation patterns of brainstem motor pathways. Front Neurol 13:934670.

      Porter R, Lemon R (1995) Corticospinal function and voluntary movement. Oxford University Press.

      Smith MA, Brandt J, Shadmehr R (2000) Motor disorder in Huntington’s disease begins as a dysfunction in error feedback control. Nature 403:544.

      Smith MA, Shadmehr R (2005) Intact ability to learn internal models of arm dynamics in Huntington’s disease but not cerebellar degeneration. J Neurophysiol 93:2809–2821.

      Tower SS (1940) Pyramidal lesion in the monkey. Brain 63:36–90.

      Twitchell TE (1951) The restoration of motor function following hemiplegia in man. Brain 74:443–480.

      Wilkins KB, Yao J, Owen M, Karbasforoushan H, Carmona C, Dewald JP (2020) Limited capacity for ipsilateral secondary motor areas to support hand function post-stroke. J Physiol 598:2153– 2167.

      Zaaimi B, Edgley SA, Soteropoulos DS, Baker SN (2012) Changes in descending motor pathway connectivity after corticospinal tract lesion in macaque monkey. Brain 135:2277–2289.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public Review): 

      In their manuscript, Gerlevik et al. performed an integrative analysis of clinical, genetic and transcriptomic data to identify MDS subgroups with distinct outcomes. The study was based on the building of an "immunoscore" and then combined with genotype and clinical data to analyze patient outcomes using multi-omics factor analysis. 

      Strengths: Integrative analysis of RNA-seq, genotyping and clinical data 

      Weaknesses: Validation of the bioinformatic pipeline is incomplete 

      Major comments: 

      (1) This study considered two RNA-seq data sets publicly available and generated in two distinct laboratories. Are they comparable in terms of RNA-seq technique: polyA versus rRNA depletion, paired-end sequencing, fragment length? 

      We want to reemphasize that the main point of this study is not to compare the BMMNC with the HSPC cohort. These datasets are not comparable because they were

      collected from different cell types, and we should not expect them to be matched. We just analysed them in parallel to check how much HSPCs contribute to the molecular signatures we see in BMMNC samples. However, we agree with the reviewer that similar RNA-seq experimental techniques should be employed to control for confounding factors. Here is the information that we found for HSPC and BMMNC RNA-seq studies:

      HSPC RNA-seq cohort: Total RNA was extracted using TRIzol (Thermo Scientific), and Sequencing was performed on an Illumina HiSeq4000 with 100-bp paired-end reads.

      BMMNC RNA-seq cohort: The RNA was extracted with TRIzol reagent (Thermo Scientific). RNA-sequencing libraries were prepared from poly(A)-selected RNA and were sequenced using Illumina HiSeq 2000 or 2500 platform with 100-bp paired-end reads. 

      The only difference between the two cohorts is that one cohort includes total RNAs, whereas the other has polyA-selected RNAs. Since the gene set signatures use the expression of proteincoding genes, which all have polyA tails and are included in total RNA libraries, the analysis will not be affected by total vs. polyA-selected RNA-seq techniques. 

      (2) Data quality control (figure 1): the authors must show in a graph whether the features (dimensions) of factor 1 were available for each BMMNC and CD34+ samples.  

      By features of Factor 1, we think the reviewer means the features with high weights for Factor 1 in BMMNC and CD34+ samples. Figure 2c-d clearly illustrates the important features and their associations with Factor 1 for all samples in both cohorts. The samples are the columns of the two heatmaps.

      (3) How to validate the importance of "immunoscore"? If GSEA of RNA-seq data was performed in the entire cohort, in the SF3B1-mutated samples or SRSF2-mutated samples (instead of patients having a high versus low level of factor 1 shown in Sup Fig. 4), what would be the ranking of Hallmarks or Reactome inflammatory terms among the others? 

      Our GSEA analysis was an attempt to validate the importance of our identified factors. As described in the paper, Factor 1 represents a combination of immunology scores (or  “immunoscores”) in CD34+ cohort. Applying GSEA, we identified upregulation of inflammation related pathways, chemokines, and Neutrophils in patients having high (4th quartile) versus low (1st quartile) levels of Factor 1. Interestingly, sorting patients by Factor 1 resulted in similar pattern based on gene signature scores (Figure 2d).    

      To show that Factor1 generated by MOFA is important and different from known MDS categories such as SF3B1 and SRSF2 mutants, we performed GSEA in SF3B1-mutated vs. SF3B1-WT samples and SRSF2-mutated vs. SRSF2-WT samples in the CD34+ cohort. As shown in Author response image 1, we did not see the upregulation of inflammation and interferon pathways in SF3B1 and SRSF2 mutant MDS.

      Author response image 1.

      GSEA showed no upregulation of inflammation and interferon pathways for SF3B1 and SRSF2 mutant in CD34+ cohort.  

      (4) To decipher cell-type composition of BMMNC and CD34+ samples, the authors used van Galen's data (2019; supplementary table 3). Cell composition is expressed as the proportion of each cell population among the others. Surprisingly, the authors found that the promonocytelike score was increased in SF3B1-mutated samples and not in SRSF2-mutated samples, which are frequently co-mutated with TET2 and associated with a CMML-like phenotype. Is there a risk of bias if bone marrow subpopulations such as megakaryocytic-erythroid progenitors or early erythroid precursors are not considered? 

      We thank the reviewer for their insightful comment about CMML and the high prevalence of SRSF2 mutation (> 45%) in CMML cases. Using single-cell RNA sequencing and high-parameter flow cytometry, Ferrall-Fairbanks et al. (DOI: 10.1158/2643-3230.BCD-21-0217) recently showed that CMML can be classified into three differentiation trajectories: monocytic, megakaryocyte-erythroid progenitor (MEP), and normal-like. One hallmark of monocytic-biased trajectory was the enrichment of inflammatory granulocyte–macrophage progenitor (GMP)-like cells, which we observed through our analysis for SRSF2 mutants (Figure 6a).

      Unfortunately,  van Galen's data does not provide any gene set for MEP, and there is no singlecell RNA-seq atlas for MDS to employ to calculate the MEP score. Also, we compared the Promono-like and GMP-like gene sets from van Galen's data, and we could not find any overlap, meaning that Promono-like is not specific enough to capture the signatures coming from the more differentiated progenitors such as GMPs. Therefore, as described in the paper, we focused on GMP-like rather than Promono-like.

      (5) Figures 2a and 2b indicated that the nature of retrotransposons identified in BMMNC and CD34+ was dicerent. ERVs were not detected in CD34+ cells. Are ERVs not reactivated in CD34+ cells? Is there a bias in the sequencing or bioinformatic method?  

      As described above, the two cohorts' sequencing methods, read length, etc., are identical.

      CD34+ RNA-seq is total RNA-seq that includes both polyA and non-polyA RTE transcripts.

      Therefore, the chance of bias and missing RTE signatures in CD34+ cohort is very low. L1 and Alu, which are shared between the two cohorts, are the two RTE families that are still active and make new insertions in humans. Our interpretation is that ERV activation in BM is associated with immune cells. As shown by Au et al. (DOI: 10.1016/j.ccell.2021.10.001), several ERV loci had expression in purified immune cell subsets in renal cell carcinoma samples, potentially explaining ERV upregulation in tumours responding to treatment as those biopsies had increased tumour infiltration.

      (6) What is the impact of factor 1 on survival? Is it dicerent between BMMNC and CD34+ cells considering the distinct composition of factor 1 in CD34+ and BMMNC? 

      As shown in Table 1, Factor 1 in the BMMNC cohort is associated with overall survival (P-val < 0.05) when we did multivariate analysis but not univariate analysis. We did not observe any association between Factor 1 and event-free survival in the BMMNC cohort. Also, The 10 factors identified by MOFA in BM CD34+ cohort did not show any significance associated with MDS overall survival (Supplementary Table 5). 

      (7) In Figure 1e, genotype contributed to the variance of in the CD34+ cell analyses more importantly than in the BMMNC. Because the patients are dicerent in the two cohorts, dicerences in the variance could be explained either by a greater variability of the type of mutations in CD34 or an increased frequency of poor prognosis mutations in CD34+ compared to BMMNC. The genotyping data must be shown.  

      The genotype has already been reported in Supplementary Table 2. In fact, the number of inspected genes was much higher in the BMMNC cohort (17 genes) compared to the CD34+ cohort (3 genes). Therefore, we have more significant variability of the type of mutations in the BMMNC cohort compared to the CD34+ cohort. For the CD34+ cohort, we only had mutations for three spliceosome genes, where most cases (n=28) were SF3B1 mutants with good prognosis. We think that the result makes sense because the less genetic variability, the more homogenous groups and the more chance that one factor or a group of factors can explain the genetic variance.   

      (8) Fig. 2a-b: Features with high weight are shown for each factor. For factor 9, features seemed to have a low weight (Fig. 1b and 1c). However, factor 9 was predictive of EFS and OS in the BMMNC cohort. What are the features driving the prognostic value of factor 9? 

      As shown in Figure 3b, The main features are RTE expression from LTR:ERV1, SINE:MIR, and SINE:Alu family.  

      (9) The authors also provided microarray analyses of CD34+ cell. It could be interesting to test more broadly the correlation between features identified by RNA-seq or microarrays. 

      The microarray data did not come with any genetic information or clinical data except survival information. Therefore, we could not apply MOFA on Microarray data. However, we did generate gene signature scores from Microarray data and investigated the relationship between inflammatory chemokines and cytokines, and IFN-I signature scores with MDS survival (Figure 3c and 4c).    

      (10) The authors should discuss the relevance of immunosenescence features in the context of SRSF2 mutation and extend the discussion to the interest of their pipeline for patient diagnosis and follow up under treatments. 

      We have added the below text to the discussion:

      Recent studies have shown that the expression of programmed death-ligand 1 (PD-L1) protein is significantly elevated in senescent cells (DOIs: 10.1128/mcb.00171-22, 10.1172/JCI156250, 10.1038/s41586-022-05388-4). Increased PD-L1 protein levels protect senescent cells from being cleared by cytotoxic immune cells that express the PD-1 checkpoint receptor. In fact, activation of the PD-1 receptor inhibits the cytotoxic capabilities of CD8 + T and NK cells, increasing immunosenescence.   

      Notably, patients with MDS who possess particular somatic mutations, such as those in the TP53, ASXL1, SETBP1, TET2, SRSF2, and RUNX1 genes, have an increased propensity to react favourably to PD-1/PD-L1 inhibitors (DOIs: 10.1111/bjh.17689, https://doi.org/10.1182/blood2020-141100) confirming that many cellular and molecular mechanisms, known to promote cellular senescence, including alteration of splicing machinery, are crucial stimulators of the expression of PD-L1 protein. Interestingly, in our analysis, we also observed a correlation between the senescence gene signature score and the expression of the PD-L1 gene in CD34+ cells (Supplementary Figure 7), supporting the previous findings linking PD-L1 gene expression to cellular senescence.

      The immunology and ageing features extracted from the MDS transcriptomic data used in our analysis pipeline can enhance the conventional risk-scoring systems for MDS by providing new insights into this disease, particularly in the context of inflammation and ageing. For some patients, the clinical and genetic features may remain relatively the same until follow-up. Still, the transcriptomic features might differ considerably from the baseline diagnosis, affecting the course of treatment.    

      Reviewer #2 (Public Review): 

      The authors performed a Multi-Omics Factor Analysis (MOFA) on analysis of two published MDS patient cohorts-1 from bone marrow mononuclear cells (BMMNCs) and CD34 cells (ref 17) and another from CD34+ cells (ref 15) --with three data modalities (clinical, genotype, and transcriptomics). Seven different views, including immune profile, inflammation/aging, Retrotransposon (RTE) expression, and cell-type composition, were derived from these modalities to attempt to identify the latent factors with significant impact on MDS prognosis. 

      SF3B1 was found to be the only mutation among 13 mutations in the BMMNC cohort that indicated a significant association with high inflammation. This trend was also observed to a lesser extent in the CD34+ cohort. The MOFA factor representing inflammation showed a good prognosis for MDS patients with high inflammation. In contrast, SRSF2 mutant cases showed a granulocyte-monocyte progenitor (GMP) pattern and high levels of senescence, immunosenescence, and malignant myeloid cells, consistent with their poor prognosis. Also, MOFA identified RTE expression as a risk factor for MDS. They proposed that this work showed the efficacy of their integrative approach to assess MDS prognostic risk that 'goes beyond all the scoring systems described thus far for MDS'. 

      Several issues need clarification and response: 

      (1) The authors do not provide adequate known clinical and molecular information which demonstrates prognostic risk of their sample cohorts in order to determine whether their data and approach 'goes 'beyond all the scoring systems described thus far for MDS'. For example, what data have the authors that their features provide prognostic data independent of the prior known factors related to prognosis (eg, marrow blasts, mutational, cytogenetic features, ring sideroblasts, IPSS-R, IPSS-M, MDA-SS)? 

      We agree with the reviewer that we did not generate a new cumulative risk score and compare it with the conventional risk scores for MDS. However, we identified individual MOFA factors, which are risk or protective factors for MDS, based on survival analysis in the BMMNC cohort. One reason that we did not generate our independent, cumulative score and compare it with other scores was that we did not receive any conventional risk score for the BMMNC cohort. However, we had access to all the clinical and genetic variables from the BMMNC cohort (except for three patients) that were required to calculate IPSS-R; hence, we calculated the IPSS-R in our resubmission for the BMMNC cohort. We made three IPSS-R risk categories by combining low and very low as low risk, and high and very high as high risk, and keeping intermediate as intermediate risk. Our survival analysis of these three categories showed a clear match between IPSS-R score and MDS survival (Author response image 2a).

      We then investigated the relationship between factors 2, 4, and 9 from MOFA with three IPSS-R risk groups.  Integration of IPSS-R risk groups with factor values confirmed the finding in the manuscript that Factors 4 and 9 generally exert a protective influence over the MDS risk, whilst higher levels of Factor 2 predict a high-risk MDS (Author response image 2b). However, we see so many outliers in all three factors, indicating that some patients were assigned to the wrong IPSS-R categories because IPSS-R calculation is based on clinical and genetic variables and does not include the transcriptomics data for coding and non-coding genomic regions. 

      Author response image 2.

      Comparison of IPSS-R risk categories and MOFA risk and protective factors.

      (2) A major issue in analyzing this paper relates to the specific patient composition from whom the samples and data were obtained. The cells from the Shiozawa paper (ref 17) is comprised of a substantial number of CMML patients. Thus, what evidence have the authors that much of the data from the BMMNCs from these patients and mutant SRSF2 related predominantly to their monocytic dicerentiation state?  

      We thank the reviewer for the insightful comment about the monocytic differentiation state of CMML and SRSF2 mutant cases. The BMMNC cohort has 11 CMML and 17 SRSF2 mutant cases, of which six are shared between the two groups. We have divided the patients into four groups: CMML only, SRSF2 mutant only, CCML and SRSF2 mutant, and others. We have generated boxplots for all cellular composition gene signature scores for these groups and compared the scores between these groups. As explained above, Ferrall-Fairbanks et al. (DOI: 10.1158/2643-3230.BCD-21-0217) recently showed that CMML can be classified into three differentiation trajectories: monocytic, megakaryocyte-erythroid progenitor (MEP), and normal-like. One hallmark of monocytic-biased trajectory was the enrichment of inflammatory granulocyte–macrophage progenitor (GMP)-like cells, which we observed through our analysis for the CMML cases with SRSF2 mutation (Author response image 3.).

      Author response image 3.

      Cellular composition gene signature scores for CMML and SRSF2 mutant versus other cases. CMML cases with SRSF2 mutation show a significant higher level of GMP and GMP-like scores compared to other MDS cases.  

      (3) In addition, as the majority of patients in the Shiozawa paper have ring sideroblasts (n=59), thus potentially skewing the data toward consideration mainly of these patients, for whom better outcomes are well known.  

      We disagree with the reviewer. We used 94 BMMNC samples from Shiozawa’s paper, of which 19 cases had Refractory Anemia with Ring Sideroblasts (RARS), 4 cases had Refractory Anemia with Ring Sideroblasts and thrombocytosis (RARS-T), and 5 cases had Refractory cytopenia with multilineage dysplasia and ring sideroblasts (RCMD-RS). In total, we had 28 cases (~30%) with Ring Sideroblasts (RS), which are not large enough to skew the data.

      (4) Further, regarding this patient subset, what evidence have the authors that the importance of the SF3B1 mutation was merely related to the preponderance of sideroblastic patients from whom the samples were analyzed? 

      We had 34 SF3B1 mutant cases, of which 25 had Ring Sideroblasts (RS). The total number of cases with RS in the BMMNC cohort was 28. Therefore, the BMMNC cohort is not an RSdominant cohort, and RS cases did not include all SF3B1 mutants. Furthermore, it was recently shown by Ochi et al. (DOI: 10.1038/s41598-022-18921-2) that RS is a consequence of SF3B1K700E mutation, and it is not a cause to affect the SF3B1 importance.

      (5) An Erratum was reported for the Shiozawa paper (Shiozawa Y, Malcovati L, Gallì A, et al. Gene expression and risk of leukemic transformation in myelodysplasia. Blood. 2018 Aug 23;132(8):869-875. doi: 10.1182/blood-2018-07-863134) that resulted from a coding error in the construction of the logistic regression model for subgroup prediction based on the gene expression profiles of BMMNCs. This coding error was identified after the publication of the article. The authors should indicate the ecect this error may have had on the data they now report.  

      Thank you for bringing this important issue to our attention. The error resulted from a mistake in the construction of the logistic regression model for subgroup prediction based on the gene expression profiles of BMMNCs. However, this issue does not affect our result because we analysed the expression data from scratch and generated our own gene signature scores. Also, the error has no impact on the genetics and clinical information that we received from the authors.

      (6) What information have the authors as to whether the dicering RTE findings were not predominantly related to the dicerentiation state of the cell population analyzed (ie higher in BM MNCs vs CD34, Fig 1)? What control data have the authors regarding these values from normal (non-malignant) cell populations? 

      As described above, L1 and Alu, the two RTE families shared between the two cohorts, are still active and make new insertions in humans (Figure 2.a-b). Our interpretation is that ERV activation in BM is associated with immune cells. This interpretation is further supported by the findings of Au et al. (DOI: 10.1016/j.ccell.2021.10.001), where several ERV loci had expression in purified immune cell subsets in renal cell carcinoma samples. 

      Unfortunately, none of these two cohorts had normal (non-malignant) cell populations. We think that the MOFA unbiased way of modelling the heterogeneity is su@icient to capture the RTE derepressed phenotype of a subset of MDS cases compared to others, and we do not need normal cases to further support the finding. 

      (7) The statement in the Discussion regarding the ecects of SRSF2 mutation is speculative and should be avoided. Many other somatic gene mutations have known stronger ecects on prognosis for MDS. 

      One aim of this study is to identify specific immune signatures associated with SRSF2 and SF3B1 mutations, which are highly prevalent in MDS. Although other mutations, such as TP53, may have a stronger correlation with poor survival, numerous studies have demonstrated a clear link between SRSF2 mutations and poor prognosis.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This study provides an important cell atlas of the gill of the mussel Gigantidas platifrons using a single nucleus RNA-seq dataset, a resource for the community of scientists studying deep sea physiology and metabolism and intracellular host-symbiont relationships. The work, which offers solid insights into cellular responses to starvation stress and molecular mechanisms behind deep-sea chemosymbiosis, is of relevance to scientists interested in host-symbiont relationships across ecosystems.

      Public Reviews:

      Reviewer #1 (Public Review):

      Wang et al have constructed a comprehensive single nucleus atlas for the gills of the deep sea Bathymodioline mussels, which possess intracellular symbionts that provide a key source of carbon and allow them to live in these extreme environments. They provide annotations of the different cell states within the gills, shedding light on how multiple cell types cooperate to give rise to the emergent functions of the composite tissues and the gills as a whole. They pay special attention to characterizing the bacteriocyte cell populations and identifying sets of genes that may play a role in their interaction with the symbiotes.

      Wang et al sample mussels from 3 different environments: animals from their native methane-rich environment, animals transplanted to a methane-poor environment to induce starvation, and animals that have been starved in the methane-poor environment and then moved back to the methane-rich environment. They demonstrated that starvation had the biggest impact on bacteriocyte transcriptomes. They hypothesize that the upregulation of genes associated with lysosomal digestion leads to the digestion of the intracellular symbiont during starvation, while the non-starved and reacclimated groups more readily harvest the nutrients from symbiotes without destroying them.

      Strengths:

      This paper makes available a high-quality dataset that is of interest to many disciplines of biology. The unique qualities of this non-model organism and the collection of conditions sampled make it of special interest to those studying deep sea adaptation, the impact of environmental perturbation on Bathymodioline mussels populations, and intracellular symbiotes. The authors do an excellent job of making all their data and analysis available, making this not only an important dataset but a readily accessible and understandable one.

      The authors also use a diverse array of tools to explore their data. For example, the quality of the data is augmented by the use of in situ hybridizations to validate cluster identity and KEGG analysis provides key insights into how the transcriptomes of bacteriocytes change.

      The authors also do a great job of providing diagrams and schematics to help orient non-mussel experts, thereby widening the audience of the paper.

      Thank the reviewer for the valuable feedback on our study. We are grateful that the reviewers found our work to be interesting and we appreciate their thorough evaluation of our research. Their constructive comments will be considered as we continue to develop and improve our study.

      Weaknesses:

      One of the main weaknesses of this paper is the lack of coherence between the images and the text, with some parts of the figures never being referenced in the body of the text. This makes it difficult for the reader to interpret how they fit in with the author's discussion and assess confidence in their analysis and interpretation of data. This is especially apparent in the cluster annotation section of the paper.

      We appreciate the feedback and suggestions provided by the reviewer, and we have revised our manuscript to make it more accessible to general audiences.

      Another concern is the linking of the transcriptomic shifts associated with starvation with changes in interactions with the symbiotes. Without examining and comparing the symbiote population between the different samples, it cannot be concluded that the transcriptomic shifts correlate with a shift to the 'milking' pathway and not other environmental factors. Without comparing the symbiote abundance between samples, it is difficult to disentangle changes in cell state that are due to their changing interactions with the symbiotes from other environmental factors.

      We are grateful for the valuable feedback and suggestions provided by the reviewer. Our keen interest lies in understanding symbiont responses, particularly at the single-cell level. However, it's worth noting that existing commercial single-cell RNA-seq technologies rely on oligo dT priming for reverse transcription and barcoding, thus omitting bacterial gene expression information from our dataset. We hope that advancements in technology will soon enable us to perform an integrated analysis encompassing both host and symbiont gene expression.

      Additionally, conclusions in this area are further complicated by using only snRNA-seq to study intracellular processes. This is limiting since cytoplasmic mRNA is excluded and only nuclear reads are sequenced after the organisms have had several days to acclimate to their environment and major transcriptomic shifts have occurred.

      We appreciate the comments shared by the reviewer and agree that scRNA-seq provides more comprehensive transcriptional information by targeting the entire mRNA of the cell. However, we would like to highlight that snRNA-seq has some unique advantages over scRNA-seq. Notably, snRNA-seq allows for simple snap-freezing of collected samples, facilitating easier storage, particularly for samples obtained during field trips involving deep-sea animals and other ecologically significant non-model animal samples. Additionally, unlike scRNA-seq, snRNA-seq eliminates the need for tissue dissociation, which often involves prolonged enzymatic treatment of deep-sea animal tissue/cells under atmospheric pressure. This process can potentially lead to the loss of sensitive cells or alterations in gene expression. Moreover, snRNA-seq procedures disregard the size and shape of animal cells, rendering it a superior technology for constructing the cell atlas of animal tissues. Consequently, we assert that snRNA-seq offers flexibility and represents a suitable choice for the research objects of our current research.

      Reviewer #2 (Public Review):

      Wang, He et al. shed insight into the molecular mechanisms of deep-sea chemosymbiosis at the single-cell level. They do so by producing a comprehensive cell atlas of the gill of Gigantidas platifrons, a chemosymbiotic mussel that dominates the deep-sea ecosystem. They uncover novel cell types and find that the gene expression of bacteriocytes, the symbiont-hosting cells, supports two hypotheses of host-symbiont interactions: the "farming" pathway, where symbionts are directly digested, and the "milking" pathway, where nutrients released by the symbionts are used by the host. They perform an in situ transplantation experiment in the deep sea and reveal transitional changes in gene expression that support a model where starvation stress induces bacteriocytes to "farm" their symbionts, while recovery leads to the restoration of the "farming" and "milking" pathways.

      A major strength of this study includes the successful application of advanced single-nucleus techniques to a non-model, deep-sea organism that remains challenging to sample. I also applaud the authors for performing an in situ transplantation experiment in a deep-sea environment. From gene expression profiles, the authors deftly provide a rich functional description of G. platifrons cell types that is well-contextualized within the unique biology of chemosymbiosis. These findings offer significant insight into the molecular mechanisms of deep-sea host-symbiont ecology, and will serve as a valuable resource for future studies into the striking biology of G. platifrons.

      The authors' conclusions are generally well-supported by their results. However, I recognize that the difficulty of obtaining deep-sea specimens may have impacted experimental design. In this area, I would appreciate more in-depth discussion of these impacts when interpreting the data.

      Thank the reviewer for their valuable feedback on our study. We're grateful that the reviewers found our work interesting, and we appreciate their thorough evaluation of our research. We'll consider their constructive comments as we continue to develop and improve our study.

      Because cells from multiple individuals were combined before sequencing, the in situ transplantation experiment lacks clear biological replicates. This may potentially result in technical variation (ie. batch effects) confounding biological variation, directly impacting the interpretation of observed changes between the Fanmao, Reconstitution, and Starvation conditions. It is notable that Fanmao cells were much more sparsely sampled. It appears that fewer cells were sequenced, resulting in the Starvation and Reconstitution conditions having 2-3x more cells after doublet filtering. It is not clear whether this is due to a technical factor impacting sequencing or whether these numbers are the result of the unique biology of Fanmao cells. Furthermore, from Table S19 it appears that while 98% of Fanmao cells survived doublet filtering, only ~40% and ~70% survived for the Starvation and Reconstitution conditions respectively, suggesting some kind of distinction in quality or approach.

      There is a pronounced divergence in the relative proportions of cells per cell type cluster in Fanmao compared to Reconstitution and Starvation (Fig. S11). This is potentially a very interesting finding, but it is difficult to know if these differences are the expected biological outcome of the experiment or the fact that Fanmao cells are much more sparsely sampled. The study also finds notable differences in gene expression between Fanmao and the other two conditions- a key finding is that bacteriocytes had the largest Fanmao-vs-starvation distance (Fig. 6B). But it is also notable that for every cell type, one or both comparisons against Fanmao produced greater distances than comparisons between Starvation and Reconstitution (Fig. 6B). Again, it is difficult to interpret whether Fanmao's distinctiveness from the other two conditions is underlain by fascinating biology or technical batch effects. Without biological replicates, it remains challenging to disentangle the two.

      As highlighted by the reviewer, our experimental design involves pooling multiple biological samples within a single treatment state before sequencing. We acknowledge the concern regarding the absence of distinct biological replicates and the potential impact of batch effects on result interpretation. While we recognize the merit of conducting multiple sequencing runs for a single treatment to provide genuine biological replicates, we contend that batch effects may not exert a strong influence on the observed patterns.

      In addition, we applied a bootstrap sampling algorithm to assess whether the gene expression patterns within a cluster are more similar than those between clusters. This algorithm involves selecting a portion of cells per cluster and examining whether this subset remains distinguishable from other clusters. Our assumption was that if different samples exhibited distinct expression patterns due to batch effect, the co-assignment probabilities of a cluster would be very low. This expectation was not met in our data, as illustrated in Fig. S2. The lack of significantly low co-assignment probabilities within clusters suggests that batch effects may not exert a strong influence on our results.

      Indeed, we acknowledge a noticeable shift in the expression patterns of certain cell types, such as the bacteriocyte. However, this is not universally applicable across all cell types. For instance, the UMAP figure in Fig. 6A illustrates a substantial overlap among basal membrane cell 2 from Fanmao, Starvation, and Reconstitution treatments, and the centroid distances between the three treatments are subtle, as depicted in Fig. 6B. This consistent pattern is also observed in DEPC, smooth muscle cells, and the food groove ciliary cells.

      The reviewer also noted variations in the number of cells per treatment. Specifically, Fanmao sequencing yielded fewer than 10 thousand cells, whereas the other two treatments produced 2-3 times more cells after quality control (QC). It is highly probable that the technician loaded different quantities of cells into the machine for single-nucleus sequencing—a not uncommon occurrence in this methodology. While loading more cells may increase the likelihood of doublets, it is crucial to emphasize that this should not significantly impact the expression patterns post-QC. It's worth noting that overloading samples has been employed as a strategic approach to capture rare cell types, as discussed in a previous study (reference: 10.1126/science.aay0267).

      The reviewer highlighted the discrepancy in cell survival rates during the 'doublet filtering' process, with 98% of Fanmao cells surviving compared to approximately 40% and 70% for the Starvation and Reconstitution conditions, respectively. It's important to clarify that the reported percentages reflect the survival of cells through a multi-step QC process employing various filtering strategies.

      Post-doublet removal, we filtered out cells with <100 or >2500 genes and <100 or >6000 unique molecular identifiers (UMIs). Additionally, genes with <10 UMIs in each data matrix were excluded. The observed differences in survival rates for Starvation and Reconstitution cells can be attributed to the total volume of data generated in Illumina sequencing. Specifically, we sequenced approximately 91 GB of data for Fanmao, ~196 GB for Starvation, and ~249 GB for Reconstitution. As a result, the qualified data obtained for Starvation and Reconstitution conditions was only about twice that of Fanmao due to the limited data volume.

      The reviewer also observed a divergence in the relative proportions of cells per cell type cluster in Fanmao compared to Reconstitution and Starvation, as depicted in Fig. S1. This discrepancy may hold true biological significance, presenting a potentially intriguing finding. However, our discussion on this pattern was rather brief, as we acknowledge that the observed differences could be influenced by the sample preparation process for dissection and digestion. It is crucial to consider that cutting a slightly different area during dissection may result in variations in the proportion of cells obtained. While we recognize the potential impact of this factor, we do not think that the sparsity of sampling alone could significantly affect the relative proportions of cells per cell type.

      In conclusion, we acknowledge the reviewer's suggestion that sequencing multiple individual samples per treatment condition would have been ideal, rather than pooling them together. However, the homogenous distribution observed in UMAP and the consistent results obtained from bootstrap sampling suggest that the impact of batch effects on our analyses is likely not substantial. Additionally, based on our understanding, the smaller number of cells in the Fanmao sample should not have any significant effect on the resulting different proportion of cells or the expression patterns per each cluster.

      Reviewer #3 (Public Review):

      Wang et al. explored the unique biology of the deep-sea mussel Gigantidas platifrons to understand the fundamental principles of animal-symbiont relationships. They used single-nucleus RNA sequencing and validation and visualization of many of the important cellular and molecular players that allow these organisms to survive in the deep sea. They demonstrate that a diversity of cell types that support the structure and function of the gill including bacteriocytes, specialized epithelial cells that host sulfur-oxidizing or methane-oxidizing symbionts as well as a suite of other cell types including supportive cells, ciliary, and smooth muscle cells. By performing experiments of transplanting mussels from one habitat which is rich in methane to methane-limited environments, the authors showed that starved mussels may consume endosymbionts versus in methane-rich environments upregulated genes involved in glutamate synthesis. These data add to the growing body of literature that organisms control their endosymbionts in response to environmental change.

      The conclusions of the data are well supported. The authors adapted a technique that would have been technically impossible in their field environment by preserving the tissue and then performing nuclear isolation after the fact. The use of single-nucleus sequencing opens the possibility of new cellular and molecular biology that is not possible to study in the field. Additionally, the in-situ data (both WISH and FISH) are high-quality and easy to interpret. The use of cell-type-specific markers along with a symbiont-specific probe was effective. Finally, the SEM and TEM were used convincingly for specific purposes in the case of showing the cilia that may support water movement.

      We appreciate the valuable feedback provided by the reviewer on our study. It is encouraging to know that our work was found to be interesting and that they conducted a thorough evaluation of our research. We will take their constructive comments into account as we strive to develop and enhance our study. Thank the reviewer for all the input.

      The one particular area for clarification and improvement surrounds the concept of a proliferative progenitor population within the gill. The authors imply that three types of proliferative cells within gills have long been known, but their study may be the first to recover molecular markers for these putative populations. The markers the authors present for gill posterior end budding zone cells (PEBZCs) and dorsal end proliferation cells (DEPCs) are not intuitively associated with cell proliferation and some additional exploration of the data could be performed to strengthen the argument that these are indeed proliferative cells. The authors do utilize a trajectory analysis tool called Slingshot which they claim may suggest that PEBZCs could be the origin of all gill epithelial cells, however, one of the assumptions of this analysis is that differentiated cells are developed from the same precursor PEBZC population.

      However, these conclusions do not detract from the overall significance of the work of identifying the relationship between symbionts and bacteriocytes and how these host bacteriocytes modulate their gene expression in response to environmental change. It will be interesting to see how similar or different these data are across animal phyla. For instance, the work of symbiosis in cnidarians may converge on similar principles or there may be independent ways in which organisms have been able to solve these problems.

      We are grateful for the valuable comments and suggestions provided by the reviewer. All suggestions have been carefully considered, and the manuscript has been revised accordingly. We particularly value the reviewer's insights regarding the characterization of the G. platifrons gill proliferative cell populations. In a separate research endeavor, we have conducted experiments utilizing both cell division and cell proliferation markers on these proliferative cell populations. While these results are not incorporated into the current manuscript, we would be delighted to share our preliminary findings with the reviewer. Our preliminary results indicate that the proliferative cell populations exhibit positivity for cell proliferation markers and contain a significant number of mitotic cells..

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Further experiments are needed to link the changes in transcriptomes of Bathymodioline mussels in the different environmental conditions to changes in their interactions with symbiotes. For example, quantifying the abundance and comparing the morphology of symbiotes between the environmental conditions would lend much support for shifting between milking and farming strategies. Without analyzing the symbiotes and comparing them across populations, it is difficult to comment on the mechanisms of interactions between symbiotes and the hosts. Without this analysis, this data is better suited towards comments about the general effect of environmental perturbation and stress on gene expression in these mussels.

      We appreciate the reviewer’s comments. We are also very curious about the symbiont responses, especially at the single-cell level. However, all the current commercial single-cell RNA-seq technologies are based on oligo dT priming for reverse transcription and barcoding. Therefore, the bacterial gene expression information is omitted from our dataset. Hopefully, with the development of technology, we could conduct an integrated analysis of both host and symbiont gene expression soon.

      Additionally, clarification is needed on which types of symbiotes are being looked at. Are they MOX or SOX populations? Are they homogenous? What are the concentrations of sulfur at the sampled sites?

      We thank you for your valuable comments and suggestions. Gigantidas platifrons harbors a MOX endosymbiont population characterized by a single 16S rRNA phylotype. We apologize for any confusion resulting from our previous wording. To clarify, we have revised lines 57-59 of our introduction

      In the text and images, consider using standardized gene names and leaving out the genome coordinates. This would greatly help with readability. Also, be careful to properly follow gene naming and formatting conventions (ie italicizing gene names and symbols).

      We appreciate the reviewer’s insightful comments. In model animals, gene nomenclature often stems from forward genetic approaches, such as the identification of loss-of-function mutants. These gene names, along with their protein products, typically correspond to unique genome coordinates. Conversely, in non-model invertebrates (e.g., Gigantidas platifrons of present study), gene prediction relies on a combination of bioinformatics methods, including de novo prediction, homolog-based prediction, and transcriptomics mapping. Subsequently, the genes are annotated by identifying their best homologs in well-characterized databases. Given that different genes may encode proteins with similar annotated functions, we chose to include both the gene ID (genome coordinates) and the gene name in our manuscript. This dual labeling approach ensures that our audience receives accurate and comprehensive information regarding gene identification and annotation.

      Additionally, extending KEGG analysis to the atlas annotation section could help strengthen the confidence of annotations. For example, when identifying bacteriocyte populations, the functional categories of individual marker genes (lysosomal proteases, lysosomal traffic regulators, etc) are used to justify the annotation. Presenting KEGG support that these functional categories are upregulated in this population relative to others would help further support how you characterize this cluster by showing it's not just a few specific genes that are enriched in this cell group, but rather an overall functionality.

      We appreciate the valuable suggestion provided by the reviewer. Indeed, incorporating KEGG analysis into the atlas annotation section could further enhance the confidence in our annotations. However, in our study, we encountered some limitations that impeded us from conducting a comprehensive KEGG enrichment analysis.

      Firstly, the number of differentially expressed genes (DEGs) that we identified for certain cell populations was relatively small, making it challenging to meet the threshold required for meaningful KEGG enrichment analysis. For instance, among the 97 marker genes identified for the Bacteriocyte cluster, only two genes, Bpl_scaf_59648-4.5 (lysosomal alpha-glucosidase-like) and Bpl_scaf_52809-1.6 (lysosomal-trafficking regulator-like isoform X1), were identified as lysosomal genes. To generate reliable KEGG enrichments, a larger number of genes is typically required.

      Secondly, single-nucleus sequencing, as employed in our study, tends to yield a relatively smaller number of genes per cell compared to bulk RNA sequencing. This limited gene yield can make it challenging to achieve sufficient gene representation for rigorous KEGG enrichment analysis.

      Furthermore, many genes in the genome still lack comprehensive annotation, both in terms of KEGG and GO annotations. In our dataset, out of the 33,584 genes obtained through single-nuclei sequencing, 26,514 genes have NO KEGG annotation, and 25,087 genes have NO GO annotation. This lack of annotations further restricts the comprehensive application of KEGG analysis in our study.

      The claim that VEPCs are symbiote free is not demonstrated. Additional double in situs are needed to show that markers of this cell type localize in regions free of symbiotes.

      We appreciate your comments and suggestions. In Figure 5B, our results demonstrate that the bacteriocytes (green fluorescent signal) are distant from the VEPCs, which are located around the tip of the gill filaments (close to the food groove). We have revised our Figure 5B to make it clear.

      Additionally, it does not seem like trajectory analysis is appropriate for these sampling conditions. Generally, to create trajectories confidently, more closely sampled time points are needed to sufficiently parse out the changes in expression. More justification is needed for the use of this type of analysis here and a discussion of the limitations should be mentioned, especially when discussing the hypotheses relating to PEBZCs, VEPCs, and DEPCs.

      We greatly appreciate your thoughtful commentary. It is important to acknowledge that in the context of a developmental study, incorporating more closely spaced time points indeed holds great value. In our ongoing project investigating mouse development, for instance, we have implemented time points at 24-hour intervals. However, in the case of deep-sea adult animals, we hypothesized a slower transcriptional shift in such extreme environment, which led us to opt for a time interval of 3-7 days. Examining the differential expression profiles among the three treatments, we observed that most cell types exhibited minimal changes in their expression profiles. For the cell types strongly impacted by in situ transplantation, their expression profiles per cell type still exhibited highly overlap in the UMAP analysis (Figure 6a), thus enabling meaningful comparisons. Nevertheless, we recognize that our sampling strategy may not be flawless. Additionally, the challenging nature of conducting in situ transplantation in 1000-meter depths limited the number of sampling occasions available to us. We sincerely appreciate your input and understanding.

      Finally, more detail should be added on the computational methods used in this paper. For example, the single-cell genomics analysis protocol should be expanded on so that readers unfamiliar with BD single-cell genomics handbooks could replicate the analysis. More detail is also needed on what criteria and cutoffs were used to calculate marker genes. Also, please be careful to cite the algorithms and software packages mentioned in the text.

      Acknowledged, thank you for highlighting this. In essence, the workflow closely resembles that of the 10x Genomics workflow (despite the use of a different software, i.e., Cell Ranger). We better explain the workflow below, and also noting that this information may no longer be relevant for newer users of BD or individuals who are not acquainted with BD, given that the workflow underwent a complete overhaul in the summer of 2023.

      References to lines

      Line 32: typo "..uncovered unknown tissue heterogeny" should read "uncovering" or "and uncovered")

      Overall abstract could include more detail of findings (ex: what are the "shifts in cell state" in line 36 that were observed)

      We apologize for the mistakes, and have revised the manuscript accordingly.

      Line 60: missing comma "...gill filament structure, but also"

      We apologize for the mistakes, and have revised the manuscript accordingly.

      Line 62-63: further discussion here, or in the relevant sections of the specific genes identified in the referenced bulk RNA-seq project could help strengthen confidence in annotation

      We appreciate the comment, and have revised the manuscript accordingly.

      Line 112: what bootstrapping strategy? Applied to what?

      This is a bootstrap sampling algorithm to assess the robustness of each cell cluster developed in a recent biorxiv paper. (Singh, P. & Zhai, Y. Deciphering Hematopoiesis at single cell level through the lens of reduced dimensions. bioRxiv, 2022.2006.2007.495099 (2022). https://doi.org:10.1101/2022.06.07.495099)

      Lines 127-129: What figures demonstrate the location of the inter lamina cells? Are there in situs that show this?

      We apologize for any errors; the referencing of figures in the manuscript has been revised for clarity

      Lines 185-190: does literature support these as markers of SMCs? Are they known smooth muscle markers in other systems?

      We characterized the SMCs by the expression of LDL-associated protein, angiotensin-converting enzyme-like protein, and the "molecular spring" titin-like protein, all of which are commonly found in human vascular smooth muscle cells. Based on this analysis, we hypothesize that these cells belong to the smooth muscle cell category.

      Line 201: What is meant by "regulatory roles"?

      In this context, we are discussing the expression of genes encoding regulatory proteins, such as SOX transcription factors and secreted-frizzled proteins.

      Line 211: which markers disappeared? What in situs show this?

      We apologize for the mistakes, and have revised the manuscript accordingly.

      Line 211: typo, "role" → "roll"

      We apologize for the mistakes, and have revised the manuscript accordingly.

      Line 214: what are these "hallmark genes"

      We apologize for the mistakes, here we are referring to the genes listed in figure 4B. We have revised the manuscript accordingly.

      Line 220: are there meristem-like cells in metazoans? If so, this would be preferable to a comparison with plants.

      In this context, we are discussing the morphological characteristics of gill proliferative cell populations found in filibranch bivalves. These populations, namely PEPC, VEPC, and DEPC, consist of cells exhibiting morphological traits akin to those of plant cambial-zone meristem cells. These cells typically display small, round shapes with a high nucleus-to-plasma ratio. We acknowledge that while these terms are utilized in bivalve studies (citations below), they lack the robust support seen in model systems backed by molecular biology evidences. The present snRNA-seq data, however, may offer valuable cell markers for future comprehensive investigations.

      Leibson, N. L. & Movchan, O. T. Cambial zones in gills of Bivalvia. Mar. Biol. 31, 175-180 (1975). https://doi.org:10.1007/BF00391629

      Wentrup, C., Wendeberg, A., Schimak, M., Borowski, C. & Dubilier, N. Forever competent: deep-sea bivalves are colonized by their chemosynthetic symbionts throughout their lifetime. Environ. Microbiol. 16, 3699-3713 (2014). https://doi.org:10.1111/1462-2920.12597

      Cannuel, R., Beninger, P. G., McCombie, H. & Boudry, P. Gill Development and its functional and evolutionary implications in the blue mussel Mytilus edulis (Bivalvia: Mytilidae). Biol. Bull. 217, 173-188 (2009). https://doi.org:10.1086/BBLv217n2p173

      Line 335: what is slingshot trajectory analysis? Does this differ from the pseudotime analysis?

      Slingshot is an algorithm that uses the principal graph of the cells to infer trajectories. It models trajectories as curves on the principal graph, capturing the progression and transitions between different cellular states.

      Both Slingshot and pseudotime aim to infer cellular trajectories. Slingshot focuses on capturing branching patterns which is fully compatible with the graph generated using dimensionality reduction such as UMAP and PHATE, while pseudotime analysis aims to order cells along a continuous trajectory. It does not rely on dimensionality reduction graphs. We used both in the MS for different purposes.

      Line 241: introduce FISH methodology earlier in the paper, when in situ images are first referenced

      We appreciate the comment, and have revised the manuscript accordingly.

      Line 246-249: can you quantify the decrease in signal or calculate the concentration of symbiotes in the cells? Was 5C imaged whole? This can impact the fluorescent intensity in tissues of different thicknesses.

      We appreciate your comment. In Figure 5C, most of the typical gill filament region is visible (the ventral tip of the gill filament, and the mid part of the gill filament) except for the dorsal end. The gill filament of bathymodioline mussels exhibits a simple structure: a single layer of bacteriocytes grow on the basal membrane. Consequently, the gill slices have a fairly uniform thickness (with two layers of bacteriocytes and one layer of interlamina cells in between), minimizing any potential impact on fluorescent intensity. As of now, detailed quantification of intracellular symbionts may necessitate continuous TEM or ultra-resolution confocal sections to 3D reconstruct the bacteriocytes, which may exceed the scope of the current study. Therefore, fluorescent intensity remains the only method available to us for estimating bacterial density/distribution across the gill filament.

      Line 249: What is meant by 'environmental gradient?'

      Here we are refereeing the gases need for symbiont’s chemosynthesis. We have revised the manuscript to make it clear.

      Lines 255-256: Were the results shown in the TEM images previously known? Not clear what novel information is conveyed in images Fig 5 C and D

      In the Fig 5 C and D, we’ve delivered a high-quality SEM TEM image of a typical bacteriocyte, showcasing its morphology and subcellular machinery with clarity. These electron microscopy images offer the audience a comprehensive introduction to the cellular function of bacteriocytes. Additionally, they serve as supportive evidence for the bacteriocytes' snRNA-seq data.

      Line 295-296: Can you elaborate on what types of solute carrier genes have been shown to be involved with symbioses?

      We appreciate the comment, and have revised the manuscript accordingly. The putative functions of the solute carriers could be found in Figure 5I.

      Line 297-301: Which genes from the bulk RNA-seq study? Adding more detail and references in cluster annotation would help readers better understand the justifications.

      We appreciate the comment, and have revised the manuscript accordingly.

      Line 316 -322: Can you provide the values of the distances?

      We also provide values in the main text, in addition to the Fig6b. We also provide a supplementary Table (Supplementary Table S19).

      Line 328: What are the gene expression patterns?

      We observed genes that are up- and down-regulated in Starvation and reconstitution.

      LIne 334-337: A visualization of the different expression levels of the specific genes in clusters between sites might be helpful to demonstrate the degree of difference between sites.

      We have prepared a new supplementary file showing the different expression levels.

      Line 337: Citation needed

      We appreciate the comment. Here, we hypothesize the cellular responds based on the gene’s function and their expression patterns.

      Line 402-403: Cannot determine lineages from data presented. Need lineage tracing over time to determine this

      We acknowledge the necessity of conducting lineage tracing over time to validate this hypothesis. Nonetheless, in practical terms, it is difficult to obtain samples for testing this. Perhaps, it is easier to use their shallow sea relatives to test this hypothesis. However, in practice, it is very difficult.

      413-414: What are the "cell-type specific responses to environmental change"? It could be interesting to present these results in the "results and discussion" section

      These results are shown in Supplementary Figure S8.

      Line 419-424: Sampling details might go better earlier on in the paper, when the sampling scheme is introduced.

      We appreciate the comments. Here, we are discussing the limitations of our current study, not sampling details.

      Line 552: What type of sequencing? Paired end? How long?

      We conducted 150bp paired-end sequencing.

      556-563: More detail here would be useful to readers not familiar with the BD guide. Also be careful to cite the software used in analysis!

      The provided guide and handbook elucidate the intricacies of gene name preparation, data alignment to the genome, and the generation of an expression matrix. It is worth mentioning that we relied upon outdated versions of the aforementioned resources during our data analysis phase, as they were the only ones accessible to us at the time. However, we have since become aware of a newer pipeline available this year, rendering the information presented here of limited significance to other researchers utilizing BD.

      Many thanks for your kind reminding. We have now included a reference for STAR. All other software was cited accordingly. There are no scholarly papers or publications to refer to for the BD pipeline that we can cite.

      Line 577-578: How was the number of clusters determined? What is meant by "manually combine the clusters?" If cells were clustered by hand, more detail on the method is needed, as well as direct discussion and justification in the body of the paper.

      It would be more appropriate to emphasize the determination of cell types rather than clusters. The clusters were identified using a clustering function, as mentioned in the manuscript. It's important to note that the clustering function (in our case, the FindClusters function of Seurat) provides a general overview based on diffuse gene expression. Technically speaking, there is no guarantee that one cluster corresponds to a single cell type. Therefore, it is crucial to manually inspect the clustering results to assign clusters to the appropriate cell types. In some cases, multiple clusters may be assigned to the same cell type, while in other cases, a single cluster may need to be further subdivided into two or more cell types or sub-cell types, depending on the specific circumstances.

      For studies conducted on model species such as humans or mice, highly and specifically expressed genes within each cluster can be compared to known marker genes of cell types mentioned in previous publications, which generally suffices for annotation purposes. However, in the case of non-model species like Bathymodioline mussels, there is often limited information available about marker genes, making it challenging to confidently assign clusters to specific cell types. In such situations, in situ hybridisation proves to be incredibly valuable. In our study, WISH was employed to visualise the expression and morphology of marker genes within clusters. When WISH revealed the expression of marker genes from a cluster in a specific type of cell, we classified that cluster as a genuine cell type. Moreover, if WISH demonstrated uniform expression of marker genes from different clusters in the same cell, we assigned both clusters to the same cell type.

      We expanded the description of the strategy in the Method section.

      LIne 690-692: When slices were used, what part of the gill were they taken from?

      We sectioned the gill around the mid part which could represent the mature bacteriocytes.

      References to figures:

      General

      Please split the fluorescent images into different channels with an additional composite. It is difficult to see some of the expression patterns. It would also make it accessible to colorblind readers.

      We appreciate the comments and suggestions from the reviewer. We have converted our figures to CMYK colour which will help the colorblind audiences to read our paper.

      Please provide the number of replicates for each in situ and what proportion of those displayed the presented pattern.

      We appreciate the reviewer’s comments. We have explained in the material and methods part of the manuscript.

      Figure 2.C' is a fantastic summary and really helps the non-mussel audience understand the results. Adding schematics like this to Figures 3-5 would be helpful as well.

      We value the reviewer's comments. We propose that Figures 3K, 4C, and 5A-D could offer similar schematic explanations to assist the audience.

      Figure 2:

      Figures 2.C-F, 2.C', 2.H-J are not referenced in the text. Adding in discussions of them would help strengthen your discussions on the cluster annotation

      We appreciate the reviewer's comments. We have revise the manuscript accordingly.

      In 2.B. 6 genes are highlighted in red and said to be shown in in situs, but only 5 are shown.

      We apology for the mistake. We didn’t include the result 20639-0.0 WISH in present study. We have changed the label to black.

      Figure 3:

      FIg 2C-E not mentioned.

      We appreciate the reviewer's comments. We have revise the manuscript accordingly.

      In 3.B 8 genes are highlighted in red and said to be shown in in situs. Only 6 are.

      The result of the WISH were provided in Supplementary Figures S4 and S5.

      FIgure 3.K is not referenced in the legend.

      We appreciate the comment, and have revised the manuscript accordingly.

      Figure 4:

      In Figure D, it might be helpful to indicate the growth direction.

      We appreciate the comment, and have revised the manuscript accordingly by adding an arrow in panel D to indicate growth direction.

      4F: A double in situ with the symbiote marker is needed to demonstrate the nucleolin-like positive cells are symbiote free.

      We appreciate the comment. The symbiont free region could be found in Figure 5A.

      Figure 5:

      In 5.A, quantification of symbiote concentration would help support your conclusion that they are denser around the edges.

      We appreciate the comment, as we mentioned above, detailed quantification of intracellular symbionts may necessitate continuous TEM or ultra-resolution confocal sections to 3D reconstruct the bacteriocytes, which may exceed the scope of the current study. Therefore, fluorescent intensity remains the only method available to us for estimating bacterial density/distribution across the gill filament.

      In 5.D, the annotation is not clear. Adding arrows like in 5.C would be helpful.

      We appreciate the comment, and have revised the manuscript accordingly.

      A few genes in 5.F are not mentioned in the paper body when listing other genes. Mentioning them would help provide more support for your clustering.

      We appreciate the comment, and have revised the manuscript accordingly.

      Is 5.I meant to be color coded with the gene groups from 5.F? Color Coding the gene names, rather than organelles or cellular structures might portray this better and help visually strengthen the link between the diagram and your dot plot.

      We appreciate the suggestions. We've experimented with color-coding the gene names, but some colors are less discernible against a white background.

      Figure 6:

      6.B Is there a better way to visualize this data? The color coding is confusing given the pairwise distances. Maybe heatmaps?

      We attempted a heatmap, as shown in the figure below. However, all co-authors agree that a bar plot provides clearer visualization compared to the heatmap. We agree that the color scheme maya be confusing because they use the same color as for individual treatment. So we change the colors.

      Author response image 1.

      Figure 6.D: Why is the fanmao sample divided in the middle?

      Fig6C show that single-cell trajectories include branches. The branches occur because cells execute alternative gene expression programs. Thus, in Fig 6D, we show changes for genes that are significantly branch dependent in both lineages at the same time. Specifically, in cluster 2, the genes are upregulated during starvation but downregulated during reconstitution. Conversely, genes in cluster 1 are downregulated during starvation but upregulated during reconstitution. It's of note that Fig 6D displays only a small subset of significantly branch-dependent genes.

      FIgure 6.D: Can you visualize the expression in the same format as in figures 2-5?

      We appreciate the comments from the reviewer. As far as we know, this heatmap are the best format to demonstrate this type of gene expression profile.

      Supplementary Figure S2:

      Please provide a key for the cell type abbreviations

      We appreciate the comment, and have added the abbreviations of cell types accordingly.

      Supplementary Figures S4 and S5:

      What part of the larger images are the subsetted image taken from?

      We appreciate the comment, these images were taken from the ventral tip and mid of the gill slices, respectively. We have revised the figure legends to make it clear.

      Supplemental Figure S7:

      If clusters 1 and 2 show genes up and downregulated during starvation, what do clusters 4 and 3 represent?

      Cluster 1: Genes that are obviously upregulated during Starvation, and downregulated during reconstitution; luster4: genes are downregulated during reconstitution but not obviously upregulated during Starvation.

      Cluster 2 show genes upregulated during reconstitution, and cluster 3 obviously downregulated during Starvation.

      Author response table 1.

      Supplemental Figure S8:

      This is a really interesting figure that I think shows some of the results really well! Maybe consider moving it to the main figures of the paper?

      We appreciate the comments and suggestions. We concur with the reviewer on the significance of the results presented. However, consider the length of this manuscript, we have prioritized the inclusion of the most pertinent information in the main figures. Supplementary materials containing additional figures and details on the genes involved in these pathways are provided for interested readers.

      Supplemental Figure S11:

      Switching the axes might make this image easier for the reader to interpret. Additionally, calculating the normalized contribution of each sample to each cluster could help quantify the extent to which bacteriocytes are reduced when starving.

      Thank you for the insightful suggestion, which we have implemented as detailed below. We acknowledge the importance of understanding the changes in bacteriocyte proportions across different treatments. However, it's crucial to note that the percentage of cells per treatment is highly influenced by factors such as the location of digestion and sequencing, as previously mentioned.

      Author response image 2.

      Reviewer #2 (Recommendations For The Authors):

      The following are minor recommendations for the text and figures that may help with clarity:

      Fig. 3K: This figure describes water flow induced by different ciliary cells. It is not clear what the color of the arrows corresponds to, as they do not match the UMAP (i.e. the red arrow) and this is not indicated in the legend. Are these colours meant to indicate the different ciliary cell types? If so it would be helpful to include this in the legend.

      We appreciate the reviewer's comments and suggestions. The arrows indicate the water flow that might be agitated by the certain types of cilium. We have revised our figure and figure legends to make it clear.

      Line 369: The incorrect gene identifier is given for the mitochondrial trifunctional enzyme. This gene identifier is identical to the one given in line 366, which describes long-chain-fatty-acid-ligase ACSBG2-like (Bpl_scaf_28862-1.5).

      We appreciate the reviewer's comments and suggestions. We have revised our manuscript accordingly.

      Line 554: The Bioproject accession number (PRJNA779258) does not appear to lead to an existing page in any database.

      We appreciate the reviewer's comments and suggestions. We have released this Bioproject to the public.

      Line 597-598: it would be helpful to know the specific number of cells that the three sample types were downsampled to, and the number of cells remaining in each cluster, as this can affect the statistical interpretation of differential expression analyses.

      The number of cells per cluster in our analysis ranged from 766 to 14633. To mitigate potential bias introduced by varying cell numbers, we implemented downsampling, restricting the number of cells per cluster to no more than 3500. This was done to ensure that the differences between clusters remained less than 5 times. We experimented with several downsampling strategies, exploring cell limits of 4500 and 2500, and consistently observed similar patterns across these variations.

      Data and code availability:

      The supplementary tables and supplementary data S1 appear to be the final output of the differential expression analyses. Including the raw data (e.g. reads) and/or intermediate data objects (e.g. count matrices, R objects), in addition to the code used to perform the analyses, may be very helpful for replication and downstream use of this dataset. As mentioned above, the Bioproject accession number appears to be incorrect.

      We appreciate the reviewer's comments and suggestions. Regarding our sequencing data, we have deposited all relevant information with the National Center for Biotechnology Information (NCBI) under Bioproject PRJNA779258. Additionally, we have requested the release of the Bioproject. Furthermore, as part of this round of revision, we have included the count matrices for reference.

      Reviewer #3 (Recommendations For The Authors):

      As noted in the public review, my only major concerns are around the treatment of progenitor cell populations. I am sympathetic to the challenges of these experiments but suggest a few possible avenues to the authors.

      First, there could be some demonstration that these cells in G. platifrons are indeed proliferative, using EdU incorporation labeling or a conserved epitope such as the phosphorylation of serine 10 in histone 3. It appears in Mytilus galloprovincialis that proliferating cell nuclear antigen (PCNA) and phospho-histone H3 have previously been used as good markers for proliferative cells (Maiorova and Odintsova 2016). The use of any of these markers along with the cell type markers the authors recover for PEBZCs for example would greatly strengthen the argument that these are proliferative cells.

      If performing these experiments would not be currently possible, the authors could use some computation approaches to strengthen their arguments. Based on conserved cell cycle markers and the use of Cell-Cycle feature analysis in Seurat could the authors provide evidence that these progenitors occupy the G2/M phase at a greater percentage than other cells? Other than the physical position of the cells is there much that suggests that these are proliferative? While I am more convinced by markers in VEPCs the markers for PEBZCs and DEPCs are not particularly compelling.

      While I do not think the major findings of the paper hinge on this, comments such as "the PBEZCs gave rise to new bacteriocytes that allowed symbiont colonization" should be taken with care. It is not clear that the PBEZCs are proliferative and there does not seem to be any direct evidence that PBEZCs (or DEPCs or VEPCS for that manner) are the progenitor cells through any sort of labeling or co-expression studies.

      We appreciate the comments and suggestions from the reviewer. We have considered all the suggestions and have revised the manuscript accordingly. We especially appreciate the reviewer’s suggestions about the characterisations of the G. platifrons gill proliferative cell populations. In a separate research project, we have tested both cell division and cell proliferation markers on the proliferation cell populations. Though we are not able to include these results in the current manuscript, we are happy to share our preliminary results with the reviewer. Our results demonstrate the proliferative cell populations, particularly the VEPCs, are cell proliferation marker positive, and contains high amount of mitotic cells.

      Author response image 3.

      Finally, there is a body of literature that has examined cell proliferation and zones of proliferation in mussels (such as Piquet, B., Lallier, F.H., André, C. et al. Regionalized cell proliferation in the symbiont-bearing gill of the hydrothermal vent mussel Bathymodiolus azoricus. Symbiosis 2020) or other organisms (such as Bird, A. M., von Dassow, G., & Maslakova, S. A. How the pilidium larva grows. EvoDevo. 2014) that could be discussed.

      We appreciate the comments and suggestions from the reviewer. We have considered all the suggestions and have revised the manuscript accordingly (line 226-229).

      Minor comments also include:

      Consider changing the orientation of diagrams in Figure 2C' in relationship to Figure 2C and 2D-K.

      We appreciate the comments and suggestions from the reviewer. The Figure 2 has been reorganized.

      For the diagram in Figure 3K, please clarify if the arrows drawn for the direction of inter lamina water flow is based on gene expression, SEM, or some previous study.

      We are grateful for the reviewer's valuable feedback and suggestions. The arrows in the figure indicate the direction of water flow that could be affected by specific types of cilium. Our prediction is based on both gene expression and SEM results. To further clarify this point, we have revised the figure legend of Fig. 3.

      Please include a label for the clusters in Figure 5E for consistency.

      We have revised our Figure 5E to keep our figures consistent.

      Please include a note in the Materials and Methods for Monocle analysis in Figure 6.

      We conducted Monocle analyses using Monocle2 and Monocle 3 in R environment. We have revised our material and methods with further information of Figure 6.

      In Supplement 2, the first column is labeled PEBC while the first row is labeled PEBZ versus all other rows and columns have corresponding names. I am guessing this is a typo and not different clusters?

      We appreciate the great effort of the reviewer in reviewing our manuscript. We have corrected the typo in the revised version.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important study presents a novel pipeline for the large-scale genomic prediction of members of the non-ribosomal peptide group of pyoverdines based on a dataset from nearly 2000 Pseudomonas genomes. The advance presented in this study is largely based on solid evidence, although some main claims are only incompletely supported. This study on bacterial siderophores has broad theoretical and practical implications beyond a singular subfield.

      Thank you for the supportive and encouraging words. We appreciate the editor’s and reviewers’ careful and professional assessment of this manuscript. The reviewers’ scrutiny has helped us to improve the presentation and discussion of our work. We have now carefully revised the manuscript following their instructive suggestions and comments. Please find below our detailed responses (marked in blue) to each of the comments.

      Public Reviews:

      Reviewer #1 (Public Review):

      The manuscript introduces a bioinformatic pipeline designed to enhance the structure prediction of pyoverdines, revealing an extensive and previously overlooked diversity in siderophores and receptors. Utilizing a combination of feature sequence and phylogenetic approaches, the method aims to address the challenging task of predicting structures based on dispersed gene clusters, particularly relevant for pyoverdines.

      Predicting structures based on gene clusters is still challenging, especially pyoverdines as the gene clusters are often spread to different locations in the genome. An improved method would indeed be highly useful, and the diversity of pyoverdine gene clusters and receptors identified is impressive.

      However, so far the method basically aligns the structural genes and domains involved in pyoverdine biosynthesis and then predicts A domain specificity to predict the encoded compounds. Both methods are not particularly new as they are included in other tools such as PRISM (10.1093/nar/gkx320) or Sandpuma (https://doi.org/10.1093/bioinformatics/btx400) among others. The study claims superiority in A domain prediction compared to existing tools, yet the support is currently limited, relying on a comparison solely with AntiSMASH. A more extensive and systematic comparison with other tools is needed.  

      Thanks for pointing this out. In the revised manuscript, we have included a comprehensive comparative analysis, in which we compared our pipeline to six different commonly used methods, including NP.searcher, PRISM4, AdenPredictor, SeMPI2, SANDPUMA, antiSMASH5 (see Supplementary_table 6 for details, and lines 281-286). These approaches either consist of a single specific algorithm or integrate several methods. Our approach performs best (see table below), demonstrating a clear improvement over previous tool. The improvements are due to several methodological differences inherent to our approach. Additionally, while exploring existing prediction tools, we found that some had not been maintained for years. For instance, we were unable to access NRPSsp (www.nrpssp.com) and NRPSpredictor2 (http://nrps.informatik.uni-tuebingen.de/). Below, we briefly explain these differences, particularly in relation to PRISM and SANDPUMA, as highlighted by the reviewer. 

      Author response table 1.

      PRISM annotates biosynthetic gene clusters (BGC) and reconstructs the linear structures of NRPS synthetases, with this function depending on proper annotations of open reading frames. This pipeline can have difficulties in assembling the linear structure into a final product. In our approach, we found that the annotations of NRPS gene are frequently truncated because of sequencing errors and annotation issues. Our method fixes this problem through rescanning all possible reading frames of the BGC to rebuild complete pyoverdine synthetase genes. 

      Sandpum and our approach are based on similar ideas (using the prediCAT algorithm) to predict A domain substrates, namely by using the closest reference A domain annotated. However, our method uses a self-adaptive feature extraction step to reduce the co-founding influence of phylogeny. This small adjustment significantly improves the performance of our approach and even works well for small training sets (101 experimentally validated A domains with our approach as opposed to 494 A domains used by Sandpuma from MIBiG).

      Additionally, in contradiction to the authors' claims, the method's applicability seems constrained to well-known and widely distributed gene clusters. The absence of predictions for new amino acids raises concerns about its generalizability to NRPS beyond the studied cases.

      We thank the reviewers for this comment. We acknowledge that our method cannot directly predict new amino acids. Nevertheless, for several reasons we believe that our approach is not constrained and can be widely applied in the future.

      First, our method can identify A domains that select new unknown amino acid substrates. In fact, three of the four unresolved cases in our experimental verification analysis (Fig. 3d) represent new amino acids. Obviously, experimental verification is required to characterize the unknown substrate. Once verified, the new A domains and their substrates can expand the reference dataset, allowing targeted improvement of our phylogeny-focused prediction technique. We now discuss this aspect in lines 634-645.

      Second, despite that the overall substrate diversity in NRPS is high across the microbial kingdom, our analysis suggests that the number of amino acids used for a specific group of secondary metabolites quickly reaches a saturation point. The discovery rate of new amino acids was 1.7% for our experimental Pseudomonas data set (Fig. 3d). The discovery rate of new amino acids was even 0.0 % for the Burkholderiales data set. This suggests that as the database expands, the discovery rate of novel amino acid substrates is expected to drop rapidly.

      Third, we acknowledge that the inability to predict the substrates of unknown domains is a common limitation among all knowledge-guided learning algorithms, including ours. However, we have made significant improvements in prediction accuracy. As the database grows, we expect the rate of unknown substrates to decrease, and the prediction accuracy to increase.

      The manuscript lacks clarity on how the alignment of structural genes operates when dealing with multiple NRPS gene clusters on different genome contigs. How would the alignment of each BGC work?

      We thank the reviewers for this comment. The pyoverdine molecules consist of a conserved fluorescent chromophore (Flu) and a peptide chain (Pep), both synthesized by NRPS enzymes. In most instances (over 90%), Flu and Pep are produced by two separate biosynthetic gene clusters (BGCs). In these cases, we merge the two BGCs by positioning Flu at the head and Pep at the tail. For the remaining less than 10%, there are two scenarios: 1. Flu and Pep are located on the same BGC, which eliminates any issues with BGC alignment. 2. In very rare cases, Flu and Pep are synthesized by three BGCs. Here, Flu is still synthesized by one BGC at the head, while Pep is produced by two BGCs. We put the BGC containing the Thioesterase (TE) domain as the tail and the BGC not containing the TE domain in the middle.

      (see lines 165-169).

      Another critical concern is that a main challenge in NRPS structure prediction is not the backbone prediction but rather the prediction of tailoring reactions, which is not addressed in the manuscript at all, and this limitation extensively restricts the applicability of the method.

      While we thank the reviewer for this comment, we only partly agree with it. Peptide backbone predictions are still a significant challenge. This challenge is clearly visible in our new analysis comparing prediction accuracies of different pipelines, such as antiSMASH5, PRISM4, AdenPredictor, SeMPI2, NP.searcher, Sandpuma. Unresolved and wrong substrate predictions are still common, highlighting the importance of our contribution in developing a new approach with improved high accuracy. 

      However, we agree with the reviewer that our current algorithm does not predict tailoring reactions (now discussed on lines 680-685). Although tailoring reactions are important for predicting the final NRPS product structure, none of the other existing pipelines address this issue either, and it remains a challenge for future work. For our study, it is important to note that the specificity of pyoverdines is primarily determined by the backbone composition, whereas tailoring reactions seem to play a minor role.

      The manuscript presents a potentially highly useful bioinformatic pipeline for pyoverdine structure prediction, showcasing a commendable exploration of siderophore diversity. However, some of the claims made remain unsubstantiated. Overall, while the study holds promise, further validation and refinement are required to fulfill its potential impact on the field of bioinformatic structure prediction.

      Thank you for the supportive and encouraging words. We deeply appreciate your constructive comments and suggestions. 

      Reviewer #2 (Public Review):

      Pyoverdines, siderophores produced by many Pseudomonads, are one of the most diverse groups of specialized metabolites and are frequently used as model systems. Thousands of Pseudomonas genomes are available, but large-scale analyses of pyoverdines are hampered by the biosynthetic gene clusters (BGCs) being spread across multiple genomic loci and existing tools' inability to accurately predict amino acid substrates of the biosynthetic adenylation (A) domains. The authors present a bioinformatics pipeline that identifies pyoverdine BGCs and predicts the A domain substrates with high accuracy. They tackled a second challenging problem by developing an algorithm to differentiate between outer membrane receptor selectivity for pyoverdines versus other siderophores and substrates. The authors applied their dataset to thousands of Pseudomonas strains, producing the first comprehensive overview of pyoverdines and their receptors and predicting many new structural variants.

      The A domain substrate prediction is impressive, including the correction of entries in the MIBiG database. Their high accuracy came from a relatively small training dataset of A domains from 13 pyoverdine BGCs. The authors acknowledge that this small dataset does not include all substrates, and correctly point out that new sequence/structure pairs can be added to the training set to refine the prediction algorithm. 

      The authors could have been more comprehensive in finding their training set data. For instance, the authors claim that histidine "had not been previously documented in pyoverdines", but the sequenced strain P. entomophila L48, incorporates His (10.1007/s10534-009-9247-y). 

      Thank you for highlighting this issue. We agree that stating histidine has not been reported before in pyoverdine was incorrect. We have reviewed the full text and made the necessary corrections.

      The primary reason for excluding the sequenced strains P. syringae 1448a (10.1186/14712180-11-218) and P. entomophila L48 (10.1007/s10534-009-9247-y) from the training set is that the pyoverdine structures of these strains were not determined solely through experimental methods. In these works, the pyoverdine structures were predicted based on the synthetic gene sequence using bioinformatical analysis, followed by structural analysis experiments based on this predicted structure. We found that pre-prediction probably has introduced biases into downstream analyses. Specifically, in the case of Pseudomonas entomophila L48, we discovered inaccuracies in the annotation of certain domains (see figures below). For example, the third A domain of the peptide chain in P. entomophila L48 pyoverdine was initially annotated with Dab specificity. However, upon closer examination, it appears to differ significantly from other Dab references (top) or Dab from our experimentally validated (right) domains (left panel in the figure below). By analyzing the interface (I) domain (10.1073/pnas.1903161116) in its predicted site, we suggested that it should actually recognize OHHis. The OHAsp domain of P. entomophila L48 reported in the paper is actually close in sequence similarity to the OHAsp domain (left panel in the figure below), while the Ala domain reported is more similar to the Ser domain (right panel in the figure below). For these reasons, we did not include this supervised pyoverdine structure analysis strain in the training set data.

      Author response image 1.

      The workflow cannot differentiate between different variants of Asp and OHOrn, and it's not clear if this is a limitation of the workflow, the training data, or both. 

      Thanks for pointing this out. It is generally challenging to differentiate between variants of the same amino acid (for all the algorithms existing to date). In this sense, it is a limitation of our but also of all other workflows. Nonetheless, we wish to stress that we observed feature sequence divergence (using the A motif4-5 region), which helped us to separate some (but not all) of the Asp and Orn variants. For example, separations between Asp-variants are distinct (left panel in the figure below). To be on the conservative side, we only differentiated between OHAsp and Asp for our predictions, but also differentiation between DOHAsp and OHAsp would be possible. In the case of Orn-variants, there was a clear separation between Orn and the OHOrn variants (right panel). In contrast, it was difficult to differentiate between the subgroups of OHOrn variants. We believe that no A domain prediction tool will be able to solve this issue. Instead, it would be important to include information on substrate-modifying enzymes in future approaches.

      Author response image 2.

      The prediction workflow holds up well in Burkholderiales A domains, however, they fail to mention in the main text that they achieved these numbers by adding more A domains to their training set.

      We thank the reviewers for this comment. We apologize for not having mentioned the training data set in the main text, while we described it in detail in the methods section (lines 714-732). We now provided more details on the analysis procedure in the main text (lines 307313). Important to note is that we did not add more A domains to the training data set but built up a new independent data set for Burkholderiales. The aim was to mirror the analysis we performed for pyoverdines with a completely new data set, featuring 124 A domains for training and 178 A domains as test set.

      To validate their predictions, they elucidated structures of several new pyoverdines, and their predictions performed well. However, the authors did not include their MS/MS data, making it impossible to validate their structures. In general, the biggest limitation of the submitted manuscript is the near-empty methods section, which does not include any experimental details for the 20 strains or details of the annotation pipeline (such as "Phydist" and "Syndist"). The source code also does not contain the requisite information to replicate the results or re-use the pipeline, such as the antiSMASH version and required flags. That said, skimming through the source code and data (kindly provided upon request) suggests that the workflow itself is sound and a clear improvement over existing tools for pyoverdine BGC annotation.

      Thank you for highlighting these issues. We agree that the methods section is short. This is because the entire paper is a step-by-step methodological introduction to our pipeline. We have now carefully revised the main text to add the information requested by the reviewer. Moreover, we have included a supplementary file with the MS/MS data of the experimentally analyzed pyoverdine structures. Finally, we further include a link to a one-click online notebook that can be used to replicate the annotation and substrate prediction results See: https://drive.google.com/drive/folders/1JsfyPUGDTFo8BDDZk8JLSvKry8emzMhr?usp=drive_ link , following a more detail explanation on code.

      Predicting outer membrane receptor specificity is likewise a challenging problem and the authors have made a promising achievement by finding specific gene regions that differentiate the pyoverdine receptor FpvA from FpvB and other receptor families. Their predictions were not tested experimentally, but the finding that only predicted FpvA receptors were proximate to the biosynthesis genes lends credence to the predictive power of the workflow. The authors find predicted pyoverdine receptors across an impressive 468 genera, an exciting finding for expanding the role of pyoverdines as public goods beyond Pseudomonas. However, whether or not these receptors can recognize pyoverdines (and if so, which structures!) remains to be investigated.

      Thank you for the supportive and encouraging words. The bioinformatic analysis and experimental testing of pyoverdine-receptor matching is complicated and it is not part of this paper. We treated it in a separate manuscript in which we developed an experimentally verified co-evolution algorithm that matches pyoverdines to receptors. With this algorithm, we can identify self-receptors (i.e. receptors used to take up the self-produced pyoverdine), and therefore establish pyoverdine sharing and interaction networks across strains in communities.

      Please see DOI:10.1101/2023.11.05.565711 for details.

      In all, the authors have assembled a rich dataset that will enable large-scale comparative genomic analyses. This dataset could be used by a variety of researchers, including those studying natural product evolution, public good eco/evo dynamics, and NRPS engineering.

      Thank you for the supportive and encouraging words. We are grateful for the reviewers’ instructive suggestions and comments.

      Reviewer #3 (Public Review):

      Summary:

      Secondary metabolites are produced by numerous microorganisms and have important ecological functions. A major problem is that neither the function of a secondary metabolite enzyme nor the resulting metabolite can be precisely predicted from gene sequence data.

      In the current paper, the authors addressed this highly relevant question.

      The authors developed a bioinformatic pipeline to reconstruct the complete secondary metabolism pathway of pyoverdines, a class of iron-scavenging siderophores produced by Pseudomonas spp. These secondary metabolites are biosynthesized by a series of nonribosomal peptide synthetases and require a specific receptor (FpvA) for uptake. The authors combined knowledge-guided learning with phylogeny-based methods to predict with high accuracy encoding NRPSs, substrate specificity of A domains, pyoverdine derivatives, and receptors. After validation, the authors tested their pipeline with sequence data from 1664 phylogenetically distinct Pseudomonas strains and were able to determine 18,292 enzymatic A domains involved in pyoverdine synthesis, reliably predicted 97.8% of their substrates, identified 188 different pyoverdine molecule structures and 4547 FpvA receptor variants belonging to 94 distinct groups. All the results and predictions were clearly superior to predictions that are based on antiSMASH. Novel pyoverdine structures were elucidated experimentally by UHPLC-HR-MS/MS.

      To assess the extendibility of the pipeline, the authors chose Burkholderiales as a test case which led to the results that the pipeline consistently maintains high prediction accuracy within Burkholderiales of 83% which was higher than for antiSMASH (67%).

      Together, the authors concluded that supervised learning based on a few known compounds produced by species from the same genus probably outperforms generalized prediction algorithms trained on many products from a diverse set of microbes for NRPS substrate predictions. As a result, they also show that both pyoverdine and receptor diversity have been vastly underestimated.

      Strengths:

      The authors developed a very useful bioinformatic pipeline with high accuracy for secondary metabolites, at least for pyoverdines. The pipelines have several advantages compared to existing pipelines like the extensively used antiSMASH program, e.g. it can be applied to draft genomes, shows reduced erroneous gene predictions, etc. The accuracy was impressively demonstrated by the discovery of novel pyoverdines whose structures were experimentally substantiated by UHPLC-HR-MS/MS.

      The manuscript is very well written, and the data and the description of the generation of pipelines are easy to follow.

      Weaknesses:

      The only major comment I have is the uncertainty of whether the pipeline can be applied to more complex non-ribosomal peptides. In the current study, the authors only applied their pipeline to a very narrow field, i.e., pyoverdines of Pseudomonas and Burkholderia strains.

      Thanks for your positive and encouraging comment. Regarding your only major comment, we think that the design concept of our pipeline has the potential to be applied to more complex non-ribosomal peptides. Currently, our method is tailored to accurately predict the structural composition of the Pseudomonas siderophore pyoverdine (see also response 3). A key point emphasized in our article is the importance of considering phylogeny in developing substrate prediction algorithms for A domains. Currently, the main challenge in advancing these algorithms is the limited availability of data on A domains and their corresponding substrates. However, with the future accumulation of more reference data, we are confident that the design principles of our method will enable precise predictions of the structural compositions of all products synthesized by non-ribosomal peptide synthetases (see our discussions in lines 634-

      645). 

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      I believe that the manuscript would benefit from focusing solely on the task of improving pyoverdine predictions. This aspect alone is significant, and robustly supporting this claim would strengthen the manuscript. The diversity analysis provided is valuable and would undoubtedly benefit the scientific community. However, additional systematic comparisons with other methods are necessary. Furthermore, clarification of certain terms, such as 'featurebased' (e.g., whether it refers to NRPS domains or CDS), would enhance clarity.

      Thank you for the supportive and encouraging words. We followed the reviewer’s suggestion and now provide the requested method comparison, see also response 2 for details. Furthermore, we have carefully checked the main text to clarify terms whenever needed. Specifically, we now define the terms “feature sequence” and “feature sequence distance” in lines 227-229.  

      Additionally, several minor points could be improved upon:

      In line 85, clarification is needed on how pyoverdine genes were identified.

      Thank you for your thorough review. In the introduction section, we provided a brief overview of our work, while the detailed methodology is outlined in the results section on lines 160-174.

      In line 382, it would be helpful to know the source of the sequences.

      We agree and have now carefully revised the manuscript following your suggestions (lines 403-405).

      Line 392 could be explained more clearly. Does it mean that the authors used an hmm search to search pHMMs against each reference sequence?

      Thanks for your comment. Yes, we used an hmm search to search pHMMs against each reference sequence. We have now revised the manuscript to improve explanations (lines 413-418).

      Reviewer #2 (Recommendations For The Authors):

      The authors state they "elucidated the chemical structure of the 20 pyoverdines using culturebased methods combined with UHPLC-HR-MS/MS", so I was alarmed to see that KR and LB already published several of those structures in the cited paper. I hope that this "double dipping" will be fixed in a revision process.

      Thank you for pointing this out. We agree that we have not explained clearly enough what steps were conducted in this study and which data were used from a previous paper (https://doi.org/10.1007/s00216-022-03907-w). The genomes of the 20 strains used for the verification analysis (Fig. 3d) were sequenced as part of this study (access code now provided). 14 out of the 20 pyoverdine structures were elucidated with UHPLC-HR-MS/MS in this study. For 6 out of the 20 pyoverdines, we had structural information already at hand from the previous paper. We have now clarified these details in our manuscript (lines 276-280). 

      Thank you for providing the source code and data, and I hope that the final non-redundant dataset will be uploaded to Zenodo or another repository. Please deposit the 20 newlysequenced genomes to GenBank or another public repository. Please also show the UHPLC-

      HR-MS/MS data, preferably in the form of raw data uploaded to GNPS.

      We have followed the reviewer’s advice and deposited our data:

      - The sequences of the 20 newly sequenced strains are available on ENA accession PRJEB76792.

      - The MS/MS plots of the 14 newly analyzed pyoverdines are shown in the Supplementary Materials.

      - We provide a one-click online notebook to allow readers to replicate the pyoverdine cluster annotation and substrate prediction of the 20 experimentally analyzed strains.

      I suggest adding "at least" or a similar qualifier when the 73 variants are mentioned unless the literature search was truly exhaustive. What were the criteria for inclusion of the 13 strains in Table S2? For instance, sequenced strains P. syringae 1448a (10.1186/1471-2180-11-218) and P. entomophila L48 (10.1007/s10534-009-9247-y) were not included.

      Thank you for your comment. We have now carefully revised the manuscript following your suggestions (lines 291-295). Regarding the criteria for including the 13 strains in Table S2, we aimed to select strains with the high credibility for inclusion in the training set data. The primary reason for excluding the two strains from the training set is that their siderophore structures were analyzed through supervised experiments. We wanted to avoid any form of biases that bioinformatic pre-predictions could introduce to downstream analyses (see Response 13 for details).

      OHAsp in pyoverdines has been reported to arise from hydroxylation of Asp after it's already been activated by the A domain (10.1073/pnas.1903161116). Was there a clear difference between A domains that lead to Asp and OHAsp? Conversely, acetylation and formylation of OHOrn occur before adenylation. Can your workflow be used to differentiate cOHOrn, fOHOrn, and AcOHOrn, which are currently difficult to predict through genome mining?

      Thank you for these considerations. We treated these aspects in our response 8.  

      Throughout, define non-proteinogenic AA substrate abbreviations (ex: Rsc, Dab).

      Revised as per suggestion (lines 329-333).

      Additional line comments:

      189: Mention PhyloPhlAn in the main text.

      Revised as per suggestion (lines 189).

      191: Define these filtering/selection criteria.

      Thanks for your comment, we have added the criteria in the main text (line 196 and line 198). 

      309, 620: An A domain presumably loading histidine is present in sequenced strain P. entomophila L48 (10.1007/s10534-009-9247-y). Please also clarify that Val has previously been seen in a pyoverdine (it is in Table S1) albeit not sequenced.

      We have clarified these aspects as per suggestion (lines 314-315 and line 630).

      310: The pipeline can "highlight" new substrates, but not identify them.

      Revised as per suggestion (line 295).

      354: Please clarify "13 amino acid substrates form the core of all the 188 pyoverdine structures", considering that 279 A domain substrates couldn't be predicted.

      Thanks for your comments. We have now clarified “our analysis found that 13 amino acids form the main structural substrates of all the 188 pyoverdine structures.” (lines

      360-363)

      630: "discovered" implies that there is experimental evidence. I suggest something like "here we predicted 151 putatively new variants".

      Revised as per suggestion (line 648).

      Reviewer #3 (Recommendations For The Authors):

      Weakness:

      The only major comment I have is the uncertainty of whether the pipeline can be applied to more complex non-ribosomal peptides. In the current study, the authors only applied their pipeline to a very narrow field, i.e., pyoverdines of Pseudomonas and Burkholderia strains

      Thanks for your comment. Please see our Responses 3+13 above, where we treat this concern in detail. Moreover, we discussed the possibility of extension to other groups of secondary metabolites in our discussion. We believe that we deliver a balanced view on the applicability of our approach and the next steps to be taken.  

      Please comment on this aspect.

      Minor:

      (1)  When you speak about "synthesis" it is rather biosynthesis. Synthesis is chemical synthesis.

      Please replace all instances of the word synthesis with biosynthesis.

      Revised as per suggestion.

      (2)  Line 188: synthetase is rather synthetases

      Revised as per suggestion (line 191).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Reviewer):

      It is not clear from the analysis presented in the paper how persistent those environmentally induced changes, do they remain with the bats till the end of their lives.

      Currently, the long-term effects of enrichment on the bats remain uncertain. Preliminary results suggest that these differences may persist throughout the bats’ lifetimes; however, further data analysis is ongoing to determine the extent of these effects. We also addressed now at the manuscript discussion

      Reviewer #2 (Public Reviewer):

      (1) Assessing personality metrics and the indoor paradigm: While I applaud this effort and think the metrics used are justified, I see a few issues in the results as they are currently presented:

      (a) [Major] I am somewhat concerned that here, the foraging box paradigm is being used for two somewhat conflicting purposes: (1) assessing innate personality and (2) measuring changes in personality as a result of experience. If the indoor foraging task is indeed meant to measure and reflect both at the same time, then perhaps this can be made more explicit throughout the manuscript. In this circumstance, I think the authors could place more emphasis on the fact that the task, at later trials/measurements, begins to take on the character of a "composite" measure of personality and experience.

      Personality traits should generally be stable over time, but personality can also somewhat change with experience. We used the foraging box to assess individual personality, but we also examined the assumption that what we are measuring is a proxy of personality and hence is stable over time. We now clarify this in the manuscript. 

      (b) [Major] Although you only refer to results obtained in trials 1 and 2 when trying to estimate "innate personality" effects, I am a little worried that the paradigm used to measure personality, i.e. the stable components of behavior, is itself affected by other factors such as age (in the case of activity, Fig. 1C3, S1C1-2), the environment (see data re trial 3), and experience outdoors (see data re trials 4/5).

      We found that boldness was the most consistent trait, showing persistence between trials 1 to 5, i.e., 144 days apart on average. We thus also used Boldness as the primary parameter for assessing the effects of personality on the outdoors behavior. While we evaluated other traits for completeness, boldness was the only one that consistently met the criteria for personality, which is why we focused on it in our analyses. The other traits which were not stable over time could be used to assess the effects of experience on behavior

      Ideally, a study that aims to disentangle the role of predisposition from early-life experience would have a metric for predisposition that is relatively unchanging for individuals, which can stand as a baseline against a separate metric that reflects behavioral differences accumulated as a result of experience.

      I would find it more convincing that the foraging box paradigm can be used to measure personality if it could be shown that young bats' behavior was consistent across retests in the box paradigm prior to any environmental exposure across many baseline trials (i.e. more than 2), and that these "initial settings" were constant for individuals. I think it would be important to show that personality is consistent across baseline trials 1 and 2. This could be done, for example, by reproducing the plots in Fig. 1C1-3 while plotting trial 1 against trial 2. (I would note here that if a significant, positive correlation were to be found (as I would expect) between the measures across trial 1 and 2, it is likely that we would see the "habituation effect" the authors refer to expressed as a steep positive slope on the correlation line (indicating that bold individuals on trial 1 are much bolder on trial 2).)

      We agree and thus used boldness which was found to be stable over five trials (three of which were without external experience). We note that if Boldness as we measured it increased over time, the differences between individuals remained similar and this is what is expected from personality traits measured in the same paradigm several times (after the animal acquires experience).  

      (c) Related to the previous point, it was not clear to me why the data from trial 2 (the second baseline trial) was not presented in the main body of the paper, and only data from trial 1 was used as a baseline.

      We added a main figure, showing the correlation between the two baseline trials

      In the supplementary figure and table, you show that the bats tended to exhibit more boldness and exploratory behavior, but fewer actions, in trial 2 as compared with trial 1. You explain that this may be due to habituation to the experimental setup, however, the precise motivation for excluding data from trial 2 from the primary analyses is not stated. I would strongly encourage the authors to include a comparison of the data between the baseline trials in their primary analysis (see above), combine the information from these trials to form a composite baseline against which further analyses are performed, or further justify the exclusion of data as a baseline.

      We had no intention of excluding data from baseline 2. As we have shown several times before (e.g., Harten, 2021) bats’ boldness as we measure it in the box experiment increases over sessions performed nearby in time. This means that trial 2’s boldness was higher than that of trial 1 and trial 3 which made the data less suitable for a Linear model. Moreover, our measurement of boldness is capped (with a maximum of 1) again making it less suitable for a Linear model. However, following the reviewer’s question we now ran all analyses with trial 2’s data included and not only that the results remained the same, some of the models fit better (based on the AIC criterion). We added this information to the revised manuscript.  

      (2) Comparison of indoor behavioral measures and outdoor behavioral measures Regarding the final point in the results, correlation between indoor personality on Trial 4 and outdoor foraging behavior: It is not entirely clear to me what is being tested (neither the details of the tests nor the data or a figure are plotted). Given some of the strong trends in the data - namely, (1) how strongly early environment seems to affect outdoor behavior, (2) how strongly outdoor experience affects boldness, measured on indoor behavior (Fig. 1D) - I am not convinced that there is no relationship, as is stated here, between indoor and outdoor behavior. If this conclusion is made purely on the basis of a p-value, I would suggest revisiting this analysis.

      We agree that the relationship between indoor personality measures and outdoor foraging behavior is of great interest and had expected to find some correspondence between the two. To test this, we conducted multiple GLM analyses using the different indoor behavioral traits as predictors of outdoor behaviors. These analyses did not reveal any significant correlations. We also performed a separate analysis using PC1 (derived from the indoor behavioral variables) as a predictor, and again found no significant associations with outdoor behavior.

      We were indeed surprised by this outcome. It is possible that the behavioral traits we assessed indoors (boldness, exploration, and activity) do not fully capture the dimensions of behavior that are most relevant to foraging in the wild. For example, traits such as neophobia or decisionmaking under risk, which we did not assess directly, may have had stronger predictive value for outdoor behavior. We now highlight this point more clearly in the Discussion and acknowledge the possibility that alternative or additional personality traits might have revealed meaningful relationships.

      (3) Use of statistics/points regarding the generalized linear models While I think the implementation of the GLMM models is correct, I am not certain that the interpretation of the GLMM results is entirely correct for cases where multivariate regression has been performed (Tables 4s and S1, and possibly Table 3). (You do not present the exact equation they used for each model (this would be a helpful addition to the methods), therefore it is somewhat difficult to evaluate if the following critique properly applies, however...)

      The "estimate" for a fixed effect in a regression table gives the difference in the outcome variable for a 1 unit increase in the predictor variable (in the case of numeric predictors) or for each successive "level" or treatment (in the case of categorical variables), compared to the baseline, the intercept, which reflects the value of the outcome variable given by the combination of the first value/level of all predictors. Therefore, for example, in Table 4a - Time spend outside: the estimate for Bat sex: male indicates (I believe) the difference in time spent outside for an enriched male vs. an enriched female, not, as the authors seem to aim to explain, the effect of sex overall. Note that the interpretation of the first entry, Environmental condition: impoverished, is correct. I refer the authors to the section "Multiple treatments and interactions" on p. 11 of this guide to evaluating contrasts in G/LMMS: https://bbolker.github.io/mixedmodelsmisc/notes/contrasts.pdf

      We are not certain we fully understand the comment; however, if our understanding is correct, we respectfully disagree. A GLM analysis without interaction terms—as conducted in our study—functions as a multiple linear regression, wherein each factor's estimate reflects its individual effect on the dependent variable. For example in the case of sex, it examines he effect of sex on the tie spent out independently of enrichment. An interaction term would be needed to test sex*enrichment. We have added the models’ formula, and we hope this clarifies our approach

      Reviewer #1 (Recommendations for the authors):

      I would recommend the following:

      (1) As video tracking and behavioral analysis softwares are wide spread, it would be great to see this applied to the bat behavior indoor to answer questions like how does the bat velocity or heading or acceleration correlate with the behavioral measures boldness , activity or exploration? In the same gist, can one infer boldness, activity or exploration from measured bat velocity or other parameters? I think this will further make the indoor behavior more quantitative.

      In a tent of the size used in our study, bats’ flight behavior tends to be highly stereotypical: they typically perch on the wall, take off, circle the tent—sometimes multiple times—and then either land or not, and enter or not. Flight velocity is largely determined by individual maneuverability and the physical constraints of the space; thus, precise tracking is unlikely to provide further insight into boldness. In contrast, decision-making behaviors—such as whether to land or enter—more accurately reflect personality traits, as we have shown previously (Harten et al., 2018). Moreover, accurate 3D tracking in such an environment is possible but definitely not easy due to the many blind-spots resulting from the cameras being inside the 3D volume.  Nonetheless, we quantified flight activity and assessed its correlation with the other behavioral axes. As it was highly correlated with general activity, we did not include it as an independent parameter in the main analysis. However, in response to the reviewer’s suggestion, we now present this analysis in the Supplementary Materials.

      (2) It is not clear whether the bats come from the same genetic background. they might be but it is not mentioned in the methods under the experimental subjects.

      We have shown in the past that there is no familial relations in a randomly caught sample of bats in the colony where we usually work (Harten et al., 2018). The bats were caught in three, not related wild colonies. The text referring to the table was clarified in the revised manuscript

      (3) It will be great to include the author's thoughts about mechanisms underlying those environmentally induced changes in behavior in the discussion section along with how this will affect the bats' social foraging abilities. Another question that comes to mind is whether growing up with a large number of bats constitute an enriched environment in itself.

      We agree that this could count as an enrichment, and we thus ensured similar group sizes in both groups for this reason. We clarify this in the revised manuscript. 

      We have elaborated on the underlying mechanisms in the discussion, focusing on how they contribute to behavioral changes.

      Reviewer #2 (Recommendations for the authors):

      (1) Outdoor foraging behavior

      If I understand correctly, the data you display in Fig. 3A is only from the 2nd to 3rd weeks of exploration, i.e. just before the first post-exploration trial.

      What does the data look like for the second outdoor exploration data, i.e. before the final trial?

      Is there a specific reason why these measures were only computed on the GPS data from the 3rd week outside? If so, can this sampling of the data be motivated or briefly addressed (in the methods and wherever else necessary)?

      In order to allow a comparison between individuals, we had to restrict ourself to a period we had data from many individuals (some dissapeared later on).

      Following the reviewer suggestion – we added a supplemenry figure including days 21-26

      I would find it important and of great interest to see movement maps for more animals, as these give very rich information that is not entirely captured by the three proxies of outdoor activity.

      Are these four exemplary animals sampled from both seasons?

      Did you check to see if there were any overall differences in outdoor foraging behavior as a function of the season in which the bats were captured?

      Yes, the samples represent individuals from both tested years. This was clarified, and additional examples were included in a supplementary figure.

      Variable of time spent outdoors: You mention that you did not include the nights that the bat spent in the colony in these calculations. Did you also look to see if 'the number of nights when the bats left the colony' predicted the bat's earlier enrichment treatment? This could also be interesting to consider.

      In response to the reviewer’s comment, we conducted an additional analysis to test whether the proportion of nights each bat spent foraging outside the roost was predicted by its earlier environmental condition (enriched vs. impoverished). We also examined whether sex or age influenced this variable. This analysis showed no significant effect of environmental condition, sex, or age on the proportion of nights spent foraging outside the roost

      [Following on point 3 in public review...]

      When wishing to discuss the effect/significance of predictors overall, it is common to present the modelling results as an analysis of variance table. See, for example, the two-way anova section (p. 182) in the book Practical Regression and ANOVA using R: https://cran.r-project.org/doc/contrib/Faraway-PRA.pdf

      I think the output of passing the model object to an "anova" yields the table that you may be looking for, where the variance accounted for by a predictor is given overall, and not just relative to the first level of all predictors. Naturally, this information can be used in combination with the information provided by the raw model output presented in the paper.

      I assume you have done this analysis in R, but am not sure, as the statistical software used is not mentioned. There are several packages in R that allow users to quickly plot the graphical interaction of the parameters they use in models, which aids in interpreting results. It would be good to check results of model fitting in this manner.

      Relatedly, I was unable to locate the data and code for this paper using the DOI provided. Neither searching the internet using the doi nor entering the doi on the Mendeley Data website returned the right results. I tried searching Mendeley Data using the senior author's last name, but the most recent entry does not appear to be from this paper. https://data.mendeley.com/datasets/fr48bmnhxj/1

      We thank the reviewer for the helpful comment. The analysis was indeed conducted in MATLAB, and this has now been clarified in the manuscript. We have also revised the result tables to improve clarity and included the exact formulas used for each model. Regarding the data availability, the reviewer is correct — the dataset had not yet been published at the time of submission. It is now available at the provided DOI link.

      ### Suggestions and questions for the present paper, grouped thematically:

      [Major] Expansion and development of results: I thought there were many interesting and suggestive points in this data that could be expanded upon. I mention some of these here. While the authors of course do not need to implement all of these suggestions, I think the paper would benefit from a more substantial presentation of this rich data set:

      (a) Individual differences as such are not emphasized in the paper so much, as the analyses, particularly those expressed as boxplots, are grouped. The scatter plots in Figure 1 give the richest insight into how individual behavior changes throughout the course of the experiment. I would advocate for the authors to show additional comparisons using such scatter plots (perhaps in the supplementary, if needed).

      We thank the reviewer and added scatter plots to figure 2

      (b) In the second paragraph of the results, the authors introduce the concept of a pareto front and that of personality archetypes (lines 101-107). I found this very interesting, but these concepts were never reiterated upon later in the results or in the discussion. In fact, at many points, I found myself curious as to how the three indoor measures of personality might be combined to form a composite measure of personality (and likewise for outdoor measures). Have you tried to combine measures into a composite and tried to measure whether this composite metric provides any additional insight into these phenomena? For example, what if you mapped the starting position of each bat as a point in a three-dimensional space, given by the three personality measures, and then evaluated their trajectory through this space with measurements taken at later trials. Could innate personality be interpreted as the starting vector in this space (measured across the two baseline trials)? 

      Following the reviewer’s (justified) curiosity we ran a PCA analysis on the behavioral data from trials 1 and 5 and found that there is a significant correlation between the individual scores on PC1. This can be thought of as a measurement that takes both boldness and exploration into account (the weight of activity was very low). We added this information to the revised manuscript and also use this new behavioral parameter as a predisposition in the models (instead of exploration and activity). 

      Could environmental exposure be quantified as a warping of the trajectory through this space? Finally, could outdoor experience also be incorporated to evaluate how an individual arrives at its final measurement of personality combined with experience (trial 5)?

      The paper currently tries to explain outdoors behavior given personality and not vice versa. While this is a very interesting suggestion, we feel that adding this analysis would make the premise of the paper less clear and since the paper is already somewhat complex, we prefer to leave this analysis for a future study. 

      Examining the 3D trajectories of the individuals through the personality space did not reveal any immediate clear pattern (triangles mark the first trial and colours depict the environmental treatment) – 

      Author response image 1.

      Related to this point: I think the strongest part of the paper is the result showing that bats exposed to enriched environments explore farther, more often, and over larger distances than bats that were raised in an impoverished environment.

      We completely agree and tried to further emphasize this  

      (c) While these results of the outdoor GPS tracking are very clear, I wish that more information were extracted from the tracking data, which is incredibly rich and certainly can be used to derive many interest parameters beyond those that the authors have shown here. Examples might include: distance travelled (as opposed to estimated km2 or farthest point), a metric of navigational ability (how much "dead reckoning" the animal engages in). I even wonder if the areas or landmarks visited by the enriched bats might be found to be more complex, challenging, or richer by some measure.

      This study was a first step, aiming to establish a connection between early exposure and outdoors foraging

      We agree that there are many more analyses that can be done and indeed that ones related to navigation capabilities are missing. We are still collecting data on these bats and hope to present a more advanced analysis with a time span of years. 

      (d) Related to the above point: I find it very interesting that in 3 of the 4 bats for which you show exemplary movement data (Fig. 3, panels B and C), they appear to travel to the farthest distances and cover the most ground early on, and become more "conservative" in their flight paths on later evenings. This point is not explored in the discussion, nor related to earlier measurements.

      During the first months of exploration, bats will occasionally perform long exploratory flights in between bouts of shorter flights where they return to nearby familiar trees. This behavior can be seen in more detail in Harten et al Science 2020. We are currently quantifying this more carefully for another study. 

      (e) Finally, my points about the possible strength of a composite measure of the three personality metrics is related to my concern about one of the conclusions, which is that innate personality does not have an effect on outdoor foraging behavior. I think the manner in which this was tested statistically is likely to bias the results against finding such a result given that personality metrics are used to predict outdoor behaviors in an individual manner (6 models in total, each examining a single comparison of predisposition to outdoor behavior), while both indoor personality metrics (Fig 1B) and outdoor behaviors appear to be correlated with each other (Table 5).

      Are there other analyses you have performed that are not presented in the paper and that have led you to conclude that there is no relationship here?

      We agree with the reviewer, that our findings do not exclude an effect of innate personality on foraging but only suggest no such affect for the parameter we measured. That said, we did expect to find an effect of boldness because this parameter has been shown to differentiate much between groups (Harten et al., 2018), and to correlate with other parameters of behavior. We were therefore surprised to find no significant effects, as we had anticipated observing some differences.

      Following the reviewer’s previous comment we now also tested another predisposition parameter – the PC1 score and also found that it did not explain foraging. 

      (f) Personality measured before and after early environmental exposure (related to point (a) above): I find it interesting that the positive correlation in boldness between baseline and post-enrichment or baseline and post-release suggests that the individuals that were the most bold remained bold (and likewise for less adventurous individuals). The correlation for activity, too, still suggests that more active individuals early in life are likely to remain very active after enrichment, even accounting for the fact that activity is confounded with age.

      Perhaps you could place some emphasis on the fact that the initial variation between individuals also appears to be relatively stable over repeated trials. You might also consider measuring this directly (population variance over successive trials; relationship of population variance on indoor measures vs. outdoor measures...)

      Yes – this is a main point of interest. We further emphasize that in the revised manuscript 

      (g) Effect of indoor behavior following early experience on outdoor behavior: You evaluate the effect of predisposition (measured on baseline trial 1) and environmental condition on measures of outdoor activity (Table 4). I wonder if you also tried using indoor behavioral measures measured on the post-enrichment trial 3 to predict outdoor foraging behavior.

      Assuming that these measures are in fact reflecting a combination of predisposition and accumulated experience, then measurements at this closer time point may tell you how the combination of innate traits and early acquired experience affect behavior in the wild.

      We appreciate the reviewer’s insightful suggestion to test whether indoor behavior from post-enrichment Trial 3, reflecting both innate traits and experience, predicts outdoor foraging behavior. We conducted this analysis, but found that the boldness in Trial 3 did not significantly predict any of the outdoor activity measures.

      (2) [Minor] Age/development: While the authors discuss the effect of their manipulations on behavioral measures, they do not much discuss the effect of age.

      I think it would be important to include at some point a mention of the developmental stages of Rousettus, giving labels to certain age ranges, e.g. pup, juvenile, adult, and to provide more context about the stages at which bats were tested in the discussion. Presently, age is only really mentioned as an explanation for declining activity levels, but I wonder if it might also have an influence on boldness.

      It would also be very elegant for figures where age is given in days, to additional label then with these stages.

      All bats were juveniles during the trials (approximately 4 to 8 months old), so they could not be divided into distinct age groups. To assess the effect of age, it was included as a predictor (in days) in the GLM analysis.

      (3) [Major] Effect of early experience and outdoor experience on the indoor task: In the paragraph on lines 278-285, you argue that the effect of seeing earlyenriched bats exhibit more boldness in trial 5 was likely due to post-sampling bias...

      I tend to disagree with this conclusion. I actually find this result both interesting and intuitive - that bats that were exposed to an enriched environment and have had experience in the wild, show much bolder activity on a familiar indoor foraging test (i.e. outside experience has made the animals bolder than before) (Fig 1, lines 159-161, Fig. S1). I did not notice this possibility mentioned in the discussion of the results.

      I also do not fully understand this argument. Could you please explain further?

      We accept the reviewer's comment and updated the manuscript (lines 336346) explaining the two hypotheses more clearly and arguing that it is difficult to tell them apart with the current data.

      [Minor] You also say that "this difference... can be seen in Figure 2 when examining only the bats that had remained until the last trial (Figure 2A2)." Do you mean supplementary Figure S1 A2? In fact, I am entirely unclear on what data is plotted in the supplementary Figure S1 and what differentiates the two columns of figures and the two models presented in the supplementary table. Did you plot data similar to that in Figure 2, with only bats that were present for all trials, but not show this data?

      There was a mistake: what was previously referred to as 2A2 is actually S2 A2.

      On the right side—only among the individuals with GPS data—the change is already evident at Baseline 2, where only the bolder individuals remain. If you have suggestions for a better analysis approach, we would be happy to hear them.

      ### Minor points

      General points regarding figures:

      For Figures 2 and 3A1-3 (as well as Fig. S1): Authors must show the raw data points over the box plots. It is very difficult to interpret the data and conclusions without being able to see the true distribution.

      Done

      For all figures showing grouped individual data, please annotate all panels or sets of boxplots with the number of bats whose data entered into each, as it is a little difficult to keep track of the changing sample sizes across experimental stages.

      To enhance transparency, we have added individual data points to all boxplots, allowing visual estimation of sample sizes across experimental stages. While numerical annotations are not included on the figures, the exact number of bats contributing to each group is provided in the Methods section (Table 8), ensuring this information is readily accessible to readers.In response to the reviewer’s request, we have updated all relevant figures to display individual data points within each boxplot. This addition makes it easier to track changes in sample size across different experimental stages.

      Unless I've missed the reason behind differences in axis labelling across the figures, it seems that trials are not always referred to consistently. E.g. Fig. 1 labels say "Trial 1 (baseline)" and fig. 2 labels say "Baseline 1 0 days." I'm not entirely sure if these correspond to exactly the same data. If so, perhaps the labels can be made uniform. I think the descriptive ones (Baseline 1, Postenrichment...) may be more helpful to the reader than providing the trial number (Trial 1, etc....).

      Done

      Figure 1:

      Very good Fig. 1A and 1B.

      For panels C1-3 & D, I think it would make it easier for the reader if the personality measure labels were placed at the top of each panel, e.g. "Boldness (entrance proportion)". The double axis labels are not only harder to read, they are also redundant, as the personality measure label repeats on both axes.

      Done

      Panel C1: For the first panel in this sequence, I think it would be elegant to include an annotation in the figure that indicates what the datapoints lying on either side of the dashed line means, i.e. "bolder after enrichment treatment" in the upper left corner, and "bolder before enrichment treatment" in the bottom right corner.

      Panel C2: It appears as though many of the data points in this panel overlap, and it appears to me that the blue data points in particular are overlaid by the orange ones. I am guessing this happens because proportion values based on entrances to only 6 boxes end up giving a more "discrete" looking distribution. I wonder if you can find a way to allow all the data to be visible by, e.g., jittering the data slightly; if there is rounding being done to the proportions, perhaps don't round them so that minute differences will allow them to escape the overlap; or possibly split the panel by enrichment treatment.

      Caption for C1-3: it may be helpful to mention the correlation line color scheme: "enriched (blue lines), the impoverished (orange lines)". The caption also says positive correlations were found for "both environments together," but this correlation line is not shown. Perhaps mention "(not shown)" or show line. Please rephrase the sentence "Dashed line represents the Y=X line." for more transparency and clarity. I understand you mean an "equality" or "unity" line, but perhaps you can explicitly state the information that this line provides, something like e.g. "Dashed line indicates equal values measured on both trials."

      We added the line for a reference, the caption was corrected

      Figure 3:

      Panels B1-C2: I would suggest giving these panels supertitles that indicate that B panels are enriched, C panels are impoverished, and that each panel is data from a different individual.

      The legend was corrected to be more clear about the figure

      General points regarding tables:

      Please revisit tables for formatting and typos, particularly in Table 4. Please also revise table captions for clarity. E.g. "first exploration as predisposition" to "Exploration (Baseline 1)" or similar

      Done

      Supplementary Tables and Figure: these are missing captions and explanations.

      The missing parts were adddad and corrected

      Points of clarification/style:

      It would seem to me more logical to present the results shown in Table 3 before those in Table 2, given that the primary in-lab manipulation is discussed with relation to Table 3, and the analysis in Table 2 is discussed rather as a limitation (though I believe this result can be expanded upon further, see above).

      For the activity metric, I would suggest showing this data as actions/hour instead of actions/minute. I think it is much more intuitive to consider, for example, that a bat makes 2 actions every hour, than that it makes 0.002 actions per minute.

      Done

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      The authors present a model for multisensory correlation detection that is based on the neurobiologically plausible Hassenstein Reichardt detector. It modifies their previously reported model (Parise & Ernst, 2016) in two ways: a bandpass (rather than lowpass) filter is initially applied and the filtered signals are then squared. The study shows that this model can account for synchrony judgement, temporal order judgement, etc in two new data sets (acquired in this study) and a range of previous data sets.

      Strengths:

      (1) The model goes beyond descriptive models such as cumulative Gaussians for TOJ and differences in cumulative Gaussians for SJ tasks by providing a mechanism that builds on the neurobiologically plausible Hassenstein-Reichardt detector.

      (2) This modified model can account for results from two new experiments that focus on the detection of correlated transients and frequency doubling. The model also accounts for several behavioural results from experiments including stochastic sequences of A/V events and sine wave modulations.

      Additional thoughts:

      (1) The model introduces two changes: bandpass filtering and squaring of the inputs. The authors emphasize that these changes allow the model to focus selectively on transient rather than sustained channels. But shouldn't the two changes be introduced separately? Transients may also be detected for signed signals.

      We updated the original model because our new psychophysical evidence demonstrates the fundamental role of unsigned transient for multisensory perception. While the original model received input from sustained unimodal channels (low-pass filters), the new version receives input from unsigned unimodal transient channels. Transient channels are normally modelled through bandpass filters (to remove the DC and high-frequency signal components) and squaring (to remove the sign). While these may appear as two separate changes in the model, they are, in fact, a single one: the substitution of sustained with unsigned transient channels (for a similar approach, see Stigliani et al. 2017, PNAS). Either change alone would not be sufficient to implement a transient channel that accounts for the present results.

      That said, we were also concerned with introducing too many changes in the model at once. Indeed, we simply modelled the unimodal transient channels as a single band-pass filter followed by squaring. This is already a stripped-down version of the unsigned transient detectors proposed by Adelson and Bergen in their classic Motion Energy model. The original model consisted of two biphasic temporal filters 90 degrees out of phase (i.e., quadrature filters), whose output is later combined. While a simpler implementation of the transient channels was sufficient in the present study, the full model may be necessary for other classes of stimuli (including speech, Parise, 2024, BiorXiv). Therefore, for completeness, we now include in the Supplementary Information a formal description of the full model, and validate it by simulating our two novel psychophysical studies. See Supplementary Information “The quadrature MCD model” section and Supplementary Figure S8.

      (2) Because the model is applied only to rather simple artificial signals, it remains unclear to what extent it can account for AV correlation detection for naturalistic signals. In particular, speech appears to rely on correlation detection of signed signals. Can this modified model account for SJ or TOJ judgments for naturalistic signals?

      It can. In a recent series of studies we have demonstrated that a population of spatially-tuned MCD units can account for audiovisual correlation detection for naturalistic stimuli, including speech (e.g. the McGurk Illusion). Once again, unsigned transients were sufficient to replicate a variety of previous findings. We have now extended the discussion to cover this recent research: Parise, C. V. (2024). Spatiotemporal models for multisensory integration. bioRxiv, 2023-12.

      Even Nidiffer et al. (2018) which is explicitly modelled by the authors report a significant difference in performance for correlated and anti-correlated signals. This seems to disagree with the results of study 1 reported in the current paper and the model's predictions. How can these contradicting results be explained? If the brain detects correlation on signed and unsigned signals, is a more complex mechanism needed to arbitrate between those two?

      We believe the reviewer here refers to our Experiment 2 (where, like Nidiffer at al. (2018) we used periodic stimuli, not Experiment 1, which consists of step stimuli). We were also puzzled by the difference between our Experiment 2 and Nidiffer et al. (2018): we induced frequency doubling, Nidiffer did not. Based on quantitative simulations, we concluded that this difference could be attributed to the fact that while Nidiffer included on each trial an intensity ramp in their periodic audiovisual stimuli, we did not. As a result, when considering the ramp (unlike in Nidiffer’s analyses), all audiovisual signals used by Nidiffer were positively correlated (irrespective of frequency and phase offset), while our signals in Experiment 2 were sometimes correlated and other times not (depending on the phase offset). This important simulation is included in Supplementary Figure S7; we also have now updated the text to better highlight the role of the pedestal in determining the direction of the correlation.

      (3) The number of parameters seems quite comparable for the authors' model and descriptive models (e.g. PSF models). This is because time constants require refitting (at least for some experimental data sets) and the correlation values need to be passed through a response mode (i.e. probit function) to account for behavioural data. It remains unclear how the brain adjusts the time constants to different sensory signals.

      This is a deep question. For simplicity, here the temporal constants were fitted to the empirical psychometric functions. To avoid overfitting, whenever possible we fitted such parameters over some training datasets, while trying to predict others. However, in some cases, it was necessary to fit the temporal constants to specific datasets. This may suggest that the temporal tuning of those units is not crystalised to some pre-defined values, but is adjusted based on recent perceptual history (e.g., the sequence of trials and stimuli participants are exposed to during the various experiments).

      For transparency, here we show how varying the tuning of the temporal constants of the filters affects the goodness of fit of our new psychophysical experiments (Supplementary Figure S8). As it can be readily appreciated, the relative temporal tuning of the unimodal transient detector was critical, though their absolute values could vary over a range of about 15 to over 100ms. The tuning of the low-pass filters of the correlation detector (not shown here) displayed much lower temporal sensitivity over a range between 0.1s to over 1s.

      This simulation shows the impact of temporal tuning in our simulations, however, the question remains as to how such a tuning gets selected in the first place. An appealing explanation relies on natural scene statistics: units are temporally tuned to the most common audiovisual stimuli. Although our current empirical evidence does not allow us to quantitatively address this question, in previous simulations (see Parise & Ernst, 2016, Supplementary Figure 8), by analogy with visual motion adaptation, we show how the temporal constants of our model can dynamically adjust and adapt to recent perceptual history. We hope these new and previous simulations address the question about the nature of the temporal tuning of the MCD units.

      (4) Fujisaki and Nishida (2005, 2006) proposed mechanisms for AV correlation detection based on the Hassenstein-Reichardt motion detector (though not formalized as a computational model).

      This is correct, Fujisaki and Nishida (2005, 2007) also hypothesized that AV synchrony could be detected using a mechanism analogous to motion detection. Interestingly, however, they ruled out such a hypothesis, as their “data do not support the existence of specialized low-level audio-visual synchrony detectors”. Yet, along with our previous work (Parise & Ernst, 2016, where we explicitly modelled the experiments of Fujisaki and Nishida), the present simulations quantitatively demonstrate that a low-level AV synchrony detector is instead sufficient to account for audiovisual synchrony perception and correlation detection. We now credit Fujusaki and Nishida in the modelling section for proposing that AV synchrony can be detected by a cross-correlator.

      Finally, we believe the reviewer is referring to the 2005 and 2007 studies of Fujisaki and Nishida (not 2006); here are the full references of the two articles we are referring to:

      Fujisaki, W., & Nishida, S. Y. (2005). Temporal frequency characteristics of synchrony–asynchrony discrimination of audio-visual signals. Experimental Brain Research, 166, 455-464.

      Fujisaki, W., & Nishida, S. Y. (2007). Feature-based processing of audio-visual synchrony perception revealed by random pulse trains. Vision Research, 47(8), 1075-1093.

      Reviewer #2 (Public Review):

      Summary:

      This is an interesting and well-written manuscript that seeks to detail the performance of two human psychophysical experiments designed to look at the relative contributions of transient and sustained components of a multisensory (i.e., audiovisual) stimulus to their integration. The work is framed within the context of a model previously developed by the authors and is now somewhat revised to better incorporate the experimental findings. The major takeaway from the paper is that transient signals carry the vast majority of the information related to the integration of auditory and visual cues, and that the Multisensory Correlation Detector (MCD) model not only captures the results of the current study but is also highly effective in capturing the results of prior studies focused on temporal and causal judgments.

      Strengths:

      Overall the experimental design is sound and the analyses are well performed. The extension of the MCD model to better capture transients makes a great deal of sense in the current context, and it is very nice to see the model applied to a variety of previous studies.

      Weaknesses:

      My one major issue with the paper revolves around its significance. In the context of a temporal task(s), is it in any way surprising that the important information is carried by stimulus transients? Stated a bit differently, isn't all of the important information needed to solve the task embedded in the temporal dimension? I think the authors need to better address this issue to punch up the significance of their work.

      In hindsight, it may appear unsurprising that transient signals carry most information for audiovisual integration. Yet, so somewhat unexpectedly, this has never been investigated using perhaps the most diagnostic psychophysical tools for perceived crossmodal timing; namely temporal order and simultaneity judgments–along with carefully designed experiments with quantitative predictions for the effect of either channel. The fact that the results conform to intuitive expectations further supports the value of the present work: grounding empirically with what is intuitively expected. This offers solid psychophysical evidence that one can build on for future advancements. Importantly, developing a model that builds on our new results and uses the same parameters to predict a variety of classic experiments in the field, further supports the current approach.

      If “significance” is intended as shaking previous intuitions or theories, then no: this is not a significant contribution. If instead, by significance we intend to build a solid empirical and theoretical ground for future work, then we believe this study is not significant, it is foundational. We hope that this work's significance is better captured in our discussion.

      On a side note, there is an intriguing factor around transient vs. sustained channels: what matters is the amount of change, not the absolute stimulus intensity. Previous studies, for example, have suggested a positive cross modal mapping between auditory loudness and visual lightness or brightness [Odegaard et al., 2004]. This study, conversely, challenges this view and demonstrates that what matters for multisensory integration in time is not the intensity of a stimulus, but changes thereof.

      In a more minor comment, I think there also needs to be a bit more effort into articulating the biological plausibility/potential instantiations of this sustained versus transient dichotomy. As written, the paper suggests that these are different "channels" in sensory systems, when in reality many neurons (and neural circuits) carry both on the same lines.

      The reviewer is right, in our original manuscript we glossed over this aspect. We have now expanded the introduction to discuss their anatomical basis. However, we are not assuming any strict dichotomy between transient and sustained channels; rather, our results and simulations demonstrate that transient information is sufficient to account for audiovisual temporal integration.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Related to point 2 of the public review, can the authors provide additional results showing that the model can also account for naturalistic signals and more complex stochastic signals?

      While working on this manuscript, we were also working in parallel on a project related to audiovisual integration of naturalistic signals. A pre-print is available online [Parise, 2024, BiorXiv], and the related study is now discussed in the conclusions.

      (2) As noted in the public review, Fujisaki and Nishida (2005, 2006) already proposed mechanisms for AV correlation detection based on the Hassenstein-Reichardt motion detector. Their work should be referenced and discussed.

      We have now acknowledged the contribution of Fujisaki and Nishida in the modelling section, when we first introduce the link between our model and the Hassenstein-Reichardt detectors.

      (3) Experimental parameters: Was the phase shift manipulated in blocks? If yes, what about temporal recalibration?

      To minimise the effect of temporal recalibration, the order of trials in our experiments was randomised. Nonetheless, we can directly assess potential short-term recalibration effects by plotting our psychophysical responses against both the current SOA, and that of the previous trials. The resulting (raw) psychometric surfaces below are averaged across observers (and conditions for Experiment 1). In all our experiments, responses are obviously dependent on the current SOA (x-axis). However, the SOA of the previous trials (y-axis) does not seem to meaningfully affect simultaneity and temporal order judgments. The psychometric curves above the heatmaps represent the average psychometric functions (marginalized over the SOA of the previous trial).

      All in all, the present analyses demonstrate negligible temporal recalibration across trials, likely induced by a random sequence of lags or phase shifts. Therefore, when estimating the temporal constants of the model, it seems reasonable to ignore the potential effects of temporal recalibration. To avoid increasing the complexity of the present manuscript, we would prefer not to include the present analyses in the revised version.

      Author response image 1.

      Effect of previous trial. Psychometric surfaces for Experiments 1 and 2 plotted against the lag in the current vs. the previous trial. While psychophysical responses are strongly modulated by the lag in the last trial (horizontal axis), they are relatively unaffected by the lag in the previous trial (vertical axis).

      (4) The model predicts no differences for experiment 1 and this is what is empirically observed. Can the authors support these null results with Bayes factors?

      This is a good suggestion: we have now included a Bayesian repeated measures ANOVA to the analyses of Experiment 1. As expected, these analyses provide further, though mild evidence in support for the null hypothesis (See Table S2). For completeness, the new Bayesian analyses are presented alongside the previous frequentist ones in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors aim to consider the effects of phonotactics on the effectiveness of memory reactivation during sleep. They have created artificial words that are either typical or atypical and showed that reactivation improves memory for the latter but not the former.

      Comment 1:

      Strengths:

      This is an interesting design and a creative way of manipulating memory strength and typicality. In addition, the spectral analysis on both the wakefulness data and the sleep data is well done. The article is clearly written and provides a relevant and comprehensive of the literature and of how the results contribute to it.

      We thank the reviewer for his/her positive evaluation of our manuscript. 

      Comment 2:

      Weaknesses:

      (1) Unlike most research involving artificial language or language in general, the task engaged in this manuscript did not require (or test) learning of meaning or translation. Instead, the artificial words were arbitrarily categorised and memory was tested for that categorisation. This somewhat limits the interpretation of the results as they pertain to language science, and qualifies comparisons with other language-related sleep studies that the manuscript builds on.

      We thank the reviewer for this comment. We agree that we did not test for meaning or translation but used a categorization task in which we trained subjects to discriminate artificial words according to their reward associations (rewarded vs. non-rewarded). Previous language studies (Batterink et al., 2014; Batterink and Paller, 2017; Reber, 1967) used artificial words to investigate implicit learning of hidden grammar rules. Here, the language researchers studied generalization of the previously learned grammar knowledge by testing subject’s ability to categorize correctly a novel set of artificial words into rule-congruent versus rule-incongruent words. These differences to our study design might limit the comparability between the results of previous language studies of artificial grammar learning and our findings. We discussed now this aspect as a limitation of our novel paradigm. 

      We added the following sentences to the discussion on p.14, ll. 481-488:

      Based on our paradigm, we investigated categorization learning of artificial words according to their reward associations (rewarded vs. unrewarded) and did not studied aspects of generalization learning of artificial grammar rules (Batterink et al., 2014; Batterink and Paller, 2017; Reber, 1967). This difference might limit the comparability between these previous language-related studies and our findings. However, the usage of artificial words with distinct phonotactical properties provided a successful way to manipulate learning difficulty and to investigate word properties on TMR, whereas our reward categorization learning paradigm had the advantage to increase the relevance of the word learnings due to incentives.    

      Comment 3:

      (2) The details of the behavioural task are hard to understand as described in the manuscript. Specifically, I wasn't able to understand when words were to be responded to with the left or right button. What were the instructions? Were half of the words randomly paired with left and half with right and then half of each rewarded and half unrewarded? Or was the task to know if a word was rewarded or not and right/left responses reflected the participants' guesses as to the reward (yes/no)? Please explain this fully in the methods, but also briefly in the caption to Figure 1 (e.g., panel C) and in the Results section.

      We thank the reviewer for this comment and added additional sentences into the document to provide additional explanations. We instructed the participants to respond to each word by left- and right-hand button presses, whereas one button means the word is rewarded and the other button means the word is unrewarded. The assignment of left- and right-hand button presses to their meanings (rewarded versus unrewarded) differed across subjects. In the beginning, they had to guess. Then over trial repetitions with feedback at the end of each trial, they learned to respond correctly according to the rewarded/unrewarded associations of the words.        

      We added the following sentences to the results section on p.5, ll. 161-168: 

      As a two alternative forced-choice task, we assigned left- and right-hand button presses to the rewarded and the unrewarded word category, counterbalanced across subjects. We instructed the participants to respond to each word by left- or right-hand button presses, whereas one button means the word is rewarded (gain of money points) and the other button means the word is unrewarded (avoid the loss of money points). In the beginning, they had to guess. By three presentations of each word in randomized order and by feedback at the end of each trial, they learned to respond correctly according to the rewarded/unrewarded associations of the words (Fig. 1c). 

      We added the following sentences to the caption of Figure 1 on p.6, ll. 188-194:

      As a two alternative forced-choice task, responses of left- and right-hand button presses were assigned to the rewarded and the unrewarded word category, respectively. The participants were instructed to respond to each word by left- or right-hand button presses, whereas one button means the word is rewarded (gain of money points) and the other button means the word is unrewarded (avoid the loss of money points). d) Feedback matrix with the four answer types (hits: rewarded and correct; CR, correct rejections: unrewarded and correct; misses: rewarded and incorrect; FA, false alarms: unrewarded and incorrect) regarding to response and reward assignment of the word.

      We added the following sentences to the methods on p.19, ll. 687-692:  

      As a two alternative forced-choice task, we assigned left- and right-hand button presses to the rewarded and the unrewarded word category, counterbalanced across subjects. We instructed the participants to respond to each word by left- or right-hand button presses, whereas one button means the word is rewarded (gain of money points) and the other button means the word is unrewarded (avoid the loss of money points).

      Comment 4:  

      (3) Relatedly, it is unclear how reward or lack thereof would translate cleanly into a categorisation of hits/misses/correct rejections/false alarms, as explained in the text and shown in Figure 1D. If the item was of the non-rewarded class and the participant got it correct, they avoided loss. Why would that be considered a correct rejection, as the text suggests? It is no less of a hit than the rewarded-correct, it's just the trial was set up in a way that limits gains. This seems to mix together signal detection nomenclature (in which reward is uniform and there are two options, one of which is correct and one isn't) and loss-aversion types of studies (in which reward is different for two types of stimuli, but for each type you can have H/M/CR/FA separably). Again, it might all stem from me not understanding the task, but at the very least this required extended explanations. Once the authors address this, they should also update Fig 1D. This complexity makes the results relatively hard to interpret and the merit of the manuscript hard to access. Unless there are strong hypotheses about reward's impact on memory (which, as far as I can see, are not at the core of the paper), there should be no difference in the manner in which the currently labelled "hits" and "CR" are deemed - both are correct memories. Treating them differently may have implications on the d', which is the main memory measure in the paper, and possibly on measures of decision bias that are used as well.

      We thank the reviewer for this comment giving us the opportunity to clarify. As explained in the previous comment, for our two alternative forced-choice task, we instructed the participants to press one button when they were thinking the presented word is rewarded and the other button, when they were thinking the word is unrewarded. Based on this instruction, we applied the signal detection theory (SDT), because the subjects had the task to detect when reward was present or to reject when reward was absent. Therefore, we considered correct responses of words of the rewarded category as hits and words of the unrewarded category as correct rejections (see Table below). However, the reviewer is correct because in addition to false alarms, we punished here the incorrect responses by subtraction of money points to control for alternative task strategies of the participants instead of reward association learning of words. We agree that further explanation/argumentation to introduce our nomenclature is necessary.  

      Author response table 1.

      We adjusted the results section on p.5, ll. 169-177:

      To obtain a measurement of discrimination memory with respect to the potential influence of the response bias, we applied the signal detection theory (Green and Swets, 1966). Because, we instructed the participants to respond to each word by left- or right-hand button presses and that one button means reward is present whereas the other button means reward is absent, we considered correct responses of words of the rewarded category as hits and words of the unrewarded category as correct rejections. Accordingly, we assigned the responses with regard to the reward associations of the words to the following four response types: hits (rewarded, correct); correct rejections (unrewarded, correct); misses (rewarded, incorrect); and false alarms (unrewarded, incorrect). Dependent on responses, subjects received money points (Fig. 1d). 

      Comment 5:

      (4) The study starts off with a sample size of N=39 but excludes 17 participants for some crucial analyses. This is a high number, and it's not entirely clear from the text whether exclusion criteria were pre-registered or decided upon before looking at the data. Having said that, some criteria seem very reasonable (e.g., excluding participants who were not fully exposed to words during sleep). It would still be helpful to see that the trend remains when including all participants who had sufficient exposure during sleep. Also, please carefully mention for each analysis what the N was.

      Our study was not pre-registered. Including all the subjects independent of low prememory performance, but with respect to a decent number of reactivations (> 160 reactivations, every word at least 2 times), resulted in a new dataset with 15 and 13 participants of the high- and low-PP cueing condition, respectively. Here, statistical analyses revealed no significant overnight change anymore in memory performance in the high-PP cueing condition (Δ memory (d'): t(14) = 1.67, p = 0.12), whereas the increase of the bias in decision making towards risk avoidance still remained significant (Δ bias (c-criterion): t(14) = 3.36, p = 0.005).

      We modified and added the following sentences to the discussion on p.13, ll. 456-458:

      Our study has limitations due to a small sample size and between-subject comparisons. The criteria of data analyses were not pre-registered and the p-values of our behavior analyses were not corrected for multiple comparisons.

      Comment 6:             

      (5) Relatedly, the final N is low for a between-subjects study (N=11 per group). This is adequately mentioned as a limitation, but since it does qualify the results, it seemed important to mention it in the public review.

      We agree with the reviewer that the small sample size and the between subject comparisons represent major limitations of our study. Accordingly, we now discussed these limitations in more detail by adding alternative explanations and further suggestions for future research to overcome these limitations.        

      We added the following sentences to the discussion about the limitations on p.14, ll. 465-488: 

      To control for potential confounders despite the influence of difficulty in word learning on TMR, we compared parameters of sleep, the pre-sleep memory performance and the vigilance shortly before the post-sleep memory test, revealing no significant group differences (see Table S1 and S2). Nevertheless, we cannot rule out that other individual trait factors differed between the groups, such as the individual susceptibility to TMR. To rule out these alternative explanations based on individual factors, we suggest for future research to replicate our study by conducting a within-subject design with cueing of subsets of previously learned low- and high-PP words providing all conditions within the same individuals as shown in other TMR studies (Cairney et al., 2018; Schreiner and Rasch, 2015).

      Comment 7:

      (6) The linguistic statistics used for establishing the artificial words are all based on American English, and are therefore in misalignment with the spoken language of the participants (which was German). The authors should address this limitation and discuss possible differences between the languages. Also, if the authors checked whether participants were fluent in English they should report these results and possibly consider them in their analyses. In all fairness, the behavioural effects presented in Figure 2A are convincing, providing a valuable manipulation test.

      We thank the reviewer pointing to the misalignment between the German-speaking participants and the used artificial words based on American English. Further, we did not assessed the English language capability of the participants to control it as a potential confounder, whereas comparative control analyses revealed no significant differences between the both cueing groups in pre-sleep memory performance (see Table S1). 

      We now discussed these comments as limitations on p.14, ll. 473-481: 

      Further, we used artificial words based on American English in combination with German speaking participants, whereas language differences of pronunciation and phoneme structures might affect word perception and memory processing (Bohn and Best, 2012). On the other hand, both languages are considered to have the same language family (Eberhard et al., 2019) and the phonological distance between English and German is quite short compared for example to Korean (Luef and Resnik, 2023). Thus, major common phonological characteristics across both languages are still preserved. In addition, our behavior analyses revealed robust word discrimination learning and distinct memory performance according to different levels of phonotactic probabilities providing evidence of successful experimental manipulation. 

      Comment 8:

      (7) With regard to the higher probability of nested spindles for the high- vs low-PP cueing conditions, the authors should try and explore whether what the results show is a general increase for spindles altogether (as has been reported in the past to be correlated with TMR benefit and sleep more generally) or a specific increase in nested spindles (with no significant change in the absolute numbers of post-cue spindles). In both cases, the results would be interesting, but differentiating the two is necessary in order to make the claim that nesting is what increased rather than spindle density altogether, regardless of the SW phase.

      We conducted additional analyses based on detected sleep spindles to provide additional data according to this question. 

      We added the following section to the supplementary data on pp. 31-32, ll. 1007-1045:  

      After conducting a sleep spindle detection (frequency range of 12-16Hz, see methods for details), we compared the sleep spindle density between the TMR conditions of high- and lowPP showing no significant difference (see Fig. S8a and Table S9). Next, we subdivided the detected sleep spindles into coupled and uncoupled sleep spindles with the previously detected slow waves (SW; analyses of Fig. 4). Sleep spindles were defined as coupled when their amplitude peak occurred during the SW up-state phase (0.3 to 0.8s time-locked to the SW troughs). A two-way mixed design ANOVA on the amplitude size of the sleep spindles with the cueing group as a between-subject factor (high-PP-cued vs. low-PP-cued) and SW-coupling as a within-subject factor (coupled vs. uncoupled) showed a significant interaction effect (cueing group × SW-coupling: F(1,20) = 4.51, p = 0.046, η2 = 0.18), a significant main effect of SW-coupling (F(1,20) = 85.02, p < 0.001, η2 = 0.81), and a trend of significance of the main effect of the cueing group (F(1,20) = 3.54, p = 0.08). Post-hoc unpaired t-tests revealed a significant higher amplitude size of the coupled sleep spindles of the cueing group of high- compared to low-PP (t(20) = 2.13, p = 0.046, Cohen’s d = 0.91; Fig. S8b) and no significant group difference of the uncoupled sleep spindles (t(20) = 1.62, p = 0.12). An additional comparison of the amount of coupled sleep spindles between the cueing groups revealed no significant difference (see Table S9). 

      Here, we found that detected sleep spindles coupled to the SW up-state phase occurred with higher amplitude after TMR presentations of the high-PP words in comparison to the low-PP words, whereas the sleep spindle density and the amount of sleep spindles coupled to the SW up-state phase did not differed between the cueing conditions.     

      We added the following sentences to the methods on pp. 22-23, ll. 822-839:  

      Sleep spindle analyses 

      We detected fast sleep spindles by band-pass filtering (12-16Hz) the signal of the Pz electrode during the auditory cueing trials in the time windows of -2 to 8s according to stimulus onsets. The amplitude threshold was calculated individually for each subject as 1.25 standard deviations (SDs) from the mean. The beginning and end times of the sleep spindles were then defined as the points at which the amplitude fell below 0.75 SDs before and after the detected sleep spindle. Only sleep spindles with a duration of 0.5-3 s were included in subsequent analyses. 

      To compare the sleep spindle densities between the different cueing conditions of high- and low-PP, we computed the grand average sleep spindle density distribution in number per trial with a bin size of 0.5s from -0.5 to 6s time-locked to stimulus onset in each condition (see Fig. S8a and Table S9).     

      Based on the detected slow waves and sleep spindles, we defined coupling events when the positive amplitude peak of a detected sleep spindle was occurring during the slow wave upstate phase in a time window of 0.3 to 0.8s according to the trough of a slow wave. 

      We computed the averaged amplitude size of each detected sleep spindle by calculating the mean of the absolute amplitude values of all negative and positive peaks within a detected sleep spindle (see Fig. S8b).

      We added the following sentences to the results on p.10, ll. 338-343:  

      By conducting an additional analyses based on detection of fast sleep spindles (12-16Hz; see methods), we confirmed that fast sleep spindles during the SW up-states (from 0.3 to 0.8s after the SW trough) occurred with significantly higher amplitude after the cueing presentation of high- compared to low-PP words, whereas parameters of sleep spindle density and the amount sleep spindles coupled to the SW up-state did not differed between the cueing conditions (see Fig. S8 and Table S9).       

      Reviewer #2 (Public Review):

      Summary:

      The work by Klaassen & Rasch investigates the influence of word learning difficulty on sleepassociated consolidation and reactivation. They elicited reactivation during sleep by applying targeted memory reactivation (TMR) and manipulated word learning difficulty by creating words more similar (easy) or more dissimilar (difficult) to our language. In one group of participants, they applied TMR of easy words and in another group of participants, they applied TMR of difficult words (between-subjects design). They showed that TMR leads to higher memory benefits in the easy compared to the difficult word group. On a neural level, they showed an increase in spindle power (in the up-state of an evoked response) when easy words were presented during sleep.

      Comment 9:

      Strengths:

      The authors investigate a research question relevant to the field, that is, which experiences are actually consolidated during sleep. To address this question, they developed an innovative task and manipulated difficulty in an elegant way.

      Overall, the paper is clearly structured, and results and methods are described in an understandable way. The analysis approach is solid.

      We thank the reviewer for his/her positive evaluation of our manuscript.

      Weaknesses:

      Comment 10:

      (1) Sample size

      For a between-subjects design, the sample size is too small (N = 22). The main finding (also found in the title "Difficulty in artificial word learning impacts targeted memory reactivation") is based on an independent samples t-test with 11 participants/group.

      The authors explicitly mention the small sample size and the between-subjects design as a limitation in their discussion. Nevertheless, making meaningful inferences based on studies with such a small sample size is difficult, if not impossible.

      We agree with the reviewer that the small sample size and the between subject comparisons represent major limitations of our study. Accordingly, we now discussed these limitations in more detail by adding alternative explanations and further suggestions for future research to overcome these limitations.        

      We added the following sentences to the discussion about the limitations on p.14, ll. 465-473: 

      To control for potential confounders despite the influence of difficulty in word learning on TMR, we compared parameters of sleep, the pre-sleep memory performance and the vigilance shortly before the post-sleep memory test, revealing no significant group differences (see Table

      S1 and S2). Nevertheless, we cannot rule out that other individual trait factors differed between the groups, such as the individual susceptibility to TMR. To rule out these alternative explanations based on individual factors, we suggest for future research to replicate our study by conducting a within-subject design with cueing of subsets of previously learned low- and high-PP words providing all conditions within the same individuals as shown in other TMR studies (Cairney et al., 2018; Schreiner and Rasch, 2015).

      Comment 11:

      (2) Choice of task

      though the task itself is innovative, there would have been tasks better suited to address the research question. The main disadvantage the task and the operationalisation of memory performance (d') have is that single-trial performance cannot be calculated. Consequently, choosing individual items for TMR is not possible.

      Additionally, TMR of low vs. high difficulty is conducted between subjects (and independently of pre-sleep memory performance) which is a consequence of the task design.

      The motivation for why this task has been used is missing in the paper.

      We used a reward task combined with TMR because previous studies revealed beneficial effects of reward related information on sleep dependent memory consolidation and reactivation (Asfestani et al., 2020; Fischer and Born, 2009; Lansink et al., 2009; Sterpenich et al., 2021). In addition, we wanted to increase the motivation of the participants, as they could receive additional monetary compensation according to their learning and memory task performances. Furthermore, we designed the task, with the overall possibility to translate this task to operant conditioning in rats (see research proposal: https://data.snf.ch/grants/grant/168602). However, the task turned out to be too difficult to translate to rats, whereas we developed a different learning paradigm for the animal study (Klaassen et al., 2021) of this cross-species research project.       

      We added the following sentence to the introduction on p.4, ll. 134-137:

      To consider the beneficial effect of reward related information on sleep dependent memory consolidation and reactivation (Asfestani et al., 2020; Fischer and Born, 2009; Lansink et al., 2009; Sterpenich et al., 2021), we trained healthy young participants to categorize these words into rewarded and unrewarded words to gain and to avoid losses of money points.  

      Reviewer #3 (Public Review):

      Summary:

      In this study, the authors investigated the effects of targeted memory reactivation (TMR) during sleep on memory retention for artificial words with varying levels of phonotactical similarity to real words. The authors report that the high phonotactic probability (PP) words showed a more pronounced EEG alpha decrease during encoding and were more easily learned than the low PP words. Following TMR during sleep, participants who had been cued with the high PP TMR, remembered those words better than 0, whilst no such difference was found in the other conditions. Accordingly, the authors report higher EEG spindle band power during slow-wave up-states for the high PP as compared to low PP TMR trials. Overall, the authors conclude that artificial words that are easier to learn, benefit more from TMR than those which are difficult to learn.

      Comment 12 & 13:

      Strengths:

      (1) The authors have carefully designed the artificial stimuli to investigate the effectiveness of TMR on words that are easy to learn and difficult to learn due to their levels of similarity with prior wordsound knowledge. Their approach of varying the level of phonotactic probability enables them to have better control over phonotactical familiarity than in a natural language and are thus able to disentangle which properties of word learning contribute to TMR success.

      (2) The use of EEG during wakeful encoding and sleep TMR sheds new light on the neural correlates of high PP vs. low PP both during wakeful encoding and cue-induced retrieval during sleep.

      We thank the reviewer for his/her positive evaluation of our manuscript.

      Weaknesses:

      Comment 14:

      (1) The present analyses are based on a small sample and comparisons between participants. Considering that the TMR benefits are based on changes in memory categorization between participants, it could be argued that the individuals in the high PP group were more susceptible to TMR than those in the low PP group for reasons other than the phonotactic probabilities of the stimuli (e.g., these individuals might be more attentive to sounds in the environment during sleep). While the authors acknowledge the small sample size and between-subjects comparison as a limitation, a discussion of an alternative interpretation of the data is missing.

      We agree with the reviewer that the small sample size and the between subject comparisons represent major limitations of our study. We thank the reviewer for this helpful comment and now discussed these limitations in more detail by adding alternative explanations and further suggestions for future research to overcome these limitations.

      We added the following sentences to the discussion on p.14, ll. 465-473: 

      To control for potential confounders despite the influence of difficulty in word learning on TMR, we compared parameters of sleep, the pre-sleep memory performance and the vigilance shortly before the post-sleep memory test, revealing no significant group differences (see Table S1 and S2). Nevertheless, we cannot rule out that other individual trait factors differed between the groups, such as the individual susceptibility to TMR. To rule out these alternative explanations based on individual factors, we suggest for future research to replicate our study by conducting a within-subject design with cueing of subsets of previously learned low- and high-PP words providing all conditions within the same individuals as shown in other TMR studies (Cairney et al., 2018; Schreiner and Rasch, 2015).

      Comment 15:

      (2) While the one-tailed comparison between the high PP condition and 0 is significant, the ANOVA comparing the four conditions (between subjects: cued/non-cued, within-subjects: high/low PP) does not show a significant effect. With a non-significant interaction, I would consider it statistically inappropriate to conduct post-hoc tests comparing the conditions against each other. Furthermore, it is unclear whether the p-values reported for the t-tests have been corrected for multiple comparisons. Thus, these findings should be interpreted with caution.

      We thank the reviewer for this comment giving us the opportunity to correct our analyses and clarify with additional description. Indeed, we investigated at first overnight changes in behavior performance within the four conditions, conducting t-tests against 0 of Δ-values of d' and c-criterion. Whereas for all our statistical analyses the p-value was set at p < 0.05 for two-tailed testing, we did not corrected the p-value of our behavior analyses for multiple comparisons. To investigate subsequently differences between conditions, we conducted additional ANOVAs. We agree with the reviewer that without significant of results of the ANOVA, post-hoc analyses should not be conducted. Taken in account as well the recommendation of reviewer 1, we included now only post-hoc pairwise comparisons when the interaction effect of the ANOVA revealed at least a trend of significance (p < 0.1). 

      We removed the following post-hoc analyses from the results section on p.9, ll. 291-295: 

      Additional post-hoc pairwise comparisons revealed a significant difference between the highPP cued and low-PP uncued (high-PP cued vs. low-PP uncued: t(10) = 2.43, p = 0.04), and no difference to other conditions (high-PP cued vs.: high-PP uncued t(20) = 1.28, p = 0.22; lowPP cued t(20) = 1.57, p = 0.13).  

      Further, we mentioned the lack of correction for multiple comparisons as a limitation of our results in the discussion on p.13, ll. 456-458:  

      The criteria of data analyses were not pre-registered and the p-values of our behavior analyses were not corrected for multiple comparisons.

      We added the following sentences to the methods p.23, ll. 842-849:

      To analyze overnight changes of sleep behavioral data within TMR conditions, we conducted at first dependent sample t-tests against 0 of Δ-values (post-sleep test minus pre-sleep test) of d' and c-criterion (see Fig. 3). Two-way mixed design ANOVAs were computed to compare Δvalues between TMR conditions. After confirming at least a trend of significance (p < 0.1) for the interaction effect, we conducted post-hoc pairwise comparisons by independent and dependent sample t-tests. For all behavior statistical analyses, the p-value was set at p < 0.05 for two-tailed testing. A p-value < 0.1 and > 0.05 was reported as a trend of significance.

      Comment 16:

      (3) With the assumption that the artificial words in the study have different levels of phonotactic similarity to prior word-sound knowledge, it was surprising to find that the phonotactic probabilities were calculated based on an American English lexicon whilst the participants were German speakers. While it may be the case that the between-language lexicons overlap, it would be reassuring to see some evidence of this, as the level of phonotactic probability is a key manipulation in the study.

      We thank the reviewer pointing to the misalignment between the German-speaking participants and the used artificial words based on American English. In line with this recommendation, we added a more outlined argumentation to the manuscript about the assumption of our study that major common phonetic characteristics across both languages are still preserved.       

      We now discussed these aspects on p.14, ll. 473-481:

      Further, we used artificial words based on American English in combination with German speaking participants, whereas language differences of pronunciation and phoneme structures might affect word perception and memory processing (Bohn and Best, 2012). On the other hand, both languages are considered to have the same language family (Eberhard et al., 2019) and the phonological distance between English and German is quite short compared for example to Korean (Luef and Resnik, 2023). Thus, major common phonological characteristics across both languages are still preserved. In addition, our behavior analyses revealed robust word discrimination learning and distinct memory performance according to different levels of phonotactic probabilities providing evidence of successful experimental manipulation. 

      Comment 17:

      (4) Another manipulation in the study is that participants learn whether the words are linked to a monetary reward or not, however, the rationale for this manipulation is unclear. For instance, it is unclear whether the authors expect the reward to interact with the TMR effects.

      We used a reward task combined with TMR because previous studies revealed beneficial effects of reward related information on sleep dependent memory consolidation and reactivation (Asfestani et al., 2020; Fischer and Born, 2009; Lansink et al., 2009; Sterpenich et al., 2021). In addition, we wanted to increase the motivation of the participants, as they could receive additional monetary compensation according to their learning and memory task performances. Furthermore, we designed the task, with the overall possibility to translate this task to operant conditioning in rats (see research proposal: https://data.snf.ch/grants/grant/168602). However, the task turned out to be too difficult to translate to rats, whereas we developed a different learning paradigm for the animal study (Klaassen et al., 2021) of this cross-species research project.       

      We added the following sentence to the introduction on p.4, ll. 134-137:

      To consider the beneficial effect of reward related information on sleep dependent memory consolidation and reactivation (Asfestani et al., 2020; Fischer and Born, 2009; Lansink et al., 2009; Sterpenich et al., 2021), we trained healthy young participants to categorize these words into rewarded and unrewarded words to gain and to avoid losses of money points.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Comment 18:

      (1) Please clearly define all linguistics terms - and most importantly the term "phonotactics" - at first use.

      We thank the reviewer for this recommendation and we added the definition of phonotactics and further reduced the diversity of linguistic terms to improve readability. 

      We added the following sentences to the beginning of the introduction on p.3, ll. 72-76:

      One critical characteristic of similarity to pre-existing knowledge in auditory word processing is its speech sound (phoneme) pattern. In phonology as the field of language specific phoneme structures, phonotactics determines the constraints of word phoneme composition of a specific language.

      Comment 19:

      (2) Some critical details about the methods should be included in the Results section to make it comprehensible. For example, the way the crucial differences between G1-4 words should be addressed in the Results, not only in Figure 1.

      According to the recommendation, we added this information to the results section.  We added the following sentences to the results section on p.4, ll. 145-154:

      To study the impact of difficulty in word learning on TMR, we developed a novel learning paradigm. We formed four sets of artificial words (40 words per set; see Table S3 and S4) consisting of different sequences of two vowels and two consonants. Here, we subdivided the alphabet into two groups of consonants (C1: b, c, d, f, g, h, j, k, l, m; C2: n, p, q, r, s, t, v, w, x, z) and vowels (V1: a, e, I; V2: o, u, y). Four-letter-words were created by selecting letters from the vowel and consonant groups according to four different sequences (G1:C1, V1, V2, C2; G2: C1, V1, C2, V2; G3: V1, C1, C2, V2; G4: V1, C1, V2, C2; Fig. 1a; see methods for further details). Comparison analyses between the sets revealed significant differences in phonotactic probability (PP; Fig. 1b; unpaired t-tests: G1 / G2 > G3 / G4, p < 0.005, values of Cohen’s d > 0.71).

      Comment 20

      (3) Was scoring done both online and then verified offline? If so, please note that.

      We included now this information.  

      We adjusted the method section on p.21, ll. 765-769:   

      The sleep stages of NREM 1 to 3 (N1 to N3), wake, and REM sleep were scored offline and manually according to the criteria of the American Academy of Sleep Medicine (AASM) by visual inspection of the signals of the frontal, central, and occipital electrodes over 30s epochs (Iber et al., 2007). Based on offline scoring, we confirmed TMR exposure during N2 and N3 and no significant differences (p-values > 0.05) of sleep parameters between the cueing groups (see Table S2).  

      Comment 21:

      (4) In Figure 2, please arrange the panel letters in an easier-to-read way (e.g., label upper right panel b with a different letter).

      Now we rearranged the panel letters according to the recommendation.

      We adjusted Figure 2 on p.8, ll. 242-258:     

      Comment 22

      (5) In the first paragraph on TMR effects, please note which memory measure you are comparing (i.e., d').

      We added this information according to the recommendation.  

      We adjusted the sentence of the results on p.8, ll. 260-263:

      To examine whether TMR during sleep impacts memory consolidation of discrimination learning with respect to learning difficulty, we calculated the overnight changes by subtracting the pre- from the post-sleep memory performance based on d'-values of the reactivated sequences (cued) and non-reactivated sequences (uncued).

      Comment 23:

      (6) Please show the pre-sleep and post-sleep test scores for both word categories (not only the delta). It may be best to show this as another data point in Fig 2a, but it may be helpful to also see this split between cued and uncued.

      We added the pre-sleep and post-sleep test scores with the individual data points as an additional figure. 

      We added the following figure to the supplementary data on p.28, ll. 936-940:  

      Comment 24:

      (7) In the sentence "An additional two-way mixed design ANOVA on the same values with cueing as a between-subject factor (cued vs. uncued) ...", a more exact phrasing for the last parentheses would probably be "(high-PP-Cued vs Low-PP-Cued)". Both groups were cued.

      We thank the reviewer pointing this out. According to the recommendation, we corrected the descriptions of the two-way mixed design ANOVAs. In addition, we detected a mistake of wrong assignments of the conditions to ANOVAs and corrected the reported values.   

      We adjusted the sentences and corrected the values on p.9, ll. 271-275 and ll. 289-291: 

      An additional two-way mixed design ANOVA on the same values with the factor cueing (cued vs. uncued) as a within-subject factor and group as a between-subject factor revealed trends of significance (p < 0.1) for the interaction (cueing × group: F(1,20) = 3.47, p = 0.08) and the main effect of group (F(1,20) = 3.28, p = 0.09). The main effect of cueing was not significant (F(1,20) = 0.58, p = 0.46).

      An ANOVA on c-criterion changes showed no significant effects (interaction cueing × group: F(1,20) = 2.66, p = 0.12; main effect cueing  F(1,20) = 2.08, p = 0.17; main effect group F(1,20) = 0.38, p = 0.55).

      Comment 25:

      (8) In the same ANOVA, please mention that there is a trend toward an interaction effect. If there wasn't one, the post-hoc comparison would be unwarranted. Please consider noting other p<0.1 pvalues as a trend as well, for consistency.

      Regarding this recommendation, we included now only post-hoc pairwise comparisons after confirming at least a trend toward an interaction effect of these ANOVAs and reported consistently a p-value < 0.1 and > 0.05 as a trend of significance.

      We added the following sentences to the methods p.23, ll. 844-849:

      Two-way mixed design ANOVAs were computed to compare Δ-values between TMR conditions. After confirming at least a trend of significance (p < 0.1) for the interaction effect, we conducted post-hoc pairwise comparisons by independent and dependent sample t-tests. For all behavior statistical analyses, the p-value was set at p < 0.05 for two-tailed testing. A p-value < 0.1 and > 0.05 was reported as a trend of significance.

      We removed the following post-hoc analyses from the results section on p.9, ll. 291-295: 

      Additional post-hoc pairwise comparisons revealed a significant difference between the highPP cued and low-PP uncued (high-PP cued vs. low-PP uncued: t(10) = 2.43, p = 0.04), and no difference to other conditions (high-PP cued vs.: high-PP uncued t(20) = 1.28, p = 0.22; lowPP cued t(20) = 1.57, p = 0.13).          

      Comment 26:      

      (9) Please consider adding an analysis correlating spindle power with memory benefit across participants. Even if it is non-significant, it is important to report given that some studies have found such a relationship.

      According to this recommendation, we conducted an additional correlation analyses.

      We added the following sentences to the manuscript into the results (pp. 10-11, ll. 346-349), the discussion (p.12, ll. 413-417), and the methods (p.23, ll. 864-867):   

      Whereas we found a significant group difference in spindle power nested during SW up-states,   conducting further whole sample (n = 22) correlation analyses between the individual spindle power values of the significant cluster and the overnight changes of behavior measurements revealed no significant correlations (Δ d': r = 0.16, p = 0.48; Δ c-criterion: r = 0.19, p = 0.40).

      In addition to our result of the significant group difference, we failed to find significant correlations between SW nested spindle power values and overnight changes in behavior measurements, whereas previous studies reported associations of SW and spindle activities during sleep with the integration of new memories in pre-existing knowledge networks (Tamminen et al., 2013, 2010).

      By using the same extracted power values (0.3 to 0.8s; 11-14Hz; Pz, P3, P4, O2, P7) per subject, we performed whole sample (n = 22) Pearson correlation analyses between these power values and the overnight changes of behavior measurements of the cued condition (Δ d' and Δ ccriterion).

      Reviewer #2 (Recommendations For The Authors):

      (1) Choice of task

      Comment 27:      

      In general, I find your task well-designed and novel. In light of your research question, however, I wonder why you chose this task. When you outlined the research question in the introduction, I expected a task similar to Schreiner et al. (2015). For example, participants have to associate high PP words with each other and low PP words. The advantage here would be that you could test the benefits of TMR in a within-subjects design (for example, cueing half of the remembered high and half of the remembered low PP words).

      Please see our previous response at comment 14.    

      Comment 28:

      Why did you decide to introduce a reward manipulation?

      Please see our previous response at comment 11.    

      Comment 29:

      Why did you do the cueing on a category level (cueing all high PP or all low PP words instead of single word cueing or instead of cueing 20 reward high-PP, 20 unrewarded high-PP plus 20 reward low-PP and 20 unrewarded low-PP)? Both alternatives would have provided you the option to run your statistics within participants.

      Please see our previous response at comment 14.    

      Comment 30:

      (2) Between-subjects design and small sample size.

      Why did you decide on a between-subjects design that severely reduces your power?

      Why did you just collect 22 participants with such a design? Were there any reasons for this small sample size? Honestly, I think publishing a TMR study with healthy participants and such a small sample size (11 participants for some comparisons) is not advisable.

      Please see our previous response at comment 14.

      Comment 31:

      (3) Encoding performance.

      Is d' significantly above 0 in the first repetition round? I would assume that the distinction between rewarded and non-rewarded words is just possible after the first round of feedback.

      Indeed, conducting t-tests against 0 revealed significantly increased d'-values in the first repetition round (2nd presentation) in both PP conditions (high-PP: 0.85 ± 0.09, t(32) = 9.17, p < 0.001; low-PP: 0.62 ± 0.09, t(32) = 6.83, p < 0.001).  

      Comment 32:

      (4) Encoding response options

      If you want to you could make it more explicit what exactly the response options are. I assume that one button means a word has a high reward and the other button means a word has a low reward. Making it explicit increases the understanding of the results section.

      Please see our previous response at comment 3.

      Comment 33:           

      (5) Alpha desynchronisation.

      Relative change

      Why did you subtract alpha power during the 1st presentation from alpha power during 2nd and 3rd presentation? You baseline-corrected already and individually included the 1st, 2nd, and 3rd repetition in your behavioural analysis.

      Based on this analysis, we aimed to examine the relative change in alpha power between PP-conditions of memory-relevant word repetitions. Therefore, to extract memory relevant changes of EEG activities, the first word presentation of naive stimulus processing could serve as a more representative baseline condition covering the time-window of interest of 0.7 to 1.9 s after the stimulus onset compared to a baseline condition before stimulus onset (-1 to -0.1s). 

      To explain the rational of the analyses with the baseline condition more clearly, we added this information to the results section on p.7, ll. 222-226: 

      We obtained the changes in power values by subtracting the first from the second and third presentation for the high- and low-PP condition, respectively. Here, the first word presentation of naive stimulus processing served us with a more representative baseline condition covering the time-window of interest of 0.7 to 1.9 s after the stimulus onset to examine relevant changes of encoding.  

      Comment 34:

      (6) Alpha desynchronisation as a neural correlate of encoding depth & difficulty?

      "In addition to the behavior results, these EEG results indicate differences between PP conditions in desynchronization of alpha oscillations, as an assumed neural correlate of encoding depth. In addition to the behavior results, these EEG results indicate differences between PP conditions in desynchronization of alpha oscillations, as an assumed neural correlate of encoding depth."

      Given that the low-PP words are more difficult to learn, I was expecting to see higher alpha desynchronisation in the low-PP relative to the high-PP words. Could you outline in a bit more detail how your findings fit into the literature (e.g., Simon Hanslmayr did a lot of work on this)?

      I would also advise you to add citations e.g., after your sentence in the quote above ("as an assumed neural correlate of encoding depth").

      We thank the reviewer for the recommendation giving us the opportunity to discuss in more detail how our results relate to previous findings. 

      We added additional sentences to the discussion on p.13, ll. 441-455:    

      Additional studies linked alpha desynchronization to cognitive effort and cognitive load (Proskovec et al., 2019; Zhu et al., 2021). So, one could assume to observe higher alpha desynchronization in the more difficult to learn condition of low-PP compared to high-PP. On the other hand numerous studies investigating oscillatory correlates of learning and memory showed that alpha desynchronization is associated with memory across different tasks, modalities and experimental phases of encoding and retrieval (Griffiths et al., 2016, 2021, 2019a, 2019b; Hanslmayr et al., 2009; Michelmann et al., 2016). Strikingly, Griffith and colleagues (Griffiths et al., 2019a) revealed by simultaneous EEG-fMRI recordings a negative correlation between the occurrence of patterns of stimulus-specific information detected by fMRI and cortical alpha/beta suppression. Here, the authors suggested that a decrease of alpha/beta oscillations might represent the neuronal mechanism of unmasking the task-critical signal by simultaneous suppression of task-irrelevant neuronal activities to promote information processing. Following this interpretation, we assume that over the course of learning elevated memory processing of the easier to learn stimuli is associated with enhanced information processing and thus accompanied by higher cortical alpha desynchronization in comparison of the more difficult to learn stimuli.

      In addition, we added the mentioned quote on p.7, ll. 239-240:

      In addition to the behavior results, these EEG results indicate differences between PP conditions in desynchronization of alpha oscillations, as an assumed neural correlate of encoding depth (Griffiths et al., 2021; Hanslmayr et al., 2009).

      Comment 35:

      (7) Exclusion criterion.

      Why did you use a d' > 0.9 as a criterion for data inclusion?

      This criterion ensured that each included subject had at least in one PP-condition a d' > 1.05 of pre-sleep memory performance, which corresponds to a general accuracy rate of 70%. 

      Accordingly, we adjusted these sentences of the method section on p.19, ll. 677-680: 

      Data were excluded from subjects who did not reach the minimal learning performance of d' > 1.05 during the pre-sleep memory test in at least one of the two PP conditions, whereas this threshold value corresponds to accuracy rates of 70% (n = 5). In addition, we excluded one subject who showed a negative d' in one PP condition of the pre-sleep memory test (n = 1). 

      Comment 36:

      (8) Coherence of wording.

      When you talk about your dependent variable (d') you sometimes use sensitivity. I would stick to one term.

      We replaced the word sensitivity with d'.    

      (9) Criterion

      Comment 37:

      Why do you refer to a change in criterion (Figure 3b, axis labels) as a change in memory? Do you think the criterion says something about memory?

      We corrected the axis label of Figure 3b and deleted here the word memory.

      Comment 38:

      Additionally, why did you analyse the effect of TMR on the criterion? Do you expect the criterion to change due to sleep-dependent memory consolidation? This section would benefit from more explanation. Personally, I am very interested in your thoughts and your hypothesis (if you had one, if not that is also fine but then, make it explicit that it was an exploratory analysis).

      By conducting exploratory analyses of overnight changes of the c-criterion measurements, we aimed to examine the bias of decision-making to provide comprehensive data according to the framework of the signal detection theory. Regarding the previous literature showing mainly beneficial effects of sleep on learning and memory, we focused with our hypothesis on d' and explored additionally the c-criterion.

      Despite our task design with gains/hits of +10 money points and losses/FAs of -8 (instead of -10), the subjects showed already during the pre-sleep memory task significant biases towards loss avoidance in both PP conditions (t-tests against 0: high-PP: 0.44 ± 0.07, t(21) = 5.63, p < 0.001; low-PP: 0.47 ± 0.09, t(21) = 5.51, p < 0.001). As already reported in the preprint, we found an additional significant increase of c-criterion by TMR solely for the high-PP words (see Fig. 3b). Even by integrating subjects with poor pre-sleep memory performance (high-PP-cueing group: n = 15; low-PP-cueing group: n = 13), t-tests against 0 revealed a significant increase of the high-PP cueing condition (t(14) = 3.36, p = 0.005) and no significant overnight changes in the other conditions (high-PP uncued: t(12) = 1.39, p = 0.19; low-PP cued: t(12) = 1.47, p = 0.17; low-PP uncued: t(14) = -0.20, p = 0.84). These exploratory findings on c-criterion suggest potential applications of TMR to affect decision-making biases in combination with reward learning.      

      We revised the manuscript mentioning the exploratory character of the c-criterion analyses of the results on p.9, ll. 282-283 and of the discussion on p.12, ll. 400-402:  

      We examined next as an exploratory analysis whether TMR conditions influence biases in decision-making.

      By conducting an additional exploratory analysis, we observed a significant change of the decision bias in the cueing condition of the easy to learn words and no overnight changes in the other conditions.

      Comment 39:

      (10) You detected SWs in the time range of 0-6 sec post sound stimulation. How was the distribution of all detected SW down-states in this time range? (You could plot a histogram for this.)

      We illustrated now the detected SWs in the time range of 0 to 6 s after stimulus onset. 

      We added a histogram to the supplementary section on p.30, ll. 982-986:  

      Reviewer #3 (Recommendations For The Authors):

      Comment 40:

      (1) In line with the weakness outlined above, I would recommend including a discussion of how the between-subject comparison and small sample size could affect the results and provide alternative interpretations.

      Please see our previous response at comment 14.

      Comment 41:

      (2) Regarding my point about statistical comparisons, I would recommend that the authors follow best practice guidelines for post-hoc tests and multiple comparisons. In Figures 3a and b, I would also recommend removing the stars indicating significance from the post-hoc tests (if this is what they reflect). Perhaps this link will be useful: https://www.statology.org/anova-post-hoc-tests/

      Please see our previous response at comment 15.    

      Comment 42:

      (3) Furthermore, to address any doubts about the possible phonotactic probability differences between languages, I would recommend that the authors show whether the languages overlap, the level of English fluency in the German-speaking participants, and/or another way of reassuring that this is unlikely to have affected the results.

      Please see our previous response at comment 7.    

      Comment 43:

      (4) In the introduction, I would recommend that the authors outline a clear rationale for the reward/no reward manipulation.

      Please see our previous response at comment 11.    

      Comment 44:

      (5) Figure 1c: Please include what response options participants had, e.g., 'rewarded/not rewarded'. This would make the type of categorization clearer to the reader.

      Please see our previous response at comment 3.

      Comment 45:

      (6) It is unclear whether the additional ANOVA conducted on the time and frequency of the identified clusters included all channels or only the channels contributing to the cluster. Consider clarifying this in the relevant methods and results. Furthermore, I would recommend labelling this as a posthoc test as this analysis was guided by an initial peak at the data and the timings, frequencies, and channels of interest were not selected a-priori.

      We thank the reviewer for this recommendation and labelled the additional repeatedmeasure ANOVA as a post-hoc test. Further, we mentioned the used channels (Pz and Cz) for this analyses.

      We adjusted the results section on p.7, ll. 230-233 and the methods section on p.23, ll. 858-860:            

      A post-hoc repeated-measure ANOVA on alpha power changes (merged over Pz and Cz electrodes) with PP (high vs. low) and presentations (2 to 3) as within-subjects factors revealed a main effect of PP (F(1,32) = 5.42, p = 0.03, η2 = 0.15), and a significant interaction (F(1,32)  = 7.38, p = 0.01, η2 = 0.19; Fig. 2e).

      After confirming the existence of a significant cluster, we conducted an additional post-hoc repeated-measure ANOVA with averaged values of the identified time and frequency range of interest and merged over the Pz and Cz electrodes (see Fig. 2e).

      Comment 46:

      (7) Figure 3: To better illustrate within- vs. between-subjects comparisons and promote transparency, please add individual points and lines between the within-subjects conditions.

      According to this recommendation, we changed Figure 3 to add the individual data points by lines.  

      We modified Figure 3 on p.9, ll. 299-303:  

      Comment 47:

      (8) For the SW density time-bin analyses, please include statistics for all comparisons (i.e., through 0 s to 3 s) and say whether these were corrected for multiple comparisons.

      According to this recommendation, we included now statistics for all comparisons. 

      We added table S6 table to the supplementary data on p.29, l.962:     

      Comment 48:

      (9) Consider reporting effect sizes.

      We thank the reviewer for this recommendation and we added now effect sizes of significant results. 

      Comment 49:

      (10) For transparency and replicability, consider including a list of the four stimulus sets including their phoneme and biphone probabilities.

      We included a list of the four stimulus sets with their phoneme and biphone probabilities  

      We added table S3 and table S4 to the supplementary data on pp. 26-27:       

      References

      Asfestani MA, Brechtmann V, Santiago J, Peter A, Born J, Feld GB. 2020. Consolidation of Reward Memory during Sleep Does Not Require Dopaminergic Activation. J Cogn Neurosci 32:1688– 1703. doi:10.1162/JOCN_A_01585

      Batterink LJ, Oudiette D, Reber PJ, Paller KA. 2014. Sleep facilitates learning a new linguistic rule.

      Neuropsychologia 65:169–79. doi:10.1016/j.neuropsychologia.2014.10.024

      Batterink LJ, Paller KA. 2017. Sleep-based memory processing facilitates grammatical generalization: Evidence from targeted memory reactivation. Brain Lang 167:83–93. doi:10.1016/J.BANDL.2015.09.003

      Bohn OS, Best CT. 2012. Native-language phonetic and phonological influences on perception of American English approximants by Danish and German listeners. J Phon 40:109–128. doi:10.1016/J.WOCN.2011.08.002

      Cairney SA, Guttesen A á. V, El Marj N, Staresina BP. 2018. Memory Consolidation Is Linked to Spindle-Mediated Information Processing during Sleep. Curr Biol 28:948-954.e4. doi:10.1016/j.cub.2018.01.087

      Eberhard DM, Simons GF, Fennig CD. 2019. Ethnologue: Languages of the world . SIL International. Online version: http://www.ethnologue.com.

      Fischer S, Born J. 2009. Anticipated reward enhances offline learning during sleep. J Exp Psychol Learn Mem Cogn 35:1586–1593. doi:10.1037/A0017256

      Green DM, Swets JA. 1966. Signal detection theory and psychophysics., Signal detection theory and psychophysics. Oxford,  England: John Wiley.

      Griffiths B, Mazaheri A, Debener S, Hanslmayr S. 2016. Brain oscillations track the formation of episodic memories in the real world. Neuroimage 143:256–266. doi:10.1016/j.neuroimage.2016.09.021

      Griffiths BJ, Martín-Buro MC, Staresina BP, Hanslmayr S, Staudigl T. 2021. Alpha/beta power decreases during episodic memory formation predict the magnitude of alpha/beta power decreases during subsequent retrieval. Neuropsychologia 153. doi:10.1016/j.neuropsychologia.2021.107755

      Griffiths BJ, Mayhew SD, Mullinger KJ, Jorge J, Charest I, Wimber M, Hanslmayr S. 2019a. Alpha/beta power decreases track the fidelity of stimulus specific information. Elife 8. doi:10.7554/eLife.49562

      Griffiths BJ, Parish G, Roux F, Michelmann S, van der Plas M, Kolibius LD, Chelvarajah R, Rollings DT, Sawlani V, Hamer H, Gollwitzer S, Kreiselmeyer G, Staresina B, Wimber M, Hanslmayr S. 2019b. Directional coupling of slow and fast hippocampal gamma with neocortical alpha/beta oscillations in human episodic memory. Proc Natl Acad Sci U S A 116:21834–21842. doi:10.1073/pnas.1914180116

      Hanslmayr S, Spitzer B, Bäuml K-H. 2009. Brain oscillations dissociate between semantic and nonsemantic encoding of episodic memories. Cereb Cortex 19:1631–40. doi:10.1093/cercor/bhn197

      Iber C, Ancoli‐Israel S, Chesson AL, Quan SF. 2007. The AASM Manual for the Scoring of Sleep and Associated Events: Rules, Terminology and Technical Specifications. Westchester, IL: American Academy of Sleep Medicine.

      Klaassen AL, Heiniger A, Sánchez PV, Harvey MA, Rainer G. 2021. Ventral pallidum regulates the default mode network, controlling transitions between internally and externally guided behavior. Proc Natl Acad Sci U S A 118:1–10. doi:10.1073/pnas.2103642118

      Lansink CS, Goltstein PM, Lankelma J V., McNaughton BL, Pennartz CMA. 2009. Hippocampus leads ventral striatum in replay of place-reward information. PLoS Biol 7. doi:10.1371/JOURNAL.PBIO.1000173

      Luef EM, Resnik P. 2023. Phonotactic Probabilities and Sub-syllabic Segmentation in Language

      Learning. Theory Pract Second Lang Acquis 9:1–31. doi:10.31261/TAPSLA.12468

      Michelmann S, Bowman H, Hanslmayr S. 2016. The Temporal Signature of Memories: Identification of a General Mechanism for Dynamic Memory Replay in Humans. PLoS Biol 14:e1002528. doi:10.1371/journal.pbio.1002528

      Proskovec AL, Heinrichs-Graham E, Wilson TW. 2019. Load Modulates the Alpha and Beta Oscillatory Dynamics Serving Verbal Working Memory. Neuroimage 184:256. doi:10.1016/J.NEUROIMAGE.2018.09.022

      Reber AS. 1967. Implicit learning of artificial grammars. J Verbal Learning Verbal Behav 6:855–863.

      doi:10.1016/S0022-5371(67)80149-X

      Schreiner T, Rasch B. 2015. Boosting vocabulary learning by verbal cueing during sleep. Cereb Cortex 25:4169–4179. doi:10.1093/cercor/bhu139

      Sterpenich V, van Schie MKM, Catsiyannis M, Ramyead A, Perrig S, Yang H-D, Van De Ville D, Schwartz S. 2021. Reward biases spontaneous neural reactivation during sleep. Nat Commun 2021 121 12:1–11. doi:10.1038/s41467-021-24357-5

      Tamminen J, Lambon Ralph MA, Lewis PA. 2013. The role of sleep spindles and slow-wave activity in integrating new information in semantic memory. J Neurosci 33:15376–15381. doi:10.1523/JNEUROSCI.5093-12.2013

      Tamminen J, Payne JD, Stickgold R, Wamsley EJ, Gaskell MG. 2010. Sleep spindle activity is associated with the integration of new memories and existing knowledge. J Neurosci 30:14356–60. doi:10.1523/JNEUROSCI.3028-10.2010

      Zhu Y, Wang Q, Zhang L. 2021. Study of EEG characteristics while solving scientific problems with different mental effort. Sci Rep 11. doi:10.1038/S41598-021-03321-9

    1. Author Response

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In this study, the researchers aimed to investigate the cellular landscape and cell-cell interactions in cavernous tissues under diabetic conditions, specifically focusing on erectile dysfunction (ED). They employed single-cell RNA sequencing to analyze gene expression patterns in various cell types within the cavernous tissues of diabetic individuals. The researchers identified decreased expression of genes associated with collagen or extracellular matrix organization and angiogenesis in several cell types, including fibroblasts, chondrocytes, myofibroblasts, valve-related lymphatic endothelial cells, and pericytes. They also discovered a newly identified marker, LBH, that distinguishes pericytes from smooth muscle cells in mouse and human cavernous tissues. Furthermore, the study revealed that pericytes play a role in angiogenesis, adhesion, and migration by communicating with other cell types within the corpus cavernosum. However, these interactions were found to be significantly reduced under diabetic conditions. The study also investigated the role of LBH and its interactions with other proteins (CRYAB and VIM) in maintaining pericyte function and highlighted their potential involvement in regulating neurovascular regeneration. Overall, the manuscript is well-written and the study provides novel insights into the pathogenesis of ED in patients with diabetes and identifies potential therapeutic targets for further investigation.

      Reviewer #2 (Public Review):

      Summary: In this manuscript, the authors performed single cell RNA-sequencing of cells from the penises of healthy and diabetes mellitus model (STZ injection-based) mice, identified Lbh as a marker of penis pericytes, and report that penis-specific overexpression of Lbh is sufficient to rescue erectile function in diabetic animals. In public human single cell RNA-sea datasets, the authors report that LBH is similarly specific to pericytes and down regulated in diabetic patients. Additionally, the authors report discovery of CRYAB and VIM1 as protein interacting partners with LBH.

      The authors contributions are of interest to the erectile dysfunction community and their Lbh overexpression experiments are especially interesting and well-conducted. However, claims in the manuscript regarding the specificity of Lbh as a pericyte marker, the mechanism by which Lbh overexpression rescues erectile function, cell-cell interactions impaired by diabetes, and protein-interaction partners require qualification or further evidence to justify.

      Major claims and evidence:

      1) Marker gene specificity and quantification: One of the authors' major contributions is the identification of Lbh as a marker of pericytes in their data. The authors present qualitative evidence for this marker gene relationship, but it is unclear from the data presented if Lbh is truly a specific marker gene for the pericyte lineage (either based on gene expression or IF presented in Fig. 2D, E). Prior results (see Tabula Muris Consortium, 2018) suggest that Lbh is widely expressed in non-pericyte cell types, so the claims presented in the manuscript may be overly broad. Even if Lbh is not a globally specific marker, the authors' subsequent intervention experiments argue that it is still an important gene worth studying.

      Answer: We appreciate this comment. In our scRNAseq data for the mouse cavernosum tissues, previously known markers such as Rgs5, Pdgfrb, Cspg4, Kcnj8, Higd1b, and Cox4i2 were found to be expressed not exclusively in pericytes, while Lbh exhibited specific expression patterns in pericytes (Fig. 2 and Supplementary Fig. 5). LBH expression was easily distinguishable from α-SMA, not only in mouse cavernosum but also in dorsal artery and dorsal vein tissues within penile tissues. This distinctive expression pattern of LBH was also observed in the human cavernous pericytes (Fig. 5). Then, we examined Lbh expression patterns in various mouse tissues using the mouse single-cell atlas (Tabula Muris), although endothelial and pericyte clusters were not subclustered in most tissues from Tabula Muris. To identify pericytes, we relied on the expression pattern of known marker genes (Pecam1 for endothelial cells, Rgs5, Pdgfrb, and Cspg4 for pericytes). Lbh was expressed in pericytes of the bladder, heart and aorta, kidney, and trachea but not as specifically in penile pericytes (Supplementary Fig. 6A-D). However, it is worth noting that other known pericyte markers were also did not exhibit exclusive expression in pericytes across all the tissues we analyzed. Therefore, in certain tissues, particularly in mouse penile tissues, Lbh may be a valuable marker in conjunction with other established pericyte marker genes for distinguishing pericytes.

      2) Cell-cell communication and regulon activity changes in the diabetic penis: The authors present cell-cell communication analysis and TF regulon analysis in Fig 3 and report differential activities in healthy and DM mice. These results are certainly interesting, however, no statistical analyses are performed to justify claimed changes in the disease state and no validations are performed. It is therefore challenging to interpret these results, and the relevant claims do not seem well supported.

      Answer: In response to these helpful suggestions, we calculated statistical significance and performed experimental validation. CellphoneDB permutes the cluster labels of all cells 1000 times and calculates the mean(mean(molecule 1 in cluster X), mean(molecule 2 in cluster Y)) at each time for each interaction pair, for each pairwise comparison between two cell types. We only considered interactions in which the difference in means calculated by these permutations were greater than 0.25-fold between diabetes and normal. Also, we considered that the interactions with P-value < 0.05 were significant.

      To assess differential regulon activities of transcription factor (SCENIC) between diabetic and normal pericytes, we utilized a generalized linear model with scaled activity scores for each cell as input. These scaled regulon activity values for angiogenesis-related TFs exhibited differences between diabetic and normal pericytes. The results of the generalized linear model revealed that Klf5, Egr1, and Junb were TFs with significantly altered regulon activities in diabetic pericytes. Experimental data indicated that the expression level of Lmo2, Junb, Elk1, and Hoxd10 was higher (Hoxd10) or lower (Lmo2, Junb, Elk1) in diabetic pericytes compared to normal pericytes (Supplementary Fig. 9). We have added the scaled regulon activity values and statistical significance in Fig. 3E.

      3) Rescue of ED by Lbh overexpression: This is a striking and very interesting result that warrants attention. By simple overexpression of the pericyte marker gene Lbh, the authors report rescue of erectile function in diabetic animals. While mechanistic details are lacking, the phenomenon appears to have a large effect size and the experiments appear sophisticated and well conducted. If anything, the authors appear to underplay the magnitude of this result.

      Answer: We appreciate this comment. Therefore, we have added relevant clarification in the revised manuscript discussion section to emphasize the importance of LBH overexpression on rescuing ED as follows: “To test our hypothesis, we utilized the diabetes-induced ED mouse model, commonly employed in various studies focusing on microvascular complications associated with type 1 diabetes. We observed that the overexpression of LBH in diabetic mice led to the restoration of reduced erectile function by enhancing neurovascular regeneration. However, this study primarily demonstrated the observed phenomenon without delving into the detailed mechanisms. Nonetheless, these results of LBH on erections provide us with new strategies for treating ED and should be of considerable concern.” (Please see revised ‘Discussion’)

      4) Mechanistic claims for rescue of ED by Lbh overexpression: The authors claim that cell type-specific effects on MPCs are responsible for the rescue of erectile function induced by Lbh overexpression. This causal claim is unsupported by the data, which only show that Lbh overexpression influences MPC performance. In vivo, it's likely that Lbh is being over expressed by diverse cell types, any of which could be the causal driver of ED rescue. In fact, the authors report rescue of cell type abundance in endothelial cells and neuronal cells. Therefore, it cannot be concluded that MPC effects alone or in principal are responsible for ED rescue.

      Answer: We agree with these claims. Therefore, we have added relevant clarifications in the discussion section of the revised manuscript. Our findings suggest that LBH can affect the function of cavernous pericytes, although we cannot definitively specify which particular cavernous cell types are affected by the overexpressed LBH, whether it be cavernous endothelial cells, smooth muscle cells, or others. Subsequent research will be required to conduct more comprehensive mechanistic investigations, such as in vitro studies using cavernous endothelial cells, smooth muscle cells, and fibroblasts to address these knowledge gaps. (Please see revised ‘Discussion’)

      5) Protein interaction data: The authors claim that CRYAB and VIM1 are novel interacting partners of LBH. However, the evidence presented (2 blots in Fig. 6A,B) lack the relevant controls. It is possible that CRYAB and VIM1 are cross-reactive with the anti-LBH antibody or were not washed out completely. The abundance of bands on the Coomassie stain in Fig. 6A suggests that either event is plausible. Therefore, the evidence presented is insufficient to support the claim that CRYAB and VIM1 are protein interacting partners of LBH.

      Answer: We agree with these claims. Therefore, we have added the relevant controls(Input) and performed Co-IP (IP: CRYAB or VIM, WB: LBH) to demonstrate CRYAB and VIM1 are not simply cross-reactive antigens to their LBH antibody. Our results show that we can detect the expression of CRYAB and VIM after LBH IP, and we also detect the expression of LBH after CRYAB and VIM IP. In addition, it can be seen from our results that the binding of LBH to VIM is higher than that of CRYAB. Regardless, these results indicate that the binding of CRYAB or VIM to LBH is not a random phenomenon. (Please see revised ‘Result’ and ‘Figure 6B’)

      Impact: These data will trigger interest in Lbh as a target gene within the erectile dysfunction community.

      Reviewer #3 (Public Review):

      Bae et al. described the key roles of pericytes in cavernous tissues in diabetic erectile dysfunction using both mouse and human single-cell transcriptomic analysis. Erectile dysfunction (ED) is caused by dysfunction of the cavernous tissue and affects a significant proportion of men aged 40-70. The most common treatment for ED is phosphodiesterase 5 inhibitors; however, these are less effective in patients with diabetic ED. Therefore, there is an unmet need for a better understanding of the cavernous microenvironment, cell-cell communications in patients with diabetic ED, and the development of new therapeutic treatments to improve the quality of life.

      Pericytes are mesenchymal-derived mural cells that directly interact with capillary endothelial cells (ECs). They play a vital role in the pathogenesis of erectile function as their interactions with ECs are essential for penile erection. Loss of pericytes has been associated with diabetic retinopathy, cancer, and Alzheimer's disease and has been investigated in relation to the permeability of cavernous blood vessels and neurovascular regeneration in the authors' previous studies. This manuscript explores the mechanisms underlying the effect of diabetes on pericyte dysfunction in ED. Additionally, the cellular landscape of cavernous tissues and cell type-specific transcriptional changes were carefully examined using both mouse and human single-cell RNA sequencing in diabetic ED. The novelty of this work lies in the identification of a newly identified pericyte (PC)-specific marker, LBH, in mouse and human cavernous tissues, which distinguishes pericytes from smooth muscle cells. LBH not only serves as a cavernous pericyte marker, but its expression level is also reduced in diabetic conditions. The LBH-interacting proteins (Cryab and Vim) were further identified in mouse cavernous pericytes, indicating that these signaling interactions are critical for maintaining normal pericyte function. Overall, this study demonstrates the novel marker of pericytes and highlights the critical role of pericytes in diabetic ED.

      Reviewer #1 (Recommendations For The Authors):

      1) The methods are poorly written. It lacks specific information on the sample size, experimental design, and data analysis methods employed. The absence of these crucial details makes it difficult to evaluate the robustness and reliability of the findings.

      Answer: We agree with the reviewer’s suggestion, now we revised the methods of our manuscript, and added detailed information or references. For sample size we have added detailed information in Figure legend (Please see revised ‘Method’ , Figure Legend, and Supplementary information.)

      2) The cell number in the scRNA-seq analysis is small (~12000) and some minor cell types are probably underrepresented. It is not clear whether the authors pooled the cells from different mice as one sample, or replicates in different groups have been included. It will be helpful to label different samples in the UMAP. The authors should repeat the experiments with more replicates to increase the cell number and validate the findings.

      Answer: We understand the reviewer's concern, but due to the small size of mouse penile tissue, we had to pool 5 corpus cavernosum tissues for each group (using pooled samples) for scRNA-seq analysis. Moreover, owing to the unique nature of mouse penile tissue, which is highly resistant, it posed challenges for the dissolution and isolation of single cells using conventional single-cell separation methods. Consequently, we had to increase the concentration of the enzyme to finally obtain 12,894 cells. Rather than conducting a repetitive scRNAseq analysis on the same mouse model, we validated our findings in human cavernous single-cell transcriptome data. This analysis allowed us to confirm the presence of pericyte in human corpus cavernosum, specific expression of LBH in human cavernous pericytes, and the identification of relevant GO terms associated with pericyte functions (Figure 5). We have add these information in ‘Method’ (Please see revised ‘Method’).

      3) Functional studies are lacking to justify how manipulating LBH expression or its interacting proteins might lead to effective therapeutic approaches for diabetic ED.

      Answer: We have performed the functional study to evaluate LBH expression might lead to effective therapeutic approaches for diabetic ED as showed in Figure 4G. Assessment of intracavernous pressure (ICP) is the most representative test for evaluating erectile function. Therefore, we modulated LBH expression in the penis of diabetic mice and assessed the erectile function of the mice by intracavernous pressure. However, we have not performed ICP studies and relative in vitro studies (migration, survival experiment) to assess whether LBH-interacting proteins have the same effect.

      4) Although the abstract identifies novel targets for potential interventions, such as LBH and its interacting proteins, the clinical relevance of these findings remains uncertain. The authors should include a discussion regarding the translation of these discoveries into therapeutic strategies or their potential impact on patients with diabetes and ED.

      Answer: We appreciate the reviewer's suggestion and have added a discussion as per the reviewer’s recommendation (Please see revised ‘Discussion’).

      5) While the study highlights the importance of pericytes in penile erection, it fails to mention the broader context of other cell types involved in the pathogenesis of ED. Neglecting to discuss potential contributions from endothelial cells, smooth muscle cells, or neural elements limits the comprehensive understanding of the cellular interactions underlying diabetic ED.

      Answer: We agree with the reviewer's suggestion and have added a discussion regarding the significance of other cell populations in penile tissues, such as endothelial cells, smooth muscle cells fibroblasts, and neural elements, along with the rationale for our focus on pericytes. (Please see revised ‘Discussion’).

      Reviewer #2 (Recommendations For The Authors):

      We congratulate the authors on an interesting study. We were especially excited to see their Lbh overexpression results. However, we felt other claims in the paper could benefit from additional investigation, analysis, and statistical rigor. We have provided a set of suggestions for improvement below.

      Major points:

      1) Pericyte marker gene proposal: See public review for commentary on the following suggested experiments. The authors should perform binary classification analysis using Lbh and report the performance of this gene as a marker (e.g. using the area under the receiver operating characteristic, accuracy, precision and recall). Further, they should consider performing this analysis for all other genes in their data to determine whether Lbh is the best marker gene.

      Answer: We appreciate this comment. AUC scores of Rgs5, Pln, Ednra, Npylr, Atp1b2, and Gpc3 for ability of a binary classifier to distinguish between pericyte and the other cell types in mouse penile tissues were measured by using FindMarkers function. Rgs5 had the highest AUC, but Rgs5 was also expressed in SMCs in our data. Pln, Ednra, Gpc3, and Npy1r also seemed to be candidate markers, but the literature search excluded these genes as they are also expressed in the SMCs of other tissues or different cell types. The AUC score of Lbh was over 0.7, and expression in SMC was not identified in previous studies, and ultimately, we experimentally identified that Lbh is penis pericyte specific. We have added this to the manuscript.

      Author response table 1.

      Robust differential expression analysis should also be performed for this gene (if not all) and the statistics should be reported, given known issues with the statistical approach used by the authors for differential expression (see: Squair 2021, 10.1038/s41467-021-25960-2). The authors' should also report the number of cells involved in these comparisons, as the number of pericytes in the data (Fig 1B) appears quite small.

      Answer: We appreciate this comment. We used “MAST” to identify differentially expressed genes. This test is often used to find DEGs in single-cell RNA data. However, because the pseudobulk method has advantages over the single cell DEG method (Squair 2021, 10.1038/s41467-021-25960-2), we additionally performed DEG analysis with DESeq2 to confirm whether Lbh can distinguish pericytes from other cell types in the penile. As a result, even when tested with DESeq2, Lbh expression was significantly higher in pericytes than in other cell types in penile (adjusted p-value = 2.694475e-07 in Pericyte vs SMC, adjusted P-value = 3.700118e-58 in Pericyte vs the other cell types). Mouse penile tissue is small in size, and the number of pericytes in mouse penile tissue is relatively smaller compared to fibroblasts and chondrocytes. In our mouse penile scRNAseq data, the number of pericytes is as follows: normal: 58, diabetes: 116. Despite the limited number of cells, we were able to establish statistical significance in our analyses.

      Immunostaining results in Fig. 2D, E should likewise be quantified. At present, it's unclear that LBH and aSMA are mutually exclusive as claimed. The authors should also investigate Lbh expression in public single cell genomics data, rather than performing candidate gene literature searches. For example, the Tabula Muris suggests Lbh is expressed widely outside pericytes.

      Answer: For Figure 2D and E, the aim of these analyses was to assess the distribution of LBH and other cellular markers to see if they overlap and if they can be distinguished. We think that some of the overlapping staining in the tissue may be caused by multilayered cellular structures, so staining within cells would be more convincing. Therefore, we quantified the percentage of LBH- or α-SMA-expressed pericytes and relative expression in smooth muscle cells in cell staining (Supplementary Fig. 5E). We found that only 3% of smooth muscle cells expressed LBH, 67% of mouse cavernous pericytes (MCPs) expressed α-SMA, and more than 97% of MCPs expressed LBH. Therefore, these results may illustrate the specific expression of LBH in MCPs. These information was added as ‘Supplementary Fig. 5E’ (Please see revised ‘Supplementary information’). We also examined Lbh expression patterns in various mouse tissues using the public mouse single-cell atlas (Tabula Muris), and provided a detailed response in reviewer 2’s public review 1.

      Even if Lbh is not the best marker, the authors' intervention experiment still motivates study of the gene, but these analyses would help contextualize the result for readers.

      2) Statistical anslyses for cell-cell communication and TF regulon analysis: See public review for context on these comments. The authors should perform statistical tests to evaluate the significance of differences detected for each of these analysis. For example, generalized linear models can be used to assess the significance of TF regulon activity scores from SCENIC, and permutation tests can be used to measure the significance of cell-cell interaction score changes. Without these statistical tests, it's challenging for a reader to interpret whether the results reported are meaningful or within the realm of experimental noise.

      Answer: We appreciate this comment. We calculated statistical significance TF regulon analyses as suggested by the reviewer and described a detailed statistical calculation method for cell-cell communication. We provided a detailed response in reviewer 2’s public review 2.

      3) Mechanism of ED rescue by Lbh overexpression: To support this claim, the authors would need to perform an experiment where Lbh is over expressed specifically in MPCs (using e.g. a specific promoter on their LTV construct, or a transgenic line with a cell type-specific Cre-Lox system). Absent these data, the claim should be removed.

      Answer: We agree with the reviewer's suggestion and we have reworked the claim that ‘LBH overexpression is affected by pericytes during ED recovery’ and have added relevant clarification in the Discussion section to clearly state that LBH overexpression may affect many cavernosum cells, such as cavernous endothelial cells, smooth muscle cells, fibroblasts, and pericytes (Please see revised ‘Result’ and ‘Discussion’)

      4) Protein interaction claims: This experiment would require that the authors perform a similar pull-down with LBH KO cells and or a reciprocal Co-IP (e.g. IP: CRYAB or VIM1, WB: LBH) to demonstrate CRYAB and VIM1 are not simply cross-reactive antigens to their LBH antibody. Further, these experiments appear to only have a single replicate for each condition. The authors should either remove associated claims, or perform a Co-IP experiment with the relevant controls with sufficient replication.

      Answer: We agree with the claims. Therefore, we have included the necessary controls (Input) and performed Co-IP (IP: CRYAB or VIM1, WB: LBH) to demonstrate that CRYAB and VIM1 are not simply cross-reactive antigens to their LBH antibody. Our results show that we can detect the expression of CRYAB and VIM after LBH IP, and we also detect the expression of LBH after CRYAB and VIM IP. In addition, it can be seen from our results that the binding of LBH to VIM is higher than that of CRYAB. Regardless, these results indicate that the binding of CRYAB or VIM to LBH is not a random phenomenon. Additionally, all IP experiments were replicated at least three times. (Please see revised ‘Result’ and ‘Figure 6B’)

      Minor Points:

      • The reference "especially in men" on line 56 seems odd given that only males can experience penile erectile dysfunction.

      Answer: We agree with the reviewer's suggestion and have removed the description 'especially male' (Please see revised ‘Introduction’)

      • Line 109, it's unclear what genes showed altered expression in Schwann cells.

      Answer: We apologize for the confusion. There was no significant differentially expressed genes between normal and diabetes in Schwann cells. We revised this part in the manuscript. (Schwann cells showed an increased expression compared to normal cells in diabetes, though not significant. In Schwann cells, there were no significant DEGs between diabetic and normal cells.)

      • It would be helpful for readers to see an analysis of the cell types that are transduced in the Lbh overexpression experiment in vivo. At present, some pericyte specificity is implied, but not demonstrated.

      Answer: We appreciate this comment. Our findings suggest that LBH can affect the function of cavernous pericytes, although we cannot definitively conclude which specific-cavernous cell types are affected by the overexpressed LBH, whether it be cavernous endothelial cells, smooth muscle cells, or others. Subsequent research will be required to conduct more comprehensive mechanistic investigations, such as in vitro studies using cavernous endothelial cells, smooth muscle cells, and fibroblasts to address these knowledge gaps. These were also mentioned in the manuscript.

      • To improve clarity and enhance readability, define abbreviations before their initial usage in the text. For instance, in the second paragraph of the Introduction, the abbreviation 'ECs' is used without prior definition. It can be inferred that it is referring to endothelial cells, mentioned in parentheses in the subsequent sentence.

      Answer: We agree with the reviewer's suggestion to expand acronyms and ensure that all acronyms are defined in the revised manuscript before they are used for the first time in the text (Please see revised Manuscript).

      • It is important to include relevant references that align with the content being discussed. For example, in the Introduction, pericytes are described as being involved in various processes such as angiogenesis, vasoconstriction, and permeability. The text refers to a single reverence, a review by Gerhardt and Besholtz, which primarily focuses on pericyte's role in regulating angiogenesis. Adding additional sources, such as the review by Bergers and Song (Neuro Oncol., 2005) is recommended.

      Answer: We agree with the reviewer's suggestion, and have added the reference as reviewer recommended (Please see revised Manuscript and reference).

      • Figure 3E: it is stated that a panel of 53 angiogenesis factors were tested, it is stated that only MMP3 showed increased expression. However, various unlabeled spots appear to show changed expression patterns. It would be helpful to show a summary graph with the relative intensities of the full array of factors tested.

      Answer: We agree with the reviewer’s suggestion, now we showed all spots density in angiogenesis array as Supplementary Table 1. The condition of the spots we selected was that the expression density was at least above 1500, and the change ratio was greater than 1.2. (Please see revised ‘Supplementary information’)

      Reviewer #3 (Recommendations For The Authors):

      Detailed statistical power calculation

      Data availability statement( were both mouse and human scRNA deposited in GEO with a taken and when will they be released to the public?)

      Answer: Human scRNA data have been deposited in GEO under accession number GSE206528. Our mouse scRNA dataset has been uploaded to KoNA and is available for download (https://www.kobic.re.kr/kona/review?encrypt_url=amlod2FucGFya3xLQUQyMzAxMDEz)

      Major concerns about this work

      1) The single cell RNAseq data collected for mouse diabetic ED(Fig 1B), FB are the most abundant cell population compared to PC, EC, SMC and other clusters. The rationale for studying FB clusters (in Figure 1, D-F) instead of PC cluster is unclear. Which cluster DEG did the authors annotate for Fig 1G-H?

      Answer: We understand the reviewer's suggestion and confusion. Although other major cell populations in penile tissue such as smooth muscle cells, endothelial cell, and fibroblasts have been extensively studied, pericytes have mainly been investigated in the context of the central nervous system (CNS). For example, in the CNS, pericytes are involved in maintaining the integrity of the brain's blood-brain barrier (BBB) [PMID: 27916653], regulating blood flow at capillary junctions [PMID: 33051294], and promoting neuroinflammatory processes [PMID: 31316352], whose dysfunction is considered an important factor in the progression of vascular diseases such as Alzheimer's disease [PMID: 24946075]. But little is known about the role of pericytes in penile tissue [PMID: 35865945; PMID: 36009395; PMID: 26044953]. In order to explore the role of pericytes in repairing the corpus cavernosum vascular and neural tissues damaged by DM, we focused on pericytes, which are multipotent perivascular cells that contribute to the generation and repair of various tissues in response to injury. Although recent studies have shown that pericytes are involved in physiological mechanisms of erection, little is known about their detailed mechanisms. We have also added this rationale in discussion.

      Single cell level study has not been conducted in mouse penile tissues. Therefore, before delving into pericytes, we aimed to identify overall transcriptome differences between normal and diabetic conditions in mouse penile tissues. We presented the analyses of FB, which make up the largest proportion among the cell types in the mouse penis, in Fig. 1D-F. The analysis of other cell types is provided in Supplementary Fig. 1-4. Fig. 1G-H are GO terms for Fibroblasts clusters. We added this information in the figure.

      2) Fig 2 is the critical data to show Lbh is a cavernous PC specific marker. More PC violin plots to identify PC cluster such as Cspg4, Kcnj8, Higd1b, Cox4i2 and more SMC violin plots to identify SMC cluster such as Acta2, Myh11, Tagln, Actg2 should be used for inclusion and exclusion of PC( the same concern applied to human scRNAseq in Fig 5B).

      Answer: We appreciate this comment. We examined the expression of other marker genes of pericytes and SMCs. Although some marker genes were rarely expressed in the mouse penis data (Kcnj8, Higd1b), the expression of marker genes tended to be relatively high in each cluster. The expression of Cspg4 and Cox4i2 was higher in pericytes than in SMCs, while the expression of Acta2, Myh11,and Tagln was higher in SMCs than in pericytes. Actag2 was specifically expressed in SMCs. Through the gene set enrichment test as well as the expression of known cell type marker genes, we identified that the annotation of pericyte and SMC was appropriate (Fig. 2B and Fig. 5C). We added the violin plots of these marker genes in Supplementary Fig. 5.

      Author response image 1.

      (Mouse)

      In human penis data, ACTA2 and MYH11 were expressed in SMCs, pericytes, and myofibroblasts, as in the previous paper [PMID: 35879305]. Among pericyte markers, the number of cells expressing KCNJ8 and HIGD1B was small. The cluster we annotated as pericyte was double positive for pericyte markers CSPG4 and COX4I2. ACTG2, a marker for SMC, was expressed more highly in SMC than in pericytes and myofibroblasts. As in the mouse penis data, we identified that the annotation of each cell type was appropriate through the gene set enrichment test in the human penis data. We added the violin plots of CSPG4, COX4I2, and ACTG2 in Supplementary Fig. 11.

      Author response image 2.

      (Human)

      When exploring Lbh expression levels in "Database of gene expression in adult mouse brain and lung vascular and perivascular cells" from https://betsholtzlab.org/VascularSingleCells/database.html, Lbh is not uniquely expressed in PC, suggesting its tissue-specific expression level. This difference should be discussed in the Discussion section.

      Answer: We appreciate this valuable comment. For the answer to this comment, we extensively analyzed Lbh expression patterns in various mouse tissues using the public mouse single-cell atlas (Tabula Muris) as also suggested by Reviewer 2. Please see our detailed response in reviewer 2’s public review 1.

      3) In prior studies on PC morphology and location (PMID: 21839917), they reside in capillaries (diameter less than 10um) or distal vessels (diameter less than 25um) and have oval cell body and long processes. Due to the non-specificity of Pdgfrb, SMC are positive for Pdgfrb staining (this has been shown in many publications that SMC are Pdgfrb+; unfortunately, NG2 antibody also stains for both PC and SMC). Therefore, the LBH immunostaining (in Fig 2D and 2E of large-sized vessels) are very likely for SMC identity, not PC. PC should be in close contact with CD31+ ECs in healthy conditions. The LBH immunostaining of PC in both mouse and human tissues (Fig 4) must be replaced and better characterized.

      Answer: We agree with the reviewer's suggestion. As it is widely known, peicytes are primarily located in capillaries, where they surround endothelial cells of blood vessels. However, recent discoveries have identified cells with pericyte-like characteristics in the walls of large blood vessels, challenging the traditional concept [PMID: 27268036]. In our study, we observed minimal overlap in staining between LBH and α-SMA, suggesting that the cells expressing LBH were not smooth muscle cells but possibly pericyte-like cells in large vessels. In small vessels within the bladder, kidney, and even the aorta, we found LBH-expressing cells surrounding CD31-expressing vessels, consistent with the known characteristics of pericytes. Further research is needed to comprehend the differences in LBH expression and its characteristics in both large and small blood vessels. We have added discussions and references for this issue (Please see revised ‘Discussion’ and ‘Reference’)

      4) How do mouse cavernous pericytes isolate? How is purity?

      Answer: As the reviewer points out, we isolated mouse spongiform pericytes following our and other previously published methods. We used pigment epithelium-derived factor (PEDF), which removes non-pericytic cells [PMID: 30929324, 23493068]. Although there are no purity study results such as FACS, other staining results thoroughly support the notion that this method yields pericytes with a notably high level of purity. (Please see ‘Method’ section).

      5) Can mouse scRNAseq cell-cell communication in Fig 3 be reproducible in human scRNAseq cell-cell communication? The results in human ED are more clinically significant than in mouse data.

      Answer: In human scRNAseq data, the difference between angiogenesis-related interactions between normal and diabetes was not as significant as that in mouse data. Because the cell type composition of the human and mouse penis is not completely identical, there are limitations in comparing cell-cell interactions. However, in the human penis data, some interactions related to angiogenesis between pericytes and other cell types were decreased in diabetes compared to normal (boxed parts).

      Author response image 3.

      6) Fibroblasts also express Vim. Murine PC VIM/CRYAB( should be written as Vim/Cryab as mouse proteins) direct interaction with Lbh is unclear from Lbh IP as Fig 6A red boxes showed a wide range of sizes. Where is the band for Lbh? Do human PC LBH interact with VIM/CRYAB?

      Answer: We agree with the reviewer's comment. VIM is a type III intermediate filament protein expressed in many cell types. We have added the relevant controls (Input) and performed Co-IP (IP: CRYAB or VIM, WB: LBH) to demonstrate CRYAB and VIM are not simply cross-reactive antigens to their LBH antibody. In western blot study, the LBH band was expressed between 35 kDa-48 kDa. From Figure 6A, we detected CRYAB in band 1 and VIM in bands 2 and 3. This may be due to the formation of dimers or multimers by VIM. We did not use human PCs for IP studies because IP requires large amounts of protein, making IP studies using human pericyte challenging. Nevertheless, the interaction between LBH and CRYAB in humans has been reported through fluorescent resonance energy transfer assay and affinity chromatography technology assay [PMID:34000384, PMID:20587334].

      7) In Fig 6H and I, why does CRYAB expression significantly reduce in vitro and in vivo under diabetic conditions, whereas VIM expression significantly increases?

      Answer: As the reviewer pointed out, and we have discussed on this issue in the manuscript, CRYAB is known to promote angiogenesis. Diabetes reduces CRYAB expression, so angiogenesis may be impaired. Furthermore, since VIM is a multifunctional protein, it interacts with several other proteins with multiple functions under various pathophysiological conditions. There are many relevant literatures showing that VIM expression is increased under diabetic conditions [PMID: 28348116 and PMID: 32557212]. And VIM deficiency protects against obesity and insulin resistance in patients with type 2 diabetes. Therefore, we hypothesize that exogenous LBH may have the ability to bind to the increased VIM in diabetic conditions and inactivate the effects of VIM. Thereby achieving the protective effect. This needs to be proved in further studies.

      8) The therapeutic strategies targeting (Lbh-Cryab-Vim) on mouse diabetic ED model is not investigated and need to be further validated and discussed.

      Answer: As the reviewers pointed out, in this study, we did not evaluate the targeted therapeutic strategy for LBH-CRYAB-VIM in a mouse diabetic ED model. We only identified the binding potential of these three proteins. Evaluation of this treatment strategy requires further study. For example, we can employ shRNA lentivirus, either alone or in combination, to downregulate CRYABexpression [PMID: 31612679] in normal mice, utilize a lentiviral vector CMV-GFP-puro-vimentin to overexpress Vimentin [PMID: 36912679], and then treat it with LBH to evaluate whether the LBH effect still exists (in vivo erectile function study and in vitro angiogenesis assay). We include this information in the Discussion section as a limitation of this study (Please see revised ‘Discussion’).

      9) The Discussion of current knowledge of pericytes in diabetic ED and other diseases and the significance of this study as well as clinical implications, should be expanded.

      Answer: As the reviewers pointed out, we have expanded the current knowledge of pericytes in diabetic ED and other diseases (CNS disease) and clinical implications as follows: “Although other major cell populations in penile tissue such as smooth muscle cells, endothelial cell, and fibroblasts have been extensively studied, pericytes have mainly been investigated in the context of the central nervous system (CNS). For example, in the CNS, pericytes are involved in maintaining the integrity of the brain's blood-brain barrier (BBB), regulating blood flow at capillary junctions, and promoting neuroinflammatory processes, whose dysfunction is considered an important factor in the progression of vascular diseases such as Alzheimer's disease. But little is known about the role of pericytes in penile tissue.” (Please see revised ‘Discussion’).

      10) How many clinical samples were used? How many times did each experiment repeat?

      Answer: As the reviewers pointed out, the clinical samples’ information was added in ‘method’ section. A total four human samples were used in this study (‘human corpus cavernosum tissues were obtained from two patients with congenital penile curvature (59-year-old and 47-year-old) who had normal erectile function during reconstructive penile surgery and two patients with diabetic ED (69-year-old and 56-year-old) during penile prosthesis implantation.’). For in vivo study, we quantified four different fields from human samples.

      Minor concerns

      1) Fig 1A, why normal mouse's body size is the same as DM?

      Answer: As the reviewer pointed out, in Figure 1A, while the size of normal mice and DM mice may not appear significantly different, there are indeed notable difference in body weight and size. The normal mice body weigh we used was about 30 grams, while DM mice body weigh was generally less than 24 grams. We found that we missed information on physiological and metabolic parameters from in vivo studies (ICP function study). Therefore, we have added it in Supplementary Table 2 (Please see revised ‘Supplementary information’)

      2) The label and negative, and positive controls for Fig 6B are missing.

      Answer: We thank for pointing out this. We have added the relevant controls (Input) and performed Co-IP (IP: CRYAB or VIM1, WB: LBH) to demonstrate CRYAB and VIM1 are not simply cross-reactive antigens to their LBH antibody and all IP was replicated for at least 3 times. (Please see revised ‘Result’ and ‘Figure 6B’)

      3) The limitation of this study and future work should be discussed.

      Answer: As the reviewer pointed out, we have added the limitation of this study and future direction in the discussion section (Please see revised ‘Discussion’).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The authors report an fMRI investigation of the neural mechanisms by which selective attention allows capacity-limited perceptual systems to preferentially represent task-relevant visual stimuli. Specifically, they examine competitive interactions between two simultaneously-presented items from different categories, to reveal how task-directed attention to one of them modulates the activity of brain regions that respond to both. The specific hypothesis is that attention will bias responses to be more like those elicited by the relevant object presented on its own, and further that this modulation will be stronger for more dissimilar stimulus pairs. This pattern was confirmed in univariate analyses that measured the mass response of a priori regions of interest, as well as multivariate analyses that considered the patterns of evoked activity within the same regions. The authors follow these neuroimaging results with a simulation study that favours a "tuning" mechanism of attention (enhanced responses to highly effective stimuli, and suppression for ineffective stimuli) to explain this pattern.

      Strengths:

      The manuscript clearly articulates a core issue in the cognitive neuroscience of attention, namely the need to understand how limited perceptual systems cope with complex environments in the service of the observer's goals. The use of a priori regions of interest, and the inclusion of both univariate and multivariate analyses as well as a simple model, are further strengths. The authors carefully derive clear indices of attentional effects (for both univariate and multivariate analyses) which makes explication of their findings easy to follow.

      Weaknesses:

      There are some relatively minor weaknesses in presentation, where the motivation behind some of the procedural decisions could be clearer. There are some apparently paradoxical findings reported -- namely, cases in which the univariate response to pairs of stimuli is greater than to the preferred stimulus alone -- that are not addressed. It is possible that some of the main findings may be attributable to range effects: notwithstanding the paradox just noted, it seems that a floor effect should minimise the range of possible attentional modulation of the responses to two highly similar stimuli. One possible limitation of the modelled results is that they do not reveal any attentional modulation at all under the assumptions of the gain model, for any pair of conditions, implying that as implemented the model may not be correctly capturing the assumptions of that hypothesis.

      We thank the reviewer for the constructive comments. In response, in the current version of the manuscript we have improved the presentation. We further discuss how the response in paired conditions is in some cases higher than the response to the preferred stimulus in this letter. For this, we provide a vector illustration, and a supplementary figure of the sum of weights to show that the weights of isolated-stimulus responses for each category pair are not bound to the similarity of the two isolated responses.

      Regarding the simulation results, we have clarified that the univariate effect of attention is not the attentional modulation itself, but the change in the amount of attentional modulation in the two paired conditions. We provide an explanation for this in this letter below, and have changed the term “attentional modulation” to “univariate shift” in the manuscript to avoid the confusion.

      Reviewer #2 (Public Review):

      Summary:

      In an fMRI study requiring participants to attend to one or another object category, either when the object was presented in isolation or with another object superimposed, the authors compared measured univariate and multivariate activation from object-selective and early visual cortex to predictions derived from response gain and tuning sharpening models. They observed a consistent result across higher-level visual cortex that more-divergent responses to isolated stimuli from category pairs predicted a greater modulation by attention when attending to a single stimulus from the category pair presented simultaneously, and argue via simulations that this must be explained by tuning sharpening for object categories.

      Strengths:

      - Interesting experiment design & approach - testing how category similarity impacts neural modulations induced by attention is an important question, and the experimental approach is principled and clever.

      - Examination of both univariate and multivariate signals is an important analysis strategy.

      - The acquired dataset will be useful for future modeling studies.

      Weaknesses:

      - The experimental design does not allow for a neutral 'baseline' estimate of neural responses to stimulus categories absent attention (e.g., attend fixation), nor of the combination of the stimulus categories. This seems critical for interpreting results (e.g., how should readers understand univariate results like that plotted in Fig. 4C-D, where the univariate response is greater for 2 stimuli than one, but the analyses are based on a shift between each extreme activation level?).

      We are happy to clarify our research rationale. We aimed to compare responses in paired conditions when the stimuli were kept constant while varying the attentional target. After we showed that the change in the attentional target resulted in a response change , we compared the amount of this response change to different stimulus category pairs to investigate the effect of representation similarity between the target and the distractor on the response modulation caused by attentional shift. While an estimate of the neural responses in the absence of attention might be useful for other modeling studies, it would not provide us with more information than the current data to answer the question of this study.

      Regarding the univariate results in Fig. 4C-D (and other equivalent ROI results in the revised version) and our analyses, we did not impose any limit on the estimated weights of the two isolated responses in the paired response and thus the sum of the two weights could be any number. We however see that the naming of “weighted average”, which implies a sum of weights being capped at one, has been misleading . We have now changed the name of this model to “linear combination” to avoid confusion

      Previous studies (Reddy et al., 2009, Doostani et al., 2023) using a similar approach have shown a related results pattern: the response to multiple stimuli is higher than the average, but lower than the sum of the isolated responses, which is exactly what our results suggest. We have added discussion on this topic in the Results section in lines 409-413 for clarification:

      “Note that the response in paired conditions can be higher or lower than the response to the isolated more preferred stimulus (condition Mat), depending on the voxel response to the two presented stimuli, as previously reported (Doostani et al. 2023). This is consistent with previous studies reporting the response to multiple stimuli to be higher than the average, but lower than the sum of the response to isolated stimuli (Reddy et al. 2009).”

      We are not sure what the reviewer means by “each extreme activation level”. Our analyses are based on all four conditions. The two isolated conditions are used to calculate the distance measures and the two paired conditions are used for calculating the shift index. Please note that either the isolated or the paired conditions could show the highest response and we seeboth cases in our data. For example, as shown in Figure 4A in EBA, the isolated Body condition and the paired BodyatCar condition show the highest activation levels for the Body-Car pair, whereas in Figure 4C, the two paired conditions (BodyatCat and BodyCatat) elicit the highest response.

      - Related, simulations assume there exists some non-attended baseline state of each individual object representation, yet this isn't measured, and the way it's inferred to drive the simulations isn't clearly described.

      We agree that the simulations assume a non-attended baseline state, and that we did not measure that state empirically. We needed this non-attended response in the simulations to test which attention mechanism led to the observed results. Thus, we generated the non-attended response using the data reported in previous neural studies of object recognition and attention in the visual cortex (Ni et al., 2012, Bao and Tsao, 2018). Note that the simulations are checking for the profile of the modulations based on category distance. Thus, they do not need to exactly match the real isolated responses in order to show the effect of gain and tuning shift on the results. We include the clarification and the range of neural responses and attention parameters used in the simulations in the revised manuscript in lines 327-333:

      “To examine which attentional mechanism leads to the effects observed in the empirical data, we generated the neural response to unattended object stimuli as a baseline response in the absence of attention, using the data reported by neural studies of object recognition in the visual cortex (Ni et al., 2012, Bao and Tsao, 2018). Then, using an attention parameter for each neuron and different attentional mechanisms, we simulated the response of each neuron to the different task conditions in our experiment. Finally, we assessed the population response by averaging neural responses.”

      - Some of the simulation results seem to be algebraic (univariate; Fig. 7; multivariate, gain model; Fig. 8)

      This is correct. We have used algebraic equations for the effect of attention on neural responses in the simulations. In fact, thinking about the two models of gain and tuning shift leads to the algebraic equations, which in turn logically leads to the observed results, if no noise is added to the data. The simulations are helpful for visualizing these logical conclusions. Also, after assigning different noise levels to each condition for each neuron, the results are not algebraic anymore which is shown in updated Figure 7 and Figure 8.

      - Cross-validation does not seem to be employed - strong/weak categories seem to be assigned based on the same data used for computing DVs of interest - to minimize the potential for circularity in analyses, it would be better to define preferred categories using separate data from that used to quantify - perhaps using a cross-validation scheme? This appears to be implemented in Reddy et al. (2009), a paper implementing a similar multivariate method and cited by the authors (their ref 6).

      Thank you for pointing out the missing details about how we used cross-validation. In the univariate analysis, we did use cross validation, defining preferred categories and calculating category distance on one half of the data and calculating the univariate shift on the other half of the data. Similarly, we employed cross-validation for the multivariate analysis by using one half of the data to calculate the multivariate distance between category pairs, and the other half of the data to calculate the weight shift for each category pair. We have now added this methodological information in the revised manuscript.

      - Multivariate distance metric - why is correlation/cosine similarity used instead of something like Euclidean or Mahalanobis distance? Correlation/cosine similarity is scale-invariant, so changes in the magnitude of the vector would not change distance, despite this likely being an important data attribute to consider.

      Since we are considering response patterns as vectors in each ROI, there is no major difference between the two measures for similarity. Using euclidean distance as a measure of distance (i.e. inverse of similarity) we observed the same relationship between weight shift and category euclidean distance. There was a positive correlation between weight shift and the euclidean category distance in all ROIs ( ps < 0.01, ts > 2.9) except for V1 (p = 0.5, t = 0.66). We include this information in the revised manuscript in the Results section lines 513-515:

      “We also calculated category distance based on the euclidean distance between response patterns of category pairs and observed a similarly positive correlation between the weight shift and the euclidean category distance in all ROIs (ps < 0.01, ts >2.9) except V1 ( p = 0.5, t = 0.66).”

      - Details about simulations implemented (and their algebraic results in some cases) make it challenging to interpret or understand these results. E.g., the noise properties of the simulated data aren't disclosed, nor are precise (or approximate) values used for simulating attentional modulations.

      We clarify that the average response to each category was based on previous neurophysiology studies (Ni et al., 2012, Bao and Tsao, 2018). The attentional parameter was also chosen based on previous neurophysiology (Ni et al., 2012) and human fMRI (Doostani et al., 2023) studies of visual attention by randomly assigning a value in the range from 1 to 10. We have included the details in the Methods section in lines 357-366:

      “We simulated the action of the response gain model and the tuning sharpening model using numerical simulations. We composed a neural population of 4⨯105 neurons in equal proportions body-, car-, cat- or house-selective. Each neuron also responded to object categories other than its preferred category, but to a lesser degree and with variation. We chose neural responses to each stimulus from a normal distribution with the mean of 30 spikes/s and standard deviation of 10 and each neuron was randomly assigned an attention factor in the range between 1 and 10 using a uniform distribution. These values are comparable with the values reported in neural studies of attention and object recognition in the ventral visual cortex (Ni et al. 2012, Bao and Tsao 2018). We also added poisson noise to the response of each neuron (Britten et al. 1993), assigned randomly for each condition of each neuron.”

      - Eye movements do not seem to be controlled nor measured. Could it be possible that some stimulus pairs result in more discriminable patterns of eye movements? Could this be ruled out by some aspect of the results?

      Subjects were instructed to direct their gaze towards the fixation point. Given the variation in the pose and orientation of the stimuli, it is unlikely that eye movements would help with the task. Eye movements have been controlled in previous experiments with individual stimulus presentation (Xu and Vaziri-Pashkam, 2019) and across attentional tasks in which colored dots were superimposed on the stimuli (Vaziri-Pashkam and Xu, 2017) and no significant difference for eye movement across categories or conditions was observed. As such, we do not think that eye movements would play a role in the results we are observing here.

      - A central, and untested/verified, assumption is that the multivariate activation pattern associated with 2 overlapping stimuli (with one attended) can be modeled as a weighted combination of the activation pattern associated with the individual stimuli. There are hints in the univariate data (e.g., Fig. 4C; 4D) that this might not be justified, which somewhat calls into question the interpretability of the multivariate results.

      If the reviewer is referring to the higher response in the paired compared to the isolated conditions, as explained above, we have not forced any limit on the sum of the estimated weights to equal 1 or 2. Therefore, our model is an estimation of a linear combination of the two multivariate patterns in the isolated conditions. In fact, Leila Reddy et al. (reference 6) reported that while the combination is closer to a weighted average than to a weighted sum, the sum of the weights are on average larger than 1. In Figure 4C and 4D the responses in the paired conditions are higher than either of the isolated-condition responses. This suggests that the weights for the linear combination of isolated responses in the multivariate analysis should add up to larger than one. This is what we find in our results. We have added a supplementary figure to Figure 6, depicting the sum of weights for different category pairs in all ROIs. The figure illustrates that in each ROI, the sum of weights are greater than 1 for some category pairs. It is however noteworthy that we normalized the weights in each condition by the sum of weights to calculate the weight shift in our analysis. The amount of the weight shift was therefore not affected by the absolute value of the weights.

      - Throughout the manuscript, the authors consistently refer to "tuning sharpening", an idea that's almost always used to reference changes in the width of tuning curves for specific feature dimensions (e.g., motion direction; hue; orientation; spatial position). Here, the authors are assaying tuning to the category (across exemplars of the category). The link between these concepts could be strengthened to improve the clarity of the manuscript.

      The reviewer brings up an excellent point. Whereas tuning curves have been extensively used for feature dimensions such as stimulus orientation or motion direction, here, we used the term to describe the variation in a neuron’s response to different object stimuli.

      With a finite set of object categories, as is the case in the current study, the neural response in object space is discrete, rather than a continuous curve illustrated for features such as stimulus orientation. However, since more preferred and less preferred features (objects in this case) can still be defined, we illustrated the neural response using a hypothetical curve in object space in Figure 3 to show how it relates with other stimulus features. Therefore, here, tuning sharpening refers to the fact that the response to the more preferred object categories has been enhanced while the response to the less preferred stimulus categories is suppressed.

      We clarify this point in the revised manuscript in the Discussion section lines 649-659:

      “While tuning curves are commonly used for feature dimensions such as stimulus orientation or motion direction, here, we used the term to describe the variation in a neuron’s response to different object stimuli. With a finite set of object categories, as is the case in the current study, the neural response in object space is discrete, rather than a continuous curve illustrated for features such as stimulus orientation. The neuron might have tuning for a particular feature such as curvature or spikiness (Bao et al., 2020) that is present to different degrees in our object stimuli in a continuous way, but we are not measuring this directly. Nevertheless, since more preferred and less preferred features (objects in this case) can still be defined, we illustrate the neural response using a hypothetical curve in object space. As such, here, tuning sharpening refers to the fact that the response to the more preferred object categories has been enhanced while the response to the less preferred stimulus categories is suppressed.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      a. The authors should address the apparent paradox noted above (and report whether it is seen in other regions of interest as well). On what model would the response to any pair of stimuli exceed that of the response to the preferred stimulus alone? This implies some kind of Gestalt interaction whereby the combined pair generates a percept that is even more effective for the voxels in question than the "most preferred" one?

      The response to a pair of stimuli can exceed the response to each of the stimuli presented in isolation if the voxel is responsive to both stimuli and as long as the voxel has not reached its saturation level. This phenomenon has been reported in many previous studies (Zoccolan et al., 2005, Reddy et al., 2009, Ni et al., 2012, Doostani et al., 2023) and can be modeled using a linear combination model which does not limit the weights of the isolated responses to equal 1 (Doostani et al., 2023). Note that the “most preferred” stimulus does not necessarily saturate the voxel response, thus the response to two stimuli could be more effective based on voxel responsiveness to the second stimulus.

      As for the current study, the labels “more preferred” and “less preferred” are only relatively defined (as explained in the Methods section), meaning that the more preferred stimulus is not necessarily the most preferred stimulus for the voxels. Furthermore, the presented stimuli are semi-transparent and presented with low-contrast, which moves the responses further away from the saturation level. Based on reported evidence for multiple-stimulus responses, responses to single stimuli are in many cases sublinearly added to yield the multiple-stimulus response (Zoccolan et al., 2005, Reddy et al., 2009, Doostani et al., 2023). This means that the multiple-stimulus response is lower than the sum of the isolated responses and not lower than each of the isolated responses. Therefore, it is not paradoxical to observe higher responses in paired conditions compared to the isolated conditions. We observe similar results in other ROIs, which we provide as supplementary figures to Figure 4 in the revised manuscript.

      We address this observation and similar reports in previous studies in the Results section of the revised manuscript in lines 409-413:

      “Note that the response in paired conditions can be higher or lower than the response to the isolated more preferred stimulus (condition Mat), depending on the voxel preference for the two presented stimuli, as previously reported (Doostani et al., 2023). This is consistent with previous studies reporting the response to multiple stimuli to be higher than the average, but lower than the sum of the response to isolated stimuli (Reddy et al., 2009).”

      b. Paradox aside, I wondered to what extent the results are in part explained by range limits. Take two categories that evoke a highly similar response (either mean over a full ROI, or in the multivariate sense). That imposes a range limit such that attentional modulation, if it works the way we think it does, could only move responses within that narrow range. In contrast, the starting point for two highly dissimilar categories leaves room in principle for more modulation.

      We do not believe that the results can be explained by range limits because responses in paired conditions are not limited by the isolated responses, as can be observed in Figure 4. However, to rule out the possibility of the similarity between responses in isolated conditions affecting the range within which responses in paired conditions can change, we turned to the multivariate analysis. We used the weight shift measure as the change in the weight of each stimulus with the change in the attentional target. In this method, no matter how close the two isolated vectors are, the response to the pair could still have a whole range of different weights of the isolated responses. We have plotted an example illustration of two-dimensional vectors for better clarification. Here, the vectors Vxat and Vyat denote the responses to the isolated x and y stimuli, respectively, and the vector Pxaty denotes the response to the paired condition in which stimulus x is attended. The weights a1 and a2 are illustrated in the figure, which are equal to regression coefficients if we solve the equation Pxaty \= [a1 a2] [x y]’. While the weight values depend on the amplitude of and the angle between the three vectors, they are not limited by a lower angle between Vxat and Vyat.

      We have updated Figure 2 in the manuscript to avoid the confusion. We have also added a figure including the sum of weights for different category pairs in different regions, showing that the sum of weights are not dependent on the similarity between the two stimuli. The conclusions based on the weight shift are therefore not confounded by the similarity between the two stimuli.

      c. Finally, related to the previous point, while including V1 is a good control, I wonder if it is getting a "fair" test here, because the range of responses to the four categories in this region, in terms of (dis)similarity, seems compressed relative to the other categories.

      We believe that V1 is getting a fair test because the single-subject range of category distance in V1 is similar to LO, as can be observed Author response image 1_:_

      Author response image 1.

      Range of category distance in each ROI averaged across participants

      The reason that V1 is showing a more compressed distance range on the average plot is that the category distance in V1 is not consistent among participants. Although the average plots are shown in Figure 5 and Figure 6, we tested statistical significance in each ROI based on single-subject correlation coefficients.

      Please also note that a more compressed range of dissimilarity does not necessarily lead to a less strong effect of category distance on the effect of attention. For instance, while LO shows a more compressed dissimilarity range for the presented categories compared to the other object selective regions, it shows the highest correlation between weight shift and category distance. Furthermore, as illustrated in Figure 5, no significant correlation is observed between univariate shift and category distance in V1, even though the range of the univariate distance in V1 is similar to LO and pFs, where we observed a significant correlation between category distance and univariate shift.

      d. In general, the manuscript does a very good job explaining the methods of the study in a way that would allow replication. In some places, the authors could be clearer about the reasoning behind those methodological choices. For example: - How was the sample size determined?

      Estimating conservatively based on the smallest amount of attentional modulation we observed in a previous study (Doostani et al., 2023), we chose a medium effect size (0.3). For a power of 0.8, the minimum number of participants should be 16. We have added the explanation to the Methods section in lines 78-81:

      “We estimated the number of participants conservatively based on the smallest amount of attentional modulation observed in our previous study (Doostani et al., 2023). For a medium effect size of 0.3 and a power of 0.8, we needed a minimum number of 16 participants.”

      - Why did the authors choose those four categories? What was the evidence that would suggest these would span the range of similarities needed here?

      We chose these four categories based on a previous behavioral study reporting the average reaction time of participants when detecting a target from one category among distractors from another category (Xu and Vaziri-Pashkam, 2019). Ideally the experiment should include as many object categories as possible. However, since we were limited by the duration of the experiment, the number of conditions had to be controlled, leading to a maximum of 4 object categories. We chose two animate and two inanimate object categories to include categories that are more similar and more different based on previous behavioral results (Xu and Vaziri-Pashkam, 2019). We included body and house categories because they are both among the categories to which highly responsive regions exist in the cortex. We chose the two remaining categories based on their similarity to body and house stimuli. In this way, for each category there was another category that elicited similar cortical responses, and two categories that elicited different responses. While we acknowledge that the chosen categories do not fully span the range of similarities, they provide an observable variety of similarities in different ROIs which we find acceptable for the purposes of our study.

      We include this information in the Methods section of the revised manuscript in lines 89-94:

      “We included body and house categories because there are regions in the brain that are highly responsive and unresponsive to each of these categories, which provided us with a range of responsiveness in the visual cortex. We chose the two remaining categories based on previous behavioral results to include categories that provided us with a range of similarities (Xu and Vaziri-Pashkam, 2019). Thus, for each category there was a range of responsiveness in the brain and a range of similarity with the other categories.”

      - Why did the authors present the stimuli at the same location? This procedure has been adopted in previous studies, but of course, it does also move the stimulus situation away from the real-world examples of cluttered scenes that motivate the Introduction.

      We presented the stimuli at the same location because we aimed to study the mechanism of object-based attention and this experimental design helped us isolate it from spatial attention. We do not think that our design moves the stimulus situation away from real-world examples in such a way that our results are not generalizable. We include real-world instances, as well as a discussion on this point, in the Discussion section of the revised manuscript, in lines 611-620:

      “Although examples of superimposed cluttered stimuli are not very common in everyday life, they still do occur in certain situations, for example reading text on the cellphone screen in the presence of reflection and glare on the screen or looking at the street through a patterned window. Such instances recruit object-based attention which was the aim of this study, whereas in more common cases in which attended and unattended objects occupy different locations in space, both space-based and object-based attention may work together to resolve the competition between different stimuli. Here we chose to move away from usual everyday scenarios to study the effect of object-based attention in isolation. Future studies can reveal the effect of target-distractor similarity, i.e. proximity in space, on space-based attention and how the effects caused by object-based and space-based attention interact.”

      - While I'm not concerned about this (all relevant comparisons were within-participants) was there an initial attempt to compare data quality from the two different scanners?

      We compared the SNR values of the two groups of participants and observed no significant difference between these values (ps > 0.34, ts < 0.97). We have added this information to the Methods section.

      Regarding the observed effect, we performed a t-test between the results of the participants from the two scanners. For the univariate results, the observed correlation between univariate attentional modulation and category distance was not significantly different for participants of the two scanners in any ROIs (ps > 0.07 , ts < 1.9). For the multivariate results, the observed correlation between the weight shift and multivariate category distance was not significantly different in any ROIs (ps > 0.48 , ts < 0.71) except for V1 (p-value = 0.015 , t-value = 2.75).

      We include a sentence about the comparison of the SNR values in the preprocessing section in the revised manuscript.

      e. There are a couple of analysis steps that could be applied to the existing data that might strengthen the findings. For one, the authors have adopted a liberal criterion of p < 0.001 uncorrected to include voxels within each ROI. Why, and to what extent is the general pattern of findings robust over more selective thresholds? Also, there are additional regions that are selective for bodies (fusiform body area) and scenes (occipital place area and retrosplenial cortex). Including these areas might provide more diversity of selectivity patterns (e.g. different responses to non-preferred categories) that would provide further tests of the hypothesis.

      We selected this threshold to allow for selection of a reasonable number of voxels in each hemisphere across all participants. To check whether the effect is robust over more selective thresholds, we exemplarily redefined the left EBA region using p < 0.0001 and p < 0.00001 and observed that the weight shift effect remained equivalent. We have made a note of this analysis in the Results section. As for the additional regions suggested by the reviewer, we chose not to include them because they could not be consistently defined in both hemispheres of all participants. Please note that the current ROIs also show different responses to non-preferred categories (e.g. in LO and pFs). We include this information in the Methods section in lines 206-207:

      “We selected this threshold to allow for selection of a reasonable number of voxels in each hemisphere across all participants.”

      And in the Results section in lines 509-512:

      “We performed the analysis including only voxels that had a significantly positive GLM coefficient across the runs and observed the same results. Moreover, to check whether the effect is robust over more selective thresholds for ROI definition, we redefined the left EBA region with p < 0.0001 and p < 0.00001 criteria. We observed a similar weight shift effect for both criteria.”

      f. One point the authors might address is the potential effect of blocking the paired conditions. If I understood right, the irrelevant item in each paired display was from the same category throughout a block. To what extent might this knowledge shape the way participants attend to the task-relevant item (e.g. by highlighting to them certain spatial frequencies or contours that might be useful in making that particular pairwise distinction)? In other words, are there theoretical reasons to expect different effects if the irrelevant category is not predictable?

      We believe that the participants’ knowledge about the distractor does not significantly affect our results because our results are in agreement with previous behavioral data (Cohen et al., 2014, Xu and Vaziri-Pashkam, 2019), in which the distractor could not be predicted. These reports suggest there is a theoretical reason to expect similar effects if the participants could not predict the distractor. To directly test this, one would need to perform an fMRI experiment using an event-related design, an interesting venue for future research.

      We have made a note of this point in the Discussion section of the revised manuscript in lines 621-626:

      “Please note that we used a blocked design in which the target and distractor categories could be predicted across each block. While it is possible that the current design has led to an enhancement of the observed effect, previous behavioral data (Cohen et al., 2014, Xu and Vaziri-Pashkam, 2019) have reported the same effect in experiments in which the distractor was not predictable. To study the effect of predictability on fMRI responses, however, an event-related design is more appropriate, an interesting venue for future fMRI studies.”

      g. The authors could provide behavioural data as a function of the specific category pairs. There is a clear prediction here about which pairs should be more or less difficult.

      We provide the behavioral data as a supplementary figure to Figure 1 in the revised manuscript. We however do not see differences in behavior for the different category paris. This is so because our fMRI task was designed in a way to make sure the participants could properly attend to the target for all conditions. The task was rather easy across all conditions and due to the ceiling effect, there was no significant difference between behavioral performance for different category pairs. However, the effect of category pair on behavior has been previously tested and reported in a visual search paradigm with the same categories (Xu and Vaziri-Pashkam, 2019), which was in fact the basis for our choice of categories in this study (as explained in response to point “d” above).

      h. Figure 4 shows data for EBA in detail; it would be helpful to have a similar presentation of the data for the other ROIs as well.

      We provide data for all ROIs as figure supplements 1-4 to Figure 4 in the revised manuscript.

      i. For the pFs and LOC ROIs, it would be helpful to have an indication of what proportion of voxels was most/least responsive to each of the four categories. Was this a relatively even balance, or generally favouring one of the categories?

      In LO, the proportion of voxels most responsive to each of the four categories was relatively even for Body (31%) and House (32%) stimuli, which was higher than the proportion of Car- and Cat-preferring voxels (18% and 19%, respectively). In pFs, 40% of the voxels were house-selective, while the proportion was relatively even for voxels most responsive to bodies, cars, and houses with 21%, 17%, and 22% of the voxels, respectively. We include the percentage of voxels most responsive to each of the four categories in each ROI as Appendix 1-table 1.

      j. Were the stimuli in the localisers the same as in the main experiment?

      No, we used different sets of stimuli for the localizers and the main experiment. We have added the information in line 146 of the Methods section.

      Reviewer #2 (Recommendations For The Authors):

      (1) Why are specific ROIs chosen? Perhaps some discussion motivating these choices, and addressing the possible overlap between these and retinotopic regions (based on other studies, or atlases - Wang et al, 2015) would be useful.

      Considering that we used object categories, we decided to look at general object-selective regions (LO, pFS) as well as regions that are highly selective for specific categories (EBA, PPA). We also looked at the primary visual cortex as a control region. We have added this clarification in the Methods section lines 128-133:

      “Considering that we used object categories, we investigated five different regions of interest (ROIs): the object-selective areas lateral occipital cortex (LO) and posterior fusiform (pFs) as general object-selective regions, the body-selective extrastriate body area (EBA) and the scene-selective parahippocampal place area (PPA) as regions that are highly selective for specific categories, and the primary visual cortex (V1) as a control region. We chose these regions because they could all be consistently defined in both hemispheres of all participants and included a large number of voxels.”

      (2) The authors should consider including data on the relative prevalence of voxels preferring each category for each ROI (and/or the mean activation level across voxels for each category for each ROI). If some ROIs have very few voxels preferring some categories, there's a chance the observed results are a bit noisy when sorting based on those categories (e.g., if a ROI has essentially no response to a given pair of categories, then there's not likely to be much attentional modulation detectable, because the ROI isn't driven by those categories to begin with).

      We thank the reviewer for the insightful comment.

      We include the percentage of voxels most responsive to each of the four categories in each ROI in the Appendix ( Appendix 1-table 1, please see the answer to point “i” of the first reviewer).

      We also provide a table of average activity across voxels for each category in all ROIs as Appendix 1-table 2.

      As shown in the table, voxels show positive activity for all categories in all ROIs except for PPA, where voxels show no response to body and cat stimuli. This might explain why we observed a marginally significant correlation between weight shift and category distance in PPA only. As the reviewer mentions, since this region does not respond to body and cat stimuli, we do not observe a significant change in response due to the shift in attention for some pairs. We include the table in the Appendix and add the explanation to the Results section of the revised manuscript in lines 506-508:

      _“_Less significant results in PPA might arise from the fact that PPA shows no response to body and cat stimuli and little response to car stimuli (Appendix 1-table 2). Therefore, it is not possible to observe the effect of attention for all category pairs.”

      a. Related - would it make sense to screen voxels for inclusion in analysis based on above-basely activation for one or both of the categories? [could, for example, imagine you're accidentally measuring from the motor cortex - you'd be able to perform this analysis, but it would be largely nonsensical because there's no established response to the stimuli in either isolated or combined states].

      We performed all the analyses including only voxels that had a significantly positive GLM coefficient across the runs and the results remained the same. We have added the explanation in the Results section in line 509-510.

      (3) Behavioral performance is compared against chance level, but it doesn't seem that 50% is chance for the detection task. The authors write on page 4 that the 1-back repetition occurred between 2-3 times per block, so it doesn't seem to be the case that each stimulus had a 50% chance of being a repetition of the previous one.

      We apologize for the mistake in our report. We have reported the detection rate for the target-present trials (2-3 per block), not the behavioral performance across all trials. We have modified the sentence in the Results section.

      (4) Authors mention that the stimuli are identical for 2-stimulus trials where each category is attended (for a given pair) - but the cue is different, and the cue appears as a centrally-fixated word for 1 s. Is this incorporated into the GLM? I can't imagine this would have much impact, but the strict statement that the goals of the participant are the only thing differentiating trials with otherwise-identical stimuli isn't quite true.

      The word cue was not incorporated as a separate predictor into the GLM. As the reviewer notes, the signals related to the cue and stimuli are mixed. But given that the cues are brief and in the form of words rather than images, they are unlikely to have an effect on the response in the regions of interest.

      To be more accurate, we have included the clarification in the Methods section in lines 181-182:

      “We did not enter the cue to the GLM as a predictor. The obtained voxel-wise coefficients for each condition are thus related to the cue and the stimuli presented in that condition.”

      And in the Results section in lines 425-428 :

      “It is important to note that since the cue was not separately modeled in the GLM, the signals related to the cue and the stimuli were mixed. However, given that the cues were brief and presented in the form of words, they are unlikely to have an effect on the responses observed in the higher-level ROIs.”

      (5) Eq 5: I expected there to be some comparison of a and b directly as ratios (e.g., a_1 > b_1, as shown in Fig. 2). The equations used here should be walked through more carefully - it's very hard to understand what this analysis is actually accomplishing. I'm not sure I follow the explanation of relative weights given by the authors, nor how that maps onto the delta_W quantity in Equation 5.

      We provide a direct comparison of a and b, as well as a more thorough clarification of the analysis, in the Methods section in lines 274-276:

      “We first projected the paired vector on the plane defined by the isolated vectors (Figure 2A) and then determined the weight of each isolated vector in the projected vector (Figure 2B).”

      And in lines 286-297:

      “A higher a1 compared to a2 indicates that the paired response pattern is more similar to Vxat compared to Vyat, and vice versa. For instance, if we calculate the weights of the Body and Car stimuli in the paired response related to the simultaneous presentation of both stimuli, we can write in the LO region: VBodyatCar \= 0.81 VBody + 0.31 VCar, VBodyCarat \= 0.43 VBody + 0.68 VCar. Note that these weights are averaged across participants. As can be observed, in the presence of both body and car stimuli, the weight of each stimulus is higher when attended compared to the case when it is unattended. In other words, when attention shifts from body to car stimuli, the weight of the isolated body response (VBody) decreases in the paired response. We can therefore observe that the response in the paired condition is more similar to the isolated body response pattern when body stimuli are attended and more similar to the isolated car response pattern when car stimuli are attended.”

      And lines 303-306:

      “As shown here, even when body stimuli are attended, the effect of the unattended car stimuli is still present in the response, shown in the weight of the isolated car response (0.31). However, this weight increases when attention shifts towards car stimuli (0.68 in the attended case).”

      We also provide more detailed clarification for the 𝛥w and the relative weights in lines 309-324:

      “To examine whether this increase in the weight of the attended stimulus was constant or depended on the similarity of the two stimuli in cortical representation, we defined the weight shift as the multivariate effect of attention:

      𝛥w = a1/(a1+a2) – b1/(b1+b2)                                                                                          (5)

      Here, a1, a2, b1,and b2 are the weights of the isolated responses, estimated using Equation 4. We calculate the weight of the isolated x response once when attention is directed towards x (a1), and a second time when attention is directed towards y (b1). In each case, we calculate the relative weight of the isolated x in the paired response by dividing the weight of the isolated x by the sum of weights of x and y (a1+a2 when attention is directed towards x, and b1+b2 when attention is directed towards y). We then define the weight shift, Δw, as the change in the relative weight of the isolated x response in the paired response when attention shifts from x to y. A higher Δw for a category pair indicates that attention is more efficient in removing the effect of the unattended stimulus in the pair. We used relative weights as a normalized measure to compensate for the difference in the sum of weights for different category pairs. Thus, using the normalized measure, we calculated the share of each stimulus in the paired response. For instance, considering the Body-Car pair, the share of the body stimulus in the paired response was equal to 0.72 and 0.38, when body stimuli were attended and unattended, respectively. We then calculated the change in the share of each stimulus caused by the shift in attention using a simple subtraction ( Equation 5: Δw=0.34 for the above example of the Body-Car pair in LO) and used this measure to compare between different pairs.”

      We hope that this clarification makes it easier to understand the multivariate analysis and the weight shift calculation in Equation 5.

      We additionally provide the values of the weights (a1, b1, a2, and b2 ) for each category pair averaged across participants as Appendix 1 -table 4.

      (6) For multivariate analyses (Fig. 6A-E), x axis is normalized (pattern distance based on Pearson correlation), while the delta_W does not seem to be similarly normalized.

      We calculated ΔW by dividing the weights in each condition by the sum of weights in that condition. Thus, we use relative weights which are always in the range of 0 to 1, and ΔW is thus always in the range of -1 to 1. This means that both axes are normalized. Note that even if one axis were not normalized, the relationship between the independent and the dependent variables would remain the same despite the change in the range of the axis.

      (7) Simulating additional scenarios like attention to both categories just increasing the mean response would be helpful - is this how one would capture results like those shown in some panels of Fig. 4?

      We did not have a condition in which participants were asked to attend to both categories. Therefore it was not useful for our simulations to include such a scenario. Please also note that the goal of our simulations is not to capture the exact amount of attentional modulation, but to investigate the effect of target-distractor similarity on the change in attentional modulation (univariate shift and weight shift).

      As for the results in some panels of Figure 4, we have explained the reason underlying higher responses in paired conditions compared to isolated conditions) in response to the “weaknesses” section of the second reviewer. We hope that these points satisfy the reviewer’s concern regarding the results in Figure 4 and our simulations.

      (8) Lines 271-276 - the "latter" and "former" are backwards here I think.

      We believe that the sentence was correct, but confusing.. We have rephrased the sentence to avoid the confusion in lines 371-376 of the revised manuscript:

      “We modeled two neural populations: a general object-selective population in which each voxel shows preference to a particular category and voxels with different preferences are mixed in with each other (similar to LO and pFS), and a category-selective population in which all voxels have a similar preference for a particular category (similar to EBA and PPA).”

      (9) Line 314 - "body-car" pair is mentioned twice in describing the non-significant result in PPA ROI.

      Thank you for catching the typo. We have changed the second Body-Car to Body-Cat.

      (10) Fig. 5 and Fig. 6 - I was expecting to see a plot that demonstrated variability across subjects rather than across category pairs. Would it be possible to show the distribution of each pair's datapoints across subjects, perhaps by coloring all (e.g.) body-car datapoints one color, all body-cat datapoints another, etc? This would also help readers better understand how category preferences (which differ across ROIs) impact the results.

      We demonstrated variability across category pairs rather than subjects because we aimed to investigate how the variation in the similarity between categories (i.e. category distance) affected the univariate and multivariate effects of attention. The variability across subjects is reflected in the error bars in the bar plots of Figure 5 and Figure 6.

      Here we show the distribution of each category pair’s data points across subjects by using a different color for each pair:

      Author response image 2.

      Univariate shift versus category distance including single-subject data points in all ROIs.

      Author response image 3.

      Weight shift versus category distance including single-subject data points in all ROIs.

      As can be observed in the figures, category preference has little impact on the results. Rather, the similarity in the preference (in the univariate case) or the response pattern (in the multivariate case) to the two presented categories is what impacts the amount of the univariate shift and the weight shift, respectively. For instance, in EBA we observe a low amount of attentional shift both for the Body-Cat pair, with two stimuli for which the ROI is highly selective, and the Car-House pair, including stimuli to which the region shows little response. A similar pattern is observed in the object-selective regions LO and pFs which show high responses to all stimulus categories.

      We believe that the figures including the data points related to all subjects are not strongly informative. However, we agree that using different colors for each category pair helps the readers better understand that category preference has little impact on the results in different ROIs. We therefore present the colored version of Figure 5 and Figure 6 in the revised manuscript, with a different color for each category pair.

      (11) Fig. 5 and Fig. 6 use R^2 as a dependent variable across participants to conclude a positive relationship. While the positive relationship is clear in the scatterplots, which depict averages across participants for each category pair, it could still be the case that there are a substantial number of participants with negative (but predictive, thus high positive R^2) slopes. For completeness and transparency, the authors should illustrate the average slope or regression coefficient for each of these analyses.

      We concluded the positive relationship and calculated the significance in Figure 5 and Figure 6 using the correlation r rather than r.^2 This is why the result was not significantly positive in V1. We acknowledge that the use of r-squared in the bar plot leads to confusion. We have therefore changed the bar plots to show the correlation coefficient instead of the r-squared. Furthermore, we have added a table of the correlation coefficient for all participants in all ROIs for the univariate and weight shift analyses supplemental to Figure 5 and Figure 6, respectively.

      (12) No statement about data or analysis code availability is provided

      Thanks for pointing this out. The fMRI data is available on OSF. We have added a statement about it in the Data Availability section of the revised manuscript in line 669.

    1. Author response:

      The following is the authors’ response to the original reviews.

      The detailed, thorough critique provided by the three reviewers is very much appreciated. We believe the manuscript is greatly improved by the changes we have made based on those reviews. The major changes are described below, followed by a point by point response.

      Major Changes:

      (1) We revised our model (old Fig. 10; new Fig. 9) to keep the explanation focused on the data shown in the current study. Specifically, references to GTP/GDP states of Rab3A and changes in the presynaptic quantum have been removed and the mechanisms depicted are confined to pre- or post-synaptic Rab3A participating in either controlling release of a trophic factor that regulates surface GluA2 receptors (pre- or postsynaptic) or directly affecting fusion of GluA2-receptor containing vesicles (postsynaptic).

      (2) We replaced all cumulative density function plots and ratio plots, based on multiple quantile samples per cell, with box plots of cell means. This affects new Figures 1, 2, 3, 5, 6, 7 and 8. All references to “scaling,” “divergent scaling,” or “uniform scaling,” have been removed. New p values for comparison of means are provided above every box plot in Figures 1, 2, 3, 5, 6, 7 and 8. The number of cultures is provided in the figure legends.

      (3) We have added frequency to Figures 1, 2 and 8. Frequency values overall are more variable, and the effect of activity blockade less robust, than for mEPSC amplitudes. We have added text indicating that the increase in frequency after activity blockade was significant in neurons from cultures prepared from WT in the Rab3A+/- colony but not cultures prepared from KO mice (Results, lines 143 to 147, new Fig. 1G. H). The TTX-induced increase in frequency was significant in the NASPM experiments before NASPM, but not after NASPM (Results, lines 231 to 233, new Fig. 3, also cultures from WT in Rab3A+/- colony). The homeostatic plasticity effect on frequency did not reach significance in WT on WT glia cultures or

      WT on KO glia cultures, possibly due to the variability of frequency, combined with smaller sample sizes (Results, lines 400 to 403, new Fig. 8). In the cultures prepared from WT mice in the Rab3A+/Ebd colony, there was a trend towards higher frequency after TTX that did not reach statistical significance, and in cultures prepared from mutant mice, the p value was large, suggesting disruption of the effect, which appears to be due to an increase in frequency in untreated cultures, similar to the behavior of mEPSC amplitudes in neurons from mutant mice (Results, lines 161-167). In sum, the effect of activity on frequency requires Rab3A and Ca2+-permeable receptors, and is mimicked by the presence of the Rab3A Earlybird mutant. We have also added a discussion of these results (Discussion, lines 427-435). 

      (4) In the revised manuscript we have added analysis of VGLUT1 levels for the same synaptic sites that we previously analyzed GluA2 levels, and these data are described in Results, lines 344 to 371, and appear in new Table 2. In contrast to previous studies, we did not find any evidence for an increase in VGLUT1 levels after activity blockade. We reviewed those studies to determine whether there might be differences in the experimental details that could explain the lack of effect we observed. In (De Gois et al., 2005), the authors measured mRNA and performed western blots to show increases in VGLUT1 after TTX treatment in older rat cortical cultures (DIV 19). The study performs immunofluorescence imaging of VGLUT1 but only after bicuculline treatment (it decreases), not after TTX treatment. In (Wilson et al.,

      2005), the hippocampal cultures are treated with AP5, not TTX, and the VGLUT1 levels in immunofluorescence images are reported relative to synapsin I. That the type of activity blockade matters is illustrated by the failure of Wilson and colleagues to observe a consistent increase in VGLUT1/Synapsin ratio in cultures treated with AMPA receptor blockade (NBQX; supplementary information). These points have been added to the Discussion, lines 436 to 447.)

      Reviewer #1:

      (1) (model…is not supported by the data), (2) (The analysis of mEPSC data using quantile sampling…), (3) (…statistical analysis of CDFs suffers from n-inflation…), (4) (How does recording noise and the mEPSC amplitude threshold affect “divergent scaling?”) (5) (…justification for the line fits of the ratio data…), (7) (A comparison of p-values between conditions….) and (10) (Was VGLUT intensity altered in the stainings presented in the manuscript?)

      The major changes we made, described above, address Reviewer #1’s points. The remaining points are addressed below.

      (6) TTX application induces a significant increase in mEPSC amplitude in Rab3A-/- mice in two out of three data sets (Figs. 1 and 9). Hence, the major conclusion that Rab3A is required for homeostatic scaling is only partially supported by the data. 

      The p values based on CDF comparisons were problematic, but the point we were making is that they were much larger for amplitudes measured in cultures prepared from Rab3A-/- mice (Fig. 1, p = 0.04) compared to those from cultures prepared from Rab3A+/+ mice (Fig. 1, p = 4.6 * 10-4). Now that we are comparing means, there are no significant TTX-induced effects on mEPSC amplitudes for Rab3A-/- data. However, acknowledging that some increase after activity blockade remains, we describe homeostatic plasticity as being impaired or not significant, rather than abolished, by loss of Rab3A, (Abstract, lines 37 to 39; Results, lines 141 to 143; Discussion, lines 415 to 418).

      (8) There is a significant increase in baseline mEPSC amplitude in Rab3AEbd/Ebd (15 pA) vs. Rab3AEbd/+ (11 pA) cultures, but not in Rab3A-/- (13.6 pA) vs. Rab3A+/- (13.9 pA). Although the nature of scaling was different between Rab3AEbd/Ebd vs. Rab3AEbd/+ and Rab3AEbd/Ebd with vs. without TTX, the question arises whether the increase in mEPSC amplitude in Rab3AEbd/Ebd is Rab3A dependent. Could a Rab3A independent mechanism occlude scaling?

      The Reviewer is concerned that the increase in mEPSC amplitude in the presence of the Rab3A point mutant may be through a ‘non-Rab3A’ mechanism (a concern raised by the lack of such effect in cultures from the Rab3A-/- mice), and secondly, that the already large mEPSC cannot be further increased by the homeostatic plasticity mechanism. It must always be considered that a mutant with an altered genetic sequence may bind to novel partners, causing activities that would not be either facilitated or inhibited by the original molecule. We have added this caveat to Results, lines 180 to 186 We added that a number of other manipulations, implicating individual molecules in the homeostatic mechanism, have caused an increase in mEPSC amplitude at baseline, potentially nonspecifically occluding the ability of activity blockade to induce a further increase (Results lines 186 to 189). Still, it is a strong coincidence that the novel activity of the mutant Rab3A would affect mEPSC amplitude, the same characteristic that is affected by activity blockade in a Rab3A dependent manner, a point which we added to Results, lines 189 to 191.

      (9) Figure 4: NASPM appears to have a stronger effect on mEPSC frequency in the TTX condition vs. control (-40% vs -15%). A larger sample size might be necessary to draw definitive conclusions on the contribution of Ca2+-permeable AMPARs.

      Our results, even with the modest sample size of 11 cells, are clear: NASPM does not disrupt the effect of TTX treatment on mEPSC amplitude (new Fig. 3A). It also looks like there is a greater magnitude effect of NAPSM on frequency in TTX-treated cells; we note this, but point out that nevertheless, these mEPSCs are not contributing to the increase in mEPSC amplitude (Results, lines 238-241). 

      (11) The change in GluA2 area or fluorescence intensity upon TTX treatment in controls is modest. How does the GluA2 integral change?

      We had reported that GluA2 area showed the most prominent increase following activity blockade, with intensity changing very little. When we examined the integral, it closely matched the change in area. We have added the values for integral to new Fig. 5 D, H; new Fig. 6 A-C; new Fig. 7 A-C and new Table 1 (for GluA2) and new Table 2 (for VGLUT1). These results are described in the text in the following places: Results, lines 289-292; 298-299; 311-319; 328-324). For VGLUT1, both area and intensity changed modestly, and the integral appeared to be a combination of the two, being higher in magnitude and resulting in smaller p values than either area or intensity (Results, lines 344-348; 353-359; new Table 2).

      (12) The quantitative comparison between physiology and microscopy data is problematic. The authors report a mismatch in ratio values between the smallest mEPSC amplitudes and the smallest GluA2 receptor cluster sizes (l. 464; Figure 8). Is this comparison affected by the fluorescence intensity threshold? What was the rationale for a threshold of 400 a.u. or 450 a.u.? How does this threshold compare to the mEPSC threshold of 3 pA.

      This concern is partially addressed by no longer comparing the rank ordered mEPSC amplitudes with the rank ordered GluA2 receptor characteristics. We had used multiple thresholds in the event that an experiment was not analyzable with the chosen threshold (this in fact happened for VGLUT1, see end of this paragraph). We created box plots of the mean GluA2 receptor cluster size, intensity and integral, for experiments in which we used all three thresholds, to determine if the effect of activity blockade was different depending on which threshold was applied, and found that there was no obvious difference in the results (Author response image 1). Nevertheless, since there is no need to use a different threshold for any of the 6 experiments (3 WT and 3KO), for new Figures 5, 6 and 7 we used the same threshold for all data, 450; described in Methods, lines 746 to 749. For VGLUT1 levels, it was necessary to use a different threshold for Rab3A+/+ Culture #1 (400), but a threshold of 200 for the other five experiments (Methods, lines 751-757). The VGLUT1 immunofluorescent sites in Culture #1 had higher levels overall, and the low threshold caused the entire AOI to be counted as the synapse, which clearly included background levels outside of the synaptic site. Conversely, to use a threshold of 400 on the other experiments meant that the synaptic site found by the automated measurement tool was much smaller that what was visible by eye. In our judgement it would have been meaningless to adhere to a single threshold for VGLUT1 data.

      Author response image 1.

      Using different thresholds does not substantially alter GluA2 receptor cluster size data. A) Rab3A+/+ Culture #1, size data for three different thresholds, depicted above each graph. B) Rab3A+/+ Culture #2, size data for three different thresholds, depicted above each graph. Note scale bar in A is different from B, to highlight differences for different thresholds. (Culture #3 was only analyzed with 450 threshold).

      The conclusion that an increase in AMPAR levels is not fully responsible for the observed mEPSC increase is mainly based on the rank-order analysis of GluA2 intensity, yielding a slope of ~0.9. There are several points to consider here: (i) GluA2 fluorescence intensity did increase on average, as did GluA2 cluster size.

      (ii) The increase in GluA2 cluster size is very similar to the increase in mEPSC amplitude (each approx. 1820%). (iii) Are there any reports that fluorescence intensity values are linearly reporting mEPSC amplitudes (in this system)? Antibody labelling efficiency, and false negatives of mEPSC recordings may influence the results. The latter was already noted by the authors.

      Our comparison between mEPSC amplitude and GluA2 receptor cluster characteristics has been reexamined in the revised version using means rather than rank-ordered data in rank-order plots or ratio plots. Importantly, all of these methods revealed that in one out of three WT cultures (Culture #3) GluA2 receptor cluster size (old Fig. 8, old Table 1; new Fig. 6, new Table 1), intensity and integral (new Fig. 6, new Table 1) values decreased following activity blockade while in the same culture, mEPSC amplitudes increased. It is based on this lack of correspondence that we conclude that increases in mEPSC amplitude are not fully explained by increases in GluA2 receptors, and suggest there may be other contributors. These points are made in the Abstract (lines 108-110); Results (lines 319 to 326; 330337; 341-343) and the Discussion (lines 472 to 474). To our knowledge, there are not any reports that quantitatively compare receptor levels (area, intensity or integrals) to mEPSC amplitudes in the same cultures. We examined the comparisons very closely for 5 studies that used TTX to block activity and examined receptor levels using confocal imaging at identified synapses (Hou et al., 2008; Ibata et al., 2008; Jakawich et al., 2010a; Xu and Pozzo-Miller, 2017; Dubes et al., 2022). We were specifically looking for whether the receptor data were more variable than the mEPSC amplitude data, as we found. However, for 4 of the studies, sample sizes were very different so that we cannot simply compare the p values. Below is a table of the comparisons.

      Author response table 1.

      In Xu 2017 the sample sizes are close enough that we feel comfortable concluding that the receptor data were slightly more variable (p < 0.05) than mEPSC data (p<0.01) but recognize that it is speculative to say our finding has been confirmed. A discussion of these articles is in Discussion, lines 456-474.

      (iv) It is not entirely clear if their imaging experiments will sample from all synapses. Other AMPAR subtypes than GluA2 could contribute, as could kainite or NMDA receptors.

      While our imaging data only examined GluA2, we used the application of NASPM to demonstrate Ca2+permeable receptors did not contribute quantitatively to the increase in mEPSC amplitude following TTX treatment. Since GluA3 and GluA4 are also Ca2+-permeable, the findings in new Figure 3 (old Fig. 4) likely rule out these receptors as well.  There are also reports that Kainate receptors are Ca2+-permeable and blocked by NASPM (Koike et al., 1997; Sun et al., 2009), suggesting the NASPM experiment also rules out the contribution of Kainate receptors. Finally, given our recording conditions, which included normal magnesium levels in the extracellular solution as well as TTX to block action-potential evoked synaptic transmission, NMDA receptors would not be available to contribute currents to our recordings due to block by magnesium ions at resting Vm. These points have been added to the Methods section, lines 617 to 677 (NMDA); 687-694 (Ca2+-permeable AMPA receptors and Kainate receptors).

      Furthermore, the statement “complete lack of correspondence of TTX/CON ratios” is not supported by the data presented (l. 515ff). First, under the assumption that no scaling occurs in Rab3A-/-, the TTX/CON ratios show a 20-30% change, which indicates the variation of this readout. Second, the two examples shown in Figure 8 for Rab3A+/+ are actually quite similar (culture #1 and #2, particularly when ignoring the leftmost section of the data, which is heavily affected by the raw values approaching zero.

      We are no longer presenting ratio plots in the revised manuscript, so we do not base our conclusion that mEPSC amplitude data is not always corresponding to GluA2 receptor data on the difference in behavior of TTX/CON ratio values, but only on the difference in direction of the TTX effect in one out of three cultures. We agree with the reviewer that the ratio plots are much more sensitive to differences between control and treated values than the rank order plot, and we feel these differences are important, for example, there is still a homeostatic increase in the Rab3A-/- cultures, and the effect is still divergent rather than uniform. But the comparison of ratio data will be presented elsewhere.

      (13) Figure 7A: TTX CDF was shifted to smaller mEPSC amplitude values in Rab3A-/- cultures. How can this be explained?

      While this result is most obvious in CDF plots, we still observe a trend towards smaller mEPSC amplitudes after TTX treatment in two of three individual cultures prepared from Rab3A-/- mice when comparing means (new Fig. 7, Table 1) which did not reach statistical significance for the pooled data (new Fig. 5, new Table 1). There was not any evidence of this decrease in the larger data set (new Fig. 1) nor for Rab3A-/- neurons on Rab3A+/+ glia (new Fig. 8). Given that this effect is not consistent, we did not comment on it in the revised manuscript. It may be that there is a non-Rab3A-dependent mechanism that results in a decrease in mEPSC amplitude after activity blockade, which normally pulls down the magnitude of the activity-dependent increase typically observed. But studying this second component would be difficult given its magnitude and inconsistent presentation.

      Reviewer #1 (Recommendations For the Authors):

      (1) Abstract, last sentence: The conclusion of the present manuscript should be primarily based on the results presented. At present, it is mainly based on a previous publication by the authors.

      We have revised the last sentence to reflect actual findings of the current study (Abstract, lines 47 to 49).

      (2) Line 55: “neurodevelopmental”

      This phrase has been removed.

      (3) Line 56: “AMPAergic” should be replaced by AMPAR-mediated

      This sentence was removed when all references to “scaling” were removed; no other instances of “AMPAergic” are present.

      (4) Figure 9: The use of BioRender should be disclosed in the Figure Legend.

      We used BioRender in new Figures 3, 7 and 8, and now acknowledge BioRender in those figure legends.

      (5) Figure legends and results: The number of cultures should be indicated for each comparison.

      Number of cultures has been added to the figure legends.

      (6) Line 289: A comparison of p-values between conditions does not allow any meaningful conclusions.

      Agreed, therefore we have removed CDFs and the KS test comparison p values. All comparisons in the revised manuscript are for cell means.

      (7) Line 623ff: The argument referring to NMJ data is weak, given that different types of receptors are involved.

      We still think it is valid to point out that Rab3A is required for the increase in mEPC at the NMJ but that ACh receptors do not increase (Discussion, lines 522 to 525). We are not saying that postsynaptic receptors do not contribute in cortical cultures, only that there could be another Rab3A-dependent mechanism that also affects mEPSC amplitude.

      (8) Plotting data points outside of the ranges should be avoided (e.g., Fig. 2Giii, 7F).

      These two figures are no longer present in the revised manuscript. In revising figures, we made sure no other plots have data points outside of the ranges.

      (9) The rationale for investigating Rab3AEbd/Ebd remains elusive and should be described.

      A rationale for investigating Rab3AEbd/Ebd is that if the results are similar to the KO, it strengthens the evidence for Rab3A being involved in homeostatic synaptic plasticity. In addition, since its phenotype of early awakening was stronger than that demonstrated in Rab3A KO mice (Kapfhamer et al., 2002), it was possible we would see a more robust effect. These points have been added to the Results, lines 118 to 126.

      (10) Figures 3 and 4, as well as Figure 5 and 6 could be merged.

      In the revised version, Figure 3 has been eliminated since its main point was a difference in scaling behavior. Figure 4 has been expanded to include a model of how NASPM could reduce frequency (new Fig. 3.) Images of the pyramidal cell body have been added to Figure 5 (new Fig. 4), and Figure 6 has been completely revised and now includes pooled data for both Rab3A+/+ and Rab3A-/- cultures, for mEPSC amplitude, GluA2 receptor cluster size, intensity and integral.

      (11) Figure 5: The legend refers to MAP2, but this is not indicated in the figure.

      MAP2 has now been added to the labels for each image and described in the figure legend (new Fig. 4).

      Reviewer #2:

      Technical concerns:

      (1) The culture condition is questionable. The authors saw no NMDAR current present during spontaneous recordings, which is worrisome since NMDARs should be active in cultures with normal network activity (Watt et al., 2000; Sutton et al., 2006). It is important to ensure there is enough spiking activity before doing any activity manipulation. Similarly it is also unknown whether spiking activity is normal in Rab3AKO/Ebd neurons.

      In the studies cited by the reviewer, NMDA currents were detected under experimental conditions in which magnesium was removed. In our recordings, we have normal magnesium (1.3 mM) and also TTX, which prevents the necessary depolarization to allow inward current through NMDA receptors. This point has been added to our Methods, lines 674 to 677. We acknowledge we do not know the level of spiking in cultures prepared from Rab3A+/+, Rab3A-/- or Rab3A_Ebd/Ebd_ mice. Given the similar mEPSC amplitude for untreated cultures from WT and KO studies, we think it unlikely that activity was low in the latter, but it remains a possibility for untreated cultures from Rab3A_Ebd/Ebd_ mice, where mEPSC amplitude was increased. These points are added to the Methods, lines 615 to 622.

      (2) Selection of mEPSC events is not conducted in an unbiased manner. Manually selecting events is insufficient for cumulative distribution analysis, where small biases could skew the entire distribution. Since the authors claim their ratio plot is a better method to detect the uniformity of scaling than the well-established rank-order plot, it is important to use an unbiased population to substantiate this claim.

      We no longer include any cumulative distributions or ratio plot analysis in the revised version. We have added the following text to Methods, lines 703 to 720:

      “MiniAnalysis selects many false positives with the automated feature when a small threshold amplitude value is employed, due to random fluctuations in noise, so manual re-evaluation of the automated process is necessary to eliminate false positives. If the threshold value is set high, there are few false positives but small amplitude events that visually are clearly mEPSCs are missed, and manual re-evaluation is necessary to add back false negatives or the population ends up biased towards large mEPSC amplitudes. As soon as there is a manual step, bias is introduced. Interestingly, a manual reevaluation step was applied in a recent study that describes their process as ‘unbiased (Wu et al., 2020). In sum, we do not believe it is currently possible to perform a completely unbiased detection process. A fully manual detection process means that the same criterion (“does this look like an mEPSC?”) is applied to all events, not just the false positives, or the false negatives, which prevents the bias from being primarily at one end or the other of the range of mEPSC amplitudes. It is important to note that when performing the MiniAnalysis process, the researcher did not know whether a record was from an untreated cell or a TTX-treated cell.”

      (3) Immunohistochemistry data analysis is problematic. The authors only labeled dendrites without doing cell-fills to look at morphology, so it is questionable how they differentiate branches from pyramidal neurons and interneurons. Since glutamatergic synapse on these two types of neuron scale in the opposite directions, it is crucial to show that only pyramidal neurons are included for analysis.

      We identified neurons with a pyramidal shape and a prominent primary dendrite at 60x magnification without the zoom feature. This should have been made clear in the description of imaging. We have added an image of the two selected cells to our figure of dendrites (old Fig. 5, new Fig. 4), and described this process in the Methods, lines 736 to 739, and Results, lines 246 to 253. Given the morphology of the neurons selected it is highly unlikely that the dendrites we analyzed came from interneurons.

      Conceptual Concerns

      The only novel finding here is the implicated role for Rab3A in synaptic scaling, but insights into mechanisms behind this observation are lacking. The authors claim that Rab3A likely regulates scaling from the presynaptic side, yet there is no direct evidence from data presented. In its current form, this study’s contribution to the field is very limited.

      We have demonstrated that loss of Rab3A and expression of a Rab3A point mutant disrupt homeostatic plasticity of mEPSC amplitudes, and that in the absence of Rab3A, the increase in GluA2 receptors at synaptic sites is abolished. Further, we show that this effect cannot be through release of a factor, like TNFα, from astrocytes. In the new version, we add the finding that VGLUT1 is not increased after activity blockade, ruling out this presynaptic factor as a contributor to homeostatic increases in mEPSC amplitude. We show for the first time by examining mEPSC amplitudes and GluA2 receptors in the same cultures that the increases in GluA2 receptors are not as consistent as the increases in mEPSC amplitude, suggesting the possibility of another contributor to homeostatic increases in mEPSC amplitude. We first proposed this idea in our previous study of Rab3A-dependent homeostatic increases in mEPC amplitudes at the mouse neuromuscular junction. In sum, we dispute that there is only one novel finding and that we have no insights into mechanism. We acknowledge that we have no direct evidence for regulation from the presynaptic side, and have removed this claim from the revised manuscript. We have retained the Discussion of potential mechanisms affecting the presynaptic quantum and evidence that Rab3A is implicated in these mechanisms (vesicle size, fusion pore kinetics; Discussion, lines 537 to 563). One way to directly show that the amount of transmitter released for an mEPSC has been modified after activity blockade is to demonstrate that a fast off-rate antagonist has become less effective at inhibiting mEPSCs (because the increased glutamate released out competes it; see (Liu et al., 1999) and (Wilson et al., 2005) for example experiments). This set of experiments is underway but will take more time than originally expected, because we are finding surprisingly large decreases in frequency, possibly the result of mEPSCs with very low glutamate concentration that are completely inhibited by the dose used. Once mEPSCs are lost, it is difficult to compare the mEPSC amplitude before and after application of the antagonist. Therefore we intend to include this experiment in a future report, once we determine the reason for the frequency reduction, or, can find a dose where this does not occur.

      (1) Their major argument for this is that homeostatic effects on mEPSC amplitudes and GluA2 cluster sizes do not match. This is inconsistent with reports from multiple labs showing that upscaling of mEPSC amplitude and GluA2 accumulation occur side by side during scaling (Ibata et al., 2008; Pozo et al., 2012; Tan et al., 2015; Silva et al., 2019). Further, because the acquisition and quantification methods for mEPSC recordings and immunohistochemistry imaging are entirely different (each with its own limitations in signal detection), it is not convincing that the lack of proportional changes must signify a presynaptic component.

      Within the analyses in the revised manuscript, which are now based only on comparison of cell/dendrite means, we find a very good match in the magnitude of increase for the pooled data of mEPSC amplitudes and GluA2 receptor cluster sizes (+19.7% and +20.0% respectively; new Table 1). However, when looking at individual cultures, we had one of three WT cultures in which mEPSC amplitude increased 17.2% but GluA2 cluster size decreased 9.5%. This result suggests that while activity blockade does lead to an increase in GluA2 receptors after activity blockade, the effect is more variable than that for mEPSC amplitude. We went back to published studies to see if this has been previously observed, but found that it was difficult to compare because the sample sizes were different for the two characteristics (see Author response table 1). We included these particular 5 studies because they use the same treatment (TTX), examine receptors using imaging of identified synaptic sites, and record mEPSCs in their cultures (although the authors do not indicate that imaging and recordings are done simultaneously on the same cultures.) Only one of the studies listed by the Reviewer is in our group (Ibata et al., 2008). The study by (Tan et al., 2015) uses western blots to measure receptors; the study by (Silva et al., 2019) blocks activity using a combination of AMPA and NMDA receptor blockers; the study by (Pozo et al., 2012) correlates mEPSC amplitude changes with imaging but not in response to activity blockade, instead for changing the expression of GluA2. While it may seem like splitting hairs to reject studies that use other treatment protocols, there is ample evidence that the mechanisms of homeostatic plasticity depend on how activity was altered, see the following studies for several examples of this (Sutton et al., 2006; Soden and Chen, 2010; Fong et al., 2015). A discussion of the 5 articles we selected is in the revised manuscript, Discussion, lines 456 to 474. In sum, we provide evidence that activity blockade is associated with an overall increase in GluA2 receptors; what we propose is that this increase, being more variable, does not fully explain the increase in mEPSC amplitude. However, we acknowledge that the disparity could be explained by the differences in limitations of the two methods (Discussion, lines 469-472).

      (2) The authors also speculate in the discussion that presynaptic Rab3A could be interacting with retrograde BDNF signaling to regulate postsynaptic AMPARs. Without data showing Rab3A-dependent presynaptic changes after TTX treatment, this argument is not compelling. In this retrograde pathway, BDNF is synthesized in and released from dendrites (Jakawich et al., 2010b; Thapliyal et al., 2022), and it is entirely possible for postsynaptic Rab3A to interfere with this process cell-autonomously.

      We have added the information that Rab3A could control BDNF from the postsynaptic cell and included the two references provided by the reviewer, Discussion, lines 517 to 518. We have added new evidence, recently published, that the Rab3 family has been shown to regulate targeting of EGF receptors to rafts (among other plasma membrane molecules), with Rab3A itself clearly present in nonneuronal cells (Diaz-Rohrer et al., 2023) (added to Discussion, lines 509 to 515).

      (3) The authors propose that a change in AMPAR subunit composition from GluA2-containing ones to GluA1 homomers may account for the distinct changes in mEPSC amplitudes and GluA2 clusters. However, their data from the NASPM wash-in experiments clearly show that the GluA1 homomer contributions have not changed before and after TTX treatment.

      We have revised this section in the Discussion, lines 534 to 536, to clarify that any change due to GluA1 homomers should have been detectable by a greater ability of NASPM to reverse the TTX-induced increase.

      Reviewer #2 (Recommendations for the Authors):

      For authors to have more convincing arguments in general, they will need to clarify/improve certain details in their data collection by addressing the above technical concerns. Additionally, the authors should design experiments to test whether Rab3A regulates scaling from pre- or post-synaptic site. For example, they could sparsely knock out Rab3A in WT neurons to test the postsynaptic possibility. On the other hand, their argument for a presynaptic role would be much more compelling if they could show whether there are clear functional changes such as in vesicle sizes and release probability in the presynaptic terminal of Rab3AKO neurons.

      An important next step is to identify whether Rab3A is acting pre- or post-synaptically (Discussion, lines 572 to 573), but these experiments will be undertaken in the future. It would not add much to simply show vesicle size is altered in the KO (and we do not necessarily expect this since mEPSC amplitude is normal in the KO). It will be very difficult to establish that vesicle size is changing with activity blockade and that this change is prevented in the Rab3A KO, because we are looking for a ~25% increase in vesicle volume, which would correspond to a ~7.5% increase in diameter. Finally, we do not believe demonstrating changes in release probability tell us anything about a presynaptic role for Rab3A in regulating the size of the presynaptic quantum.

      Reviewer #3 (Public Review)

      Weaknesses: However, the rather strong conclusions on the dissociation of AMPAR trafficking and synaptic response are made from somewhat weaker data. The key issue is the GluA2 immunostaining in comparison with the mEPSC recordings. Their imaging method involves only assessing puncta clearly associated with a MAP2 labeled dendrite. This is a small subset of synapses, judging from the sample micrographs (Fig. 5). To my knowledge, this is a new and unvalidated approach that could represent a particular subset of synapses not representative of the synapses contributing to the mEPSC change (they are also sampling different neurons for the two measurements; an additional unknown detail is how far from the cell body were the analyzed dendrites for immunostaining.) While the authors acknowledge that a sampling issue could explain the data, they still use this data to draw strong conclusions about the lack of AMPAR trafficking contribution to the mEPSC amplitude change. This apparent difference may be a methodological issue rather than a biological one, and at this point it is impossible to differentiate these. It will unfortunately be difficult to validate their approach. Perhaps if they were to drive NMDAdependent LTD or chemLTP, and show alignment of the imaging and ephys, that would help. More helpful would be recordings and imaging from the same neurons but this is challenging. Sampling from identified synapses would of course be ideal, perhaps from 2P uncaging combined with SEP-labeled AMPARs, but this is more challenging still. But without data to validate the method, it seems unwarranted to make such strong conclusions such as that AMPAR trafficking does not underlie the increase in mEPSC amplitude, given the previous data supporting such a model.

      In the new version, we soften our conclusion regarding the mismatch between GluA2 receptor levels and mEPSC amplitudes, now only stating that receptors may not be the sole contributor to the TTX effect on mEPSC amplitude (Discussion, lines 472 to 474). With our analysis in the new version focusing on comparisons of cell means, the GluA2 receptor cluster size and the mEPSC amplitude data match well in magnitude for the data pooled across the 3 matched cultures (20.0% and 19.7%, respectively, see new Table 1). However, in one of the three cultures the direction of change for GluA2 receptors is opposite that of mEPSC amplitudes (Table 1, Culture #3, -9.5% vs +17.2%, respectively).

      It is unlikely that the lack of matching of homeostatic plasticity in one culture, but very good matching in two other cultures, can be explained by an unvalidated focus on puncta associated with MAP2 positive dendrites. We chose to restrict analysis of synaptic GluA2 receptors to the primary dendrite in order to reduce variability, reasoning that we are always measuring synapses for an excitatory pyramidal neuron, synapses that are relatively close to the cell body, on the consistently identifiable primary dendrite. We measured how far this was for the two cells depicted in old Figure 5 (new Fig. 4). Because we always used the 5X zoom window which is a set length, and positioned it within ~10 microns of the cell body, these cells give a ball park estimate for the usual distances. For the untreated cell, the average distance from the cell body was 38.5 ± 2.8 µm; for the TTX-treated cell, it was 42.4 ± 3.2 µm (p = 0.35, KruskalWallis test). We have added these values to the Results, lines 270 to 274.

      We did not mean to propose that AMPA receptor levels do not contribute at all to mEPSC amplitude, and we acknowledge there are clear cases where the two characteristics change in parallel (for example, in the study cited by Reviewer #2, (Pozo et al., 2012), increases in GluA2 receptors due to exogenous expression are closely matched by increases in mEPSC amplitudes.) What our matched culture experiments demonstrate is that in the case of TTX treatment, both GluA2 receptors and mEPSC amplitudes increase on average, but sometimes mEPSC amplitudes can increase in the absence of an increase in GluA2 receptors (Culture #3, Rab3A+/+ cultures), and sometimes mEPSC amplitudes do not increase even though GluA2 receptor levels do increase (Culture #3, Rab3A-/- cultures). Therefore, it would not add anything to our argument to examine receptors and mEPSCs in NMDA-dependent LTP, a different plasticity paradigm in which changes in receptors and mEPSCs may more closely align. It has been demonstrated that mEPSCs of widely varying amplitude can be recorded from a single synaptic site (Liu and Tsien, 1995), so we would need to measure a large sample of individual synapse recordings to detect a modest shift in average values due to activity blockade. In addition, it would be essential to express fluorescent AMPA receptors in order to correlate receptor levels in the same cells we record from (or at the same synapses). And yet, even after these heroics, one is still left with the issue that the two methods, electrophysiology and fluorescent imaging, have distinct limitations and sources of variability that may obscure any true quantitative correlation.

      Other questions arise from the NASPM experiments, used to justify looking at GluA2 (and not GluA1) in the immunostaining. First, there is a frequency effect that is quite unclear in origin. One would expect NASPM to merely block some fraction of the post-synaptic current, and not affect pre-synaptic release or block whole synapses. It is also unclear why the authors argue this proves that NASPM was at an effective concentration (lines 399-400). Further, the amplitude data show a strong trend towards smaller amplitude. The p value for both control and TTX neurons was 0.08 – it is very difficult to argue that there is no effect. And the decrease is larger in the TTX neurons. Considering the strong claims for a presynaptic locus and the use of this data to justify only looking at GluA2 by immunostaining, these data do not offer much support of the conclusions. Between the sampling issues and perhaps looking at the wrong GluA subunit, it seems premature to argue that trafficking is not a contributor to the mEPSC amplitude change, especially given the substantial support for that hypothesis. Further, even if trafficking is not the major contributor, there could be shifts in conductance (perhaps due to regulation of auxiliary subunits) that does not necessitate a pre-synaptic locus. While the authors are free to hypothesize such a mechanism, it would be prudent to acknowledge other options and explanations.

      We have created a model cartoon to explain how NASPM could reduce mEPSC frequency (new Fig. 3D). mEPSCs that arise from a synaptic site that has only Ca2+-permeable AMPA receptors will be completely blocked by NASPM, if the NASPM concentration is maximal. The reason we conclude that we have sufficient NASPM reaching the cells is that the frequency is decreased, as expected if there are synaptic sites with only Ca2+-permeable AMPA receptors. We previously were not clear that there is an effect of NASPM on mEPSC amplitude, although it did not reach statistical significance (new Fig. 3B). Where there is no effect is on the TTX-induced increase in mEPSC amplitude, which remains after the acute NASPM application (new Fig. 3A). We have revised the description of these findings in Results, lines 220 to 241. In reviewing the literature further, we could find no previous studies demonstrating an increase in conductance in GluA2 or Ca2+-impermeable receptors, only in GluA1 homomers. In other words, any conductance change would have been due to a change in GluA1 homomers, and should have been visible as a disruption of the homeostatic plasticity by NASPM application. We have added text to Results, lines 211 to 217; 236-241; Discussion, lines 420 to 422; 526-536 and Methods, lines 685 to 695 regarding this point.

      The frequency data are missing from the paper, with the exception of the NASPM dataset. The mEPSC frequencies should be reported for all experiments, particularly given that Rab3A is generally viewed as a pre-synaptic protein regulating release. Also, in the NASPM experiments, the average frequency is much higher in the TTX treated cultures. Is this statistically above control values?

      This comment is addressed by the major change #3, above.

      Unaddressed issues that would greatly increase the impact of the paper:

      (1) Is Rab3A activity pre-synaptically, post-synaptically or both. The authors provide good evidence that Rab3A is acting within neurons and not astrocytes. But where is it acting (pre or post) would aid substantially in understanding its role (and particularly the hypothesized and somewhat novel idea that the amount of glutamate released per vesicle is altered in HSP). They could use sparse knockdown of Rab3A, or simply mix cultures from KO and WT mice (with appropriate tags/labels). The general view in the field has been that HSP is regulated post-synaptically via regulation of AMPAR trafficking, and considerable evidence supports this view. The more support for their suggestion of a pre-synaptic site of control, the better.

      This is similar to the request of Reviewer #2, Recommendations to the Authors. An important next step is to identify whether Rab3A is working pre- or postsynaptically. However, it is possible that it is acting pre-synaptically to anterogradely regulate trafficking of AMPAR, as we have depicted in our model, new Fig. 9. To demonstrate that the presynaptic quantum is being altered, we would need to show that vesicle size is increased, or the amount of transmitter being released during an mEPSC is increased after activity blockade. To that end, we are currently performing experiments using a fast off-rate antagonist. As described above in response to Reviewer #2’s Conceptual Concerns, we find dramatic decreases in frequency not explained by the 30-60% inhibition observed for the largest amplitude mEPSCs, which suggests the possibility that small mEPSCs are more sensitive than large mEPSCs and therefore may have less transmitter. Due to these complexities and the delay while we test other antagonists to see if the effect is specific to fast-off rate antagonists, we are not including these results here.

      (2) Rab3A is also found at inhibitory synapses. It would be very informative to know if HSP at inhibitory synapses is similarly affected. This is particularly relevant as at inhibitory synapses, one expects a removal of GABARs and/or a decrease of GABA-packaging in vesicles (ie the opposite of whatever is happening at excitatory synapses.). If both processes are regulated by Rab3A, this might suggest a role for this protein more upstream in the signaling, an effect only at excitatory synapses would argue for a more specific role just at these synapses.

      It will be important to determine if homeostatic synaptic plasticity at inhibitory synapses on excitatory neurons is sensitive to Rab3A deletion, especially in light of the fact that unlike many of the other molecules implicated in homeostatic increases in mEPSCS, Rab3A is not a molecule known to be selective for glutamate receptor trafficking (in contrast to Arc/Arg3.1 or GRIP1, for example). Such a study would warrant its own publication.

      Reviewer #3 (Recommendations for the Authors):

      There are a number of minor points or suggestions for the authors:

      Is RIM1 part of this pathway (or expected to be)? Some discussion of this would be nice.

      RIM, Rab3-interacting molecule, has been implicated at the drosophila neuromuscular junction in a presynaptic form of homeostatic synaptic plasticity in which evoked release is increased after block of postsynaptic receptors (Muller et al., 2012), a plasticity that also requires Rab3-GAP (Muller et al., 2011). To our knowledge there is no evidence that RIM is involved in the homeostatic plasticity of mEPSC amplitude after activity blockade by TTX. The Rim1a KO does not have a change in mEPSC amplitude relative to WT (Calakos et al., 2004), but that is not unexpected given the normal mEPSC amplitude in neurons from cultures prepared from Rab3A-/- mice in the current study. It would be interesting to look at homeostatic plasticity in cortical cultures prepared from Rim1a or other RIM deletion mice, but we have not added these points to the revised manuscript since there are a number of directions one could go in attempting to define the molecular pathway and we feel it is more important to discuss the potential location of action and physiological mechanisms.

      Is the Earlybird mutation a GOF? More information about this mutation would help.

      We have added a description of how the Earlybird mutation was identified, in a screen for rest:activity mutants (Results, lines 118 to 123). Rab3A Earlybird mice have a shortened circadian period, shifting their wake cycle earlier and earlier. When Rab3A deletion mice were tested in the same activity raster plot measurements, the shift was smaller than that for the Earlybird mutant, suggesting the possibility that it is a dominant negative mutation.

      The high K used in the NASPM experiments seems a bit unusual. Have the authors done high K/no drug controls to see if this affects the synapses in any way?

      We used the high K based on previous studies that indicated the blocking effect of the Ca2+-permeable receptor blockers was use dependent (Herlitze et al., 1993; Iino et al., 1996; Koike et al., 1997). We reasoned that a modest depolarization would increase the frequency of AMPA receptor mEPSCs and allow access of the NASPM.  We have added this point to the Methods, lines 695 to 708. 

      The NASPM experiments do not show that GluA1 does not contribute (line 401), only that GluA1 homomers are not contributing (much – see above). GluA1/A2 heteromers are quite likely involved. Also, the SEM is missing from the WT pre/post NASPM data.

      Imaging of GluA2-positive sites will not distinguish between GluA2 homomers and GluA2-GluA1 heteromers, so we have added this clarification to Results, lines 242 to 246. We have remade the NASPM pre-post line plots so that the mean values and error bars are more visible (new Fig. 3B, C).

      It seems odd to speculate based on non-significant findings (line 650-1), with lower significance (p = 0.11) than findings being dismissed in the paper (NASPM on mEPSC amplitude; p = 0.08).

      We did not mean to dismiss the effect of NASPM on mEPSC amplitude (new Fig. 3B), rather, we dismiss the effect of NASPM on the homeostatic increase in mEPSC amplitude caused by TTX treatment (new Fig. 3A). We have emphasized this distinction in Results, lines 223 to 225, and Discussion, lines 420 to 422, as well as adding that the stronger effect of NASPM on frequency after TTX treatment suggests an activity-dependent increase in the number of synapses expressing only Ca2+ permeable homomers (Results, lines 236 to 241; Discussion, lines 431 to 435).

      Fig. 4 could be labeled better (to make it clear that B is amplitude and C is freq from the same cells).

      Fig. 4 has been revised—now the amplitude and frequency plots from the same condition (new Fig. 3, B, C; CON or TTX) are in a vertical line and the figure legend states that the frequency data are from the same cells as in Fig. 3A.

      The raw amplitude data seems a bit hidden in the inset panels – I would suggest these data are at least as important as the cumulative distributions in the main panel. Maybe re-organizing the figures would help.

      We have removed all cumulative distributions, rank order plots, and ratio plots. The box plots are now full size in new Figures 1, 2, 5, 6, 7 and 8.

      I’m not sure I would argue in the paper that 12 cells a day is a limiting issue for experiments. It doesn’t add anything and doesn’t seem like that high a barrier. It is fine to just say it is difficult and therefore there is a limited amount of data meeting the criteria.

      We have removed the comment regarding difficulty.

      Calakos N, Schoch S, Sudhof TC, Malenka RC (2004) Multiple roles for the active zone protein RIM1alpha in late stages of neurotransmitter release. Neuron 42:889-896.

      De Gois S, Schafer MK, Defamie N, Chen C, Ricci A, Weihe E, Varoqui H, Erickson JD (2005) Homeostatic scaling of vesicular glutamate and GABA transporter expression in rat neocortical circuits. J Neurosci 25:7121-7133.

      Diaz-Rohrer B, Castello-Serrano I, Chan SH, Wang HY, Shurer CR, Levental KR, Levental I (2023) Rab3 mediates a pathway for endocytic sorting and plasma membrane recycling of ordered microdomains. Proc Natl Acad Sci U S A 120:e2207461120.

      Dubes S, Soula A, Benquet S, Tessier B, Poujol C, Favereaux A, Thoumine O, Letellier M (2022) miR-124dependent tagging of synapses by synaptopodin enables input-specific homeostatic plasticity. EMBO J 41:e109012.

      Fong MF, Newman JP, Potter SM, Wenner P (2015) Upward synaptic scaling is dependent on neurotransmission rather than spiking. Nat Commun 6:6339.

      Herlitze S, Raditsch M, Ruppersberg JP, Jahn W, Monyer H, Schoepfer R, Witzemann V (1993) Argiotoxin detects molecular differences in AMPA receptor channels. Neuron 10:1131-1140.

      Hou Q, Zhang D, Jarzylo L, Huganir RL, Man HY (2008) Homeostatic regulation of AMPA receptor expression at single hippocampal synapses. Proc Natl Acad Sci U S A 105:775-780.

      Ibata K, Sun Q, Turrigiano GG (2008) Rapid synaptic scaling induced by changes in postsynaptic firing. Neuron 57:819-826.

      Iino M, Koike M, Isa T, Ozawa S (1996) Voltage-dependent blockage of Ca(2+)-permeable AMPA receptors by joro spider toxin in cultured rat hippocampal neurones. J Physiol 496 ( Pt 2):431437.

      Jakawich SK, Neely RM, Djakovic SN, Patrick GN, Sutton MA (2010a) An essential postsynaptic role for the ubiquitin proteasome system in slow homeostatic synaptic plasticity in cultured hippocampal neurons. Neuroscience 171:1016-1031.

      Jakawich SK, Nasser HB, Strong MJ, McCartney AJ, Perez AS, Rakesh N, Carruthers CJ, Sutton MA (2010b) Local presynaptic activity gates homeostatic changes in presynaptic function driven by dendritic BDNF synthesis. Neuron 68:1143-1158.

      Kapfhamer D, Valladares O, Sun Y, Nolan PM, Rux JJ, Arnold SE, Veasey SC, Bucan M (2002) Mutations in Rab3a alter circadian period and homeostatic response to sleep loss in the mouse. Nat Genet 32:290-295.

      Koike M, Iino M, Ozawa S (1997) Blocking effect of 1-naphthyl acetyl spermine on Ca(2+)-permeable AMPA receptors in cultured rat hippocampal neurons. Neurosci Res 29:27-36.

      Liu G, Tsien RW (1995) Properties of synaptic transmission at single hippocampal synaptic boutons. Nature 375:404-408.

      Liu G, Choi S, Tsien RW (1999) Variability of neurotransmitter concentration and nonsaturation of postsynaptic AMPA receptors at synapses in hippocampal cultures and slices. Neuron 22:395409.

      Muller M, Pym EC, Tong A, Davis GW (2011) Rab3-GAP controls the progression of synaptic homeostasis at a late stage of vesicle release. Neuron 69:749-762.

      Muller M, Liu KS, Sigrist SJ, Davis GW (2012) RIM controls homeostatic plasticity through modulation of the readily-releasable vesicle pool. J Neurosci 32:16574-16585.

      Pozo K, Cingolani LA, Bassani S, Laurent F, Passafaro M, Goda Y (2012) beta3 integrin interacts directly with GluA2 AMPA receptor subunit and regulates AMPA receptor expression in hippocampal neurons. Proc Natl Acad Sci U S A 109:1323-1328.

      Silva MM, Rodrigues B, Fernandes J, Santos SD, Carreto L, Santos MAS, Pinheiro P, Carvalho AL (2019) MicroRNA-186-5p controls GluA2 surface expression and synaptic scaling in hippocampal neurons. Proc Natl Acad Sci U S A 116:5727-5736.

      Soden ME, Chen L (2010) Fragile X protein FMRP is required for homeostatic plasticity and regulation of synaptic strength by retinoic acid. J Neurosci 30:16910-16921.

      Sun HY, Bartley AF, Dobrunz LE (2009) Calcium-permeable presynaptic kainate receptors involved in excitatory short-term facilitation onto somatostatin interneurons during natural stimulus patterns. J Neurophysiol 101:1043-1055.

      Sutton MA, Ito HT, Cressy P, Kempf C, Woo JC, Schuman EM (2006) Miniature neurotransmission stabilizes synaptic function via tonic suppression of local dendritic protein synthesis. Cell 125:785-799.

      Tan HL, Queenan BN, Huganir RL (2015) GRIP1 is required for homeostatic regulation of AMPAR trafficking. Proc Natl Acad Sci U S A 112:10026-10031.

      Thapliyal S, Arendt KL, Lau AG, Chen L (2022) Retinoic acid-gated BDNF synthesis in neuronal dendrites drives presynaptic homeostatic plasticity. Elife 11.

      Wilson NR, Kang J, Hueske EV, Leung T, Varoqui H, Murnick JG, Erickson JD, Liu G (2005) Presynaptic regulation of quantal size by the vesicular glutamate transporter VGLUT1. J Neurosci 25:62216234.

      Wu YK, Hengen KB, Turrigiano GG, Gjorgjieva J (2020) Homeostatic mechanisms regulate distinct aspects of cortical circuit dynamics. Proc Natl Acad Sci U S A 117:24514-24525.

      Xu X, Pozzo-Miller L (2017) EEA1 restores homeostatic synaptic plasticity in hippocampal neurons from Rett syndrome mice. J Physiol 595:5699-5712.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment This valuable paper reports a theoretical framework and methodology for identifying Cancer Driving Nucleotides (CDNs), primarily based on single nucleotide variant (SNV) frequencies. A variety of solid approaches indicate that a mutation recurring three or more times is more likely to reflect selection rather than being the consequence of a mutation hotspot. The method is rigorously quantitative, though the requirement for larger datasets to fully identify all CDNs remains a noted limitation. The work will be of broad interest to cancer geneticists and evolutionary biologists. 

      The key criticism “the requirement for larger datasets to fully identify all CDNs remains a noted limitation” that is also found in both reviews. We have clarified the issue in the main text, the relevant parts, from which are copied below. The response below also addresses many comments in the reviews. In addition, Discussion of eLife-RP-RA-2024-99341 has been substantially expanded to answer the questions of Reviewer 2.

      We shall answer the boldface comment in three ways. First, it can be answered using GENIE data. Fig. 7 of the main text (eLife-RP-RA-2024-99340) shows that, when n increases from ~ 1000 to ~ 9,000, the numbers of discovered CDNs increase by 3 – 5 fold, most of which come from the two-hit class. Hence, the power of discovering more CDNs with larger datasets is evident. By extrapolation, a sample size of 100,000 should be able to yield 90% of all CDNs, as calculated here. (Fig. 7 also addresses the queries of whether we have used datasets other than TCGA. We indeed have used all public data, including GENIE and COSMIC.) 

      Second, the power of discovering more cancer driver genes by our theory is evident even without using larger datasets. Table 3 of the companion study (eLife-RP-RA-2024-99341) shows that, averaged across cancer types, the conventional method would identify 45 CDGs while the CDN method tallies 258 CDGs. The power of the CDN method is demonstrated. This is because the conventional approach has to identify CDGs (cancer driver genes) in order to identify the CDNs they carry. However, many CDNs occur in non-CDGs and are thus missed by the conventional approach. In Supplementary File S2, we have included a full list of CDNs discovered in our study, along with population allele frequency annotations from gnomAD. The distribution patterns of these CDNs across different cancer types show their pan-cancer properties as further explored in the companion paper.

      Third, while many, or even most CDNs occur in non-CDGs and are thus missed, the conventional approach also includes non-CDN mutations in CDGs. This is illustrated in Fig. 5 of the companion study (eLife-RP-RA-2024-99341) that shows the adverse effect of misidentifications of CDNs by the conventional approach. In that analysis, the gene-targeting therapy is effective if the patient has the CDN mutations on EGFR, but the effect is reversed if the EGFR mutations are non-CDN mutations.

      Reviewer #1 (Public Review):

      The authors developed a rigorous methodology for identifying all Cancer Driving Nucleotides (CDNs) by leveraging the concept of massively repeated evolution in cancer. By focusing on mutations that recur frequently in pan-cancer, they aimed to differentiate between true driver mutations and neutral mutations, ultimately enhancing the understanding of the mutational landscape that drives tumorigenesis. Their goal was to call a comprehensive catalogue of CDNs to inform more effective targeted therapies and address issues such as drug resistance.

      Strengths

      (1) The authors introduced a concept of using massively repeated evolution to identify CDNs. This approach recognizes that advantageous mutations recur frequently (at least 3 times) across cancer patients, providing a lens to identify true cancer drivers.

      (2) The theory showed the feasibility of identifying almost all CDNs if the number of sequenced patients increases to 100,000 for each cancer type.

      Weaknesses

      (1) The methodology remains theoretical and no novel true driver mutations were identified in this study.

      We now address the weakness criticism, which is gratefully received.

      The second part of the criticism (no novel true driver mutations were identified in this study) has been answered in the long responses to eLife assessment above. The first part “The methodology remains theoretical” is somewhat unclear. It might be the lead to the second part. However, just in case, we interpret the word “theoretical” to mean “the lack of experimental proof” and answer below.

      As Reviewer #1 noted, a common limitation of theoretical and statistical analyses of cancer drivers is the need to validate their selective advantage through in vitro or in vivo functional testing. This concern is echoed by both reviewers in the companion paper (eLife-RP-RA-2024-99341), prompting us to consider the methodology for functional testing of potential cancer drivers. An intuitive approach would involve introducing putative driver mutations into normal cells and observing phenotypic transformation in vitro and in vivo. In a recent stepwise-edited human melanoma model, Hodis et al. demonstrated that disease-relevant phenotypes depend on the “correct” combinations of multiple driver mutations (Hodis et al. 2022). Other high-throughput strategies can be broadly categorized into two approaches: (1) introducing candidate driver mutations into pre-malignant model systems that already harbor a canonical mutant driver (Drost and Clevers 2018; Grzeskowiak et al. 2018; Michels et al. 2020) and (2) introducing candidate driver mutations into growth factor-dependent cell models and assessing their impact on resulting fitness (Bailey et al. 2018; Ng et al. 2018). The underlying assumption of these strategies is that the fitness outcomes of candidate driver mutations are influenced by pre-existing driver mutations and the specific pathways or cancer hallmarks being investigated. This confines the functional test of potential cancer driver mutations to conventional cancer pathways. A comprehensive identification of CDNs is therefore crucial to overcome these limitations. In conjunction with other driver signal detection methods, our study aims to provide a more comprehensive profile of driver mutations, thereby enabling the functional testing of drivers involved in non-conventional cancer evolution pathways.

      (2) Different cancer types have unique mutational landscapes. The methodology, while robust, might face challenges in uniformly identifying CDNs across various cancers with distinct genetic and epigenetic contexts.

      We appreciate the comment. Indeed, different cancer types should have different genetic and epigenetic landscapes. In that case, one may have expected CDNs to be poorly shared among cancer types. However, as reported in Fig. 4 of the companion study, the sharing of CDNs across cancer types is far more common than the sharing of CDGs (Cancer Driving Genes). We suggest that CDNs have a much higher resolution than CDGs, whereby the signals are diluted by non-driver mutations. In other words, despite that the mutational landscape may be cancer-type specific, the pan-cancer selective pressure may be sufficiently high to permit the detection of CDN sharing among cancer types.

      Below, we shall respond in greater details. Epigenetic factors, such as chromatin states, methylation/acetylation levels, and replication timing, can provide valuable insights when analyzing mutational landscapes at a regional scale (Stamatoyannopoulos et al. 2009; Lawrence et al. 2013; Makova and Hardison 2015; Baylin and Jones 2016; Alexandrov et al. 2020; Abascal et al. 2021; Sherman et al. 2022). However, at the site-specific level, the effectiveness of these covariates in predicting mutational landscapes depends on their integration into a detailed model. Overemphasizing these covariates could lead to false negatives for known driver mutations (Hess et al. 2019; Elliott and Larsson 2021). In figure 3B of the main text, we illustrate the discrepancy between the mutation rate predictions from Dig and empirical observation. Ideally, no covariates would be needed under extensive sample sizes, where each mutable genomic sites would have sufficient mutations to yield a statistic significance and consequently, synonymous mutations would be sufficient for the characterization of mutational landscape. In this sense, the integration of mutational covariates represents a compromise under current sample size. In our study, the effect of unique mutational landscapes is captured by E(u), the mean mutation rate for each cancer type. We further accounted for the variability of site-level mutability using a gamma distribution. The primary goal of our study is to determine the upper limit of mutation recurrences under mutational mechanisms only. While selection force acts blindly to genomic features, mutational hotspots should exhibit common characteristics determined by their underlying mechanisms. In the main text, we attempted to identify such shared features among CDNs. Until these mutational mechanisms are fully understood, CDNs should be considered as potential driver mutations.

      (3) L223, the statement "In other words, the sequences surrounding the high-recurrence sites appear rather random.". Since it was a pan-cancer analysis, the unique patterns of each cancer type could be strongly diluted in the pan-cancer data.

      We now state that the analyses of mutation characteristic have been applied to the individual cancer types and did not find any pattern that deviates from randomness. Nevertheless, it may be argued that, with the exception of those with sufficiently large sample sizes such as lung and breast cancers, most datasets do not have the power to reject the null hypothesis. To alleviate this concern, we applied the ResNet and LSTM/GRU methods for the discovery of potential mutation motifs within each cancer type. All methods are more powerful than the one used but the results are the same – no cancer type yields a mutation pattern that can reject the null hypothesis of randomness (see below).

      As a positive control, we used these methods for the discovery of splicing sites of human exons. When aligned up with splicing site situated in the center (position 51 in the following plot), the sequence motif would look like:

      Author response image 1.

      5-prime

      Author response image 2.

      3-prime

      However, To account for the potential influence of distance from the mutant site in motif analysis, we randomly shuffled the splicing sites within a specified window around the alignment center, and their sequence logo now looks like:

      Author response image 3.

      5-prime shuffled

      Author response image 4.

      3-prime shuffled

      Author response image 5.

      random sequences from coding regions

      The classification results of the shuffled 5-prime (donner), 3-prime (acceptor) and random sequences from coding regions (Random CDS) are presented in the Author response table 1 (The accuracy for the aligned results, which is approximately 99%, is not shown here).

      Author response table 1.

      With the positive results from these positive controls (splicing site motifs) validating our methodology, we applied the same model structure to the train and test of potential mutational motifs of CDN sites. All models achieved approximately 50% accuracy in CDN motif analysis, suggesting that the sequence contexts surrounding CDN sites are not significantly different from other coding regions of the genome. This further implies that the recurrence of mutations at CDN sites is more likely driven by selection rather than mutational mechanisms.

      Note that this preliminary analysis may be limited by insufficient training data for CDN sites. Future studies will require larger sample sizes and more sophisticated models to address these limitations.

      (4) To solidify the findings, the results need to be replicated in an independent dataset.

      Figure 7 validates our CDN findings using the GENIE dataset, which primarily consists of targeted sequencing data from various panels. By focusing on the same genomic regions sequenced by GENIE, we observed a 3-5 fold increase in the number of discovered CDNs as sample size increased from approximately 1000 to 9000. Moreover, the majority of CDNs identified in TCGA were confirmed as CDNs in GENIE.

      (5) The key scripts and the list of key results (i.e., CDN sites with i{greater than or equal to}3) need to be shared to enable replication, validation, and further research. So far, only CDN sites with i{greater than or equal to}20 have been shared.

      We have now updated the “Data Availability” section in the main text, the corresponding scripts for key results are available on Gitlab at: https://gitlab.com/ultramicroevo/cdn_v1.

      (6) The versions of data used in this study are not clearly detailed, such as the specific version of gnomAD and the version and date of TCGA data downloaded from the GDC Data Portal.

      The versions of data sources have now been updated in the revised manuscript.

      Recommendations For The Authors:

      (1) L119, states "22.7 million nonsynonymous sites," but Table 1 lists the number as 22,540,623 (22.5 million). This discrepancy needs to be addressed for consistency.<br /> (2) Figure 2B, there is an unexplained drop in the line at i = 6 and 7 (from 83 to 45). Clarification is needed on why this drop occurs.<br /> (3) Figure 3A, for the CNS type, data for recurrence at 8 and 9 are missing. An explanation should be provided for this absence.<br /> (4) L201, the title refers to "100-mers," but L218 mentions "101-mers." This inconsistency needs to be corrected to ensure clarity and accuracy.<br /> (5) Figures 6 and 7 currently lack titles. Titles should be added to these figures to improve readability.

      Thanks. All corrections have been incorporated into the revised manuscript.

      Reviewer #2 (Public Review):<br /> Summary:<br /> The authors propose that cancer-driver mutations can be identified by Cancer Driving Nucleotides (CDNs). CDNs are defined as SNVs that occur frequently in genes. There are many ways to define cancer driver mutations, and the strengths and weaknesses are the reliance on statistics to define them.<br /> Strengths:<br /> There are many well-known approaches and studies that have already identified many canonical driver mutations. A potential strength is that mutation frequencies may be able to identify as yet unrecognized driver mutations. They use a previously developed method to estimate mutation hotspots across the genome (Dig, Sherman et al 2022). This publication has already used cancer sequence data to infer driver mutations based on higher-than-expected mutation frequencies. The advance here is to further illustrate that recurrent mutations (estimated at 3 or more mutations (CDNs) at the same base) are more likely to be the result of selection for a driver mutation (Figure 3). Further analysis indicates that mutation sequence context (Figure 4) or mutation mechanisms (Figure 5) are unlikely to be major causes for recurrent point mutations. Finally, they calculate (Figure 6) that most driver mutations identifiable by the CDN approach could be identified with about 100,000 to one million tumor coding genomes.<br /> Weaknesses:<br /> The manuscript does provide specific examples where recurrent mutations identify known driver mutations but do not identify "new" candidate driver mutations. Driver mutation validation is difficult and at least clinically, frequency (ie observed in multiple other cancer samples) is indeed commonly used to judge if an SNV has driver potential. The method would miss alternative ways to trigger driver alterations (translocations, indels, epigenetic, CNVs). Nevertheless, the value of the manuscript is its quantitative analysis of why mutation frequencies can identify cancer driver mutations.

      Recommendations For The Authors<br /> Whereas the analysis of driver mutations in WES has been extensive, the application of the method to WGS data (ie the noncoding regions) would provide new information.

      We appreciate that Reviewer #2 has suggested the potential application of our method to noncoding regions. Currently, the background mutation model is based on the site level mutations in coding regions, which hinders its direct applications in other mutation types such as CNVs, translocations and indels. We acknowledge that the proportion of patients with driver event involving CNV (73%) is comparable to that of coding point mutations (76%) as reported in the PCAWG analysis (Fig. 2A from Campbell et al., 2020). In future studies, we will attempt to establish a CNV-based background mutation rate model to identify positive selection signals driving tumorigenesis.

      References

      Abascal F, Harvey LMR, Mitchell E, Lawson ARJ, Lensing SV, Ellis P, Russell AJC, Alcantara RE, Baez-Ortega A, Wang Y, et al. 2021. Somatic mutation landscapes at single-molecule resolution. Nature:1–6.

      Alexandrov LB, Kim J, Haradhvala NJ, Huang MN, Tian Ng AW, Wu Y, Boot A, Covington KR, Gordenin DA, Bergstrom EN, et al. 2020. The repertoire of mutational signatures in human cancer. Nature 578:94–101.

      Bailey MH, Tokheim C, Porta-Pardo E, Sengupta S, Bertrand D, Weerasinghe A, Colaprico A, Wendl MC, Kim J, Reardon B, et al. 2018. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 173:371-385.e18.

      Baylin SB, Jones PA. 2016. Epigenetic Determinants of Cancer. Cold Spring Harb Perspect Biol 8:a019505.

      Campbell PJ, Getz G, Korbel JO, Stuart JM, Jennings JL, Stein LD, Perry MD, Nahal-Bose HK, Ouellette BFF, Li CH, et al. 2020. Pan-cancer analysis of whole genomes. Nature 578:82–93.

      Drost J, Clevers H. 2018. Organoids in cancer research. Nat Rev Cancer 18:407–418.

      Elliott K, Larsson E. 2021. Non-coding driver mutations in human cancer. Nat Rev Cancer 21:500–509.

      Grzeskowiak CL, Kundu ST, Mo X, Ivanov AA, Zagorodna O, Lu H, Chapple RH, Tsang YH, Moreno D, Mosqueda M, et al. 2018. In vivo screening identifies GATAD2B as a metastasis driver in KRAS-driven lung cancer. Nat Commun 9:2732.

      Hess JM, Bernards A, Kim J, Miller M, Taylor-Weiner A, Haradhvala NJ, Lawrence MS, Getz G. 2019. Passenger Hotspot Mutations in Cancer. Cancer Cell 36:288-301.e14.

      Hodis E, Triglia ET, Kwon JYH, Biancalani T, Zakka LR, Parkar S, Hütter J-C, Buffoni L, Delorey TM, Phillips D, et al. 2022. Stepwise-edited, human melanoma models reveal mutations’ effect on tumor and microenvironment. Science 376:eabi8175.

      Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al. 2013. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499:214–218.

      Makova KD, Hardison RC. 2015. The effects of chromatin organization on variation in mutation rates in the genome. Nat Rev Genet 16:213–223.

      Michels BE, Mosa MH, Streibl BI, Zhan T, Menche C, Abou-El-Ardat K, Darvishi T, Członka E, Wagner S, Winter J, et al. 2020. Pooled In Vitro and In Vivo CRISPR-Cas9 Screening Identifies Tumor Suppressors in Human Colon Organoids. Cell Stem Cell 26:782-792.e7.

      Ng PK-S, Li J, Jeong KJ, Shao S, Chen H, Tsang YH, Sengupta S, Wang Z, Bhavana VH, Tran R, et al. 2018. Systematic Functional Annotation of Somatic Mutations in Cancer. Cancer Cell 33:450-462.e10.

      Sherman MA, Yaari AU, Priebe O, Dietlein F, Loh P-R, Berger B. 2022. Genome-wide mapping of somatic mutation rates uncovers drivers of cancer. Nat Biotechnol 40:1634–1643.

      Stamatoyannopoulos JA, Adzhubei I, Thurman RE, Kryukov GV, Mirkin SM, Sunyaev SR. 2009. Human mutation rate associated with DNA replication timing. Nat Genet 41:393–395.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary:  

      Wang et al. investigate sexual dimorphic changes in the transcriptome of aged humans. This study relies upon analysis of the Genotype-Tissue Expression dataset that includes 54 tissues from human donors. The authors investigate 17,000 transcriptomes from 35 tissues to investigate the effect of age and sex on transcriptomic variation, including the analysis of alternative splicing. Alternative splicing is becoming more appreciated as an influence in the aging process, but how it is affected by sexual dimorphism is still largely unclear. The authors investigated multiple tissues but ended up distilling brain tissue down to four separate regions: decision, hormone, memory, and movement. Building upon prior work, the authors used an analysis method called principal component-based signal-to-variation ratio (pcSVR) to quantify differences between sex or age by considering data dispersion. This method also considers differentially expressed genes and alternative splicing events. 

      Strengths:  

      (1) The authors investigate sexual dimorphism on gene expression and alternative splicing events with age in multiple tissues from a large publicly available data set that allows for reanalysis. 

      (2) Furthermore, the authors take into account the ethnic background of donors. Identification of agingmodulating genes could be useful for the reanalysis of prior data sets. 

      Weaknesses:  

      The models built off of the GTEx dataset should be tested in another data set (ex. Alzheimer's disease) where there are functional changes that can be correlated. Gene-length-dependent transcription decline, which occurs with age and disease, should also be investigated in this data set for potential sexual dimorphism. 

      We appreciate the reviewer’s constructive feedback and acknowledgment of the strengths of our study. The detailed results are included in the ‘Recommendations for the authors’ from the editorial office. Below we summarize our feedback that address the concerns of this reviewer:

      (1) Independent Alzheimer’s disease (AD) datasets:

      We acknowledge the importance of validating our models beyond GTEx to assess their generalizability aging to Alzheimer’s disease. While GTEx provides valuable transcriptomic data across multiple tissues, it lacks direct functional assessments linked to disease states. We have already analyzed RNA-seq data from ROSMAP and GEO in Figure 4, focusing on sex-biased gene expression and splicing changes between aging and AD.  The results showed a male-biased association with Alzheimer’s disease at AS resolution, indicating that the AS changes during aging could contribute more to AD in males than females. We added a highlight to this analysis in the manuscript (Pages 6-7).

      (2) Sexual dimorphism in Gene-Length-Dependent Transcription Decline (GLTD) 

      We appreciate the reviewer’s suggestion to explore gene-length-dependent transcription decline (GLTD), which has been implicated in both aging and disease. As the reviewer suggested, our analysis revealed that GLTD exhibits sex-biased patterns in different tissues, aligning with recent literature on sex-dimorphic transcriptional aging. Our findings also revealed that longer genes with greater transcriptional decline are enriched in AD-related pathways. We have incorporated this new analysis in the ‘Recommendations for the authors’ in Author response image 5-6 and expanded the discussion of the biological relevance. 

      Reviewer #2 (Public review): 

      Summary: 

      In this manuscript, Wang et al analyze ~17,000 transcriptomes from 35 human tissues from the GTEx database and address transcriptomic variations due to age and sex. They identified both gene expression changes as well as alternative splicing events that differ among sexes. Using breakpoint analysis, the authors find sex dimorphic shifts begin with declining sex hormone levels with males being affected more than females. This is an important pan-tissue transcriptomic study exploring age and sex-dependent changes although not the first one. 

      Strengths:  

      (1) The authors use sophisticated modeling and statistics for differential, correlational, and predictive analysis. 

      (2) The authors consider important variables such as genetic background, ethnicity, sampling bias, sample sizes, detected genes, etc. 

      (3) This is likely the first study to evaluate alternative splicing changes with age and sex at a pan-tissue scale. 

      (4) Sex dimorphism with age is an important topic and is thoroughly analyzed in this study.  Weaknesses:  

      (1) The findings have not been independently validated in a separate cohort or through experiments. Only selective splicing factor regulation has been verified in other studies. 

      (2) It seems the authors have not considered PMI or manner of death as a variable in their analysis. 

      (3) The manuscript is very dense and sometimes difficult to follow due to many different types of analyses and correlations. 

      (4) Short-read data can detect and quantify alternative splicing events with only moderate confidence and therefore the generalizability of these findings remains to be experimentally validated. 

      We appreciate the thorough review and thoughtful feedback. We have addressed the reviewer’s concerns and added clarification. The detailed results are included in Recommendations for the authors. Here are the summaries.

      (1) Challenge of independent validation in separate cohorts

      • The GTEx dataset includes the most comprehensive transcriptome resource for studying population-level differences in age and sex across tissues, particularly including large-scale brain samples. This provides a unique opportunity to analyze sex-dimorphic aging and the relevance of age-associated diseases.  Several technical issues, including cell type heterogeneity, postmortem artifacts, as well as sequencing biases, lead to technical challenges in different cohorts.

      • As the reviewer mentioned, we analyzed transcriptomic data from Shen et al. (2024) and compared them with GTEx results (Author response image 2). Limited overlap in differentially expressed genes again highlighted the challenges in cross-dataset validation due to the differences in cell composition and data processing (peripheral blood mononuclear cells (PBMCs) vs whole blood). 

      • Due to the limited human brain transcriptome data covering different age and sex groups, we found mouse hippocampus datasets from Mass spectrometry (MS), including young and old, as well as female and male groups.  The results validated the expression of splicing factors in brain (Author response image 9). This cross-species consistency supports the robustness of our findings in human brain aging.

      (2) Effects of Postmortem Interval, Manner of Death, and Time of Death

      • We agree that the sample collections could introduce confounding effects. To address this, we calculated the correlations between the confounding factors with Postmortem Interval (PMI), Manner of Death (DTHMNNR), or Time of Death (DTHTIME and DTHSEASON). We observed strong correlations in some surrogate variables in most tissues, indicating that those factors could be well-regressed during our analysis (Recommendations for the authors, Figure S4 and R8). 

      • In addition, we re-evaluated our analyses while incorporating PMI as a covariate in our models. Our results align with our initial findings (Author response image 1), suggesting that age- and sex-dependent transcriptomic changes are not strongly confounded by PMI and confirming that our model has controlled PMI. These results are detailed in ‘Recommendations for the authors’ and included in Figure S4C-E with the description in text, Page 5. 

      (3) Readability of manuscript and flow of analyses

      • In summary, our study first examined global alternative splicing (AS) and gene expression (GE) across all tissues before focusing on specific regions for deeper insights. To improve clarity, we have made the following revisions:

      • Add clearer statements when transitioning between all-tissue and brain-specific analyses (Page 6-7).

      • Modify the subtitle of Results to highlight all-tissue vs. brain analyses (Page 6).

      • These refinements could enhance the manuscript’s structure, making the flow of analysis and conclusions more intuitive for readers.

      (4) Limitations of short-read RNA-seq for splicing analysis

      • Short-read RNA-seq provides only moderate confidence in detecting and quantifying full-length isoforms. However, its higher sequencing depth makes it more suitable for quantifying changes in alternative splicing (AS) events.

      • Our analysis focused on splicing event-level quantification, applying stringent filters and using our GPU-based tool, which showed strong concordance with RT-PCR and other pipelines. Therefore, we also cited and included the updated Paean manuscript that benchmarks its performance in AS analysis.

      Reviewer #3 (Public review): 

      Summary:  

      In this study, Wang et al utilized the available GTEx data to compile a comprehensive analysis that attempt to reveal aging-related sex-dimorphic gene expression as well as alternative splicing changes in humans. 

      The key conclusions based on their analysis are that. 

      (1) extensive sex-dimorphisms during aging with distinct patterns of change in gene expression and alternative splicing (AS), and 

      (2) the male-biased age-associated AS events have a stronger association with Alzheimer's disease, and  (3) the female-biased events are often regulated by several sex-biased splicing factors that may be controlled by estrogen receptors. They further performed break-point analysis and revealed that in males there are two main breakpoints around ages 35 and 50, while in females, there is only one breakpoint at 45. 

      Strengths:  

      This study sets an ambitious goal, leveraging the extensive GTEx dataset to investigate aging-related, sexdimorphic gene expression and alternative splicing changes in humans. The research addresses a significant question, as our understanding of sex-dimorphic gene expression in the context of human aging is still in its early stages. Advancing our knowledge of these molecular changes is vital for identifying therapeutic targets for age-related diseases and extending the human health span. The study is highly comprehensive, and the authors are commendable for their attempted thorough analysis of both gene expression and alternative splicing - an area often overlooked in similar studies. 

      We thank this reviewer for the insightful review and recognition of our study's significance.  We agree with the reviewer on how to examine sex-dimorphic gene expression and alternative splicing in aging by using the GTEx dataset.  This is indeed an essential aspect of developing potential therapeutic targets for agerelated diseases to promote human health span.

      Weaknesses:  

      Due to the inherent noise within the GTEx dataset - which includes numerous variables beyond aging and sex - there are significant technical concerns surrounding this study. Additionally, the lack of crossvalidation with independent, existing data raises questions about whether the observed gene expression changes genuinely reflect those associated with human aging. For instance, the break-point analysis in this study identifies two major breakpoints in males around ages 35 and 50, and one breakpoint in females at age 45; however, these findings contradict a recent multi-omics longitudinal study involving 108 participants aged 25 to 75 years, where breakpoint at 44 and 60 years was observed in both male and females (Shen et al, 2024). These issues cast doubt on the robustness of the study's conclusions. Specific concerns are outlined below: 

      References: 

      Ferreira PG, Muñoz-Aguirre M, Reverter F, Sá Godinho CP, Sousa A, Amadoz A, Sodaei R, Hidalgo MR, Pervouchine D, Carbonell-Caballero J et al (2018) The effects of death and post-mortem cold ischemia on human tissue transcriptomes. Nature Communications 9: 490. 

      Shen X, Wang C, Zhou X, Zhou W, Hornburg D, Wu S, Snyder MP (2024) Nonlinear dynamics of multiomics profiles during human aging. Nature Aging. 

      Wucher V, Sodaei R, Amador R, Irimia M, Guigó R (2023) Day-night and seasonal variation of human gene expression across tissues. PLOS Biology 21: e3001986. 

      (1) The primary method used in this study is linear regression, incorporating age, sex, and age-by-sex interactions as covariates, alongside other confounding factors (such as ethnicity) as unknown variables. However, the analysis overlooks two critical known variables in the GTEx dataset: time of death (TOD) and postmortem interval (PMI). Both TOD and PMI are recorded for each sample and account for substantial variance in gene expression profiles. A recent study by Wucher et al.(Wucher et al, 2023) demonstrated the powerful impact of TOD on gene expression by using it to reconstruct human circadian and even circannual datasets. Similarly, Ferreira et al. (Ferreira et al, 2018) highlighted PMI's influence on gene expression patterns. Without properly adjusting for these two variables, confidence in the study's conclusions remains limited at best. 

      We appreciate the reviewer for raising this important point regarding the impact of post-mortem interval (PMI) and time of death (TOD) on gene expression, including the death seasons (DTHSEASON) and daytime (DTHTIME). To address this point, we carefully evaluated whether our linear model controlled for these factors as potential confounders. 

      Our results showed that PMI and TOD significantly correlated with the estimated covariates in most tissues, suggesting that their effects could be effectively regressed out using our model (Figure S4).  As the reviewers and editors suggested, we have now included this correlation analysis in the updated Figure S4C-E and the text in the Results section, citing relevant literature [1,2] (Page 5). 

      Author response image 1.

      The results of differential gene expression analysis with vs without the inclusion of PMI correction as a known covariate. The scatter plots show the correlations of significance levels (pvalues, left panel) and effect sizes (coefficients, right panel) of sex (A) and age (B). Whole-blood tissue is used as an example.

       

      In addition, we did the differential analysis that incorporated PMI as a covariate in the regression models and re-evaluated the age- and sex-related transcriptomic changes. Using WholeBlood gene expression as an example, our revised analysis shows that the inclusion of PMI in the covariates has minimal impact on the significance levels and effects of sex and age (i.e., p-values and coefficients, respectively), indicating that our findings are robust using confounding factors (Author response image 1). 

      (2) To demonstrate that their analysis is robust and that the covariates TOD and PMI are otherwise negligible - the authors should cross-validate their findings with independent datasets to confirm that the identified gene expression changes are reproducible for some tissues. For instance, the recent study by Shen et al. (Shen et al., 2024) in Nature Aging offers an excellent dataset for cross-validation, particularly for blood samples. Comparing the GTEx-derived results with this longitudinal transcriptome dataset would enable verification of gene expression changes at both the individual gene and pathway levels. Without such validation, confidence in the study's conclusions remains limited. 

      We thank the reviewer for the insightful suggestion regarding cross-validation with independent datasets. We understand that validating findings across datasets is crucial for ensuring robustness. As the reviewers suggested, we see whether there are some shared findings in the GTEx data with the study by Shen et al. (2024) in Nature Aging. However, after performing comparisons with our GTEx results in whole blood tissue, we found that the overlaps of differentially expressed genes are limited (Fig. 3). In our results, we found a large proportion of age-associated genes in the GTEx data, whereas just 54 genes are age-associated from Shen et al.’s PBMC data. 3 in 7 genes are differentially expressed in both datasets (Fig. 3A). Additionally, we performed the functional enrichment analysis on the GTEx-specific age-associated genes.

      We observed a strong enrichment in the biological pathways related to neutrophil functions and innate immune responses, which are specific to the cell compositions in whole blood rather than PBMC (Fig. 3B).

      Author response image 2.

      The comparison between the gene expression of whole blood tissue from GTEx and PBMCs from Shen et al. (A) The bar plot shows the number of age (left panel) or sex-associated  (right panel) genes in the two datasets. The grey bars highlight the proportion of overlapped genes in both datasets. (B) The top 10 significantly enriched biological processes in the GTEx-specific age-associated genes. The color bar shows the number of age-associated genes in specific pathways.

      These discrepancies highlighted the crucial factors in cross-dataset comparison:

      • Cell compositions: GTEx used whole blood, which contains all blood components, including neutrophils and erythrocytes, whereas PBMCs contain lymphocytes and monocytes. Under the influence of granulocytes and red blood cells in whole blood, the gene expression profiles between these two datasets are different.

      • Biological functions: Whole blood includes both innate and adaptive immune components; thus, aging-related gene expression changes in whole blood may include a broader systemic response than those in PBMCs. This difference in biological context contributes to the observed variation in the differentially expressed genes, as demonstrated by our functional enrichment analysis (Fig. 3B). 

      • Sequencing biases and data processing: The two datasets were generated using different RNAseq processing pipelines, including distinct normalization, batch correction, and quantification methodologies. These technical differences may introduce systematic variations that complicate direct cross-validation.

      Due to these fundamental problems, a direct one-to-one validation between the two datasets is challenging. We understand the importance of independent dataset validation and appreciate the reviewer’s suggestion. However, future studies could be performed more precisely if comparable whole-blood-based datasets are available. In addition, GTEx data provides nearly thousands of samples in whole blood, which is a largescale, comprehensive, and clinically relevant dataset for studying aging-related changes, particularly in innate immunity and inflammation, which are not well captured in PBMCs.

      (3) As a demonstration of the lack of such validation, in the Shen et al. study (Shen et al., 2024), breakpoints at 44 and 60 years were observed in both males and females, while this study identifies two major breakpoints in males around ages 35 and 50, and one breakpoint in females at age 45. What caused this discrepancy? 

      We thank the reviewer and the editors for both coming up with the non-linear multi-omic aging patterns observed by Shen et al.  They observed two prominent crests around the ages of 45 and 60 from omics data.

      Similarly, we also identified two breakpoints in our analysis, with some differences in specific age breakpoints. These could be the result of sample preparation methods and breakpoint definition. These responses are also included in the editor’s recommendations.

      Definition of breakpoints vs crests:

      • Crests represent age-related molecular changes at each time point across the human lifespan. They indicate the number of molecules that are differentially expressed during aging (q < 0.05), without considering individual expression levels.

      • Our breakpoints, in contrast, are identified after filtering the chronological trends using the Autoregressive Integrated Moving Average (ARIMA) model. We calculated the rate of change at each age point using the smooth approach and sliding windows. Breakpoints are defined as local maxima where the distance to the nearest minimum, relative to the global maximum. We indeed found some local wide peaks around 60 in some tissues, shown in Figure S10, however, we excluded these due to our strict cutoffs to remove noise.

      Differences and similarities between sequenced tissues: 

      • Whole-blood vs PBMC: In the GTEx RNA-seq data used in our study, whole blood samples from donors were sequenced, whereas their study used PBMCs. Whole blood contains all blood components, including red blood cells, platelets, granulocytes (e.g., neutrophils), lymphocytes, and monocytes, while PBMCs represent a subset of white blood cells, primarily consisting of lymphocytes (T cells, B cells, NK cells) and monocytes, excluding granulocytes and erythrocytes. As we mentioned in the previous responses, the gene expression changes observed in whole blood capture the contributions of neutrophils and other granulocytes, which are neglected in the PBMC profile (also shown in Figure S11C). 

      • For the shared tissues in two studies – skin, we looked at the non-linear changes during aging and found the same two breakpoints: 43 and 58. 

      Novelties in our study:

      • Whole blood can serve as a readily accessible resource for testing age-related disease biomarkers without cell separation, making it more practical for clinical applications.

      • Our analysis was performed on females and males, respectively. The main object of our analysis is to compare the differences in aging rates between sexes. Our results reveal clear sex-specific differences across multiple human tissues. Therefore, the identified breakpoints may differ when sex effects are not taken into account, highlighting the specificity of our analysis. 

      • Additionally, our breakpoints are integrated across multiple tissues. Our results showed that there is a large diversity of aging patterns in different tissues.

      As the reviewers and editors suggested, we have added the following statements to clarify this distinction in the Discussion section: ‘Our analysis observed the non-linear aging patterns with two breakpoints, which is consistent with recent findings, with differences in specific age points due to sex differences as well as tissue diversities 3.’ (Page 14), and ‘These breakpoints could represent key junctures in the aging process that align with the non-linear patterns of aging and disease progression.’ (Page 15)

      (4) Although the alternative splicing analysis is intriguing, the authors did not differentiate between splicing events that alter the protein-coding sequence and those that do not. Many splicing changes occurring in the 5' UTR and 3' UTR regions do not impact protein coding, so it is essential to filter these out and focus specifically on alternative splicing events that can modify protein-coding sequences. 

      The reviewer raises an important point. In our study, we included the AS events in protein-coding genes to gain a comprehensive understanding of sex-biased age-associated splicing. As the reviewer suggested, focusing on coding-sequence-altering events is particularly relevant to protein function. To address this, we performed an additional analysis to specifically annotate sBASEs occurring within the coding sequence (defeined as CDS-altering sBASEs) and reanalyzed their functional pathways and AD-associations (Author response image 3).  

      Our analysis revealed that most of the sBASEs are relevant to protein-coding sequences (CDS) across multiple tissues (Author response image 3A).  We then confirmed our findings using CDS-altering sBASEs. We found that those sBASEs in brain regions were significantly enriched in pathways related to amyloid-beta formation and actin filament organization (Author response image 3B). Notably, male-biased sBASEs in decision-related brain regions were particularly associated with dendrite development and regulation of cell morphogenesis, highlighting the sex-specific roles of sBASEs in brain functions. Additionally, we performed a random forest classification using only CDS-altering sBASEs in AD datasets (Author response image 3C-D), again confirming the malebiased association between aging and AD.

      Overall, we found that most of the identified sBASEs could modify protein-coding sequences, and our main conclusions remain consistent even after filtering out non-coding events. 

      Nevertheless, in addition to AS events that impact protein sequences, alternative splicing in untranslated regions (UTRs) also plays a critical regulatory role. Splicing events in the 5′ UTR can influence translation efficiency by modifying upstream open reading frames (uORFs) or RNA secondary structures, while splicing in the 3′UTR can affect mRNA stability, localization, and translation by altering microRNA binding sites and RNA-binding protein interactions. Given these functional implications, we believe that UTR-targeted AS events should also be considered to supplement the understanding of post-transcriptional gene regulation in future research.

      Author response image 3.

      The distribution and functional relevance of sBASEs with coding effects. (A) The number of sBASEs and CDS-altering sBASEs across multiple tissues. The deeper bars show the number of sBASEs whose alternative splice sites are located at protein-coding regions. (B) GO biological pathways in each sex and brain region. Heatmap shows the sex-specific pathways that are significantly enriched by CDS-altering sBASEs in more than 2 brain regions and sex. (C) Correlation between ADassociated and age-associated AS changes across the CDS-altering sBASEs that alter protein-coding sequences in females and males. (D) Performances of sex-stratified models predicted by CDS-altering sBASEs in 100 iterations using the random forest approach

      (5) One of the study's main conclusions - that "male-biased age-associated AS events have a stronger association with Alzheimer's disease" - is not supported by the data presented in Figure 4A, which shows an association with "regulation of amyloid precursor formation" only in female, not male, alternative splicing genes. Additionally, the gene ontology term "Alzheimer's disease" is absent from the unbiased GO analysis in Figure S6. These discrepancies suggest that the focus on Alzheimer's disease may reflect selective data interpretation rather than results driven by an unbiased analysis. 

      We thank the reviewer for this point. In our functional analysis, we identified distinct biological processes enriched in female- and male-biased AS genes, such as the regulation of amyloid precursor formation in females and structural constituents of the cytoskeleton in males. However, Alzheimer’s disease (AD) is a complex neurodegenerative disorder with multiple pathological mechanisms beyond amyloid-beta (Aβ) formation, many of which are strongly age-related in both sexes. This complexity motivates us to explore novel relationships between splicing and AD in distinct sexes.

      Although Figure 4A shows the enrichment of “regulation of amyloid precursor formation” in female-biased AS events, this does not contradict the broader enrichment of AD-related processes in male-biased AS events. Our disease ontology analysis supports this finding, as male-biased age-associated AS events are enriched in neurodegenerative diseases, including cognitive disorders. Additionally, we considered not only individual GO terms but also the disease-associated transcriptomic signatures from AD-related datasets, which collectively indicate a stronger association in males. 

      Regarding Figure S6 mentioned by the reviewer, the GO term “Alzheimer’s disease” is not explicitly listed in the heatmap because we filtered the pathways that are consistently enriched in multiple tissues. As noted in the figure legend, we only displayed sex-specific GO terms that were significant in at least 15 tissues. Then, since the brain is highly affected by age-related processes and neurological conditions show sex differences, the sex-biased AS events could help explain differential susceptibility to age-related cognitive decline and neurodegeneration. That’s why we chose the brain data for detailed analysis.

      To improve clarity, we have revised the text to describe the purpose of our analysis in brain rather than other tissues (Page 6-7). We appreciate the reviewer’s feedback, and we will consider additional analyses to further explore the sex-biased AS as well as disease risk in other tissues.

      (6) The experimental data presented in Figures 5E - I merely demonstrate that estrogen receptor regulates the expression of two splicing factors, SRSF1 and SRSF7, in an estradiol-dependent manner. However, this finding does not support the notion that this regulation actually contributes to sex-dimorphic alternative splicing changes during human aging. Notably, the authors do not provide evidence that SRSF1 and SRSF7 expression changes actually occur in a sex-dependent manner with human aging (in a manner similar to TIA1). As such, this experimental dataset is disconnected from the main focus of the study and does not substantiate the conclusions on sex-dimorphic splicing during human aging. The authors performed RNAseq in wild-type and ER mutant cells, and they should perform a comprehensive analysis of ER-dependent alternative splicing and compare the results with the GTEx data. It should be straightforward. 

      Thanks for the reviewer’s feedback. The main purpose of the analyses in Figures 5E-I was to explore which factors affect the sex-biased expression of splicing factors during aging and substantially regulate alternative splicing (AS). To address the reviewer’s concerns, we have included additional analysis and explained the challenge of linking estrogen receptor (ER)-regulated splicing factors to sex-dimorphic AS changes during human aging in specific human cell types. 

      • As suggested by the reviewer, we first examined the expression changes of SRSF1 and SRSF7 during aging in males and females, like TIA1 in decision-related brain regions (Fig. 5I).

      • Secondly, the regulation is based on a highly complex regulatory network involving multiple splicing factors and cell heterogeneity. Due to these complexities, we did not overlap ER-dependent AS changes with sBASEs from GTEx datasets directly. As far as the reviewer is concerned, we supplemented the AS analysis in the GSE89888 dataset (Fig. 5H) and identified the estrogenregulated AS events mediated by ESR1. We found that ~6% (26/396) of female-specific ageassociated AS events were regulated by ESR1, of which 6 sBASEs can be regulated by femalebiased splicing factors. The low overlaps could be represented by the limited coverage of different RNA-seq datasets and cell types used across these analyses. Notably, the results indicated that only a fraction of AS could be directly accounted for by estrogen via ESR1, suggesting the complexity of transcriptional and splicing regulatory networks during aging. 

      • Meanwhile, we downloaded independent experimental datasets to discover the regulation by our candidate splicing factors. Due to SRSF1 is identified as a potential regulator of sex-biased splicing, we analyzed RNA-seq data with SRSF1 knock-down (KD) glioblastoma cell lines (U87MG and U251), a type of brain cancer formed from astrocytes that support nerve cells 4.  As a result, we indeed found that some sBASEs are regulated by SRSF1 during aging through this experiment using brain cell lines (Author response image 4). Together, these results suggested that some of the SF-RNA regulatory relationships can be observed in another cellular system, further supporting our findings. 

      Due to the limitations of cell-based models and the complexity in the splicing regulatory network, it is challenging to directly validate aging regulation, particularly between different sexes, based on ER treatments in vivo. However, our findings still provide valuable mechanistic insights into ER-regulated splicing factors, implying their potential role in sex-biased aging.

      Author response image 4.

      SRSF1 regulations on specific sBASEs using SRSF1 knock-down RNA-seq data in GBM cells. Three examples are shown to be regulated during aging with significant changes between SRSF1 KD vs control in U251 and U87MG cell lines. The splicing diagrams are shown below.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      The authors found that alternative splicing was affected by both sex and age across many tissues, with gene expression differences affected by both parameters only present in some tissues. This trend was consistent when the effects of sex chromosomes were subtracted from the analysis. The effect of aging on differential gene expression and alternative splicing was more prevalent in male than female samples. For analysis purposes, young subjects were deemed to be anyone under 40, and old subjects were over 60 years old. The authors then investigated if specific genes or alternative splicing events were responsible for these effects. Some candidate genes or splicing events were identified but there was little overlap between tissues, suggesting no universal gene or event as a driver of aging. Surrogate variables like the ethnic backgrounds of donors were also investigated. Ultimately the authors found that alternative splicing events showed a stronger sexual dimorphic effect with age than did differential gene expression and that at least for the brain, alternative splicing changes showed a bias for Alzheimer's disease in male samples. This was highlighted by examples of exon skipping in SCL43A2 and FAM107A in males that were associated respectively with plaques and tangles. 

      The authors go on to identify sexual dimorphic differences in splicing factors in particular brain regions during age. Finally, the authors performed analysis for aging-modulated genes, identifying nearly 1000 across the tissues, nearly 70% of which are sex-specific. Their work suggests that further analysis of these aging-modulated genes could be differentially modulating the transcriptome based on sex. The work is novel and interesting, especially investigating sexual dimorphism in alternative splicing. However, the work is still preliminary, and these assumptions need to be applied to other data sets beyond GTEx for validation as well as some other phenomena that need to be considered. I recommend major revisions to address the points below. 

      (1) At the beginning of the results section, the authors state that the brain is stratified into four functional regions. It would be useful to explicitly state those four regions in the text at that point. 

      We agree that specifying these regions early in the text will improve clarity and provide the reader with a clear understanding of the analysis. As the reviewer’s suggestion, we revised the Results section (Page 3) to explicitly state the four functional brain regions as follows: ‘Due to data sparseness, the brain tissues were recombined into four functional regions (table S1), including hormone- or emotion-related region, movement-related region, memory-related region, and decision-related region (See Methods).’. This ensures that the regions are clearly defined before the subsequent analysis is presented. 

      (2) The manuscript becomes a bit confusing when the authors shift from all the tissues as a whole specifically to the brain and then back to the larger tissue set to make assumptions. This can be a bit confusing and should be better delineated.

      We thank the reviewer and editor for the feedback regarding the transitions between the analysis of all tissues and the brain-specific analysis. In our study, we first conducted a broad analysis of alternative splicing (AS) and gene expression (GE) across all tissues. For the AS analyses, we did sBASEs analysis in all tissues and then focused on specific tissue (i.e., brain) whose splicing changes are functionally enriched with age-related diseases.  For the GE analyses, we also analyzed the aging rate across tissues and identified the tissue-specific/shared patterns. 

      We agree that the shifts of the tissues for AS and GE may cause some confusion, and have made the following revisions to delineate why we focused on different tissues for distinct analyses:

      • We have added clear statements to better delineate when we shift focus from the analysis of all tissues to the region-specific analysis and vice versa. For instance, in the Results section (Page 67), we include a transitional phrase: ‘Having established patterns across all tissues, we now turn to a more focused analysis to investigate tissue-specific alternative splicing changes.’

      • To improve the overall structure, we have reorganized the Results section, adding distinct subheadings for the analysis of all tissues and the brain (Page 6), which should make the transition between these sections smoother and more intuitive for the reader.

      We believe that these revisions will make the manuscript’s structure clearer and allow the reader to better follow the flow of the analysis and the subsequent conclusions.

      (3) Gene-length-dependent transcription decline (GLTD) is another phenomenon that occurs with aging and is known to be associated with Alzheimer's disease [PMID38519330]. The authors should make some statement if this is present in their dataset and if any sexual dimorphism in tissues is present. 

      We thank the editors and reviewers for bringing up the possible connection of gene-length-dependent transcription decline (GLTD), which was reported to be associated with both aging and Alzheimer’s disease (AD). We appreciate the reviewer’s suggestion and have addressed whether GLTD is present in our dataset and whether any sex differences are observed in this context.

      We evaluated GLTD using the correlation between gene length with age-associated changes (i.e., the coefficients of the ‘age’ term in the linear regression model) in GTEx data. We did observe strong evidence of GLTD, particularly in the brain, heart, muscle, pancreas, spleen, skin, muscle, etc (Author response image 5A). In brain, we performed the functional enrichment analysis on the genes with Foldchange > 2 and length > 10<sup>5</sup> bp (Author response image 5B). We found that these extremely long genes are significantly relevant to synapse and neuron functions. These findings align with previous studies showing that GLTD can occur with aging in the tissues that are relevant to Alzheimer’s disease, cardiovascular diseases, and common failures of metabolism (e.g., diabetes) [5,6]. Additionally, it was not a ubiquitous phenomenon across all tissues. The correlations could be positive in tissues like adipose and artery.  These findings suggested the GLTD could be varied and tissuespecific in its manifestation during aging. 

      Author response image 5.

      (A) The correlation between gene length and age-associated changes across GTEx tissues in human samples. The correlation tests are evaluated using Spearman’s approach. The color bar indicates the -log10 transformed p-values in the correlation test. (B) The results of GO enrichment analysis using the genes with Foldchange > 2 and length > 10<sup>5</sup> bp. The parent terms calculated by ‘rrvgo’ with a similarity threshold of 0.9 are shown.

      Regarding sexual dimorphism, we conducted this analysis in females and males, respectively (Author response image 6). We found GLTD exists in both females and males in most tissues, such as brain, whole blood, muscle, etc, consistent with the previous results without considering the sex groups. Interestingly, we observed sexbiased patterns in certain tissues. In particular, the left ventricle, pancreas, and hippocampus showed notable male-biased patterns in the degree of transcriptional decline with gene length, whereas skin, liver, small intestine, and esophagus showed that in females. These findings suggest that GLTD could be relevant to aging and age-related diseases; the levels of expression and sexual dimorphism may vary depending on the tissue type. We hope this clarification addresses the reviewer’s concern and provides a more comprehensive understanding of the GLTD and sex differences observed in our dataset. 

      Author response image 6.

      The correlation between gene length and age-associated changes across tissues in females and males, respectively. The correlation tests are evaluated using the Spearman’s approach. The red dots indicate the significant correlations in females, while the navy dots show those in males.

      (4) Because the majority of this work has been performed in the GTEx dataset, applying this analysis to another publicly available dataset would be useful validation. For instance, the authors have interesting findings in the brain and correlations to Alzheimer's disease. Analysis of an existing RNAseq dataset from Alzheimer's disease patients and controls (with functional outcomes) would provide more evidence beyond the preliminary findings from GTEx. 

      We appreciate the reviewer’s suggestion on the validation of our findings by applying our analysis to independent RNA-seq datasets from Alzheimer’s disease patients. 

      • We have used two Alzheimer’s disease datasets, GEO and ROSMAP, to investigate the correlation between aging and Alzheimer’s disease (AD) and included these analyses in our study (Fig. 4B-C and Figure S8C).

      • In the Results section (Page 7), we have presented the results of this validation, where we identified correlations between sex-biased aging-related splicing changes and AD-related changes. These findings support the conclusions from the GTEx dataset and further strengthen the relevance of our results to AD.

      As suggested, we have updated the manuscript to more explicitly highlight this validation in the Discussion section (Page 12), noting: ‘We further validated our findings using Alzheimer’s disease dataset, ROSMAP, where we observed consistent correlations between aging-related splicing changes and Alzheimer’s disease-related changes, providing additional evidence for the robustness of our results.’ 

      Reviewer #2 (Recommendations for the authors): 

      (1) In the text (Introduction and Discussion), the authors mention analyzing 54 tissues, the abstract states 35 tissues, Table S1 lists 48, and Figure 2A-B shows 33. Could the authors please clarify exactly how many tissues they used? I am also confused by the sample numbers in Table S1. For example: for adiposesubcutaneous tissue, the total number of females is listed as 218 but the sum of young and old females is only 110. Does this mean some samples were excluded? What is the exclusion criterion? 

      We thank the reviewers and editors for pointing out the discrepancies regarding the number of tissues analyzed and the sample numbers in Table S1. We appreciate the opportunity to clarify these points:

      Number of tissues analyzed:

      • We downloaded and analyzed 17,382 samples in 54 tissues from GTEx in total (31 tissues and 13 brain regions), as mentioned in the Results, Methods, and Discussion sections. Table S1 lists 48 tissues (31 tissues, 13 brain regions, and 4 merged brain regions), which include a refined classification of the tissues we analyzed, accounting for the variations in brain region categorization in the dataset.

      • The discrepancy also arises from the different sample size cutoffs in specific analyses. For pcSVR analysis (Figure 2A-B), we did the subsampling for the permutation analysis for certain key findings, so we filtered a subset of 33 tissues (29 tissues and 4 merged brain regions), which included at least 3 samples in each age group in females or males. 

      • To resolve this, we have clarified the total number of tissues analyzed and aligned the numbers across the manuscript. In the revised manuscript, we now explicitly state in both the Abstract and Methods sections that 54 tissues were analyzed in the context of this study. We added a note in Methods to clarify that 35 tissues are 31 tissues and 4 merged brain regions (Page 16). In Figure 2A-B, we clarified that the 33 tissues are filtered due to the usage in this analysis (Page 17).

      Sample numbers in Table S1:

      • Regarding the sample sizes of age groups, the discrepancy occurred due to the classification of the age groups. We classify the samples into three: Young, Middle, and Old, as mentioned in the Results section (Page 4). 

      • Additionally, we excluded the sample sizes in 13 single brain regions. We aligned the total tissue number to 35 with our texts.

      We hope this resolves the confusion regarding the number of tissues and the sample sizes used in the analysis. These clarifications have been incorporated into the revised manuscript to ensure consistency.

      (2) Was post-mortem interval (PMI) or manner of death considered in the model? For example, traumatic death may have major consequences on gene expression. Similarly, a few tissues have low sample numbers, for example, kidney cortex and brain. The pooling of brain samples is explained and the kidney cortex is excluded, so why is it listed in Table S1? 

      Thank you for raising this important point regarding the potential impact of post-mortem interval (PMI) and manner of death (DTHMNNR) on gene expression. We carefully considered both factors as potential confounders in our analysis. 

      Specifically, to evaluate their impacts, we calculated the correlations between the coefficients of PMI or manner of death, with the confounding factors. Our results showed that PMI and DTHMNNR are significantly correlated with the covariates in most tissues, suggesting that their effects could be effectively regressed in our model (Figure S4). As we have mentioned in Figure S4 and Author response image 1, we conducted a differential analysis that incorporated PMI as a covariate in the regression models and re-evaluated the age- and sex-related transcriptomic changes to address this concern. The high correlations showed the minor effect size of PMI when including the covariates in the model. As suggested by the reviewers and editors, we have now included this correlation analysis in Figure S4C-E and updated the text in the results section (Page 5).

      Additionally, as the responses above, Table S1 provides the general sample sizes of all GTEx tissues without filtering. We have modified the table to include a total of 35 tissues, including 31 non-brain tissues and 4 brain regions.

      (3) It might be important to show a simple visual of cohort details such as age ranges, sexes, ethnicities, PMIs, etc. 

      To address this, we added summary figures to illustrate the distributions of key demographic variables, including age, sex, BMI, ethnicity, post-mortem intervals (PMIs), and manner of death (DTHMNNR) (Author response image 7 and Author response image 8). This will provide readers with a clearer overview of the dataset composition and potential covariates affecting the analysis. 

      Author response image 7.

      Age (left panel), BMI (Body Mass Index) (middle panel), and PMI (Post-Mortem Interval) (right panel) distribution in GTEx v8 cohort.

      Author response image 8.

      Sex (left panel), ethnicity (middle panel), and manner of death (DTHMNNR) (right panel) distribution in GTEx v8 cohort.

      (4) Since this study is highly correlative, it is impossible to determine if the findings hold true without an independent cohort validation or experimental validation. They used the ROSMAP cohort for AD samples, and some splicing factors regulation but the generalizability to the age and sex effects have not been independently tested.

      The reviewer raises an important point regarding the independent validation of sex- and age-associated splicing changes associated with AD. We used GTEx primarily because it includes approximately 17,000 RNA-seq samples across multiple human tissues, making it the most comprehensive public resource for studying population-level differences in age and sex. In particular, its large-scale brain samples provide a unique opportunity to analyze transcriptomic changes in sex-dimorphic aging.

      We understand the reviewer’s concern that our findings are mainly supported by correlative evidence, which could be affected by dataset-specific biases. However, there are several technical issues in crossvalidation with transcriptomes across different datasets, including limited comparability due to cell type heterogeneity, postmortem artifacts, and sequencing biases.

      Specifically, GTEx data is bulk RNA-seq that does not capture cell-type-specific transcriptomic changes. Given the cellular complexity of the brain and other tissues, observed differences in gene expression and splicing may be influenced by shifts in cellular composition rather than intrinsic transcriptional regulation. For example, we compared our results from GTEx whole blood with the analysis using an external dataset from Peripheral Blood Mononuclear Cells (PBMCs) provided by Shen et al. (2024) [3] (Author response image 2).  We observed limited overlap in differentially expressed genes between these datasets (probably because the whole blood contains diverse immune cell populations), highlighting the challenges in cross-dataset validation due to differences in tissue composition and sample processing.

      Therefore, we applied surrogate variable analysis (SVA) to minimize technical and biological confounders. This approach helped reduce biases from genetic background to hidden batch effects, including postmortem artifacts, sequencing biases (Figure S4), and other covariates. This approach could help us identify whether sex-biased splicing events are biologically meaningful rather than technical artifacts.  

      In addition, to address the reviewer’s concern on the splicing factor regulation, we managed to find a dataset in decision-related brain regions. Due to the limitation of human brain data covering different age and sex groups, we used mouse hippocampus datasets, including young and old, as well as female and male groups [7].  The analysis of protein levels from MS data identified sex-biased age-associated splicing factors, including Srsf1 and Srsf7.  We found that the changes are consistent with the findings from GTEx (Author response image 9), aligning with our sex-biased splicing factor expression during aging in the same region of the human brain. This cross-species consistency supports the robustness of our findings in human brain aging.

      Author response image 9.

      Protein levels of some male-specific splicing factors in human hippocampus quantified using MS data. The Y-axis shows the protein intensity. Different facets mean different sample batch sets. The yellow boxes indicate the protein levels in the young group, while the brown boxes indicate those in the old group.

      In summary, despite the inherent limitations of RNA-seq studies in sex- and age-related transcriptomics, we have made our best efforts to address these concerns through comparisons with external datasets, statistical corrections, and validation using proteomic data. We appreciate the reviewer’s feedback and include additional discussion on these points (Page 13). 

      (5) Are AS predictions from short-read data accurate enough to make the predictions the authors report? 

      The reviewer is correct that the short-read sequencing has inherent limitations in reconstructing full-length isoforms.  However, the higher sequencing depth for short reads makes it a better choice in quantifying the relative change of each AS event across different conditions.  As a result, short-read data are extensively used in the splicing field to quantitatively measure the AS changes.  For this reason, we focused on the levels of alternative splicing events, rather than the quantification of full-length isoforms.  We used a series of stringent filters in our analyses to increase the reliability of our results.

      Specifically, we filtered the read counts of the junction read counts (JC) of most differential AS events that were higher than 10, as mentioned in the Methods section. Also, we used our GPU-based gene expression quantification tool, Paean, which performed better in cross-validation with quantitative RT-PCR results. The results of Paean are consistent with other pipelines. We cited an updated version of Paean that included the comparison with other tools in analyzing AS for consistency.  The manuscript on the new Paean version is being reviewed in another journal, and we included the PDF of that manuscript (Fig. 3 in the Paean manuscript) in the revised documents. 

      (6) Along the same lines, the finding that male age-related AS events are linked to Alzheimer's disease somewhat contradicts epidemiological studies that show that even after adjusting for age, women still have a greater risk of developing Alzheimer's than men. The authors show a significant overlap with AD GE events in females but don't explain the discrepancy. 

      We appreciate the editor’s comment regarding these discrepancies with the epidemiological studies. Previous studies suggested that the disease manifestations of Alzheimer’s Disease (AD) showed sex differences in AD phenotypes, including cognitive decline and brain atrophy [8].  The analyses on the sex/age effect of AD are indeed pretty complex, depending on the molecular criteria (GE or AS vs epidemiological data) in distinct studies, probably due to the difficulty in capturing how environmental exposures interact with biological pathways.  We hope to bring up three related points regarding this concern, which were also discussed in the revised manuscript. 

      • As we have mentioned in the Discussion section, an early study investigated the relationship between age, sex, and cognitive function in a large cohort of 17,127 UK Biobank participants [9]. Their study highlighted more apparent age-related changes in cognitive function among men, suggesting a potential vulnerability of men to cognitive decline with age.  Their main conclusion is consistent with our findings. 

      • While men and women can both suffer from Alzheimer's disease, women are more likely to be diagnosed, possibly due to longer lifespans and potential differences in brain structure or other factors. Although women exhibit a higher overall risk of AD, they may also have distinct molecular compensatory mechanisms that influence disease progression. 

      • To avoid the age effect, in our AD datasets, including ROSMAP, we filtered the samples over 90 years old to match the number of both sexes and the age distribution between the AD and control groups. Our analysis avoided the age biases in comparing AD and control, suggesting the crucial roles of sBASEs in AD during male aging.

      Moreover, for gene expression (GE), we showed distinct patterns of AD-related genes in females with AS. These two molecular processes do not necessarily have the same functional impact. AS changes may precede or contribute to disease onset in different ways compared to GE alterations. Our study came up with the underlying mechanisms linking cognitive disorders and alternative splicing (AS) at a higher molecular resolution.   

      (7) Could the authors explain which sBASE subset they used for their random forest prediction model and what was the rationale? 

      We are sorry for missing the details in selecting sBASEs (sex-biased age-associated splicing events) for the random forest prediction model. We specifically used sBASEs that exhibited specific sex-biased changes in splicing associated with aging. This subset of sBASEs was chosen in terms of those that could also be detected in the ROSMAP AD dataset due to different sequencing depths or technical biases across datasets. These sBASEs were further input to a prediction model with the feature selection algorithm RFE, and then evaluated their contributions. In the revised manuscript, we added the details of this selection in the Methods (Page 7).

      (8) The breakpoint analysis is particularly interesting. Can this be speculated to correlate with the recent non-linear multi-omic aging patterns observed by Shen et al in Nature Aging? 

      Thank you for highlighting the interesting aspects of our breakpoint analysis and suggesting its potential correlation with the non-linear aging patterns observed by Shen et al. 

      Shen et al. observed two prominent crests around the ages of 45 and 60 using omics data. Similarly, we also identified the non-linear aging patterns with two breakpoints in our analysis. However, there are some notable differences in specific breakpoints between these two studies, resulting from the breakpoint definition, as well as the sample preparations. According to the response in Author response image 2, the differences come from the following aspects:

      The definition of breakpoints vs crests:

      • Crests represent age-related molecular changes at each time point across the human lifespan. They indicate the number of molecules that are differentially expressed during aging (q < 0.05), without considering individual expression levels.

      • Our breakpoints, in contrast, are identified after filtering the chronological trends based on the expression levels and calculating the rate of change at each age point using sliding windows. Breakpoints are defined as local maxima where the distance to the nearest minimum, relative to the global maximum, exceeds 10%. We indeed found some local wide peaks around 60 in some tissues, shown in Figure S10, however, we excluded these due to our strict cutoffs.

      The sequenced biosamples: 

      • Whole-blood vs Peripheral Blood Mononuclear Cells (PBMC): As mentioned in previous responses, in GTEx, whole blood samples from donors were sequenced, whereas their study used PBMCs. Whole blood contains all blood components, including red blood cells, platelets, granulocytes (e.g., neutrophils), lymphocytes, and monocytes, while PBMCs only represent a subset of white blood cells, primarily consisting of lymphocytes (T cells, B cells, NK cells) and monocytes, excluding granulocytes and erythrocytes. Gene expression changes observed in whole blood capture the contributions from neutrophils and other granulocytes, which are absent in PBMC analyses (as shown in Figure S11C and Author response image 2). Additionally, whole blood can serve as a readily accessible biomarker source for testing age-related diseases without the need for cell separation, making it a more practical option for clinical applications.

      • For both studies, we share a tissue, which is skin, we looked at the non-linear changes during aging and found the same two breakpoints: 43 and 58. 

      Sex-specific analysis in females and males:

      • The main object of our analysis is to compare the differences in aging rates between sexes. Notably, the identified breakpoints may differ when sex effects are not taken into account, highlighting the importance of analyzing males and females separately.

      We have added the following statements to further clarify this connection: ‘Our analysis observed the nonlinear aging patterns with two breakpoints, which is consistent with recent findings (Nature Aging, 2024), with differences in specific age points due to the sex differences as well as tissue diversities.’ (Page 14), and ‘These breakpoints could represent key junctures in the aging process that align with the non-linear patterns of aging and disease progression.’ (Page 15)

      (9) Minor - the authors should refer to figures in the Discussion. They do so in some cases but this needs to be more extensive. 

      Thank you for pointing this out. In response, we have reviewed the Discussion section and added references to relevant figures where appropriate. In the section discussing the discrepancies between the profiles of GE vs. AS, we now refer to Figure 3 to highlight the earlier onset of different transcriptomic resolutions (Page 12); When describing the sex-specific age-associated AS changes and their associations with Alzheimer’s disease, we have added references to Figure 4 (Page 12); In the discussion of estrogen-mediated regulation of splicing factors, we have referred to Figure 5A, which detail the construction of RBP-RNA regulatory network integrating muti-dimensional data obtained through several orthogonal state-of-the-art approaches (Page 14).

      Reference:

      (1) Ferreira, P.G. et al. The effects of death and post-mortem cold ischemia on human tissue transcriptomes. Nature communications 9, 490 (2018).

      (2) Wucher, V., Sodaei, R., Amador, R., Irimia, M. & Guigó, R. Day-night and seasonal variation of human gene expression across tissues. PLoS Biology 21, e3001986 (2023).

      (3) Shen, X. et al. Nonlinear dynamics of multi-omics profiles during human aging. Nature aging, 116 (2024).

      (4) Zhou, X. et al. Splicing factor SRSF1 promotes gliomagenesis via oncogenic splice-switching of MYO1B. The Journal of clinical investigation 129, 676-693 (2019).

      (5) Soheili-Nezhad, S., Ibáñez-Solé, O., Izeta, A., Hoeijmakers, J.H. & Stoeger, T. Time is ticking faster for long genes in aging. Trends in Genetics 40, 299-312 (2024).

      (6) Brouillette, M. Gene length could be a critical factor in the aging of the genome. Proceedings of the National Academy of Sciences 121, e2416630121 (2024).

      (7) Keele, G.R. et al. Global and tissue-specific aging effects on murine proteomes. Cell reports 42(2023).

      (8) Ferretti, M.T. et al. Sex differences in Alzheimer disease—the gateway to precision medicine. Nature Reviews Neurology 14, 457-469 (2018).

      (9) Foo, H. et al. Age-and sex-related topological organization of human brain functional networks and their relationship to cognition. Frontiers in aging neuroscience 13, 758817 (2021).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this study, Yue et al. re-processed publicly available DNA methylation data (published in 2012 and 2017 from the Meissner lab) from pre- and post-implantation mouse embryos. Against the global wave of genome-wide reduction of DNA methylation occurring during pre-implantation development, they detected a slight increase (~1% on average) of DNA methylation at gene promoter regions during the transition from 8-cell to blastocyst stage. They claim that many such promoters are located in the X chromosome. Subsequently, they knocked down Dnmt3b (presumably because of its upregulation during the transition from the 8-cell to blastocyst stage) and detected the aberrant patterning of H3K27me3 in the mutant female embryos. Based on this observation, they claim that imprinted X-chromosome inactivation is impaired in the Dnmt3b-Kd pre-implantation embryos. Finally, they propose a model where such an increase of DNA methylation together with H3K27me3 regulates imprinted X-chromosome inactivation in the pre-implantation embryos. While their observation is of potential interest, the current version of the work fails to provide enough evidence to support their conclusions. Below are suggestions and comments on the manuscript.

      Major issues:

      (1) Sex of the embryos of the genome-wide bisulfite-sequencing data

      The authors re-analyzed publicly available genome-wide DNA methylation data from the Meissner lab published in 2012 and 2017. The former used reduced representation bisulfite sequencing (RRBS) and the latter used whole-genome bisulfite sequencing (WGBS). Based mainly on the RRBS data, Yue et al. detected de novo DNA methylated promoters during the transition from 8-cell to blastocyst against the global wave of genome-wide DNA demethylation. They claim that such promoter regions are enriched at the "inactive" X chromosome. However, it would be difficult to discuss DNA methylation at inactive X-chromosomes as the RRBS data were derived from a mixture of male and female embryos. It would also be notable that the increase of DNA methylation at these promoter regions is ~1% on average. Such a slight increase in DNA methylation during pre-implantation development could also be due to the developmental variations between the embryos or between the sexes of embryos.

      Thanks so much for your insightful comments. Whether de novo DNA methylation occurs in a sex-dimorphic manner would be of significance for our study. Based on your comments, we have added a reanalysis based on a publicly available single cell multi-omics sequencing (COOL-seq) data of mouse early embryos (Guo et al., 2017). The results showed that both male and female embryonic cells gain DNA methylation during the transition from the 8-cell to ICM (Figure 1—figure supplement 1C-D; Lines 112-115 in the revised manuscript).

      With regards to the increase in the promoter region, many previous studies have revealed that promoter and overlapping CGI regions, especially high CpG promoters, always showed low levels of DNA methylation (Auclair et al., 2014; Borgel et al., 2010; Dahlet et al., 2020). The relatively lower basal levels make the increase seem relatively slight. Thus, we added relevant statements to clarify this information and rewritten the sentences in the revised manuscript (Lines 116-118, 125-127 in the revised manuscript).

      In addition, using the single cell COOL-seq data, we also specifically reanalyzed the DNA methylation changes on the X chromosome in female embryos. The X chromosome showed a more notable increase than that on autosomes, and the female X chromosome showed a higher DNA methylation level than that of the male (Figure 3—figure supplement 2A-B; Lines 203-206 in the revised manuscript).

      Thanks again for your insightful and constructive comments that significantly strengthen our evidence. We have added these results in the revised manuscript.

      (2) Imprinted X-chromosome inactivation and evaluation of H3K27me3 (related to Figures 2C, D; 3F; Figure2-supplement 2 F, G; Figure3-supplement 3G)

      Based on the slight change in the H3K27me3 signals in the Dnmt3b-Kd blastocysts, the authors claim that imprinted X-chromosome inactivation is impaired in the mutant embryo. It would be not easy to reach this conclusion from such a rough analysis of H3K27me3 presented in Figure 2C, D. Rigorous quantification/evaluation of the H3K27me3 signals in the Dnmt3b-Kd embryos should be considered. Additional evidence for the impairment of H3K27me3 in the mutant embryos should also be provided (expression of a subset of X-linked genes by RNA-FISH or RT-PCR etc.). Though technically challenging, high-resolution genome-wide approach such as ChIP-seq of H3K27me3 in the Dnmt3b-kd female embryos (with traceable SNPs between maternal and paternal X chromosome to distinguish inactive and active X-chromosome) could more precisely evaluate regions that lose H3K27me3 in the X-chromosome (de novo DNA methylated promoters from 8-cell to blastocyst, for example).

      Thanks so much for your insightful comments that make our results more convincing. The H3K27me3 domain is a classic marker for establishment of XCI by achieving X chromosome wide heterochromatinization of transcriptional depression (Chow and Heard, 2009; Heard et al., 2004; Huynh and Lee, 2005). Thus, in the present study, we have performed immunostaining for H3K27me3 domains to evaluate the iXCI status in the blastocysts, as previously reported (Fukuda et al., 2014; Gontan et al., 2018; Inoue et al., 2010; Tan et al., 2016). Base on your comments, we have added another statistical method to quantify the establishment of iXCI, i.e. the percentage of H3K27me3-positive and -negative cells to total trophoblast cells in female blastocysts subject to Dnmt3b knockdown or not. The result also indicated that Dnmt3b knockdown led to a significant loss of H3K27me3 domains from total trophoblast cells. Similarly, new data based on statistical analyses of total trophoblast cells, has also been added in the results of Dnmt3b knockout and 5-aza-dC (Figure 3F; Figure 3—figure supplement 3D, H in the revised manuscript).

      To clarify the significance and reliability of detecting H3K27me3 domains, we have added a schematic diagram depicting the process of iXCI initiation and establishment, as well as the experimental design and work flows, to make our results easier to be understood (Figure 3C in the revised manuscript).

      In addition, we agree with your comments that additional evidence will benefit the conclusion. Thus, we have reanalyzed the RNA-seq and H3K27me3 CHIP-seq data in extraembryonic ectoderm (ExE) of E6.5 single embryos that underwent Dnm3a/3b knockout because preimplantation iXCI status maintains extraembryonic cells (Chen et al., 2019; Galupa and Heard, 2015; Schulz and Heard, 2013). The results showed that Dnmt knockout-induced chromosome-wide loss of DNA methylation led to a nearly complete loss of H3k27me3 on paternal X chromosome (specifically inactivated in iXCI), along with a notable transcriptional upregulation cross the chromosome. By contrast, these changes cannot be not observed on maternal X chromosome.

      We have added this result in the revised manuscript (Lines 253-261; Figure 3—figure supplement 4A in the revised manuscript).

      (3) Analysis of the developmental potential of Dnmt3b-kd embryos

      While the authors claim that Dnmt3b-mediated de novo DNA methylation plays an important role in imprinted X-chromosome inactivation, it remains unclear whether the analysis presented in Figure 4 is derived from "female" embryos. This analysis seemed confusing as the authors claim that de novo DNA methylation in the promoter regions during the transition from 8-cell to blastocyst regulates imprinted X-chromosome inactivation, but this should not happen in the male embryos. Was the impairment of embryonic proliferation and differentiation observed in both male and female embryos? Or is this specific to the female embryos? We think that the sex of the embryos would be critical for the analysis presented in Figure 4.

      Thanks so much for your constructive comments to make our results smoother and clearer. The Figure 4 mainly presents the developmental role of minor de novo methylation based on the integrated analysis of DNA methylation and gene expression dynamics from the 8-cell to ICM. Because our data indicated that both male and female embryos undergo minor de novo methylation (Figure 1—figure supplement 1C-D in the revised manuscript). This section mainly focused on genome wide and general changes, but not on sex dimorphic consequence.

      To avoid the possible confusion, we have reorganized the RESULTS AND DISCUSSION section and presented this section as Figure 2 in the revised manuscript, before the chromosomal distribution analysis and subsequent detection relevant to iXCI.

      Reviewer #2 (Public Review):

      Summary:

      Here, Yue et al. set out to determine if the low DNMT3B expression that is observed prior to de novo DNA methylation (before the blastocyst stage) has a function. Re-analyzing existing DNA methylation data from Smith et al. (2012) they find a small DNA methylation gain over a subset of promoters and gene bodies, occurring between the 8-cell and blastocyst stages, and refer to this as "minor de novo DNA methylation". They attempt to assess the relevance/functionality of this minor DNA methylation gain, and report reduced H3K27me3 in Dnmt3b knockdown (KD) trophoblast cells that normally undergo imprinted X-chromosome inactivation (iXCI) before the blastocyst stage. In addition, they assess the proliferation, differentiation, metabolic function, implantation rate, and live birth rate of Dnmt3b KD blastocysts.

      Strengths:

      Working with early embryos is technically demanding, making the well-designed experiments from this manuscript useful to the epigenetics community. Particularly, the DNMT3B expression and 5-mC staining at different embryonic stages.

      Thanks for your positive evaluation, we have revised manuscript based on your comments, and the items need to be addressed in detail are explained in the point-by-point response to each comment.

      Weaknesses:

      - Throughout the manuscript, please represent DNA methylation changes as delta DNA methylation instead of fold change.

      Thanks so much for your constructive comments. We have represented DNA methylation changes as “ΔDNA methylation” (Figure 2—figure supplement 1A; Figure 3—figure supplement 1A; Figure 3—figure supplement 3I in the revised manuscript).

      - Detailed methods on the re-analysis of the DNA methylation data from Smith et al. 2012 are missing from the materials and methods section. Was a minimum coverage threshold used?

      Thanks so much for your reminder. We have added relevant statements and provided the detail of the coverage criteria in the subsection of Bioinformatics analysis in the Materials and methods section as follows: RRBS data of mouse embryos (2-cell embryos, 4-cell embryos, 8-cell embryos, ICM, and E6.5 embryos) were downloaded from the published article by Smith et al (Smith et al., 2012) (accession number: GSE34864). The methylation level was calculated as the number of “methylated” reads (reporting as C), divided by the total number of “methylated” and “unmethylated” read, which reporting as C or T. The genomic region information was downloaded from the mm9 Repeat Masker. As described in the published article, promoters were defined as 1 kb up- and downstream of the TSS and classified into high-density CpG promoter (HCP), intermediate-density CpG promoter (ICP) and low-density CpG promoter (LCP). Only CpG sites with at least fivefold coverage were included in the methylation analysis. We have added relevant information in the revised manuscript (Lines 462-470 in the revised manuscript).

      - Detailed methods on the establishment and validation of Dnmt3b KO blastocysts and 5-aza-dC treated blastocysts are missing (related to Figure 2).

      Thanks so much for your detailed reminder. In the present study, we used a well-established Dnmt3b-deficient mouse model (Okano et al., 1999) to validate the role of minor de novo DNA methylation in iXCI establishment. Heterozygous Dnmt3b<sup>+/-</sup> mice that carry one mutant locus of Dnmt3b, were obtained from the Mutant Mouse Resource & Research Centers (MMRRC, NIH). Homozygous embryos were obtained by intercrossing Dnmt3b<sup>+/-</sup> male and female mice. Genotyping assays of collected embryos was performed by PCR using primers that were designed based on the gene targeting strategy following the MMRRC genotyping protocol (https://www.med.unc.edu/mmrrc/genotyping-protocols/mmrrc-center-protocol-29886/). We have provided the detailed methods in the revised manuscript (Lines 350-354; 391-393 in the revised manuscript). In addition, we added a schematic diagram depicting the processes of embryo collection and detection (Figure 3—figure supplement 3A in the revised manuscript).

      Similarly, we have provided relevant details of 5-aza-dC supplementation in the revised manuscript (Lines 412-415 in the revised manuscript) and added a schematic diagram depicting the details of experimental design and processes (Figure 3—figure supplement 3E in the revised manuscript).

      - Detailed methods on the re-analysis of the ChIPseq data from Liu et al. 2016 are missing from the materials and methods section.

      Thank you for pointing this out. The bigwig files of H3K27me3 ChIP-seq data were downloaded from the published article by Liu et al (Liu et al., 2016)(accession number: GSE73952). These signal tracks were generated using the MACS2 (v2.0.10.20131216) pileup function and normalized to 1 million reads for visualization, as described in the original publication. We have added relevant information to the MATERIALS AND METHODS section in the revised manuscript (Lines 474-479 in the revised manuscript).

      - Some of the data represented in bar graphs does not look convincing/significant. Maybe this data can be better represented differently, such as in box plots or violin plots, which would better represent the data.

      Thanks so much for your comments that improve our result presentation, relevant results have been changed into box plots in the revised manuscript (Figure 3E; Figure 3—figure supplement 3C; Figure 3—figure supplement 3G in the revised manuscript). In addition, to strengthen our evidence, we have added alternative statistical method to quantify the establishment of iXCI, i.e. the percentage of H3K27me3-positive and -negative cells to total trophoblast cells in female blastocysts subject to Dnmt3b knockdown or not. (Figure 3F; Figure 3—figure supplement 3D, H in the revised manuscript).

      - The relevance and rationale for experiments using 5-aza-dC treatment is unclear.

      Thanks so much for reminding us to make our results more informative and convincing. 5-aza-dC is a well-established global DNA hypomethylating agent that efficiently inhibit the activity of all DNMTs, and thus has been frequently used to study the maintenance of DNA methylation and de novo DNA methylation (Maslov et al., 2012; Oka et al., 2005).

      In our study, to validate the function of minor de novo DNA methylation in iXCI, we take advantage of 5-aza-dC-induced DNMT inhibition, which allows us, despite its inhibitory effect common to various DNMTs, to transiently treat embryos specifically during the window of minor de novo DNA methylation (from the 8-cell to blastocyst stage). We have added these statements, as well as a schematic diagram depicting the experimental design, in the revised manuscript to make our experiments more rational and easier to be understood (Lines 183-188; Figure 3—figure supplement 3E in the revised manuscript).

      References

      Auclair, G., Guibert, S., Bender, A. and Weber, M. (2014). Ontogeny of CpG island methylation and specificity of DNMT3 methyltransferases during embryonic development in the mouse. Genome Biol. 15, 545.

      Borgel, J., Guibert, S., Li, Y., Chiba, H., Schubeler, D., Sasaki, H., Forne, T. and Weber, M. (2010). Targets and dynamics of promoter DNA methylation during early mouse development. Nat. Genet. 42, 1093-1100.

      Chen, Z., Yin, Q., Inoue, A., Zhang, C. and Zhang, Y. (2019). Allelic H3K27me3 to allelic DNA methylation switch maintains noncanonical imprinting in extraembryonic cells. Sci Adv 5, eaay7246.

      Chow, J. and Heard, E. (2009). X inactivation and the complexities of silencing a sex chromosome. Curr. Opin. Cell Biol. 21, 359-366.

      Dahlet, T., Argueso Lleida, A., Al Adhami, H., Dumas, M., Bender, A., Ngondo, R. P., Tanguy, M., Vallet, J., Auclair, G., Bardet, A. F., et al. (2020). Genome-wide analysis in the mouse embryo reveals the importance of DNA methylation for transcription integrity. Nat Commun 11, 3153.

      Fukuda, A., Tomikawa, J., Miura, T., Hata, K., Nakabayashi, K., Eggan, K., Akutsu, H. and Umezawa, A. (2014). The role of maternal-specific H3K9me3 modification in establishing imprinted X-chromosome inactivation and embryogenesis in mice. Nat Commun 5, 5464.

      Galupa, R. and Heard, E. (2015). X-chromosome inactivation: new insights into cis and trans regulation. Curr. Opin. Genet. Dev. 31, 57-66.

      Gontan, C., Mira-Bontenbal, H., Magaraki, A., Dupont, C., Barakat, T. S., Rentmeester, E., Demmers, J. and Gribnau, J. (2018). REX1 is the critical target of RNF12 in imprinted X chromosome inactivation in mice. Nat Commun 9, 4752.

      Guo, F., Li, L., Li, J., Wu, X., Hu, B., Zhu, P., Wen, L. and Tang, F. (2017). Single-cell multi-omics sequencing of mouse early embryos and embryonic stem cells. Cell Res. 27, 967-988.

      Heard, E., Chaumeil, J., Masui, O. and Okamoto, I. (2004). Mammalian X-chromosome inactivation: an epigenetics paradigm. Cold Spring Harb. Symp. Quant. Biol. 69, 89-102.

      Huynh, K. D. and Lee, J. T. (2005). X-chromosome inactivation: a hypothesis linking ontogeny and phylogeny. Nat. Rev. Genet. 6, 410-418.

      Inoue, K., Kohda, T., Sugimoto, M., Sado, T., Ogonuki, N., Matoba, S., Shiura, H., Ikeda, R., Mochida, K., Fujii, T., et al. (2010). Impeding Xist expression from the active X chromosome improves mouse somatic cell nuclear transfer. Science 330, 496-499.

      Liu, X. Y., Wang, C. F., Liu, W. Q., Li, J. Y., Li, C., Kou, X. C., Chen, J. Y., Zhao, Y. H., Gao, H. B., Wang, H., et al. (2016). Distinct features of H3K4me3 and H3K27me3 chromatin domains in pre-implantation embryos. Nature 537, 558-562.

      Maslov, A. Y., Lee, M., Gundry, M., Gravina, S., Strogonova, N., Tazearslan, C., Bendebury, A., Suh, Y. and Vijg, J. (2012). 5-aza-2'-deoxycytidine-induced genome rearrangements are mediated by DNMT1. Oncogene 31, 5172-5179.

      Oka, M., Meacham, A. M., Hamazaki, T., Rodic, N., Chang, L. J. and Terada, N. (2005). De novo DNA methyltransferases Dnmt3a and Dnmt3b primarily mediate the cytotoxic effect of 5-aza-2'-deoxycytidine. Oncogene 24, 3091-3099.

      Okano, M., Bell, D. W., Haber, D. A. and Li, E. (1999). DNA methyltransferases Dnmt3a and Dnmt3b are essential for de novo methylation and mammalian development. Cell 99, 247-257.

      Schulz, E. G. and Heard, E. (2013). Role and control of X chromosome dosage in mammalian development. Curr. Opin. Genet. Dev. 23, 109-115.

      Smith, Z. D., Chan, M. M., Mikkelsen, T. S., Gu, H. C., Gnirke, A., Regev, A. and Meissner, A. (2012). A unique regulatory phase of DNA methylation in the early mammalian embryo. Nature 484, 339-344.

      Tan, K., An, L., Miao, K., Ren, L., Hou, Z., Tao, L., Zhang, Z., Wang, X., Xia, W., Liu, J., et al. (2016). Impaired imprinted X chromosome inactivation is responsible for the skewed sex ratio following in vitro fertilization. Proc. Natl. Acad. Sci. U. S. A. 113, 3197-3202.

      Reviewer #1 (Recommendations For The Authors):

      Title

      It would be hard to understand what "co"-regulates means. Does this mean DNA methylation and H3K27me3 co-regulate imprinted X- X-chromosome inactivation? If so, the title can be reworded.

      Thanks for your insightful comments, the title has been corrected into “A wave of minor de novo DNA methylation initiates in mouse 8-cell embryos and co-regulates imprinted X- chromosome inactivation with H3K27me3” (Line 2 in the revised manuscript).

      Text

      (1) As DNA methylation analysis is a primary part of this study, how they processed DNA methylation data can be added to the "Bioinformatics analysis" in the MATERIALS AND METHODS section.

      Thanks for your kind reminder. We have added relevant information in the Materials and methods section in the revised manuscript (Lines 462-474 in the revised manuscript).

      (2) It seems that recent literature has not been cited in the manuscript. Specifically, none of the papers after 2018 were cited. Recent relevant papers should also be cited throughout the manuscript.

      Thanks so much for your reminder. We have added more recent literature to update the relevant information, such as the evidence supporting the causal role between DNA methylation and XCI (Lines 225-228, 264-265 in the revised manuscript); the concurrent enrichment of DNA methylation and H3K27me3 in genes subject to XCI (Lines 301-303 in the revised manuscript); the dominant role of de novo methylation in X chromosome (Lines 253-256 in the revised manuscript), etc.

      (3) Line 56: The first report that describes the dynamics of DNMT3B expression in pre-implantation embryonic development (Hirasawa et al., 2007) is missing. This paper should be cited.

      Sorry for our carelessness, we have added relevant references and rewritten the sentence in the revised manuscript (Lines 56-57 in the revised manuscript). I think you meant the report by Hirasawa et al in 2008, in which presented expression and subcellular localization of Dnmt3a and Dnmt3b in mouse oocytes and preimplantation embryos.

      (4) Line 98: It would be good to mention that the data were derived from reduced representation bisulfite sequencing as the authors used whole-genome bisulfite sequencing data from the same research group as well.

      Thanks for your kind reminder. As you have suggested, we have added the description in the revised manuscript to emphasize that these data were derived from reduced representation bisulfite sequencing, while another data were derived from whole-genome bisulfite sequencing, respectively. (Lines 98-99, 111 in the revised manuscript).

      (5) Line 101: We first... "the preferential target of DNMT3B (Auclair et al., 2014; Borgel et al., 2010)". More recent literature (Baubec et al., 2016, Duymich et al., 2016, for example) showed that the preferential target of DNMT3B is not a promoter but a gene body. This sentence should be reworded.

      Thanks so much for your detailed reminder. As you have pointed out, “preferential target” seems to be an inaccurate statement. Besides of promoters, gene bodies and other elements also undergo de novo DNA methylation (Auclair et al., 2014; Dahlet et al., 2020; Duymich et al., 2016).

      We have rewritten the sentence as follows in the revised manuscript: “Promoter regions are important target sites of DNMT3B (Choi et al., 2011). The acquisition of DNA methylation in promoters, especially in intermediate and low CpG promoters, during implantation is largely dependent on DNMT3B and plays an important role in regulating developmental genes (Auclair et al., 2014; Borgel et al., 2010; Dahlet et al., 2020). Thus, among genomic regions that may undergo de novo DNA methylation, we initially focused our analysis on DNA methylation dynamics of promoters...” (Lines 100-106 in the revised manuscript)

      (6) Lines 108-109: It would be good to mention that these data were derived from whole-genome bisulfite sequencing.

      Thanks for your kind reminder. As aforementioned, we have added a description in the revised manuscript to distinguish between data derived from reduced representation bisulfite sequencing and whole-genome bisulfite sequencing (Lines 98-99, 111 in the revised manuscript).

      (7) Line 141: rXCI should be defined.

      Thanks for your kind reminder. We have added full descriptions and more necessary information about iXCI and rXCI, to make our statements clearer and easier to be understood (Lines 210-213 in the revised manuscript). In addition, we carefully checked the relevant descriptions throughout the manuscript, and each abbreviation (such as “ICM”) has been defined at its first occurrence. Additionally, we have replaced abbreviations that appears only once in the manuscript with their full terms (Lines 122, 212 in the revised manuscript).

      (8) Lines 145-149: The role of DNA methylation for imprinted X-inactivation has already been reported (Chiba et al., 2008). The relevant sentences should be reworded.

      Thanks so much for reminding us the important earlier literature that explores the relationship between DNA methylation and XCI. However, the primary aim and hypothesis of the study by Chiba et al. are different from those of our study. Chiba et al focused on whether DNA methylation is the imprinting mark responsible for monoallelic expression of Xist (the initiation event of iXCI), while our study focused on the role of DNA methylation in achieving X chromosomal heterochromatinization (the late event of iXCI).

      In detail, the study by Chiba et al. mainly focused on exploring why Xist is specifically expressed from paternal allele and iXCI occurs specifically on the paternal X chromosome in mouse preimplantation embryos. Because Previous studies have suggested that genomic imprinting of Xist is established during oogenesis (Oikawa et al., 2014; Tada et al., 2000), Chiba et al. wanted to test whether the DNA methylation imprinting established during oogenesis is responsible for the monoallelic expression of Xist in preimpantaiton embryos. Analyses of DNA methyltransferase maternal knockout embryos revealed that oocyte DNA methylation is dispensable for Xist imprinting (Chiba et al., 2008). Follow-up study by Inoue et al. identified a broad H3K27me3 enrichment within the Xist 5’region established during oocyte growth and persists through preimplantation development, as the imprinting mark of Xist (Inoue et al., 2017). These series of studies are very important and allows us to understand the mechanism underlying paternal allele-specific iXCI in mouse preimplantation embryos and extraembryonic tissues.

      However, the hypothesis is different in our study. Based on the finding of minor de novo DNA methylation and its preferential distribution on the X chromosome, we have speculated that the minor de novo methylation, which occurs from the 8-cell to blastocyst stage, may participate in achieving X chromosomal heterochromatinization. Although DNA methylation is essential for maintaining X chromosome-wide transcriptional silence of rXCI, its role in iXCI remains controversial and it is even plausibly thought that DNA methylation is not required for achieving iXCI because preimplantation embryos undergo global and massive DNA demethylation.

      We have reorganized this paragraph, relevant statements have been added to make the background and discussion clearer and easier to be understood. (Lines 217-234 in the revised manuscript)

      (9) Lines 164-165: Information regarding Dnmt3b KO is missing. Did the authors generate an original KO line or use an already published one? It should be explicitly stated.

      Thank you so much for your kind reminder. The Dnmt3b heterozygous mice were obtained from the Mutant Mouse Resource & Research Centers (MMRRC), and Dnmt3b knockout (KO) embryos were generated by mating Dnmt3b heterozygous females with heterozygous males. The genotyping of Dnmt3b KO embryos was performed by PCR following the MMRRC genotyping protocol (https://www.med.unc.edu/mmrrc/genotyping-protocols/mmrrc-center-protocol-29886/). The relevant information has been added to the MATERIALS AND METHODS section in the revised manuscript (Lines 350-354; 391-393 in the revised manuscript).

      (10) Line 165: chemical-induced inhibition of DNMT3B. As 5-aza-dC also blocks DNMT3A and DNMT1, this sentence should be reworded.

      Thank you for your valuable comments. 5-aza-dC is a well-established global DNA hypomethylating agent that efficiently inhibit the activity of all DNMTs, and has been frequently used to study the maintenance of DNA methylation and de novo DNA methylation (Maslov et al., 2012; Oka et al., 2005). Thus, despite its inhibitory effect common to various DNMTs, chemical-induced inhibition of DNMTs has the advantage of allowing us to transiently treated embryos specifically during the window of minor de novo DNA methylation (the 8-cell to blastocyst stage). We have rewritten the relevant sentences in the revised manuscript (Lines 183-188 in the revised manuscript).

      (11) Lines 171-174: "The role of de novo methylation in iXCI...". This possibility was already tested in the previous study from the Sasaki lab (Chiba et al., 2008).

      As mentioned above, the primary aim and hypothesis of the study by Chiba et al. are different from those of our study. Chiba et al. mainly focused on exploring why Xist is specifically expressed from paternal allele and iXCI occurs specifically on the paternal X chromosome in mouse preimplantation embryos, so they tested whether the DNA methylation imprinting established during oogenesis is responsible for this monoallelic expression of Xist in preimplantation embryos (the initiation event of iXCI).

      By contrast, based on the finding of minor de novo DNA methylation and its preferential distribution on X chromosome, our study has speculated that the minor de novo DNA methylation, which occurs from the 8-cell to blastocyst stage, may participate in achieving X chromosomal heterochromatinization (the late event of iXCI).

      Thanks so much for reminding us this important literature, to make our discussion more informative. We have reorganized this paragraph by rewriting or adding relevant statements to make the background and discussion clearer and easier to be understood (Lines 217-231 in the revised manuscript). In addition, to avoid repeated statement and make our discussion more concise, we have removed the similar sentences at the end of this paragraph.

      (12) Lines 198-200: "Given DNA methylation...". These citations mention a general relationship between DNA methylation and H3K27me3 in cells in culture. As I believe the authors focus on X-chromosome inactivation in the female embryos, more relevant papers that discuss the order of the events for the establishment of H3K27me3 and DNA methylation in the inactive X-chromosome can be cited.

      Thanks so much for your comment to improve our discussion. It has been thought that during the late phase of rXCI in fully differentiated cells, gene silencing is achieved by PRC2 complex-induced H3K27me3, and then is further stably maintained by the redundant action of multiple layers of epigenetic modifications, including DNA methylation, to reach the maximum level of chromatin compaction (Chow and Heard, 2009; Heard et al., 2004; Pintacuda and Cerase, 2015). In line with this, a recent multifaceted analysis showed that DNA methylation and H3K27me3 are concurrently enriched in genes subject to XCI (Balaton and Brown, 2021). We have added these statements in the revised manuscript (Lines 295-303 in the revised manuscript).

      (13) Line 241: As 5-aza-dC blocks both de novo and maintenance DNA methylation, this sentence should be reworded.

      Thank you for your kind reminder. As you have mentioned above, 5-aza-dC is a well-established global DNA hypomethylating agent that efficiently inhibit the activity of all DNMTs, and has been frequently used to study the maintenance of DNA methylation and de novo DNA methylation (Maslov et al., 2012; Oka et al., 2005). Thus, despite its inhibitory effect common to various DNMTs, chemical-induced inhibition of DNMTs has the advantage of allowing us to transiently treated embryos specifically during the window of minor de novo DNA methylation (the 8-cell to blastocyst stage). We have rewritten the relevant sentences in the revised manuscript (Lines 183-188 in the revised manuscript).

      Figures

      (1) Figure 1C, D: Do the rows in C and D show the corresponding genes?

      Figure 1C and D represent the DNA methylation changes of promoters (C) and gene bodies (D) respectively, during the transition from the 8-cell to blastocyst stage. Two data were analyzed independently, and rows did not show the corresponding genes. Since we have focused on the minor de novo methylation in promoter regions, to avoid confusion, the results of the gene body have been removed from the revised manuscript.

      (2) Figure 1G: Yy2 promoter gained DNA methylation during the transition from 8-cell to the blastocyst stage. Is this a representative locus for the de novo methylated promoters that are shown in Figure 1F where an increase of DNA methylation is about ~1% on average? Another representative locus could be shown instead of this gene promoter.

      Thanks so much for you detailed reminder. The inconsistency between the global methylation change and bisulfite sequencing analysis of Yy2, may be due to the details of methodologies, such C-T conversion efficiency, the number of picked colonies, etc. Since we have confirmed the presence of minor de novo DNA methylation using different publicly available data, to avoid ambiguity, we have removed this result in revised manuscript.

      (3) Figures 2C and 3A: It would be helpful to mention what the arrowheads mean.

      Thanks so much for you detailed reminder. In Figure 2C, the arrowhead indicates the H3k27me3 domain and the blank arrowhead indicates the blastomere without the H3k27me3 domain. In Figure 3A, the arrowhead indicates Xist RNA domain and the blank arrowhead indicates the blastomere without Xist RNA domain. We have added the information in the revised manuscript (Lines 736-738, 747-749 in the revised manuscript).

      (4) Figure 3-figure supplement 2B: It would be hard to see whether H3K27me3 is enriched at the promoter regions of presented genes. It would be helpful to show the values for the Y-axis as in panel A.

      Thanks for your helpful reminder. We have added the scales to the figure to improve the result presentation (Figure 4—figure supplement 2B in the revised manuscript).

      (5) Figure 4-figure supplement 2: 5-aza-dC blocks not only the activity of DNMT3B but also DNMT1, and DNMT3A (all these DNMTs are expressed during pre-implantation embryos, see Hirasawa et al., 2007). This part can be omitted from the manuscript.

      Thanks for your insightful comments. As you have mentioned above, the relevance and rationale for experiments using 5-aza-dC treatment should be clarified. 5-aza-dC is a well-established global DNA hypomethylating agent that efficiently inhibit the activity of all DNMTs, and thus has been frequently used to study the maintenance of DNA methylation and de novo DNA methylation (Maslov et al., 2012; Oka et al., 2005).

      In our study, to validate the function of minor de novo DNA methylation in iXCI and blastocyst development, we take advantage of 5-aza-dC-induced DNMT inhibition, which allows us to transiently treated embryos specifically during the window of minor de novo DNA methylation (the 8-cell to blastocyst stage), despite its non-specificity to various DNMTs.

      Based on these considerations, we hope to retain this result, and wish to get your understanding.

      We have added these statements in the revised manuscript to make our experiments more rational and easier to be understood (Lines 183-188 in the revised manuscript) and added a schematic diagram depicting the experimental design (Figure 3—figure supplement 3E in the revised manuscript).

      Reviewer #2 (Recommendations For The Authors):

      Recommendations/concerns in the text:

      - Line 106, it is unclear what is meant by "in line with this"? Gene body DNA methylation is a characteristic of active transcription, so why would a gain in DNA methylation at promoters be in line with a gain in DNA methylation over gene bodies?

      Thank you so much for your comments that pointed out our ambiguous statement. We meant both the promoter and gene body regions, albeit accounting for small proportions, gain DNA methylation during the transition from the 8-cell to blastocyst stage. Based on the comment by Reviewer#1, since we have focused on the minor de novo methylation in promoter regions, to avoid confusion, the results of the gene body have been removed from the revised manuscript.

      - Line 111 & 114, can 6% DNA methylation really be considered "relatively hypermethylated" compared to 3% DNA methylation that is referred to as "more hypomethylated"?

      We apologize for our unclear and ambiguous statements. Here we focused on the promoter regions. Many previous studies have revealed that compared with gene bodies and other genome elements, promoter and overlapping CGI regions, especially high CpG promoters, always showed low levels of DNA methylation. We have added relevant statements to clarify this information, and rewritten the sentences in the revised manuscript (Lines 100-106, 116-118, 121, 124 in the revised manuscript).

      - Line 124, there are a number of processes identified, why only mention one in the text? Suggest changing writing to be more accurate, indicating what was included for the GO analysis and using the words "enriched for ... processes". Saying it may be linked to a process is an overstatement and not supported by further experiments/data.

      Thank you so much for your detailed comments that make our results more informative. We have checked the relevant description and addressed your suggestions as follows: By performing gene ontology enrichment analysis of genes that undergo minor or major de novo DNA methylation respectively, we noticed that besides of many important basic processes common to two waves of de novo DNA methylation, genes subject to minor de novo DNA methylation were enriched in processes such as organic substance transport, chromosome organization, and cell fate specification (Lines 129-134 in the revised manuscript).

      - Lines 149 - 152: sentence/message unclear.

      We apologize for the ambiguous description. We have corrected the relevant descriptions as follows: To identify the biological function of minor de novo DNA methylation in iXCI, we knocked down Dnmt3b in preimplantation embryos by microinjecting Dnmt3b siRNA into zygotes (Lines 234-236 in the revised manuscript).

      - Lines 162-164: the data in Figure 2C/D does not support this statement, as it does not show H3K27me3 loss specifically at the inactive X-chromosome.

      Thanks so much for your insightful comments. Despite the global enrichment of H3K27me3, the H3K27me3 domain detected by immunostaining is a classic marker for establishment of XCI by achieving X chromosome wide heterochromatinization of transcriptional depression (Chow and Heard, 2009; Heard et al., 2004; Huynh and Lee, 2005). Thus, we have used immunostaining for H3K27me3 domains to evaluate the iXCI establishment in the blastocysts, as previously reported (Fukuda et al., 2014; Gontan et al., 2018; Inoue et al., 2010; Tan et al., 2016). To make our results more convincing, we have added another statistical method to quantify the establishment of iXCI, i.e., the percentage of H3K27me3-positive and -negative trophoblast cells to total trophoblast cells in female blastocysts subject to Dnmt3b knockdown or not.

      In addition, we have added a schematic diagram depicting the process of iXCI initiation and establishment, as well as the experimental design and work flows, to make the result easier to be understood.

      In addition, we agree with your comments that additional evidence will benefit the conclusion. To strengthen the evidence, and test whether DNA methylation loss leads to a prolonged effect on iXCI, we have reanalyzed the RNA-seq and H3K27me3 CHIP-seq data in extraembryonic ectoderm (ExE) of E6.5 single embryos that underwent Dnm3a/3b knockout because preimplantation iXCI status maintains extraembryonic cells (Chen et al., 2019; Galupa and Heard, 2015; Schulz and Heard, 2013). The results showed that chromosome-wide loss of DNA methylation led to a nearly complete loss of H3k27me3 on paternal (specifically inactivated in iXCI), along with a notable transcriptional upregulation cross the chromosome. By contrast, these changes cannot be not observed on maternal X chromosome. (Lines 253-261; Figure 3—figure supplement 4A in the revised manuscript)

      - Lines 169-174: sentence/message unclear.

      As aforementioned, we have reorganized this paragraph by rewriting or adding relevant statements relevant to the DNA methylation and XCI, to make the background and discussion clearer and easier to be understood (Lines 217-234 in the revised manuscript). In addition, to avoid repeated statement and make our discussion more concise, we have removed the similar sentences at the end of this paragraph.

      - Lines 177-179: this statement is too bold. The data does not support "direct evidence".

      Thank you for your detailed reminder. We have rewritten the sentence to avoid confusion and overstatement (Lines 262-268 in the revised manuscript).

      - Line 198: these are not all enzymes, but could be referred to as chromatin modifiers.

      We apologize for the ambiguous description. As you suggested, we have corrected “enzymes” to “chromatin modifiers” (Lines 284, 287 in the revised manuscript).

      - Line 199: this statement is not correct in all contexts. There are many studies showing antagonism between DNA methylation and H3K27me3.

      Thanks so much for you careful reviewing. As you have pointed out, the relationship of DNA methylation and H3K27me3 are divergent and largely controversial among studies. Under certain circumstances, DNA methylation shows antagonistic effect to H3K27me3 at promoters, via excluding the binding of PRC2 (the main complex responsible for H3K27me3 deposition) components to their targets (Bartke et al., 2010; Jermann et al., 2014), while other studies have presented alternative evidence that PRC2 (the main complex responsible for H3K27me3 deposition) and DNA methylation cooperate to achieve silencing (Hagarman et al., 2013; Vire et al., 2006). Thus, it has been thought that the relationship between DNA and methylation and histone modifications is complex, possibly in a cell-type and/or genomic region-specific manner. Both antagonism and coordination can be observed in different regulatory elements in mouse ES cells (King et al., 2016).

      We apologize our incomplete statement because we mainly focused on their synergistic relationship. We have refined this section by rewriting relevant sentences and adding necessary statements (Lines 288-303 in the revised manuscript).

      - Lines 228-230: the developmental significance of DNA methylation homeostasis is already well-established. Please reference relevant papers showing this here.

      Thank you for this helpful suggestion. We have reorganized this section. Relevant references that highlight the developmental significance of DNA methylation homeostasis have added. The sentence has been rewritten and moved to the end of this paragraph, in the revised manuscript (Lines 159-161 in the revised manuscript).

      - Line 238: an explanation/rationale for looking at energy metabolism is lacking.

      Thank you for your comments to make our results earlier to be understood. The detection of energy metabolism is mainly based on the integrated analysis of DNA methylation and gene expression from the 8-cell embryos to ICM, to test the potential short-and long-term developmental consequences of minor de novo DNA methylation. Bioinformatic analysis suggested that many basic processes, such as cell differentiation, cell cycle and metabolic regulation, may be regulated by minor de novo DNA methylation. Among the enriched genes, several are related energy metabolism. In addition, because energy metabolism is crucial for supporting embryo differentiation and development, and oxidative phosphorylation (OXPHOS) metabolism is highly activated during the blastocyst stage (Zhao et al., 2021), we next examined the energy metabolism, particularly OXPHOS activity, of Dnmt3b-KD embryos. We have refined the section by rewritten relevant sentence and added necessary statements (Lines 175-179 in the revised manuscript).

      - Lines 246-248: Looking at the data in Figure 2 figure supplement 2, this statement is simply not true with regards to DNMT3B protein, and also global DNA methylation level is reduced in the Dnmt3b KD blastocyst, which could lead to defective major de novo DNA methylation.

      Thanks for your careful reviewing, we have rewritten the sentence to make our statement more accurate and avoid overstatement (Lines 188-190 in the revised manuscript).

      Recommendations/concerns relating to figures:

      Figure 1:

      - Of all genic promoters, how many were included in the analysis (contained sufficient coverage)? What cut-off/thresholds were used to consider DNA methylation gain at a promoter?

      Thanks for your comments. In total, 11662 promoters were analyzed. Given that promoter methylation is generally at low level, particularly at the 8-cell stage at which minor de novo methylation is just initiated. The relatively lower basal levels make the increase before the blastocyst, seem considerably slight. To capture the slight changes, we have used the relaxed threshold based on ΔDNA methylation. Only CpG sites with at least fivefold coverage were included in the methylation analysis based on data from Smith et al. (Smith et al., 2012)., ΔDNA methylation greater or less than 0 was defined as gain or loss of DNA methylation. We have added this information in the revised manuscript (Lines 462-470 in the revised manuscript).

      - Does an average methylation level of 0.02 represent 2% DNA methylation? Presuming yes, is the average 1.5% DNA methylation gain at promoters real? And meaningful? Especially compared to the gain in DNA methylation that takes place between ICM and E6.5 (Figure 1 Figure Supplement 1 D)

      As you have pointed out, an average methylation level of 0.02 represent 2% DNA methylation. As aforementioned, promoters exhibited an average of 1.5% DNA methylation gain during the transition from 8-cell stage to ICM. The slight increase may be mainly due to the relatively lower basal levels. As you expected, compared with the comprehensive de novo DNA methylation during implantation, preimplantation de novo methylation occurs more slightly, at a small proportion of promoter regions, so designated it as minor de novo DNA methylation. It should be also mentioned that a proportion of these promoters continue to gain massive DNA methylation during implantation. We have refined the relevant sentences to provide more detailed information of our results (Lines 125-127 in the revised manuscript).

      - Why is there a focus on promoters (which are not the preferential target of DNMT3B)?

      Thanks so much for your detailed reminder. As you have pointed out, “preferential target” seems to be an inaccurate statement. besides of promoters, gene bodies and other elements also undergo de novo DNA methylation (Auclair et al., 2014; Dahlet et al., 2020; Duymich et al., 2016). We have focused on the promoter regions based on the following considerations: (1) Promoter regions are important target sites of DNMT3B (Choi et al., 2011); (2) The acquisition of DNA methylation in promoters, especially in intermediate and low CpG promoters, during implantation is largely dependent on DNMT3B and plays an important role in regulating developmental genes (Auclair et al., 2014; Borgel et al., 2010; Dahlet et al., 2020). We have rewritten the relevant sentence in the revised manuscript (Lines 100-106 in the revised manuscript).

      - Figure 1H shows that promoters that gain DNA methylation during the "minor de novo DNA methylation" continue to gain DNA methylation during "de novo DNA methylation". Is the ~1.5% DNA methylation gain just the slow start of the main de novo DNA methylation wave?

      Your comments is very helpful to improve the description of our results. In the present study, our analysis indicated that a small proportion of promoters initially gain methylation during the transition from the 8-cell to ICM. The finding challenges current knowledge: (1) de novo DNA methylation occurs during implantation, by which globally hypomethylated blastocysts acquire genome-wide DNA methylation (Borgel et al., 2010; Dahlet et al., 2020; Smith et al., 2012); (2) during preimplantation development, embryos undergo massive and global DNA demethylation.

      To distinguish the current knowledge of the timing and dynamics of DNA methylation during the early development, we have designated our finding during the transition from the 8-cell to blastocyst stage, as minor de novo DNA methylation.

      We agree with your notion that among the promoters undergoing minor de novo methylation, most of them continue to gain DNA methylation during implantation, as revealed in Fig. 1F. We have added refine the relevant statement in revised manuscript (Lines 125-127 in the revised manuscript).

      - The GO analysis performed for Figure 1H, what was used as input? Promoters of genes that gain DNA methylation as identified in 1C?

      Thank you for your comments. For the GO analysis shown in Figure 1H, we used genes with promoter regions that gained or lost DNA methylation during the transition from the 8-cell to ICM respectively (identified in Figure 1C, as input), respectively. This information has been clarified in the revised manuscript to ensure accuracy (Lines 129-134 in the revised manuscript).

      - Figure 1 figure supplement 1, is there only a fold change as threshold or also a calculated significance (eg. p-value/FDR)?

      Thanks for your valuable comments. Considering the relatively low DNA methylation levels at promoter regions, and the slightly changes occurring during the preimplantation embryo development, we used the relaxed threshold based on ΔDNA methylation. Only CpG sites with at least fivefold coverage were included in the methylation analysis based on data from Smith et al. (Smith et al., 2012), ΔDNA methylation greater or less than 0 was defined as gain or loss of DNA methylation. We have replaced relevant figures and added this information in the revised manuscript (Figure 1—figure supplement 1D-E; Lines 125-127 in the revised manuscript).

      - To confirm DNMT3B is responsible for the DNA methylation gain: DNMT3B KD/KO followed by promoter DNA methylation analysis to confirm the promoters that gain DNA methylation between 8 cell and ICM don't gain DNA methylation in the absence of DNMT3B.

      We agree with your comments that additional evidence will benefit the conclusion. To strengthen the evidence, we have reanalyzed the RNA-seq and H3K27me3 CHIP-seq data in extraembryonic ectoderm (ExE) of E6.5 single embryos that underwent Dnm3a/3b knockout because preimplantation iXCI status maintains extraembryonic cells (Chen et al., 2019; Galupa and Heard, 2015; Schulz and Heard, 2013). The results showed that chromosome-wide loss of DNA methylation led to a nearly complete loss of H3k27me3 on paternal (specifically inactivated in iXCI), which showed a notable transcriptional upregulation cross the chromosome. By contrast, these changes cannot be not observed on maternal X chromosome. We have added this result in the revised manuscript (Lines 253-261; Figure 3—figure supplement 4A in the revised manuscript).

      Figure 2:

      - Figure 2A: label missing for what the numbers on the y-axis represent.

      Thank you for pointing this out. We apologize for the oversight. We have added the label of y-axis in Figure 2A to clarify what the numbers represent, making it easier to be understood (Figure 3A in the revised manuscript).

      - Figure 2B: y-axis is % of methylated promoters compared to all promoters?

      Thank you for your suggestion. The y-axis in Figure 2B indeed represents the percentage of de novo methylated promoters relative to all promoters. As you have suggested, we have clarified this labeling in the revised manuscript (Figure 3B in the revised manuscript).

      - What is the delta DNA methylation gain specifically for X-linked promoters?

      Thanks so much for your reminder. To provide more convincing evidence. We have reanalyzed a single cell COOL-seq data, we also specifically reanalyzed the DNA methylation changes on the X chromosomal promoter in female embryos. The X chromosome showed a more notable increase in the de novo methylated promoters than that on autosomes, and the female X chromosome showed higher DNA methylation levels than that of the male (Figure 3—figure supplement 2A-B; Lines 203-206 in the revised manuscript).

      - Figure 2C: include representative images of separate channels to better see the signal of CDX2 and H3K27me3. Quantification would be better represented with box plots.

      Thank you for your helpful suggestions. We have added separate channel images in the revised manuscript. Additionally, we have adjusted the quantification to be represented as box plots, as you have suggested, to improve the accuracy and interpretability of the data presentation (Figure 3D-F in the revised manuscript).

      - Figure 2C: Does the H3K27me3 signal overlap with the location of the inactive X-chromosome (is there maybe denser DAPI or do IF combined with Xist RNA-FISH)?

      Thanks so much for your insightful comments. Despite the global enrichment of H3K27me3, the H3K27me3 domain detected by immunostaining is a classic marker for establishment of XCI by achieving X chromosome wide heterochromatinization of transcriptional depression (Chow and Heard, 2009; Heard et al., 2004; Huynh and Lee, 2005). Thus, we have used immunostaining for H3K27me3 domains to evaluate the iXCI establishment in the blastocysts, as previously reported (Fukuda et al., 2014; Gontan et al., 2018; Inoue et al., 2010; Tan et al., 2016). We have taken effort to perform co-staining of H3K27me3 IF and Xist FISH, but was hindered by the technical challenge, we wish to get your understanding. However, as we aforementioned, H3K27me3 is a well-accepted maker to clarify the XCI status.

      In addition, to make our results more convincing, we have added an alternative statistical method to quantify the establishment of iXCI, i.e., the percentage of H3K27me3-positive and -negative trophoblast cells to total trophoblast cells in female blastocysts subject to Dnmt3b knockdown or not (Figure 3F; Lines 243-244 in the revised manuscript)

      - Figure 2 figure supplement 2A: relative expression of Dnmt3b?

      Thanks for your detailed reminder. The data represent the relative expression level of Dnmt3b, as noted in the original figure legend. Based on your comments, we have added the gene name in the label of the Y-axis. Similarly, the protein name has been also added to make the results more informative (Figure 2 figure supplement 2A, C, E in the revised manuscript).

      - Figure 2 figure supplement 2B/C: in the text, line 153, it is stated that "Dnmt3b mRNA and protein levels were significantly reduced in morulae, but not in blastocysts compared to those of negative control (NC) group". These figures do not support that statement. The IF images show a loss of DNMT3B in the Dnmt3b KD blastocysts. The IF quantification seems to have fewer datapoints for the blastocyst, and looking at the bar graphs, there seems to be a trend towards reduced DNMT3B in both the morula and blastocyst, which would also explain the reduction in DNA methylation in both stages as shown in Figure 2 figure supplement 2D/E.

      Thanks so much for your careful reviewing that makes our statements more accurate. We have rewritten the sentence in the revised manuscript as follows: Dnmt3b mRNA and protein levels were significantly reduced in morulae, and tended to be lower in blastocysts compared to those of the negative control (NC) group. In addition, we have removed “transient” from the original statement “The transient inhibition of Dnmt3b” (Lines 168-170 in the revised manuscript).

      - Figure 2 figure supplement 2F/G: include representative IF images with separation of all channels and the merged image.

      Thank you for your suggestion. We have added the representative immunofluorescence (IF) images with separate channels and merged image in the revised manuscript (Figure 3—figure supplement 3B, F in the revised manuscript).

      - Figure 2 figure supplement 2H: Instead of showing log2FC in methylation levels, delta methylation would be more informative. Are these genes already inactivated at the 8-cell stage? Or are they active and become inactivated by the gain in DNA methylation? Doing qPCR for these genes, or looking at published RNAseq data would be informative. What happens to the expression of these genes in the Dnmt3b KD?

      Thanks for your suggestions. We have represented DNA methylation changes as “ΔDNA methylation”. During mouse preimplantation development, iXCI is initiated in earlier cleavage female embryos dependent on Xist upregulation around 4-8-cell stage, and then Xist specifically coats paternal X chromosome and finally leads to chromosome-wide silencing via heterochromatinization in early blastocysts. Thus, these non-escaping genes, which are subject to XCI, would not be inactivated at 8-cell stage

      Author response image 1.

      The processes of iXCI initiation and establishment (left panel), and dynamics of total expression levels of X chromosome in male and female preimplantation embryos (right panel, note that X-dosage is balanced between sexes until the early blastocyst stage).

      As you expected, most of these representative non-escaping is downregulated upon the transition of 8-cell to blastocyst stage, consistent with their gain of DNA methylation. Additionally, since preimplantation iXCI status maintains extraembryonic cells (Galupa and Heard, 2015; Schulz and Heard, 2013), we further reanalyzed the published RNA-seq data in extraembryonic ectoderm (ExE) of E6.5 single embryos that underwent DNA methyltransferase knockout (Chen et al., 2019). The results showed that chromosome-wide loss of DNA methylation led to a chromosome-wide transcriptional upregulation, including the locus of these non-escaping genes, on paternal X chromosome. We have added this result in the revised manuscript (Figure 3—figure supplement 3J; Figure 3—figure supplement 4A-B; Lines 253-261 in the revised manuscript).

      Figure 3:

      - Figure 3 figure supplement 1: representative IF image missing.

      Thanks for your kind reminder. We have added the representative IF images in the revised manuscript to provide a clearer illustration of the data (Figure 4—figure supplement 1A in the revised manuscript).

      - Figure 3 figure supplement 2B: scales are missing for the H3K27me3 ChIP-seq data (are the 8-cell and ICM tracks set to the same scale?). It looks like the ICM track is cut off at the top (peaks not fully displayed) and the data looks very sparse. A more informative analysis would be to do peak calling over promoters and compare 8-cell with ICM.

      Thanks for your detailed reminder. We apologize for the missing of scale bars in the H3K27me3 ChIP-seq data. The 8-cell and ICM tracks were set to the same scale, and we have now added scales to the figure in the revised manuscript to improve the result presentation. As you have speculated, the visual effect of the flatted peak is not caused by track cutting off, but rather by zooming into a specific region in the extended IGV files.

      These results are based on the reanalysis of publicly available data of pooled embryos, which just provided suggestive but not direct evidence to support the role of DNA methylation in promoting X-linked H3K27me3 enrichment in iXCI.

      To provide more convincing evidence. we have reanalyzed the RNA-seq and H3K27me3 CHIP-seq data in extraembryonic ectoderm (ExE) of E6.5 female embryos that underwent Dnmt3a/3b knockout because preimplantation iXCI status maintains extraembryonic cells (Chen et al., 2019; Galupa and Heard, 2015; Schulz and Heard, 2013). The results showed that Dnmt knockout led to a nearly complete loss of H3k27me3 on paternal (specifically inactivated in iXCI), which showed a notable transcriptional upregulation cross the chromosome. By contrast, these changes cannot be not observed on maternal X chromosome (Figure 3—figure supplement 4 in the revised manuscript). We have added these results in the revised manuscript.

      - Figure 3E: Given all tested proteins give a positive signal, it would have been good to include a negative control chromatin protein that is known to not interact with DNMT3B. Given both PRC2 and DNMT3B are chromatin-binding proteins, can the signal be a result of close proximity instead of a direct interaction?

      In the present study, to test the interaction between DNMT3B and PRC2 core components, we have used in situ proximity ligation assay (PLA), an increasingly popular technique for detecting the close proximity of two proteins in fixed samples using two primary antibodies (Alsemarz et al., 2018).

      Author response image 2.

      Schematic diagram of the principle of the in situ PLA.

      Compared with classical co-Immunoprecipitation (Co-IP) method, in situ PLA has advantages in (1) detecting low input samples or proteins expressed at low levels, which is extremely difficult using Co-IP; (2) providing in situ or subcellular information of protein-protein interaction. However, it should be noted that the maximal distance allowing this reaction is 40 nm, which is not quite small enough to demonstrate a physical interaction between the two antigens, but sufficient to support a very close “proximity”.

      In our study, in situ PLA, including the experimental design of negative control, was performed in the accordance with the manufacturer’s instruction of Duolink® In Situ Red Starter Kit (MilliporeSigma): “Technical negative controls included incubation with each primary antibody separately and no primary antibody”. We have refined the relevant sentence in the revised manuscript (Lines 308-310 in the revised manuscript)

      - Figure 3G: It would have been good to include a negative control, and DNase/benzonase to exclude DNA/RNA-mediated protein interaction.

      - (Of note, there have been previous studies reporting an interaction between PRC2 and DNMT3B in other cell types, such as in Weigert et al. 2023, but unfortunately, they don't seem to use DNase/benzonase either).

      The Co-IP analysis of DNMT3B and PRC2 core components in differentiated female ES cells was presented as additional supportive evidence. Because the Co-IP analysis is extremely difficult for preimplantation embryos, we have used in situ PLA to detect their interaction. However, the maximal distance allowing in situ PLA reaction is 40 nm, which is not quite small enough to demonstrate a physical interaction (Alsemarz et al., 2018). Thus, we have added a Co-IP analysis using differentiated female ES cells, in which rXCI occurs upon the differentiation.

      Based on this consideration of the importance and contribution of this result, we have moved this result from the main figure, to the supplemental figure (Figure 4—figure supplement 3H in the revised manuscript).

      - Figure 3 figure supplement 3G: what were the ESCs differentiated into? Did the Dnmt3b KO or Dnmt3a/b DKO show any differentiation defect?

      The mouse ESC line PGK12.1 was a well-established ex vivo model of rXCI. Under the standard culture condition, PGK12.1 is normally fated to neuroectodermal commitment.

      Author response image 3.

      Immunostaining of NESTIN, a neuroectodermal stem cell marker molecule, and NANOG in undifferentiated and differentiated PGK12.1 ESCs respectively.

      No differentiation defects have been observed in either Dnmt3b KO or Dnmt3a/3b DKO ESCs in our study. Dnmt KO/DKO/TKO ES cell lines have been successfully used as the model of interaction of DNA methylation and H3K27me3 deposition (King et al., 2016).

      Figure 4:

      - Figure 4B: Is there an explanation for seeing similar total cell numbers in Figure 4B, but showing decreased proliferation in Figure 4A?

      Thank you for your insightful comments. The EdU cell proliferation assays labels cells during the S phase of cell cycle, as the 5-ethynyl 2´-deoxyuridine (EdU) is incorporated into newly synthesized DNA. This labeling identifies cells undergoing DNA synthesis, but these cells may not have completed mitosis at the time of detection. As a result, the total cell number may not immediately reflect the decrease in proliferation observed in the treated group. To address this point, we have rewritten the sentences in the revised manuscript (Lines 174-175 in the revised manuscript).

      References

      Alsemarz, A., Lasko, P. and Fagotto, F. J. B. (2018). Limited significance of the in situ proximity ligation assay. bioRxiv, 411355.

      Auclair, G., Guibert, S., Bender, A. and Weber, M. (2014). Ontogeny of CpG island methylation and specificity of DNMT3 methyltransferases during embryonic development in the mouse. Genome Biol. 15, 545.

      Balaton, B. P. and Brown, C. J. (2021). Contribution of genetic and epigenetic changes to escape from X-chromosome inactivation. Epigenetics Chromatin 14, 30.

      Bartke, T., Vermeulen, M., Xhemalce, B., Robson, S. C., Mann, M. and Kouzarides, T. (2010). Nucleosome-interacting proteins regulated by DNA and histone methylation. Cell 143, 470-484.

      Borgel, J., Guibert, S., Li, Y., Chiba, H., Schubeler, D., Sasaki, H., Forne, T. and Weber, M. (2010). Targets and dynamics of promoter DNA methylation during early mouse development. Nat. Genet. 42, 1093-1100.

      Chen, Z., Yin, Q., Inoue, A., Zhang, C. and Zhang, Y. (2019). Allelic H3K27me3 to allelic DNA methylation switch maintains noncanonical imprinting in extraembryonic cells. Sci Adv 5, eaay7246.

      Chiba, H., Hirasawa, R., Kaneda, M., Amakawa, Y., Li, E., Sado, T. and Sasaki, H. (2008). De novo DNA methylation independent establishment of maternal imprint on X chromosome in mouse oocytes. Genesis 46, 768-774.

      Choi, S. H., Heo, K., Byun, H. M., An, W., Lu, W. and Yang, A. S. (2011). Identification of preferential target sites for human DNA methyltransferases. Nucleic Acids Res. 39, 104-118.

      Chow, J. and Heard, E. (2009). X inactivation and the complexities of silencing a sex chromosome. Curr. Opin. Cell Biol. 21, 359-366.

      Dahlet, T., Argueso Lleida, A., Al Adhami, H., Dumas, M., Bender, A., Ngondo, R. P., Tanguy, M., Vallet, J., Auclair, G., Bardet, A. F., et al. (2020). Genome-wide analysis in the mouse embryo reveals the importance of DNA methylation for transcription integrity. Nat Commun 11, 3153.

      Duymich, C. E., Charlet, J., Yang, X. J., Jones, P. A. and Liang, G. N. (2016). DNMT3B isoforms without catalytic activity stimulate gene body methylation as accessory proteins in somatic cells. Nat Commun 7, 11453.

      Fukuda, A., Tomikawa, J., Miura, T., Hata, K., Nakabayashi, K., Eggan, K., Akutsu, H. and Umezawa, A. (2014). The role of maternal-specific H3K9me3 modification in establishing imprinted X-chromosome inactivation and embryogenesis in mice. Nat Commun 5, 5464.

      Galupa, R. and Heard, E. (2015). X-chromosome inactivation: new insights into cis and trans regulation. Curr. Opin. Genet. Dev. 31, 57-66.

      Gontan, C., Mira-Bontenbal, H., Magaraki, A., Dupont, C., Barakat, T. S., Rentmeester, E., Demmers, J. and Gribnau, J. (2018). REX1 is the critical target of RNF12 in imprinted X chromosome inactivation in mice. Nat Commun 9, 4752.

      Hagarman, J. A., Motley, M. P., Kristjansdottir, K. and Soloway, P. D. (2013). Coordinate regulation of DNA methylation and H3K27me3 in mouse embryonic stem cells. PLoS One 8, e53880.

      Heard, E., Chaumeil, J., Masui, O. and Okamoto, I. (2004). Mammalian X-chromosome inactivation: an epigenetics paradigm. Cold Spring Harb. Symp. Quant. Biol. 69, 89-102.

      Huynh, K. D. and Lee, J. T. (2005). X-chromosome inactivation: a hypothesis linking ontogeny and phylogeny. Nat. Rev. Genet. 6, 410-418.

      Inoue, A., Jiang, L., Lu, F. and Zhang, Y. (2017). Genomic imprinting of Xist by maternal H3K27me3. Genes Dev. 31, 1927-1932.

      Inoue, K., Kohda, T., Sugimoto, M., Sado, T., Ogonuki, N., Matoba, S., Shiura, H., Ikeda, R., Mochida, K., Fujii, T., et al. (2010). Impeding Xist expression from the active X chromosome improves mouse somatic cell nuclear transfer. Science 330, 496-499.

      Jermann, P., Hoerner, L., Burger, L. and Schubeler, D. (2014). Short sequences can efficiently recruit histone H3 lysine 27 trimethylation in the absence of enhancer activity and DNA methylation. Proc. Natl. Acad. Sci. U. S. A. 111, E3415-3421.

      King, A. D., Huang, K., Rubbi, L., Liu, S., Wang, C. Y., Wang, Y., Pellegrini, M. and Fan, G. (2016). Reversible Regulation of Promoter and Enhancer Histone Landscape by DNA Methylation in Mouse Embryonic Stem Cells. Cell Rep. 17, 289-302.

      Maslov, A. Y., Lee, M., Gundry, M., Gravina, S., Strogonova, N., Tazearslan, C., Bendebury, A., Suh, Y. and Vijg, J. (2012). 5-aza-2'-deoxycytidine-induced genome rearrangements are mediated by DNMT1. Oncogene 31, 5172-5179.

      Oikawa, M., Inoue, K., Shiura, H., Matoba, S., Kamimura, S., Hirose, M., Mekada, K., Yoshiki, A., Tanaka, S., Abe, K., et al. (2014). Understanding the X chromosome inactivation cycle in mice: a comprehensive view provided by nuclear transfer. Epigenetics-Us 9, 204-211.

      Oka, M., Meacham, A. M., Hamazaki, T., Rodic, N., Chang, L. J. and Terada, N. (2005). De novo DNA methyltransferases Dnmt3a and Dnmt3b primarily mediate the cytotoxic effect of 5-aza-2'-deoxycytidine. Oncogene 24, 3091-3099.

      Pintacuda, G. and Cerase, A. (2015). X Inactivation Lessons from Differentiating Mouse Embryonic Stem Cells. Stem Cell Rev Rep 11, 699-705.

      Schulz, E. G. and Heard, E. (2013). Role and control of X chromosome dosage in mammalian development. Curr. Opin. Genet. Dev. 23, 109-115.

      Smith, Z. D., Chan, M. M., Mikkelsen, T. S., Gu, H. C., Gnirke, A., Regev, A. and Meissner, A. (2012). A unique regulatory phase of DNA methylation in the early mammalian embryo. Nature 484, 339-344.

      Tada, T., Obata, Y., Tada, M., Goto, Y., Nakatsuji, N., Tan, S., Kono, T. and Takagi, N. (2000). Imprint switching for non-random X-chromosome inactivation during mouse oocyte growth. Development 127, 3101-3105.

      Tan, K., An, L., Miao, K., Ren, L., Hou, Z., Tao, L., Zhang, Z., Wang, X., Xia, W., Liu, J., et al. (2016). Impaired imprinted X chromosome inactivation is responsible for the skewed sex ratio following in vitro fertilization. Proc. Natl. Acad. Sci. U. S. A. 113, 3197-3202.

      Vire, E., Brenner, C., Deplus, R., Blanchon, L., Fraga, M., Didelot, C., Morey, L., Van Eynde, A., Bernard, D., Vanderwinden, J. M., et al. (2006). The Polycomb group protein EZH2 directly controls DNA methylation. Nature 439, 871-874.

      Zhao, J., Yao, K., Yu, H., Zhang, L., Xu, Y., Chen, L., Sun, Z., Zhu, Y., Zhang, C., Qian, Y., et al. (2021). Metabolic remodelling during early mouse embryo development. Nat Metab 3, 1372-1384.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1

      (1) In the "Introduction" section, an important aspect that requires attention pertains to the discussion surrounding the heterodimerization of CXCR4 and CCR5. Notably, the manuscript overlooks a recent study (https://doi.org/10.1038/s41467-023-42082-z) elucidating the mechanism underlying the formation of functional dimers within these G protein-coupled receptors (GPCRs)…The inclusion of this study within the manuscript would significantly enrich the contextual framework of the work, offering readers a comprehensive understanding of the current knowledge surrounding the structural dynamics and functional implications of CXCR4 and CCR5 heterodimerization.

      We thank the reviewer for his/her recommendation to enrich the contextual framework of our study. The Nature Communications paper by Di Marino et al. was published after we sent the first version of our manuscript to eLife, and therefore was not included in the discussion. As the reviewer rightly indicates, this paper elucidates the mechanism underlying the formation of functional dimers within CCR5 and CXCR4. Using metadynamics approaches, the authors emphasize the importance of distinct transmembrane regions for dimerization of the two receptors. In particular, CXCR4 shows two low energy dimer structures and the TMVI-TMVII helices are the preferred interfaces involved in the protomer interactions in both cases. Although the study uses in silico techniques, it also includes the molecular binding mechanism of CCR5 and CXCR4 in the membrane environment, as the authors generate a model in which the receptors are immersed in a 1-palmitoyl-2-oleoyl-sn-glycero-3-phosphocholine (POPC) phospholipid bilayer with 10% cholesterol. This is an important point in this study, as membrane lipids also interact with membrane proteins, and the lipid composition affects CXCR4 oligomerization (Gardeta S.R. et al. Front. Immunol. 2023). In particular, Di Marino et al. find a cholesterol molecule placed in-between the two CXCR4 protomers where it engages a series of hydrophobic interactions with residues including Leu132, Val214, Leu216 and Phe249. Then, the polar head of cholesterol forms an H-bond with Tyr135 that further stabilizes protomer binding. In our hands, the F249L mutation in CXCR4 reverted the antagonism of AGR1.137, suggesting that the compound binds, among others, this residue. We should, nonetheless, indicate that we analyzed receptor oligomerization and not CXCR4 dimerization, which was the main object of the Di Marino et al. study. It is therefore also plausible that other residues than those described as essential for CXCR4 dimerization might participate in receptor oligomerization. We can speculate that AGR1.137 might affect cholesterol binding to CXCR4 and, therefore, alter dimerization/oligomerization. Additionally, the CXCR4 x-ray structure with PDB code 3ODU (Wu B. et al. Science, 2010) experimentally shows the presence of two fatty acid molecules in contact with both TMV and TMVI. These molecules closely interact with hydrophobic residues in the protein, thereby stabilizing it in a hydrophobic environment. Although more experiments will be needed to clarify the mechanism involved, our results suggest that cholesterol and/or other lipids also play an important role in CXCR4 oligomerization and function, as seen for other GPCRs (Jakubik J. & ElFakahani E.E. Int J Mol Sci. 2021). However, we should also consider that other factors not included in the analysis by Di Marino et al. can also affect CXCR4 oligomerization; for instance, the co-expression of other chemokine receptors and/or other GPCRs that heterodimerize with CXCR4 might affect CXCR4 dynamics at the cell membrane, similar to other membrane proteins such as CD4, which also forms complexes with CXCR4 (Martinez-Muñoz L. et al. Mol. Cell 2018).

      The revised discussion contains references to the study by Di Marino et al. to enrich the contextual framework of our data.

      (2) In "various sections" of the manuscript, there appears to be confusion surrounding the terminology used to refer to antagonists. It is recommended to provide a clearer distinction between allosteric and orthosteric antagonists to enhance reader comprehension. An orthosteric antagonist typically binds to the same site as the endogenous ligand, directly blocking its interaction with the receptor. On the other hand, an allosteric antagonist binds to a site distinct from the orthosteric site, inducing a conformational change in the receptor that inhibits the binding of the endogenous ligand. By explicitly defining the terms "allosteric antagonist" and "orthosteric antagonist" within the manuscript, readers will be better equipped to discern the specific mechanisms discussed in the context of the study.

      The behavior of the compounds described in our manuscript (AGR1.35 and AGR1.137) fits with the definition of allosteric antagonists, as they bind on a site distinct from the orthosteric site, although they only block some ligand-mediated functions and not others. This would mean that they are not formally antagonists and should be not considered as allosteric compounds, as their binding on CXCR4 does not alter CXCL12 binding, although they might affect its affinity. In this sense, our compounds respond much better to the concept of negative allosteric modulators (Gao Z.-G. & Jacobson K.A. Drug Discov. Today Technol. 2013). They act by binding on a site distinct from the orthosteric site and selectively block some downstream signaling pathways but not others induced by the same endogenous agonist.

      To avoid confusion and to clarify the role of the compounds described in this study, we now refer to them as negative allosteric modulators along the manuscript.

      (3) In the Results section, the computational approach employed for "screening small compounds targeting CXCR4, particularly focusing on the inhibition of CXCL12-induced CXCR4 nanoclustering", requires clarification due to several points of incomprehension. The following recommendations aim to address these concerns and enhance the overall clarity of the section:

      (1) Computational Approach and Binding Mode Description: 

      -Explicitly describe the methodology for identifying the pocket/clef area in angstroms (Å) on the CXCR4 protein structure. Include details on how the volume of the cleft enclosed by TMV and TMVI was determined, as this information is not readily apparent in the provided reference (https://doi.org/10.1073/pnas.1601278113).

      The identification of the cleft was based on the observations by Wu et al. (Wu B. et al. Science 2010) who described the presence of bound lipids in the area formed by TMV and VI, and those of Wescott et al. (Wescott M.P. et al. Proc. Natl. Acad. Sci. 2016) on the importance of TMVI in the transmission of conformational changes promoted by CXCL12 on CXCR4 towards the cytoplasmic surface of the receptor to link the binding site with signaling activation. Collectively, these results, and our previous data on the critical role of the N-terminus region of TMVI for CXCR4 oligomerization (Martinez-Muñoz L. et al. Mol. Cell 2018), focused our in silico screening to this region. Once we detected that several compounds bound CXCR4 in this region, the cleavage properties were calculated by subtracting the compound structure. The resulting PDB was analyzed using the PDBsum server (Laskowski R.A. et. al. Protein Sci. 2018). Volume calculations were obtained using the server analyzing surface clefts by SURFNET (Laskowski R. A. J. Mol. Graph. 1995). The theoretical interaction surface between the selected compounds and CXCR4 and the atomic distances between the protein residues and the compounds was calculated using the PISA server (Krissinel E. & Henrick K. J. Mol. Biol. 2007) (Fig. I, only for review purposes). The analysis of the cleft occupied by AGR1.135 showed two independent cavities of 434 Å3 and 1,381 Å3 that were not connected to the orthosteric site. In the case of AGR1.137, the data revealed two distinct clefts of 790 Å3 and 580 Å3 (Fig. I, only for review purposes). These details have been included in the revised manuscript (New Fig. 1A, Supplementary Fig 8A, B).

      (4) Clarify the statement regarding the cleft being "surface exposed for interactions with the plasma membrane," particularly in the context of its embedding within the membrane.

      For GPCRs, transmembrane domains represent binding sites for bioactive lipids that play important functional and physiological roles (Huwiler A. & Zangemeister-Wittke U. Pharmacol. Ther. 2018). The channel between TMV and TMVI connects the orthosteric chemokine binding pocket to the lipid bilayer and is occupied by an oleic acid molecule, according to the CXCR4 structure published in 2010 (Wu B. et al. Science 2010). In addition, the target region contains residues involved in cholesterol (and perhaps other lipids) engagement (Di Marino et al. Nat. Commun. 2023). Taken together, these data support our statement that the cleft supports interactions between CXCR4 molecules and the plasma membrane. 

      Moreover, the data of Di Marino et al. also support that CCR5 and CXCR4 have a symmetric and an asymmetric binding mode. Therefore, either dimeric structure has the possibility to form trimers, tetramers, and even oligomers by using the free binding interface to complex with another protomer. This hypothesis suggests that the interaction of dimers to form oligomers should involve residues distinct from those included in the dimeric conformation.

      The sentence has been modified in the revised manuscript to clarify comprehension.

      (5) Discuss the rationale behind targeting the allosteric binding pocket instead of the orthosteric pocket, outlining potential advantages and disadvantages.

      The advantages and disadvantages of using negative allosteric modulators vs orthosteric antagonists have been now included in the revised discussion. 

      The majority of GPCR-targeted drugs function by binding to the orthosteric site of the receptor, and are agonists, partial agonists, antagonists or inverse agonists. These orthosteric compounds can have off-target effects and poor selectivity due to highly homologous receptor orthosteric sites and to abrogation of spatial and/or temporal endogenous signaling patterns. 

      The alternative is to use allosteric modulators, which can tune the functions associated with the receptors without affecting the orthosteric site. They can be positive, negative or neutral modulators, depending on their effect on the functionality of the receptor (Foster D.J. & Conn P.J. Neuron 2017). For example, the use of a negative allosteric modulator of a chemokine receptor to dampen pathological signaling events, while retaining full signaling for non-pathological activities might limit adverse effects (Kohout T.A.et al. J. Biol. Chem. 2004). In this case, the negative allosteric modulator 873140 blocks CCL3 binding on CCR5 but does not alter CCL5 binding (Watson C. et al. Mol. Pharmacol. 2005). In other cases, allosteric modulators can stabilize a particular receptor conformation and block others. The mechanism of action of the anti-HIV-1, FDAapproved, CCR5 allosteric modulator, maraviroc (Jin J. et al. Sci. Signal. 2018) is attributed to its ability to modulate CCR5 dimer populations and their subsequent subcellular trafficking and localization to the cell membrane (Jin J .et al. Sci. Signal. 2018). Two CCR5 dimeric conformations that are imperative for membrane localization were present in the absence of maraviroc; however, an additional CCR5 dimer conformation was discovered after the addition of maraviroc, and all homodimeric conformations were further stabilized. This finding is consistent with the observation that CCR5 dimers and oligomers inhibit HIV host-cell entry, likely by preventing the HIV-1 co-receptor formation.

      It is well known that GPCRs activate G proteins, but they also recruit additional proteins (e.g., β-arrestins) that induce signaling cascades which, in turn, can direct specific subsets of cellular responses independent of G protein activation (Eichel K. et al. Nature 2018) and are responsible for either therapeutic or adverse effects. Allosteric modulators can thus be used to block these adverse effects without influencing the therapeutic benefits. This was the case in the design of G protein-biased agonists for the kappa opioid receptor, which maintain the desirable antinociceptive and antipruritic effects and eliminate the sedative and dissociative effects in rodent models (Brust T.F. et al. Sci. Signal 2016).

      (6) Provide the PDB ID of the CXCR4 structure used as a template for modeling with SwissModel. Explain the decision to model the structure from the amino acid sequence and suggest an alternative approach, such as utilizing AlphaFold structures and performing classical molecular dynamics with subsequent clustering for the best representative structure.

      The PDB used as a template for modeling CXCR4 was 3ODU. This information was already included in the material and methods section. At the time we performed these analyses, there were several crystallographic structures of CXCR4 in complex with different molecules and peptides deposited at the PDB. None of them included a full construct containing the complete receptor sequence to provide a suitable sample for Xray structure resolution, as the N- and C-terminal ends of CXCR4 are very flexible loops. In addition, the CXCR4 constructs contained T4 lysozyme inserted between helices TMV and TMVI to increase the stability of the protein––a common strategy used to facilitate crystallogenesis of GPCRs (Zou Y. et al. PLoS One 2012). Therefore, we generated a CXCR4 homology model using the SWISS-MODEL server (Waterhouse A. et al. Nucleic Acids Res. 2018). This program reconstructed the loop between TMV and TMVI, a domain particularly important in this study that was not present in any of the crystal structure available in PDB. The model structure was, nonetheless, still incomplete, as it began at P27 and ended at S319 because the terminal ends were not resolved in the crystal structure used as a template. Nevertheless, we considered that these terminal ends were not involved in CXCR4 oligomerization. 

      As Alphafold was not available at the time we initiated this project, we didn’t use it. However, we have now updated our workflow to current methods and predicted the structure of the target using AlphaFold (Jumper J. et al. Nature 2021) and the sequence available under UniProt entry P61073. We prepared the ligands using OpenBabel (O’Boyle N.M. et al., J. Cheminformatics 2011), with a gasteiger charge assignment, and generated 10 conformers for each input ligand using the OpenBabel genetic algorithm. We then prepared the target structure with Openmm, removing all waters and possible heteroatoms, and adding all missing atoms. We next predicted the target binding pockets with fPocket (Le Guilloux V. et al. BMC Bioinformatics 2009), p2rank (Krivak R. & Hoksza, J. Cheminformatics 2018), and AutoDock autosite (Ravindranath P.A. & Sanner M.F. Bioinformatics 2016). We chose only those pockets between TMV and TMVI (see answer to point 3). We merged the results of the three programs into so-called consensus pockets, as two pockets are said to be sufficiently similar if at least 75% of their surfaces are shared (del Hoyo D. et al. J. Chem. Inform. Model. 2023). From the consensus pockets, there was one pocket that was significantly larger than the others and was therefore selected. We then docked the ligand conformers in this pocket using AutoDock GPU (Santos-Martins D. et al. J. Chem. Theory Comput. 2021), LeDock (Liu N & Xu Z., IOP Conf. Ser. Earth Environ. Sci. 2019), and Vina (Eberhardt J. et al. J. Chem. Inf. Model. 2021). The number of dockings varied from 210 to 287 poses. We scored each pose with the Vina score using ODDT (Wójcikowski M. et al. J. Cheminform. 2015). Then, we clustered the different solutions into groups whose maximum RMSD was 1Å. This resulted in 40 clusters, the representative of each cluster was the one with maximum Vina score and confirmed that the selected compounds bound this pocket (Author response image 1). When required, we calculated the binding affinity using Schrodinger’s MM-GBSA procedure (Greenidge P.A. et al. J. Chem. Inf. Model. 2013), in two ways: first, assuming that the ligand and target are fixed; second, with an energy minimization of all the atoms within a distance of 3Å from the ligand. This information has now been included in the revised version of the manuscript.

      Author response image 1.

      AGR1.135 docking in CXCR4 using the updated protocol for ligand docking. Cartoon representation colored in gray with TMV and TMVI shown in blue and pink, respectively. AGR1.135 is shown in stick representation with carbons in yellow, oxygens in red and nitrogens in blue.

      (7) Specify the meaning of "minimal interaction energy" and where (if present) the interaction scores are reported in the text.

      We refer to minimal interaction energy, the best docking score, that is, the best score obtained in our docking studies. These data were not included in the previous manuscript due to space restrictions but are now included in the reviewed manuscript.

      (8) You performed docking studies using GLIDE to identify potential binding sites for the small compounds on the CXCR4 protein. The top-scoring binders were then subjected to further refinement using PELE simulations. However, I realize that a detailed description of the specific binding modes of these compounds was not provided in the text. Please make the description of binding poses more detailed

      Firstly, to assess the reliability of this method, a PELE study was carried out for the control molecule IT1t, which is a small drug-like isothiourea derivative that has been crystallized in complex with CXCR4 (PDB code: 3ODU). IT1t is a CXCR4 antagonist that binds to the CXCL12 binding cavity and inhibits HIV-1 infection (Das D. Antimicrob. Agents Chemother. 2015; Dekkers S. et al. J. Med. Chem. 2023). From the best five trajectories, two of them had clearly better binding energies, and corresponded to almost the same predicted pose of the molecule. Although the predicted binding mode was not exactly the same as the one in the crystal structure, the approximation was very good, giving validation to the approach. Although PELE is a suitable technique to find potential binding sites, the predicted poses must be subsequently refined using docking programs.

      Analyzing the best trajectories for the remaining ligands, at least one of the best-scored poses was always located at the orthosteric binding site of CXCR4. Even though these poses showed good binding energies, they were discarded as the in vitro biological experiments indicated that the compounds were unable to block CXCL12 binding or CXCL12-mediated inhibition of cAMP release or CXCR4 internalization. Collectively, these data indicated that the selected compounds did not behave as orthosteric inhibitors of CXCR4. The CXCL12 binding pocket is the biggest cavity in CXCR4, and so PELE may tend to place the molecules near it. However, all the compounds presented other feasible binding sites with a comparable binding energy.

      AGR1.135 and AGR1.137 showed interesting poses between TMV and TMVI with very good binding energy (-51.4 and -37.2 kcal/mol, respectively). This was precisely the region we had previously selected for the in silico screening, as previously described (see response to point 3).

      AGR1.131 showed two poses with low binding energy that were placed between helices TMI and TMVII (-43.6 kcal/mol) and between helices TMV and TMVI (-39.8 kcal/mol). This compound was unable to affect CXCL12-mediated chemotaxis and was therefore used as an internal negative control as it was selected in the in silico screening with the same criteria as the other compounds but failed to alter any CXCL12-mediated functions. PELE studies nonetheless provided different binding sites for each molecule, which had to be further studied using docking to obtain a more accurate binding mode. In agreement with the previous commentary, we repeated the analysis using AlphaFold and the rest of the procedure described (see our response to point 6) and calculated the binding energies for all the compounds using Schrodinger’s MM-GBSA procedure (Greenidge P.A. et al. J. Chem. Inf. Model. 2013). Calculations were performed in two ways: first, assuming that the ligand and target are fixed; second, with an energy minimization of all the atoms within a distance of 3Å from the ligand. The results using the first method indicated that AGR1.135 and AGR1.137 showed poses between TMV and TMVI with - 56.4 and -62.4 kcal/mol, respectively and AGR1.131 had a pose between TMI and TMVII with -61.6kcal/mol.  In the second method AGR1.135 and AGR1.137 showed poses between TMV and TMVI with -57.9, and -67.6 kcal/mol, respectively, and AGR1.131 of -62.2 kcal/mol between TMI and TMVII.

      This information is now included in the text.

      (9) (2) Experimental Design:-Justify the choice of treating Jurkat cells with a concentration of 50 μM of the selected compound. Consider exploring different concentrations and provide a rationale for the selected dosage. Additionally, clearly identify the type of small compound used in the initial experiment.

      The revised version contains a new panel in Fig. 1B to show a more detailed kinetic analysis with different concentrations (1-100 µM) of the compounds in the Jurkat migration experiments. In all cases, 100 µM nearly completely abrogated cell migration, but in order to reduce the amount of DMSO added to the cells we selected 50 µM for further experiments, as it was the concentration that inhibits 50-75% of ligand-induced cell migration. Regarding the type of small compounds used in the initial experiments, they were compounds included in the library described in reference #24 (Sebastian-Pérez V. et al Med. Biol. Chem. 2017), which contains heterocyclic compounds. We would note that we do not consider AGR1.137 a final compound. We think that there is scope to develop AGR1.137-based second-generation compounds with greater solubility in water, greater specificity or affinity for CXCR4, and to evaluate delivery methods to hopefully increase activity.  

      (10) Avoid reporting details in rounded parentheses within the text; consider relocating such information to the Materials and Methods section or figure captions for improved readability.

      Most of the rounded parentheses within the text have been eliminated in the revised version of the manuscript to improve readability.

      (11) Elaborate on the virtual screening approach using GLIDE software, specifying the targeted site and methodology employed.

      For the virtual screening, we used the Glide module (SP and XP function scoring) included in the Schrödinger software package, utilizing the corresponding 3D target structure and our MBC library (Sebastián-Pérez V et al. J. Chem. Inf. Model. 2017).  The center of the catalytic pocket was selected as the centroid of the grid. In the grid generation, a scaling factor of 1.0 in van der Waals radius scaling and a partial charge cutoff of 0.25 were used. A rescoring of the SP poses of each compound was then performed with the XP scoring function of the Glide. The XP mode in Glide was used in the virtual screening, the ligand sampling was flexible, epik state penalties were added and an energy window of 2.5 kcal/mol was used for ring sampling. In the energy minimization step, the distance-dependent dielectric constant was 4.0 with a maximum number of minimization steps of 100,000. In the clustering, poses were considered as duplicates and discarded if both RMS deviation is less than 0.5 Å and maximum atomic displacement is less than 1.3 Å.

      (12) Provide clarity on the statement that AGR1.131 "theoretically" binds the same motif, explaining the docking procedure used for this determination.

      In the in silico screening, AGR1.131 was one of the 40 selected compounds that showed, according to the PELE analysis (see answer to point 8), a pose with low binding energy (-39.8 kcal/mol) between TMV and TMVI helices, which is the selected area for the screening. It, nonetheless, also showed a best pose placed between helices TM1 and TM7 (-43.7 kcal/mol) using the initial workflow. In conclusion, although AGR1.131 also faced to the TMV-TMVI, the most favorable pose was in the area between TMI and TMVII. In addition, the compound was included in the biological screening, where it did not affect CXCL12-mediated chemotaxis. We thus decided to use it as an internal negative control, as it has a skeleton very similar to AGR1.135 and AGR1.137 and can interact with the TM domains of CXCR4 without promoting biological effects. This statement has been clarified in the revised text.

      (13) Toxicity Testing:

      -Enhance the explanation of the approach to testing the toxicity of the compound in Jurkat cells. Consider incorporating positive controls to strengthen the assessment and clarify the experimental design.

      All the selected compounds in the in silico screening were initially tested for propidium iodide incorporation in treated cells in a toxicity assay, and some of them were discarded for further experiments (e.g., AGR1.103 and VSP3.1).

      Further evaluation of Jurkat cell viability was determined by cell cycle analysis using propidium iodide.  Supplementary Fig. 1B included the percentage of each cell cycle phase, and data indicated no significant differences between the treatments tested. Nevertheless, at the suggestion of the reviewer, and to clarify this issue, positive controls inducing Jurkat cell death (staurosporine and hydrogen peroxide) have also been included in the new Supplementary Fig. 2. The new figure also includes a table showing the percentage of cells in each cell-cycle phase.  

      (14) In the Results section concerning "AGR1.135 and AGR1.137 blocking CXCL12-mediated CXCR4 nanoclustering and dynamics", several points can be improved to enhance clarity and coherence: 1. Specificity of Low Molecular Weight Compounds:  

      -Clearly articulate how AGR1.135 and AGR1.137 specifically target homodimeric CXCR4 and provide an explanation for their lack of impact on heterodimeric CXCR4-CCR5 in that region.

      First of all, we should clarify that when we talk about receptor nanoclustering, oligomers refer to complexes including 3 or more receptors and, therefore, the residues involved in these interactions can differ from those involved in receptor dimerization. Moreover, our FRET experiments did not indicate that the compounds alter receptor dimerization (see new Supplementary Fig. 7). Of note, mutant receptors unable to oligomerize can still form dimers (Martínez-Muñoz L. et al. Mol. Cell 2018; García-Cuesta E.M .et al. Proc. Natl. Acad. Sci. USA 2022). Additionally, we believe that these oligomers can also include other chemokine receptors/proteins expressed at the cell membrane, which we are currently studying using different models and techniques.

      We have results supporting the existence of CCR5/CXCR4 heterodimers (Martínez-Muñoz L et al. Proc. Natl. Acad. Sci. USA 2014), in line with the data published by Di Marino et al. However, in the current study we have not evaluated the impact of the selected compounds on other CXCR4 complexes distinct from CXCR4 oligomers. Our Jurkat cells do not express CCR5 and, therefore, we cannot discuss whether AGR1.137 affects CCR5/CXCR4 heterodimers. The chemokine field is very complex and most receptors can form dimers (homo- and heterodimers) as well as oligomers (Martinez-Muñoz L., et al Pharmacol & Therap. 2011) when co-expressed. To evaluate different receptor combinations in the same experiment is a complex task, as the number of potential combinations between distinct expressed receptors makes the analysis very difficult. We started with CXCR4 as a model, to continue later with other possible CXCR4 complexes. In addition, for the analysis of CCR5/CXCR4 dynamics, it is much better to use dual-TIRF techniques, which allow the simultaneous detection of two distinct molecules coupled to different fluorochromes.

      Regarding the data of Di Marino et al., it is possible that the compounds might also affect heterodimeric conformations of CXCR4. This aspect has also been broached in the revised discussion. We would again note that we evaluated CXCR4 oligomers and not monomers or dimers; this is especially relevant when we compare the residues involved in these processes as they might differ depending on the receptor conformation considered. This issue was also hypothesized by Di Marino et al. (see our response to point 4).

      (15) When referring to "unstimulated" cells, provide a more detailed explanation to elucidate the experimental conditions and cellular state under consideration.

      Unstimulated cells refer to the cells in basal conditions, that is, cells in the absence of CXCL12. For TIRF-M experiments, transiently-transfected Jurkat cells were plated on glass-bottomed microwell dishes coated with fibronectin; these are the unstimulated cells. To observe the effect of the ligand, dishes were coated as above plus CXCL12 (stimulated cells). We have clarified this point in the material and methods section of the revised version.

      (16) 2. Paragraph Organization

      -Reorganize the second paragraph to eliminate redundancy and improve overall flow. A more concise and fluid presentation will facilitate reader comprehension and engagement.

      The second paragraph has been reorganized to improve overall flow.

      (17) Ensure that each paragraph contributes distinct information, avoiding repetition and redundancy.

      We have carefully revised each paragraph of the manuscript to avoid redundancy.

      (18) 3. Claim of Allosteric Antagonism:

      -Exercise caution when asserting that "AGR1.135 and AGR1.137 behave as allosteric antagonists of CXCR4" based on the presented results. Consider rephrasing to reflect that the observed effects suggest the potential allosteric nature of these compounds, acknowledging the need for further investigations and evidence.

      To avoid misinterpretations on the effect of the compounds on CXCR4, as we have commented in our response to point 2, we have substituted the term allosteric inhibitors with negative allosteric modulators, which refer to molecules that act by binding a site distinct from the orthosteric site, and selectively block some downstream signaling pathways, whereas others induced by the same endogenous or orthosteric agonist are unaffected (Gao Z.-G. & Jacobson K.A. Drug Discov. Today Technol. 2013). Our data indicate that the selected small compounds do not block ligand binding or G protein activation or receptor internalization, but inhibit receptor oligomerization and ligand-mediated directed cell migration.

      (19) In the Results section discussing the "incomplete abolition of CXCR4-mediated responses in Jurkat cells by AGR1.135 and AGR1.137", several points can be refined for better clarity and completeness:  1. Inclusion of Positive Controls: 

      -Consider incorporating positive controls in relevant experiments to provide a comparative benchmark for assessing the impact of AGR1.135 and AGR1.137. This addition will strengthen the interpretation of results and enhance the experimental rigor. 

      The in vivo experiments (Fig. 7E,F) used AMD3100, an orthosteric antagonist of CXCR4, as a positive control. We also included AMD3100, as a positive control of inhibition when evaluating the effect of the compounds on CXCL12 binding (Fig. 3, new Supplementary Fig. 3). The revised version of the manuscript also includes the effect of this inhibitor on other relevant CXCL12-mediated responses such as cell migration (Fig. 1B), receptor internalization (Fig. 3A), cAMP production (Fig. 3C), ERK1/2 and AKT phosphorylation (Supplementary Fig. 4), actin polymerization (Fig. 4A), cell polarization (Fig. 4B, C) and cell adhesion (Fig. 4D), to facilitate the interpretation of the results and improve the experimental rigor.

      (20) 2. Clarification of Terminology: 

      -Clarify the term "CXCR4 internalizes" by providing context, perhaps explaining the process of receptor internalization and its relevance to the study.

      We refer to CXCR4 internalization as a CXCL12-mediated endocytosis process that results in reduction of CXCR4 levels on the cell surface. We use CXCR4 internalization in this study with two purposes: First, for CXCR4 and other chemokine receptors, internalization processes are mediated by ligand-induced clathrin vesicles (Venkatesan et al 2003) a process that triggers CXCR4 aggregation in these vesicles. We have previously determined that the oligomers of receptors detected by TIRF-M remain unaltered in cells treated with inhibitors of clathrin vesicle formation and of internalization processes (Martinez-Muñoz L. et al. Mol. Cell 2018). Moreover, we have described a mutant CXCR4 that cannot form oligomers but internalizes normally in response to CXCL12 (Martinez-Muñoz L. et al. Mol. Cell 2018). The observation in this manuscript of normal CXCL12-mediated endocytosis in the presence of the negative allosteric inhibitors of CXCR4 that abrogate receptor oligomerization reinforces the idea that the oligomers detected by TIRF are not related to receptor aggregates involved in endocytosis; Second, receptor internalization is not affected by the allosteric compounds, indicating that they downregulate some CXCL12-mediated signaling events but not others (new Fig. 3).

      All these data have been included in the revised discussion of the manuscript.

      (21) Elaborate on the meaning of "CXCL12 triggers normal CXCR4mut internalization" to enhance reader understanding.

      We have previously described a triple-mutant CXCR4 (K239L/V242A/L246A; CXCR4mut). The mutant residues are located in the N-terminal region of TMVI, close to the cytoplasmic region, thus limiting the CXCR4 pocket described in this study (see our response to point 3). This mutant receptor dimerizes but neither oligomerizes in response to CXCL12 nor supports CXCL12-induced directed cell migration, although it can still trigger some Ca2+ flux and is internalized after ligand activation (Martinez-Muñoz L. et al. Mol. Cell 2018).  We use the behavior of this mutant (CXCR4mut) to show that the CXCR4 oligomers and the complexes involved in internalization processes are not the same and to explain why we evaluated CXCR4 endocytosis in the presence of the negative allosteric modulators.

      As we indicated in a previous answer to the reviewer, these issues have been re-elaborated in the revised version.

      (22) 3. Discrepancy in CXCL12 Concentration:

      -Address the apparent discrepancy between the text stating, "...were stimulated with CXCL12 (50 nM, 37{degree sign}C)," and the figure caption (Fig. 3A) reporting a concentration of 12.5 nM. Rectify this inconsistency and provide an accurate and clear explanation.

      We apologize for this error, which is now corrected in the revised manuscript. With the exception of the cell migration assays in Transwells, where the optimal concentration was established at 12.5 nM, in the remaining experiments the optimal concentration of CXCL12 employed was 50 nM. These concentrations were optimized in previous works of our laboratory using the same type of experiment. We should also remark that in the experiments using lipid bilayers or TIRF-M experiments, CXCL12 is used to coat the plates and therefore it is difficult to determine the real concentration of the ligand that is retained in the surface of the plates after the washing steps performed prior to adding the cells. In addition, we use 100 nM CXCL12 to create the gradient in the chambers used to perform the directed-cell migration experiments.

      (23) 4. Speculation on CXCL12 Binding:

      -Refrain from making speculative statements, such as "These data suggest that none of the antagonists alters CXCL12 binding to CXCR4," unless there is concrete evidence presented up to that point. Clearly outline the results that support this conclusion.

      Figure 3B and Supplementary Figure 3 show CXCL12-ATTO700 binding by flow cytometry in cells pretreated with the negative allosteric modulators. We have also included AMD3100, the orthosteric antagonist, as a control for inhibition. While these experiments showed no major effect of the compounds on CXCL12 binding, we cannot discard small changes in the affinity of the interaction between CXCL12 and CXCR4. In consequence we have re-written these statements.

      (24) 5. Corroboration of Data:

      -Specify where the corroborating data from immunostaining and confocal analysis are reported, ensuring readers can access the relevant information to support the conclusions drawn in this section.

      In agreement with the suggestion of the reviewer, the revised manuscript includes data from immunostaining and confocal analysis to complement Fig. 4B (new Fig. 4C). The revised version also includes some representative videos for the TIRF experiments showed in Figure 2 to clarify readability.

      (25) In the Results section concerning "AGR1.135 and AGR1.137 antagonists and their direct binding to CXCR4", several aspects need clarification and refinement for a more comprehensive and understandable presentation: 1. Workflow Clarification:

      -Clearly articulate the workflow used for assessing the binding of AGR1.135 and AGR1.137 to CXCR4. Address the apparent contradiction between the inability to detect a direct interaction and the utilization of Glide for docking in the TMV-TMVI cleft.

      To address the direct interaction of the compounds with CXCR4, we intentionally avoided the modification of the small compounds with different labels, which could affect their properties. We therefore attempted a fluorescence a spectroscopy strategy to formally prove the ability of the small compounds to bind CXCR4, but this failed because the AGR1.135 is yellow in color, which interfered with the determinations. We also tried a FRET strategy (see new Supplementary Fig. 7) and detected a significant increase in FRET efficiency of CXCR4 homodimers when AGR1.135 was evaluated, but again the yellow color interfered with FRET determinations. Moreover, AGR1.137 did not modify FRET efficiency of CXCR4 dimers. Therefore, we were unable to detect the interaction of the compounds with CXCR4.

      We elected to develop an indirect strategy; in silico, we evaluated the binding-site using docking and molecular dynamics to predict the most promising CXCR4 binding residues involved in the interaction with the selected compounds. Next, we generated point mutant receptors of the predicted residues and re-evaluated the behavior of the allosteric antagonists in a CXCL12-induced cell migration experiment. Obviously, we first discarded those CXCR4 mutants that were not expressed on the cell membrane as well as those that were not functional when activated with CXCL12. Using this strategy, we eliminated the interference due to the physical properties of the compounds and demonstrated that if the antagonism of a compound is reversed in a particular CXCR4 mutant it is because the mutated residue participates or interferes with the interaction between CXCR4 and the compound, thus assuming (albeit indirectly) that the compound binds CXCR4. 

      To select the specific mutations included in the analysis, our strategy was to generate point mutations in residues present in the TMV-TMVI pocket of CXCR4 that were not directly proposed as critical residues involved in chemokine engagement, signal initiation, signal propagation, or G protein-binding, based on the extensive mutational study published by Wescott MP et. al. (Wescott M.P. et. al. Proc. Natl. Acad. Sci. U S A. 2016).

      (26) Provide a cohesive explanation of the transition from docking evaluation to MD analysis, ensuring a transparent representation of the methodology.

      Based on the aim of this work, the workflow shown in Author response image 2, was proposed to predict the binding mode of the selected molecules. Firstly, a CXCR4 model was generated to reconstruct some unresolved parts of the protein structure; then a binding site search using PELE software was performed to identify the most promising binding sites; subsequently, docking studies were performed to refine the binding mode of the molecules; and finally, molecular dynamics simulations were run to determine the most stable poses and predict the residues that we should mutate to test that the compounds interact with CXCR4. 

      Author response image 2.

      Workflow followed to determine the binding mode of the  studied compounds.

      (27) 2. Choice of Software and Techniques:

      -Justify the use of "AMBER14" and the PELE approach, considering  their potential obsolescence.

      These experiments were performed five years ago when the project was initiated. As the reviewer indicates, AMBER14 and PELE approaches might perhaps be considered obsolescent. Thus, we have predicted the structure of the target using AlphaFold (Jumper J. et al, Nature 2021) and the sequence available under UniProt entry P61073. The complete analysis performed (see our response to point 4) confirmed that the compounds bound the selected pocket, as we had originally determined using PELE. These new analyses have been incorporated into the revised manuscript.

      (28)-Discuss the role of the membrane in the receptor-ligand interac7on. Elaborate on how the lipidic double layer may influence the binding of small compounds to GPCRs embedded in the membrane.

      Biological membranes are vital components of living organisms, providing a diffusion barrier that separates cells from the extracellular environment, and compartmentalizing specialized organelles within the cell. In order to maintain the diffusion barrier and to keep it electrochemically sealed, a close interaction of membrane proteins with the lipid bilayer is necessary. It is well known that this is important, as many membrane proteins undergo conformational changes that affect their transmembrane regions and that may regulate their activity, as seen with GPCRs (Daemen F.J. & Bonting S.L., Biophys. Struct. Mech. 1977; Gether U. et al. EMBO J. 1997). The lateral and rotational mobility of membrane lipids supports the sealing function while allowing for the structural rearrangement of membrane proteins, as they can adhere to the surface of integral membrane proteins and flexibly adjust to a changing microenvironment. In the case of the first atomistic structure of CXCR4 (Wu B. et al. Science 2010), it was indicated that for dimers, monomers interact only at the extracellular side of helices V and VI, leaving at least a 4-Å gap between the intracellular regions, which is presumably filled by lipids. In particular, they indicated that the channel between TMV and TMVI that connects the orthosteric chemokine binding pocket to the lipid bilayer is occupied by an oleic acid molecule. Recently, Di Marino et al., analyzing the dimeric structure of CXCR4, found a cholesterol molecule placed in between the two protomers, where it engages a series of hydrophobic interactions with residues located in the area between TMI and TMVI (Leu132, Val214, Leu216, Leu246, and Phe249). The polar head of cholesterol forms an H-bond with Tyr135 that further stabilizes its binding mode. This finding confirms that cholesterol might play an important role in mediating and stabilizing receptor dimerization, as seen in other GPCRs (Pluhackova, K., et al. PLoS Comput. Biol. 2016). In addition, we have previously observed that, independently of the structural changes on CXCR4 triggered by lipids, the local lipid environment also regulates CXCR4 organization, dynamics and function at the cell membrane and modulates chemokine-triggered directed cell migration. Prolonged treatment of T cells with bacterial sphingomyelinase promoted the complete and sustained breakdown of sphingomyelins and the accumulation of the corresponding ceramides, which altered both membrane fluidity and CXCR4 nanoclustering and dynamics. Under these conditions, CXCR4 retained some CXCL12-mediated signaling activity but failed to promote efficient directed cell migration (Gardeta S.R. et al. Front. Immunol. 2022). Collectively, these data demonstrate the key role that lipids play in the stabilization of CXCR4 conformations and in regulating its lateral mobility, influencing their associated functions. These considerations have been included in the revised version of the manuscript. 

      (29) 3. Stable Trajectories and Binding Mode Superimposi7on -Specify the criteria for defining "stable trajectories" to enhance reader understanding

      There could be several ways to describe the stability of a MD simulation, based on the convergence of energies, distances or ligand-target interactions, among others. In this work, we use the expression “stable trajectories” to refer to simulations in which the ligand trajectory converges and the ligand RMSD does not fluctuate more than 0.25Å. This definition is now included in the revised text.

      (30)  Clarify the meaning behind superimposing the two small compounds and ensure that the statement in the figure caption aligns with the information presented in the main text.

      We apologize for the error in the previous Fig. 5A and in its legend. The figure was created by superimposing the protein component of the poses for the two compounds, AGR1.135 and AGR1.137, rather than the compounds themselves. As panel 5A was confusing, we have modified all Fig. 5 in the revised manuscript to improve clarity.

      (31) 4. Volume Analysis and Distances:

      -Provide details on how the volume analysis was computed and how distances were accounted for. Consider adding a figure to illustrate these analyses, aiding reader comprehension.

      The cleft search and analysis were performed using the default settings of SURFNET (Laskowski R.A. J. Mol. Graph. 1995) included in the PDBsum server (Laskowski R.A. et. al. Trends Biochem. Sci. 1997). The first run of the input model for CXCR4 3ODU identified a promising cleft of 870 Å3 in the lower half of the region flanked by TMV and TMVI, highlighting this area as a possible small molecule binding site (Fig. I, only for review purposes). Analysis of the cleft occupied by AGR1.135 showed two independent cavities of 434 Å3 and 1381 Å3 that were not connected to the orthosteric site. The same procedure for AGR1.137 revealed two distinct clefts of 790 Å3 and 580 Å3, respectively (Fig. I, only for review purposes). Analysis of the atomic distances between the protein residues and the compounds was performed using the PISA server. Krissinel E. & Henrick K. J. Mol. Biol. 2007). (Please see our response to point 3 and the corresponding figure).

      (32) 5. Mutant Selection and Relevance:

      -Clarify the rationale behind selecting the CXCR4 mutants used in the study. Consider justifying the choice and exploring the possibility of performing an alanine (ALA) scan for a more comprehensive mutational analysis.  

      The selection of the residues to be mutated along the cleft was first based on their presence in the proposed cleft and the direct interaction of the compounds with them, either by hydrogen bonding or by hydrophobic interactions. Secondly, all mutated residues did not belong to any of the critical residues involved in transmitting the signal generated by the interaction of CXCL12 with the receptor. In any case, mutants producing a non-functional CXCR4 at the cell membrane were discarded after FACS analysis and chemotaxis experiments. Finally, the length and nature of the resulting mutations were designed mainly to occlude the cleft in case of the introduction of long residues such as lysines (I204K, L208K) or to alter hydrophobic interactions by changing the carbon side chain composition of the residues in the cleft. Indeed, we agree that the alanine scan mutation analysis would have been an alternative strategy to evaluate the residues involved in the interactions of the compounds. 

      (33) Reevaluate the statement regarding the relevance of the Y256F muta7on for the binding of AGR1.137. If there is a significant impact on migra7on in the mutant (Fig. 6B), elaborate on the significance in the context of AGR1.137 binding.

      In the revised discussion we provide more detail on the relevance of Y256F mutation for the binding of AGR1.137 as well as for the partial effect of G207I and R235L mutations. The predicted interactions for each compound are depicted in new Fig. 6 C, D after LigPlot+ analysis (Laskowski R.A. & Swindells M.B. J. Chem. Inf. Model. 2011), showing that AGR1.135 interacted directly with the receptor through a hydrogen bond with Y256. When this residue was mutated to F, one of the anchor points for the compound was lost, weakening the potential interaction in the region of the upper anchor point.

      It is not clear how the Y256F mutation will affect the binding of AGR1.137, but other potential contacts cannot be ruled out since that portion of the compound is identical in both AGR1.135 and AGR1.137. This is especially true for its neighboring residues in the alpha helix, F249, L208, as shown in 3ODU structure (Fig. 6D), which are shown to be directly implicated in the interaction of both compounds. Alternatively, we cannot discard that Y256 interacts with other TMs or lipids stabilizing the overall structure, which could reverse the effect of the mutant at a later stage (Author response image 3).

      Author response image 3.

      Cartoon representation of Y256 and its intramolecular interactions in the CXCR4 Xray solved structure 3ODU. TMV helix is colored in blue and TMVI in pink.

      (34) Address the apparent discrepancy in residue involvement between AGR1.135 and AGR1.137, particularly if they share the same binding mode in the same clef.

      AGR1.135 and AGR1.137 exhibit comparable yet distinct binding modes, engaging with CXCR4 within a molecular cavity formed by TMV and TMVI. AGR1.135 binds to CXCR4 through three hydrogen bonds, two on the apical side of the compound that interact with residues TMV-G207 and TMVI-Y256 and one on the basal side that interacts with TMVI-R235 (Fig. 5A). This results in a more extended and rigid conformation when sharing hydrogen bonds, with both TMs occupying a surface area of 400 Å2 and a length of 20 Å in the cleft between TMV and TMVI (Supplementary Fig. 8A). AGR1.137 exhibits a distinct binding profile, interacting with a more internal region of the receptor. This interaction involves the formation of a hydrogen bond with TMIIIV124, which induces a conformational shift in the TMVI helix towards an active conformation (Fig. 5B; Supplementary Fig. 13). Moreover, AGR1.137 may utilize the carboxyl group of V124 in TMIII and overlap with AGR1.135 binding in the cavity, interacting with the other 19 residues dispersed between TMV and VI to create an interaction surface of 370 Å2 along 20 Å (Supplementary Fig. 8B). This is illustrated in the new Fig. 5B. AGR1.137 lacks the phenyl ring present in AGR1.135, resulting in a shorter compound with greater difficulty in reaching the lower part of TMVI where R235 sits. 

      Author response image 4.

      AGR1.135 and AGR1.137 interaction with TMV and TMVI.  The model shows the location of the compounds within the TMV-VI cleft, illustrated by a ribbon and stick representation. The CXCR4 segments of TMV and TMVI are represented in blue and pink ribbons respectively, and side chains for some of the residues defining the cavity are shown in sticks. AGR1.135 and AGR1.137 are shown in stick representation with carbon in yellow, nitrogen in blue, oxygen in red, and fluorine in green. Hydrogen bonds are indicated by dashed black lines, while hydrophobic interactions are shown in green. The figure reproduces the panels A, B of Fig. 5 in the revised manuscript.

      (35) In the Results sec7on regarding "AGR1.137 treatment in a zebrafish xenograf model", the following points can be refined for clarity and completeness: 1. Cell Line Choice for Zebrafish Xenograft Model:

      -Explain the rationale behind the choice of HeLa cells for the zebrafish xenograft model when the previous experiments primarily focused on Jurkat cells. Address any specific biological or experimental considerations that influenced this decision.

      As far as we know, there are no available models of tumors in zebrafish using Jurkat cells. We looked for a tumoral cell system that expresses CXCR4 and could be transplanted into zebrafish. HeLa cells are derived from a human cervical tumor, express a functional CXCR4, and have been previously used for tumorigenesis analyses in zebrafish (Brown H.K. et al. Expert Opin. Drug Discover. 2017; You Y. et al Front. Pharmacol. 2020). These cells grow in the fish and disseminate through the ventral area and can be used to determine primary tumor growth and metastasis. Nonetheless, we first analyzed in vitro the expression of a functional CXCR4 in these cells (Supplementary Fig. 10A), whether AGR1.137 treatment specifically abrogated CXCL12-mediated direct cell migration (Fig. 7A, B), as whether it affected cell proliferation (Supplementary Fig. 10B). As HeLa cells reproduce the in vitro effects detected for the compounds in Jurkat cells, we used this model in zebrafish. These issues were already discussed in the first version of our manuscript. 

      (36) 2. Toxicity Assessment in Zebrafish Embryos: 

      -Clarify the basis for stating that AGR1.137 is not toxic to zebrafish embryos. Consider referencing the Zebrafish Embryo Acute Toxicity Test (ZFET) and provide relevant data on lethal concentration (LC50) and non-lethal toxic phenotypes such as pericardial edema, head and tail necrosis, malformation, brain hemorrhage, or yolk sac edema.

      Tumor growth and metastasis kinetics within the zebrafish model have been extensively evaluated in many publications (White R. et al. Nat. Rev. Cancer. 2013; Astell K.R. and Sieger D. Cold Spring Harb. Perspect. Med. 2020; Chen X. et al. Front. Cell Dev. Biol. 2021; Weiss JM. Et al. eLife 2022; Lindhal G. et al NPJ Precis. Oncol. 2024). Our previous experience using this model shows that tumors start having a more pronounced proliferation and lower degree of apoptosis from day 4 onwards, but we cannot keep the tumor-baring larvae for that long due to ethical reasons and also because we don’t see much scientific benefit of unnecessarily extending the experiments. Anti-proliferative or pro-apoptotic effects of drugs can still be observed within the three days, even if this is then commonly seen as larger reduction (instead of a smaller growth as it is commonly seen in for example mouse tumor models) compared to controls. Initially we characterized the evolution of implanted tumors in our system and how much they metastasize over time in the absence of treatment before to test the compounds (Author response image 5).

      The in vivo experiments were planned to validate efficacious concentrations of the investigated drugs rather than to derive in vivo IC50 or other values, which require testing of multiple doses. We have, however, included an additional concentration to show concentration-dependence and therefore on-target specificity of the drugs in the revised version of the manuscript (data also being elaborated in ongoing experiments). At this stage, we believe that adding the LC50 does not provide interesting new knowledge, and it is standard to only show results from the experimental endpoint (in our case 3 days post implantation). We agree that showing these new data points strengthens the manuscript and facilitates independent evaluation and conclusions to be drawn from the presented data. We have created new graphs where datapoints for each compound dose are shown.  

      Author response image 5.

      Evolution of the tumors and metastasis along the time in the absence of any treatment. HeLa cells were labeled with 8 µg/mL Fast-DiI™ oil and then implanted in the dorsal perivitelline space of 2-days old zebrafish embryos. Tumors were imaged within 2 hours of implantation and re-imaged each 24 h for three days. Changes in tumor size was evaluated as tumor area at day 1, 2 and 3 divided by tumor area at day 0, and metastasis was evaluated as the number of cells disseminated to the caudal hematopoietic plexus at day 1, 2 and 3 divided by the number of cells at day  3.

      Regarding the statement that AGR1.137 was not toxic, this was based on visual inspection of the zebrafish larvae at the end of the experiment, which also revealed a lack of drug-related mortality in these experiments. There are a number of differences in how our experiment was run compared with the standardized ZFET. ZFET evaluates toxicity from 0 hours post-fertilization to 1 or 2 days post-fertilization, whereas here we exposed zebrafish from 2 days post-fertilization to 5 days post-fertilization. The ZFET furthermore requires that the embryos are raised at 26ºC whereas kept the temperature as close as possible to a physiologically relevant temperature for the tumor cells (36ºC). In the ZFET, embryos are incubated in 96-well plates whereas for our studies we required larger wells to be able to manipulate the larvae and avoid well edge-related imaging artefacts, and we therefore used 24-well plates. As such, the ZFET was for various reasons not applicable to our experimental settings. As we were not interested in rigorously determining the LD50 or other toxicity-related measurements, as our focus was instead on efficacy and we found that the targeted dose was tolerated, we did not evaluate multiple doses, including lethal doses of the drug, and are therefore not able to determine an LD50/LC50. We also did not find drug-induced non-lethal toxic phenotypes in this study, and so we cannot elaborate further on such phenotypes other than to simply state that the drug is well tolerated at the given doses. Therefore, the reference to ZFET in the manuscript was eliminated.

      (37) If supplementary information is available, consider providing it for a comprehensive understanding of toxicity assessments. 

      The effective concentration used in the zebrafish study was derived from the in vitro experiments. That being said, and as elaborated in our response to comment 36, we have added data for one additional dose to show the dose-dependent regulation of tumor growth and metastasis. 

      (38) 3. Optimization and Development of AGR1.137: 

      -Justify the need for further optimization and development of AGR1.137 if it has a comparable effect to AMD3100. Explain the specific advantages or improvements that AGR1.137 may offer over AMD3100. 

      AGR1.137 is highly hydrophobic and is very difficult to handle, particularly in in vivo assays; thus, for the negative allosteric modulators to be used clinically, it would be very important to increase their solubility in water. Contrastingly, AMD3100 is a water-soluble compound. Before using the zebrafish model, we performed several experiments in mice using AGR1.137, but the inhibitory results were highly variable, probably due to its hydrophobicity. We also believe that it would be important to increase the affinity of AGR1.137 for CXCR4, as the use of lower concentrations of the negative allosteric modulator would limit potential in vivo side effects of the drug. On the other hand, we are also evaluating distinct administration alternatives, including encapsulation of the compounds in different vehicles. These alternatives may also require modifications of the compounds. 

      AMD3100 is an orthosteric inhibitor and therefore blocks all the signaling cascades triggered by CXCL12. For instance, we observed that AMD3100 treatment blocked CXCL12 binding, cAMP inhibition, calcium flux, cell adhesion and cell migration (Fig. 3, Fig. 4), whereas the effects of AGR1.137 were restricted to CXCL12-mediated directed cell migration. Although AMD3100 was well tolerated by healthy volunteers in a singledose study, it also promoted some mild and reversible events, including white blood cells count elevations and variations of urine calcium just beyond the reported normal range (Hendrix C.W. et al. Antimicrob. Agents Chemother. 2000). To treat viral infections, continuous daily dosing requirements of AMD3100 were impractical due to severe side effects including cardiac arrhythmias (De Clercq E. Front Immunol. 2015). For AMD3100 to be used clinically, it would be critical to control the timing of administration. In addition, side effects after long-term administration have potential problems. Shorter-term usage and lower doses would be fundamental keys to its success in clinical use (Liu T.Y. et al. Exp. Hematol. Oncol. 2016). The use of a negative allosteric modulator that block cell migration but do not affect other signaling pathways triggered by CXCL12 would be, at least in theory, more specific and produce less side effects. These ideas have been incorporated into the revised discussion to reflect potential advantages or improvements that AGR1.137 may offer over AMD3100.

      (39) 4. Discrepancy in AGR1.137 and AMD3100 Effects:

      -Discuss the observed discrepancy where AGR1.137 exhibits similar effects to AMD3100 but only after 48 hours. Provide insights into the temporal dynamics of their actions and potential implications for the experimental design.

      Images and data shown in Fig. 7E, F correspond to days 0 and 3 after HeLa cell implantation (tumorigenesis) and only to day 3 in the case of metastasis data. The revised version contains the effect of two distinct doses of the compounds (10 and 50 µM, for AGR1.135 and AGR1.137 and 1 and 10 µM for AMD3100). 

      (40) In the "Discussion" section, there are several points that require clarifica7on and refinement to enhance the overall coherence and depth of the analysis:  1. Reduction of Side-Effects: 

      -Provide a more detailed explanation of how the identified compounds, specifically AGR1.135 and AGR1.137, contribute to the reduction of side effects. Consider discussing specific mechanisms or characteristics that differentiate these compounds from existing antagonists.

      The sentence indicating that AGR1.135 and AGR1.137 contribute to reduce side effects is entirely speculative, as we have no experimental evidence to support it. We have therefore corrected this in the revised version. The origin of the sentence was that orthosteric antagonists typically bind to the same site as the endogenous ligand, thus blocking its interaction with the receptor. Therefore, orthosteric inhibitors (i.e. AMD3100) block all signaling cascades triggered by the ligand and therefore their functional consequences. However, the compounds described in this project are essentially negative allosteric modulators, that is, they bind to a site distinct from the orthosteric site, inducing a conformational change in the receptor that does not alter the binding of the endogenous ligand, and therefore block some specific receptor-associated functions without altering others. We observed that AGR1.137 blocked receptor oligomerization and directed cell migration whereas CXCL12 still bound CXCR4, triggered calcium mobilization, did not inhibit cAMP release or promoted receptor internalization. This is why we speculated on the limitation of side effects. The statements have been nonetheless revised in the new version of the manuscript.

      (41) 2. Binding Site Clarification:

      -Address the apparent discrepancy between docking the small compounds in a narrow cleft formed by TMV and TMVI helices and the statement that AGR1.131 binds elsewhere. Clarify the rationale behind this assertion

      After the in silico screening, a total of 40 compounds were selected.  These compounds showed distinct degrees of interaction with the cleft formed by TMV and TMVI and even with other potential interaction sites on CXCR4, with the exception of the ligand binding site according to the data described by Wescott et al. (PNAS 2016 113:9928-9933), as this possibility was discarded in the initial approach of the in silico screening. According to PELE analysis, AGR1.131 was one of the 40 selected compounds that showed a pose with low binding energy, -39.8 kcal/mol, between TMV and TMVI helices, that is, it might interact with CXCR4 through the selected area for the screening. It nonetheless also showed a best pose placed between helices TMI and TMVII, -43.7 kcal/mol. In any case, the compound was included in the biological screening, where it was unable to impact CXCL12-mediated chemotaxis (Fig. 1B). We then focused on AGR1.135 and AGR1.137, as showed a higher inhibitory effect on CXCL12-mediated migration, and on AGR1.131 as an internal negative control. AGR1.131 has a skeleton very similar to the other compounds (Fig. 1C) and can interact with the TM domains of CXCR4 without promoting effects. None of the three compounds affected CXCL12 binding, or CXCL12mediated inhibition of cAMP release, or receptor internalization. However, whereas AGR1.135 and AGR1.137, blocked CXCL12-mediated CXCR4 oligomerization and directed cell migration towards CXCL12 gradients, AGR1.131 had no effect in these experiments (Fig. 3, Fig.  4). 

      Next, we performed additional theoretical calculations (PELE, docking, MD) to inspect in detail the potential binding modes of active and inactive molecules. Based on these additional calculations, we identified that whereas AGR1.135 and AGR1.137 showed preferent binding on the molecular pocket between TMV and TMVI, the best pose for AGR1.131 was located between TMI and TMVII, as the initial experiments indicated.  These observations and data have been clarified in the revised discussion. 

      (42) 3. Impact of Chemical Modifications:

      -Discuss the consequences of the distinct chemical groups in AGR1.135, AGR1.137, and AGR1.131, specifically addressing how variations in amine length and chemical nature may influence binding affinity and biological activity. Provide insights into the potential effects of these modifications on cellular responses and the observed outcomes in zebrafish. 

      The main difference between AGR1.131 and the other two compounds is the higher flexibility of AGR1.131 due to the additional CH2 linker, together with the lack of a piperazine ring. The additional CH2 linking the phenyl ring increases the flexibility of AGR1.131 when compared with AGR1.135 and AGR1.137, and the absence of the piperazine ring might be responsible for its lack of activity, as it makes this compound able to bind to CXCR4 (Fig. 1C).

      AGR1.137 was chosen in a second round. The additional presence of the tertiary amine (in the piperazine ring) allows the formation of quaternary ammonium salts in the aqueous medium and its substituents to increase its solubility (Fig 1C). This characteristic might be related to the absence of toxic effects of the compound in the zebrafish model.

      (43) 4. Existence of Distinct CXCR4 Conformational States: 

      -Provide more detailed support for the statement suggesting the "existence of distinct CXCR4 conformational states" responsible for activating different signaling pathways. Consider referencing relevant studies or experiments that support this claim.

      Classical models of GPCR allostery and activation, which describe an equilibrium between a single inactive and a single signaling-competent active conformation, cannot account for the complex pharmacology of these receptors. The emerging view is that GPCRs are highly dynamic proteins, and ligands with varying pharmacological properties differentially modulate the balance between multiple conformations.

      Just as a single photograph from one angle cannot capture all aspects of an object in movement, no one biophysical method can visualize all aspects of GPCR activation. In general, there is a tradeoff between high-resolution information on the entire protein versus dynamic information on limited regions. In the former category, crystal and cryo-electron microscopy (cryoEM) structures have provided comprehensive, atomic-resolution snapshots of scores of GPCRs both in inactive and active conformations, revealing conserved conformational changes associated with activation. However, different GPCRs vary considerably in the magnitude and nature of the conformational changes in the orthosteric ligand-binding site following agonist binding (Venkatakrishnan A.J.V. et al. Nature 2016). Spectroscopic and computational approaches provide complementary information, highlighting the role of conformational dynamics in GPCR activation (Latorraca N.R.V. et al. Chem. Rev 2017). In the absence of agonists, the receptor population is typically dominated by conformations closely related to those observed in inactive-state crystal structures (Manglik A. et al. Cell 2015). While agonist binding drives the receptor population towards conformations similar to those in activestate structures, a mixture of inactive and active conformations remains, reflecting “loose” or incomplete allosteric coupling between the orthosteric and transducer pockets (Dror R.O. et al. Proc. Natl. Acad. Sci. USA 2011). Surprisingly, for some GPCRs, and under some experimental conditions, a substantial fraction of unliganded receptors already reside in an active-like conformation, which may be related to their level of basal or constitutive signaling (Staus D.P. et al. J. Biol. Chem. 2019);  Ye L. et al. Nature 2016).  In our case, the negative allosteric modulators, (Staus DP, et al. J. Biol. Chem 2019); Ye L. et al. Nature 2016) did not alter ligand binding and had only minor effects on specific CXCL12-mediated functions such as inhibition of cAMP release or receptor internalization, among others, but failed to regulate CXCL12-mediated actin dynamics and receptor oligomerization. Collectively, these data suggest that the described compounds alter the active conformation of CXCR4 and therefore support the presence of distinct receptor conformations that explain a partial activation of the signaling cascade.

      All these observations are now included in the revised discussion of the manuscript.

      (44) 5. Equilibrium Shift and Allosteric Ligands: 

      -Clarify the statement about "allosteric ligands shifting the equilibrium to favor a particular receptor conformation". Support this suggestion with references or experimental evidence

      In a previous answer (see our response to point 2), we explain why we define the compounds as negative allosteric modulators. These compounds do not bind the orthosteric binding site or a site distinct from the orthosteric site that alters the ligand-binding site. Their effect should be due to changes in the active conformation of CXCR4, which allow some signaling events whereas others are blocked. Our functional data thus support that through the same receptor the compounds separate distinct receptor-mediated signaling cascades, that is, our data suggest that CXCR4 has a conformational heterogeneity. It is known that GPCRs exhibit more than one “inactive” and “active” conformation, and the endogenous agonists stabilize a mixture of multiple conformations. Biased ligands or allosteric modulators can achieve their distinctive signaling profiles by modulating this distribution of receptor conformations. (Wingler L.M. & Lefkowitz R.J. Trends Cell Biol. 2020). For instance, some analogs of angiotensin II do not appreciably activate Gq signaling (e.g., increases in IP3 and Ca2+) but still induce receptor phosphorylation, internalization, and mitogen-activated protein kinase (MAPK) signaling (Wei H, et al. Proc. Natl. Acad. Sci. USA 2003). Some of these ligands activate Gi and G12 in bioluminescence resonance energy transfer (BRET) experiments (Namkung Y. et al. Sci. Signal. 2018). A similar observation was described in the case of CCR5, where some chemokine analogs promoted G protein subtype-specific signaling bias (Lorenzen E. et al. Sci. Signal 2018). Structural analysis of distinct GPCRs in the presence of different ligands vary considerably in the magnitude and nature of the conformational changes in the orthosteric ligand-binding site following agonist binding (Venkatakrishnan A.J.V. et al. Nature 2016). Yet, these changes modify conserved motifs in the interior of the receptor core and induce common conformational changes in the intracellular site involved in signal transduction. That is, these modifications might be considered distinct receptor conformations. 

      The revised discussion contains some of these interpretations to support our statement about the stabilization of a particular receptor conformation triggered by the negative allosteric modulators. 

      (45) 6. Refinement of Binding Mode: 

      -Clarify the workflow for obtaining the binding mode, particularly the role of GLIDE and PELE. Clearly explain how these software tools were used in tandem to refine the binding mode. 

      The computational sequential workflow applied in this project included, i) Protein model construction, ii) Virtual screening (Glide), iii) PELE, iv) Docking (AutoDock and Glide) and v) Molecular Dynamics (AMBER).

      Glide was applied for the structure-based virtual screening to explore which compounds could fit and interact with the previously selected binding site.

      After the identification of theoretically active compounds (modulators of CXCR4), additional calculations were done to identify a potential binding site. PELE was used in this sense, to study how the compounds could bind in the whole surface of the target (TMV-TMVI). By applying PELE, we avoided biasing the calculation, and we found that the trajectories with better interaction energies identified the cleft between TMV and TMVI as the binding site for AGR1.135 and AGR1.137, and not for AGR1.131. AGR1.131 showed a pose with low binding energy, -39.8 kcal/mol, between TMV and TMVI helices, that is, it might interact with CXCR4 in the selected area for the screening. But it also showed a better pose placed between helices TMI and TMVII, - 43.7 kcal/mol (see our response to point 41). These data have been now confirmed using Schrodinger’s MM-GBSA procedure (see our response to points 6 and 8). In any case, the compound was included in the biological screening, where it was unable to affect CXCL12-mediated chemotaxis (Fig. 1B). Docking and MD simulations were then performed to study and refine the specific binding mode in this cavity. These data were important to choose the mutations on CXCR4 required, to test whether the compounds reversed its behavior. In these experiments we also confirmed that AGR1.131 had a better pose on the TMI-TMVII region. 

      (46) 7. Impact of Compound Differences on CXCR4-F249L mutant: 

      -Provide visual aids, such as figures, and additional experiments to support the statement about differences in the behavior of AGR1.135 and AGR1.137 on cells expressing CXCR4-F249L mutant. Elaborate on the closer interaction suggested between the triazole group of AGR1.137 and the F249 residue

      At the reviewer’s suggestion, Fig. 5 has been modified to incorporate a closer view of the interactions identified and new panels in new Fig. 6 have been added to show in detail the effect of the mutations selected on the structure of the cleft between TMV and TMVI. The main difference between AGR1.135 and AGR1.137 is how the triazole group interacts with F249 and L216 (Author response image 6). In AGR1.137, the three groups are aligned in a parallel organization, which appears to be more effective: This might be due to a better adaptation of this compound to the cleft since there is only one hydrogen bond with V124. In AGR1.135, the compound interacts with the phenyl ring of F249 and has a stronger interaction at the apical edge to stabilize its position in the cleft. However, there is still an additional interaction present. When changing F249

      Author response image 6.

      Cartoon representation of the interaction of CXCR4 F249L mutant with AGR1.135 (A) and AGR1.137 (B). The two most probable conformations of Leucine rotamers are represented in cyan A and B conformations. Van der Waals interactions are depicted in blue cyan dashed lines, hydrogen bonds in black dashed lines. CXCR4 segments of TMV and TMVI are colored in blue and pink, respectively

      to L (Fig. VIIA, B, only for review purposes) and showing the two most likely rotamers resulting from the mutation, it is observed that rotamer B is in close proximity to the compound, which may cause the binding to either displace or adopt an alternative conformation that is easier to bind into the cleft. As previously mentioned, it is likely that AGR1.135 can displace the mutant rotamer and bind into the cleft more easily due to its higher affinity.

      (47) In the "Materials and Methods" section, the computational approach for the "discovery of CXCR4 modulators" requires significant revision and clarification. The following suggestions aim to address the identified issues: 1. Structural Modeling: 

      -Reconsider the use of SWISS-MODEL if there is an available PDB code for the entire CXCR4 structure. Clearly articulate the rationale for choosing one method over the other and explain any limitations associated with the selected approach. 

      The SWISS-model server allows for automated comparative modeling of 3D protein structures that was pioneered in the fields of automated modeling. At the time we started this project. it was the most accurate method to generate reliable 3D protein structure models.

      As explained above, we have now predicted the structure of the target using AlphaFold (Jumper J. et al, Nature 2021) and performed several additional experiments that confirm that the small compounds bind the selected pocket as the original strategy indicated (see our response to point 6). (Fig. II, only for review purposes).

      (48) 2. Parametriza7on of Small Compounds: 

      -Provide a detailed description of the parametrization process for the small compounds used in the study. Specify the force field and parameters employed, considering the obsolescence of AMBER14 and ff14SB. Consider adopting more contemporary force fields and parameterization strategies. 

      When we performed these experiments, some years ago, the force fields applied (ff14SB, AMBER14 used in MD or OPLS2004 in docking with Glide) were well accepted and were gold standards. It is, however, true that the force fields have evolved in the past few years, Moreover, in the case of the MD simulations, to consider the parameters of the ligands that are not contained within the force field, we performed an additional parameterization as a standard methodology. We then generated an Ab initio optimization of the ligand geometry, defining as basis sets B3LYP 6-311+g(d), using Gaussian 09, Revision A.02, and then a single point energy calculation of ESP charges, with HF 6311+g(d) on the optimized structure. As the last step of the parametrization, the antechamber module was used to adapt these charges and additional parameters for MD simulations.

      (49) 3. Treatment of Lipids and Membrane: 

      -Elaborate on how lipids were treated in the system. Clearly describe whether a membrane was included in the simulations and provide details on its composition and structure. Address the role of the membrane in the study and its relevance to the interactions between CXCR4 and small compounds 

      To stabilize CXCR4 and more accurately reproduce the real environment in the MD simulation, the system was embedded in a lipid bilayer using the Membrane Builder tool (Sunhwan J. et al. Biophys. J. 2009) from the CHARMM-GUI server. The membrane was composed of 175 molecules of the fatty acid 1-palmitoyl-2-oleoyl-sn-glycero-3phosphocholine (POPC) in each leaflet. The protein-membrane complex was solvated with TIP3 water molecules. Chloride ions were added up to a concentration of 0.15 M in water, and sodium ions were added to neutralize the system. This information was previously described in detail.

      (50) 4. Molecular Dynamics Protocol: 

      -Provide a more detailed and coherent explanation of the molecular dynamics protocol. Clarify the specific steps, parameters, and conditions used in the simulations. Ensure that the protocol aligns with established best practices in the field.

      Simulations were calculated on an Asus 1151 h170 LVX-GTX-980Ti workstation, with an Intel Core i7-6500 K Processor (12 M Cache, 3.40 GHz) and 16 GB DDR4 2133 MHz RAM, equipped with a Nvidia GeForce GTX 980Ti available for GPU (Graphics Processing Unit) computations. MD simulations were performed using AMBER14 (Case D.A. et al. AMBERT 14, Univ. of California, San Francisco, USA, 2014) with ff14SB (Maier J.A. et al. J. Chem. Theory Comput. 2015) and lipid14 (Dickson C. J. et al. J. Chem. Theory Comput. 2014) force fields in the NPT thermodynamic ensemble (constant pressure and temperature). Minimization was performed using 3500 Steepest Descent steps and 4500 Conjugate Gradient steps three times, firstly considering only hydrogens, next considering only water molecules and ions, and finally minimizing all atoms. Equilibration raises system temperature from 0 to 300 K at a constant volume fixing everything but ions and water molecules. After thermalization, several density equilibration phases were performed. In the production phase, 50 ns MD simulations without position restraints were calculated using a time step of 2 fs. Trajectories of the most interesting poses were extended to 150 ns. All bonds involving hydrogen atoms were constrained with the SHAKE algorithm (Lippert R.A. et al. J. Chem. Phys. 2007). A cutoff of 8 Å was used for the Lennard-Jones interaction and the short-range electrostatic interactions. Berendsen barostat (Berendsen H.J. et al. J. Chem. Phys.  1984) and Langevin thermostat were used to regulate the system pression and temperature, respectively. All trajectories were processed using CPPTRAJ (Roe D.R. & Cheatham III T.E. J. Chem. Theory Comput. 2013) and visualized with VMD (Visual Molecular Dynamics) (Humphrey W. et al. J. Mol. Graphics. 1996). To reduce the complexity of the data, Principal Component Analysis (PCA) was performed on the trajectories using CPPTRAJ.

      (51) Consider updating the molecular dynamics protocol to incorporate more contemporary methodologies, considering advancements in simulation techniques and software.

      In our answer to points 6 and 47, we describe why we use the technology based on Swiss-model and PELE analysis and how we have now used Alphafold and other more contemporary methodologies to confirm that the small compounds bind the selected pocket.

      (52) Figure 1A: 

      •  Consider switching to a cavity representation for CXCL12 to enhance clarity and emphasize the cleft.

      Fig. 1A has been modified to emphasize the cleft.

      (53) Explicitly show the TMV-TMVI cleft in the figure for a more comprehensive visualization. 

      In Fig. 1A we have added an insert to facilitate TMV-TMVI visualization.

      (54) Figure 1B: 

      •  Clearly explain the meaning of the second DMSO barplot to avoid confusion. 

      To clarify this panel, we have modified the figure and the figure legend. Panel B now includes a complete titration of the three compounds analyzed in the manuscript.  The first bar shows cell migration in the absence of both treatment with AMD3100 and stimulation with CXCL12.  The second bar shows migration in response to CXCL12 in the absence of AMD3100. The third bar shows the effect of AMD3100 on CXCL12-induced migration, as a known control of inhibition of migration.  We hope that this new representation of the data results is clearer.

      (55) Figure 1C: 

      •  Provide a clear legend explaining the significance of the green shading on the small compounds. 

      The legend for Fig. 1C has been modified accordingly to the reviewer’s suggestion.

      (56) Figure 2: 

      •  Elaborate on the role of fibronectin in the experiment and explain the specific contribution of CD86-AcGFP.

      The ideal situation for TIRF-M determinations is to employ cells on a physiological substrate complemented with or without chemokines. Fibronectin is a substrate widely used in different studies that allows cell adhesion, mimicking a physiological situation. Jurkat cells express alpha4beta1 and alpha5beta1 integrins that mediate adhesion to fibronectin (Seminario M.C. et al. J. Leuk. Biol. 1999).

      Regarding the use of CD86-AcGFP in TIRF-M experiments. We currently determine the number of receptors in individual trajectories of CXCR4 using, as a reference, the MSI value of CD86-AcGFP that strictly showed a single photobleaching step (Dorsch S. et al. Nat Methods 2009).

      We preferred to use CD86-AcGFP in cells instead of AcGFP on glass, to exclude any potential effect on the different photodynamics exhibited by AcGFP when bound directly to glass. In any case, this issue has been clarified in the revised version.

      (57) Figure 3D: 

      •  Include a plot for the respective band intensity to enhance data presentation 

      The plot showing the band intensity analysis of the experiments shown in Fig. 3D was already included in the original version (see old Supplementary Fig. 3). However, in the revised version, we include these plots in the same figure as panels 3E and 3F.  As a control of inhibition of CXCL12 stimulation, we have also included a new figure (Supplementary Fig. 4) showing the effect of AMD3100 on CXCL12-induced activation of Akt and ERK as analyzed by western blot.

      (58) Consider adding AMD3100 as a control for comparison. 

      In agreement with the reviewer’s suggestion, we have added the effect of AMD3100 in most of the functional experiments performed.

      (59) Figure 4: 

      •  Address the lack of positive controls in Figure 4 and consider their inclusion for a more comprehensive analysis. 

      DMSO bars correspond to the control of the experiment, as they represent the effect of CXCL12 in the absence of any allosteric modulator. As previously described in this point-by-point reply, DMSO bars correspond to the control performed with the solvent with which the small compounds, at maximum concentration, are diluted.  Therefore, they show the effect of the solvent on CXCL12 responses. In any case, and in order to facilitate the comprehension of the figure we have also added the controls in the absence of DMSO to demonstrate that the solvent does not affect CXCL12-mediated functions, together with the effect of the orthosteric inhibitor AMD3100. In addition, we have also included representative images of the effect of the different compounds on CXCL12-induced polarization (Fig. 4C).

      (60) In Figure 4A, carefully assess overlapping error bars and ensure accurate interpreta7on. If necessary, consider alternative representation. 

      We have tried alternative representations of data in Fig. 4A, but in all cases the figure was unclear. We believe that the way we represent the data in the original manuscript is the most clear and appropriate.  Nevertheless, we have now included significance values as a table annexed to the figure, as well as the effect of AMD3100, as a control of inhibition

      (61) Supplementary Figure 1A: 

      •  Improve the clarity of bar plots for better understanding. Consider reordering them from the most significant to the least. 

      This was a good idea, and therefore Supplementary Fig. 1A has been reorganized to improve clarity.

      (62) Supplementary Figure 1C: 

      •  Clarify the rationale behind choosing the 12.5 nM concentration and explain if different concentrations of CXCL12 were tested. 

      In old Supplementary Fig. 1C, we used untreated cells, that is, CXCL12 was not present in the assay.  These experiments were performed to test the potential toxicity of DMSO (solvent) or the negative allosteric modulators on Jurkat cells. The 12.5 nM concentration of CXCL12 mentioned in the figure legend applied only to panels A and B, as indicated in the figure legend. We previously optimized this concentration for Jurkat cells using different concentrations of CXCL12 between 5 and 100 nM.  Nevertheless, we have reorganized old supplementary fig. 1 and clarified the figure legend to avoid misinterpretations (see Supplementary Fig 1A, B and Supplementary Fig. 2A, B).

      (63) Explain the observed reduction in fluorescence intensity for AGR1.135. 

      The cell cycle analysis has been moved from Supplementary Fig. 1C to a new Supplementary Fig. 2.  It now includes the flow cytometry panels to show fluorescence intensity as a function of the number of cells analyzed (Panel 1A) as well as a table (panel B) with the percentage of cells in each phase of the cell cycle. We believe that the apparent reduction in fluorescence that the reviewer observes is mainly due to the number of events analyzed. However, we have changed the flow cytometry panels for others that are more representative and included a table with the mean of the different results. When we determined the percentage of cells in each cell cycle phase, we observed that it looks very similar in all the experimental conditions. That is, none of the compounds affected any of the cell cycle phases. We have also included the effect of H2O2 and staurosporine as control compounds inducing cell death and cell cycle alteration of Jurkat cells.

      (64) Supplementary Table 1: 

      •  Include a column specifying the scoring for each compound to provide a clear reference for readers. 

      To facilitate references to readers, we have now included the inhibitory effect of each compound on Jurkat cell migration in the revised version of this table. 

      (65) Minor Points 

      Page 2 - Abstract: Rephrase the first sentence of the abstract to enhance fluidity. 

      Although the entire manuscript was revised by a professional English editor, we appreciate the valuable comments of this reviewer and we have corrected these issues accordingly.

      (66) Page 2 - Abstract: Explicitly define "CXCR4" as "C-X-C chemokine receptor type 4" the first time it appears.

      We have not used C-X-C chemokine receptor type 4 the first time it appears in the abstract. CXCR4 is an acronym normally accepted to identify this chemokine receptor, and it is used as CXCR4 in many articles published in eLife. However, we introduce the complete name the first time it appears in the introduction.

      (67) Page 2 - Abstract: Explicitly define "CXCL12" as "C-X-C motif chemokine 12" the first time it is mentioned. 

      As we have discussed in the previous response, we have not used C-X-C motif chemokine 12 the first time CXCL12 appears in the abstract, as it is a general acronym normally accepted to identify this specific chemokine, even in eLife papers. However, we introduce the complete name the first time it appears in the introduction section.

      (68) Page 2 - Abstract: Explicitly define "TMV and TMVI" upon its first mention.

      The acronym TM has been defined as “Transmembrane” in the revised version

      (69) Page 2 - Abstract: Review the use of "in silico" in the sentence for accuracy and consider revising if necessary.

      With the term “in silico” we want to refer to those experiments performed on a computer or via computer simulation software. We have carefully reviewed its use in the new version of the manuscript.

      (70) Page 2 - Abstract: Add a comma after "compound" in the sentence, "We identified AGR1.137, a small compound that abolishes...".

      A comma after “compound” has been added in the revised sentence.

      (71) Page 2 - Significance Statement: Rephrase the first sentence of the "Significance Statement" to avoid duplication with the abstract.

      The first sentence of the Significance Statement has been revised to avoid duplication with the abstract. 

      (72) Page 2 - Significance Statement: Break down the lengthy sentence, "Here, we performed in silico analyses..." for better readability. 

      The sentence starting by “Here, we performed in silico analyses…” has been broken down in the revised manuscript.

      (73) Page 2 - Introduction: Replace "Murine studies" with a more specific term for clarity.

      The term “murine studies” is normally used to refer to experimental studies developed in mice. We have nonetheless rephrased the sentence.

      (74) Page 3 - Introduction: Rephrase the sentence for clarity: "Finally, using a zebrafish model, ..."

      The sentence has been now rephrased for clarity.

      (75) Results-AGR1.135 and AGR1.137 block CXCL12-mediated CXCR4 nanoclustering and dynamics: 

      Rephrase the sentence for clarity: "Retreatment with AGR1.135 and AGR1.137, but not with AGR1.131, substantially impaired CXCL12-mediated receptor nanoclustering.”

      The sentence has been rephrased for clarity.

      (76) Results - AGR1.135 and AGR1.137 incompletely abolish CXCR4-mediated responses in Jurkat cells: Clarify the sentence: "In contrast to the effect promoted by AMD3100, a binding-site antagonist of CXCR4..."

      The sentence has been modified for clarity.

      (77) Consider using "orthosteric" instead of "binding-site" antagonist.

      The term orthosteric is now used throughout to refer to a binding site antagonist.

      (78) Discussion: Use the term "in silico" only when necessary.

      We have carefully reviewed the use of “in silico” in the manuscript.

      (79) Discussion: Clarify the sentence: "...not affect neither CXCR2-mediated cell migration...". Confirm if "CXCL12" is intended.

      The sentence refers to the chemokine receptor CXCR2, which binds the chemokine CXCL2. To test the specificity of the compounds for the CXCL12/CXCR4 axis, we evaluated CXCL2-mediated cell migration.  The results indicated that CXCL2/CXCR2 axis was not affected by the negative allosteric modulators, whereas CXCL12-mediated cell migration was blocked.  The sentence has been clarified in the new version of the manuscript.

      (80) Figure 4B: Bold the "B" in the figure label for consistency.

      The “B” in Fig. 4B has been bolded.

      Reviewer #2

      (1) Fig 2. The SPT data is sub-optimal in its presentation as well as analysis. Example images should be shown. The analysis and visualization of the data should be reconsidered for improvements. Graphs with several hundreds, in some conditions over 1000 tracks, per condition are very hard to compare. The same (randomly selected representative set) number of data points should be shown for better visualization. Also, more thorough analyses like MSD or autocorrelation functions are lacking - they would allow enhanced overall representation of the data.

      In agreement with the reviewer’s commentary, we have modified the representation of Fig. 2. We have carefully read the paper published by Lord S.J. and col. (Lord S. J. et al., J. Cell Biol. 2020) and we apply their recommendations for these type of data. We have also included as supplementary material representative videos for the TIRF-M experiments performed to allow readers to visualize the original images. Regarding the MSD analyses, they were developed to determine all D1-4 values. According to the data published by Manzo & García-Parajo (Manzo C. & García-Parajo M.F. Rep.Prog. Phys. 2015) due to the finite trajectory length the MSD curve at large tlag has poor statistics and deviates from linearity. However, the estimation of the Diffusion Coefficient (D1-4) can be obtained by fitting of the short tlag region of the MSD plot giving a more accurate idea of the behavior of particles. In agreement we show D1-4 values and not MSD data. 

      Due to the space restrictions, it is very difficult to include all the figures generated, but, only for review purposes, we included in this point-by-point reply some representative plots of the MSD values as a function of the time from individual trajectories showing different types of motion obtained in our experiments (Author response image 7).

      Author response image 7.

      Representative MSD plots from individual trajectories of CXCR4-AcGFP showing different types of motion: A) confined, B) Brownian/Free, C) direct transport of CXCR4-AcGFP particles diffusing at the cell membrane detected by SPT-TIRF in resting JKCD4 cells.

      Further analysis, such as the classification based on particle motion, has not been included in this article. This classification uses the moment scaling spectrum (MSS), described by Ewers H. et al. 2005 PNAS, and requires particles with longer trajectories (>50 frames). Only for review purposes, we include a figure showing the percentage of the MSS-based particle motion classification for each condition. As expected, most of long particles are confined, with a slight increase in the percentage upon CXCL12 stimulation in all conditions, except in cell treated with AGR1.137 (Author response image 8).

      Author response image 8.

      Effects of the negative allosteric modulators on the Types of Motion of CXCR4. Percentage of single trajectories with different types of motion, classified by MSS (DMSO: 58 particles in 59 cells on FN; 314 in 63 cells on FN+CXCL12; AGR1.131: 102 particles in 71 cells on FN; 258in 69 cells on FN+CXCL12; AGR1.135: 86 particles in 70 cells on FN; 120 in 77 cells on FN+CXCL12; AGR1.137: 47 particles in 66 cells on FN; 74 in 64 cells on FN+CXCL12) n = 3.

      (2) Fig 3. The figure legends have inadequate information on concentrations and incubation times used, both for the compounds and other treatments like CXCL12 and forskolin. For the Western blot data, also the quantification should be added to the main figure. The compounds, particularly AGR1.137 seem to lead to augmented stimulation of pAKT and pERK. This should be discussed

      The Fig. 3 legend has been corrected in the revised manuscript. Fig. 3D now contains representative western blots and the densitometry evaluation of these experiments. As the reviewer indicates, we also detected in the western blot included, augmented stimulation of pAKT and pERK in cells treated with AGR1.137. However, as shown in the densitometry analysis, no significant differences were noted between the data obtained with each compound. As a control of inhibition of CXCL12 stimulation we have included a new Supplementary Fig. 4 showing the effect of AMD3100 on CXCL12-induced activation of Akt and ERK as analyzed by western blot.

      (3) Fig. 4 immunofluorescence data on polarization as well as the flow chamber data lack the representative images of the data. The information on the source of the T cells is missing. Not clear if this experiment was done on bilayers or on static surfaces.

      Representative images for the data shown in Figure 4B have been added in the revised figure (Fig. 4C). The experiments in Fig. 4B were performed on static surfaces. As indicated in the material and methods section, primary T cell blasts were added to fibronectin-coated glass slides and then were stimulated or not with CXCL12 (5 min at 37ºC) prior to fix permeabilize and stain them with Phalloidin. Primary T cell blasts were generated from PBMCs isolated from buffy coats that were activated in vitro with IL-2 and PHA as indicated in the material and methods section.

      (4) The data largely lacks titration of different concentrations of the compounds. How were the effective concentration and treatment times determined? What happens at higher concentrations? It is important to show, for instance, if the CXCR12 binding gets inhibited at higher concentrations. most experiments were performed with 50 uM, but HeLa cell data with 100 uM. Why and how was this determined? 

      The revised version contains a new panel in Fig. 1B to show a more detailed kinetic analysis with different concentrations (1-100 µM) of the compounds in the migration experiments using Jurkat cells. We choose 50 µM for further studies as it was the concentration that inhibits 50-75% of the ligand induced cell migration. 

      We have also included the effect of two doses of the compounds (10 and 50 µM) in the zebrafish model as well as AMD3100 (1 and 10 µM) as control (new Fig. 7D, E).  Tumors were imaged within 2 hours of implantation and tumor-baring embryos were treated with either vehicle (DMSO) alone, AGR1.131 or AGR1.137 at 10 and 50 µM or AMD3100 at 1 and 10 µM for three days, followed by re-imaging.

      Regarding the amount of CXCL12 used in these experiments, with the exception of cell migration assays in Transwells, where the optimal concentration was established at 12.5 nM, in all the other experiments the optimal concentration of CXCL12 employed was 50 nM. In the case of the directional cell migration assays, we use 100 nM to create the chemokine gradient in the device. These concentrations have been optimized in previous works of our laboratory using these types of experiments. It should also be noted that in the experiments using lipid bilayers or TIRF-M experiments, CXCL12 is used to coat the plates and therefore it is difficult to determine the real concentration that is retained in the surface after the washing steps performed prior adding the cells.

      (5) The authors state that they could not detect direct binding of the compounds and the CXCR14. It should be reported what approaches were tried and discussed why this was not possible. 

      We attempted a fluorescence spectroscopy strategy to formally prove the ability of AGR1.135 to bind CXCR4, but this strategy failed because the compound has a yellow color that interfered with the determinations. We also tried a FRET strategy (see supplementary Fig. 7) and detected a significant increase in FRET efficiency of CXCR4 homodimers in cells treated with AGR1.135; this effect was due to the yellow color of this compound that interferes with FRET determinations. In the same assays, AGR1.137 did not modify FRET efficiency for CXCR4 homodimers and therefore we cannot assume that AGR1.137 binds on CXCR4. All these data have been considered in the revised discussion.

      (6) The proliferation data in Supplementary Figure 1 lacks controls that affect proliferation and indication of different cell cycle stages. What is the conclusion of this data? More information on the effects of the drug to cell viability would be important.

      Toxicity in Jurkat cells was first determined by propidium iodide incorporation. Some compounds (i.e., AGR1.103 and VSP3.1) were discarded from further analysis as they were toxic for cells. In a deeper analysis of cell toxicity, even if these compounds did not kill the cells, we checked whether they could alter the cell cycle of the cells. New Supplementary Fig. 2 includes a table (panel B) with the percentage of cells in each cell cycle phase, and no differences between any of the treatments tested were detected. 

      Nevertheless, to clarify this issue the revised version of the figure also includes H2O2 and staurosporine stimuli to induce cell death and cell cycle alterations as controls of these assays.

      (7) The flow data in Supplementary Figure 2 should be statistically analysed. 

      Bar graphs corresponding to the old Supplementary Fig. 2 (new Supplementary Fig. 3) are shown in Fig. 3B. We have also incorporated the corresponding statistical analysis to this figure. 

      (8) In general, the authors should revise the figure legends to ensure that critical details are added. 

      We have carefully revised all the figure legends in the new version of the manuscript.

      (9) Bar plots are very poor in showing the heterogeneity of the data. Individual data points should be shown whenever feasible. Superplot-type of representation is strongly advised (https://doi.org/10.1083/jcb.202001064).

      We have carefully read the paper published by Lord S.J. and col. (Lord S. J. et al., J. Cell Biol. 2020) and we apply their recommendations for our TIRF-M data (see revised Fig.  2).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      This paper details a study of endothelial cell vessel formation during zebrafish development. The results focus on the role of aquaporins, which mediate the flow of water across the cell membrane, leading to cell movement. The authors show that actin and water flow together drive endothelial cell migration and vessel formation. If any of these two elements are perturbed, there are observed defects in vessels. Overall, the paper significantly improves our understanding of cell migration during morphogenesis in organisms.

      Strengths:

      The data are extensive and are of high quality. There is a good amount of quantification with convincing statistical significance. The overall conclusion is justified given the evidence.

      Weaknesses:

      There are two weaknesses, which if addressed, would improve the paper.

      (1) The paper focuses on aquaporins, which while mediates water flow, cannot drive directional water flow. If the osmotic engine model is correct, then ion channels such as NHE1 are the driving force for water flow. Indeed this water is shown in previous studies. Moreover, NHE1 can drive water intake because the export of H+ leads to increased HCO3 due to the reaction between CO2+H2O, which increases the cytoplasmic osmolarity (see Li, Zhou and Sun, Frontiers in Cell Dev. Bio. 2021). If NHE cannot be easily perturbed in zebrafish, it might be of interest to perturb Cl channels such as SWELL1, which was recently shown to work together with NHE (see Zhang, et al, Nat. Comm. 2022).

      (2) In some places the discussion seems a little confusing where the text goes from hydrostatic pressure to osmotic gradient. It might improve the paper if some background is given. For example, mention water flow follows osmotic gradients, which will build up hydrostatic pressure. The osmotic gradients across the membrane are generated by active ion exchangers. This point is often confused in literature and somewhere in the intro, this could be made clearer.

      Reviewer #1 (Recommendations For The Authors):

      (1) The paper focuses on aquaporins, which while mediating water flow, cannot drive directional water flow. If the osmotic engine model is correct, then ion channels such as NHE1 are the driving force for water flow. Indeed this water is shown in previous studies. Moreover, NHE1 can drive water intake because the export of H+ leads to increased HCO3 due to the reaction between CO2+H2O, which increases the cytoplasmic osmolarity (see Li, Zhou and Sun, Frontiers in Cell Dev. Bio. 2021). If NHE cannot be easily perturbed in zebrafish, it might be of interest to perturb Cl channels such as SWELL1, which was recently shown to work together with NHE (see Zhang, et al, Nat. Comm. 2022).

      We thank Reviewer #1 for this very important comment and the suggestion to examine the function of ion channels in establishing an osmotic gradient to drive directional flow. We have taken on board the reviewer’s suggestion and examined the expression of NHE1 and SWELL1 in endothelial cells using published scRNAseq of 24 hpf ECs (Gurung et al, 2022, Sci. Rep.). We found that slc9a1a, slc9a6a, slc9a7, slc9a8, lrrc8aa and lrrc8ab are expressed in different endothelial subtypes. To examine the function of NHE1 and SWELL1 in endothelial cell migration, we used the pharmacological compounds, 5-(N-ethyl-Nisopropyl)amiloride (EIPA) and DCPIB, respectively. While we were unable to observe an ISV phenotype after EIPA treatment at 5, 10 and 50µM, we were able to observe impaired ISV formation after DCPIB treatment that was very similar to that observed in Aquaporin mutants. We were very encouraged by these results and proceeded to perform more detailed experiments whose results have yielded a new figure (Figure 6) and are described and discussed in lines 266 to 289 and 396 to 407, respectively, in the revised manuscript.

      (2) In some places the discussion seems a little confusing where the text goes from hydrostatic pressure to osmotic gradient. It might improve the paper if some background is given. For example, mention water flow follows osmotic gradients, which will build up hydrostatic pressure. The osmotic gradients across the membrane are generated by active ion exchangers. This point is often confused in literature and somewhere in the intro, this could be made clearer.

      Thank you for pointing out the deficiency in explaining how osmotic gradients drive water flow to build up hydrostatic pressure. We have clarified this in lines 50, 53 - 54 and 385.

      The two recommendations listed above would improve the paper. They are however not mandatory. The paper would be acceptable with some clarifying rewrites. I am not an expert on zebrafish genetics, so it might be difficult to perturb ion channels in this model organism. Have the authors tried to perturb ion channels in these cells?

      We hope that our attempts at addressing Reviewer’s 1 comments are satisfactory and sufficient to clarify the concerns outlined.

      Reviewer #2 (Public Review):

      Summary:

      Directional migration is an integral aspect of sprouting angiogenesis and requires a cell to change its shape and sense a chemotactic or growth factor stimulus. Kondrychyn I. et al. provide data that indicate a requirement for zebrafish aquaporins 1 and 8, in cellular water inflow and sprouting angiogenesis. Zebrafish mutants lacking aqp1a.1 and aqp8a.1 have significantly lower tip cell volume and migration velocity, which delays vascular development. Inhibition of actin formation and filopodia dynamics further aggravates this phenotype. The link between water inflow, hydrostatic pressure, and actin dynamics driving endothelial cell sprouting and migration during angiogenesis is highly novel.

      Strengths:

      The zebrafish genetics, microscopy imaging, and measurements performed are of very high quality. The study data and interpretations are very well-presented in this manuscript.

      Weaknesses:

      Some of the mechanobiology findings and interpretations could be strengthened by more advanced measurements and experimental manipulations. Also, a better comparison and integration of the authors' findings, with other previously published findings in mice and zebrafish would strengthen the paper.

      We thank Reviewer #2 for the critique that the paper can be strengthened by more advanced measurements and experimental manipulations. One of the technical challenges that we face is how to visualize and measure water flow directly in the zebrafish. We have therefore taken indirect approaches to assess water abundance in endothelial cells in vivo. One approach was to measure the diffusion of GEM nanoparticles in tip cell cytoplasm in wildtype and Aquaporin mutants, but results were inconclusive. The second was to measure the volume of tip cells, which should reflect water in/outflow. As the second approach produced clear and robust differences between wildtype ECs, ECs lacking Aqp1a.1 and Aqp8a.1 and ECs overexpressing Aqp1a.1 (revised Fig. 5), we decided to present these data in this manuscript.

      We have also taken Reviewer 2 advice to better incorporate previously published data in our discussion (see below and lines 374 to 383 of the revised manuscript).

      Reviewer #2 (Recommendations For The Authors):

      I have a few comments that the authors may address to further improve their manuscript analysis, quality, and impact.

      Major comments:

      (1) Citation and discussion of published literature

      The authors have failed to cite and discuss recently published results on the role of aqp1a.1 and aqp8a.1 in ISV formation and caliber in zebrafish (Chen C et al. Cardiovascular Research 2024). That study showed a similar impairment of ISV formation when aqp1a.1 is absent but demonstrated a stronger phenotype on ISV morphology in the absence of aqp8a.1 than the current manuscript by Kondrychyn I et al. Furthermore, Chen C et al show an overall decrease in ISV diameter in single aquaporin mutants suggesting that the cell volume of all ECs in an ISV is affected equally. Given this published data, are ISV diameters affected in single and double mutants in the current study by Kondrochyn I et al? An overall effect on ISVs would suggest that aquaporin-mediated cell volume changes are not an inherent feature of endothelial tip cells. The authors need to analyse/compare and discuss all differences and similarities of their findings to what has been published recently.

      We apologise for having failed and discussed the recently published paper by Chen et al. This has been corrected and discussed in lines 374 to 383.

      In the paper by Chen et al, the authors describe a role of Aqp1a.1 and Aqp8a.1 in regulating ISV diameter (ISV diameter was analysed at 48 hpf) but they did not examine the earlier stages of sprouting angiogenesis between 20 to 30 hpf, which is the focus of our study. We therefore cannot directly compare the ISV phenotypes with theirs. Nevertheless, we recognise that there are differences in ISV phenotypes from 2 dpf. For example, they did not observe incompletely formed or missing ISVs at 2 and 3 dpf, which we clearly observe in our study. This could be explained by differences in the mutations generated. In Chen et al., the sgRNA used targeted the end of exon 2 that resulted in the generation of a 169 amino acid truncated aqp1a.1 protein. However, in our approach, our sgRNA targeted exon 1 of the gene that resulted in a truncated aqp1a.1 protein that is 76 amino acid long. As for the aqp8a.1 zebrafish mutant that we generated, our sgRNA targeted exon 1 of the gene that resulted in a truncated protein that is 73 amino acids long. In Chen et al., the authors did not generate an aqp8a.1 mutant but instead used a crispant approach, which leads to genetic mosaicism and high experimental variability.

      Following the reviewer’s suggestion, we have now measured the diameters of arterial ISVs (aISVs) and venous ISVs (vISVs) in aqp1a.1<sup>-/-</sup>, aqp8a.1<sup>-/-</sup> and aqp1a.1<sup>-/-</sup>;aqp8a.1<sup>-/-</sup> zebrafish. In our lab, we always make a distinction between aISVs and vISVs are their diameters are significantly different from each other. The results are in Fig S11A. While we corroborate a decrease in diameter in both aISVs and vISVs in single aqp1a.1<sup>-/-</sup> and double aqp1a.1<sup>-/-</sup>;aqp8a.1<sup>-/-</sup>.zebrafish, we observed a slight increase in diameter in both aISVs and vISVs in aqp8a.1<sup>-/-</sup> zebrafish at 2 dpf. We also measured the diameter of aISV and vISV in Tg(fli1ep:aqp1a.1-mEmerald) and Tg(fli1ep:aqp8a.1-mEmerald) zebrafish at 2 dpf (Fig S11B) and unlike in Chen et al., we could not detect a difference in the diameter between control and aqp1a.1- or aqp8a.1-overexpressing endothelial cells.

      We also would also like to point out that, because ISVs are incompletely formed or are missing in aqp1a.1<sup>-/-</sup>;aqp8a.1<sup>-/-</sup> zebrafish (Fig. 3G – L), blood flow is most likely altered in the zebrafish trunk of these mutants, and this can have a secondary effect on blood vessel calibre or diameter. In fact, we often observed wider ISVs adjacent to unperfused ISVs (Fig. 3J) as more blood flow enters the lumenized ISV. Therefore, to determine the cell autonomous function of Aquaporin in mediating cell volume changes in vessel diameter regulation, one would need to perform cell transplantation experiments where we would measure the volume of single aqp1a.1<sup>-/-</sup>;aqp8a.1<sup>-/-</sup> endothelial cells in wildtype embryos with normal blood flow. As this is beyond the scope of the present study, we have not done this experiment during the revision process.

      (2) Expression of aqp1a.1 and aqp8a.1

      The quantification shown in Figure 1G shows a relative abundance of expression between tip and stalk cells. However, it seems aqp8a.1 is almost never detected in most tip cells. The authors could show in addition, the % of Tip and stalk cells with detectable expression of the 2 aquaporins. It seems aqp8a1 is really weakly or not expressed in the initial stages. Ofcourse the protein may have a different dynamic from the RNA.

      We would like to clarify that aqp8a.1 mRNA is not detected in tip cells of newly formed ISVs at 20hpf. At 22 hpf, it is expressed in both tip cells (22 out of 23 tip cells analysed) and stalk cells of ISVs at 22hpf. This is clarified in lines 107 - 109. We also include below a graph showing that although aqp8a.1 mRNA is expressed in tip cells, its expression is higher in stalk cells.

      Author response image 1.

      Could the authors show endogenously expressed or tagged protein by antibody staining? The analysis of the Tg(fli1ep:aqp8a.1-mEmerald)rk31 zebrafish line is a good complement, but unfortunately, it does not reveal the localization of the endogenously expressed protein. Do the authors have any data supporting that the endogenously expressed aqp8a.1 protein is present in sprouting tip cells?

      We tested several antibodies against AQP1 (Alpha Diagnostic International, AQP11-A; ThermoFisher Scientific, MA1-20214; Alomone Labs, AQP-001) and AQP8 (Sigma Aldrich, SAB 1403559; Alpha Diagnostic International, AQP81-A; Almone Labs, AQP-008) but unfortunately none worked. As such, we do not have data demonstrating endogenous expression and localisation of Aqp1a.1 and Aqp8a.1 proteins in endothelial cells.

      Could the authors perform F0 CRISPR/Cas9 mediated knockin of a small tag (i.e. HA epitope) in zebrafish and read the endogenous protein localization with anti-HA Ab?

      CRISPR/Cas9 mediated in-frame knock-in of a tag into a genomic locus is a technical challenge that our lab has not established. We therefore cannot do this experiment within the revision period.

      Given the double mutant phenotypic data shown, is aqp8a.1 expression upregulated and perhaps more important in aqp1a.1 mutants?

      In our analysis of aqp1a.1 homozygous zebrafish, there is a slight down_regulation in _aqp8a.1 expression (Fig. S5C). Because the loss of Aqp1a.1 leads to a stronger impairment in ISV formation than the loss of Aqp8a.1 (see Fig. S6F, G, I and J), we believe that Aqp1a.1 has a stronger function than Aqp8a.1 in EC migration during sprouting angiogenesis.

      Regarding the regulation of expression by the Vegfr inhibitor Ki8751, does this inhibitor affect Vegfr/ERK signalling in zebrafish and the sprouting of ISVs significantly?

      ki8751 has been demonstrated to inhibit ERK signalling in tip cells in the zebrafish by Costa et al., 2016 in Nature Cell Biology. In our experiments, treatment with 5 µM ki8751 for 6 hours from 20 hpf also inhibited sprouting of ISVs.

      The data presented suggest that tip cells overexpressing aqp1a.1-mEmerald (Figure 2C) need more than 6 times longer to migrate the same distance as tip cells expressing aqp8a.1mEmerald (Figure 2D). How does this compare with cells expressing only Emerald? A similar time difference can be seen in Movie S1 and Movie S2. Is it just a coincidence? Could aqp8a.1, when expressed at similar levels than aqp1a, be more functional and induce faster cell migration? These experiments were interpreted only for the localization of the proteins, but not for the potential role of the overexpressed proteins on function. Chen C et al. Cardiovascular Research 2024 also has some Aqp overexpression data.

      The still images prepared for Fig. 2 C and D were selected to illustrate the localization of Aqp1a.1-mEmerald and Aqp8a.1-mEmerald at the leading edge of migrating tip cells. We did not notice that the tip cell overexpressing Aqp1a.1-mEmerald (Figure 2C) needed more than 6 times longer to migrate the same distance as the tip cell expressing aqp8a.1-mEmerald (Figure 2D), which the reviewer astutely detected. To ascertain whether there is a difference in migration speed between Aqp1a.1-mEmerald and Aqp8a.1-mEmerald overexpressing endothelial cells, we measured tip cell migration velocity of three ISVs from Tg(fli1ep:aqp1a.1-mEmerald) and Tg(fli1ep:aqp8a.1-mEmerald) zebrafish during the period of ISV formation (24 to 29 hpf) using the Manual Tracking plugin in Fiji. As shown in the graph, there is no significant difference in the migration speed of ECs overexpressing Aqp1a.1-mEmerald and Aqp8a.1-mEmerald, suggesting that Aqp8a.1-overexpressing cells migrate at a similar rate as Aqp1a.1-overexpressing cells. As we have not generated a Tg(fli1ep:mEmerald) zebrafish line, we are unable to determine whether endothelial cells migrate faster in Tg(fli1ep:aqp1a.1mEmerald) and Tg(fli1ep:aqp8a.1-mEmerald) zebrafish compared to endothelial cell expressing only mEmerald. As for the observation that tip cells overexpressing aqp1a.1mEmerald (Figure 2C) need more than 6 times longer to migrate the same distance as tip cells expressing aqp8a.1-mEmerald, we can only surmise that it is coincidental that the images selected “showed” faster migration of one ISV from Tg(fli1ep:aqp8a.1-mEmerald) zebrafish. We do not know whether the Aqp1a.1 and Aqp8a.1 are overexpressed to the same levels in Tg(fli1ep:aqp1a.1mEmerald) and Tg(fli1ep:aqp8a.1-mEmerald) zebrafish.

      We would also like to point out that when we analysed the lengths of ISVs at 28 hpf in aqp1a.1<sup>-/-</sup> and aqp8a.1<sup>-/-</sup> zebrafish, ISVs were shorter in aqp1a.1<sup>-/-</sup> zebrafish compared to aqp8a.1<sup>-/-</sup> zebrafish (Fig. S6 F to J). These results indicate that the loss of Aqp1a.1 function causes slower migration than the loss of aqp8a.1 function, and suggest that Aqp1a.1 induces faster endothelial cell migration that Aqp8a.1.

      Author response image 2.

      The data on Aqps expression after the Notch inhibitor DBZ seems unnecessary, and is at the moment not properly discussed. It is also against what is set in the field. aqp8a.1 levels seem to increase only 24h after DBZ, not at 6h, and still authors conclude that Notch activation inhibits aqp8a.1 expression (Line 138-139). In the field, Notch is considered to be more active in stalk cells, where aqp8a.1 expression seems higher (not lower). Maybe the analysis of tip vs stalk cell markers in the scRNAseq data, and their correlation with Hes1/Hey1/Hey2 and aqp1 vs aqp8 mRNA levels will be more clear than just showing qRT-PCR data after DBZ.

      As our scRNAseq data did not include ECs from earlier during development when ISVs are developing, we have analysed of scRNAseq data of 24 hpf endothelial cells published by Gurung et al, 2022 in Scientific Reports during the revision of this manuscript. However, we are unable to detect separate clusters of tip and stalk cells. As such, we are unable to correlate hes1/hey1/hey2 expression (which would be higher in stalk cells) with that of aqp1a.1/aqp8a.1. Also, we have decided to remove the DBZ-treatment results from our manuscript as we agree with the two reviewers that they are unnecessary.

      The paper would also benefit from some more analysis and interpretation of available scRNAseq data in development/injury/disease/angiogenesis models (zebrafish, mice or humans) for the aquaporin genes characterized here. To potentially raise a broader interest at the start of the paper.

      We thank the reviewer for suggesting examining aquaporin genes in other angiogenesis/disease/regeneration models to expand the scope of aquaporin function. We will do this in future studies.

      (3) Role of aqp1a.1 and aqp8a.1 on cytoplasmic volume changes and related phenotypes

      In Figure 5 the authors show that Aqp1/Aqp8 mutant endothelial tip cells have a lower cytoplasmic volume than tip cells from wildtype fish. If aquaporin-mediated water inflow occurs locally at the leading edge of endothelial tip cells (Figure 2, line 314-318), why doesn't cytoplasmic volume expand specifically only at that location (as shown in immune cells by Boer et al. 2023)? Can the observed reduction in cytoplasmic volume simply be a side-effect of impaired filopodia formation (Figure 4F-I)?

      We believe that water influx not only expands filopodia but also the leading front of tip cells (see bracket region in Fig. 4D), where Aqp1a.1-mEmerald/Aqp8a.1-mEmerald accumulate (Fig. 2), to generate an elongated protrusion and forward expansion of the tip cell. The decrease in cytoplasmic volume observed in the aqp1a.1;aqp8a.1 double mutant zebrafish is a result of decreased formation of these elongated protrusions at the leading front of migration tip cells as shown in Fig. 4E (compare to Fig. 4D), not from just a decrease in filopodia number. In fact, in the method used to quantify cell volume, mEmerald/EGFP localization is limited to the cytoplasm and does not label filopodia well (compare mEmerald/EGFP in green with membrane tagged-mCherry in Fig. 5A - C). The volume measured therefore reflects cytoplasmic volume of the tip cell, not filopodia volume.

      Do the authors have data on cytoplasmic volume changes of endothelial tip cells in latrunculin B treated fish? The images in Figures 6 A,B suggest that there is a difference in cell volume upon lat b treatment only.

      No, unfortunately we have not performed single cell labelling and measurement of tip cells in Latrunculin B-treated embryos. We can speculate that as there is a decrease in actindriven membrane protrusions in this experiment, one would also expect a decrease in cell volume as the reviewer has observed.

      (4) Combined loss of aquaporins and actin-based force generation.

      Lines 331-332 " we show that hydrostatic pressure is the driving force for EC migration in the absence of actin-based force generation"....better leave it more open and stick to the data. The authors show that aquaporin-mediated water inflow partially compensates for the loss of actin-based force generation in cell migration. Not that it is the key driving/rescuing force in the absence of actin-based force.

      We have changed it to “we show that hydrostatic pressure can generate force for EC migration in the absence of actin-based force generation” in line 348.

      (5) Aquaporins and their role in EC proliferation

      In the study by Phnk LK et al. 2013, the authors have shown that proliferation is not affected when actin polymerization or filopodia formation is inhibited. However, in the current manuscript by Kondrychyn I. et al. this has not been analysed carefully. In Movie S4 the authors indicate by arrows tip cells that fail to invade the zebrafish trunk demonstrating a severe defect of sprouting initiation in these mutants. Yet, when only looking at ISVs that reach the dorsal side in Movie S4, it appears that they are comprised of fewer EC nuclei/ISV than the ISVs in Movie S3. At the beginning of DLAV formation, most ISVs in control Movie S3 consist of 3-4 EC nuclei, while in double mutants Movie S4 it appears to be only 2-3 EC nuclei. At the end of the Movie S4, one ISV on the left side even appears to consist of only a single EC when touching the dorsal roof. The authors provide convincing data on how the absence of aquaporin channels affects sprouting initiation and migration speed, resulting in severe delay in ISV formation. However, the authors should also analyse EC proliferation, as it may also be affected in these mutants, and may also contribute to the observed phenotype. We know that effects on cell migration may indirectly change the number of cells and proliferation at the ISVs, but this has not been carefully analysed in this paper.

      We thank the reviewer for highlighting the lack of information on EC number and division in the aquaporin mutants. We have now quantified EC number in ISVs that are fully formed (i.e. connecting the DA or PCV to the DLAV) at 2 and 3 dpf and the results are displayed in Figure S10A and B. At 2 dpf, there is a slight but significant reduction in EC number in both aISVs and vISVs in aqp1a.1<sup>-/-</sup> zebrafish and an even greater reduction in the double aqp1a. aqp1a.1<sup>/-</sup>;aqp8a.1<sup>-/-</sup> zebrafish. No significant change in EC number was observed in aqp8a.1<sup>-/-</sup> zebrafish. EC number was also significantly decreased at 3 dpf for aqp1a.1<sup>-/-</sup>, aqp8a.1<sup>-/-</sup> and aqp1a.1<sup>-/-</sup>;aqp8a.1<sup>-/-</sup> zebrafish. The decreased in EC number per ISV may therefore contribute to the observed phenotype.

      We have also quantified the number of cell divisions during sprouting angiogenesis (from 21 to 30 hpf) to assess whether the lack of Aquaporin function affects EC proliferation. This analysis shows that there is no significant difference in the number of mitotic events between aqp1a.1<sup>+/-</sup>; aqp8a.1<sup>+/-</sup> and aqp1a.1<sup>-/-</sup>;aqp8a.1<sup>-/-</sup> zebrafish (Figure S10 C), suggesting that the reduction in EC number is not caused by a decrease in EC proliferation.

      These new data are reported on lines 198 to 205 of the manuscript.

      Minor comments:

      - Figure 3K data seems not to be necessary and even partially misleading after seeing Figure 3E. Fig. 3E represents the true strength of the phenotype in the different mutants.

      Figure 3K has been removed from Figure 3.

      - Typo Figure 3L (VII should be VI).

      Thank you for spotting this typo. VII has been changed to VI.

      - Line 242: The word "required" is too strong because there is vessel formation without Aqps in endothelial cells.

      This has been changed to “ …Aqp1a.1 and Aqp8a.1 regulate sprouting angiogenesis…” (lines 238 - 239).

      - From Figure S2, the doublets cluster should be removed.

      We have performed a new analysis of 24 hpf, 34hpf and 3 dpf endothelial cells scRNAseq data (the previous analysis did not consist of 24 hpf endothelial cells). The doublets cluster is not included in the UMAP analysis.

      - Better indicate the fluorescence markers/alleles/transgenes used for imaging in Figures 6A-D.

      The transgenic lines used for this experiment are now indicated in the figure (this figure is now Figure 7).

      Reviewer #3 (Public Review):

      Summary:

      Kondrychyn and colleagues describe the contribution of two Aquaporins Aqp1a.1 and Aqp8a.1 towards angiogenic sprouting in the zebrafish embryo. By whole-mount in situ hybridization, RNAscope, and scRNA-seq, they show that both genes are expressed in endothelial cells in partly overlapping spatiotemporal patterns. Pharmacological inhibition experiments indicate a requirement for VEGR2 signaling (but not Notch) in transcriptional activation.

      To assess the role of both genes during vascular development the authors generate genetic mutations. While homozygous single mutants appear normal, aqp1a.1;aqp8a.1 double mutants exhibit defects in EC sprouting and ISV formation.

      At the cellular level, the aquaporin mutants display a reduction of filopodia in number and length. Furthermore, a reduction in cell volume is observed indicating a defect in water uptake.

      The authors conclude, that polarized water uptake mediated by aquaporins is required for the initiation of endothelial sprouting and (tip) cell migration during ISV formation. They further propose that water influx increases hydrostatic pressure within the cells which may facilitate actin polymerization and formation membrane protrusions.

      Strengths:

      The authors provide a detailed analysis of Aqp1a.1 and Aqp8a.1 during blood vessel formation in vivo, using zebrafish intersomitic vessels as a model. State-of-the-art imaging demonstrates an essential role in aquaporins in different aspects of endothelial cell activation and migration during angiogenesis.

      Weaknesses:

      With respect to the connection between Aqp1/8 and actin polymerization/filopodia formation, the evidence appears preliminary and the authors' interpretation is guided by evidence from other experimental systems.

      Reviewer #3 (Recommendations For The Authors):

      Figure 1 H, J:

      The differential response of aqp1/-8 to ki8751 vs DBZ after 6h treatment is quite obvious. Why do the authors show the effect after 24h? The effect is more likely than not indirect.

      We agree with the reviewer and we have now removed 24 hour Ki8751 treatment and all DBZ treatments from Figure 1.

      Figure 2:

      According to the authors' model anterior localization of Aqp1 protein is critical. The authors perform transient injections to mosaically express Aqp fusion proteins using an endothelial (fli1) promoter. For the interpretation, it would be helpful to also show the mCherry-CAAX channel in separate panels. From the images, it is not possible to discern how many cells we are looking at. In particular the movie in panel D may show two cells at the tip of the sprout. A marker labelling cell-cell junctions would help. Furthermore, the authors are using a strong exogenous promoter, thus potentially overexpressing the fusion protein, which may lead to mislocalization. For Aqp1a.1 an antibody has been published to work in zebrafish (e.g. Kwong et al., Plos1, 2013).

      We would like to clarify that we generated transgenic lines - Tg(fli1ep:aqp1a.1-mEmerald) and Tg(fli1ep:aqp8a.1-mEmerald) - to visualize the localization of Aqp1a.1 and Aqp8a.1 in endothelial cells, and the images displayed in Fig. 2 are from the transgenic lines (not transient, mosaic expression).

      To aid visualization and interpretation, we have now added mCherry-CAAX only channel to accompany the Aqp1a.1/Aqp8a.1-mEmerald channel in Fig. 2A and B. To discern how many cells there are in the ISVs at this stage, we have crossed Tg(fli1ep:aqp1a.1-mEmerald) and Tg(fli1ep:aqp8a.1-mEmerald) zebrafish to TgKI(tjp1a-tdTomato)<sup>pd1224</sup> (Levic et al., 2021) to visualize ZO1 at cell-cell junction. However, because tjp1-tdTomato is expressed in all cell types including the skin that lies just above the ISV and the signal in ECs in ISVs is very weak at 22 to 25 hpf, it was very difficult to obtain good quality images that can properly delineate cell boundaries to determine the number of cells in the ISVs at this early stage. Instead, we have annotated endothelial cell boundaries based on more intense mCherryCAAX fluorescence at cell-cell borders, and from the mosaic expression of mCherryCAAX that is intrinsic to the  Tg(kdrl:ras-mCherry)<sup>s916</sup> zebrafish line.

      In Fig. 2D, there are two endothelial cells in the ISV during the period shown but there is only 1 cell occupying the tip cell position i.e. there is one tip cell in this ISV. Unlike the mouse retina where it has been demonstrated that two endothelial cells can occupy the tip cell position side-by-side (Pelton et al., 2014), this is usually not observed in zebrafish ISVs. This is demonstrated in Movie S3, where it is clear that one nucleus (belonging to the tip cell) occupies the tip of the growing ISV. The accumulation of intracellular membranes is often observed in tip cells that may serve as a reservoir of membranes for the generation of membrane protrusions at the leading edge of tip cells.

      We agree that by generating transgenic Tg(fli1ep:aqp1a.1-mEmerald) and Tg(fli1ep:aqp8a.1mEmerald) zebrafish, Aqp1a.1 and Aqp8a.1 are overexpressed that may affect their localization. The eel anti-Aqp1a.1 antibody used in (Kwong et la., 2013) was a gift from Dr. Gordon Cramb, Univ. of St Andrews, Scotland and it was first published in 2001. This antibody is not available commercially. Instead, we have tried to several other antibodies against AQP1 (Alpha Diagnostic International , AQP11-A; ThermoFisher Scientific, MA120214; Alomone Labs, AQP-001) and AQP8 (Sigma Aldrich, SAB 1403559; Alpha Diagnostic International, AQP81-A; Almone Labs, AQP-008) but unfortunately none worked. As such, we cannot compare localization of Aqp1a.1-mEmerald and Aqp8a.1-mEmerald with the endogenous proteins.

      Figure 3:

      E: the quantification is difficult to read. Wouldn't it be better to set the y-axis in % of the DV axis? (see also Figure S6).

      We would like to show the absolute length of the ISVs, and to illustrate that the ISV length decreases from anterior to posterior of the zebrafish trunk. We have increased the size of Fig. 3E to enable easier reading of the bars.

      K: This quantification appears arbitrary.

      We have removed this panel from Figure 3.

      G-J: The magenta channel is difficult to see. Is the lifeact-mCherry mosaic? In panel J there appears to be a nucleus between the sprout and the DLAV. It would be helpful to crop the contralateral side of the image.

      No, the Tg(fli1:Lifeact-mCherry) line is not mosaic. The “missing” vessels are not because of mosaicism in transgene but because of truncated ISVs that is a phenotype of loss Aquaporin function. We have changed the magenta channel to grey and hope that by doing so, the reviewer will be able to see the shape of the blood vessels more clearly. We would like to leave the contralateral side in the images, as it shows that the defective vessel is only on one side of body. Furthermore, when we tried to remove it (reducing the number of Z-stacks) neighbour ISV looks incomplete because the embryos were not mounted flat. To clarify what the nucleus between the sprout and the DLAV is, we have indicated that it is that of the contralateral ISV.

      L: I do not quite understand the significance of the different classes of phenotypes. Do the authors propose different morphogenetic events or contexts of how these differences come about?

      Here, we report the different types of ISV phenotypes that we observe in 3 dpf aqp1a.1<sup>-/-</sup>; aqp8a.1<sup>-/-</sup> zebrafish (Fig. 3 and Fig. S7). As demonstrated in Fig. 4, most of the phenotypes can be explained by the delayed emergence of tip cells from the dorsal aorta and slower tip cell migration. However, in some instances, we also observed retraction of tip cells (Movie S4) and failure of tip cells to emerge from the dorsal aorta or endothelial cell death (see attached figure on page 14), which can give rise to the Class II phenotype. In the dominant class I phenotype (in contrast to class II), secondary sprouting from the posterior cardinal vein is unaffected, and the secondary sprout migrates dorsally passing the level of horizontal myoseptum but cannot complete the formation of vISV (it stops beneath the spinal cord). The Class III phenotype appears to result from a failure of the secondary sprout to fuse with the regressed primary ISV. In the Class IV phenotype, the ventral EC does not maintain a connection to the dorsal aorta. We did not examine how Class III and IV phenotypes arise in detail in this current study.

      Author response image 3.

      Figure 4:

      This figure nicely demonstrates the defects in cell behavior in aqp mutants.

      In panel F it would be helpful to show the single channels as well as the merge.

      We have now added single channels for PLCd1PH and Lifeact signal in panels F and G.

      In Figure 1 the authors argue that the reduction of Aqp1/8 by VEGFR2 inhibition may account for part of that phenotype. In turn, the aqp phenotype seems to resemble incomplete VEGFR2 inhibition. The authors should check whether expression Aqp1Emerald can partially rescue ki8751 inhibition.

      To address the reviewer’s comment, we have treated Tg(fli1ep:Aqp1-Emerald) embryos with ki8751 from 20 hpf for 6 hours but we were unable to observe a rescue in sprouting. It could be because VEGFR2 inhibition also affects other downstream signalling pathways that also control cell migration as well as proliferation.

      Based on previous studies (Loitto et al.; Papadopoulus et al.) the authors propose that also in ISVs aquaporin-mediated water influx may promote actin polymerization and thereby filopodia formation. However, while the effect on filopodia number and length is well demonstrated, the underlying cause is less clear. For example, filopodia formation could be affected by reduced cell polarization. This can be tested by using a transgenic golgi marker (Kwon et al., 2016).

      We have examined tip cell polarity of wildtype, aqp1a.1<sup>-/-</sup> and  aqp8a. 1<sup>-/-</sup> embryos at 24-26 hpf by analysing Golgi position relative to the nucleus. We were unable to analyze polarity in  aqp1a.1<sup>rk28/rk28</sup>; aqp8a.1<sup>rk29/rk29</sup> embryos as they exist in an mCherry-containing transgenic zebrafish line (the Golgi marker is also tagged to mCherry). The results show that tip cell polarity is similar, if not more polarised, in aqp1a.1<sup>-/-</sup> and  aqp8a. 1<sup>-/-</sup> embryos when compared to wildtype embryos (Fig. S10D). This new data is discussed in lines 234 to 237.

      Figure 5:

      Panel D should be part of Figure 4.

      Panel 5D is now in panel J of Figure 4 and described in lines 231 and 235.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      People can perform a wide variety of different tasks, and a long-standing question in cognitive neuroscience is how the properties of different tasks are represented in the brain. The authors develop an interesting task that mixes two different sources of difficulty, and find that the brain appears to represent this mixture on a continuum, in the prefrontal areas involved in resolving task difficulty. While these results are interesting and in several ways compelling, they overlap with previous findings and rely on novel statistical analyses that may require further validation.

      Strengths

      1) The authors present an interesting and novel task for combining the contributions of stimulus-stimulus and stimulus-response conflict. While this mixture has been measured in the multi-source interference task (MSIT), this task provides a more graded mixture between these two sources of difficulty

      2) The authors do a good job triangulating regions that encoding conflict similarity, looking for the conjunction across several different measures of conflict encoding

      3) The authors quantify several salient alternative hypothesis and systematically distinguish their core results from these alternatives

      4) The question that the authors tackle is of central theoretical importance to cognitive control, and they make an interesting an interesting contribution to this question

      We would like to thank the reviewer for the positive evaluation of our manuscript and the constructive comments and suggestions. Your feedback has been invaluable in our efforts to enhance the accessibility of our manuscript and strengthen our findings. In response to your suggestion, we reanalyzed our data using the approach proposed by Chen et al.’s (2017, NeuroImage) and applied stricter multiple comparison correction thresholds in our reporting. This reanalysis largely replicated our previous results, thereby reinforcing the robustness of our findings. We also have examined several alternative models and results supported the integration of the spatial Stroop and Simon conflicts within the cognitive space. In addition, we enriched the theoretical framework of our manuscript by connecting the cognitive space with other important theories such as the “Expected Value of Control” theory. We have incorporated your feedback, revisions and additional analyses into the manuscript. As a result, we firmly believe that these changes have significantly improved the quality of our work. We have provided detailed responses to your comments below.

      1) It's not entirely clear what the current task can measure that is not known from the MSIT, such as the additive influence of conflict sources in Fu et al. (2022), Science. More could be done to distinguish the benefits of this task from MSIT.

      We agree that the MSIT task incorporates Simon and Eriksen Flanker conflict tasks and can efficiently detect the additivity of conflict effects across orthogonal tasks. Like the MSIT, our task incorporates Simon with spatial Stroop conflicts and can test the same idea. For example, a previous study from our lab (Li et al., 2014) used the combined spatial Stroop-Simon condition with the arrows displayed on diagonal corners and found evidence for the additive hypothesis. However, the MSIT cannot be used to test whether/how different conflicts are parametrically represented in a low-dimensional space, a question that is important to address the debate of domain-general and domain-specific cognitive control.

      To this end, our current study adopted the spatial Stroop-Simon task for the unique purpose of parametrically modulating conflict similarity. As far as we know, there is no way to define the similarity between the combined Simon_Flanker conflict condition and the Simon/Flanker conditions in the MSIT. In contrast, with the spatial Stroop-Simon paradigm, we can define the similarity with the cosine of the angle difference across the two conditions in question.

      We have added the following texts in the discussion part to emphasize the 51 difference between our paradigm and other studies.

      "The use of an experimental paradigm that permits parametric manipulation of conflict similarity provides a way to systematically investigate the organization of cognitive control, as well as its influence on adaptive behaviors. This approach extends traditional paradigms, such as the multi-source interference task (Fu et al., 2022), color Stroop-Simon task (Liu et al., 2010) and similar paradigms that do not afford a quantifiable metric of conflict source similarity."

      References:

      Li, Q., Nan, W., Wang, K., & Liu, X. (2014). Independent processing of stimulus-stimulus and stimulus-response conflicts. PloS One, 9(2), e89249.

      2) The evidence from this previous work for mixtures between different conflict sources make the framing of 'infinite possible types of conflict' feel like a strawman. The authors cite classic work (e.g., Kornblum et al., 1990) that develops a typology for conflict which is far from infinite, and I think few people would argue that every possible source of difficulty will have to be learned separately. Such an issue is addressed in theories like 'Expected Value of Control', where optimization of control policies can address unique combinations of task demands.

      The notion that there might be infinite conflicts arises when we consider the quantitative feature of cognitive control. If each combination of the Stroop-Simon combination is regarded as a conflict condition, there would be infinite combinations, and it is our major goal to investigate how these infinite conflict conditions are represented effectively in a space with finite dimensions. We agree that it is unnecessary to dissociate each of these conflict conditions into a unique conflict type, since they may not differ substantially. However, we argue that understanding variant conflicts within a purely categorical framework (e.g., Simon and Flanker conflict in MSIT) is insufficient, especially because it leads to dichotomic conclusions that do not capture how combinations of conflicts are organized in the brain, as our study addresses.

      There could be different perspectives on how our cognitive control system flexibly encodes and resolves multiple conflicts. The cognitive space assumption we held provides a principle by which we can represent multiple conflicts in a lower dimensional space efficiently. While the “Expected Value of Control” theory addresses when and how much cognitive control to apply based on control demand, the “cognitive space” view seeks to explain how the conflict, which defines cognitive control demand, is encoded in the brain. Thus, we argue that these two lines of work are different yet complementary. The geometry of cognitive space of conflict can benefit the adjustment of cognitive control for upcoming conflicts. For example, our brain may evaluate the similarity/distance (and thus cost) between the consecutive conflict conditions, and selects the path with best cost-benefit tradeoff to switch from one state to another. This idea is conceptually similar to a recent study by Grahek et al. (2022) demonstrating that more frequently switching states were encoded as closer together than less frequently switching states in a “drift-threshold” space.

      Nevertheless, Grahek et al (2022) investigated how cognitive control changes based on the expected value of control theory within the same conflict, whereas our study aims to examine organization of different conflict.

      We have added the implications of cognitive space view in the discussion to indicate the potential values of our finding to understand the EVC account and the difference between the two theories.

      “Previous researchers have proposed an “expected value of control (EVC)” theory, which posits that the brain can evaluate the cost and benefit associated with executing control for a demanding task, such as the conflict task, and specify the optimal control strength (Shenhav et al., 2013). For instance, Grahek et al. (2022) found that more frequently switching goals when doing a Stroop task were achieved by adjusting smaller control intensity. Our work complements the EVC theory by further investigating the neural representation of different conflict conditions and how these representations can be evaluated to facilitate conflict resolution. We found that different conflict conditions can be efficiently represented in a cognitive space encoded by the right dlPFC, and participants with stronger cognitive space representation have also adjusted their conflict control to a greater extent based on the conflict similarity (Fig 4C). The finding suggests that the cognitive space organization of conflicts guides cognitive control to adjust behavior. Previous studies have shown that participants may adopt different strategies to represent a task, with the model-based strategies benefitting goal-related behaviors more than the model-free strategies (Rmus et al., 2022). Similarly, we propose that cognitive space could serve as a mental model to assist fast learning and efficient organization of cognitive control settings. Specifically, the cognitive space representation may provide a principle for how our brain evaluates the expected cost of switching and the benefit of generalization between states and selects the path with the best cost-benefit tradeoff (Abrahamse et al., 2016; Shenhav et al., 2013). The proximity between two states in cognitive space could reflect both the expected cognitive demand required to transition and the useful mechanisms to adapt from. The closer the two conditions are in cognitive space, the lower the expected switching cost and the higher the generalizability when transitioning between them. With the organization of a cognitive space, a new conflict can be quickly assigned a location in the cognitive space, which will facilitate the development of cognitive control settings for this conflict by interpolating nearby conflicts and/or projecting the location to axes representing different cognitive control processes, thus leading to a stronger CSE when following a more similar conflict condition. On the other hand, without a cognitive space, there would be no measure of similarity between conflicts on different trials, hence limiting the ability of fast learning of cognitive control setting from similar trials.”

      Reference:

      Grahek, I., Leng, X., Fahey, M. P., Yee, D., & Shenhav, A. Empirical and Computational Evidence for Reconfiguration Costs During Within-Task Adjustments in Cognitive Control. CogSci.

      3) Wouldn't a region that represented each conflict source separately still show the same pattern of results? The degree of Stroop vs Simon conflict is perfectly negatively correlated across conditions, so wouldn't a region that just tracks Stoop conflict show these RSA patterns? The authors show that overall congruency is not represented in DLPFC (which is surprising), but they don't break it down by whether this is due to Stroop or Simon congruency (I'm not sure their task allows for this).

      To estimate the unique contributions of the spatial Stroop and Simon conflicts, we performed a model-comparison analysis. We constructed a Stroop-Only model and a Simon-Only model, with each conflict type projected onto the Stroop (vertical) axis or Simon (horizontal) axis, respectively. The similarity between any two conflict types was defined using the Jaccard similarity index (Jaccard, P., 1901), that is, their intersection divided by their union. By replacing the cognitive spacebased conflict similarity regressor with the Stroop-Only and Simon-Only regressors, we calculated their BICs. Results showed that the BIC was larger for Stroop-Only (5377122) and Simon-Only (5377096) than for the Cognitive-Space model (5377094). An additional Stroop+Simon model, including both Stroop-Only and Simon-Only regressors, also showed a poorer model fitting (BIC = 5377118) than the Cognitive-Space model. Considering that the pattern of conflict representations is more manifested when the conflict is present (i.e., on incongruent trials) than not (i.e., on congruent trials), we also conducted the model comparison using the incongruent trials only. Results showed that Stroop-Only (1344128), Simon-Only (1344120), and Stroop+Simon (1344157) models all showed higher BIC values than the CognitiveSpace model (1344104). These results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. Therefore, we believe the cognitive space has incorporated both dimensions. We added these additional analyses and results to the revised manuscript.

      “To examine if the right 8C specifically encodes the cognitive space rather than the domain-general or domain-specific organizations, we tested several additional models (see Methods). Model comparison showed a lower BIC in the Cognitive-Space model (BIC = 5377094) than the Domain-General (BIC = 537127) or Domain-Specific (BIC = 537127) models. Further analysis showed the dimensionality of the representation in the right 8C was 1.19, suggesting the cognitive space was close to 1D. We also tested if the observed conflict similarity effect was driven solely by spatial Stroop or Simon conflicts, and found larger BICs for the models only including the Stroop similarity (i.e., the Stroop-Only model, BIC = 5377122) or Simon similarity (i.e., the Simon-Only model, BIC = 5377096). An additional Stroop+Simon model, including both StroopOnly and Simon-Only regressors, also showed a worse model fitting (BIC = 5377118). Moreover, we replicated the results with only incongruent trials, considering that the pattern of conflict representations is more manifested when the conflict is present (i.e., on incongruent trials) than not (i.e., on congruent trials). We found a poorer fitting in Domain-general (BIC = 1344129), Domain-Specific (BIC = 1344129), Stroop-Only (BIC = 1344128), Simon-Only (BIC = 1344120), and Stroop+Simon (BIC = 1344157) models than the Cognitive-Space model (BIC = 1344104). These results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. The more detailed model comparison results are listed in Table 2.”

      We reason that we did not observe an overall congruency effect in the RSA results is because our definition of congruency here differed from traditional definitions (i.e., contrast between incongruent and congruent conditions). In the congruency regressor of our RSA model, we defined representational similarity as 1 if calculated between two incongruent, or two congruent trials, and 0 if between incongruent and congruent trials. Thus, our definition of the congruency regressor reflects whether multivariate patterns differ between incongruent and congruent trials, rather than whether activity strengths differ. Indeed, we did observe the latter form of congruency effects, with stronger univariate activities in pre-SMA for incongruent versus congruent conditions. We have added this in the Note S6 (“The multivariate representations of conflict type and orientation are different from the congruency effect”):

      “Neither did we observe a multivariate congruency effect (i.e., the pattern difference between incongruent and congruent conditions compared to that within each condition) in the right 8C or any other regions. Note the definition of congruency here differed from traditional definitions (i.e., contrast between activity strength of incongruent and congruent conditions), with which we found stronger univariate activities in pre-SMA for incongruent versus congruent conditions.”

      We could not determine whether the null effect of the congruency regressor was due to Stroop or Simon congruency alone, because congruency levels of the two types always covary. On all trials of the compound conditions (Conf 2-4), whenever the Stroop dimension was incongruent, the Simon dimension was also incongruent, and vice versa for the congruent condition. Thus, the contribution of spatial Stroop or Simon alone to the congruency effect could not be tested using compound conditions. Although we have pure spatial Stroop or Simon conditions, within-Stroop and withinSimon trial pairs constituted only 8% of cells in the representational similarity matrix. This was insufficient to determine whether the null congruency effect was due to solely Stroop or Simon.

      Overall, with the added analysis we found that the data in the right 8C area supports conflict representations that are organized based on both Simon and spatial Stroop conflict. Although the current experimental design does not allow us to identify whether the null effect of the congruency regressor was driven by either conflict or both, we clarified that the congruency regressor did not test the 205 conventional congruency effect and the null finding does not contradict previous 206 research.

      Reference:

      Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat(37), 547-579.

      4) The authors use a novel form of RSA that concatenates patterns across conditions, runs and subjects into a giant RSA matrix, which is then used for linear mixed effects analysis. This appears to be necessary because conflict type and visual orientation are perfectly confounded within the subject (although, if I understand, the conflict type x congruence interaction wouldn't have the same concern about visual confounds, which shouldn't depend on congruence). This is an interesting approach but should be better justified, preferably with simulations validating the sensitivity and specificity of this method and comparing it to more standard methods.

      The confound exists for both the conflict type and the conflict type × congruence interaction in our design, since both incongruent and congruent conditions include stimuli from the full orientation space. For example, for the spatial Stroop type, the congruent condition could be either an up arrow at the top or a down arrow at the bottom. Similarly, the incongruent condition could be either an up arrow at the bottom or a down arrow at the top. Therefore, both the congruent and incongruent conditions are perfectly confounded with the orientation.

      We reanalyzed the data using the well-documented approach by Chen et al. (2017, Neuroimage), as suggested by the reviewer. The new analysis replicated our previously reported results (Fig. 4-5, S4-S7). As Chen et al (2017) has provided abundant simulations to validate this approach, we did not run any further simulations.

      5) A chief concern is that the same pattern contributes to many entries in the DV, which has been addressed in previous work using row-wise and column-wise random effects (Chen et al., 2017, Neuroimage). It would also be informative to know whether the results hold up to removing within-run similarity, which can bias similarity measures (Walther et al., 2016, Neuroimage).

      Thank you for the comment. In our revised manuscript, we followed your suggestion and adopted the approach proposed by Chen et al. (2017). Specifically, we included both the upper and lower triangle of the representational similarity matrix (excluding the diagonal). Moreover, we also removed all the within-subject similarity (thus also excluding the within-run similarity as suggested by Walther et al. (2016)) to minimize the bias of the potentially strong within-subject similarity. In addition, we added both the row-wise and column-wise random effects to capture the dependence of cells within each column and each row, respectively (Chen et al., 2017).

      Results from this approach largely replicated our previous results. The right 8C again showed significant conflict similarity representation, with greater representational strength in incongruent than congruent condition, and positively correlated to behavioral performance. The orientation effect was also identified in the visual (e.g., right V1) and oculomotor (e.g., left FEF) regions.

      We have revised the methodology and the results in the revised manuscript:

      "Representational similarity analysis (RSA).

      For each cortical region, we calculated the Pearson’s correlations between fMRI activity patterns for each run and each subject, yielding a 1400 (20 conditions × 2 runs × 35 participants) × 1400 RSM. The correlations were calculated in a cross297 voxel manner using the fMRI activation maps obtained from GLM3 described in the previous section. We excluded within-subject cells from the RSM (thus also excluding the within-run similarity as suggested by Walther et al., (2016)), and the remaining cells were converted into a vector, which was then z-transformed and submitted to a linear mixed effect model as the dependent variable. The linear mixed effect model also included regressors of conflict similarity and orientation similarity. Importantly, conflict similarity was based on how Simon and spatial Stroop conflict are combined and hence was calculated by first rotating all subject’s stimulus location to the top right and bottom-left quadrants, whereas orientation was calculated using original stimulus locations. As a result, the regressors representing conflict similarity and orientation similarity were de-correlated. Similarity between two conditions was measured as the cosine value of the angular difference. Other regressors included a target similarity regressor (i.e., whether the arrow directions were identical), a response similarity regressor (i.e., whether the correct responses were identical); a spatial Stroop distractor regressor (i.e., vertical distance between two stimulus locations); a Simon distractor regressor (i.e., horizontal distance between two stimulus locations). Additionally, we also included a regressor denoting the similarity of Group (i.e., whether two conditions are within the same subject group, according to the stimulus-response mapping). We also added two regressors including ROI316 mean fMRI activations for each condition of the pair to remove the possible uni-voxel influence on the RSM. A last term was the intercept. To control the artefact due to dependence of the correlation pairs sharing the same subject, we included crossed random effects (i.e., row-wise and column-wise random effects) for the intercept, conflict similarity, orientation and the group factors (G. Chen et al., 2017)."

      Reference:

      Walther, A., Nili, H., Ejaz, N., Alink, A., Kriegeskorte, N., & Diedrichsen, J. (2016). Reliability of dissimilarity measures for multi-voxel pattern analysis. Neuroimage, 137, 188-200. doi:10.1016/j.neuroimage.2015.12.012

      6) Another concern is the extent to which across-subject similarity will only capture consistent patterns across people, making this analysis very similar to a traditional univariate analysis (and unlike the traditional use of RSA to capture subject-specific patterns).

      With proper normalization, we assume voxels across different subjects should show some consistent localizations, although individual differences can be high. J. Chen et al. (2017) has demonstrated that consistent multi-voxel activation patterns exist across individuals. Previous studies have also successfully applied cross-subject RSA (see review by Freund et al, 2021) and cross-subject decoding approaches (e.g., Jiang et al., 2016; Tusche et al., 2016), so we believe cross-subject RSA should be feasible to capture distributed activation patterns shared at the group level. We added this argument in the revised manuscript:

      "Previous studies (e.g., J. Chen et al., 2017) have demonstrated that consistent multivoxel activation patterns exist across individuals, and successful applications of cross-subject RSA (see review by Freund, Etzel, et al., 2021) and cross-subject decoding approaches (Jiang et al., 2016; Tusche et al., 2016) have also been reported."

      In the revised manuscript, we also tested whether the representation in right 8C held for within-subject data. We reasoned that the conflict similarity effects identified by cross-subject RSA should be replicable in within-subject data, although the latter is not able to dissociate the conflict similarity effect from the orientation effect. We performed similar RSA for within-subject RSMs, excluding the within-run cells. We replaced the perfectly confounded factors of conflict similarity and orientation with a common factor called similarity_orientation. Other confounding factor pairs were addressed similarly. Results showed a significant effect of similarity_orientation, t(13993) = 3.270, p = .0005, 1-tailed. Given the specific representation of conflict similarity identified by the cross-subject RSA, we believe that the within-subject data of right 8C probably showed similar conflict similarity modulation effects as the cross-subject data, although future research that orthogonalizes conflict type and orientation is needed to fully answer this question. We added this result in the revised section Note S7.

      "Note S7. The cross-subject RSA captures similar effects with the within-subject RSA Considering the variability in voxel-level functional localizations among individuals, one may question whether the cross-subject RSA results were biased by the consistent multi-voxel patterns across subjects, distinct from the more commonly utilized withinsubject RSA. We reasoned that the cross-subject RSA should have captured similar effects as the within-subject RSA if we observe the conflict similarity effect in right 8C with the latter analysis. Therefore, we tested whether the representation in right 8C held for within-subject data. Specifically, we performed similar RSA for withinsubject RSMs, excluding the within-run cells. We replaced the perfectly confounded factors of conflict similarity and orientation with a common factor called similarity_orientation. Other confounding factor pairs (i.e., target versus response, and Stroop distractor versus Simon distractor) were addressed similarly. Results showed a significant effect of similarity_orientation, t(13993) = 3.270, p = .0005, 1tailed. Given the specific representation of conflict similarity identified by the crosssubject RSA, the within-subject data of right 8C may show similar conflict similarity modulation effects as the cross-subject data. Further research is needed to fully dissociate the representation of conflict and the representation of visual features such as orientation."

      Reference:

      Chen, J., Leong, Y. C., Honey, C. J., Yong, C. H., Norman, K. A., & Hasson, U. (2017). Shared memories reveal shared structure in neural activity across individuals. Nature Neuroscience, 20(1), 115-125.

      Freund, M. C., Etzel, J. A., & Braver, T. S. (2021). Neural Coding of Cognitive Control: The Representational Similarity Analysis Approach. Trends in Cognitive Sciences, 25(7), 622-638.

      Jiang, J., Summerfield, C., & Egner, T. (2016). Visual Prediction Error Spreads Across Object Features in Human Visual Cortex. J Neurosci, 36(50), 12746-12763.

      Tusche, A., Bockler, A., Kanske, P., Trautwein, F. M., & Singer, T. (2016). Decoding the Charitable Brain: Empathy, Perspective Taking, and Attention Shifts Differentially Predict Altruistic Giving. Journal of Neuroscience, 36(17), 4719-4732.

      7) Finally, the authors should confirm all their results are robust to less liberal methods of multiplicity correction. For univariate analysis, they should report the effects from the standard p < .001 cluster forming threshold for univariate analysis (or TFCE). For multivariate analyses, FDR can be quite liberal. The authors should consider whether their mixed-effects analyses allow for group-level randomization, and consider (relatively powerful) Max-Stat randomization tests (Nichols & Holmes, 2002, Hum Brain Mapp).

      In our revised manuscript, we have corrected the univariate results using the probabilistic TFCE (pTFCE) approach by Spisak et al. (2019). This approach estimates the conditional probability of cluster extent based on Bayes’ rule. Specifically, we applied pTFCE on our univariate results (i.e., the z-maps of our contrasts). This returned enhanced Z-score maps, which were then thresholded based on simulated cluster size thresholds using 3dClustSim. A cluster-forming threshold of p < .001 was employed. Results showed only the pre-SMA was activated in the incongruent > congruent contrast, and right IPS and right dmPFC were activated in the linear Simon modulation effect. Further tests also showed these regions were not correlated with the behavioral performance, uncorrected ps >.28. These results largely replicated our previous results. We have revised the method and results accordingly.

      Methods:

      "Results were corrected with the probabilistic threshold-free cluster enhancement(pTFCE) and then thresholded by 3dClustSim function in AFNI (Cox & Hyde, 1997) with voxel-wise p < .001 and cluster-wize p < .05, both 1-tailed."

      Results:

      "In the fMRI analysis, we first replicated the classic congruency effect by searching for brain regions showing higher univariate activation in incongruent than congruent conditions (GLM1, see Methods). Consistent with the literature (Botvinick et al., 2004; Fu et al., 2022), this effect was observed in the pre-supplementary motor area (preSMA) (Fig. 3, Table S1). We then tested the encoding of conflict type as a cognitive space by identifying brain regions with activation levels parametrically covarying with the coordinates (i.e., axial angle relative to the horizontal axis) in the hypothesized cognitive space. As shown in Fig. 1B, change in the angle corresponds to change in spatial Stroop and Simon conflicts in opposite directions. Accordingly, we found the right inferior parietal sulcus (IPS) and the right dorsomedial prefrontal cortex (dmPFC) displayed positive correlation between fMRI activation and the Simon conflict (Fig. 3, Fig. S3, Table S1)."

      We appreciate the reviewer’s suggestion to apply the Max-Stat randomization tests (Nichols & Holmes, 2002) for the multivariate analyses. However, the representational similarity matrix was too large (1400×1400) to be tested with a balanced randomization approach (i.e., the Max-Stat), due to (1) running even 1000 times for all ROIs cost very long time; (2) the distribution generated from normal times of randomization (e.g., 5000 iterations) would probably be unbalanced, since the full range of possible samples that could be generated by a complete randomization is not adequately represented. Instead, we adopted a very strict Bonferroni correction p < 0.0001/360 when reporting the regression results from RSA. Notebally, Chen et al (2017) has shown that their approach could control the FDR at an acceptable level.

      Reference:

      Spisák, T., Spisák, Z., Zunhammer, M., Bingel, U., Smith, S., Nichols, T., & Kincses,T. (2019). Probabilistic TFCE: A generalized combination of cluster size and voxel intensity to increase statistical power. NeuroImage, 185, 12-26.

      Chen, G., Taylor, P. A., Shin, Y.-W., Reynolds, R. C., & Cox, R. W. J. N. (2017). Untangling the relatedness among correlations, Part II: Inter-subject correlation group analysis through linear mixed-effects modeling. 147, 825-840.

      Minor concerns:

      8) I appreciate the authors wanting to present the conditions in a theory-agnostic way, but the framing of 5 conflict types was confusing. I think framing the conditions as a mixture of 2 conflict types (Stroop and Simon) makes more sense, especially given the previous work on MSIT.

      We have renamed the Type1-5 as spatial Stroop, StHSmL, StMSmM, StLSmH, and Simon conditions, respectively. H, L, and M indicate high, low andmedium similarity with the corresponding conflict, respectively. This is alsoconsistent with the naming of our previous work (Yang et al., 2021).

      Reference:

      Yang, G., Xu, H., Li, Z., Nan, W., Wu, H., Li, Q., & Liu, X. (2021). The congruency sequence effect is modulated by the similarity of conflicts. Journal of Experimental Psychology: Learning, Memory, and Cognition, 47(10), 1705-1719.

      9) It would be helpful to have more scaffolding for the key conflict & orientation analyses. A schematic in the main text that outlines these contrasts would be very helpful (e.g. similar to S4).

      We have inserted Figure 7 in the revised manuscript. In this figure, we plotted the schematic of the difference between the conflict similarity 467 and orientation regressors according to their cross-group representational similarity 468 matrices.

      10) Figure 4D could be clearer, both in labeling and figure caption. 'Modeled similarity' could be relabelled to something more informative, like 'conflict type (or mixture) similarity'. Alternatively, it would be helpful to show a summary RDM for region r-8C. For example, breaking it down by just conflict type and congruence.

      We have relabeled the x-axis to “Conflict type similarity” and y-axis to “Neural similarity” for Figure 4D in the revised manuscript.

      We have also added a summary RSM figure in Fig. S5 to show the different similarity patterns between incongruent and congruent conditions.

      11) It may be helpful to connect your work to how people have discussed multiple forms of conflict monitoring and control with respect to target and distractor features e.g., Lindsay & Jacoby, 1994, JEP:HPP; Mante, Sussillo et al., 2013, Nature; Soutschek et al., 2015, JoCN; Jackson et al., 2021, Comm Bio; Ritz & Shenhav, 2022, bioRxiv

      We have added an analysis to examine how cognitive control modulates target and distractor representation. To this end, we selected the left V4, a visual region showing joint representation of target, Stroop distractor and Simon distractor, as the region of interest. We tested whether these representation strengths differed between incongruent and congruent conditions, finding the representation of target was stronger and representations of both distractors were weaker in the incongruent condition. This suggests that cognitive control modulates the stimuli in both directions. We added the results in Note S10 and Fig. S8, and also added discussion of it in “Methodological implications”.

      “Note S10. Cognitive control enhances target representation and suppresses distractor representation Using the separability of confounding factors afforded by the cross-subject RSA, we examined how representations of targets and distractors are modulated by cognitive control. The key assumption is that exerting cognitive control may enhance target representation and suppress distractor representation. We hypothesized that stimuli are represented in visual areas, so we chose a visual ROI from the main RSA results showing joint representation of target, spatial Stroop distractor and Simon distractor (p < .005, 1-tail, uncorrected). Only the left V4 met this criterion. We then tested representations with models similar to the main text for incongruent only trials, congruent only trials, and the incongruent – congruent contrast. The contrast model additionally used interaction between the congruency and target, Stroop distractor and Simon distractor terms. Results showed that in the incongruent condition, when we employ more cognitive control, the target representation was enhanced (t(237990) = 2.59, p = .029, Bonferroni corrected) and both spatial Stroop (t(237990) = –4.18, p < .001, Bonferroni corrected) and Simon (t(237990) = –3.14, p = .005, Bonferroni corrected) distractor representations were suppressed (Fig. S8). These are consistent with the idea that the top-down control modulates the stimuli in both directions (Polk et al., 2008; Ritz & Shenhav, 2022).”

      Discussion:

      “Moreover, the cross-subject RSA provides high sensitivity to the variables of interest and the ability to separate confounding factors. For instance, in addition to dissociating conflict type from orientation, we dissociated target from response, and spatial Stroop distractor from Simon distractor. We further showed cognitive control can both enhance the target representation and suppress the distractor representation (Note S10, Fig. S8), which is in line with previous studies (Polk et al., 2008; Ritz & Shenhav, 2022)."

      12) For future work, I would recommend placing stimuli along the whole circumference, to orthogonalize Stroop and Simon conflict within-subject.

      We thank the reviewer for this highly helpful suggestion. Expanding the 547 conflict conditions to a full conflict space and replicating our current results could 548 provide stronger evidence for the cognitive space view.

      In the revised manuscript, we added this as a possible future design:

      “A possible improvement to our current design would be to include left, right, up, and down arrows presented in a grid formation across four spatially separate quadrants, with each arrow mapped to its own response button. However, one potential confounding factor would be that these conditions have different levels of difficulty (i.e., different magnitude of conflict), which may affect the CSE results and their representational similarity."

      Reviewer #2:

      Summary, general appraisal

      This study examines the construct of "cognitive spaces" as they relate to neural coding schemes present in response conflict tasks. The authors utilize a novel paradigm, in which subjects must map the direction of a vertically oriented arrow to either a left or right response. Different types of conflict (spatial Stroop, Simon) are parametrically manipulated by varying the spatial location of the arrow (a taskirrelevant feature). The vertical eccentricity of the arrow either agrees or conflicts with the arrow's direction (spatial Stroop), while the horizontal eccentricity of the arrow agrees or conflicts with the side of the response (Simon). A neural coding model is postulated in which the stimuli are embedded in a cognitive space, organized by distances that depend only on the similarity of congruency types (i.e., where conditions with similar relative proportions of spatial-Stroop versus Simon congruency are represented with similar activity patterns). The authors conduct a behavioral and fMRI study to provide evidence for such a representational coding scheme. The behavioral findings replicate the authors' prior work in demonstrating that conflict-related cognitive control adjustments (the congruency sequence effect) shows strong modulation as a function of the similarity between conflict types. With the fMRI neural activity data, the authors report univariate analyses that identified activation in left prefrontal and dorsomedial frontal cortex modulated by the amount of Stroop or Simon conflict present, and multivariate representational similarity analyses (RSA) that identified right lateral prefrontal activity encoding conflict similarity and correlated with the behavioral effects of conflict similarity.

      This study tackles an important question regarding how distinct types of conflict, which have been previously shown to elicit independent forms of cognitive control adjustments, might be encoded in the brain within a computationally efficient representational format. The ideas postulated by the authors are interesting ones and the utilized methods are rigorous.

      We would like to express our sincere appreciation for the reviewer’s positive evaluation of our manuscript and the constructive comments and suggestions. Through careful consideration of your feedback, we have endeavored to make our manuscript more accessible to readers and further strengthened our findings. In response to your suggestion, we reanalyzed our data with the approach proposed by Chen et al.’s (2017, NeuroImage). This reanalysis largely replicated our previous results, reinforcing the validity of our findings. Additionally, we conducted tests with several alternative models and found that the cognitive space hypothesis best aligns with our observed data. We have incorporated these revisions and additional analyses into the manuscript based on your valuable feedback. As a result, we believe that these changes and additional analyses have significantly enhanced the quality of our manuscript. We have provided detailed responses to your comments below.

      However, the study has critical limitations that are due to a lack of clarity regarding theoretical hypotheses, serious confounds in the experimental design, and a highly non-standard (and problematic) approach to RSA. Without addressing these issues it is hard to evaluate the contribution of the authors findings to the computational cognitive neuroscience literature.

      1) The primary theoretical question and its implications are unclear. The paper would greatly benefit from more clearly specifying potential alternative hypotheses and discussing their implications. Consider, for example, the case of parallel conflict monitors. Say that these conflict monitors are separately tuned for Stroop and Simon conflict, and are located within adjacent patches of cortex that are both contained within a single cortical parcel (e.g., as defined by the Glasser atlas used by the authors for analyses). If RSA was conducted on the responses of such a parcel to this task, it seems highly likely that an activation similarity matrix would be observed that is quite similar (if not identical) to the hypothesized one displayed in Figure 1. Yet it would seem like the authors are arguing that the "cognitive space" representation is qualitatively and conceptually distinct from the "parallel monitor" coding scheme. Thus, it seems that the task and analytic approach is not sufficient to disambiguate these different types of coding schemes or neural architectures.

      The authors also discuss a fully domain-general conflict monitor, in which different forms of conflict are encoded within a single dimension. Yet this alternative hypothesis is also not explicitly tested nor discussed in detail. It seems that the experiment was designed to orthogonalize the "domain-general" model from the "cognitive space" model, by attempting to keep the overall conflict uniform across the different stimuli (i.e., in the design, the level of Stroop congruency parametrically trades off with the level of Simon congruency). But in the behavioral results (Fig. S1), the interference effects were found to peak when both Stroop and Simon congruency are present (i.e., Conf 3 and 4), suggesting that the "domain-general" model may not be orthogonal to the "cognitive space" model. One of the key advantages of RSA is that it provides the ability to explicitly formulate, test and compare different coding models to determine which best accounts for the pattern of data. Thus, it would seem critical for the authors to set up the design and analyses so that an explicit model comparison analysis could be conducted, contrasting the domain-general, domain-specific, and cognitive space accounts.

      We appreciate the reviewer pointing out the need to formally test alternative models. In the revised manuscript, we have added and compared a few alternative models, finding the Cognitive-Space model (the one with graded conflict similarity levels as we reported) provided the best fit to our data. Specifically, we tested the following five models against the Cognitive-Space model:

      (1) Domain-General model. This model treats each conflict type as equivalent, so each two conflict types only differ in the magnitude of their conflict. Therefore, we defined the domain-general matrix as the difference in their effects indexed by the group-averaged RT in Experiment 2. Then the z-scored model vector was sign-flipped to reflect similarity instead of distance. This model showed non-significant conflict type effects (t(951989) = 0.92, p = .179) and poorer fit (BIC = 5377126) than the Cognitive-Space model (BIC = 5377094).

      (2) Domain-Specific model. This model treats each conflict type differently, so we used a diagonal matrix, with within-conflict type similarities being 1 and all crossconflict type similarities being 0. This model also showed non-significant effects (t(951989) = 0.84, p = .201) and poorer fit (BIC = 5377127) than the Cognitive-Space model.

      (3) Stroop-Only model. This model assumes that the right 8C only encodes the spatial Stroop conflict. We projected each conflict type to the Stroop (vertical) axis and calculated the similarity between any two conflict types as the Jaccard similarity index (Jaccard, 1901), that is, their intersection divided by their union. This model also showed non-significant effects (t(951989) = 0.20, p = .423) and poorer fit (BIC = 5377122) than the Cognitive-Space model.

      (4) Simon-Only model. This model assumes that the right 8C only encodes the Simon conflict. We projected each conflict type to the Simon (horizontal) axis and calculated the similarity like the Stroop-Only model. This model showed significant effects (t(951989) = 4.19, p < .001) but still quantitatively poorer fit (BIC = 5377096) than the Cognitive-Space model.

      (5) Stroop+Simon model. This model assumes the spatial Stroop and Simon conflicts are parallelly encoded in the brain, similar to the "parallel monitor" hypothesis suggested by the reviewer. It includes both Stroop-Only and Simon-Only regressors. This model showed nonsignificant effect for the Stroop regressor (t(951988) = 0.06, p = .478) and significant effect for the Simon regressor (t(951988) = 3.30, p < .001), but poorer fit (BIC = 5377118) than the Cognitive-Space model.

      “Moreover, we replicated these results with only incongruent trials (i.e., when conflict is present), considering that the pattern of conflict representations is more manifested when the conflict is present (i.e., on incongruent trials) than not (i.e., on congruent trials). We found a poorer fitting in Domain-general (BIC = 1344129), Domain-Specific (BIC = 1344129), Stroop-Only (BIC = 1344128), Simon-Only (BIC = 1344120), and Stroop+Simon (BIC = 1344157) models than the Cognitive-Space model (BIC = 1344104).”

      In summary, these results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. We added the above results to the revised manuscript.

      The above analysis approach was added to the method “Model comparison and representational dimensionality”, and the results were added to the “Multivariate patterns of the right dlPFC encodes the conflict similarity” in the revised manuscript.

      Methods:

      “Model comparison and representational dimensionality To estimate if the right 8C specifically encodes the cognitive space, rather than the domain-general or domain-specific structures, we conducted two more RSAs. We replaced the cognitive space-based conflict similarity matrix in the RSA we reported above (hereafter referred to as the Cognitive-Space model) with one of the alternative model matrices, with all other regressors equal. The domain-general model treats each conflict type as equivalent, so each two conflict types only differ in the magnitude of their conflict. Therefore, we defined the domain-general matrix as the difference in their congruency effects indexed by the group-averaged RT in Experiment 2. Then the zscored model vector was sign-flipped to reflect similarity instead of distance. The domain-specific model treats each conflict type differently, so we used a diagonal matrix, with within-conflict type similarities being 1 and all cross-conflict type similarities being 0.

      Moreover, to examine if the cognitive space is driven solely by the Stroop or Simon conflicts, we tested a spatial Stroop-Only (hereafter referred to as “Stroop-Only”) and a Simon-Only model, with each conflict type projected onto the spatial Stroop (vertical) axis or Simon (horizontal) axis, respectively. The similarity between any two conflict types was defined using the Jaccard similarity index (Jaccard, 1901), that is, their intersection divided by their union. We also included a model assuming the Stroop and Simon dimensions are independently represented in the brain, adding up the StroopOnly and Simon-Only regressors (hereafter referred to as the Stroop+Simon model). We conducted similar RSAs as reported above, replacing the original conflict similarity regressor with the Strrop-Only, Simon-Only, or both regressors (for the Stroop+Simon model), and then calculated their Bayesian information criterions (BICs).”

      Results:

      “To examine if the right 8C specifically encodes the cognitive space rather than the domain-general or domain-specific organizations, we tested several additional models (see Methods). Model comparison showed a lower BIC in the Cognitive-Space model (BIC = 5377094) than the Domain-General (BIC = 537127) or Domain-Specific (BIC = 537127) models. Further analysis showed the dimensionality of the representation in the right 8C was 1.19, suggesting the cognitive space was close to 1D. We also tested if the observed conflict similarity effect was driven solely by spatial Stroop or Simon conflicts, and found larger BICs for the models only including the Stroop similarity (i.e., the Stroop-Only model, BIC = 5377122) or Simon similarity (i.e., the Simon-Only model, BIC = 5377096). An additional Stroop+Simon model, including both StroopOnly and Simon-Only regressors, also showed a worse model fitting (BIC = 5377118). Moreover, we replicated the results with only incongruent trials, considering that the pattern of conflict representations is more manifested when the conflict is present (i.e., on incongruent trials) than not (i.e., on congruent trials). We found a poorer fitting in Domain-general (BIC = 1344129), Domain-Specific (BIC = 1344129), Stroop-Only (BIC = 1344128), Simon-Only (BIC = 1344120), and Stroop+Simon (BIC = 1344157) models than the Cognitive-Space model (BIC = 1344104). These results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. The more detailed model comparison results are listed in Table 2.”

      Reference:

      Jaccard, P. (1901). Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat(37), 547-579.

      2a) Relatedly, the reasoning for the use of the term "cognitive space" is unclear. The mere presence of graded coding for two types of conflict seems to be a low bar for referring to neural activity patterns as encoding a "cognitive space". It is discussed that cognitive spaces/maps allow for flexibility through inference and generalization. But no links were made between these cognitive abilities and the observed representational structure.

      In the revised manuscript, we have clarified that we tested a specific prediction of the cognitive space hypothesis: the geometry of the cognitive space predicts that more similar conflict types will have more similar neural representations,leading to the CSE and RSA patterns tested in this study. These results add to the literature by providing empirical evidence on how different conflict types are encoded in the brain. We agree that this study is not a comprehensive test of the cognitive space hypothesis. Thus, in the revised manuscript we explicitly clarified that this study is a test of the geometry of the cognitive space hypothesis.

      Critically, the cognitive space view holds that the representations of different abstract information are organized continuously and the representational geometry in the cognitive space are determined by the similarity among the represented information (Bellmund et al., 2018).

      "The present study aimed to test the geometry of cognitive space in conflict representation. Specifically, we hypothesize that different types of conflict are represented as points in a cognitive space. Importantly, the distance between the points, which reflects the geometry of the cognitive space, scales with the difference in the sources of the conflicts being represented by the points."

      We have also discussed the limitation of the results and stressed the need for more research to fully test the cognitive space hypothesis.

      “Additionally, our study is not a comprehensive test of the cognitive space hypothesis but aimed primarily to provide original evidence for the geometry of cognitive space in representing conflict information in cognitive control. Future research should examine other aspects of the cognitive space such as its dimensionality, its applicability to other conflict tasks such as Eriksen Flanker task, and its relevance to other cognitive abilities, such as cognitive flexibility and learning.

      2b) Additionally, no explicit tests of generality (e.g., via cross-condition generalization) were provided.

      To examine the generality of cognitive space across conditions, we conducted a leave-one-out prediction analysis. We used the behavioral data from Experiment 1 for this test, due to its larger amount of data than Experiment 2. Specifically, we removed data from one of the five similarity levels (as illustrated by the θs in Fig. 1C) and used the remaining data to perform the same mixed-effect model as reported in the main text (i.e., the two-stage analysis). This yielded one pair of beta coefficients including the similarity regressor and the intercept for each subject, with which we predicted the CSE for the removed similarity level for each subject. We repeated this process for each similarity level once. The predicted results were highly correlated with the original data, with r = .87 for the RT and r = .84 for the ER, ps < .001. We have added this analysis and result to the “Conflict type 706 similarity modulated behavioral congruency sequence effect (CSE)” section.

      “Moreover, to test the continuity and generalizability of the similarity modulation, we conducted a leave-one-out prediction analysis. Specifically, we removed data from one of the five similarity levels (as illustrated by the θs in Fig. 1C) and used the remaining data to perform the same mixed-effect model (i.e., the two-stage analysis). This yielded one pair of beta coefficients including the similarity regressor and the intercept for each subject, with which we predicted the CSE for the removed similarity level for each subject. We repeated this process for each similarity level once. The predicted results were highly correlated with the original data, with r = .87 for the RT and r = .84 for the ER, ps < .001."

      2c) Finally, although the design elicits strong CSE effects, it seems somewhat awkward to consider CSE behavioral patterns as a reflection of the kind of abilities supported by a cognitive map (if this is indeed the implication that was intended). In fact, CSE effects are well-modeled by simpler "model-free" associative learning processes, that do not require elaborate representations of abstract structures.

      We argue the conflict similarity modulation of CSEs we observed cannot be explained by the “model-free” stimulus-driven associative learning process. This mainly refers to the feature integration account proposed by Hommel et al. (2004), which explains poorer performance in CI and IC trials (compared with CC and II trials) with the partial repetition cost caused by the breaking of stimulus-response binding. Although we cannot remove its influence on the within-type trials (similarity level 5, θ = 0), it should not affect the cross-type trials (similarity level 1-4, θ = 90°, 67.5°, 45° and 22.5°, respectively), because the CC, CI, IC, II trials had equal probabilities of partially repeated and fully switched trials (see the Author response image 1 for an example of trials across Conf 1 and Conf 3 conditions). Thus, feature integration cannot explain the gradual CSE decrease from similarity level 1 to 4, which sufficiently reproduce the full effect, as suggested by the leave-one-out prediction analysis mentioned above. We thus conclude that the similarity modulation of CSE cannot be explained by the stimulus-driven associative learning.

      Author response image 1.

      Notably, however, our findings are aligned with an associative learning account of cognitive control (Abrahamse et al., 2016), which extends association learning from stimulus/response level to cognitive control. In other words, abstract cognitive control state can be learned and generalized like other sensorimotor features. This view explicitly proposes that “transfer occurs to the extent that two tasks overlap”, a hypothesis directly supported by our CSE results (see also Yang et al., 2021). Extending this, our fMRI results provide the neural basis of how cognitive control can generalize through a representation of cognitive space. The cognitive space view complements associative learning account by providing a fundamental principle for the learning and generalization of control states. Given the widespread application of CSE as indicator of cognitive control generalization (Braem et al., 2014), we believe that it can be recognized as a kind of ability supported by the cognitive space. This was further supported by the brain-behavioral correlation: stronger encoding of cognitive space was associated with greater bias of trial-wise behavioral adjustment by the consecutive conflict similarity.

      We have incorporated these ideas into the discussion:

      “Similarly, we propose that cognitive space could serve as a mental model to assist fast learning and efficient organization of cognitive control settings. Specifically, the cognitive space representation may provide a principle for how our brain evaluates the expected cost of switching and the benefit of generalization between states and selects the path with the best cost-benefit tradeoff (Abrahamse et al., 2016; Shenhav et al., 2013). The proximity between two states in cognitive space could reflect both the expected cognitive demand required to transition and the useful mechanisms to adapt from. The closer the two conditions are in cognitive space, the lower the expected switching cost and the higher the generalizability when transitioning between them. With the organization of a cognitive space, a new conflict can be quickly assigned a location in the cognitive space, which will facilitate the development of cognitive control settings for this conflict by interpolating nearby conflicts and/or projecting the location to axes representing different cognitive control processes, thus leading to a stronger CSE when following a more similar conflict condition.”

      References:

      Hommel, B., Proctor, R. W., & Vu, K. P. (2004). A feature-integration account of sequential effects in the Simon task. Psychological Research, 68(1), 1-17. Abrahamse, E., Braem, S., Notebaert, W., & Verguts, T. (2016). Grounding cognitive control in associative learning. Psychological Bulletin, 142(7), 693-728.

      Yang, G., Xu, H., Li, Z., Nan, W., Wu, H., Li, Q., & Liu, X. (2021). The congruency sequence effect is modulated by the similarity of conflicts. Journal of 770 Experimental Psychology: Learning, Memory, and Cognition, 47(10), 1705-1719.

      Braem, S., Abrahamse, E. L., Duthoo, W., & Notebaert, W. (2014). What determines the specificity of conflict adaptation? A review, critical analysis, and proposed synthesis. Frontiers in Psychology, 5, 1134.

      3) More generally, it seems problematic that Stroop and Simon conflict in the paradigm parametrically trade-off against each other. A more powerful design would have de-confounded Stroop and Simon conflict so that each could be separately estimation via (potentially orthogonal) conflict axes. Additionally, incorporating more varied stimulus sets, locations, or responses might have enabled various tests of generality, as implied by a cognitive space account.

      We thank the reviewer for these valuable suggestions. We argue that the current design is adequate to test the prediction that more similar conflict types have more similar neural representations. That said, we agree that further examination using more powerful experimental designs are needed to fully test the cognitive space account of cognitive control. We also agree that employing more varied stimulus sets,locations and responses would further extend our findings. We have included this as a future research direction in the revised manuscript.

      We have revised our discussion about the limitation as:

      “A few limitations of this study need to be noted. To parametrically manipulate the conflict similarity levels, we adopted the spatial Stroop-Simon paradigm that enables parametrical combinations of spatial Stroop and Simon conflicts. However, since this paradigm is a two-alternative forced choice design, the behavioral CSE is not a pure measure of adjusted control but could be partly confounded by bottom-up factors such as feature integration (Hommel et al., 2004). Future studies may replicate our findings with a multiple-choice design (including more varied stimulus sets, locations and responses) with confound-free trial sequences (Braem et al., 2019). Another limitation is that in our design, the spatial Stroop and Simon effects are highly anticorrelated. This constraint may make the five conflict types represented in a unidimensional space (e.g., a circle) embedded in a 2D space. Future studies may test the 2D cognitive space with fully independent conditions. A possible improvement to our current design would be to include left, right, up, and down arrows presented in a grid formation across four spatially separate quadrants, with each arrow mapped to its own response button. However, one potential confounding factor would be that these conditions have different levels of difficulty (i.e., different magnitude of conflict), which may affect the CSE results and their representational similarity.”

      4) Serious confounds in the design render the results difficult to interpret. As much prior neuroimaging and behavioral work has established, "conflict" per se is perniciously correlated with many conceptually different variables. Consequently, it is very difficult to distinguish these confounding variables within aggregate measures of neural activity like fMRI. For example, conflict is confounded with increased time-on-task with longer RT, as well as conflict-driven increases in coding of other task variables (e.g., task-set related coding; e.g., Ebitz et al. 2020 bioRxiv). Even when using much higher resolution invasive measures than fMRI (i.e., eCoG), researchers have rightly been wary of making strong conclusions about explicit encoding of conflict (Tang et al, 2019; eLife). As such, the researchers would do well to be quite cautious and conservative in their analytic approach and interpretation of results.

      We acknowledge the findings showing that encoding of conflicts may not be easily detected in the brain. However, recent studies have shown that the representational similarity analysis can effectively detect representations of conflict tasks (e.g., the color Stroop) using factorial designs (Freund et al., 2021a; 2021b).

      In our analysis, we are aware of the potential impact of time-on-task (e.g., RT) on univariate activation levels and subsequent RSA patterns. To address this issue, we added univariate fMRI activation levels as nuisance regressors to the RSA. To de confound conflict from other factors such as orientation of stimuli related to the center of the screen, we also applied the cross-subject RSA approach. Furthermore, we were cautious about determining regions that encoded conflict control. We set three strict criteria: (1) Regions must show a conflict similarity modulation effect; (2) regions must show higher representational strength in the incongruent condition compared with the congruent condition; and (3) regions must correlate with behavioral performance. With these criteria, we believe that the results we reported are already conservative. We would be happy to implement any additional criteria the reviewer recommends.

      Reference:

      Freund, M. C., Etzel, J. A., & Braver, T. S. (2021a). Neural Coding of Cognitive Control: The Representational Similarity Analysis Approach. Trends in Cognitive Sciences, 25(7), 622-638.

      Freund, M. C., Bugg, J. M., & Braver, T. S. (2021b). A Representational Similarity 823 Analysis of Cognitive Control during Color-Word Stroop. Journal of 824 Neuroscience, 41(35), 7388-7402.

      5) This issue is most critical in the interpretation of the fMRI results as reflecting encoding of conflict types. A key limitation of the design, that is acknowledged by the authors is that conflict is fully confounded within-subject by spatial orientation. Indeed, the limited set of stimulus-response mappings also cast doubt on the underlying factors that give rise to the CSE modulations observed by the authors in their behavioral results. The CSE modulations are so strong - going from a complete absence of current x previous trial-type interaction in the cos(90) case all the way to a complete elimination of any current trial conflict when the prior trial was incongruent in the cos(0) case - that they cause suspicion that they are actually driven by conflict-related control adjustments rather than sequential dependencies in the stimulus-response mappings that can be associatively learned.

      Unlike the fMRI data, we cannot tease apart the effects of conflict similarity and orientation in a similar manner as the cross-subject RSA for behavioral CSEs. However, we have a few reasons that the orientation and other bottom-up factors should not be the factors driving the similarity modulation effect.

      First, we did not find any correlation between the regions showing orientation effects and behavioral CSEs. This suggests that orientation does not directly contribute to the CSE modulation.

      Second, if the CSE modulation is purely driven by the association learning of the stimulus-response mapping, we should observe a stronger modulation effect after more extensive training. However, our results do not support this prediction. Using data from Experiment 1, we found that the modulation effect remained constant across the three sessions (see Note S3).

      “Note S3. Modulation of conflict similarity on behavioral CSEs does not change across time We tested if the conflict similarity modulation on the CSE is susceptible to training. We collected the data of Experiment 1 across three sessions, thus it is possible to examine if the conflict similarity modulation effect changes across time. To this end, we added conflict similarity, session and their interaction into a mixed-effect linear model, in which the session was set as a categorical variable. With a post-hoc analysis of variance (ANOVA), we calculated the statistical significance of the interaction term. This approach was applied to both the RT and ER. Results showed no interaction effect in either RT, F(2,1479) = 1.025, p = .359, or ER, F(2,1479) = 0.789, p = .455. This result suggests that the modulation effect does not change across time. “

      Third, the observed similarity modulation on the CSE, particularly for similarity levels 1-4, should not be attributed to the stimulus-response associations, such as feature integration, as have been addressed in response to comment 2.c.

      Finally, other bottom-up factors, such as the spatial location proximity did not drive the CSE modulation results, which we have addressed in the original manuscript in Note S2.

      "Note S2. Modulation of conflict similarity on behavioral CSEs cannot be explained by the physical proximity

      In our design, the conflict similarity might be confounded by the physical proximity between stimulus (i.e., the arrow) of two consecutive trials. That is, when arrows of the two trials appear at the same quadrant, a higher conflict similarity also indicates a higher physical proximity (Fig. 1A). Although the opposite is true if arrows of the two trials appear at different quadrants, it is possible the behavioral effects can be biased by the within quadrant trials. To examine if the physical distance has confounded the conflict similarity modulation effect, we conducted an additional analysis.

      We defined the physical angular difference across two trials as the difference of their polar angles relative to the origin. Therefore, the physical angular difference could vary from 0 to 180°. For each CSE conditions (i.e., CC, CI, IC and II), we grouped the trials based on their physical angular distances, and then averaged trials with the same previous by current conflict type transition but different orders (e.g., StHSmL−StLSmH and StLSmH−StHSmL) within each subject. The data were submitted to a mixed-effect model with the conflict similarity, physical proximity (i.e., the opposite of the physical angular difference) as fixed-effect predictors, and subject and CSE condition as random effects. Results showed significant conflict similarity modulation effects in both Experiment 1 (RT: β = 0.09 ± 0.01, t(7812) = 13.74, p < .001, ηp2 = .025; 875 ER: β = 0.09 ± 0.01, t(7812) = 7.66, p < .001, ηp2 = .018) and Experiment 2 (RT: β = 876 0.21 ± 0.02, t(3956) = 9.88, p < .001, ηp2 = .043; ER: β = 0.20 ± 0.03, t(4201) = 6.11, 877 p < .001, ηp2 = .038). Thus, the observed modulation of conflict similarity on behavioral 878 CSEs cannot be explained by physical proximity."

      6) To their credit, the authors recognize this confound, and attempt to address it analytically through the use of a between-subject RSA approach. Yet the solution is itself problematic, because it doesn't actually deconfound conflict from orientation. In particular, the RSA model assumes that whatever components of neural activity encode orientation produce this encoding within the same voxellevel patterns of activity in each subject. If they are not (which is of course likely), then orthogonalization of these variables will be incomplete. Similar issues underlie the interpretation target/response and distractor coding. Given these issues, perhaps zooming out to a larger spatial scale for the between-subject RSA might be warranted. Perhaps whole-brain at the voxel level with a high degree of smoothing, or even whole-brain at the parcel level (averaging per parcel). For this purpose, Schaefer atlas parcels might be more useful than Glasser, as they more strongly reflect functional divisions (e.g., motor strip is split into mouth/hand divisions; visual cortex is split into central/peripheral visual field divisions). Similarly, given the lateralization of stimuli, if a within-parcel RSA is going to be used, it seems quite sensible to pool voxels across hemispheres (so effectively using 180 parcels instead of 360).

      Doing RSA at the whole-brain level is an interesting idea. However, it does not allow the identification of specific brain regions representing the cognitive space. Additionally, increasing the spatial scale would include more voxels that are not involved in representing the information of interest and may increase the noise level of data. Given these concerns, we did not conduct the whole-brain level RSA.

      We agree that smoothing data can decrease cross-subject variance in voxel distribution and may increase the signal-noise ratio. We reanalyzed the results for the right 8C region using RSA on smoothed beta maps (6-mm FWHM Gaussian kernel). This yielded a significant conflict similarity effect, t(951989) = 5.55, p < .0001, replicating the results on unsmoothed data (t(951989) = 5.60, p < .0001). Therefore, we retained the results from unsmoothed data in the main text, and added the results based on smoothed data to the supplementary material (Note S9).

      “Note S9. The cross-subject pattern similarity is robust against individual differences Due to individual differences, the multivoxel patterns extracted from the same brain mask may not reflect exactly the same brain region for each subject. To reduce the influence of individual difference, we conducted the same cross-subject RSA using data smoothed with a 6-mm FWHM Gaussian kernel. Results showed a significant conflict similarity effect, t(951989) = 5.55, p < .0001, replicating the results on unsmoothed data (t(951989) = 5.60, p < .0001). “

      We also used the bilateral 8C area as a single mask and conducted the same RSA. We found a significant conflict type similarity effect, t(951989) = 4.36, p < .0001. However, the left 8C alone showed no such representation, t(951989) = 0.38, p = .351, consistent with the right lateralized representation of cognitive space we reported in Note S8. Therefore, we used ROIs from each hemisphere separately.

      “Note S8. The lateralization of conflict type representation

      We observed the right 8C but not the left 8C represented the conflict type similarity. A further test is to show if there is a lateralization. We tested several regions of the left dlPFC, including the i6-8, 8Av, 8C, p9-46v, 46, 9-46d, a9-46v (Freund, Bugg, et al., 2021). We found that none of these regions show the representation of conflict type, all uncorrected ps > .35. These results indicate that the conflict type is specifically represented in the right dlPFC. “

      We have also discussed the lateralization in the manuscript:

      “In addition, we found no such representation in the left dlPFC (Note S8), indicating a possible lateralization. Previous studies showed that the left dlPFC was related to the expectancy-related attentional set up-regulation, while the right dlPFC was related to the online adjustment of control (Friehs et al., 2020; Vanderhasselt et al., 2009), which is consistent with our findings. Moreover, the right PFC also represents a composition of single rules (Reverberi et al., 2012), which may explain how the spatial Stroop and Simon types can be jointly encoded in a single space.”

      7) The strength of the results is difficult to interpret due to the non-standard analysis method. The use of a mixed-level modeling approach to summarize the empirical similarity matrix is an interesting idea, but nevertheless is highly non-standard within RSA neuroimaging methods. More importantly, the way in which it was implemented makes it potentially vulnerable to a high degree of inaccuracy or bias. In this case, this bias is likely to be overly optimistic (high false positive rate). No numerical or formal defense was provided for this mixed-level model approach. As a result, the use of this method seems quite problematic, as it renders the strength of the observed results difficult to interpret. Instead, the authors are encouraged using a previously published method of conducting inference with between-subject RSA, such as the bootstrapping methods illustrated in Kragel et al. (2018; Nat Neurosci), or in potentially adopting one of the Chen et al. methods mentioned above, that have been extensively explored in terms of statistical properties.

      No numerical or formal defense was provided for this mixed-level model approach. As a result, the use of this method seems quite problematic, as it renders the strength of the observed results difficult to interpret. Instead, the authors are encouraged using a previously published method of conducting inference with between-subject RSA, such as the bootstrapping methods illustrated in Kragel et al. (2018; Nat Neurosci), or in potentially adopting one of the Chen et al. methods mentioned above, that have been extensively explored in terms of statistical properties.

      In our revised manuscript, we have adopted the approach proposed by Chen et al. (2017). Specifically, we included both the upper and lower triangle of the representational similarity matrix (excluding the diagonal). Moreover, we also removed all the within-subject similarity (thus also excluding the within-run similarity) to minimize the bias of the potentially strong within-subject similarity (note we also analyzed the within-subject data and found significant effects for the similarity modulation, though this effect cannot be attributed to the conflict similarity or orientation alone. We added this part in Note S7, see below). In addition, we added both the row-wise and column-wise random effects to capture the dependence of cells within each column/row (Chen et al., 2017). We have revised the method part as:

      “We excluded within-subject cells from the RSM (thus also excluding the withinrun similarity as suggested by Walther et al., (2016)), and the remaining cells were converted into a vector, which was then z-transformed and submitted to a linear mixed effect model as the dependent variable. The linear mixed effect model also included regressors of conflict similarity and orientation similarity. Importantly, conflict similarity was based on how Simon and spatial Stroop conflicts are combined and hence was calculated by first rotating all subject’s stimulus location to the topright and bottom-left quadrants, whereas orientation was calculated using original stimulus locations. As a result, the regressors representing conflict similarity and orientation similarity were de-correlated. Similarity between two conditions was measured as the cosine value of the angular difference. Other regressors included a target similarity regressor (i.e., whether the arrow directions were identical), a response similarity regressor (i.e., whether the correct responses were identical); a spatial Stroop distractor regressor (i.e., vertical distance between two stimulus locations); a Simon distractor regressor (i.e., horizontal distance between two stimulus locations). Additionally, we also included a regressor denoting the similarity of Group (i.e., whether two conditions are within the same subject group, according to the stimulus-response mapping). We also added two regressors including ROImean fMRI activations for each condition of the pair to remove the possible uni-voxel influence on the RSM. A last term was the intercept. To control the artefact due to dependence of the correlation pairs sharing the same subject, we included crossed random effects (i.e., row-wise and column-wise random effects) for the intercept, conflict similarity, orientation and the group factors (G. Chen et al., 2017).”

      Results from this approach highly replicated our original results. Specifically, we found the right 8C again showed a strong conflict similarity effect, a higher representational strength in the incongruent condition compared to the congruent condition, and a significant correlation with the behavioral CSE. The orientation effect was also identified in the visual (e.g., right V1) and oculomotor (e.g., left FEF) regions.

      We revised the results accordingly:

      For the conflict type effect:

      “The first criterion revealed several cortical regions encoding the conflict similarity, including the Brodmann 8C area (a subregion of dlPFC(Glasser et al., 2016)) and a47r in the right hemisphere, and the superior frontal language (SFL) area, 6r, 7Am, 24dd, and ventromedial visual area 1 (VMV1) areas in the left hemisphere (Bonferroni corrected ps < 0.0001, one-tailed, Fig. 4A). We next tested whether these regions were related to cognitive control by comparing the strength of conflict similarity effect between incongruent and congruent conditions (criterion 2). Results revealed that the left SFL, left VMV1, and right 8C met this criterion, Bonferroni corrected ps < .05, one-tailed, suggesting that the representation of conflict type was strengthened when conflict was present (e.g., Fig. 4D). The intersubject brain-behavioral correlation analysis (criterion 3) showed that the strength of conflict similarity effect on RSM scaled with the modulation of conflict similarity on the CSE (slope in Fig. S2C) in right 8C (r = .52, Bonferroni corrected p = .002, onetailed, Fig. 4C, Table 1) but not in the left SFL and VMV1 (all Bonferroni corrected ps > .05, one-tailed). “

      For the orientation effect:

      “We observed increasing fMRI representational similarity between trials with more similar orientations of stimulus location in the occipital cortex, such as right V1, right V2, right V4, and right lateral occipital 2 (LO2) areas (Bonferroni corrected ps < 0.0001). We also found the same effect in the oculomotor related region, i.e., the left 997 frontal eye field (FEF), and other regions including the right 5m, left 31pv and right parietal area F (PF) (Fig. 5A). Then we tested if any of these brain regions were related to the conflict representation by comparing their encoding strength between incongruent and congruent conditions. Results showed that the right V1, right V2, left FEF, and right PF encoded stronger orientation effect in the incongruent than the congruent condition, Bonferroni corrected ps < .05, one-tailed (Table1, Fig. 5B). We then tested if any of these regions was related to the behavioral performance, and results showed that none of them positively correlated with the behavioral conflict similarity modulation effect, all uncorrected ps > .45, one-tailed. Thus all regions are consistent with the criterion 3.”

      “Note S7. The cross-subject RSA captures similar effects with the within-subject RSA Considering the variability in voxel-level functional localizations among individuals, one may question whether the cross-subject RSA results were biased by the consistent multi-voxel patterns across subjects, distinct from the more commonly utilized withinsubject RSA. We reasoned that the cross-subject RSA should have captured similar effects as the within-subject RSA if we observe the conflict similarity effect in right 8C with the latter analysis. Therefore, we tested whether the representation in right 8C held for within-subject data. Specifically, we performed similar RSA for withinsubject RSMs, excluding the within-run cells. We replaced the perfectly confounded factors of conflict similarity and orientation with a common factor called similarity_orientation. Other confounding factor pairs (i.e., target versus response, and Stroop distractor versus Simon distractor) were addressed similarly. Results showed a significant effect of similarity_orientation, t(13993) = 3.270, p = .0005, 1tailed. Given the specific representation of conflict similarity identified by the crosssubject RSA, the within-subject data of right 8C may show similar conflict similarity modulation effects as the cross-subject data. Further research is needed to fully dissociate the representation of conflict and the representation of visual features such as orientation.”

      8) Another potential source of bias is in treating the subject-level random effect coefficients (as predicted by the mixed-level model) as independent samples from a random variable (in the t-tests). The more standard method for inference would be to use test statistics derived from the mixed-model fixed effects, as those have degrees of freedom calculations that are calibrated based on statistical theory.

      In our revised manuscript, we reported the statistical p values calculated from the mixed-effect models. Note that because we used the Chen et al. (2017) method, which includes data from the symmetric matrix, we corrected the degrees of freedom and estimated the true p values based on the t statistics of model results. For the I versus C comparison results, we calculated the p values by combining I and C RSMs into a larger model and then adding the condition type, as well as the interaction between the regressors of interest (conflict similarity and orientation) and the condition type. We made the statistical inference based on the interaction effect.

      We have revised the corresponding methods as:

      “The statistical significance of these beta estimates was based on the outputs of the mixed-effect model estimated with the “fitlme” function in Matlab 2022a. Since symmetric cells from the RSM matrix were included in the mixed-effect model, we adjusted the t and p values with the true degree of freedom, which is half of the cells included minus the number of fixed regressors. Multiple comparison correction was applied with the Bonferroni approach across all cortical regions at the p < 0.0001 level. To test if the representation strengths are different between congruent and incongruent conditions, we also conducted the RSA using only congruent (RDM_C) and incongruent (RDM_I) trials separately. The contrast analysis was achieved by an additional model with both RDM_C and RDM_I included, adding the congruency and the interaction between conflict type (and orientation) and congruency as both fixed and random factors. The difference between incongruent and congruent representations was indicated by a significant interaction effect.”

      Reviewer #3:

      Yang and colleagues investigated whether information on two task-irrelevant features that induce response conflict is represented in a common cognitive space. To test this, the authors used a task that combines the spatial Stroop conflict and the Simon effect. This task reliably produces a beautiful graded congruency sequence effect (CSE), where the cost of congruency is reduced after incongruent trials. The authors measured fMRI to identify brain regions that represent the graded similarity of conflict types, the congruency of responses, and the visual features that induce conflicts.

      Using several theory-driven exclusion criteria, the authors identified the right dlPFC (right 8C), which shows 1) stronger encoding of graded similarity of conflicts in incongruent trials and 2) a positive correlation between the strength of conflict similarity type and the CSE on behavior. The dlPFC has been shown to be important for cognitive control tasks. As the dlPFC did not show a univariate parametric modulation based on the higher or lower component of one type of conflict (e.g., having more spatial Stroop conflict or less Simon conflict), it implies that dissimilarity of conflicts is represented by a linear increase or decrease of neural responses. Therefore, the similarity of conflict is represented in multivariate neural responses that combine two sources of conflict.

      The strength of the current approach lies in the clear effect of parametric modulation of conflict similarity across different conflict types. The authors employed a clever cross-subject RSA that counterbalanced and isolated the targeted effect of conflict similarity, decorrelating orientation similarity of stimulus positions that would otherwise be correlated with conflict similarity. A pattern of neural response seems to exist that maps different types of conflict, where each type is defined by the parametric gradation of the yoked spatial Stroop conflict and the Simon conflict on a similarity scale. The similarity of patterns increases in incongruent trials and is correlated with CSE modulation of behavior.

      We would like to thank the reviewer for the positive evaluation of our manuscript and for providing constructive comments. By addressing these comments, we believe that we have made our manuscript more accessible for the readers while also strengthening our findings. In particular, we have tested a few alternative models and confirmed that the cognitive space hypothesis best fits the data. We have also demonstrated the geometric properties of the cognitive space by examining the continuity and dimensionality of the space, further supporting our main arguments. We have incorporated revisions and additional analyses to the manuscript based on your feedback. Overall, we believe that these changes and additional analyses have significantly improved the manuscript. Please find our detailed responses below.

      However, several potential caveats need to be considered.

      1) One caveat to consider is that the main claim of recruitment of an organized "cognitive space" for conflict representation is solely supported by the exclusion criteria mentioned earlier. To further support the involvement of organized space in conflict representation, other pieces of evidence need to be considered. One approach could be to test the accuracy of out-of-sample predictions to examine the continuity of the space, as commonly done in studies on representational spaces of sensory information. Another possible approach could involve rigorously testing the geometric properties of space, rather than fitting RSM to all conflict types. For instance, in Fig 6, both the organized and domain-specific cognitive maps would similarly represent the similarity of conflict types expressed in Fig1c (as evident from the preserved order of conflict types). The RSM suggests a low-dimensional embedding of conflict similarity, but the underlying dimension remains unclear.

      Following the reviewer’s first suggestion, we conducted a leave-one-out prediction approach to examine the continuity of the cognitive space. We used the behavioral data from Experiment 1 for this test, due to its larger amount of data than Experiment 2. Specifically, we removed data from one of the five similarity levels (as illustrated by the θs in Fig. 1C) and used the remaining data to perform the same mixed-effect model as reported in the main text (i.e., the two-stage analysis). This yielded one pair of beta coefficients including the similarity regressor and the intercept for each subject, with which we predicted the CSE for the removed similarity level at subject level. We repeated this process for each similarity level once. The predicted results were highly correlated with the original data, with r = .87 for the RT and r = .84 for the ER, ps < .001. We have added this analysis and result to the “Conflict type similarity modulated behavioral congruency sequence effect (CSE)” 1079 section:

      “Moreover, to test the continuity and generalizability of the similarity modulation, we conducted a leave-one-out prediction analysis. We used the behavioral data from Experiment 1 for this test, due to its larger amount of data than Experiment 2. Specifically, we removed data from one of the five similarity levels (as illustrated by the θs in Fig. 1C) and used the remaining data to perform the same mixed-effect model (i.e., the two-stage analysis). This yielded one pair of beta coefficients including the similarity regressor and the intercept for each subject, with which we predicted the CSE for the removed similarity level for each subject. We repeated this process for each similarity level once. The predicted results were highly correlated with the original data, with r = .87 for the RT and r = .84 for the ER, ps < .001.”

      To estimate if the domain-specific model could explain the results we observed in right 8C, we conducted a model-comparison analysis. The domain-specific model treats each conflict type differently, so we used a diagonal matrix, with within-conflict type similarities being 1 and all cross-conflict type similarities being 0. This model showed non-significant effects (t(951989) = 0.84, p = .201) and poorer fit (BIC = 5377127) than the cognitive space model (t(951989) = 5.60, p = 1.1×10−8, BIC = 5377094). We also compared other alternative models and found the cognitive space model best fitted the data. We have included these results in the revised manuscript:

      “To examine if the right 8C specifically encodes the cognitive space rather than the domain-general or domain-specific organizations, we tested several additional models (see Methods). Model comparison showed a lower BIC in the Cognitive-Space model (BIC = 5377094) than the Domain-General (BIC = 537127) or Domain-Specific (BIC = 537127) models. Further analysis showed the dimensionality of the representation in the right 8C was 1.19, suggesting the cognitive space was close to 1D. We also tested if the observed conflict similarity effect was driven solely by spatial Stroop or Simon conflicts, and found larger BICs for the models only including the Stroop similarity (i.e., the Stroop-Only model, BIC = 5377122) or Simon similarity (i.e., the Simon-Only model, BIC = 5377096). An additional Stroop+Simon model, including both StroopOnly and Simon-Only regressors, also showed a worse model fitting (BIC = 5377118). Moreover, we replicated the results with only incongruent trials, considering that the pattern of conflict representations is more manifested when the conflict is present (i.e., on incongruent trials) than not (i.e., on congruent trials). We found a poorer fitting in Domain-general (BIC = 1344129), Domain-Specific (BIC = 1344129), Stroop-Only (BIC = 1344128), Simon-Only (BIC = 1344120), and Stroop+Simon (BIC = 1344157) models than the Cognitive-Space model (BIC = 1344104). These results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. The more detailed model comparison results are listed in Table 2.”

      We also estimated the dimensionality of the right 8C with the averaged RSM and found the dimensionality of the cognitive space was ~ 1.19, very close to a 1D space. This result is consistent with our experimental design, as the only manipulated variable is the angular distance between conflict types. We have added these results and the methods to the revised manuscript.

      Results:

      “Further analysis showed the dimensionality of the representation in the right 8C was 1.19, suggesting the cognitive space was close to 1D.”

      Methods:

      “To better capture the dimensionality of the representational space, we estimated its dimensionality using the participation ratio (Ito & Murray, 2023). Since we excluded the within-subject cells from the whole RSM, the whole RSM is an incomplete matrix and could not be used. To resolve this issue, we averaged the cells corresponding to each pair of conflict types to obtain an averaged 5×5 RSM matrix, similar to the matrix shown in Fig. 1C. We then estimated the participation ratio using the formula:

      where λi is the eigenvalue of the RSM and m is the number of eigenvalues.

      2) Another important factor to consider is how learning within the confined task space, which always negatively correlates the two types of conflicts within each subject, may have influenced the current results. Is statistical dependence of conflict information necessary to use the organized cognitive space to represent conflicts from multiple sources? Answering this question would require a paradigm that can adjust multiple sources of conflicts parametrically and independently. Investigating such dependencies is crucial in order to better understand the adaptive utility of the observed cognitive space of conflict similarity.

      As the central goal of our design was to test the geometry of neural representations of conflict, we manipulated the conflict similarity. The anticorrelated Simon and spatial Stroop conflict aimed to make the overall magnitude of conflict similar among different conflict types. We agree that with the current design the likely cognitive space is not a full 2D space with Simon and spatial Stroop being two dimensions. Instead, the likely cognitive space is a subspace (e.g., a circle) embedded in the 2D space, due to the constraint of anticorrelated Simon and spatial Stroop conflict across conflict types. Nevertheless, the subspace can also be used to test the geometry that similar conflict types share similar neural representations.

      To test the full 2D cognitive space, a possible revision of our current design is to have multiple hybrid conditions (like Type 2-4) that cover the whole space. For instance, imagine arrow locations in the first quadrant space. We could have a 3×3 design with 9 conflict conditions, where their horizontal/vertical coordinates could be one of the combinations of 0, 0.5 and 1. This way, the spatial Stroop and Simon conditions would be independent of each other. Notably, however, one potential confounding factor would be that these conditions have different levels of difficulty (i.e., different magnitude of conflict), which may affect the CSE results and their representational similarity.<br /> We have added the above limitations and future designs to the revised 1156 manuscript.

      “Another limitation is that in our design, the spatial Stroop and Simon effects are highly anticorrelated. This constraint may make the five conflict types represented in a unidimensional space (e.g., a circle) embedded in a 2D space. Future studies may test the 2D cognitive space with fully independent conditions. A possible improvement to our current design would be to include left, right, up, and down arrows presented in a grid formation across four spatially separate quadrants, with each arrow mapped to its own response button. However, one potential confounding factor would be that these conditions have different levels of difficulty (i.e., different magnitude of conflict), which may affect the CSE results and their representational similarity.”

      Major comments:

      3) The RSM result (and the absence of univariate effect) seem to be a good first step to claim the use of cognitive space of conflict. Yet, the presence of an organized (unidimensional; Fig. 6) and continuous cognitive space should be further tested and backed up.

      We thank the reviewer for recognizing the methods and results of our current work. Indeed, the utilization of a parametric design and RSA to examine organization of neural representations is a widely embraced methodology in the field of cognitive neuroscience (e.g., Freund et al., 2021; Ritz et al., 2022). Our current study aimed primarily to provide original evidence for whether similar conflicts are represented similarly in the brain, which reflects the geometry of conflict representations (i.e., the structure of differences between conflict representations). We have used multiple criteria to back up the findings by showing the representation is sensitive to the presence of conflict and has behavioral relevance.

      We agree that the cognitive space account of cognitive control requires further validation. Therefore, in the revised manuscript, we have added several additional tests to strengthen the evidence supporting the organized cognitive space representation. Firstly, we tested five alternative models (Domain-General, Domain Specific, Stroop-Only, Simon-Only and Stroop+Simon models), and found that the Cognitive-Space model best fitted our data. Secondly, we explicitly calculated the dimensionality of the representation and observed a low dimensionality (1.19D). We have added these results to the “Multivariate patterns of the right dlPFC encodes the conflict similarity” section in the revised manuscript (see also the response to Comment 1).

      Furthermore, we utilized data from Experiment 1 to demonstrate the continuity of the cognitive space by showing its ability to predict out-of-sample data. We have included this result to the “Conflict type similarity modulated behavioral congruency sequence effect (CSE)” section in the revised manuscript:

      “Moreover, to test the continuity and generalizability of the similarity modulation, we conducted a leave-one-out prediction analysis. We used the behavioral data from Experiment 1 for this test, due to its larger amount of data than Experiment 2. Specifically, we removed data from one of the five similarity levels (as illustrated by the θs in Fig. 1C) and used the remaining data to perform the same mixed-effect model (i.e., the two-stage analysis). This yielded one pair of beta coefficients including the similarity regressor and the intercept for each subject, with which we predicted the CSE for the removed similarity level for each subject. We repeated this process for each similarity level once. The predicted results were highly correlated with the original data, with r = .87 for the RT and r = .84 for the ER, ps < .001.”

      References:

      Freund, M. C., Bugg, J. M., & Braver, T. S. (2021). A Representational Similarity Analysis of Cognitive Control during Color-Word Stroop. Journal of Neuroscience, 41(35), 7388-7402.

      Ritz, H., & Shenhav, A. (2022). Humans reconfigure target and distractor processing to address distinct task demands. bioRxiv. doi:10.1101/2021.09.08.459546

      4) Is the conflict similarity effect not driven by either coding of the weak to strong gradient of the spatial Stroop conflict or the Simon conflict? For example, would simply identifying brain regions that selectively tuned to the Simon conflict continuously enough to create a graded similarity in Fig. C.

      We recognize that our current design and analyzing approach cannot fully exclude the possibility that the current results are driven solely by either Stroop or Simon conflicts, since their gradients are correlated to the conflict similarity gradient we defined. To estimate their unique contributions, we performed a model-comparison analysis. We constructed a Stroop-Only model and a Simon-Only model, with each conflict type projected onto the Stroop (vertical) axis or Simon (horizontal) axis, respectively. The similarity between any two conflict types was defined using the Jaccard similarity index (Jaccard, P., 1901), that is, their intersection divided by their union. By replacing the cognitive space-based conflict similarity regressor with the Stroop-Only and Simon-Only regressors, we calculated their BICs. Results showed that the BIC was larger for Stroop-Only (5377122) and Simon-Only (5377096) than for the cognitive space model (5377094). An additional Stroop+Simon model, including both Stroop-Only and Simon-Only regressors, also 1220 showed a poorer model fitting (BIC = 5377118) than the cognitive space model.

      Moreover, we replicated the results with only incongruent trials. We found a poorer fitting in Stroop-Only (BIC = 1344128), Simon-Only (BIC = 1344120), and Stroop+Simon (BIC = 1344157) models than the Cognitive-Space model (BIC = 1344104). These results indicate that the right 8C encodes an integrated cognitive space for resolving Stroop and Simon conflicts. Therefore, we believe the cognitive space has incorporated both dimensions. We added these additional analyses and results to the revised manuscript (see also the response to the above Comment 1).

      5) Is encoding of conflict similarity in the unidimensional organized space driven by specific requirements of the task or is this a general control strategy? Specifically, is the recruitment of organized space something specific to the task that people are trained to work with stimuli that negatively correlate the spatial Stroop conflict and the Simon conflict?

      We argue that this encoding is a general control strategy. In our task design, we asked the participants to respond to the target arrow and ignore the location that appeared randomly for them. So, they were not trained to deal with the stimuli in any certain way. We also found the conflict similarity modulation on CSE did not change with more training (We added this result in Note S3), indicating that the cognitive space did not depend on strategies that could be learned through training.

      “Note S3. Modulation of conflict similarity on behavioral CSEs does not change across time We tested if the conflict similarity modulation on the CSE is susceptible to training. We collected the data of Experiment 1 across three sessions, thus it is possible to examine if the conflict similarity modulation effect changes across time. To this end, we added conflict similarity, session and their interaction into a mixed-effect linear model, in which the session was set as a categorical variable. With a post-hoc analysis of variance (ANOVA), we calculated the statistical significance of the interaction term.

      This approach was applied to both the RT and ER. Results showed no interaction effect in either RT, F(2,1479) = 1.025, p = .359, or ER, F(2,1479) = 0.789, p = .455. This result suggests that the modulation effect does not change across time."

      Instead, the cognitive space should be determined by the intrinsic similarity structure of the task design. A previous study (Freitas et al., 2015) has found that the CSE across different versions of spatial Stroop and flanker tasks was stronger than that across either of the two conflicts and Simon. In their designs, the stimulus similarity was controlled at the same level, so the difference in CSE was only attributable to the similar dimensional overlap between Stroop and flanker tasks, in contrast to the Simon task. Furthermore, recent studies showed that the cognitive space generally exists to represent structured latent states (e.g., Vaidya et al., 2022), mental strategy cost (Grahek et al., 2022), and social hierarchies (Park et al., 2020). Therefore, we argue that cognitive space is likely a universal strategy that can be applied to different scenarios.

      We added this argument in the discussion:

      “Although the spatial orientation information in our design could be helpful to the construction of cognitive space, the cognitive space itself was independent of the stimulus-level representation of the task. We found the conflict similarity modulation on CSE did not change with more training (see Note S3), indicating that the cognitive space did not depend on strategies that could be learned through training. Instead, the cognitive space should be determined by the intrinsic similarity structure of the task design. For example, a previous study (Freitas et al, 2015) has found that the CSE across different versions of spatial Stroop and flanker tasks was stronger than that across either of the two conflicts and Simon. In their designs, the stimulus similarity was controlled at the same level, so the difference in CSE was only attributable to the similar dimensional overlap between Stroop and flanker tasks, in contrast to the Simon task. Furthermore, recent studies showed that the cognitive space generally exists to represent structured latent states (e.g., Vaidya et al., 2022), mental strategy cost (Grahek et al., 2022), and social hierarchies (Park et al., 2020). Therefore, cognitive space is likely a universal strategy that can be applied to different scenarios."

      Reference:

      Freitas, A. L., & Clark, S. L. (2015). Generality and specificity in cognitive control: conflict adaptation within and across selective-attention tasks but not across selective-attention and Simon tasks. Psychological Research, 79(1), 143-162.

      Vaidya, A. R., Jones, H. M., Castillo, J., & Badre, D. (2021). Neural representation of 1280 abstract task structure during generalization. Elife, 10, 1-26.

      Grahek, I., Leng, X., Fahey, M. P., Yee, D., & Shenhav, A. Empirical and 1282 Computational Evidence for Reconfiguration Costs During Within-Task 1283 Adjustments in Cognitive Control. CogSci.

      Park, S. A., Miller, D. S., Nili, H., Ranganath, C., & Boorman, E. D. (2020). Map 1285 Making: Constructing, Combining, and Inferring on Abstract Cognitive Maps. 1286 Neuron, 107(6), 1226-1238 e1228. doi:10.1016/j.neuron.2020.06.030

      6) The observed pattern seems to suggest that there is conflict similarity space that is defined by the combination of the conflict similarity (i.e., the strength of conflicts) and the sources of conflict (i.e., the Simon vs the spatial Stroop). What are the rational reasons to separate conflicts of different sources (beyond detecting incongruence)? And how are they used for better conflict resolutions?

      The necessity of separating conflicts of different sources lies in that the spatial Stroop and the Simon effects are resolved with different mechanisms. The behavioral congruency effects of a combined conflict from two different sources were shown to be the summation of the two conflict sources (Liu et al., 2010), suggesting that the conflicts are resolved independently. Moreover, previous studies have shown that different sources of conflict are resolved with different brain regions (Egner, 2008; Li et al., 2017), and at different processing stages (Wang et al., 2013). Therefore, when multiple sources of conflict occur simultaneously or sequentially, it should be more efficient to resolve the conflict by identifying the sources.

      We have added this argument to the revised manuscript:

      “The rationale behind defining conflict similarity based on combinations of different conflict sources, such as spatial-Stroop and Simon, stems from the evidence that these sources undergo independent processing (Egner, 2008; Li et al., 2014; Liu et al., 2010; Wang et al., 2014). Identifying these distinct sources is critical in efficiently resolving potentially infinite conflicts."

      Reference:

      Egner, T. (2008). Multiple conflict-driven control mechanisms in the human brain. Trends in Cognitive Sciences, 12(10), 374-380.

      Li, Q., Yang, G., Li, Z., Qi, Y., Cole, M. W., & Liu, X. (2017). Conflict detection and 1307 resolution rely on a combination of common and distinct cognitive control networks. Neuroscience and Biobehavioral Reviews, 83, 123-131.

      Wang, K., Li, Q., Zheng, Y., Wang, H., & Liu, X. (2014). Temporal and spectral 1310 profiles of stimulus-stimulus and stimulus-response conflict processing. NeuroImage, 89, 280-288.

      Liu, X., Park, Y., Gu, X., & Fan, J. (2010). Dimensional overlap accounts for independence and integration of stimulus-response compatibility effects. Attention, Perception, & Psychophysics, 72(6), 1710-1720.

      7) The congruency effect is larger in conflict type 2, 3, 4 consistently compared to conflict 1 and 5. Are these expected under the hypothesis of unified cognitive space of conflict similarity? Is the pattern of similarity modeled in RSA?

      Yes, this is expected. The spatial Stroop and Simon effects have been shown to be additive and independent (Li et al., 2014). Therefore, the congruency effects of conflict type 2, 3 and 4 would be the weighted sum of the spatial Stroop and Simon effects. The weights can be defined by the sine and cosine of the polar angle.

      For instance, in Type 2, wy = sin(67.5°) and wx = cos(67.5°). The sum of the two 1321 weight values (i.e., 1.31) is larger than 1, leading to a larger congruency effect than 1322 the pure spatial Stroop (Conf 1) and Simon (Conf 5) conditions.

      Note that this hypothesis underlies the Stroop+Simon model, which assumes the Stroop and Simon dimensions are independently represented in the brain and drive the behavior in an additive fashion. Moreover, the observed difference of behavioral congruency effects may have reflected the variance in the Domain-General model, which treats all conflict types as equivalent, with the only difference between each two conflict types in the magnitude of their conflict. Therefore, we did not model the behavioral congruency effects as a covariance regressor in the major RSA. Instead, we conducted a model comparison analysis by comparing these models and the Cognitive-Space model. Results showed worse model fitting of both the Domain-general and Stroop+Simon models. Specially, the regressor of congruency effect difference in the Domain-General model was not significant (p = .575), which also suggests that the higher congruency effect in conflict type 2, 3 and 4 should not influence the Cognitive-Space model results. We have added these methods and results to the revised manuscript (see also our response to Comment 1):

      Methods:

      “Model comparison and representational dimensionality

      To estimate if the right 8C specifically encodes the cognitive space, rather than the domain-general or domain-specific structures, we conducted two more RSAs. We replaced the cognitive space-based conflict similarity matrix in the RSA we reported above (hereafter referred to as the Cognitive-Space model) with one of the alternative model matrices, with all other regressors equal. The domain-general model treats each conflict type as equivalent, so each two conflict types only differ in the magnitude of their conflict. Therefore, we defined the domain-general matrix as the difference in their congruency effects indexed by the group-averaged RT in Experiment 2. Then the z scored model vector was sign-flipped to reflect similarity instead of distance. The domain-specific model treats each conflict type differently, so we used a diagonal matrix, with within-conflict type similarities being 1 and all cross-conflict type similarities being 0.

      Moreover, to examine if the cognitive space is driven solely by the Stroop or Simon conflicts, we tested a spatial Stroop-Only (hereafter referred to as “Stroop-Only”) and a Simon-Only model, with each conflict type projected onto the spatial Stroop (vertical) axis or Simon (horizontal) axis, respectively. The similarity between any two conflict types was defined using the Jaccard similarity index (Jaccard, 1901), that is, their intersection divided by their union. We also included a model assuming the Stroop and Simon dimensions are independently represented in the brain, adding up the Stroop Only and Simon-Only regressors. We conducted similar RSAs as reported above, replacing the original conflict similarity regressor with the Strrop-Only, Simon-Only, or both regressors, and then calculated their Bayesian information criterions (BICs)."

      Reference:

      Li, Q., Nan, W., Wang, K., & Liu, X. (2014). Independent processing of stimulus stimulus and stimulus-response conflicts. PloS One, 9(2), e89249.

      8) Please clarify the observed patterns of CSE effects in relation to the hypothesis of common cognitive space of conflict. In particular, right 8C shows that the patterns become dissimilar in incongruent trials compared to congruent trials. How does this direction of the effect fit to the common unidimensional cognitive space account? And how does such a representation contribute to the CES effects?

      The behavioral CSE patterns provide initial evidence for the cognitive space hypothesis. Previous studies have debated whether cognitive control relies on domain-general or domain-specific representations, with much evidence gathered from behavioral CSE patterns. A significant CSE across two conflict conditions typically suggests domain-general representations of cognitive control, while an absence of CSE suggests domain-specific representations. The cognitive space view proposes that conflict representations are neither purely domain-general nor purely domain-specific, but rather exist on a continuum. This view predicts that the CSE across two conflict conditions should depend on the representational distance between them within this cognitive space. Our finding that CSE values systematically vary with conflict similarity level support this hypothesis. We have added this point in the discussion of the revised manuscript:

      “Previous research on this topic often adopts a binary manipulation of conflict(Braem et al., 2014) (i.e., each domain only has one conflict type) and gathered evidence for the domain-general/specific view with presence/absence of CSE, respectively. Here, we parametrically manipulated the similarity of conflict types and found the CSE systematically vary with conflict similarity level, demonstrating that cognitive control is neither purely domain-general nor purely domain-specific, but can be reconciled as a cognitive space(Bellmund et al., 2018) (Fig. 6, middle).

      Fig. 4D was plotted to show the steeper slope of the conflict similarity effect for incongruent versus congruent conditions. Note the y-aixs displays z-scored Pearson correlation values, so the grand mean of each condition was 0. The values for the first two similarity levels (level 1 and 2) were lower for incongruent than congruent conditions, seemingly indicating lower average similarity. However, this was not the case. The five similarity levels contained different numbers of data points (see Fig. 1C), so levels 4 and 5 should be weighted more heavily than levels 1 and 2. When comparing the grand mean of raw Pearson correlation values, the incongruent condition (0.0053) showed a tendency toward higher similarity than the congruent condition (0.0040), t(475998) = 1.41, p = .079. We have also plotted another version of Fig. 4D in Fig. S5, in which the raw Pearson correlation values were used.

      The greater representation of conflict type in incongruent condition compared to congruent condition (as evidenced by a steeper slope) suggests that the conflict representation was driven by the incongruent condition. This is probably due to the stronger involvement of cognitive control in incongruent condition (than congruent condition), which in turn leads to more distinct patterns across different conflict types. This is consistent with the fact that the congruent condition is typically a baseline, where any conflict related effects should be weaker.

      The representation of cognitive space may contribute to the CSE as a mental model. This model allows our brain to evaluate the cost and benefit associated with transitioning between different conflict conditions. When two consecutive trials are characterized by more similar conflict types, their representations in the cognitive space will be closer, resulting in a less costly transition. As a consequence, stronger CSEs are observed. We revised the corresponding discussion part as:

      “Similarly, we propose that cognitive space could serve as a mental model to assist fast learning and efficient organization of cognitive control settings. Specifically, the cognitive space representation may provide a principle for how our brain evaluates the expected cost of switching and the benefit of generalization between states and selects the path with the best cost-benefit tradeoff (Abrahamse et al., 2016; Shenhav et al., 2013). The proximity between two states in cognitive space could reflect both the expected cognitive demand required to transition and the useful mechanisms to adapt from. The closer the two conditions are in cognitive space, the lower the expected switching cost and the higher the generalizability when transitioning between them. With the organization of a cognitive space, a new conflict can be quickly assigned a location in the cognitive space, which will facilitate the development of cognitive control settings for this conflict by interpolating nearby conflicts and/or projecting the location to axes representing different cognitive control processes, thus leading to a stronger CSE when following a more similar conflict condition.”

      Minor comments:

      9) Some of the labels of figure axes are unclear (e.g., Fig4C) about what they represent.

      In Fig. 4C, the x-axis label is “neural representational strength”, which refers to the beta coefficient of the conflict type effect computed from the main RSA, denoting the strength of the conflict type representation in neural patterns. The y-axis label is “behavioral representational strength”, which refers to the beta coefficient obtained from the behavioral linear model using conflict similarity to predict the CSE in Experiment 2; it reflects how strong the conflict similarity modulates the behavioral 1440 CSE. We apologize for any confusion from the brief axis labels. We have added expanded descriptions to the figure caption of Fig. 4C.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer 1:

      (1) In general, the representation of target and distractor processing is a bit of a reach. Target processing is represented by SSVEP amplitude, which is most likely going to be related to the contrast of the dots, as opposed to representing coherent motion energy, which is the actual target. These may well be linked (e.g., greater attention to the coherent motion task might increase SSVEP amplitude), but I would call it a limitation of the interpretation. Decoding accuracy of emotional content makes sense as a measure of distractor processing, and the supplementary analysis comparing target SSVEP amplitude to distractor decoding accuracy is duly noted.

      We agree with the reviewer. The SSVEP amplitude of the target at the whole trial level indeed reflected the combined effect of the stimulus parameters (e.g., contrast of the moving dots) as well as attention. However, the time course of the target SSVEP amplitude within a trial, derived from the moving window analysis, reflected the temporal fluctuations of target processing, since the stimulus parameters remained the same during the trial. We now make this clearer in the revised manuscript.

      (2) Comparing SSVEP amplitude to emotional category decoding accuracy feels a bit like comparing apples with oranges. They have different units and scales and probably reflect different neural processes. Is the result the authors find not a little surprising in this context? This relationship does predict performance and is thus intriguing, but I think this methodological aspect needs to be discussed further. For example, is the phase relationship with behaviour a result of a complex interaction between different levels of processing (fundamental contrast vs higher order emotional processing)?

      Traditionally, the SSVEP amplitude at the distractor frequency is used to quantify distractor processing. Given that the target SSVEP amplitude is stronger than that of the distractor, it is possible that the distractor SSVEP amplitude is contaminated by the target SSVEP amplitude due to spectral power leakage; see Figure S4 for a demonstration of this. Because of this issue we therefore introduced the use of decoding accuracy as an index of distractor processing. The lack of correlation between the distractor SSVEP amplitude and the distractor decoding accuracy, although it is kind of like comparing apples with oranges as pointed out by the reviewer, serves the purpose of showing that these two measures are not co-varying, and the use of decoding accuracy is free from the influence of the distractor SSVEP amplitude which is influenced by the target SSVEP amplitude. Also, to address the apples-vs-oranges issue, the correlation was computed on normalized time series, in which a z-score time series replaced the original time series so that the correlated variables are dimensionless. Regarding the question of assessing the relation between behavior and different levels of processing, we do not have means to address it, given that we are not able to empirically separate the effects of stimulus parameters versus attention.

      Reviewer 2:

      (1) Incomplete Evidence for Rhythmicity at 1 Hz: The central claim of 1 Hz rhythmic sampling is insufficiently validated. The windowing procedure (0.5s windows with 0.25s step) inherently restricts frequency resolution, potentially biasing toward low-frequency components like 1 Hz. Testing different window durations or providing controls would significantly strengthen this claim.

      We appreciate the reviewer’s insightful suggestion. In response, we tested different windowing parameters, e.g., 0.1s sliding window with a 0.05s step size. Figure S5 demonstrates that the strength of both target and distractor processing fluctuates around ~1 Hz, both at the individual and group levels. Additionally, Figures S6(A) and S6(B) show that the relative phase between target and distractor processing time series exhibits a uniform distribution across subjects. In terms of the relation between relative phase and behavior, Figure S6(C) illustrates two representative cases: a high-performing subject with 84.34% task accuracy exhibited a relative phase of 0.9483π (closer to π), while a low-performing subject with 30.95% accuracy showed a phase of 0.29π close to 0). At the group level, a significant positive correlation between relative phase and task performance was found (r = 0.6343, p = 0.0004), as shown in Figure S6(D). All these results, aligning closely with our original findings (0.5s window length and 0.25s step size), suggest that the conclusions are not dependent on windowing parameters. We discuss these results in the revised manuscript.

      To further validate our findings, we also employed the Hilbert transform to extract amplitude envelopes of the target and distractor signals on a time-point-by-time-point basis, providing a window-free estimate of signal strength (Figures R3 and R4). The results remain consistent with both the original findings and the new sliding window analyses (Figure S6). Specifically, Figure S7 reveals ~1 Hz fluctuations in target and distractor processing at both individual and group levels. Figures S8(A) and S8(B) confirm a uniform distribution of the relative phase across subjects. In Figure S8(C), the relative phase was 0.9567π for a high-performing subject (84.34% accuracy) and 0.2247π for a low-performing subject (28.57% accuracy). At the group level, a significant positive correlation was again observed between relative phase and task performance (r = 0.4020, p = 0.0376), as shown in Figure S8(D).

      (2) No-Distractor Control Condition: The study lacks a baseline or control condition without distractors. This makes it difficult to determine whether the distractor-related decoding signals or the 1 Hz effect reflect genuine distractor processing or more general task dynamics.

      The lack of a no-distractor control condition is certainly a limitation and will be acknowledged as such in the revised manuscript. However, given that our decoding results are between two different classes of distractors, we are confident that they reflect distractor processing.

      (3) Decoding Near Chance Levels: The pairwise decoding accuracies for distractor categories hover close to chance (~55%), raising concerns about robustness. While statistically above chance, the small effect sizes need careful interpretation, particularly when linked to behavior.

      This is an important point. To test robustness, we have implemented a random permutation procedure in which trial labels were randomly shuffled to construct a nullhypothesis distribution for decoding accuracy. We then compared the decoding accuracy from the actual data to this distribution. Figure S9 shows the results based on 1,000 permutations. For each of the three pairwise classifications—pleasant vs. neutral, unpleasant vs. neutral, and pleasant vs. unpleasant—as well as the three-way classification, the actual decoding accuracies fall far outside the null-hypothesis distribution (p < 0.001), and the effect size in all four cases is extremely large. These findings indicate that the observed decoding accuracies are statistically significant and robust in terms of both statistical inference and effect size.

      (4) No Clear Correlation Between SSVEP and Behavior: Neither target nor distractor signal strength (SSVEP amplitude) correlates with behavioral accuracy. The study instead relies heavily on relative phase, which - while interesting - may benefit from additional converging evidence.

      We felt that what the reviewer pointed out is actually the main point of our study, namely, it is not the target or distractor strength over the whole trial that matters for behavior, it is their temporal relationship within the trial that matters for behavior. This reveals a novel neuroscience principle that has not been reported in the past. We have stressed this point further in the revised manuscript.

      (5) Phase-analysis: phase analysis is performed between different types of signals hindering their interpretability (time-resolved SSVEP amplitude and time-resolved decoding accuracy).

      The time-resolved SSVEP amplitude is used to index the temporal dynamics of target processing whereas the time-resolved decoding accuracy is used to index the temporal dynamics of distractor processing. As such, they can be compared, using relative phase for example, to examine how temporal relations between the two types of processes impact behavior. This said, we do recognize the reviewer’s concern that these two processes are indexed by two different types of signals. We thus normalized each time course using zscoring, making them dimensionless, and then computed the temporal relations between them.

      Appraisal of Aims and Conclusions:

      The authors largely achieved their stated goal of assessing rhythmic sampling of distractors. However, the conclusions drawn - particularly regarding the presence of 1 Hz rhythmicity - rest on analytical choices that should be scrutinized further. While the observed phaseperformance relationship is interesting and potentially impactful, the lack of stronger and convergent evidence on the frequency component itself reduces confidence in the broader conclusions.

      Impact and Utility to the Field:

      If validated, the findings will advance our understanding of attentional dynamics and competition in complex visual environments. Demonstrating that ignored distractors can be rhythmically sampled at similar frequencies to targets has implications for models of attention and cognitive control. However, the methodological limitations currently constrain the paper's impact.

      Thanks for these comments and positive assessment of our work’s potential implications and impact. As indicated above, in the revision process, we have carried out a number of additional analyses, some suggested by the reviewers, and the results of the additional analyses, now included in the Supplementary Materials, served to further validate the main findings and strengthen our conclusions.

      Additional Context and Considerations:

      (1) The use of EEG-fMRI is mentioned but not leveraged. If BOLD data were collected, even exploratory fMRI analyses (e.g., distractor modulation in visual cortex) could provide valuable converging evidence.

      Indeed, leveraging fMRI data in EEG studies would be very beneficial, as has been demonstrated in our previous work. However, given that this study concerns the temporal relationship between target and distractor processing, it is felt that fMRI data, which is known to possess low temporal resolution, has limited potential to contribute. We will be exploring this rich dataset in other ways in the future, where we will be integrating the two modalities for more insights that are not possible with either modality used alone.

      Author response image 1.

      Appyling moving window analysis (0.02s window duration and 0.01 step size) to a different EEG-fMRI dataset. (A) The amplitude time series of the 4.29 Hz component and the Fourier spectrum. (B) The group level Fourier spectrum. At both individual and group level, no 1 Hz modulation is observed, suggesting that the 1 Hz modulation observed in our data is not introduced by the artifact removal procedure.

      (2) In turn, removal of fMRI artifacts might introduce biases or alter the data. For instance, the authors might consider investigating potential fMRI artifact harmonics around 1 Hz to address concerns regarding induced spectral components.

      We have done extensive work in the area of simultaneous EEG-fMRI and have not encountered artifacts with a 1Hz rhythmicity. Our scanner artifact removal procedure is very standardized. As such, it stands to reason that if the 1Hz rhythmicity observed here results from the artifact removal process, it should also be present in other datasets where the same preprocessing steps were implemented. We tested this using another EEG-fMRI dataset (Rajan et al., 2019) . Author response image 1 shows that the EEG power time series of the new dataset doesn't have 1 Hz rhythmicity, whether at the individual level or at the group level, suggesting that the 1 Hz rhythmicity reported in the manuscript is not coming from the removal of the scanner artifacts, but instead reflects true rhythmic sampling of stimulus information. Also, the fact that the temporal relations between target processing and distractor processing at 1Hz impact behavior is another indication that the 1Hz rhythmicity is a neuroscientific effect, not an artifact.

      References

      Rajan, A., Siegel, S. N., Liu, Y., Bengson, J., Mangun, G. R., & Ding, M. (2019). Theta Oscillations Index Frontal Decision-Making and Mediate Reciprocal Frontal–Parietal Interactions in Willed Attention. Cerebral Cortex, 29(7), 2832–2843. https://doi.org/10.1093/cercor/bhy149

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In the manuscript titled "Vangl2 suppresses NF-κB signaling and ameliorates sepsis by targeting p65 for NDP52-mediated autophagic degradation" by Lu et al, the authors show that Vangl2, a planner cell polarity component, plays a direct role in autophagic degradation of NFkB-p65 by facilitating its ubiquitination via PDLIM2 and subsequent recognition and autophagic targeting via the autophagy adaptor protein NDP52. Conceptually it is a wonderful study with excellent execution of experiments and controls. The concerns with the manuscript are mainly on two counts - First issue is the kinetics of p65 regulation reported here, which does not fit into the kinetics of the mechanism proposed here, i.e., Vangl2-mediated ubiquitination followed by autophagic degradation of p65. The second issue is more technical- an absolute lack of quantitative analyses. The authors rely mostly on visual qualitative interpretation to assess an increase or decrease in associations between partner molecules throughout the study. While the overall mechanism is interesting, the authors should address these concerns as highlighted below:

      Major points:

      (1) Kinetics of p65 regulation by Vangl2: As mentioned above, authors report that LPS stimulation leads to higher IKK and p65 activation in the absence of Vangl2. The mechanism of action authors subsequently work out is that- Vangl2 helps recruit E3 ligase PDLIM to p65, which causes K63 ubiquitination, which is recognised by NDP52 for autophagic targeting. Curiously, peak p65 activation is achieved within 30 minutes of LPS stimulation. The time scale of all other assays is way longer. It is not clear that in WT cells, p65 could be targeted to autophagic degradation in Vangl2 dependent manner within 30 minutes. The HA-Myc-Flag-based overexpression and Co-IP studies do confirm the interactions as proposed. However, they do not prove that this mechanism was responsible for the Vangl2-mediated modulation of p65 activation upon LPS stimulation. Moreover, the Vangl2 KO line also shows increased IKK activation. The authors do not show the cause behind increased IKK activation, which in itself can trigger increased p65 phosphorylation.

      We thank the reviewer for this valuable suggestion.

      Indeed, we agreed with the reviewer that peak p65 activation is achieved within 30 minutes of LPS stimulation in vitro, and p65 could not be targeted to autophagic degradation in a Vangl2 dependent manner within 30 minutes. Given that the protein and mRNA levels of Vangl2 were elevated at 3-6 h of LPS stimulation (Fig. S1 C-E), we extended the stimulation time scale in the revised manuscript. The data (Fig. 2A-D in the revised manuscript) demonstrated that IKK phosphorylation was enhanced in Vangl2 KO myeloid cells during the early phase (within 3 h) of LPS stimulation, but not for the prolonged period of LPS stimulation. The underlying mechanism may be complex. Only p65 phosphorylation was continuously enhanced after long-term LPS stimulation in Vangl2 KO cells, compared to WT cells. Furthermore, the overexpression of Vangl2 in A549 cells also demonstrated a reduction of phosphorylation and total endogenous p65 (Fig. 2 I, J in the revised manuscript). These findings were corroborated by overexpression and Co-IP experiments, which collectively indicated that Vangl2 regulates the stability of p65 by promoting its interaction with NDP52 and autophagic degradation. (Page 7; Line 183-185).  

      (2) The other major concern is regarding the lack of quantitative assessments. For Co-IP experiments, I can understand it is qualitative observation. However, when the authors infer that there is an increase or decrease in the association through co-IP immunoblots, it should also be quantified, especially since the differences are quite marginal and could be easily misinterpreted.

      We are grateful to the reviewer for this suggestion. The quantitative analysis has been updated in the revised version.

      (3) Figure 4E and F: It is evident that inhibiting Autolysosome (CQ or BafA1) or autophagy (3MA) led to the recovery of p65 levels and inducing autophagy by Rapamycin led to faster decay in p65 levels. Did the authors also note/explore the possibility that Vangl2 itself may be degraded via the autophagy pathway? IB of WCL upon CQ/BAF/3MA or upon Rapa treatment does indicate the same. If true, how would that impact the dynamics of p65 activation?

      We thank the reviewer for this question. Previous studies have shown that Vangl2 is primarily degraded by the proteasome pathway, rather than by the autolysosomal pathway (doi: 10.1126/sciadv.abg2099; doi: 10.1038/s41598-019-39642-z). In our experiments, Vangl2 recruits E3 ligase PDLIM2 to enhance K63-linked ubiquitination on p65, which serves as a recognition signal for cargo receptor NDP52-mediated selective autophagic degradation. Vangl2 facilitated the interaction between p65 and NDP52, yet itself did not undergo significant autophagic degradation.

      (4) Autophagic targeting of p65 should also be shown through alternate evidence, like microscopy etc., in the LPS-stimulated WT cells.

      We thank the reviewer for this suggestion. We have added the data (co-localization of p65 and LC3 was detected by immunofluorescence) in the revised version (Fig. S4 H in the revised manuscript). (Page 9, lines 267-268)

      Reviewer #2 (Public Review):

      Vangl2, a core planar cell polarity protein involved in Wnt/PCP signaling, mediates cell proliferation, differentiation, homeostasis, and cell migration. Vangl2 malfunctioning has been linked to various human ailments, including autoimmune and neoplastic disorders. Interestingly, Vangl2 was shown to interact with the autophagy regulator p62, and indeed, autophagic degradation limits the activity of inflammatory mediators such as p65/NF-κB. However, if Vangl2, per se, contributes to restraining aberrant p65/NF-kB activity remains unclear.

      In this manuscript, Lu et al. describe that Vangl2 expression is upregulated in human sepsis-associated PBMCs and that Vangl2 mitigates experimental sepsis in mice by negatively regulating p65/NF-κB signaling in myeloid cells. Vangl2 recruits the E3 ubiquitin ligase PDLIM2 to promote K63-linked poly-ubiquitination of p65. Vangl2 also facilitates the recognition of ubiquitinated p65 by the cargo receptor NDP52. These molecular processes cause selective autophagic degradation of p65. Indeed, abrogation of PDLIM2 or NDP52 functions rescued p65 from autophagic degradation, leading to extended p65/NF-κB activity.

      As such, the manuscript presents a substantial body of interesting work and a novel mechanism of NF-κB control. If found true, the proposed mechanism may expand therapeutic opportunities for inflammatory diseases. However, the current draft has significant weaknesses that need to be addressed.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested.

      Specific comments

      (1) Vangl2 deficiency did not cause a discernible increase in the cellular level of total endogenous p65 (Fig 2A and Fig 2B) but accumulated also phosphorylated IKK.

      Even Fig 4D reveals that Vangl2 exerts a rather modest effect on the total p65 level and the figure does not provide any standard error for the quantified data. Therefore, these results do not fully support the proposed model (Figure 7) - this is a significant draw back. Instead, these data provoke an alternate hypothesis that Vangl2 could be specifically mediating autophagic removal of phosphorylated IKK and phosphorylated IKK, leading to exacerbated inflammatory NF-κB response in Vangl2-deficient cells. One may need to use phosphorylation-defective mutants of p65, at least in the over-expression experiments, to dissect between these possibilities.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested.

      (1) Indeed, we agreed with the reviewer that Vangl2 deficiency did not cause a discernible increase in the cellular level of total p65 after a short time of LPS stimulation in vitro, and p65 could not be targeted to autophagic degradation in a Vangl2 dependent manner within 30 minutes. Given that the protein and mRNA levels of Vangl2 were elevated at 3-6 h of LPS stimulation (Fig. S1 C-E), we extended the stimulation time scale in the revised manuscript. The data (Fig. 2A-D in the revised manuscript) demonstrated that IKK phosphorylation was enhanced in Vangl2 KO myeloid cells during the early phase (within 3 h) of LPS stimulation, but not for the prolonged period of LPS stimulation. The underlying mechanism may be complex. Only phosphorylation of p65 and total endogenous p65 was continuously enhanced after long-term LPS stimulation in Vangl2 KO cells, compared to WT cells. Furthermore, the overexpression of Vangl2 in A549 cells also demonstrated a reduction of phosphorylation and total endogenous p65 (Fig. 2 I, J in the revised manuscript). These findings were corroborated by overexpression and Co-IP experiments, which collectively indicated that Vangl2 regulates the stability of p65 by promoting its interaction with NDP52 and autophagic degradation. (Page 7; Line 183-185).  

      (2) Similarly, the stimulation time scale in Fig 4D was extended, and it was demonstrated that p65 was more stable in Vangl2-deficient cells.

      3) Moreover, we constructed phosphorylation-defective mutants of p65 (S536A), and found that Vangl2 could also promote the degradation of the p65 phosphorylation mutants (Fig. S4 A, B in the revised manuscript). Thus, Vangl2 promote the degradation of the basal/unphosphorylated p65. (Page 8, lines 237-240)

      (2) Fig 1A: The data indicates the presence of two subgroups within the sepsis cohort - one with high Vangl2 expressions and the other with relatively normal Vangl2 expression. Was there any difference with respect to NF-κB target inflammatory gene expressions between these subgroups?

      As suggested, we conducted an analysis of NF-kB target inflammatory gene expressions between the high and relatively low Vangl2 expression groups in sepsis patients. The results showed that the serum of the high Vangl2 expression group exhibited lower levels of IL-6, WBC, and CRP than the low Vangl2 expression group, which suggested an inverse correlation between Vangl2 and the inflammatory response (Fig. S1 A in the revised manuscript) (Page 5, lines 126-128).

      (3) The effect of Vangl2 deficiency was rather modest in the neutrophil. Could it be that Vangl2 mediates its effect mostly in macrophages?

      As showed in Fig. S1C-E, the induction of Vangl2 by LPS stimulation is more rapid in macrophages than in neutrophils. This may contribute to its dominant effect in macrophages. Consequently, we primarily focused our investigation on the role of Vangl2 in macrophages.

      (4) Fig 1D and Figure 1E: Data for unstimulated Vangl2 cells should be provided. Also, the source of the IL-1β primary antibody has not been mentioned.

      Thank you for the suggestion. We have updated the data for unstimulated cells in the revised manuscript (Fig. 1 D, E in the revised manuscript). Also, IL-1β primary antibody was purchased from Cell Signaling Technology and the information has been included in the Materials and Methods section (Table S1).

      (5) The relevance and the requirement of RNA-seq analysis are not clear in the present draft. Figure 1E already reveals upregulation of the signature NF-κB target inflammatory genes upon Vangl2 deficiency.

      We agreed with the reviewer that the data presented in Figure 1E demonstrated the upregulation of the signature NF-kB target inflammatory genes upon Vangl2 deficiency in a murine model of LPS induced sepsis. Subsequently, we proceeded to investigate the mechanism by which Vangl2 regulates NF-kB target inflammatory genes at the cellular level in Figure 2. To this end, we performed RNA-seq analysis to screen signal pathways involved in LPS-induced septic shock by comparing LPS-stimulated BMDMs from Vangl2ΔM and WT mice, and identified that TNF signaling pathway and cytokine-cytokine receptor interaction were found to be significantly enriched in Vangl2ΔM BMDMs upon LPS stimulation. This analysis provides further evidence that Vangl2 plays a role in regulating NF-kB signaling pathways and the release of related inflammatory cytokines.

      (6) Fig 2A reveals an increased accumulation of phosphorylated p65 and IKK in Vangl2-deficient macrophages upon LPS stimulation within 30 minutes. However, Vangl2 accumulates at around 60 minutes post-stimulation in WT cells. Similar results were obtained for neutrophils (Fig 2B). There appears to be a temporal disconnect between Vangl2 and phosphorylated p65 accumulation - this must be clarified.

      This concern has been addressed above (see response to questions 1 from reviewer #2). 

      (7) Figure 2E and 2F do not have untreated controls. Presentations in Fig 2E may be improved to more clearly depict IL6 and TNF data, preferably with separate Y-axes.

      Thank you for the suggestion. We have added untreated controls and separated Y-axes for IL-6 and TNF data in the revised manuscript (Fig. 2 E, F in the revised manuscript).

      (8) Line 219: "strongly with IKKα, p65 and MyD88, and weak" - should be revised.

      We have improved the manuscript as suggested in the revised manuscript (Page 7; Line 203).

      (9) It is not clear why IKKβ was excluded from interaction studies in Fig S3G.

      We added the Co-IP experiment and showed that HA-tagged Vangl2 only interacted with Flag-tagged p65, but not with Flag-tagged IKKb in 293T cells (Fig S3H). Furthermore, endogenous co-IP immunoblot analyses showed that Vangl2 did not associate with IKKb (Fig. S3I)

      (10) Fig 3F- In the text, authors mentioned that Vangl2 strongly associates with p65 upon LPS stimulation in BMDM. However, no controls, including input or another p65-interacting protein, were used.

      As reviewer suggested, we have added input and positive control (IkBa) in this experiment (Fig. 3F in the revised manuscript). The results demonstrated that the interaction between p65 and IkBa was attenuated, although the total IkBa did not undergo significant degradation over long-term course of LPS stimulation.

      (11) Figure 4D - Authors claim that Vangl2-deficient BMDMs stabilized the expression of endogenous p65 after LPS treatment. However, p65 levels were particularly constitutively elevated in knockout cells, and LPS signaling did not cause any further upregulation. This again indicates the role of Vangl2 in the basal state. The authors need to explain this and revise the test accordingly.

      Thank you for the reviewer's comments. We repeated the experiment to ascertain whether Vangl2 could stabilize the expression of endogenous p65 before and after LPS treatment. It was found that, due to the extremely low expression of Vangl2 in WT cells in the absence of stimulation, there was no observable difference on the basal level of p65 between WT and Vangl2DM cells. However, upon prolonged LPS stimulation, Vangl2 expression was induced, resulting in p65 degradation in WT cells. In contrast, p65 protein was more stable in Vangl2 deficient cells after LPS stimulation (Fig. 4D in the revised manuscript).

      Reviewer #3 (Public Review):

      Lu et al. describe Vangl2 as a negative regulator of inflammation in myeloid cells. The primary mechanism appears to be through binding p65 and promoting its degradation, albeit in an unusual autolysosome/autophagy dependent manner. Overall, the findings are novel and the crosstalk of PCP pathway protein Vangl2 with NF-kappaB is of interest. …….Regardless, Vangl2 as a negative regulator of NF-kappaB is an important finding. There are, however, some concerns about methodology and statistics that need to be addressed.

      Thank you for your comments on our manuscript, and we have further improved the manuscript as suggested.

      (1) Whether PCP is anyway relevant or if this is a PCP-independent function of Vangl2 is not directly explored (the later appears more likely from the manuscript/discussion). PCP pathways intersect often with developmentally important pathways such as WNT, HH/GLI, Fat-Dachsous and even mechanical tension. It might be of importance to investigate whether Vangl2-dependent NF-kappaB is influenced by developmental pathways.

      Thank you for the reviewer's insightful comments. Our study revealed that Vangl2 recruits the E3 ubiquitin ligase PDLIM2 to facilitate K63-linked ubiquitination of p65, which is subsequently recognized by autophagy receptor NDP52 and then promotes the autophagic degradation of p65. Our findings by using autophagy inhibitors and autophagic-deficient cells indicate that Vangl2 regulates NF-kB signaling through a selective autophagic pathway, rather than affecting the PCP pathway, WNT, HH/GLI, Fat-Dachsous or even mechanical tension. Moreover, a discussion section has been added to the revised version. (Page 12, lines 377-393)

      (2) Are Vangl2 phosphorylations (S5, S82 and S84) in anyway necessary for the observed effects on NF-kappaB or would a phospho-mutant (alanine substitution mutant) Vangl2 phenocopy WT Vangl2 for regulation of NF-kappaB?

      As suggested, we generated phospho-mutants of Vangl2 (S82/84A) and observed that Vangl2 (S82/84A) could still facilitate the degradation of p65 (Fig. S4 B in the revised manuscript), suggesting that Vangl2 regulates the NF-kB pathway independently of its phosphorylation.

      (3) Another area to strengthen might be with regards to specificity of cell types where this phenomenon may be observed. LPS treatment in mice resulted in Vangl2 upregulation in spleen and lymph nodes, but not in lung and liver. What explains the specificity of organ/cell-type Vangl2 upregulation and its consequences observed here? Why is NF-kappaB signaling not more broadly or even ubiquitously affected in all cell types in a Vangl2-dependent manner, rather than being restricted to macrophages, neutrophils and peritoneal macrophages, or, for that matter, in spleen and LN and not liver and lung? After all, one may think that the PCP proteins, as well as NF-kappaB, are ubiquitous.

      Thank you for the reviewer's comments.

      (1) LPS is an important mediator to trigger sepsis with excessive immune activation. As is well known, the spleen and lymph nodes are important peripheral immune organs, where immune cells (e.g., macrophages) are abundant and respond sensitively to LPS stimulation. Nevertheless, immune cells represent a minor fraction of the lungs and liver. Consequently, Vangl2 represents a pivotal regulator of immune function, exhibiting a more pronounced increase in the immune organs and cells.

      2) Induction of Vangl2 expression by LPS stimulation is cell specific. Given that different cells exhibit varying protein abundances, the molecular events involved may also differ. Moreover, we observed high Vangl2 expression in the liver at the basal state (Author response image 1), whereas it was not induced after 12 h of LPS stimulation. Therefore, the functional role of Vangl2 exhibits significant phenotype in macrophages and neutrophils/spleen and LN, rather than in liver or lung cells.

      Author response image 1.

      Vangl2 showed no significant changes in the liver after LPS treatment. Mice (n≥3) were treated with LPS (30 mg/kg, i.p.). Livers were collected at 12 h after LPS treatment. Immunoblot analysis of Vangl2.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      General points:

      Figure 4G- panels appear mislabeled. Pl correct.

      We have corrected this mislabeling as you suggested.

      The dynamics of Vangl2 interaction with p65 and autophagy adaptors is not clear/apparent. For example, Vangl2 expression destabilises p65 levels (as in Fig. 4), but in Fig. 5, it seems there is no decline in the p65 protein level, and a large fraction of it coprecipitates with NDP52.

      We appreciate the reviewer’s comments. In the co-IP assay, we used the lysosomal inhibitor CQ to inhibit p65 degradation to observe the interaction between p65 and NDP52 or Vangl2.

      Fig 5E- I would expect p65 levels to be lower in WT cells than Vangl2 KO cells. But as such, there is no difference between the two.

      We appreciate the reviewer’s comments. We repeated the experiments and updated the data. Firstly, Vangl2 was not induced in WT cells in the absence of LPS stimulation, thus there was no difference in p65 expression between the two groups at the basal level. Secondly, we used CQ/Baf-A1 to inhibit the degradation of Vangl2 in the co-IP assay to observe the interaction between p65 and other molecule.

      Reviewer #2 (Recommendations For The Authors):

      A few points that can be looked at and revised.

      (1) Quantification of the presented data is needed for Fig 4D and Fig 4E.

      We added the quantification analysis as suggested.  

      (2) The labeling of Fig 4G should be scrutinized.

      We have corrected this mislabeling as you suggested.

      (3) Fig 6B and Fig 6C should be explained in the result section more elaborately.

      We thank the reviewer for the suggestion, and we have rephrased this sentence to better describe the results. (Page 10, lines 306-313)

      (4) Line 85: "Vangl2 mediated downstream of Toll-like or interleukin (IL)-1" - unclear.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested in the revised manuscript. (Page 3, lines 68)

      (5) Line 181: "mice. Differentially expression analysis" - this should be revised.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested in the revised manuscript. (Page 11, lines 323)

      (6) Line 261-264- CHX-chase assay showed the degradation rate of p65 in Vangl2-deficient BMDM was slower compared with WT cells. However, Vangl2 is not induced in WT BMDMs upon CHX treatment (Fig. S4B).

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested in the revised manuscript (Fig. S4D).

      (7) Finally, some editing to provide data only critical for the conclusions could improve the ease of reading.

      We have further improved the manuscript as suggested in the revised manuscript.

      Reviewer #3 (Recommendations For The Authors):

      Comments (general, please address at least in Discussion. Some experimental data, for example the role, if any, of Vangl2 phosphorylations will be very useful):

      (1) It might be interesting to explore whether there are any potential effects of developmental pathways on the observed effect mediated by Vangl2 or if the effects are entirely a PCP-independent function of Vangl2. Please see above public review.

      Thank you for the reviewer's insightful comments. Our study revealed that Vangl2 recruits the E3 ubiquitin ligase PDLIM2 to facilitate K63-linked ubiquitination of p65, which is subsequently recognized by autophagy receptor NDP52 and then promotes the autophagic degradation of p65. Our findings by using autophagy inhibitors and autophagic-deficient cells indicate that Vangl2 regulates NF-kB signaling through a selective autophagic pathway, rather than affecting the PCP pathway, WNT, HH/GLI, Fat-Dachsous or even mechanical tension. Furthermore, we generated phospho-mutants of Vangl2 (S82/84A) and observed that Vangl2 (S82/84A) could still facilitate the degradation of p65 (Fig. S4 B), suggesting that Vangl2 regulates the NF-kB pathway independently of its phosphorylation. In addition, a discussion section has been added to the revised version. (Page 12, lines 377-393)

      (2) What explains the specificity of organ/cell-type Vangl2 upregulation and its consequences observed here? Why is NF-kappaB signaling not more broadly or even ubiquitously affected in all cell types in a Vangl2-dependent manner, rather than being restricted to macrophages, neutrophils and peritoneal macrophages, or, for that matter, in spleen and LN and not liver and lung? Afterall, one may think that the PCP proteins, as well as NF-kappaB, are ubiquitous.

      Thank you for the reviewer's comments. A similar question has been addressed above (refer to the response to question 3 of reviewer 3).

      (3) Another specificity-related question that comes to mind is whether the Vangl2 function in autolysomal/autophagic degradation is restricted to p65 as the exclusive substrate? The cytosolic targeting of p65 as opposed to the more well-known nuclear-targeting is interesting.

      Our previous finding demonstrated that Vangl2 inhibits antiviral IFN-I signaling by targeting TBK1 for autophagic degradation (doi: 10.1126/sciadv.adg2339), thereby indicating that p65 is not the sole substrate for Vangl2. However, in the NF-kB pathway, p65 is a specific substrate for Vangl2. Moreover, our findings indicate that the interaction between Vangl2 and p65 occurs predominantly in the cytoplasm, rather than in the nucleus (Fig. S4 C).

      (4) Pharmacological approach is used to tease apart autolysosome versus proteasome pathway. What is the physiological importance of autophagic degradation? It is interesting to note that Vangl2 was already previously implicated in degrading LAMP-2A and increasing chaperon-mediated autophagy (CMA)-lysosome numbers (PMID: 34214490).

      Previous literature has domonstrated that Vangl2 can inhibit CMA degradation (PMID: 34214490). However, in our study, we found that Vangl2 can promote the selective autophagic degradation of p65. It is important to note that CMA degradation and selective autophagic degradation are two distinct degradation modes, which is not contradictory.

      (5) Are these phenotypes discernable in heterozygotes or only when ablated in homozygosity? Any phenotypes recapitulated in the looptail heterozygote mice?

      We found that these phenotypes discernable only in homozygosity.

      (6) What is the conservation of the Vangl2 p65-interaction site between Vangl2 and Vangl1? PDLIM2 recruitment between Vangl2 and Vangl1?

      We appreciate the reviewer’s comments on our manuscript. Previous studies have shown that human Vangl1 and Vangl2 exhibit only 72% identity and exhibit distinct functional properties (doi: 10.1530/ERC-14-0141).Thus, the interaction of Vangl2 with p65 and PDLIM2 recruitment may not necessarily occur in Vangl1.

      Comments (specific to experiments and data analyses. Please address the following):

      (7) The patient population used in Fig 1 is not described in the Methods. This is a critical omission. Were age, sex etc. controlled for between healthy and disease? How was the diagnosis made? What times during sepsis were the samples collected? As presented, this data is impossible to evaluate and interpret.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested in the revised supplement materials. (Supplementary information, Page 12, lines 146-147)

      (8) In general, the statistical method should be described for each experiment presented in the figures. Comparisons should not be made only at the time point with maximal difference (such as in Fig 1F or Fig 2C, but at all time points using appropriate statistical methods). The sample size should also be included to allow determination appropriateness of parametric or non-parametric tests.

      We appreciate the reviewer’s comments on our manuscript, and we have further improved the manuscript as suggested in the revised manuscript (Figures 1F and 2C).

      (9) PCP pathways can activate p62/SQSTM1 or JNK via RhoA. JNK activation should be tested experimentally.

      According to the reviewer's comments, we further examined the effect of Vangl2 on the JNK pathway. The results showed that Vangl2 did not affect the JNK pathway (Author response image 2). This suggests that Vangl2 functions independently of the PCP pathway.

      Author response image 2.

      Vangl2 did not affect the JNK pathway. WT and Vangl2-deficient (n≥3) BMDMs were stimulated with LPS (100 ng/ml) for the indicated times. Immunoblot analysis of total and phosphorylated JNK.

      (10) Why are different cells such as A549, HEK293, CHO, 293T, THP-1 used during the studies for different experiments? Consistency would improve rigor. At least, logical explanation driving the cell type of choice for each experiment should be included in the manuscript. Nonetheless, one aspect of using a panel of cell lines indicate that the effect of Vangl2 on NF-kappa B is pleiotropic.

      We are grateful to the reviewer for their comments on our manuscript. A549, HEK293, CHO, and 293T cells are commonly utilized in protein-protein interaction studies. The selection of cell lines for overexpression (exogenous) experiment is dependent on their transfection efficiency and the ability to express TLR4 (the receptor for LPS). Additionally, we conducted endogenous experiments by using THP-1 and BMDMs, which are human macrophage cell lines and murine primary macrophages, respectively. Moreover, we generated Vangl2f/f lyz-cre mice by specifically knocking out Vangl2 in myeloid cells, and investigated the effect of Vangl2 on NF-kB signaling in vivo.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This useful study examined the associations of a healthy lifestyle with comprehensive and organ-specific biological ages defined using common blood biomarkers and body measures. Its large sample size, longitudinal design, and robust statistical analysis provide solid support for the findings, which will be of interest to epidemiologists and clinicians.

      Thank you very much for your thoughtful review of our manuscript. Your valuable comments have greatly helped us improve our manuscript. We have carefully considered all the comments and suggestions made by the reviewers and have revised them to address each point. Below, we provide detailed responses to each of the reviewers' comments. Please note that the line numbers mentioned in the following responses correspond to the line numbers in the clean version of the manuscript.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This study was to examine the associations of a healthy lifestyle with comprehensive and organ-specific biological ages. It emphasized the importance of lifestyle factors in biological ages, which were defined using common blood biomarkers and body measures.

      Strengths:

      The data were from a large cohort study and defined comprehensive and six-specified biological ages.

      Weaknesses:

      (1) Since only 8.5% of participants from the CMEC (China Multi-Ethnic Cohort Study) were included in the study, has any section bias happened?

      Thank you for your valuable question. We understand the concern regarding the potential selection bias due to only 8.5% of participants being included in the study. The baseline survey of China Multi-Ethnic Cohort Study (CMEC) employed a rigorous multi-stage stratified cluster sampling method and the repeat survey reevaluated approximately 10% of baseline participants through community-based cluster random sampling. Therefore, the sample of the repeat survey is representative. The second reason for the loss of sample size was the availability of biomarkers for BA calculation. We have compared characteristic of the overall population, the population included in and excluded from this study. Most characteristics were similar, but participants included in this study showed better in some health-related variables, one potential reason is healthier individuals were more likely to complete the follow-up survey. In conclusion, we believe that the impact of selection bias is limited.

      Author response table 1.

      Baseline characteristics of participants included and not included in the study

      BA, biological age; BMI, body mass index; CVD, cardiovascular disease; HLI, healthy lifestyle indicator.

      1 Data are presented as median (25th, 75th percentile) for continuous variables and count (percentage) for categorical variables.

      2 For HLI, "healthy" corresponds to a score of 4-5.

      3 Information on each validated BA has been reported. BA acceleration is the difference between each BA and CA in the same survey.

      (2) The authors should specify the efficiency of FFQ. How can FFQ genuinely reflect the actual intake? Moreover, how was the aMED calculated?

      Thank you for the comments and questions. We appreciate the opportunity to clarify these aspects of our study. For the first question, we evaluated the FFQ's reproducibility and validity by conducting repeated FFQs and 24-hour dietary recalls at the baseline survey. Intraclass correlation coefficients (ICC) for reproducibility ranged from 0.15 for fresh vegetables to 0.67 for alcohol, while deattenuated Spearman rank correlations for validity ranged from 0.10 for soybean products to 0.66 for rice. More details are provided in our previous study (Lancet Reg Health West Pac, 2021). We have added the corresponding content in both the main text and the supplementary materials.

      Methods, Page 8, lines 145-146: “The FFQ's reproducibility and validity were evaluated by conducting repeated FFQs and 24-hour dietary recalls.”

      Supplementary methods, Dietary assessment: “We evaluated the FFQ's reproducibility and validity by conducting repeated FFQs and 24-hour dietary recalls. Intraclass correlation coefficients for reproducibility ranged from 0.15 for fresh vegetables to 0.67 for alcohol, while deattenuated Spearman rank correlations for validity ranged from 0.10 for soybean products to 0.66 for rice.”

      For the second question, we apologize for any confusion. To avoid taking up too much space in the main text, we decided not to include the detailed aMED calculation (as described in Circulation, 2009) there and instead placed it in the supplementary materials:

      “Our calculated aMED score incorporates eight components: vegetables, legumes, fruits, whole grains, fish, the ratio of monounsaturated fatty acids (MUFA) to saturated fatty acids (SFA), red and processed meats, and alcohol. Each component's consumption was divided into sex-specific quintiles. Scores ranging from 1 to 5 were assigned based on quintile rankings to each component, except for red and processed meats and alcohol, for which the scoring was inverted. The alcohol criteria for the aMED was defined as moderate consumption. Since the healthy lifestyle index (HLI) already contained a drinking component, we removed the drinking item in the aMED, which had a score range of 7-35 with a higher score reflecting better adherence to the overall Mediterranean dietary pattern. We defined individuals with aMED scores ≥ population median as healthy diets.”

      Reference:

      (1) Xiao X, Qin Z, Lv X, Dai Y, Ciren Z, Yangla Y, et al. Dietary patterns and cardiometabolic risks in diverse less-developed ethnic minority regions: results from the China Multi-Ethnic Cohort (CMEC) Study. Lancet Reg Health West Pac. 2021;15:100252. doi: 10.1016/j.lanwpc.2021.100252.

      (2) Fung TT, Rexrode KM, Mantzoros CS, Manson JE, Willett WC, Hu FB. Mediterranean diet and incidence of and mortality from coronary heart disease and stroke in women. Circulation. 2009;119(8):1093-100. doi: 10.1161/circulationaha.108.816736.

      (3) HLI (range) and HLI (category) should be clearly defined.

      Thank you for the comment. We have added the definition of HLI (range) and HLI (category) in the methods section:

      Methods P9 lines 165-170: “The HLI was calculated by directly adding up the five lifestyle scores, ranging from 0-5, with a higher score representing an overall healthier lifestyle, denoted as HLI (range) in the following text. We then transformed HLI into a dichotomous variable in this study, denoted as HLI (category), where a score of 4-5 for HLI was considered a healthy lifestyle, and a score of 0-3 was considered an unfavorable lifestyle that could be improved.”

      (4) The comprehensive rationale and each specific BA construction should be clearly defined and discussed. For example, can cardiopulmonary BA be reflected only by using cardiopulmonary status? I do not think so.

      Thank you for the opportunity to clarify. We constructed the comprehensive BA based on all the available biochemical data from the CMEC study, selecting aging-related markers (J Gerontol A Biol Sci Med Sci, 2021), and further construct organ-specific BAs based on these selected biomarkers. The KDM algorithm does not specify biomarker types but requires them to be correlated with chronological age (CA) (Ageing Dev, 2006). Existing studies typically construct BA based on available biomarker, we included 15 biomarkers in this study, which could be considered comprehensive and extensive compared to previous research (J Transl Med. 2023; J Am Heart Assoc. 2024; Nat Cardiovasc Res. 2024). For how the biomarkers for each organ-specific BAs were selected, we categorized biomarkers primarily based on their relevance to the structure and function of each organ system according to the classification in previous studies (Nat Med, 2023; Cell Rep, 2022). Since the biomarkers we used came from clinical-lab data sets, they were categorized based on the clinical interpretation of blood chemistry tests following the methods outlined in the two referenced papers (Nat Med, 2023; Cell Rep, 2022). We only used biomarkers directly related to each specific system to minimize overlap between the indicators used for different BAs, thereby preserving the distinctiveness of organ-specific BAs. We acknowledge the limitations of this approach that a few biomarkers may not fully capture the complete aging process of a system, and certain indicators may be missing due to data constraints. However, the multi-organ BAs we constructed are cost-effective, easy to implement, and have been validated, making them valuable despite the limitations.

      Reference:

      (1) Verschoor CP, Belsky DW, Ma J, Cohen AA, Griffith LE, Raina P. Comparing Biological Age Estimates Using Domain-Specific Measures From the Canadian Longitudinal Study on Aging. J Gerontol A Biol Sci Med Sci. 2021;76(2):187-94. doi: 10.1093/gerona/glaa151.

      (2) Klemera P, Doubal S. A new approach to the concept and computation of biological age. Mech Ageing Dev. 2006;127(3):240-8. doi: 10.1016/j.mad.2005.10.004

      (3) Zhang R, Wu M, Zhang W, Liu X, Pu J, Wei T, et al. Association between life's essential 8 and biological ageing among US adults. J Transl Med. 2023;21(1):622. doi: 10.1186/s12967-023-04495-8.

      (4) Forrester SN, Baek J, Hou L, Roger V, Kiefe CI. A Comparison of 5 Measures of Accelerated Biological Aging and Their Association With Incident Cardiovascular Disease: The CARDIA Study. J Am Heart Assoc. 2024;13(8):e032847. doi: 10.1161/jaha.123.032847.

      (5) Jiang M, Tian S, Liu S, Wang Y, Guo X, Huang T, Lin X, Belsky DW, Baccarelli AA, Gao X. Accelerated biological aging elevates the risk of cardiometabolic multimorbidity and mortality. Nat Cardiovasc Res. 2024;3(3):332-42. doi: 10.1038/s44161-024-00438-8.

      (6) Tian YE, Cropley V, Maier AB, Lautenschlager NT, Breakspear M, Zalesky A. Heterogeneous aging across multiple organ systems and prediction of chronic disease and mortality. Nat Med. 2023;29(5):1221-31. doi: 10.1038/s41591-023-02296-6.

      (7) Nie C, Li Y, Li R, Yan Y, Zhang D, Li T, et al. Distinct biological ages of organs and systems identified from a multi-omics study. Cell Rep. 2022;38(10):110459. doi: 10.1016/j.celrep.2022.110459.

      (5) The lifestyle index is defined based on an equal-weight approach, but this does not reflect reality and cannot fully answer the research questions it raises.

      Thank you very much for your valuable suggestion. We used equal weight healthy lifestyle index (HLI) partly to facilitate comparisons with other studies. The equal-weight approach to construct the HLI is commonly used in current research (Bmj, 2021; Diabetes Care. 2022; Arch Gerontol Geriatr. 2022). The equal-weight HLI can demonstrate the average benefit of adopting each additional healthy lifestyle and avoid assumptions about the relative importance of different behaviors, which may vary depending on the population. To further clarify the importance of each lifestyle factor, we conducted quantile G-computation analysis, which can reflect the weight differences between lifestyle factors (PLoS Med, 2020; Clin Epigenetics, 2022).

      Reference:

      (1) Zhang YB, Chen C, Pan XF, Guo J, Li Y, Franco OH, Liu G, Pan A. Associations of healthy lifestyle and socioeconomic status with mortality and incident cardiovascular disease: two prospective cohort studies. Bmj. 2021;373:n604. doi: 10.1136/bmj.n604.

      (2) Han H, Cao Y, Feng C, Zheng Y, Dhana K, Zhu S, Shang C, Yuan C, Zong G. Association of a Healthy Lifestyle With All-Cause and Cause-Specific Mortality Among Individuals With Type 2 Diabetes: A Prospective Study in UK Biobank. Diabetes Care. 2022;45(2):319-29. doi: 10.2337/dc21-1512.

      (3) Jin S, Li C, Cao X, Chen C, Ye Z, Liu Z. Association of lifestyle with mortality and the mediating role of aging among older adults in China. Arch Gerontol Geriatr. 2022;98:104559. doi: 10.1016/j.archger.2021.104559.

      (4) Chudasama YV, Khunti K, Gillies CL, Dhalwani NN, Davies MJ, Yates T, Zaccardi F. Healthy lifestyle and life expectancy in people with multimorbidity in the UK Biobank: A longitudinal cohort study. PLoS Med. 2020;17(9):e1003332. doi: 10.1371/journal.pmed.1003332.

      (5) Kim K, Zheng Y, Joyce BT, Jiang H, Greenland P, Jacobs DR, Jr., et al. Relative contributions of six lifestyle- and health-related exposures to epigenetic aging: the Coronary Artery Risk Development in Young Adults (CARDIA) Study. Clin Epigenetics. 2022;14(1):85. doi: 10.1186/s13148-022-01304-9.

      Reviewer #2 (Public Review):

      This interesting study focuses on the association between lifestyle factors and comprehensive and organ-specific biological aging in a multi-ethnic cohort from Southwest China. It stands out for its large sample size, longitudinal design, and robust statistical analysis.

      Some issues deserve clarification to enhance this paper:

      (1) How were the biochemical indicators for organ-specific biological ages chosen, and are these indicators appropriate? Additionally, a more detailed description of the multi-organ biological ages should be provided to help understand the distribution and characteristics of BAs.

      We thank you for raising this point. As explained in our response to the fourth question from the first reviewer, we constructed the comprehensive BA b ased on all the available biochemical data from the CMEC study, selecting aging-related markers (J Gerontol A Biol Sci Med Sci, 2021), and further construct organ-specific BAs based on these selected biomarkers. The KDM algorithm does not specify biomarker types but requires them to be correlated with chronological age (CA) (Ageing Dev, 2006). Existing studies typically construct BA based on available biomarker, we included 15 biomarkers in this study, which could be considered comprehensive and extensive compared to previous research (J Transl Med. 2023; J Am Heart Assoc. 2024; Nat Cardiovasc Res. 2024). For how   the biomarkers for each organ-specific BAs were selected, we categorized biomarkers primarily based on their relevance to the structure and function of each organ system according to the classification in previous studies (Nat Med, 2023; Cell Rep, 2022). Since the biomarkers we used came from clinical-lab data sets, they were categorized based on the clinical interpretation of blood chemistry tests (Nat Med, 2023). We only used biomarkers directly related to each specific system to minimize overlap between the indicators used for different BAs, thereby preserving the distinctiveness of organ-specific BAs.

      We have added a descriptive table for the comprehensive and organ systems BAs in the supplementary materials to provide a more detailed understanding of the distribution and characteristics of BAs:

      Author response table 2.

      Description of BA and BA acceleration1

      BA, biological age

      1 Data are presented as mean (standard deviation).

      (2) The authors categorized the HLI score into a dichotomous variable, which may cause a loss of information. How did the authors address this potential issue?

      Thank you for raising this concern. We categorized each lifestyle factor into a binary variable based on relevant guidelines and studies, which recommend assigning a score of 1 if the guideline or study recommendations are met (Bmj, 2021; J Am Heart Assoc, 2023). While dichotomization may lead to some loss of information, it allows for a clearer interpretation and comparison of adherence to ideal healthy lifestyle behaviors. Another advantage of this treatment is that it allows for easy comparison with other studies. We categorized the HLI score into a dichotomous variable to enhance the practical relevance of the results (J Gerontol A Biol Sci Med Sci, 2021). Additionally, we conducted analyses using the continuous HLI score to ensure that our findings were robust, and the results were consistent with those obtained using the dichotomous HLI.

      Reference:

      (1) Verschoor CP, Belsky DW, Ma J, Cohen AA, Griffith LE, Raina P. Comparing Biological Age Estimates Using Domain-Specific Measures From the Canadian Longitudinal Study on Aging. J Gerontol A Biol Sci Med Sci. 2021;76(2):187-94. doi: 10.1093/gerona/glaa151.

      (2) Klemera P, Doubal S. A new approach to the concept and computation of biological age. Mech Ageing Dev. 2006;127(3):240-8. doi: 10.1016/j.mad.2005.10.004

      (3) Zhang R, Wu M, Zhang W, Liu X, Pu J, Wei T, et al. Association between life's essential 8 and biological ageing among US adults. J Transl Med. 2023;21(1):622. doi: 10.1186/s12967-023-04495-8.

      (4) Forrester SN, Baek J, Hou L, Roger V, Kiefe CI. A Comparison of 5 Measures of Accelerated Biological Aging and Their Association With Incident Cardiovascular Disease: The CARDIA Study. J Am Heart Assoc. 2024;13(8):e032847. doi: 10.1161/jaha.123.032847.

      (5) Jiang M, Tian S, Liu S, Wang Y, Guo X, Huang T, Lin X, Belsky DW, Baccarelli AA, Gao X. Accelerated biological aging elevates the risk of cardiometabolic multimorbidity and mortality. Nat Cardiovasc Res. 2024;3(3):332-42. doi: 10.1038/s44161-024-00438-8.

      (6) Tian YE, Cropley V, Maier AB, Lautenschlager NT, Breakspear M, Zalesky A. Heterogeneous aging across multiple organ systems and prediction of chronic disease and mortality. Nat Med. 2023;29(5):1221-31. doi: 10.1038/s41591-023-02296-6.

      (7) Nie C, Li Y, Li R, Yan Y, Zhang D, Li T, et al. Distinct biological ages of organs and systems identified from a multi-omics study. Cell Rep. 2022;38(10):110459. doi: 10.1016/j.celrep.2022.110459.

      (3) Because lifestyle data are self-reported, they may suffer from recall bias. This issue needs to be addressed in the limitations section.

      Thank you for your valuable suggestion. We acknowledge that the use of self-reported lifestyle data in our study may introduce recall bias, potentially affecting the accuracy of the information collected. We have added the following statement to the limitations section of our manuscript:

      Discussion, Page 22, lines 463-464: “Fifth, assessment of lifestyle factors was based on self-reported data collected through questionnaires, which may be subject to recall bias.”

      (4) It should be clarified whether the adjusted CA is the baseline value of CA. Additionally, why did the authors choose models with additional adjustments for time-invariant variables as their primary analysis? This approach does not align with standard FEM analysis (Lines 261-263).

      Thank you for the opportunity to clarify. We have changed the sentence to “baseline CA”. For the second question, in a standard fixed effects model (FEM), only time-varying variables are typically included. However, to enhance the flexibility of our models and account for potential variations in the association of time-invariant variables with CA, as has been commonly done in previous studies, we additionally adjusted for time-invariant variables and the baseline value of CA (BMC Med Res Methodol, 2024; Am J Clin Nutr, 2020). Moreover, sensitivity analyses using the standard FEM were conducted in this study, and robust results were obtained.

      Reference:

      (1) Tang D, Hu Y, Zhang N, Xiao X, Zhao X. Change analysis for intermediate disease markers in nutritional epidemiology: a causal inference perspective. BMC Med Res Methodol. 2024;24(1):49. doi: 10.1186/s12874-024-02167-9.

      (2) Trichia E, Luben R, Khaw KT, Wareham NJ, Imamura F, Forouhi NG. The associations of longitudinal changes in consumption of total and types of dairy products and markers of metabolic risk and adiposity: findings from the European Investigation into Cancer and Nutrition (EPIC)-Norfolk study, United Kingdom. Am J Clin Nutr. 2020;111(5):1018-26. doi: 10.1093/ajcn/nqz335.

      (5) How is the relative contribution calculated in the QGC analysis? The relative contribution of some lifestyle factors is not shown in Figure 2 and the supplementary figures, such as Supplementary Figure 7. These omissions should be explained.

      Thanks for the questions. The QGC obtains causal relationships and estimates weights for each component, which has been widely used in epidemiological research. More details about QGC can be found in the supplementary methods. The reason some results are not displayed is that we assumed all healthy lifestyle changes would have a protective effect on BA acceleration. However, the effect size of some lifestyle factors did not align with this assumption and lacked statistical significance. Because positive and negative weights were calculated separately in QGC, with all positive weights summing to 1 and all negative weights summing to 1, these factors would have had large positive weights. To avoid potential misunderstandings, we chose not to include these results in the figures. We have added explanations to the figure legends where applicable:

      “The blue bars represent results that are statistically significant in the FEM analysis, while the gray bars represent results in the FEM analysis that were not found to be statistically significant and positive weights were not shown.”

      Recommendations for the authors:

      Reviewer #2 (Recommendations For The Authors):

      To enhance this paper, some issues deserve clarification:

      (1) How were the biochemical indicators for organ-specific biological ages chosen, and are these indicators appropriate? Additionally, please provide a more detailed description of the multi-organ biological ages to help understand BAs' the distribution and characteristics.

      (2) The authors categorized the HLI score into a dichotomous variable, which may cause a loss of information. How did the authors address this potential issue?

      (3) Because lifestyle data are self-reported, they may suffer from recall bias. This issue needs to be addressed in the limitations section.

      (4) Lines 261-263: Please clarify if the adjusted CA is the baseline value of CA. Additionally, why did you choose models with additional adjustments for time-invariant variables as your primary analysis? This approach does not align with standard FEM analysis.

      (5) How is the relative contribution calculated in the QGC analysis? The relative contribution of some lifestyle factors is not shown in Figure 2 and the supplementary figures, such as Supplementary Figure 7. Please explain these omissions.

      The above five issues overlap with those raised by Reviewer #2 (Public Review). Please refer to the responses provided earlier.

      Minor revision:

      Line 50: The expression "which factors" should be changed to "which lifestyle factor."

      Thank you for the suggestion. As suggested, we have used “which lifestyle factor” instead.

      Lines 91-92: "Aging exhibits variations across and with individuals" appears to be a clerical error. According to the context, it should be "Aging exhibits variations across and within individuals."

      We thank the reviewer for the correction. We have updated the text to read:

      “Aging exhibits variations across and within individuals.”

      Line 154: The authors mentioned "Considering previous studies" but lacked references. Please add the appropriate citations.

      Thank you for pointing this out. We apologize for the oversight. We have now added the appropriate citations to support the statement "Considering previous studies" in the revised manuscript.

      Lines 170-171: "regular exercise ("12 times/week", "3-5 times/week," or "daily or almost every day")"; the first item in parentheses should be "1-2 times/week"? Please verify and correct if necessary. Additionally, check the entire text carefully to avoid confusion caused by clerical errors.

      Thank you for your careful review. We have changed the sentence to "1-2 times/week." We have thoroughly checked the entire manuscript to ensure that no other clerical errors remain.

      Clarifications for Table 1:

      i. The expression "HLI=0" is difficult to understand. Please provide a more straightforward explanation or rephrase it.

      Thank you for your feedback. We have removed the confusing expression and provided a clearer explanation in the table legend for better understanding:

      “For HLI (category), "healthy" corresponds to a score of 4-5, while "unfavorable" corresponds to a score of 0-3.”

      ii. The baseline age is presented as an integer, but the follow-up age is not. Please clarify this discrepancy.

      Thank you for pointing out this discrepancy. We calculated the precise chronological age based on based on participants' survey dates and birth dates for the biological age calculations. Initially, the table presented age as integers, but we have now updated it to show the precise ages.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      Despite the strengths, multiple analytical decisions have to be explained, justified, or clarified. Also, there is scope to enhance the clarity and coherence of the writing - as it stands, readers will have to go back and forth to search for information. Last, it would be helpful to add line numbers in the manuscript during the revision, as this will help all reviewers to locate the parts we are talking about.

      We thank the reviewer’s suggestions have added the line numbers to the revised manuscript.

      (1) Introduction:

      The introduction is somewhat unmotivated, with key terms/concepts left unexplained until relatively late in the manuscript. One of the main focuses in this work is "hyperaltruistic", but how is this defined? It seems that the authors take the meaning of "willing to pay more to reduce other's pain than their own pain", but is this what the task is measuring? Did participants ever need to PAY something to reduce the other's pain? Note that some previous studies indeed allow participants to pay something to reduce other's pain. And what makes it "HYPER-altruistic" rather than simply "altruistic"?

      As the reviewer noted, we adopted a well-established experimental paradigm to study the context-dependent effect on hyper-altruism. Altruism refers to the fact that people take others’ welfare into account when making decisions that concern both parties. Research paradigms investigating altruistic behavior typically use a social decision task that requires participants to choose between options where their own financial interests are pitted against the welfare of others (FeldmanHall et al., 2015; Hu et al., 2021; Hutcherson et al., 2015; Teoh et al., 2020; Xiong et al., 2020). On the other hand, the hyperaltruistic tendency emphasizes subjects’ higher valuation to other’s pain than their own pain (Crockett et al., 2014, 2015, 2017; Volz et al., 2017). One example for the manifestation of hyperaltruism would be the following scenario: the subject is willing to forgo $2 to reduce others’ pain by 1 unit (social-decision task) and only willing to forgo $1 to reduce the same amount of his/her own pain (self-decision task) (Crockett et al., 2014). On the contrary, if the subjects are willing to forgo less money to reduce others’ suffering in the social decision task than in the self-decision task, then it can be claimed that no hyperaltruism is observed. Therefore, hyperaltruistic preference can only be measured by collecting subjects’ choices in both the self and social decision tasks and comparing the choices in both tasks.

      In our task, as in the studies before ours (Crockett et al., 2014, 2015, 2017; Volz et al., 2017), subjects in each trial were faced with two options with different levels of pain on others and monetary payoffs on themselves. Based on subjects’ choice data, we can infer how much subjects were willing to trade 1 unit of monetary payoff in exchange of reducing others’ pain through the regression analysis (see Figure 1 and methods for the experimental details). We have rewritten the introduction and methods sections to make this point clearer to the audience.  

      Plus, in the intro, the authors mentioned that the "boundary conditions" remain unexplored, but this idea is never touched again. What do boundary conditions mean here in this task? How do the results/data help with finding out the boundary conditions? Can this be discussed within wider literature in the Discussion section?

      Boundary conditions here specifically refer to the variables or decision contexts that determine whether hyperaltruistic behavior can be elicited. Individual personality trait, motivation and social relationship may all be boundary conditions affecting the emergence of hyperaltruistic behavior. In our task, we specifically focused on the valence of the decision context (gain vs. loss) since previous studies only tested the hyperaltruistic preference in the gain context and the introduction of the loss context might bias subjects’ hyperaltruistic behavior through implicit moral framing.

      We have explained the boundary conditions in the revised introduction (Lines 45 ~ 49).

      “However, moral norm is also context dependent: vandalism is clearly against social and moral norms yet vandalism for self-defense is more likely to be ethically and legally justified (the Doctrine of necessity). Therefore, a crucial step is to understand the boundary conditions for hyperaltruism.”

      Last, what motivated the authors to examine the decision context? It comes somewhat out of the blue that the opening paragraph states that "We set out to [...] decision context", but why? Are there other important factors? Why decision context is more important than studying those others?

      We thank the reviewer for the comment. The hyperaltruistic preference was originally demonstrated between conditions where subjects’ personal monetary gain was pitted against others’ pain (social-condition) or against subjects’ own suffering (self-condition) (Crockett et al., 2014). Follow up studies found that subjects also exhibited strong egoistic tendencies if instead subjects needed to harm themselves for other’s benefit in the social condition (by flipping the recipients of monetary gain and electric shocks) (Volz et al., 2017). However, these studies have primarily focused on the gain contexts, neglecting the fact that valence could also be an influential factor in biasing subjects’ behavior (difference between gain and loss processing in humans). It is likely that replacing monetary gains with losses in the money-pain trade-off task might bias subjects’ hyperaltruistic preference due to heightened vigilance or negative emotions in the face of potential loss (such as loss aversion) (Kahneman & Tversky, 1979; Liu et al., 2020; Pachur et al., 2018; Tom et al., 2007; Usher & McClelland, 2004; Yechiam & Hochman, 2013). Another possibility is that gain and loss contexts may elicit different subjective moral perceptions (or internal moral framings) in participants, affecting their hyperaltruistic preferences (Liu et al., 2017; Losecaat Vermeer et al., 2020; Markiewicz & Czupryna, 2018; Wu et al., 2018). In our manuscript, we did not strive to compare which factors might be more important in eliciting hyperaltruistic behavior, but rather to demonstrate the crucial role played by the decision context and to show that the internal moral framing could be the mediating factor in driving subjects’ hyperaltruistic behavior. In fact, we speculate that the egoistic tendencies found in the Volz et al. 2017 study was partly driven by the subjects’ failure to engage the proper internal moral framing in the social condition (harm for self, see Volz et al., 2017 for details).

      (2) Experimental Design:

      (2a) The experiment per se is largely solid, as it followed a previously well-established protocol. But I am curious about how the participants got instructed? Did the experimenter ever mention the word "help" or "harm" to the participants? It would be helpful to include the exact instructions in the SI.

      In the instructions, we avoided words such as “harm”, “help”, or other terms reminding subjects about the moral judgement of the decisions they were about to make. Instead, we presented the options in a neutral and descriptive manner, focusing only on the relevant components (shocks and money). The instructions for all four conditions are shown in supplementary Fig. 9.

      (2b) Relatedly, the experimental details were not quite comprehensive in the main text. Indeed, the Methods come after the main text, but to be able to guide readers to understand what was going on, it would be very helpful if the authors could include some necessary experimental details at the beginning of the Results section.

      We thank the reviewer’s suggestion. We have now provided a brief introduction of the experimental details in the revised results section (Lines 125 ~132).

      “Prior to the money-pain trade-off task, we individually calibrated each subject’s pain threshold using a standard procedure[4–6]. This allowed us to tailor a moderate electric stimulus that corresponded to each subject’s subjective pain intensity. Subjects then engaged in 240 decision trials (60 trials per condition), acting as the “decider” and trading off between monetary gains or losses for themselves and the pain experienced by either themselves or an anonymous “pain receiver” (gain-self, gain-other, loss-self and loss-other, see Supplementary Fig. 8 for the instructions and also see methods for details).”

      (3) Statistical Analysis<br /> (3a) One of the main analyses uses the harm aversion model (Eq1) and the results section keeps referring to one of the key parameters of it (ie, k). However, it is difficult to understand the text without going to the Methods section below. Hence it would be very helpful to repeat the equation also in the main text. A similar idea goes to the delta_m and delta_s terms - it will be very helpful to give a clear meaning of them, as nearly all analyses rely on knowing what they mean.

      We thank the reviewer’s suggestion. We have now added the equation of the harm aversion model and provided more detailed description to the equations in the main text (Lines 150 ~155).

      “We also modeled subjects’ choices using an influential model where subjects’ behavior could be characterized by the harm (electric shock) aversion parameter κ, reflecting the relative weights subjects assigned to ∆m and ∆s, the objective difference in money and shocks between the more and less painful options, respectively (∆V=(1-κ)∆m - κ∆s Eq.1, See Methods for details)[4–6]. Higher κ indicates that higher sensitivity is assigned to ∆s than ∆m and vice versa.”

      (3b) There is one additional parameter gamma (choice consistency) in the model. Did the authors also examine the task-related difference of gamma? This might be important as some studies have shown that the other-oriented choice consistency may differ in different prosocial contexts.

      To examine the task-related difference of choice consistency (γ), we compared the performance of 4 candidate models:

      Model 1 (M1): The choice consistency parameter γ remains constant across shock recipients (self vs. other) and decision contexts (gain vs. loss).

      Model 2 (M2): γ differs between the self- and other-recipient conditions, with γ<sub>self</sub> and γ<sub>other</sub> representing the choice consistency when pain is inflicted on him/her-self or the other-recipient.

      Model 3 (M3): γ differs between the gain and loss conditions, with γ<sub>gain</sub> and γ<sub>loss</sub> representing the choice consistencies in the gain and loss contexts, respectively.

      Model 4 (M4): γ varies across four conditions, with γ<sub>self-gain</sub>, γ<sub>other-gain</sub>, γ<sub>self-loss</sub> and γ<sub>other-loss</sub> capturing the choice consistency in each condition.

      Supplementary Fig. 10 shows, after fitting all the models to subjects’ choice behavioral data, model 1 (M1) performed the best among all the four candidate models in both studies (1 & 2) with the lowest Bayesian Information Criterion (BIC). Therefore, we conclude that factors such as the shock recipients (self vs. other) and decision contexts (gain vs. loss) did not significantly influence subjects’ choice consistency and report model results using the single choice consistency parameter.

      (3c) I am not fully convinced that the authors included two types of models: the harm aversion model and the logistic regression models. Indeed, the models look similar, and the authors have acknowledged that. But I wonder if there is a way to combine them? For example:

      Choice ~ delta_V * context * recipient (*Oxt_v._placebo)

      The calculation of delta_V follows Equation 1.

      Or the conceptual question is, if the authors were interested in the specific and independent contribution of dalta_m and dalta_s to behavior, as their logistic model did, why did the authors examine the harm aversion first, where a parameter k is controlling for the trade-off? One way to find it out is to properly run different models and run model comparisons. In the end, it would be beneficial to only focus on the "winning" model to draw inferences.

      The reviewer raised an excellent point here. According to the logistic regression model, we have:

      Where P is the probability of selecting the less harmful option. Similarly, if we combine Eq.1 (∆V=1-κ)∆m-κ∆s) and Eq.2 ) of the harm aversion model, we have:

      If we ignore the constant term β<sub>0</sub> from the logistic regression model, the harm aversion model is simply a reparameterization of the logistic regression model. The harm aversion model was implemented first to derive the harm aversion parameter (κ), which is an parameter in the range of [0 1] to quantify how subjects value the relative contribution of Δm and Δs between options in their decision processes. Since previous studies used the term κ<sub>other</sub>-κ<sub>self</sub> to define the magnitude of hyperaltruistic preference, we adopted similar approach to compare our results with previous research under the same theoretical framework. However, in order to investigate the independent contribution of Δm and Δs, we will have to take γ into account (we can see that the β<sub>∆m</sub> and β<sub>∆s</sub> in the logistic regression model are not necessarily correlated by nature; however, in the harm aversion model the coefficients (1-κ) and κ is always strictly negatively correlated (see Eq. 1). Only after multiplying γ, the correlation between γ(1-κ) and γκ will vary depending on the specific distribution of γ and κ). In summary, we followed the approach of previous research to estimate harm aversion parameter κ to compare our results with previous studies and to capture the relative influence between Δm and Δs. When we studied the contextual effects (gain vs. loss or placebo vs. control) on subjects’ behavior, we further investigated the contextual effect on how subjects evaluated Δm and Δs, respectively. The two models (logistic regression model and harm aversion model) in our study are mathematically the same and are not competitive candidate models. Instead, they represent different aspects from which our data can be examined.

      We also compared the harm aversion model with and without the constant term β<sub>0</sub> in the choice function. Adding a constant term β<sub>0</sub> the above Equation 2 becomes:

      As the following figure shows, the hyperaltruistic parameters (κ<sub>other</sub>-κ<sub>self</sub>) calculated from the harm aversion model with the constant term (panels A & B) have almost identical patterns as the model without the constant term (panels C & D, i.e. Figs. 2B & 4B in the original manuscript) in both studies.

      Author response image 1.

      Figs. 2B & 4B in the original manuscript) in both studies.

       

      (3d) The interpretation of the main OXT results needs to be more cautious. According to the operationalization, "hyperaltruistic" is the reduction of pain of others (higher % of choosing the less painful option) relative to the self. But relative to the placebo (as baseline), OXT did not increase the % of choosing the less painful option for others, rather, it decreased the % of choosing the less painful option for themselves. In other words, the degree of reducing other's pain is the same under OXT and placebo, but the degree of benefiting self-interest is reduced under OXT. I think this needs to be unpacked, and some of the wording needs to be changed. I am not very familiar with the OXT literature, but I believe it is very important to differentiate whether OXT is doing something on self-oriented actions vs other-oriented actions. Relatedly, for results such as that in Figure 5A, it would be helpful to not only look at the difference but also the actual magnitude of the sensitivity to the shocks, for self and others, under OXT and placebo.

      We thank the reviewer for this thoughtful comment. As the reviewer correctly pointed out, “hyperaltruism” can be defined as “higher % of choosing the less painful option to the others relative to the self”. Closer examination of the results showed that both the degrees of reducing other’s pain as well as reducing their own pain decreased under OXT (Figure 4A). More specifically, our results do not support the claim that “In other words, the degree of reducing others’ pain is the same under OXT and placebo, but the degree of benefiting self-interest is reduced under OXT.” Instead, the results show a significant reduction in the choice of less painful option under OXT treatment for both the self and other conditions (the interaction effect of OXT vs. placebo and self vs. other: F<sub>1.45</sub>= 16.812, P < 0.001, η<sup>2</sup> = 0.272, simple effect OXT vs. placebo in the self- condition: F<sub>1.45</sub>=59.332, P < 0.001, η<sup>2</sup> = 0.569, OXT vs. placebo in the other-condition: F<sub>1.45</sub>= 14.626, P < 0.001, η<sup>2</sup> = 0.245, repeated ANOVA, see Figure 4A).

      We also performed mixed-effect logistic regression analyses where subjects’ choices were regressed against  and  in different valences (gain vs. loss) and recipients (self vs. other) conditions in both studies 1 & 2 (Supplementary Figs. 1 & 6). As we replot supplementary Fig. 6 and panel B (included as Supplementary Fig. 8 in the supplementary materials) in the above figure, we found a significant treatment × ∆<sub>s</sub> (differences in shock magnitude between the more and less painful options) interaction effect β=0.136±0.029P < =0.001, 95% CI=[-0.192, -0.079]), indicating that subject’s sensitivities towards pain were indeed different between the placebo and OXT treatments for both self and other conditions. Furthermore, the significant four-way ∆<sub>s</sub> × treatment (OXT vs. Placebo) × context (gain vs. loss) × recipient (self vs. other) interaction effect (β=0.125±0.053, P=0.018 95% CI=[0.022, 0.228]) in the regression analysis, followed by significant simple effects (In the OXT treatment: ∆<sub>s</sub> × recipient effect in the gain context: F<sub>1.45</sub>= 7.622, P < 0.008, η<sup>2</sup> = 0.145; ∆<sub>s</sub> × recipient effect in the loss context: F<sub>1.45</sub>= 7.966, P 0.007, η<sup>2</sup> = 0.150, suggested that under OXT treatment, participants showed a greater sensitivity toward ∆<sub>s</sub> (see asterisks in the OXT condition in panel B) in the other condition than the self-condition, thus restoring the hyperaltruistic behavior in loss context.

      As the reviewer suggested, OXT’s effect on hyperaltruism does manifest separately on subjects’ harm sensitivities on self- and other-oriented actions. We followed the reviewer’s suggestions and examined the actual magnitude of the sensitivities to shocks for both the self and other treatments (panel B in the figure above). It’s clear that the administration of OXT (compared to the Placebo treatment, panel B in the figure above) significantly reduced participants’ pain sensitivity (treatment × ∆<sub>s</sub>: β=-0.136±0.029, P < 0.001, 95% CI=[-0.192,-0.079]), yet also restored the harm sensitivity patterns in both the gain and loss conditions. These results are included in the supplementary figures (6 & 8) as well as in the main texts.

      Recommendations:

      (1) For Figures 2A-B, it would be great to calculate the correlation separately for gain and loss, as in other figures.

      We speculate that the reviewer is referring to Figures 3A & B. Sorry that we did not present the correlations separately for the gain and loss contexts because the correlation between an individual’s IH (instrumental harm), IB (impartial beneficence) and hyperaltruistic preferences was not significantly modulated by the contextual factors. The interaction effects in both Figs. 3A & B and Supplementary Fig.5 (also see Table S1& S2) are as following: Study1 valence × IH effect: β=0.016±0.022, t<sub>152</sub>=0.726, P=0.469; valence × IB effect: β=0.004±0.031, t<sub>152</sub>=0.115, P=0.908; Study2 placebo condition: valence × IH effect: β=0.018±0.024, t<sub>84</sub>=0.030 P=0.463; valence × IB effect: β=0.051±0.030, t<sub>84</sub>=1.711, P=0.702. We have added these statistics to the main text following the reviewer’s suggestions.

      (2) "by randomly drawing a shock increment integer ∆s (from 1 to 19) such that [...] did not exceed 20 (𝑆+ {less than or equal to} 20)." I am not sure if a random drawing following a uniform distribution can guarantee S is smaller than 20. More details are needed. Same for the monetary magnitude.

      We are sorry for the lack of clarity in the method description. As for the task design, we followed adopted the original design from previous literature (Crockett et al., 2014, 2017). More specifically:

      “Specifically, each trial was determined by a combination of the differences of shocks (Δs, ranging from 1 to 19, with increment of 1) and money (Δm, ranging from ¥0.2 to ¥19.8, with increment of ¥0.2) between the two options, resulting in a total of 19×99=1881 pairs of [Δs, Δm]. for each trial. To ensure the trials were suitable for most subjects, we evenly distributed the desired ratio Δm / (Δs + Δm) between 0.01 and 0.99 across 60 trials for each condition. For each trial, we selected the closest [Δs, Δm] pair from the [Δs, Δm] pool to the specific Δm / (Δs + Δm) ratio, which was then used to determine the actual money and shock amounts of two options. The shock amount (S<sub>less</sub>) for the less painful option was an integer drawn from the discrete uniform distribution [1-19], constraint by S<sub>less</sub> + ∆s < 20. Similarly, the money amount (M<sub>less</sub>) for the less painful option was drawn from a discrete uniform distribution [¥0.2 - ¥19.8], with the constraint of M<sub>less</sub> + ∆m < 20. Once the S<sub>less</sub>and M<sub>less</sub> were selected, the shock (S<sub>more</sub>) and money (M<sub>more</sub>) magnitudes for the more painful option were calculated as: S<sub>more</sub> = S<sub>less</sub> + ∆s, M<sub>more</sub> = M<sub>less</sub> + ∆m”  

      We have added these details to the methods section (Lines 520-533).

      Reviewer #2:

      (1) The theoretical hypothesis needs to be better justified. There are studies addressing the neurobiological mechanism of hyperaltruistic tendency, which the authors unfortunately skipped entirely.

      Also in recommendation #1:

      (1) In the Introduction, the authors claim that "the mechanistic account of the hyperaltruistic phenomenon remains unknown". I think this is too broad of a criticism and does not do justice to prior work that does provide some mechanistic account of this phenomenon. In particular, I was surprised that the authors did not mention at all a relevant fMRI study that investigates the neural mechanism underlying hyperaltruistic tendency (Crockett et al., 2017, Nature Neuroscience). There, the researchers found that individual differences in hyperaltruistic tendency in the same type of moral decision-making task is better explained by reduced neural responses to ill-gotten money (Δm in the Other condition) in the brain reward system, rather than heightened neural responses to others' harm. Moreover, such neural response pattern is related to how an immoral choice would be judged (i.e., blamed) by the community. Since the brain reward system is consistently involved in Oxytocin's role in social cognition and decision-making (e.g., Dolen & Malenka, 2014, Biological Psychiatry), it is important to discuss the hypothesis and results of the present research in the context of this literature.

      We totally agree with the reviewer that the expression “mechanistic account of the hyperaltruistic phenomenon remains unknown” in our original manuscript can be misleading to the audience. Indeed, we were aware of the major findings in the field and cited all the seminal work of hyperaltruism and its related neural mechanism (Crockett et al., 2014, 2015, 2017). We have changed the texts in the introduction to better reflect this point and added further discussion as to how oxytocin might play a role:

      “For example, it was shown that the hyperaltruistic preference modulated neural representations of the profit gained from harming others via the functional connectivity between the lateral prefrontal cortex, a brain area involved in moral norm violation, and profit sensitive brain regions such as the dorsal striatum6.” (Lines 41~45)

      “Oxytocin has been shown to play a critical role in social interactions such as maternal attachment, pair bonding, consociate attachment and aggression in a variety of animal models[42,43]. Humans are endowed with higher cognitive and affective capacities and exhibit far more complex social cognitive patterns[44]. ” (Lines 86~90)

      (2) There are some important inconsistencies between the preregistration and the actual data collection/analysis, which the authors did not justify.

      Also in recommendations:

      (4) It is laudable that the authors pre-registered the procedure and key analysis of the Oxytocin study and determined the sample size beforehand. However, in the preregistration, the authors claimed that they would recruit 30 participants for Experiment 1 and 60 for Experiment 2, without justification. In the paper, they described a "prior power analysis", which deviated from their preregistration. It is OK to deviate from preregistration, but this needs to be explicitly mentioned and addressed (why the deviation occurred, why the reported approach was justifiable, etc.).

      We sincerely appreciate the reviewer’s thorough assessment of our manuscript. In the more exploratory study 1, we found that the loss decision context effectively diminished subjects’ hyperaltruistic preference. Based on this finding, we pre-registered study 2 and hypothesized that: 1) The administration of OXT may salvage subject’s hyperaltruistic preference in the loss context; 2) The administration of OXT may reduce subjects’ sensitivities towards electric shocks (but not necessarily their moral preference), due to the well-established results relating OXT to enhanced empathy for others (Barchi-Ferreira & Osório, 2021; Radke et al., 2013) and the processing of negative stimuli(Evans et al., 2010; Kirsch et al., 2005; Wu et al., 2020); and 3) The OXT effect might be context specific, depending on the particular combination of valence (gain vs. loss) and shock recipient (self vs. other) (Abu-Akel et al., 2015; Kapetaniou et al., 2021; Ma et al., 2015).

      As our results suggested, the administration of OXT indeed restored subjects’ hyperaltruistic preference (confirming hypothesis 1, Figure 4A). Also, OXT decreased subjects’ sensitivities towards electric shocks in both the gain and loss conditions (supplementary Fig. 6 and supplementary Fig. 8), consistent with our second hypothesis. We must admit that our hypothesis 3 was rather vague, since a seminal study clearly demonstrated the context-dependent effect of OXT in human cooperation and conflict depending on the group membership of the subjects (De Dreu et al., 2010, 2020). Although our results partially validated our hypothesis 3 (supplementary Fig. 6), we did not make specific predictions as to the direction and the magnitude of the OXT effect.

      The main inconsistency is related to the sample size. When we carried out study 1, we recruited both male and female subjects. After we identified the context effect on the hyperaltruistic preference, we decided to pre-register and perform study 2 (the OXT study). We originally made a rough estimate of 60 male subjects for study 2. While conducting study 2, we also went through the literature of OXT effect on social behavior and realized that the actual subject number around 45 might be enough to detect the main effect of OXT. Therefore, we settled on the number of 46 (study 2) reported in the manuscript. Correspondingly, we increased the subject number in study 1 to the final number of 80 (40 males) to make sure the subject number is enough to detect a small-to-medium effect, as well as to have a fair comparison between study 1 and 2 (roughly equal number of male subjects). It should be noted that although we only reported all the subjects (male & female) results of study 1 in the manuscript, the main results remain very similar if we only focus on the results of male subjects in study 1 (see the figure below). We believe that these results, together with the placebo treatment group results in study 2 (male only), confirmed the validity of our original finding.

      Author response image 2.

      Author response image 3.

      We have included additional texts (Lines 447 ~ 452) in the Methods section for the discrepancy between the preregistered and actual sample sizes in the revised manuscript:

      “It should be noted that in preregistration we originally planned to recruit 60 male subjects for Study 2 but ended up recruiting 46 male subjects (mean age =  years) based on the sample size reported in previous oxytocin studies[57,69]. Additionally, a power analysis suggested that the sample size > 44 should be enough to detect a small to median effect size of oxytocin (Cohen’s d=0.24, α=0.05, β=0.8) using a 2 × 2 × 2 within-subject design[76].”

      (3) Some of the exploratory analysis seems underpowered (e.g., large multiple regression models with only about 40 participants).

      We thank the reviewer’s comments and appreciate the concern that the sample size would be an issue affecting the results reliability in multiple regression analysis.

      In Fig. 2, the multiple regression analyses were conducted after we observed a valence-dependent effect on hyperaltruism (Fig. 2A) and the regression was constructed accordingly:

      Choice ~ ∆s *context*recipient + ∆m *context*recipient+(1+ ∆s *context*recipient + ∆s*context*recipient | subject)

      Where ∆s and ∆m indicate the shock level and monetary reward difference between the more and loss painful options, context as the monetary valence (gain vs. loss) and recipient as the identity of the shock recipient (self vs. other).

      Since we have 240 trials for each subject and a total of 80 subjects in Study 1, we believe that this is a reasonable regression analysis to perform.

      In Fig. 3, the multiple regression analyses were indeed exploratory. More specifically, we ran 3 multiple linear regressions:

      hyperaltruism~EC*context+IH*context+IB*context

      Relative harm sensitivity~ EC*context+IH*context+IB*context

      Relative money sensitivity~ EC*context+IH*context+IB*context

      Where Hyperaltruism is defined as κ<sub>other</sub> - κ<sub>self</sub>, Relative harm sensitivity as otherβ<sub>∆s</sub> - selfβ<sub>∆s</sub> and Relative monetary sensitivity as otherβ<sub>∆m</sub> - selfβ<sub>∆m</sub>. EC (empathic concern), IH (instrumental harm) and IB (impartial beneficence) were subjects’ scores from corresponding questionnaires.

      For the first regression, we tested whether EC, IH and IB scores were related to hyperaltruism and it should be noted that this was tested on 80 subjects (Study 1). After we identified the effect of IH on hyperaltruism, we ran the following two regressions. The reason we still included IB and EC as predictors in these two regression analyses was to remove potential confounds caused by EC and IB since previous research indicated that IB, IH and EC could be correlated (Kahane et al., 2018).

      In study 2, we performed the following regression analyses again to validate our results (Placebo treatment in study 2 should have similar results as found in study 1).

      Relative harm sensitivity~ EC*context+IH*context+IB*context

      Relative money sensitivity~ EC*context+IH*context+IB*context

      Again, we added IB and EC only to control for the nuance effects by the covariates. As indicated in Fig. 5 C-D, the placebo condition in study 2 replicated our previous findings in study 1 and OXT administration effectively removed the interaction effect between IH and valence (gain vs. loss) on subjects’ relative harm sensitivity.

      To more objectively present our data and results, we have changed the texts in the results section and pointed out that the regression analysis:

      hyperaltruism~EC*context+IH*context+IB*context

      was exploratory (Lines 186-192).

      “We tested how hyperaltruism was related to both IH and IB across decision contexts using an exploratory multiple regression analysis. Moral preference, defined as κ<sub>other</sub> - κ<sub>self</sub>, was negatively associated with IH (β=-0.031±0.011, t<sub>156</sub>=-2.784, P =0.006) but not with IB (β=0.008±0.016, t<sub>156</sub>=0.475, P=0.636) across gain and loss contexts, reflecting a general connection between moral preference and IH (Fig. 3A & B).”

      (4) Inaccurate conceptualization of utilitarian psychology and the questionnaire used to measure it.

      Also in recommendations:

      (2) Throughout the paper, the authors placed lots of weight on individual differences in utilitarian psychology and the Oxford Utilitarianism Scale (OUS). I am not sure this is the best individual difference measure in this context. I don't see a conceptual fit between the psychological construct that OUS reflects, and the key psychological processes underlying the behaviors in the present study. As far as I understand it, the conceptual core of utilitarian psychology that OUS captures is the maximization of greater goods. Neither the Instrumental Harm (IH) component nor the Impartial Beneficence (IB) component reflects a tradeoff between the personal interests of the decision-making agent and a moral principle. The IH component is about the endorsement of harming a smaller number of individuals for the benefit of a larger number of individuals. The IB component is about treating self, close others, and distant others equally. However, the behavioral task used in this study is neither about distributing harm between a smaller number of others and a larger number of others nor about benefiting close or distant others. The fact that IH showed some statistical association with the behavioral tendency in the present data set could be due to the conceptual overlap between IH and an individual's tendency to inflict harm (e.g., psychopathy; Table 7 in Kahane et al., 2018, which the authors cited). I urge the authors to justify more why they believe that conceptually OUS is an appropriate individual difference measure in the present study, and if so, interpret their results in a clearer and justifiable manner (taking into account the potential confound of harm tendency/psychopathy).

      We thank the reviewer for the thoughtful comment and agree that “IH component is about the endorsement of harming a smaller number of individuals for the benefit of a larger number of individuals. The IB component is about treating self, close others, and distant others equally”. As we mentioned in the previous response to the reviewer, we first ran an exploratory multiple linear regression analysis of hyperaltruistic preference (κ<sub>other</sub> - κ<sub>self</sub>) against IB and IH in study 1 based on the hypothesis that the reduction of hyperaltruistic preference in the loss condition might be due to 1) subjects’ altered altitudes between IB and hyperaltruistic preference between the gain and loss conditions, and/or 2) the loss condition changed how the moral norm was perceived and therefore affected the correlation between IH and hyperaltruistic preference. As Fig. 3 shows, we did not find a significant IB effect on hyperaltruistic preference (κ<sub>other</sub> - κ<sub>self</sub>), nor on the relative harm or money sensitivity (supplementary Fig. 3). These results excluded the possibility that subjects with higher IB might treat self and others more equally and therefore show less hyperaltruistic preference. On the other hand, we found a strong correlation between hyperaltruistic preference and IH (Fig. 3A): subjects with higher IH scores showed less hyperaltruistic preference. Since the hyperaltruistic preference (κ<sub>other</sub> - κ<sub>self</sub>) is a compound variable and we further broke it down to subjects’ relative sensitivity to harm and money (other β<sub>∆s</sub> - self β<sub>∆s</sub> and other β<sub>∆m</sub> - self β<sub>∆m</sub>, respectively). The follow up regression analyses revealed that the correlation between subjects’ relative harm sensitivity and IH was altered by the decision contexts (gain vs. loss, Fig. 3C-D). These results are consistent with our hypothesis that for subjects to engage in the utilitarian calculation, they should first realize that there is a moral dilemma (harming others to make monetary gain in the gain condition). When there is less perceived moral conflict (due to the framing of decision context as avoiding loss in the loss condition), the correlation between subjects’ relative harm sensitivity and IH became insignificant (Fig. 3C). It is worth noting that these results were further replicated in the placebo condition of study 2, further indicating the role of OXT is to affect how the decision context is morally framed.

      The reviewer also raised an interesting possibility that the correlation between subject’s behavioral tendency and IH may be confounded by the fact that IH is also correlated with other traits such as psychopathy. Indeed, in the Kahane et al., 2018 paper, the authors showed that IH was associated with subclinical psychopathy in a lay population. Although we only collected and included IB and Empathic concern (EC) scores as control variables and in principle could not rule out the influence of psychopathy, we argue it is unlikely the case. First, psychopaths by definition “only care about their own good” (Kahane et al., 2018). However, subjects in our studies, as well as in previous research, showed greater aversion to harming others (compared to harming themselves) in the gain conditions. This is opposite to the prediction of psychopathy. Even in the loss condition, subjects showed similar levels of aversion to harming others (vs. harming themselves), indicating that our subjects valuated their own and others’ well-being similarly. Second, although there appears to be an association between utilitarian judgement and psychopathy(Glenn et al., 2010; Kahane et al., 2015), the fact that people also possess a form of universal or impartial beneficence in their utilitarian judgements suggest psychopathy alone is not a sufficient variable explaining subjects’ hyperaltruistic behavior.

      We have thus rewritten part of the results to clarify our rationale for using the Oxford Utilitarianism Scale (especially the IH and IB) to establish the relationship between moral traits and subjects’ decision preference (Lines 212-215):

      “Furthermore, our results are consistent with the claim that profiting from inflicting pains on another person (IH) is inherently deemed immoral1. Hyperaltruistic preference, therefore, is likely to be associated with subjects’ IH dispositions.”

      (3) Relatedly, in the Discussion, the authors mentioned "the money-pain trade-off task, similar to the well-known trolley dilemma". I am not sure if this statement is factually accurate because the "well-known trolley dilemma" is about a disinterested third-party weighing between two moral requirements - "greatest good for the greatest number" (utilitarianism) and "do no harm" (Kantian/deontology), not between a moral requirement and one's own monetary interest (which is the focus of the present study). The analogy would be more appropriate if the task required the participants to trade off between, for example, harming one person in exchange for a charitable donation, as a recent study employed (Siegel et al., 2022, A computational account of how individuals resolve the dilemma of dirty money. Scientific reports). I urge the authors to go through their use of "utilitarian/utilitarianism” in the paper and make sure their usage aligns with the definition of the concept and the philosophical implications.

      We thank the reviewer for prompting us to think over the difference between our task and the trolley dilemma. Indeed, the trolley dilemma refers to a disinterested third-party’s decision between two moral requirements, namely, the utilitarianism and deontology. In our study, when the shock recipient was “other”, our task could be interpreted as either the decision between “moral norm of no harm (deontology) and one’s self-interest maximization (utilitarian)”, or a decision between “greatest good for both parties (utilitarian) vs. do no harm (deontology)”, though the latter interpretation typically requires differential weighing of own benefits versus the benefits of others(Fehr & Schmidt, 1999; Saez et al., 2015). In fact, it could be argued that the utilitarianism account applies not only to the third party’s well-being, but also to our own well-being, or to “that of those near or dear to us” (Kahane et al., 2018).

      We acknowledge that there may lack a direct analogy between our task and the trolley dilemma and therefore have deleted the trolley example in the discussion.

      (5) Related to the above point, the sample size of Study 2 was calculated based on the main effect of oxytocin. However, the authors also reported several regression models that seem to me more like exploratory analyses. Their sample size may not be sufficient for these analyses. The authors should: a) explicitly distinguish between their hypothesis-driven analysis and exploratory analysis; b) report achieved power of their analysis.

      We appreciate the reviewer’s thorough reading of our manuscript. Following the reviewer’s suggestions, we have explicitly stated in the revised manuscript which analyses were exploratory, and which were hypothesis driven. Following the reviewer’s request, we added the achieved power into the main texts (Lines 274-279):

      “The effect size (Cohen’s f<sup>2</sup>) for this exploratory analysis was calculated to be 0.491 and 0.379 for the placebo and oxytocin conditions, respectively. The post hoc power analysis with a significance level of α = 0.05, 7 regressors (IH, IB, EC, decision context, IH×context, IB×context, and EC×context), and sample size of N = 46 yielded achieved power of 0.910 (placebo treatment) and 0.808 (oxytocin treatment).”

      (6) Do the authors collect reaction times (RT) information? Did the decision context and oxytocin modulate RT? Based on their procedure, it seems that the authors adopted a speeded response task, therefore the RT may reflect some psychological processes independent of choice. It is also possible (and recommended) that the authors use the drift-diffusion model to quantify latent psychological processes underlying moral decision-making. It would be interesting to see if their manipulations have any impact on those latent psychological processes, in addition to explicit choice, which is the endpoint product of the latent psychological processes. There are some examples of applying DDM to this task, which the authors could refer to if they decide to go down this route (Yu et al, 2021, How peer influence shapes value computation in moral decision-making. Cognition.)

      We did collect the RT information for this experiment. As demonstrated in the figure below, participants exhibited significantly longer RT in the loss context compared to the gain context (Study1: the main effect of decision context: F<sub>1,79</sub>=20.043, P < 0.001, η<sup>2</sup> =0.202; Study2-placebo: F<sub>1.45</sub>=17.177, P < 0.001, η<sup>2</sup> =0.276). In addition to this effect of context, decisions were significantly slower in the other-condition compared to the self-condition

      (Study1: the main effect of recipient: F<sub>1,79</sub>=4.352, P < 0.040, η<sup>2</sup> =0.052; Study2-placebo: F<sub>1,45</sub>=5.601, P < 0.022, η<sup>2</sup> =0.111) which replicates previous research findings (Crockett et al., 2014). However, the differences in response time between recipients was not modulated by decision context (Study1: context × recipient interaction: F<sub>1,79</sub>=1.538, P < 0.219, η<sup>2</sup> =0.019; Study2-placebo: F<sub>1,45</sub>=2.631, P < 0.112, η<sup>2</sup> =0.055). Additionally, the results in the oxytocin study (study 2) revealed no evidence supporting any effect of oxytocin on reaction time. Neither the main effect (treatment: placebo vs. oxytocin) nor the interaction effect of oxytocin on response time was statistically significant (main effect of OXT treatment: F<sub>1,45</sub>=2.380, P < 0.230, η<sup>2</sup> =0.050; treatment × context: F<sub>1,45</sub>=2.075, P < 0.157η<sup>2</sup> =0.044; treatment × recipient: F<sub>1,45</sub>=0.266, P < 0.609, η<sup>2</sup> =0.006; treatment × context × recipient: F<sub>1,45</sub>=2.909, P < 0.095, η<sup>2</sup> =0.061).;

      Author response image 4.

      We also agree that it would be interesting to also investigate how the OXT might impact the dynamics of the decision process using a drift-diffusion model (DDM). However, we have already showed in the original manuscript that the OXT increased subjects’ relative harm sensitivities. If a canonical DDM is adopted here, then such an OXT effect is more likely to correspond to the increased drift rate for the relative harm sensitivity, which we feel still aligns with the current framework in general. In future studies, including further manipulations such as time pressure might be a more comprehensive approach to investigate the effect of OXT on DDM related decision variables such as attribute drift rate, initial bias, decision threshold and attribute synchrony.

      (7) This is just a personal preference, but I would avoid metaphoric language in a scientific paper (e.g., rescue, salvage, obliterate). Plain, neutral English terms can express the same meaning clearly (e.g., restore, vanish, eliminate).

      Again, we thank the reviewer for the suggestion and have since modified the terms.

      Reviewer #3:

      The primary weakness of the paper concerns its framing. Although it purports to be measuring "hyper-altruism" it does not provide evidence to support why any of the behavior being measured is extreme enough to warrant the modifier "hyper" (and indeed throughout I believe the writing tends toward hyperbole, using, e.g., verbs like "obliterate" rather than "reduce"). More seriously, I do not believe that the task constitutes altruism, but rather the decision to engage, or not engage, in instrumental aggression.

      We agree with the reviewer (and reviewer # 2) that plain and clear English should be used to describe our results and have since modified those terms.

      However, the term “hyperaltruism”, which is the main theme of our study, was originally proposed by a seminal paper (Crockett et al., 2014) and has since been widely adopted in related studies (Crockett et al., 2014, 2015, 2017; Volz et al., 2017; Zhan et al., 2020). The term “hyperaltruism” was introduced to emphasize the difference from altruism (Chen et al., 2024; FeldmanHall et al., 2015; Hu et al., 2021; Hutcherson et al., 2015; Lockwood et al., 2017; Xiong et al., 2020). Hyperaltruism does not indicate extreme altruism. Instead, it simply reflects the fact that “we are more willing to sacrifice gains to spare others from harm than to spare ourselves from harm” (Volz et al., 2017). In other words, altruism refers to people’s unselfish regard for or devotion to the welfare of others, and hyperaltruism concerns subject’s own cost-benefit preference as the reference point and highlights the “additional” altruistic preference when considering other’s welfare. For example, in the altruistic experimental design, altruism is characterized by the degree to which subjects take other people’s welfare into account (left panel). However, in a typical hyperaltruism task design (right panel), hyperaltruistic preference is operationally defined as the difference (κ<sub>other</sub> - κ<sub>self</sub>) between the degrees to which subjects value others’ harm (κ<sub>other</sub>) and their own harm (κ<sub>self</sub>).

      Author response image 5.

      I found it surprising that a paradigm that entails deciding to hurt or not hurt someone else for personal benefit (whether acquiring a financial gain or avoiding a loss) would be described as measuring "altruism." Deciding to hurt someone for personal benefit is the definition of instrumental aggression. I did not see that in any of the studies was there a possibility of acting to benefit the other participant in any condition. Altruism is not equivalent to refraining from engaging in instrumental aggression. True altruism would be to accept shocks to the self for the other's benefit (e.g., money).  The interpretation of this task as assessing instrumental aggression is supported by the fact that only the Instrumental Harm subscale of the OUS was associated with outcomes in the task, but not the Impartial Benevolence subscale. By contrast, the IB subscale is the one more consistently associated with altruism (e.g,. Kahane et al 2018; Amormino at al, 2022) I believe it is important for scientific accuracy for the paper, including the title, to be re-written to reflect what it is testing.

      Again, as we mentioned in the previous response, hyperaltruism is a term coined almost a decade ago and has since been widely adopted in the research field. We are afraid that switching such a term would be more likely to cause confusion (instead of clarity) among audience.

      Also, from the utilitarian perspective, the gain or loss (or harm) occurred to someone else is aligned on the same dimension and there is no discontinuity between gains and losses. Therefore, taking actions to avoid someone else’s loss can also be viewed as altruistic behavior, similar to choices increasing other’s welfare (Liu et al., 2020).

      Relatedly: in the introduction I believe it would be important to discuss the non-symmetry of moral obligations related to help/harm--we have obligations not to harm strangers but no obligation to help strangers. This is another reason I do not think the term "hyper altruism" is a good description for this task--given it is typically viewed as morally obligatory not to harm strangers, choosing not to harm them is not "hyper" altruistic (and again, I do not view it as obviously altruism at all).

      We agree with the reviewer’s point that we have the moral obligations not to harm others but no obligation to help strangers (Liu et al., 2020). In fact, this is exactly what we argued in our manuscript: by switching the decision context from gains to losses, subjects were less likely to perceive the decisions as “harming others”. Furthermore, after the administration of OXT, making decisions in both the gain and loss contexts were more perceived by subjects as harming others (Fig. 6A).

      The framing of the role of OT also felt incomplete. In introducing the potential relevance of OT to behavior in this task, it is important to pull in evidence from non-human animals on origins of OT as a hormone selected for its role in maternal care and defense (including defensive aggression). The non-human animal literature regarding the effects of OT is on the whole much more robust and definitive than the human literature. The evidence is abundant that OT motivates the defensive care of offspring of all kinds. My read of the present OT findings is that they increase participants' willingness to refrain from shocking strangers even when incurring a loss (that is, in a context where the participant is weighing harm to themselves versus harm to the other). It will be important to explain why OT would be relevant to refraining from instrumental aggression, again, drawing on the non-human animal literature.

      We thank the reviewer’s comments and agree that the current understanding of the link between our results of OT with animal literature can be at best described as vague and intriguing. Current literature on OT in animal research suggests that the nucleus accumbens (NAc) oxytocin might play the critical role in social cognition and reinforcing social interactions (Dölen et al., 2013; Dölen & Malenka, 2014; Insel, 2010). Though much insight has already been gained from animal studies, in humans, social interactions can take a variety of different forms, and the consociate recognition can also be rather dynamic. For example, male human participants with self-administered OT showed higher trust and cooperation towards in-group members but more defensive aggression towards out-group members (De Dreu et al., 2010). In another human study, participants administered with OT showed more coordinated out-group attack behavior, suggesting that OT might increase in-group efficiency at the cost of harming out-group members (Zhang et al., 2019). It is worth pointing out that in both experiments, the participant’s group membership was artificially assigned, thus highlighting the context-dependent nature of OT effect in humans.

      In our experiment, more complex and higher-level social cognitive processes such as moral framing and moral perception are involved, and OT seems to play an important role in affecting these processes. Therefore, we admit that this study, like the ones mentioned above, is rather hard to find non-human animal counterpart, unfortunately. Instead of relating OT to instrumental aggression, we aimed to provide a parsimonious framework to explain why the “hyperaltruism” disappeared in the loss condition, and, with the OT administration, reappeared in both the gain and loss conditions while also considering the effects of other relevant variables.  

      We concur with the reviewer’s comments about the importance of animal research and have since added the following paragraph into the revised manuscript (Line 86~90) as well as in the discussion:

      “Oxytocin has been shown to play a critical role in social interactions such as maternal attachment, pair bonding, consociate attachment and aggression in a variety of animal models[42,43]. Humans are endowed with higher cognitive and affective capacities and exhibit far more complex social cognitive patterns[44].”

      Another important limitation is the use of only male participants in Study 2. This was not an essential exclusion. It should be clear throughout sections of the manuscript that this study's effects can be generalized only to male participants.

      We thank the reviewer’s comments. Prior research has shown sex differences in oxytocin’s effects (Fischer-Shofty et al., 2013; Hoge et al., 2014; Lynn et al., 2014; Ma et al., 2016; MacDonald, 2013). Furthermore, with the potential confounds of OT effect due to the menstrual cycles and potential pregnancy in female subjects, most human OT studies have only recruited male subjects (Berends et al., 2019; De Dreu et al., 2010; Fischer-Shofty et al., 2010; Ma et al., 2016; Zhang et al., 2019). We have modified our manuscript to emphasize that study 2 only recruited male subjects.

      Recommendations:

      I believe the authors have provided an interesting and valuable dataset related to the willingness to engage in instrumental aggression - this is not the authors' aim, although also an important aim. Future researchers aiming to build on this paper would benefit from it being framed more accurately.

      Thus, I believe the paper must be reframed to accurately describe the nature of the task as assessing instrumental aggression. This is also an important goal, as well-designed laboratory models of instrumental aggression are somewhat lacking.

      Please see our response above that to have better connections with previous research, we believe that the term hyperaltruism might align better with the main theme for this study.

      The research literature on other aggression tasks should also be brought in, as I believe these are more relevant to the present study than research studies on altruism that are primarily donation-type tasks. It should be added to the limitations of how different aggression in a laboratory task such as this one is from real-world immoral forms of aggression. Arguably, aggression in a laboratory task in which all participants are taking part voluntarily under a defined set of rules, and in which aggression constrained by rules is mutual, is similar to aggression in sports, which is not considered immoral. Whether responses in this task would generalize to immoral forms of aggression cannot be determined without linking responses in the task to some real-world outcome.

      We agree with the reviewer that “aggression in a lab task …. is similar to aggression in sports”. Our starting point was to investigate the boundary conditions for the hyperaltruism (though we don’t deny that there is an aggression component in hyperaltruism, given the experiment design we used). In other words, the dependent variable we were interested in was the difference between “other” and “self” aggression, not the aggression itself. Our results showed that by switching the decision context from the monetary gain environment to the loss condition, human participants were willing to bear similar amounts of monetary loss to spare others and themselves from harm. That is, hyperaltruism disappeared in the loss condition. We interpreted this result as the loss condition prompted subjects to adopt a different moral framework (help vs. harm, Fig. 6A) and subjects were less influenced by their instrumental harm personality trait due to the change of moral framework (Fig. 3C). In the following study (study 2), we further tested this hypothesis and verified that the administration of OT indeed increased subjects’ perception of the task as harming others for both gain and loss conditions (Fig. 6A), and such moral perception mediated the relationship between subject’s personality traits (instrumental harm) and their relative harm sensitivities (the difference of aggression between the other- and self-conditions). We believe the moral perception framework and that OT directly modulates moral perception better account for subjects’ context-dependent choices than hypothesizing OT’s context-dependent modulation effects on aggression.

      The language should also be toned down--the use of phrases like "hyper altruism" (without independent evidence to support that designation) and "obliterate" rather than "reduce" or "eliminate" are overly hyperbolic.

      We have changed terms such as “obliterate” and “eliminate” to plain English, as the reviewer suggested.

      Reference

      Abu-Akel, A., Palgi, S., Klein, E., Decety, J., & Shamay-Tsoory, S. (2015). Oxytocin increases empathy to pain when adopting the other- but not the self-perspective. Social Neuroscience, 10(1), 7–15.

      Barchi-Ferreira, A., & Osório, F. (2021). Associations between oxytocin and empathy in humans: A systematic literature review. Psychoneuroendocrinology, 129, 105268.

      Berends, Y. R., Tulen, J. H. M., Wierdsma, A. I., van Pelt, J., Feldman, R., Zagoory-Sharon, O., de Rijke, Y. B., Kushner, S. A., & van Marle, H. J. C. (2019). Intranasal administration of oxytocin decreases task-related aggressive responses in healthy young males. Psychoneuroendocrinology, 106, 147–154.

      Chen, J., Putkinen, V., Seppälä, K., Hirvonen, J., Ioumpa, K., Gazzola, V., Keysers, C., & Nummenmaa, L. (2024). Endogenous opioid receptor system mediates costly altruism in the human brain. Communications Biology, 7(1), 1–11.

      Crockett, M. J., Kurth-Nelson, Z., Siegel, J. Z., Dayan, P., & Dolan, R. J. (2014). Harm to others outweighs harm to self in moral decision making. Proceedings of the National Academy of Sciences of the United States of America, 111(48), 17320–17325.

      Crockett, M. J., Siegel, J. Z., Kurth-Nelson, Z., Dayan, P., & Dolan, R. J. (2017). Moral transgressions corrupt neural representations of value. Nature Neuroscience, 20(6), 879–885.

      Crockett, M. J., Siegel, J. Z., Kurth-Nelson, Z., Ousdal, O. T., Story, G., Frieband, C., Grosse-Rueskamp, J. M., Dayan, P., & Dolan, R. J. (2015). Dissociable Effects of Serotonin and Dopamine on the Valuation of Harm in Moral Decision Making. Current Biology, 25(14), 1852–1859.

      De Dreu, C. K. W., Greer, L. L., Handgraaf, M. J. J., Shalvi, S., Van Kleef, G. A., Baas, M., Ten Velden, F. S., Van Dijk, E., & Feith, S. W. W. (2010). The Neuropeptide Oxytocin Regulates Parochial Altruism in Intergroup Conflict Among Humans. Science, 328(5984), 1408–1411.

      De Dreu, C. K. W., Gross, J., Fariña, A., & Ma, Y. (2020). Group Cooperation, Carrying-Capacity Stress, and Intergroup Conflict. Trends in Cognitive Sciences, 24(9), 760–776.

      Dölen, G., Darvishzadeh, A., Huang, K. W., & Malenka, R. C. (2013). Social reward requires coordinated activity of nucleus accumbens oxytocin and serotonin. Nature, 501(7466), 179–184.

      Dölen, G., & Malenka, R. C. (2014). The Emerging Role of Nucleus Accumbens Oxytocin in Social Cognition. Biological Psychiatry, 76(5), 354–355.

      Evans, S., Shergill, S. S., & Averbeck, B. B. (2010). Oxytocin Decreases Aversion to Angry Faces in an Associative Learning Task. Neuropsychopharmacology, 35(13), 2502–2509.

      Fehr, E., & Schmidt, K. M. (1999). A Theory of Fairness, Competition, and Cooperation*. The Quarterly Journal of Economics, 114(3), 817–868.

      FeldmanHall, O., Dalgleish, T., Evans, D., & Mobbs, D. (2015). Empathic concern drives costly altruism. Neuroimage, 105, 347–356.

      Fischer-Shofty, M., Levkovitz, Y., & Shamay-Tsoory, S. G. (2013). Oxytocin facilitates accurate perception of competition in men and kinship in women. Social Cognitive and Affective Neuroscience, 8(3), 313–317.

      Fischer-Shofty, M., Shamay-Tsoory, S. G., Harari, H., & Levkovitz, Y. (2010). The effect of intranasal administration of oxytocin on fear recognition. Neuropsychologia, 48(1), 179–184.

      Glenn, A. L., Koleva, S., Iyer, R., Graham, J., & Ditto, P. H. (2010). Moral identity in psychopathy. Judgment and Decision Making, 5(7), 497–505.

      Hoge, E. A., Anderson, E., Lawson, E. A., Bui, E., Fischer, L. E., Khadge, S. D., Barrett, L. F., & Simon, N. M. (2014). Gender moderates the effect of oxytocin on social judgments. Human Psychopharmacology: Clinical and Experimental, 29(3), 299–304.

      Hu, J., Hu, Y., Li, Y., & Zhou, X. (2021). Computational and Neurobiological Substrates of Cost-Benefit Integration in Altruistic Helping Decision. Journal of Neuroscience, 41(15), 3545–3561.

      Hutcherson, C. A., Bushong, B., & Rangel, A. (2015). A Neurocomputational Model of Altruistic Choice and Its Implications. Neuron, 87(2), 451–462.

      Insel, T. R. (2010). The Challenge of Translation in Social Neuroscience: A Review of Oxytocin, Vasopressin, and Affiliative Behavior. Neuron, 65(6), 768–779.

      Kahane, G., Everett, J. A. C., Earp, B. D., Caviola, L., Faber, N. S., Crockett, M. J., & Savulescu, J. (2018). Beyond sacrificial harm: A two-dimensional model of utilitarian psychology. Psychological Review, 125(2), 131–164.

      Kahane, G., Everett, J. A. C., Earp, B. D., Farias, M., & Savulescu, J. (2015). ‘Utilitarian’ judgments in sacrificial moral dilemmas do not reflect impartial concern for the greater good. Cognition, 134, 193–209.

      Kahneman, D., & Tversky, A. (1979). Prospect Theory: An Analysis of Decision under Risk. Econometrica, 47(2), 263.

      Kapetaniou, G. E., Reinhard, M. A., Christian, P., Jobst, A., Tobler, P. N., Padberg, F., & Soutschek, A. (2021). The role of oxytocin in delay of gratification and flexibility in non-social decision making. eLife, 10, e61844.

      Kirsch, P., Esslinger, C., Chen, Q., Mier, D., Lis, S., Siddhanti, S., Gruppe, H., Mattay, V. S., Gallhofer, B., & Meyer-Lindenberg, A. (2005). Oxytocin Modulates Neural Circuitry for Social Cognition and Fear in Humans. The Journal of Neuroscience, 25(49), 11489–11493.

      Liu, J., Gu, R., Liao, C., Lu, J., Fang, Y., Xu, P., Luo, Y., & Cui, F. (2020). The Neural Mechanism of the Social Framing Effect: Evidence from fMRI and tDCS Studies. The Journal of Neuroscience, 40(18), 3646–3656.

      Liu, Y., Li, L., Zheng, L., & Guo, X. (2017). Punish the Perpetrator or Compensate the Victim? Gain vs. Loss Context Modulate Third-Party Altruistic Behaviors. Frontiers in Psychology, 8, 2066.

      Lockwood, P. L., Hamonet, M., Zhang, S. H., Ratnavel, A., Salmony, F. U., Husain, M., & Maj, A. (2017). Prosocial apathy for helping others when effort is required. Nature Human Behaviour, 1(7), 131–131.

      Losecaat Vermeer, A. B., Boksem, M. A. S., & Sanfey, A. G. (2020). Third-party decision-making under risk as a function of prior gains and losses. Journal of Economic Psychology, 77, 102206.

      Lynn, S. K., Hoge, E. A., Fischer, L. E., Barrett, L. F., & Simon, N. M. (2014). Gender differences in oxytocin-associated disruption of decision bias during emotion perception. Psychiatry Research, 219(1), 198–203.

      Ma, Y., Liu, Y., Rand, D. G., Heatherton, T. F., & Han, S. (2015). Opposing Oxytocin Effects on Intergroup Cooperative Behavior in Intuitive and Reflective Minds. Neuropsychopharmacology, 40(10), 2379–2387.

      Ma, Y., Shamay-Tsoory, S., Han, S., & Zink, C. F. (2016). Oxytocin and Social Adaptation: Insights from Neuroimaging Studies of Healthy and Clinical Populations. Trends in Cognitive Sciences, 20(2), 133–145.

      MacDonald, K. S. (2013). Sex, Receptors, and Attachment: A Review of Individual Factors Influencing Response to Oxytocin. Frontiers in Neuroscience, 6. 194.

      Markiewicz, Ł., & Czupryna, M. (2018). Cheating: One Common Morality for Gain and Losses, but Two Components of Morality Itself. Journal of Behavior Decision Making. 33(2), 166-179.

      Pachur, T., Schulte-Mecklenbeck, M., Murphy, R. O., & Hertwig, R. (2018). Prospect theory reflects selective allocation of attention. Journal of Experimental Psychology: General, 147(2), 147–169.

      Radke, S., Roelofs, K., & De Bruijn, E. R. A. (2013). Acting on Anger: Social Anxiety Modulates Approach-Avoidance Tendencies After Oxytocin Administration. Psychological Science, 24(8), 1573–1578.

      Saez, I., Zhu, L., Set, E., Kayser, A., & Hsu, M. (2015). Dopamine modulates egalitarian behavior in humans. Current Biology, 25(7), 912–919.

      Teoh, Y. Y., Yao, Z., Cunningham, W. A., & Hutcherson, C. A. (2020). Attentional priorities drive effects of time pressure on altruistic choice. Nature Communications, 11(1), 3534.

      Tom, S. M., Fox, C. R., Trepel, C., & Poldrack, R. A. (2007). The neural basis of loss aversion in decision-making under risk. Science, 315(5811), 515–518.

      Usher, M., & McClelland, J. L. (2004). Loss Aversion and Inhibition in Dynamical Models of Multialternative Choice. Psychological Review, 111(3), 757–769.

      Volz, L. J., Welborn, B. L., Gobel, M. S., Gazzaniga, M. S., & Grafton, S. T. (2017). Harm to self outweighs benefit to others in moral decision making. Proceedings of the National Academy of Sciences of the United States of America, 114(30), 7963–7968.

      Wu, Q., Mao, J., & Li, J. (2020). Oxytocin alters the effect of payoff but not base rate in emotion perception. Psychoneuroendocrinology, 114, 104608.

      Wu, S., Cai, W., & Jin, S. (2018). Gain or non-loss: The message matching effect of regulatory focus on moral judgements of other-orientation lies. International Journal of Psychology, 53(3), 223-227.

      Xiong, W., Gao, X., He, Z., Yu, H., Liu, H., & Zhou, X. (2020). Affective evaluation of others’ altruistic decisions under risk and ambiguity. Neuroimage, 218, 116996.

      Yechiam, E., & Hochman, G. (2013). Losses as modulators of attention: Review and analysis of the unique effects of losses over gains. Psychological Bulletin, 139(2), 497–518.

      Zhan, Y., Xiao, X., Tan, Q., Li, J., Fan, W., Chen, J., & Zhong, Y. (2020). Neural correlations of the influence of self-relevance on moral decision-making involving a trade-off between harm and reward. Psychophysiology, 57(9), e13590.

      Zhang, H., Gross, J., De Dreu, C., & Ma, Y. (2019). Oxytocin promotes coordinated out-group attack during intergroup conflict in humans. eLife, 8, e40698.

    1. Author Response

      The following is the authors’ response to the original reviews.

      eLife assessment

      This important work identifies a previously uncharacterized capacity for songbirds to recover vocal targets even without sensory experience. While the evidence supporting this claim is solid, with innovative experiments exploring vocal plasticity in deafened birds, additional behavioral controls and analyses are necessary to shore up the main claims. If improved, this work has the potential for broad relevance to the fields of vocal and motor learning.

      We were able to address the requests for additional behavioral controls about the balancing of the groups (reviewer 1) and the few individual birds that showed a different behavior (reviewer 2) without collecting any further data. See our detailed replies below.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Zai et al test if songbirds can recover the capacity to sing auditory targets without singing experience or sensory feedback. Past work showed that after the pitch of targeted song syllables is driven outside of birds' preferred target range with external reinforcement, birds revert to baseline (i.e. restore their song to their target). Here the authors tested the extent to which this restoration occurs in muted or deafened birds. If these birds can restore, this would suggest an internal model that allows for sensory-to-motor mapping. If they cannot, this would suggest that learning relies entirely on feedback-dependent mechanisms, e.g. reinforcement learning (RL). The authors find that deafened birds exhibit moderate but significant restoration, consistent with the existence of a previously under-appreciated internal model in songbirds.

      Strengths:

      The experimental approach of studying vocal plasticity in deafened or muted birds is innovative, technically difficult, and perfectly suited for the question of feedback-independent learning. The finding in Figure 4 that deafened birds exhibit subtle but significant plasticity toward restoration of their pre-deafening target is surprising and important for the songbird and vocal learning fields, in general.

      Weaknesses:

      The evidence and analyses related to the directed plasticity in deafened birds are confusing, and the magnitude of the plasticity is far less than the plasticity observed in control birds with intact feedback. The authors acknowledge this difference in a two-system model of vocal plasticity, but one wonders why the feedback-independent model, which could powerfully enhance learning speed, is weak in this songbird system.

      We fully agree with the reviewer. This surprising weakness applies to birds’ inability rather than our approach for characterizing it.

      There remains some confusion about the precise pitch-change methods used to study the deafened birds, including the possibility that a critical cohort of birds was not suitably balanced in a way where deafened birds were tested on their ability to implement both pitch increases and decreases toward target restoration.

      Both deaf groups were balanced: (dLO and WNd) were balanced in that half of the birds (5/10 WNm and 4/8 dLO) shifted their pitch up (thus target restoration corresponded to decreasing pitch) and half of the birds (5/10 WNd and 4/8 dLO) shifted their pitch down (thus target restoration corresponded to increasing pitch), see Methods.

      To clarify the precise pitch-change method used, we added to the methods an explanation about why we used the sensitivity index 𝒅′ in Fig. 4:

      We used sensitivity 𝒅′ relative to the last 2 h of WN/LO instead of NRP because we wanted to detect a pitch change, which is the realm of detection theory, i.e. 𝒅′. Furthermore, by measuring local changes in pitch relative to the last 2 h of WN/LO reinforcement, our measurements are only minimally affected by the amount of reinforcement learning that might have occurred during this 2 h time window — choosing an earlier or longer window would have blended reinforced pitch changes into our estimates. Last but not least, changes in the way in which we normalized 𝒅’ values — dividing by 𝑺𝑩, — or using the NRP relative to the last 2 h of WN/LO did not qualitatively change the results shown in Fig. 4D.

      Reviewer #2 (Public Review):

      Summary:

      This paper investigates the role of motor practice and sensory feedback when a motor action returns to a learned or established baseline. Adult male zebra finches perform a stereotyped, learned vocalization (song). It is possible to shift the pitch of particular syllables away from the learned baseline pitch using contingent white noise reinforcement. When the reinforcement is stopped, birds will return to their baseline over time. During the return, they often sing hundreds of renditions of the song. However, whether motor action, sensory feedback, or both during singing is necessary to return to baseline is unknown.

      Previous work has shown that there is covert learning of the pitch shift. If the output of a song plasticity pathway is blocked during learning, there is no change in pitch during the training. However, as soon as the pathway is unblocked, the pitch immediately shifts to the target location, implying that there is learning of the shift even without performance. Here, they ask whether the return to baseline from such a pitch shift also involves covert or overt learning processes. They perform a series of studies to address these questions, using muting and deafening of birds at different time points. learning.

      Strengths:

      The overall premise is interesting and the use of muting and deafening to manipulate different aspects of motor practice vs. sensory feedback is a solid approach.

      Weaknesses:

      One of the main conclusions, which stems primarily from birds deafened after being pitch-shifted using white noise (WNd) birds in comparison to birds deafened before being pitchshifted with light as a reinforcer (LOd), is that recent auditory experience can drive motor plasticity even when an individual is deprived of such experience. While the lack of shift back to baseline pitch in the LOd birds is convincing, the main conclusion hinges on the responses of just a few WNd individuals who are closer to baseline in the early period. Moreover, only 2 WNd individuals reached baseline in the late period, though neither of these were individuals who were closer to baseline in the early phase. Most individuals remain or return toward the reinforced pitch. These data highlight that while it may be possible for previous auditory experience during reinforcement to drive motor plasticity, the effect is very limited. Importantly, it's not clear if there are other explanations for the changes in these birds, for example, whether there are differences in the number of renditions performed or changes to other aspects of syllable structure that could influence measurements of pitch.

      We thank the reviewer for these detailed observations. We looked into the reviewer’s claim that our main conclusion of revertive pitch changes in deaf birds with target mismatch experience hinges on only few WNd birds in the early period.

      When we remove the three birds that were close to baseline (NRP=0) in the early period, we still get the same trend that WNd birds show revertive changes towards baseline: Early 𝒅’ = −𝟎. 𝟏𝟑, 𝒑 = 𝟎. 𝟐𝟒, tstat = −𝟎.𝟕𝟒, 𝒅𝒇 = 𝟔, 𝑵 = 𝟕 birds, one-sided t-test of H0: 𝒅′ = 𝟎; Late 𝒅’ = −𝟏. 𝟐𝟔, 𝒑 = 𝟎. 𝟎𝟖, tstat = −𝟏.𝟔𝟑, 𝒅𝒇 = 𝟔, 𝑵 = 𝟕 birds, one-sided t-test of H0: 𝒅′ = 𝟎. Furthermore, even without these three birds, bootstrapping the difference between WNd and dC birds shows the same trend in the early period (p=0.22) and a significant reversion in the late period (p<0.001). Thus, the effect of reversion towards baseline in the late period is robustly observed on a population level, even when discounting for three individual birds that the reviewer suspected would be responsible for the effect.

      Moreover, note that there are not two but three WNd individuals that reached baseline in the late period (see Figure 2C, D). One of them was already close to baseline in the early period and another one was already relatively close, too.

      Also, the considerable variability among birds is not surprising, it is to be expected that the variability across deaf birds is large because of their ongoing song degradation that might lead to a drift of pitch over time since deafening.

      Last but not least, see also our multivariate model (below).

      With regards to the “differences in the number of renditions” that could explain pitch changes: Deaf birds sing less after deafening than hearing birds: they sing less during the first 2 hours (early): 87±59 renditions (WNd) and 410±330 renditions (dLO) compared to 616±272 renditions (control birds). Also, WN deaf birds sing only 4300±2300 motif renditions between the early and late period compared to the average of 11000±3400 renditions that hearing control birds produce in the same time period. However, despite these differences, when we provide WNd birds more time to recover, namely 9 days after the early period, they sung on average 12000±6000 renditions, yet their NRP was still significantly different from zero (NRP = 0.37, p=0.007, tstat=3.47, df=9). Thus, even after producing more practice songs, deaf birds do not recover baseline pitch and so the number of songs alone cannot explain why deaf birds do not fully recover pitch. We conclude that auditory experience seems to be necessary to recover song.

      We added this information to the Results.

      In this context, note that the interesting part of our work is not that deaf birds do not fully recover, but that they recover anything at all (“main conclusion”, Fig. 4). The number of songs does not explain why deaf birds with mismatch experience (WNd, singing the least and singing significantly less than control birds, p=2.3*10-6, two-tailed t-test) partially revert song towards baseline, unlike deaf birds without mismatch experience (dLO, singing significantly more than WNd birds, p=0.008, and indistinguishable from control birds, p=0.1). We added this information to the Results section.

      With regards to ‘other aspects of syllable structure’: We did not look into this. Regardless of the outcome of such a hypothetical analysis, whether other syllable features change is irrelevant for our finding that deaf birds do not recover their target song. Nevertheless, note that in Zai et al. 2020 (supplementary Figure 1), we analyzed features other than pitch change in deaf birds. Absolute change in entropy variance was larger in deaf birds than in hearing birds, consistent with the literature on song degradation after deafening (Lombardino and Nottebohm, 2000, Nordeen and Nordeen 2010 and many others). In that paper, we found that only pitch changes consistently along the LO direction. All other features that we looked at (duration, AM, FM and entropy) did not change consistently with the LO contingency. We expect that a similar result would apply for the changes across the recovery period in WNd and dLO birds, i.e., that song degradation can be seen in many features and that pitch is the sole feature that changes consistently with reinforcement (LO/WN) direction.

      While there are examples where the authors perform direct comparisons between particular manipulations and the controls, many of the statistical analyses test whether each group is above or below a threshold (e.g. baseline) separately and then make qualitative comparisons between those groups. Given the variation within the manipulated groups, it seems especially important to determine not just whether these are different from the threshold, but how they compare to the controls. In particular, a full model with time (early, late), treatment (deafened, muted, etc), and individual ID (random variable) would substantially strengthen the analysis.

      We performed a full model of the NRP as the reviewer suggests and it supports our conclusions: Neither muting, deafening nor time without practice between R and E windows have a significant effect on pitch in the E window, but the interaction between deafening and time (late, L) results in a significant pitch change (fixed effect 0.67, p=2*10-6), demonstrating that deaf birds are significantly further away from baseline (NRP=0) than hearing birds in late windows, thereby confirming that birds require auditory feedback to recover a distant pitch target. Importantly, we find a significant fixed effect on pitch in the direction of the target with mismatch experience (fixed effect -0.37, p=0.006), supporting our finding that limited vocal plasticity towards a target is possible even without auditory feedback.

      We included this model as additional analysis to our manuscript.

      The muted birds seem to take longer to return to baseline than controls even after they are unmuted. Presumably, there is some time required to recover from surgery, however, it's unclear whether muting has longer-term effects on syrinx function or the ability to pass air. In particular, it's possible that the birds still haven't recovered by 4 days after unmuting as a consequence of the muting and unmuting procedure or that the lack of recovery is indicative of an additional effect that muting has on pitch recovery. For example, the methods state that muted birds perform some quiet vocalizations. However, if birds also attempt to sing, but just do so silently, perhaps the aberrant somatosensory or other input from singing while muted has additional effects on the ability to regain pitch. It would also be useful to know if there is a relationship between how long they are muted and how quickly they return to baseline.

      We agree, it might be the case that muting has some longer-term effects that could explain why WNm birds did not recover pitch 4 days after unmuting. However, if such an effect exists, it is only weak. Arguing against the idea that a longer muting requires longer recovery, we did not find a correlation between the difference in NRP between early and late and 1. the duration the birds were muted (correlation coefficient = -0.50, p=0.20), and 2. the number of renditions the birds sung between early and late (correlation coefficient = 0.03, p=0.95), and 3. the time since they last sung the target song (last rendition of baseline, correlation coefficient = -0.43, p=0.29). Neither did we find a correlation between the early NRP and the time since the muting surgery (correlation coefficient = 0.26, p=0.53), suggesting that the lack of pitch recovery while muted was not due to a lingering burden of the muting surgery. We added these results to the results section.

      In summary, we used the WNm group to assess whether birds can recover their target pitch in the absence of practice, i.e. whether they recovered pitch in the early time period. Whether or not some long-term effect of the muting/unmuting procedure affects recovery does not impair the main finding we obtained from WNm birds in Figure 1 (that birds do not recover without practice).

      Reviewer #3 (Public Review):

      Summary:

      Zai et al. test whether birds can modify their vocal behavior in a manner consistent with planning. They point out that while some animals are known to be capable of volitional control of vocalizations, it has been unclear if animals are capable of planning vocalizations -that is, modifying vocalizations towards a desired target without the need to learn this modification by practicing and comparing sensory feedback of practiced behavior to the behavioral target. They study zebra finches that have been trained to shift the pitch of song syllables away from their baseline values. It is known that once this training ends, zebra finches have a drive to modify pitch so that it is restored back to its baseline value. They take advantage of this drive to ask whether birds can implement this targeted pitch modification in a manner that looks like planning, by comparing the time course and magnitude of pitch modification in separate groups of birds who have undergone different manipulations of sensory and motor capabilities. A key finding is that birds who are deafened immediately before the onset of this pitch restoration paradigm, but after they have been shifted away from baseline, are able to shift pitch partially back towards their baseline target. In other words, this targeted pitch shift occurs even when birds don't have access to auditory feedback, which argues that this shift is not due to reinforcement-learning-guided practice, but is instead planned based on the difference between an internal representation of the target (baseline pitch) and current behavior (pitch the bird was singing immediately before deafening).

      The authors present additional behavioral studies arguing that this pitch shift requires auditory experience of the song in its state after it has been shifted away from baseline (birds deafened early on, before the initial pitch shift away from baseline, do not exhibit any shift back towards baseline), and that a full shift back to baseline requires auditory feedback. The authors synthesize these results to argue that different mechanisms operate for small shifts (planning, does not need auditory feedback) and large shifts (reinforcement learning, requires auditory feedback).

      We thank the reviewer for this concise summary of our paper. To clarify, we want to point out that we do not make any statement about the learning mechanism birds use to make large shifts to recover their target pitch, i.e. we do not say that large shifts are learned by reinforcement learning requiring auditory feedback. We only show that large shifts require auditory feedback.

      The authors also make a distinction between two kinds of planning: covert-not requiring any motor practice and overt-requiring motor practice but without access to auditory experience from which target mismatch could be computed. They argue that birds plan overtly, based on these deafening experiments as well as an analogous experiment involving temporary muting, which suggests that indeed motor practice is required for pitch shifts.

      Strengths:

      The primary finding (that partially restorative pitch shift occurs even after deafening) rests on strong behavioral evidence. It is less clear to what extent this shift requires practice, since their analysis of pitch after deafening takes the average over within the first two hours of singing. If this shift is already evident in the first few renditions then this would be evidence for covert planning. This analysis might not be feasible without a larger dataset. Similarly, the authors could test whether the first few renditions after recovery from muting already exhibit a shift back toward baseline.

      This work will be a valuable addition to others studying birdsong learning and its neural mechanisms. It documents features of birdsong plasticity that are unexpected in standard models of birdsong learning based on reinforcement and are consistent with an additional, perhaps more cognitive, mechanism involving planning. As the authors point out, perhaps this framework offers a reinterpretation of the neural mechanisms underlying a prior finding of covert pitch learning in songbirds (Charlesworth et al., 2012).

      A strength of this work is the variety and detail in its behavioral studies, combined with sensory and motor manipulations, which on their own form a rich set of observations that are useful behavioral constraints on future studies.

      Weaknesses:

      The argument that pitch modification in deafened birds requires some experience hearing their song in its shifted state prior to deafening (Fig. 4) is solid but has an important caveat. Their argument rests on comparing two experimental conditions: one with and one without auditory experience of shifted pitch. However, these conditions also differ in the pitch training paradigm: the "with experience" condition was performed using white noise training, while the "without experience" condition used "lights off" training (Fig. 4A). It is possible that the differences in the ability for these two groups to restore pitch to baseline reflect the training paradigm, not whether subjects had auditory experience of the pitch shift. Ideally, a control study would use one of the training paradigms for both conditions, which would be "lights off" or electrical stimulation (McGregor et al. 2022), since WN training cannot be performed in deafened birds. This is difficult, in part because the authors previously showed that "lights off" training has different valences for deafened vs. hearing birds (Zai et al. 2020). Realistically, this would be a point to add to in discussion rather than a new experiment.

      We added the following statement to our manuscript:

      It is unlikely that dLO birds’ inability to recover baseline pitch is somehow due to our use of a reinforcer of a non-auditory (visual) modality, since somatosensory stimuli do not prevent reliable target pitch recovery in hearing birds (McGregor et al 2022).

      A minor caveat, perhaps worth noting in the discussion, is that this partial pitch shift after deafening could potentially be attributed to the birds "gaining access to some pitch information via somatosensory stretch and vibration receptors and/or air pressure sensing", as the authors acknowledge earlier in the paper. This does not strongly detract from their findings as it does not explain why they found a difference between the "mismatch experience" and "no mismatch experience groups" (Fig. 4).

      We added the following statement: Our insights were gained in deaf birds and we cannot rule out that deaf birds could gain access to pitch information via somatosensoryproprioceptive sensory modalities. However, such information, even if available, cannot explain the difference between the "mismatch experience” (WNd) and the "no mismatch experience" (dLO) groups, which strengthens our claim that the pitch reversion we observe is a planned change and not merely a rigid motor response (as in simple usedependent forgetting).

      More broadly, it is not clear to me what kind of planning these birds are doing, or even whether the "overt planning" here is consistent with "planning" as usually implied in the literature, which in many cases really means covert planning. The idea of using internal models to compute motor output indeed is planning, but why would this not occur immediately (or in a few renditions), instead of taking tens to hundreds of renditions?

      Indeed, what we call ‘covert planning’ refers to what usually is called ‘planning’ in the literature. Also, there seems to be currently no evidence for spontaneous overt planning in songbirds (which we elicited with deafening). Replay of song-like syringeal muscle activity can be induced by auditory stimuli during sleep (Bush, A., Doppler, J. F., Goller, F., and Mindlin, G. B. (2018), but to our knowledge there are no reports of similar replay in awake, non-singing birds, which would constitute evidence for overt planning.

      We cannot ascertain how fast birds can plan their song changes, but our findings are not in disagreement with fast planning. The smallest time window of analysis we chose is 2h, which sets a lower bound of the time frame within which we can measure pitch changes. Our approach is probably not ideally suited for determining the minimal planning time, because the deafening and muting procedures cause an increase in song variability, which calls for larger pitch sample sizes for statistical testing, and the surgeries themselves cause a prolonged period without singing during which we have no access to the birds’ planned motor output. Note that fast planning is demonstrated by the recent finding of instant imitation in nightingales (Costalunga, Giacomo, et al. 2023) and is evidenced by fast re-pitching upon context changes in Bengalese finches (Veit, L., Tian, L. Y., Monroy Hernandez, C. J., & Brainard, M. S., 2021).

      To resolve confusion, it would be useful to discuss and add references relating "overt" planning to the broader literature on planning, including in the introduction when the concept is introduced.

      Overt and covert planning are terms used in the literature on child development and on adult learning, see (Zajic, Matthew Carl, et al., Overt planning behaviors during writing in school-age children with autism spectrum disorder and attention-deficit/hyperactivity disorder, 2020) and (Abbas zare-ee, Researching Aptitude in a Process-Based Approach to Foreign Language Writing Instruction. Advances in Language and Literary Studies, 2014), and references therein.

      Indeed, muddying the interpretation of this behavior as planning is that there are other explanations for the findings, such as use-dependent forgetting, which the authors acknowledge in the introduction, but don't clearly revisit as a possible explanation of their results. Perhaps this is because the authors equate use-dependent forgetting and overt planning, in which case this could be stated more clearly in the introduction or discussion.

      We do not mean to strictly equate use-dependent forgetting and overt planning, although they can be related, namely when ‘use’ refers to ‘altered use’ as is the case when something about the behavior is missing (e.g. auditory feedback in our study), and the dependence is not just on ‘use’ but also on ‘experience’.

      We added the following sentence to the discussion: We cannot distinguish the overt planning we find from more complex use-and-experience dependent forgetting, since we only probed for recovery of pitch and did not attempt to push birds into planning pitch shifts further away from baseline.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) The single main issue with this paper is in the section related to Figure 4, and the Figure itself - this is the most important part of the paper essential to buttress the claim of covert learning. However, there are several sources of confusion in the text, analyses, and figures. The key result is in Figure 4B, C - and, in the context of Figs 1-3, the data are significant but subtle. That is, as the authors state, the birds are mostly dependent on slow sensory feedback-dependent (possibly RL) mechanisms but there is a small component of target matching that evidences an internal model. One wonders why this capacity is so small - if they had a good internal model they'd be much faster and better at recovering target pitches after distortion-driven deviations even without sensory feedback.

      (1a) The analysis of the WNd and DLO reversions of pitch (related to Fig. 4) uses a d' analysis which is a pivot from the NRP analysis used in the rest of the paper. It is not clear why different analyses are being used here to compute essentially the same measure, i.e. how much did the pitch revert. It's also odd that different results are now obtained - Fig. 4 has a small but significant reversion of pitch in WNd birds but Fig. 2 shows no significant return to baseline.

      We did not test for reversion towards baseline in Fig. 2 and made no statement about whether there is a significant reversion or not. But when we do such a test, we find a significant reversion for WNd birds in the ‘late’ window (NRP=0.5, p=0.02, N=10, tstat=-1.77, two-tailed t-test), which agrees with Figure 4. In the ‘early’ window in Fig. 2, we find only a trend but no reversion (NRP = 0.76, p=0.11, n=10, tstat=-1.76), which contrasts with our findings in Figure 4. However, the discrepancy can be simply explained by the difference in time alignment that we detail in the Materials and Methods. Namely, in Figure 2, we measure pitch relative to the pitch in the morning on the day before, which is not a good measure of ‘reversion’ (since pitch had been reinforced further away during the day), which is why we do not present this analysis in the paper and dedicate a separate analysis in Figure 4 to reversion.

      (1b) Also in Fig. 4 is it the case that, as in the schematic of 4a, ALL birds in these experiments had their pitch pushed up - so that the return to baseline was all down? If this is the case the analysis may be contaminated by a pitch-down bias in deafened birds. This would ideally be tested with a balance of pitch-up and pitch-down birds in the pre-deafening period, and/or analysis of non-targeted harmonic stacks to examine their pitch changes. If non-targeted stacks exhibit pitch-down changes after deafening, then the reversion that forms the key discovery of this paper will be undermined. Please address.

      Both groups in Figure 4 were balanced (same number of birds were shifted their pitch up and down), see response to public review and Methods.

      (1c) After multiple re-reads and consultations with the Methods section I still do not understand the motivation or result for Figure 4E. Please provide clarification of the hypothesis/control being assessed and the outcome.

      Figure 4E does not add an additional result but strengthens our previous findings because we obtain the same result with a different method. The pitch of deaf birds tends to drift after deafening. To discount for this drift and the effect of time elapsed since deafening, we bootstrapped the magnitude of the pitch change in WNd and dLO birds by comparing them to dC birds in matched time windows. We modified the sentence in the results section to clarify this point:

      To discount for the effect of time elapsed since deafening and quantify the change in pitch specifically due to reinforcement, we bootstrapped the difference in 𝒅′ between dLO/WNd birds and a new group of dC birds that were deafened but experienced no prior reinforcement (see methods).

      (1d) Line 215. It's not clear in the text here how the WNd birds experience a pitch mismatch. Please clarify the text that this mismatch was experienced before deafening. This is a critical paragraph to set up the main claims of the paper. Also, it's not clear what is meant by 'fuel their plan'? I can imagine this would simply be a DA-dependent plasticity process in Area X that does not fuel a plan but rather re-wires and HVC timestep to medium spiny neurons whose outputs drive pitch changes - i.e. not a fueled plan but simply an RL-dependent re-mapping in the motor system. Alternatively, a change could result in plasticity in pallial circuits (e.g. auditory to HVC mappings) that are RL independent and invoke an inverse model along the lines of the author's past work (e.g. Ganguli and Hahnlsoer). This issue is taken up in the discussion but the setup here in the results is very confusing about the possible outcomes. This paragraph is vague with respect to the key hypotheses. It's possible that the WNd and DLO groups enable dissection of the two hypotheses above - because the DLO groups would presumably have RL signals but without recovery - but there remains a real lack of clarity over exactly how the authors are interpreting Fig 4 at the mechanistic level.

      WNd birds experience a pitch mismatch because while singing they hear that their pitch differs from baseline pitch, but the same is not true for dLO birds. We simply tested whether this experience makes a difference for reversion and it does. We added ‘before deafening’ to the paragraph and changed the wording of our hypothesis to make it clearer (we reworded ‘fuel their plan’). Mechanistic interpretations we left in the discussion. Without going to details, all we are saying is that birds can only plan to revert motor changes they are aware of in the first place.

      Minor issues

      The songs of deafened birds degrade, at a rate that depends on the bird's age. Younger crystalized birds degrade much faster, presumably because of lower testosterone levels that are associated with increased plasticity and LMAN function. Some background is needed on deafened birds to set up the WNd experiments.

      Despite deafening leading to the degradation of song (Lombardino and Nottebohm, 2000), syllable detection and pitch calculation were still possible in all deaf birds (up to 13-50 days after deafening surgery, age range 90-300 dph, n=44 birds).

      Since pitch shifting was balanced in both deaf bird groups (the same number of birds were up- and down-shifted), systematic changes in pitch post deafening (Lombardino and Nottebohm, 2000) will average out and so would not affect our findings.

      Lines 97-103. The paragraph is unclear and perhaps a call to a SupFig to show the lack of recovery would help. If I understand correctly, the first two birds did not exhibit the normal recovery to baseline if they did not have an opportunity to hear themselves sing without the WN. I am failing to understand this.

      In the early window (first 2 hours after unmuting) birds have not changed their pitch compared to their pitch in the corresponding window at the end of reinforcement (with matching time-of-day). We added ‘immediately after unmuting (early)’ to clarify this statement.

      Lines 68-69. What is the difference between (2) and (3)? Both require sensory representation/target to be mapped to vocal motor output. Please clarify or fuse these concepts.

      We fused the concept and changed the figure and explanation accordingly.

      Line 100. Please name the figure to support the claim.

      We marked the two birds in the Fig. 1H and added a reference in the text.

      Line 109. Is there a way to confirm / test if muted birds attempted to sing?

      Unfortunately, we do not have video recordings to check if there are any signs of singing attempts in muted birds.

      Line 296: Why 'hierarchically 'lower'?

      Lower because without it there is nothing to consolidate, i.e. the higher process can only be effective after the lower but not before. We clarified this point in the text.

      Past work on temporal - CAF (tcaf) by the Olveczky group showed that syllable durations and gaps could be reinforced in a way that does not depend on Area X and, therefore, related to the authors' discussion on the possible mechanisms of sensory-feedback independent recovery, may rely on the same neural substrates that Fig. 4 WNd group uses to recover. Yet the authors find in this paper that tCAF birds did not recover. There seems to be an oddity here - if covert recovery relies on circuits outside the basal ganglia and RL mechanisms, wouldn't t-CAF birds be more likely to recover? This is not a major issue but is a source of confusion related to the authors' interpretations that could be fleshed out.

      This is a good point, we reinvestigated the tCAF birds in the context of Fig 4 where we looked for pitch reversions towards baseline. tCAF birds do also revert towards baseline. We added this information to the supplement. We cannot say anything about the mechanistic reasons for lack of recovery, especially given that we did not look at brain-level mechanisms.

      Reviewer #2 (Recommendations For The Authors):

      The data presentation could be improved. It is difficult to distinguish between the early and late symbols and to distinguish between the colors for the individual lines on the plots or to match them with the points on the group data plots. In addition, because presumably, the points in plots like 2D are for the same individuals, lines connecting those points would be useful rather than trying to figure out which points are the same color.

      We added lines in Fig. 2D connecting the birds in early and late.

      The model illustrations (Fig 1A, Fig 5) are not intuitive and do not help to clarify the different hypotheses or ideas. I think these need to be reworked.

      We revised the model illustrations and hope they improved to clarify the different hypothesis.

      Some of the phrasing is confusing. Especially lines 157-158 and 256-257.

      Lines 157-158: we removed an instance of ‘WNd’, which was out of place.

      Lines 256-257: we rephrased to ‘showing that prior experience of a target mismatch is necessary for pitch reversion independently of auditory feedback’

      Reviewer #3 (Recommendations For The Authors):

      For Fig. 1, the conclusion in the text "Overall, these findings suggest that either motor practice, sensory feedback, or both, are necessary for the recovery of baseline song" is not aligned with the figure header "Recovery of pitch target requires practice".

      We rephrased the conclusion to: Overall, these findings rule out covert planning in muted birds and suggest that motor practice is necessary for recovery of baseline song.

      The use of the term "song experience" can be confusing as to whether it means motor or auditory experience. Perhaps replace it with "singing experience" or "auditory experience" where appropriate.

      We did the requested changes.

      Fig. 1A, and related text, reads as three hypotheses that the authors will test in the paper, but I don't think this turns out to the be the main goal (and if it is, it is not clear their results differentiate between hypotheses 1, 2, and 3). Perhaps reframe as discussion points and have this panel not be so prominent at the start, just to avoid this confusion.

      We modified the illustration in Fig 1A and simplified it. We now only show the 2 hypotheses that we test in the paper.

      Line 275-276, "preceding few hours necessitates auditory feedback, which sets a limit to zebra finches' covert planning ability". Did the authors mean "overt", not covert? Since their study focuses on overt planning.

      Our study focuses on covert planning in figure 1 and overt planning in subsequent figures.

      The purpose of the paragraph starting on line 278 could be more clear. Is the goal to say that overt planning and what has previously been described as use-dependent forgetting are actually the same thing? If not, what is the relationship between overt planning and forgetting? In other words, why should I care about prior work on use-dependent forgetting?

      We moved the paragraph further down where it does not interrupt the narrative. See also our reply to reviewer 3 on use-dependent forgetting.

      Line 294, "...a dependent process enabled by experience of the former...", was not clear what "former" is referring to. In general, this paragraph was difficult to understand. Line 296: Which is the "lower" process?

      We added explanatory parentheses in the text to clarify. We rephrased the sentence to ‘the hierarchically lower process of acquisition or planning as we find is independent of immediate sensory experience.’

      Line 295, the reference to "acquisition" vs. "retention". It is not clear how these two concepts relate to the behavior in this study, and/or the hierarchical processes referenced in the previous sentence. Overall, it is not clear how consolidation is related to the paper's findings.

      We added explanatory parentheses in the text and changed figure 5 to better explain the links.

      Line 305, add a reference to Warren et al. 2011, which I believe was the first study (or one of them) that showed that AFP bias is required for restoring pitch to baseline.

      We are citing Warren et al. 2011 in the sentence:

      Such separation also applies to songbirds. Both reinforcement learning of pitch and recovery of the original pitch baseline depend on the anterior forebrain pathway and its output, the lateral magnocellular nucleus of the anterior nidopallium (LMAN)(1).

      Line 310, "Because LMAN seems capable of executing a motor plan without sensory feedback", is this inferred from this paper (in which case this is an overreach) or is this referencing prior work (if so, which one, and please cite)?

      We changed the wording to ‘It remains to be seen whether LMAN is capable of executing a motor plans without sensory feedback’.

      Line 326, "which makes them well suited for planning song in a manner congruent with experience." I don't fully understand the logic. Can this sentence be clarified?

      We rephrased the sentence and added an explanation as follows: …which makes them well suited for executing song plans within the range of recent experience (i.e., if the song is outside recent experience, it elicits no LMAN response and so does not gain access to planning circuits).

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews: 

      Reviewer #1 (Public review):

      Summary: 

      Authors benchmarked 5 IBD detection methods (hmmIBD, isoRelate, hap-IBD, phasedIBD, and Refined IBD) in Plasmodium falciparum using simulated and empirical data. Plasmodium falciparum has a mutation rate similar to humans but a much higher recombination rate and lower SNP density. Thus, the authors evaluated how recombination rate and marker density affect IBD segment detection. Next, they performed parameter optimization for Plasmodium falciparum and benchmarked the robustness of downstream analyses (selection detection and Ne inference) using IBD detected by each of the methods. They also tracked the computational efficiency of these methods. The authors work is valuable for the tested species and the analyses presented appear to support their claim that users should be cautious calling IBD when SNP density is low and recombination rate is high. 

      Strengths: 

      The study design was solid. The authors set up their reasoning for using P. falciparum very well. The high recombination rate and similar mutation rate to humans is indeed an interesting case. Further, they chose methods that were developed explicitly for each species. This was a strength of the work, as well as incorporating both simulated and empirical data to support their goal that IBD detection should be benchmarked in P. falciparum

      Weaknesses: 

      The scope of the optimization and application of results from the work are narrow, in that everything is finetuned for Plasmodium. Some of the results were not entirely unexpected for users of any of the tested software that was developed for humans. For example, it is known that Refined IBD is not going to do well with the combination of short IBD segments and low SNP density. Lastly, it appears the authors only did one largescale simulation (there are no reported SDs). 

      We thank the reviewer for highlighting the strengths and weaknesses of the study. 

      First, we would like to highlight that: (1) while we use Plasmodium as a model to investigate the impact of high recombination and low marker density on IBD detection and downstream analyses, our IBD benchmarking framework and strategies are widely applicable to IBD methods development for many sexually recombining species including both Plasmodium and non-Plasmodium species. (2) Although some results are not completely unexpected, such as the impact of low marker density on IBD detection, IBD-based methods have been increasingly used in malaria genomic surveillance research without comprehensive benchmarking for malaria parasites despite the high recombination rate. Due to the lack of benchmarking, researchers use a variety of different IBD callers for malaria research including those that are only benchmarked in human genomes, such as refined-ibd. Our work not only confirmed that low marker density (related to high recombination rate) can affect the accuracy of IBD detection, but also demonstrated the importance of proper parameter optimization and tool prioritization for specific downstream analyses in malaria research. We believe our work significantly contributes to the robustness of IBD segment detection and the enhancement of IBDbased malaria genomic surveillance.

      Second, we agree that there is a lack of clarity regarding simulation replicates and the uncertainty of reported estimates. We have made the following improvements, including (1) running n = 3 full sets of simulations for each analysis purpose, which is in addition to the large sample sizes and chromosomal-level replications already presented in our initial submission, and (2) updating data and figures to reflect the uncertainty at relevant levels (segment level, genome-pair level or simulation set level).   

      Reviewer #2 (Public review):

      Summary: 

      Guo et al. benchmarked and optimized methods for detecting Identity-By-Descent (IBD) segments in Plasmodium falciparum (Pf) genomes, which are characterized by high recombination rates and low marker density. Their goal was to address the limitations of existing IBD detection tools, which were primarily developed for human genomes and do not perform well in the genomic context of highly recombinant genomes. They first analysed various existing IBD callers, such as hmmIBD, isoRelate, hap-IBD, phased-IBD, refinedIBD. They focused on the impact of recombination on the accuracy, which was calculated based on two metrics, the false negative rate and the false positive rate. The results suggest that high recombination rates significantly reduce marker density, leading to higher false negative rates for short IBD segments. This effect compromises the reliability of IBD-based downstream analyses, such as effective population size (Ne) estimation. They showed that the best tool for IBD detection in Pf is hmmIBD, because it has relatively low FN/FP error rates and is less biased for relatedness estimates. However, this method is less computationally efficient. Their suggestion is to optimize human-oriented IBD methods and use hmmIBD only for the estimation of Ne. 

      Strengths: 

      Although I am not an expert on Plasmodium falciparum genetics, I believe the authors have developed a valuable benchmarking framework tailored to the unique genomic characteristics of this species. Their framework enables a thorough evaluation of various IBD detection tools for non-human data, such as high recombination rates and low marker density, addressing a key gap in the field. This study provides a

      comparison of multiple IBD detection methods, including probabilistic approaches (hmmIBD, isoRelate) and IBS-based methods (hap-IBD, Refined IBD, phased IBD). This comprehensive analysis offers researchers valuable guidance on the strengths and limitations of each tool, allowing them to make informed choices based on specific use cases. I think this is important beyond the study of Pf. The authors highlight how optimized IBD detection can help identify signals of positive selection, infer effective population size (Ne), and uncover population structure. They demonstrate the critical importance of tailoring analytical tools to suit the unique characteristics of a species. Moreover, the authors provide practical recommendations, such as employing hmmIBD for quality-sensitive analyses and fine-tuning parameters for tools originally designed for non-P. falciparum datasets before applying them to malaria research. 

      Overall, this study represents a meaningful contribution to both computational biology and malaria genomics, with its findings and recommendations likely to have an impact on the field. 

      Weaknesses: 

      One weakness of the study is the lack of emphasis on the broader importance of studying Plasmodium falciparum as a critical malaria-causing organism. Malaria remains a significant global health challenge, causing hundreds of thousands of deaths annually. The authors could have introduced better the topic, even though I understand this is a methodological paper. While the study provides a thorough technical evaluation of IBD detection methods and their application to Pf, it does not adequately connect these findings to the broader implications for malaria research and control efforts. Additionally, the discussion on malaria and its global impact could have framed the study in a more accessible and compelling way, making the importance of these technical advances clearer to a broader audience, including researchers and policymakers in the fight against malaria. 

      We thank the reviewer for highlighting the need to better contextualize the work and emphasize its relevance to malaria control and elimination efforts. We have edited the introduction and discussion sections to highlight the importance of studying Plasmodium as malaria-causing organisms and why IBD-based analysis is important to malaria researchers and policymakers. We believe the changes will better emphasize the public health relevance of the work and improve clarity for a general audience.  

      We would like to clarify that we are not recommending that researchers “optimize human-oriented IBD methods and use hmmIBD only for the estimation of Ne.” We recommended hmmIBD for Ne analysis; however, hmmIBD can be utilized for other applications, including population structure and selection detection. Thus, we generally recommend using hmmIBD for Plasmodium when phased genotypes are available. To avoid potential misunderstandings, we have revised relevant sentences in the abstract, introduction, and discussion. One reason to consider human-oriented IBD detection methods in Plasmodium research is that hmmIBD currently has limitations in handling large genomic datasets. Our ongoing research focuses on improving hmmIBD to reduce its computational runtime, making it scalable for large Plasmodium wholegenome sequence datasets.

      Recommendations for the authors

      Reviewer #1:

      (1) Additional experiments 

      (i) More simulation replicates would be valuable here. The way that results are presented, it appears as though there are no replicates. Apologies if I am incorrect, but when looking through the authors code the --num_reps defaults to one simulation and there are no SDs reported for any figure. Perhaps the authors are bypass replicates by taking a random sample of lineages? Some clarification here would be great. 

      We agree with the reviewer’s constructive suggestions. We have increased the number of simulation sets to (n = 3) in addition to the existing replicates at the chromosomal level. We did not use a larger n for full sets of simulation replicates for two reasons: (1) full replication is quite computationally intensive (n=3 simulation sets already require a week to run on our computer cluster with hundreds of CPU cores). (2), the results from different simulation sets are highly consistent with each other, likely due to our large sample size (n= 1000 haploid genomes for each parameter combination).  The consistency across simulation sets can be exemplified by the following figures (Author response image 1 and 2) based on simulation sets different from Figures and Supplementary Figures included in the manuscript. 

      Author response image 1.

      Additional simulation sets repeating experiments shown in Fig 2.

      Author response image 2.

      Post-optimization Ne estimates based on three independent simulation sets (Fig 5 shows data simulation set 1).

      In our updated figures, we address the uncertainty of measurements as follows:

      (1) For IBD accuracy based on overlapping IBD segments, we present the mean ± standard deviation (SD) at the segment level (IBD segment false positives and false negatives for each length bin) or genome-pair level (IBD error rates at the genome-wide level). Figures in the revised manuscript show results from one of the three simulation set replicates. The SD of IBD segment accuracy is included in all relevant figures. In the S2 Data file, we chose not to show SDs to avoid text overcrowding in the heatmaps; however, a detailed version, including SD plotting on the heatmap and across three simulation set replicates, is available on our GitHub repository at https://github.com/bguo068/bmibdcaller_simulations/tree/main/simulations/ext_data

      (2) For IBD-based genetic relatedness, the uncertainty is depicted in scatterplots.

      (3) For IBD-based selection signal scans, we provide the mean ± SD of the number of true selection signals and false selection signals. The SD is calculated at the simulation set level (n=3). 

      (4) For IBD network community detection, the mean ± SD of the adjusted Rand index is reported at the simulation set level (n=3). A representative simulation set is randomly chosen for visualization purposes.

      (5) For IBD-based Ne estimates, each simulation set provides confidence intervals via bootstrapping. We found Ne estimates across n=3 simulation sets to be highly consistent and decided to display Ne from one of the simulation sets.

      (6) For the measurement of computational efficiency and memory usage, the mean ± SD was calculated across chromosomes from the same simulation sets.

      We have included a paragraph titled "Replications and Uncertainty of Measures" in the methods section to clarify simulation replications. Additionally, a table of simulation replicates is provided in the new S1 Data file under the sheet named “02_simulation_replicates.”

      (ii) I might also recommend a table or illustrative figure with all the simulation parameters for the readers rather than them having to go to and through a previous paper to get a sense of the tested parameters. 

      We have now generated tables containing full lists of simulation/IBD calling parameters. We have organized the tables into two sections: simulation parameters and IBD calling parameters. For the simulations, we are using three demographic models: the single-population (SP) model, the multiple-population (MP) model, and the human population demography in the UK (UK) model, each with different sets of parameters. Parameters and their values are listed separately for each demographic model (SP, MP and UK). For the IBD calling, we have five different IBD callers, each with different parameters. We have provided lists of the parameters and their values separately for each caller. In total, there are 15 different combinations of 3 demographic models in simulation and five callers in IBD detection (Author response image 3). We provide a table for each of the 15 combinations. We also provide a single large table by concatenating all 15 tables. In the combined table, demographic model-specific or IBD caller-specific parameters are displayed in their own columns, with NA values (empty cells) appearing in rows where these parameters are not applied (see S2 Data file).

      Author response image 3.

      Schematic of combined parameters from simulations and IBD detection (also included in the S2 Data file)

      (2) Recommendations for improving the writing and presentation 

      Overall, the writing was great, especially the introduction. 

      Three thoughts: 

      (i) It would be great if the authors included a few sentences with guidance on the approach one would take if their organism was not human or P. falciparum

      We have updated our discussion with the following statement: “Beyond Plasmodium parasites, there are many other high-recombining organisms such as Apicomplexan species like Theileria, insects like Apis mellifera (honeybee), and fungi like Saccharomyces cerevisiae (Baker's yeast). For these species, our optimized parameters may not be directly applicable, but the benchmarking framework established in this study can be utilized to prioritize and optimize IBD detection methods in a context-specific manner.”

      (ii) I think there was a lot of confusion about the simulations as they were presented between the co-reviewer and I. Clarification on whether there were replicates and how sampling of lineages occurred would be helpful for a reader. 

      We have added a paragraph with heading “Replications and uncertainty of measures” under the method section to clarify simulation replicates.  Please also refer to our response above for more details (Reviewer #1 (1) Additional experiments).

      (iii) Maybe we missed it, but could the authors add a sentence or two about why isoRelate performed so poorly (e.g. lines 206-207) considering it was developed for Plasmodium? This result seems important. 

      IsoRelate assumes non-phased genotypes as input; therefore, even if phased genotypes are provided, the HMM model used in isoRelate (distinct from the hmmIBD model) may not utilize them. Below, we present examples of IBD segments between true sets and inferred sets from both isoRelate and hmmIBD, where many small IBD segments identified by tskibd (ground truth) and hmmIBD (inferred) are not detected by isoRelate (inferred), although isoRelate still captures very long IBD segments. These patterns are also illustrated in Fig. 3 and S3 Fig. We acknowledge that isoRelate may outperform other methods in the context of unphased genotypes. However, we chose not to benchmark IBD calling methods using unphased genotypes in simulations, as the results may be significantly influenced by the quality of genotype phasing for all other IBD detection methods. The characterization of deconvolution methods is beyond the scope of this paper. We have added a paragraph in the discussion to reflect the above explanation.

      Author response image 4.

      Example IBD segments inferred by isoRelate and hmmIBD compared to true IBD segments calculated by tskibd.

      (3) Minor corrections to the text and figures 

      Lines 105-110 feel like introduction because the authors are defining IBD and goals of work 

      We have shortened these sentences and retained only relevant information for transition purposes. 

      Line 121-122 The definition of false positive is incorrect, it appears to be the exact text from false negative 

      We apologize for the typo and have corrected the definition, so that  it is consistent with that in the methods section. 

      Lines 177-180 feels more like discussion than results 

      We have removed this sentence for brevity. 

      Figure 1: 

      Remove plot titles from the figure 

      Write out number in a 

      The legend in b overlaps the data so moving that inset to the right would be helpful 

      We have removed the titles from Figure 1. In Figure 1a, we have changed the format of  the y-axis tick labels from scientific notation to integers.  In Figure 1b, we have adjusted the size and location of the legend so that it does not overlap with the data points.

      Figure 2-3 & S4-5: 

      It was hard to tell the difference between [3-4) and [10-18) because the colors and shapes are similar. It might be worth using a different color or shape for one of them? 

      We have changed the color for the [10-18) group so that the two groups are easier to distinguish.

      Figure 3 & S3-5: 

      Biggest suggestion is that when an axis is logged it should not only be mentioned in the caption but also should be shown in the figure as well. 

      We have updated all relevant figures so that the log scale is noted in the figure captions (legends) as well as in the figures (in the x and/or y axis labels).

      Supplementary Figure S2 

      (i) It would be nice to either combine it with the main text Figure 1 (I don't believe it would be overwhelming) or add in the other two methods for comparison 

      We have now plotted data for all five IBD callers in S1 Fig for better comparison. 

      (ii) the legend overlaps the data so relocating it to the top or bottom would be helpful 

      We have moved the legend to the bottom of the figure to avoid overlap with the data.

      Reviewer #2:

      I don't have any major comments on the paper. It is well-written, although perhaps a bit long and repetitive in some sections. Make sure not to repeat the same concepts too many times. 

      We have consolidated and removed several paragraphs to reduce repetition of the same concepts.

      I am not a methodological developer, but it seems you have addressed several challenges regarding IBD detection in P. falciparum. You have also acknowledged the study's caveats, which I agree with. 

      Thank you for the positive comments.

      Minor comments: 

      -In my opinion, the paper would benefit from including the workflow figure in the main text rather than keeping it in the supplementary materials. This would make it more accessible and useful for readers. 

      We have moved the original S1 Fig to be Fig 1 in the main text.

      -Some of the figures (e.g. Fig. 2, 4) should be larger for better clarity and interpretation. 

      We have updated Fig 2 and Fig 4 (now labeled as Figure 3 and 5) to make them larger for improved clarity and interpretation.

      -While the focus on P. falciparum is understandable, it would have been valuable to include examples of other species and discuss the broader implications of the findings for a broader field. 

      We have updated the third-to-last paragraph to discuss implications for other species, such as Apicomplexan species like Theileria, insects like Apis mellifera (honeybee), and fungi like Saccharomyces cerevisiae (Baker's Yeast). We acknowledge that optimal parameters and tool choices may vary among species due to differences in demographic history and evolutionary parameters. However, we emphasize that the methods outlined are adaptable for prioritizing and optimizing IBD detection methods in a context-specific manner across different species.

      -Figure 6 is somewhat confusing and could use clearer labeling or additional explanation to improve comprehension. 

      We have updated the labels and titles in the figure to improve clarity. We also edited the figure caption for better clarity.

      -Although hmmIBD outperformed other tools in accuracy, its computational inefficiency due to single-threaded execution poses a significant challenge for scaling to large datasets. The trade-off between accuracy and computational cost could be discussed in more detail. 

      We have added a paragraph in the discussion section to highlight the trade-off between accuracy and computation cost. We noted that we are developing an adapted tool to enhance the hmmIBD model and significantly reduce the runtime via parallelizing the IBD inference process.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer 1 (Public Review):

      Summary:

      The authors present a mean-field model that describes the interplay between (protein) aggregation and phase separation. Different classes of interaction complexity and aggregate dimensionality are considered, both in calculations concerning (equilibrium) phase behavior and kinetics of assembly formation.

      Strengths:

      The present work is, although purely theoretical, of high interest to understanding biological processes that occur as a result of a coupling between protein aggregation and phase separation. Of course, such processes are abundant, in the living cell as well as in in-vitro experiments. I appreciate the consideration of aggregates with various dimensionality, as well as the categorization into different ”interaction classes”, together with the mentioning of experimental observations from biology. The model is convincing and underlines the complexity associated with the distribution of proteins across phases and aggregates in the living cell.

      Weaknesses:

      There are a few minor weaknesses.

      Reviewer 2 (Public Review):

      This work deals with a very difficult physical problem: relating the assembly of building blocks on a molecular scale to the appearance of large, macroscopic assemblies. This problem is particularly difficult to treat, because of the large number of units involved, and of the complex way in which these units-monomers-interact with each other and with the solvent. In order to make the problem treatable, the authors recur to a number of approximations: Among these, there is the assumption that the system is spatially homogeneous, i.e., its features are the same in all regions of space. In particular, the homogeneity assumption may not hold in biologically relevant systems such as cells, where the behavior close to the cell membrane may strongly differ from the one in the bulk. As a result, this hypothesis calls for a cautious consideration and interpretation of the results of this work. Another notable simplification introduced by the authors is the assumption that the system can only follow two possible behaviors: In the first, each monomer interacts equally with the solvent; no matter the size of the cluster of which it is part. In the second case, monomers in the bulk of a cluster and monomers at the assembly boundary interact with the solvent in a different way. These two cases are considered not only because they simplify the problem, but also because they are inspired by biologically relevant proteins.

      With these simplifications, the authors trace the phase diagram of the system, characterizing its phases for different fractions of the volume occupied by the monomers and solvent, and for different values of the temperature. The results qualitatively reproduce some features observed in recent experiments, such as an anomalous distribution of cluster sizes below the system saturation threshold, and the gelation of condensed phases above such threshold.

      Reviewer 3 (Public Review):

      Summary:

      The authors combine classical theories of phase separation and self-assembly to establish a framework for explaining the coupling between the two phenomena in the context of protein assemblies and condensates. By starting from a mean-field free energy for monomers and assemblies immersed in solvent and imposing conditions of equilibrium, the authors derive phase diagrams indicating how assemblies partition into different condensed phases as temperature and the total volume fraction of proteins are varied. They find that phase separation can promote assembly within the protein-rich phase, providing a potential mechanism for spatial control of assembly. They extend their theory to account for the possibility of gelation. They also create a theory for the kinetics of self-assembly within phase separated systems, predicting how assembly size distributions change with time within the different phases as well as how the volumes of the different phases change with time.

      Strengths:

      The theoretical framework that the authors present is an interesting marriage of classic theories of phase separation and self-assembly. Its simplicity should make it a powerful general tool for understanding the thermodynamics of assembly coupled to phase separation, and it should provide a useful framework for analyzing experiments on assembly within biomolecular condensates.

      The key advance over previous work is that the authors now account for how self-assembly can change the boundaries of the phase diagram.

      A second interesting point is the explicit theoretical consideration for the possibility that gelation (i.e. self-assembly into a macroscopic aggregate) could account for widely observed solidification of condensates. While this concept has been broadly discussed, to date I have yet to see a rigorous theoretical analysis of the possibility.

      The kinetic theory in sections 5 and 6 is also interesting as it extends on previous work by considering the kinetics of phase separation as well as those of self-assembly.

      Weaknesses:

      A key point the authors make about their theory is that it allows, as opposed to previous research, to study non-dilute limits. It is true that they consider gelation when the 3D assemblies become macroscopic. However, dilute solution theory assumptions seem to be embedded in many aspects of their theory, and it is not always clear where else the non-dilute limits are considered. Is it in the inter-species interaction χij? Why then do they never explore cases for which χij is nonzero in their analysis?

      We explicitly consider that monomers and aggregates are non-dilute with respect to solvent. This is evident in accounting for the mixing entropy of all components, including the solvent. Moreover, we account for interactions among the monomers and the different aggregates with the solvent. We consider the case where each monomeric unit, independent in aggregate it is part of, interacts the same way with the solvent. Please note that this case corresponds to a non-dilute scenario where interactions indeed drive phase separation.

      The connection between this theory and biological systems is described in the introduction but lost along the main text. It would be very helpful to point out, for instance, that the presence of phase separation might induce aggregation of proteins. This point is described formally at the end of Section 3, but a more qualitative connection to biological systems would be very useful here.

      We thank the referee for the useful comment, we now mention this in the introduction (line 80) and point out the biological relevance of assembly formation and localization via the presence of phase separation (lines 268 and 283).

      Building on the previous point, it would be helpful to give an intuitive sense of where the equations derived in the Appendices and presented in the main text come from and to spell out clear physical interpretations of the results. For example, it would be helpful to point out that Eq. 4 is a form of the law of mass action, familiar from introductory chemistry. It would be useful to better explain how the current work extends on existing previous work from these authors as well as others. Along these lines, closely related work by W. Jacobs and B. Rogers [O. Hedge et al. 2023, https://arxiv.org/abs/2301.06134; T. Li et al. 2023, https://arxiv.org/abs/2306.13198] should be cited in the introduction. The results discussed in the first paragraph of Section 3 on assembly size distributions in a homogeneous system are well-known from classic theories of self-assembly. This should be acknowledged and appropriate references should be added; see for instance, Rev. Mod. Phys. 93, 025008 and Statistical Thermodynamics Of Surfaces, Interfaces, And Membranes by Sam Safran. Equation 14 for the kinetic of volume fractions is given with reference to Bauermann et al. 2022, but it should be accompanied by a better intuitive interpretation of its terms in the main text. In particular, how should one understand the third term in this equation? Why does the change in volume impact the change of volume fraction in this way?

      We thank the referee for the suggestions. We have included the missing references, with a particular emphasis on DNA nanostars that inhibit phase separation in DNA liquids in the definition of class II. We added intuitive explanations of the main equations, such as Eqs. (4),(8),(14), (17), and (18). Notice that, according to Mysels, Karol J., J. Chem. Educ., 33, 178 (1956) (https://pubs-acs-org.sire.ub.edu/doi/epdf/10.1021/ed033p178) we refer to (18) as the law of mass action.

      The discussion in the last paragraph of Section 6 should be clarified. How can the total amount of protein in both phases decrease? This would necessarily violate either mass or volume conservation. Also, the discussion of why the volume is non-monotonic in time is not clear.

      A decrease in the total amount of protein in both phases does not violate mass conservation, if the volume of the phases varies accordingly. In particular, the volume of the denser phase should grow. This given, in the case presented the total protein amount in the dense phase decreases, while in the dilute phase increases. For this reason, we revised the paragraph and now explain the results in more detail (see lines starting from 407). The nonmonotonic volume change is indeed a puzzling finding that, as we now state in the manuscript, requires further investigation. Given the lack of analytical approaches available to tackle the complex kinetics in the presence of coexisting phases, we believe that this analysis goes beyond the scope of the present paper.

      Recommendations for the authors

      Reviewer 1 (Recommendations For The Authors):

      Line 96: I feel a mentioning/definition/explanation and perhaps some discussion on the parameter M (limiting aggregate size) would have been in place in the introduction of Equation (1). Furthermore, in the usual interpretation, Flory interaction parameters (symbolized χ) are dimensionless, as, classically, they represent an exchange energy (normalized by kT), defined on a monomeric basis. Here they seem to carry the dimension of energy.

      We thank the reviewer for the observation. We have included a brief comment on M and mentioned that we use χ parameters that carry the dimension of energy such that, varying kBT, we scale at the same time the term containing interaction propensities (χ) and the one containing internal energies (_e_int). See the comment on line 127

      Line 150: The choice of ρi \= i physically implies that a single protein is assumed to have the same as a solvent molecule. This may be a bit of a stretch. This assumption leads to an overestimation of the translational entropy of the aggregates (first term in Equation (1)). Acknowledging that ρ_1 >> ρs_ would give a pronounced desymmetrization of the phase diagram (I suspect).

      Indeed, in the case of monomers only, the assumption leads to a symmetric phase diagram which may be unrealistic. Once assemblies form, however, the phase diagram becomes asymmetric and for this reason we decided to assume ρi \= i, simplifying the theoretical analysis. We have added a clarifying sentence in the manuscript, see line 163

      Furthermore, the pictures in Figure 1a-c suggest the presence of a disordered residue, the degree of swelling of which might affect binding strength (see for instance: https://doi.org/10.3389/fnmol.2022.962526).

      We added a comment on the possible coupling between internal free energies and interaction propensities, such as the swelling mechanism that affects binding sites, and included the reference above (line 215).

      Line 154-156: It’s unclear what is meant with ”an internal bond that keeps each assembly together”. How should this be interpreted on an intuitive physical level?

      We apologise for being unclear. We meant the internal bonds that lead to the formation of assemblies. We have now rephrased this sentence in the main text (lines starting from 169).

      Line 254: The fact that ϕsg is defined below does not mean it does not fall out of the air here. The same holds for the consideration of the limit M →∞. Ideally, the main text should stand on its own, in particular with respect to physical intuitiveness, as well as the necessity and interest of discussion topics. Technical details, derivations and additional information can be in an appendix.

      We agree with the referee and added some physical insights about the limit. We now also state clearly in the main text (line 298) that _ϕ_sg is affected by temperature and the free energy of internal bonds.

      Line 257: ”Since we do not explicitly include the solvent in assembly formation we will consider the gel as a phase without solvent and thus ϕtot \= 1”. I’m not sure if I can agree with this. I would say, a gel, certainly in biological context, almost per definition contains a large fraction of solvent, i.e. here water. The situation ”ϕtot \= 1” would rather be a solid precipitate. Is gelation properly captured by this model?

      We thank the referee for this very relevant observation. We now state in the main text that the model predicts a macroscopic assembly which we call ’the gel phase’, in agreement with previous literature. Then, to clarify, we added the sentence ”Please note that, since we do not explicitly include the solvent in assembly formation (see reaction scheme in Fig.1a), in our model the gel corresponds to a phase without solvent, _ϕ_tot \= 1. To account for biological gels that can be rich in water, our theory can be straightforwardly extended by incorporating the solvent into the reaction scheme.”, see main text line 300.

      Line 268: Shouldn’t ”solvent” be ”solution”? If fsol is given by Equation (1), surely not only the solvent is considered.

      Indeed, this is a typo, and we now use the term ’solution’ instead of ’solvent.’

      Line 273: At this stage, the only information provided in the main text is that ω∞ is ”a constant that does not affect chemical nor phase equilibrium, except in the limit M →∞” (see lines 153-154). This is a little bit too abstract for me. Again, the main text should stand on its own, meaning the reader should not have to rely on an appendix to at least have an intuitive physical understanding of any modeling or input parameter discussed in the main text.

      We thank the reviewer for pointing this out. We now comment on the physical interpretation of ω∞ in the main text, see lines from 320 on.

      Figure 4. appears in Equation (39) but it is not defined.

      We thank the reviewer for pointing this out. We have reshaped appendix 6A, making use of chemical activities and clarified the origin of the rate .

      Line 317. I don’t fully understand the intention of the remark on the model being adaptable for ”primary and secondary nucleation”. How/in what way is this different from association and dissociation? For instance, classical nucleation theory is based on association and dissociation of monomeric units to and from clusters.

      We agree that the kinetic rate coefficients kij (appearing in the association and dissociation rates ∆rij, Eq. 17) in our manuscript already depend on assembly length, see Appendix 6 B, where we now clarified their definition. Please note that, however, that secondary nucleation is a special kind of association, for which the kinetic rate coefficients corresponding to associations of small assemblies, i.e. kij with_i,j_ ≪ M, explicitly depend on the presence of large assemblies with sizes l ≫ 1. In our manuscript, we have not accounted for such a dependence. We now make this aspect clear in the manuscript, see Appendix 6 B.

      Line 321. Why is ∆rij called the ”monomer exchange rate”? In line 318 the same parameter is defined as the ”reaction rate for the formation of a (i+j)-mer”. Why should these be the same?

      We thank the reviewer for spotting this typo.

      Line 323. Why do these calculations use M = 15?

      The exploration of a 15-dimensional phase space is already numerically challenging. We are currently working on a generalization of the numerical scheme to work with larger values of M but, to discuss the fundamental physical principles, we kept M \= 15.

      Reviewer 2 (Recommendations For The Authors):

      The manuscript presents several issues, on both the scientific and presentational level, which need to be carefully addressed. Please find below a list of the points that need to be addressed by the authors, divided into major and minor points. Major issues:

      • A general, major concern about the results in the paper is the homogeneity assumption. I do understand that repeating the whole analysis presented in the manuscript by allowing for spatial inhomogeneities partially goes beyond the scope of this paper. However, the authors should at least discuss how such inhomogeneities may alter the results in a qualitative way, and treat explicitly the presence of inhomogeneity in one prototypical case treated in the manuscript. Namely, what happens if the volume fractions and relative molecular volumes in the free energy (1) depend on space, e.g., ϕiϕi(x)?

      We would like to stress that, in the present paper, we do account for spatial inhomogeneities. Indeed, in the case of phase separation, we consider systems which are divided into two phases, characterized by different values of the assemblies’ volume fractions ϕi. We do, however, consider the system to be homogeneous inside the phases, implying a jump in the value of the volume fraction at the interface between the two phases. In this sense, the analysis we carry out is valid in the thermodynamic limit, where gradients of the volume fractions ϕi(x) within the phases, can be neglected. On the other hand, considering the full spatial problem, i.e. solving the equations for M \= 15 spatially varying fields, would be numerically extremely challenging.

      • The authors’ results relate molecular assembly- a phenomenon at the molecular scale-to phase separation-a mesoscopic or macroscopic phenomenon. The authors should stress the conceptual importance of this connection between scales, and present their results from the perspective of a multi-scale model.

      We thank the reviewer for pointing this out. We now emphasize the multi-scale feature of our model in the introduction (line 80).

      • Starting from Section 1, the reader is not well guided through the sections that follow. The authors should provide an outline of the line of though that they are going to follow in the following sections, and logically connect each section to the next one with a short paragraph at the end of each section. This paragraph should resume what has been addressed in the current section, and the connection with the topic that will be addressed in the next one.

      We agree with the reviewer and have added a transitioning sentence at the end of each paragraph.

      • ’We focus on linear assemblies (d = 1)’: Given the striking differences of the results between d = 1 and d > 1 shown above, the authors should discuss what happens for d > 1 as well.

      • ’In figure Fig. 5a, we show the initial and final equilibrium binodals (black and coloured curve, respectively), for the case of linear assemblies (d = 1) belonging to class 1’: Again, show what happens for d > 1.

      We agree with the reviewer, the kinetics in d > 1 would be definitely interesting. However, in this case, one assembly can become macroscopic (i.e. M must be set to ∞). This requires some substantial modification in the kinetic scheme, like introducing an absorbing boundary condition for monomers ’sucked in’ the gel. We prefer to leave this for future work, and now state it explicitly in the manuscript (line 383).

      • ’This difference arises because, within class 2, monomers in the bulk of an assembly have reduced interaction propensity with respect to the boundary ones. As a consequence, the formation of large clusters shifts the onset of phase separation to higher ϕtot values.’: To prove this argument, the authors should show Fig. 2g and h for d > 1. In fact, by varying d, the effect of the boundary vs. bulk also varies.

      We prefer to discuss the thermodynamics of d > 1 in section 4 on gelation. There we present only a single phase diagram so as not to blow up the discussion on equilibrium too much.

      • ’referring for simplicity to systems belonging to Class 1’: The authors should do the same analysis for Class 2.

      We agree with the reviewer. However, again not to blow up the discussion on equilibrium, we leave it for future work.

      • ’other, implying that the corresponding Flory-Huggins parameter χij vanishes’: Why?

      The explanation based on a lattice model is reported in Appendix 2, and is now more clearly referenced (line 185).

      Minor issues:

      • Eq. (10): Here the authors should explain in the main text, possibly in a simple and intuitive way, why the number of monomers i and the space dimension d enter the righthand side of this equation in this particular way.

      We thank the reviewer for pointing this out. We added the physical origin of the scaling with dimension in Eq. (10) and in Eq. (8), as pointed out by reviewer 3.

      • ’The second and fifth terms of fsol characterize the internal free energies’: What do you mean by ’characterize the internal free energies’? Please clarify.

      As we now state more clearly (lines 114-120), these two contributions include the internal free energies ω_s and _ωi, stemming from the free energy of internal bonds that lead to assembly formation.

      • ’depend on the scaling form of the’: Scaling with respect to what ? Please clarify.

      We have now clarified that the scaling is with respect to the assembly size i.

      • Figure 2 is way too dense: it should be split into two figures, and the legend of each of the two figures should be expanded to properly guide the reader to understand the figures.

      We understand the reviewer’s point of view. To avoid altering the present flow, we decided not to split the figure, but we have included shaded boxes to better guide the reader.

      • ’this is a consequence of the gelation transition’: Please clarify

      • ’and this limitation can be dealt with by introducing explicitly the infinite-sized gel in the free energy’: Why? Please clarify.

      We have now rephrased these sentences, hopefully in a clearer way. We now state: ’We know that this divergence is physical, and is caused by the gelation transition. This limitation can be dealt with by introducing explicitly a term in the free energy that accounts for an infinite-sized assembly (the gel)’, see lines 320-322.

      • Figure 4: Add plots of panels d, e, h and i with log scale on the y axis to make explicit an eventual exponential behavior, and revise the text accordingly

      Not to further complicate Figure 4, we preferred to display the logarithmic plots of the equilibrium distribution in the appendix, see Figure A3-1.

      • ’... an equilibrium distribution which monotonously decreases with assembly size’: It is not the distributions that decreases but the cluster volume fraction, please rephrase.

      We thank the reviewer for pointing this out and have now rephrased this sentence (line 394).

      Reviewer 3 (Recommendations For The Authors):

      I could not obtain the exact form of Eq 29 in App 3, can the authors elaborate on this calculation. App 3: What does it mean binodal agrees well with ϕsg? And doesn’t ϕsg depend on temperature through phi tilde? What temperature is this result for?

      We apologise for the unclear explanation. We now state in detail that Eq. (29) is obtained by plugging the expression of ϕi given in Eq. (24) into Eq. (1), in the main text. The dependence of ϕ<sub>1</sub> on ϕ<sub>tot</sub> is expressed in Eq. (26), and we have omitted linear terms in ϕ<sub>tot</sub>, since they do not affect phase equilibrium (see lines 802-809). Moreover, ϕsg depends indeed on k<sub>B</sub>T. We refer to the comparison between the full curve ϕsg in the k<sub>B</sub>T−ϕ<sub>tot</sub> plane, and the branch of the binodal between the triple point (indicated now with a cross) and ϕ<sub>tot</sub> \= 1. The two curves are close, as expected since both correspond to the boundary between homogeneous mixtures and the gel state, obtained with different methods.

      The references to Figures in the appendices are confusing. Please make it clear whether Figures in the main text or the appendices are being referenced. On a related note, the Appendix figures seem to be placed in appendices whose text describes something else - Appendix 2, Figure 1 should be moved to Appendix 3; Appendix 3, Figure 1 should be moved to Appendix 4; etc.

      We revised the appendix, corrected the figure positions and clarified their references.

    1. Author Response

      The following is the authors’ response to the original reviews.

      Reviewer #2 (Public Review):

      Making state-of-the-art (super-resolution) microscopy widely available has been the subject of many publications in recent years as correctly referenced in the manuscript. By advocating the ideas of open-microscopy and trying to replace expensive, scientific-grade components such as lasers, cameras, objectives, and stages with cost-effective alternatives, interested researchers nowadays have a number of different frameworks to choose from. In the iteration of the theme presented here, the authors used the existing modular UC2 framework, which consists of 3D printable building blocks, and combined a cheapish laser, detector and x,y,(z) stage with expensive filters/dichroics and a very expensive high-end objective (>15k Euros). This particular choice raises a first technical question, to which extent a standard NA 1.3 oil immersion objective available for <1k would compare to the chosen NA 1.49 one.

      Measurement of the illumination quality (e.g. the spectral purity) of low budget lasers convinced us of the necessity to use spectral filtering. These cannot be replaced with lower budget alternatives, to sill retain the necessary sensitivity to image single molecules. As expected, the high-quality objectives are able to produce high-quality data. Lower budget alternatives (<500 €) to replace the objective have been tried out. Image quality is reduced but key features in fluorescent images can be identified (see figure S1). The usage of a low budget objective for SMLM imaging is possible, but quality benchmarks such as identifying railroad tracks along microtubule profiles is not possible. Their usage is not optimal for applications aiming to visualize single molecules and might find better application in teaching projects.

      The choice of using the UC2 framework has the advantage, that the individual building blocks can be 3D printed, although it should be mentioned that the authors used injection-molded blocks that will have a limited availability if not offered commercially by a third party. The strength of the manuscript is the tight integration of the hardware and the software (namely the implementations of imSwitch as a GUI to control data acquisition, OS SMLM algorithms for fast sub-pixel localisation and access to Napari).

      The injection-molded cubes can be acquired through the OpenUC2 platform. Alternatively, the 3D printable version of the cubes is freely available and just requires the user to have a 3D printer. https://github.com/openUC2/UC2-GIT/tree/master/CAD/CUBE_EmptyTemplate

      The presented experimental data is convincing, demonstrating (1) extended live cell imaging both using bright-field and fluorescence in the incubator, (2) single-particle tracking of quantum dots, and (3) and STORM measurements in cells stained against tubulin. In the following I will raise two aspects that currently limit the clarity and the potential impact of the manuscript.

      First, the manuscript would benefit from further refinement. Elements in Figure 1d/e are not described properly. Figure 2c is not described in the caption. GPI-GFP is not introduced. MMS (moment scaling spectrum) could benefit from a one sentence description of what it actually is. In Figure 6, the size of the STORM and wide-field field of views are vastly different, the distances between the peaks on the tubuli are given in micrometers rather than nanometers. (more in the section on recommendations for the author)

      Second, and this is the main criticism at this point, is that although all the information and data is openly available, it seems very difficult to actually build the setup due to a lack of proper documentation (as of early July 2023).

      1) The bill of materials (https://github.com/openUC2/UC2-STORM-and-Fluorescence#bill-of-material) should provide a link to the commercially available items. Some items are named in German. Maybe split the BoM in commercially available and 3D printable parts (I first missed the option to scroll horizontally).

      2) The links to the XY and Z stage refer to the general overview site of the UC2 project (https://github.com/openUC2/) requiring a deep dive to find the actual information.

      3) Detailed building instructions are unfortunately missing. How to assemble the cubes (pCad files showing exploded views, for example)? Trouble shooting?

      4) Some of the hardware details (e.g. which laser was being used, lenses, etc) should be mentioned in the manuscript (or SI)

      I fully understand that providing such level of detail is very time consuming, but I hope that the authors will be able to address these shortcomings.

      1) The bill of materials has been and will also in future still be improved. The items have been sorted into UC2 printed parts and externally acquired parts. The combination of part name as well as provider enables users to find and acquire the same parts. Additionally, depending on the country where the user is located, different providers of a given part might be advantageous as delivery means and costs might vary.

      2) The Z-stage now has a specific repository with different solutions, offering different solutions with different levels of movement precision. According to the user and their budget, different solutions can be optimal for the endeavor.

      https://github.com/openUC2/UC2-Zstage

      The XY stage now also has a detailed repository, as the motorizing of the stage requires a fair amount of tinkering. The video tutorials and the detailed instructions on stage motorizing should help any user to reproduce the stage shown within this manuscript. https://github.com/openUC2/UC2-Motorized-XY-Table

      3) The updated repository has a short video showing the general assembly of the cubes and the layers. Additionally, figure S2 shows all the pieces that are included in every layer (as a photograph as well as CAD). An exploded view of the complete setup would certainly be a helpful visualization of the complete setup. We however hope that the presented assembly tutorials and documents are sufficient to successfully reproduce the U.C.STORM setup.

      First, we want to thank the reviewers for their effort to help us improving our work. We apologize for any trivial mistakes we had overlooked. Please find below our answers to the very constructive and helpful comments of the editors.  

      Recommendations for the authors:

      Reviewer #1 (Recommendations for The Authors):

      To complement the current data set:

      Figure 2(a & b): Panels i & ii, were chosen on the area where the distribution of the laser appears to be flatter. Can the authors select microtubules from a different section? Otherwise, it is reasonable to also crop the field-of-view along the flatter area (as done in Fig 6).

      Figure 2 was changed to according to the reviewer’s suggestions. The profiles of microtubules from a different section have similar profiles, but the region with best illumination thus best SNR of the profile have been used for the figure.

      Figure 2(c): The current plot shows the gaussian distribution which does not appear to be centered. Instead of a horizontal line, can the authors provide a diagonal profile across the field of view and update the panel below?

      A diagonal cross-section of the illuminated FOV is provided in figure 2 to replace the previous horizontal profile. The pattern seems not to be perfectly radially symmetric, and more light seems to be blocked at the bottom of the illumination pattern compared to the top. A possible improvement can be provided by a fiber-coupled laser, that could provide a more homogeneous illumination while being easier to handle in the assembly process.

      Author response image 1.

      Diagonal cross-section of the illuminated FOV. Pixel-size (104nm) is the same as in figure 2. Intensity has been normalized according to the maximal value.

      Figure 2(d): The system presents a XY drift of ~500nm over the course of a couple of hours. However, is not clear how the focus is being maintained. Can the authors clarify this point and add the axial drift to the plot?

      The axial position of the sample could be maintained over a prolonged period of time without correcting for drift. Measurements where an axial shift was induced by tension pulses in the electronics have been discarded, but the stability of the stage seems to be sufficient to allow for imaging without lateral and axial drift correction. The XY drift measurement displayed in Figure 2(d) can be extended by measuring the σ of the PSF over time. The increase of σ would suggest an axial displacement in relation to the focus plane. In these measurements, a slight axial drift can be seen, the fluorescent beads however can still be localized over the whole course of the measurement.

      A separate experiment was performed, using the same objective on the UC2 setup and on a high-quality setup equipped with a piezo actuator able to move in 10 nm steps. The precise Z steps of the piezo allows to reproducibly swipe through the PSF shape and to give an estimate of the axial displacement of the sample, according to the changes in PSF FWHM (Full Width at Half Maximum). When superimposing the graph with the UC2 measurement of fluorescent beads with the smallest possible Z step, an estimate about the relative axial position of the sample can be provided. The accuracy of the stage however remains limited.

      Author response image 2.

      Drift Figure: a. Drift of fluorescent TS beads on the UC2 setup positioned upon an optical table over a duration of two hours. Beads are localized and resulting displacement in i. and ii. are plotted in the graphs below. The procedure is repeated in b. with the microscope placed on a laboratory bench instead. c. (for the optical table i.) and d. (for the laboratory bench i.) show the variation in the sigma value of the localized beads over the measurement duration. As the sigma values changes when the beads are out of focus, the stability of the setup can be confirmed, as it remains practically unchanged over the measurement duration.

      Author response image 3.

      Z-focus Figure: Estimation of the axial position of TS beads on the UC2 setup. a. The change in PSF FWHM was quantified by acquiring a Z stack of a beads sample. The homebuilt high-quality setup (HQ) was used as a reference, by using the same objective and TS sample. The PSF FWHM on the UC2 setup was measured using the lowest possible axial stage displacement. A Z-position can thus be estimated for single molecules, as displayed in b.

      Addressing the seemingly correlated behavior of the X and Y drift:

      Further measurement show less correlation between drift in X and in Y. Simultaneous motion in X and Y seems to indicate that the stage or the sample is tilted. The collective movement in X and Y seems accentuated by bigger jumps, probably originating from vibrations (as more predominantly shown in the measurements on the laboratory bench compared to the optical table). Tension fluctuations inducing motion of the stage are possible but are highly unlikely to have induced the drift in the displayed measurements.

      Figure 3: Can the authors comment on the effect or otherwise potential effect of the incubator (humidity, condensation etc) may have on the system (e.g., camera, electronics etc)?

      When moving the microscope into the incubator, the first precaution is to check if the used electronics are able to perform at 37° C. Then, placing the microscope inside the incubator can induce condensation of water droplets at the cold interfaces, potentially damaging the electronics or reducing imaging quality. This can be prevented by preheating the microscope in e.g. an incubator without humidity, for a few hours before placing it within the functional incubator. The used incubator should also be checked for air streams (to distribute the CO2), and a direct exposure of the setup to the air stream should be prevented. The usage of a layer of foam material (e.g. Polyurethane) under the microscope helps to reduce possible effects of incubator vibrations on the microscope. The hydrophilic character of PLA makes its usage within the incubator challenging due to its reduced thermal stability. The temperature also inherently reduces the mechanical stability of 3D printed parts. Using a less hydrophilic and more thermally stable plastic, such as ABS, combined with a higher percentage of infill are the empirical solution to this challenge. Further options and designs to improve the usage of the microscope within the incubator are still in developement.

      Figure 5: Can the authors perform single molecule experiments with an alternative tag such as Alexa647?

      The SPT experiments were performed with QDs to make use of their photostability and brightness. The dSTORM experiment suggests that imaging single AF647 molecules with sufficient SNR is possible. The usage of AF647 for SPT is possible but would reduce the accuracy of the localization and shorten the acquired track-lengths, due to the blinking properties of AF647 when illuminated. The tracking experiment with the QDs thus was a proof of concept that the SPT experiments are possible and allow to reproduce the diffusion coefficients published in common literature. The usage of alternative tags can be an interesting extension of the capabilities that users can perform for their applications.

      Figure 6: The authors demonstrate dSTORM of microtubules. It would enhance the paper to also demonstrate 3D imaging (e.g., via cylindrical lens).

      The usage of a cylindrical lens for 3D imaging was not performed yet. The implementation would not be difficult, given the high modularity of the setup in general. The calibration of the PSF shape with astigmatism might however be challenging as the vertical scanning of the Z-stage lacks reliability in its current build. Methods such as biplane imaging might also be difficult to implement, as the halved number of photons in each channel leads to losses in the accuracy of localization. As a future improvement of the setup, the option of providing 3D information with single molecule accuracy is definitely desirable and will be tried out. In the following figure, two concepts for introducing 3D imaging capabilities in the detection layer of the microscope are presented.

      Author response image 4.

      3D concept Figure: Two possible setup modifications to provide axial information when imaging single molecules. a. A cylindrical lens can be placed to induce an asymmetry between the PSF FWHM in x and in y. Every Z position can be identified by two distinct PSF FWHM values in X and Y. b. By splitting the beam in two and defocusing one path, every PSF will have a specific set of values for its FWHM on the two detectors.

      Imaging modalities section: Regarding the use of cling film to diffuse; can the authors comment on the continual use of this approach, including its degradation over time?

      The cling foil was only used as a diffuser for broadening the laser profile. A detailed analysis of the constitution of the foil was not done, as no visible changes could be seen on the illumination pattern and the foil itself. The piece of cling foil is attached to a rotor. Detaching of the cling foil or vibrations originating from the rotor need to be minimized. By keeping the rotation speed to a necessary minimum and attaching the cling foil correctly to the rotor, a usable solution can be created. The low price of the cling foil provides the possibility to exchange the foil on a regular basis, allowing to keep the foil under optimal conditions.

      Author response image 5.

      Profile Figure: By moving a combination of pinhole and photometer to scan through the laser profile with a translational mount, the shape of the laser beam can be estimated. The cling foil plays the same role as a diffuser in other setups.

      Reviewer #2 (Recommendations for The Authors):

      lines

      20, add "," after parts

      110, rotating cling foil?

      112/116, "custom 3D printed" I thought they were injection molded, please finalize

      113, "puzzle pieces" rephrase and they are also barely visible

      119, not clear that the stage is a manual stage that was turned into a motorised one by adding belts

      123-126, detail for SI,

      132, replace Arduino-coded with Arduino-based

      143, add reference to Napari

      146, (black) cardboard seems to be a cheaper and quicker alternative

      153, dichroic

      151-155, reads more like a blog post than a paper (maybe add a section on trouble shooting)

      156, antibody?

      167/189, moderate, please be specific

      194, layer of foam material, specify

      221, add description/reference to GPI. What is that? why is it relevant?

      226: add one sentence description of MMS

      318, add "," after students

      332-334, as mentioned earlier, not clear, you bought a manual stage and connected belts, correct?

      376-377, might be difficult to understand for the layman

      391, what laser was used?

      Figure 1, poor contrast between components, components visible should be named as much as possible, maybe provide the base layer in a different shade. To me, the red and blue labels look like fluorophores.

      Figure 1. looks like d is the excitation layer and not e, please fix.

      Figure 2, caption a-c, figure 1-d!, btw, why is the drift so anti-correlated?

      Figure 6 (line 259) nanometer I guess, not micrometer

      We now incorporated all the above-mentioned changes in the manuscript. Furthermore we added the supplementary Figures as below.

      Author response image 6.

      Basic concept of the UC2 setup: Left: Cubes (green) are connected to one another via puzzle pieces (white). Middle: 3D printed mounts have been designed to adapt various optics (right) to the cube framework. Combined usage of cubes and design of various mounts allows to interface various optics for the assembly.

      Author response image 7.

      Building the UC2 widefield microscope: a. Photograph of the complete setup. b. All pieces necessary to build the setup. A list of the components can be found in the bill of materials. c. Bottom emission layer of the microscope before assembly. d. Emission layer after assembly. Connection between cubes is doubled by using a layer of puzzles on the top and the bottom of the emission layer. e. CAD schematic of the emission layer and the positioning of the optics. f. Middle excitation layer of the microscope before assembly. Beam magnifier and homogenizer have been left out for clarity. g. Excitation layer after assembly is also covered by a puzzle layer. h. CAD schematic of the excitation layer and the positioning of the optics. i. Z-stage photograph and corresponding CAD file. Motor of the stage is embedded within the bottom cube. j. A layer of empty cubes supports the microscope stage. k. At this stage of the assembly, the objective is screwed into the objective holder. l. Finally, the stage is wired to the electronics and can then be mounted on top of the microscope (see a.).

      Author response image 8.

      Measurements performed on the UC2 setup with lower budget objectives. The imaged sample is HeLa cells, stably transfected to express CLC-GFP, then labelled with AF647 through immunostaining. The setup has been kept identical except for the objectives. Scale bar respectively represents 30 µm.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this manuscript, the role of orexin receptors in dopamine neurons is studied. Considering the importance of both orexin and dopamine signalling in the brain, with critical roles in arousal and drug seeking, this study is important to understand the anatomical and functional interaction between these two neuromodulators. This work suggests that such interaction is direct and occurs at the level of SN and VTA, via the expression of OX1R-type orexin receptors by dopaminergic neurons.

      Strengths:

      The use of a transgenic line that lacks OX1R in dopamine-transporter-expressing neurons is a strong approach to dissecting the direct role of orexin in modulating dopamine signalling in the brain. The battery of behavioural assays to study this line provides a valuable source of information for researchers interested in the role of orexin-A in animal physiology.

      We thank the reviewer for summarizing the importance and significance of our study. 

      Weaknesses:

      The choice of methods to demonstrate the role of orexin in the activation of dopamine neurons is not justified and the quantification methods are not described with enough detail. The representation of results can be dramatically improved and the data can be statistically analysed with more appropriate methods.

      We have further improved our description of the methods in the revised reviewed preprint, and here in the response letter, we respond point-by-point to ‘Reviewer #1 (Recommendations For The Authors)’ below. 

      Reviewer #2 (Public Review):

      Summary:

      This manuscript examines the expression of orexin receptors in the midbrain - with a focus on dopamine neurons - and uses several fairly sophisticated manipulation techniques to explore the role of this peptide neurotransmitter in reward-related behaviors. Specifically, in situ hybridization is used to show that dopamine neurons predominantly express the orexin receptor 1 subtype and then go on to delete this receptor in dopamine neurons using a transgenic strategy. Ex vivo calcium imaging of midbrain neurons is used to show that in the absence of this receptor orexin is no longer able to excite dopamine neurons of the substantia nigra.

      The authors proceed to use this same model to study the effect of orexin receptor 1 deletion on a series of behavioral tests, namely, novelty-induced locomotion and exploration, anxiety-related behavior, preference for sweet solutions, cocaine-induced conditioned place preference, and energy metabolism. Of these, the most consistent effects are seen in the tests of novelty-induced locomotion and exploration in which the mice with orexin 1 receptor deletion are observed to show greater levels of exploration, relative to wild-type, when placed in a novel environment, an effect that is augmented after icv administration of orexin.

      In the final part of the paper, the authors use PET imaging to compare brain-wide activity patterns in the mutant mice compared to wildtype. They find differences in several areas both under control conditions (i.e., after injection of saline) as well as after injection of orexin. They focus on changes in the dorsal bed nucleus of stria terminalis (dBNST) and the lateral paragigantocellular nucleus (LPGi) and perform analysis of the dopaminergic projections to these areas. They provide anatomical evidence that these regions are innervated by dopamine fibers from the midbrain, are activated by orexin in control, but not mutant mice, and that dopamine receptors are present. Thus, they argue these anatomical data support the hypothesis that behavioral effects of orexin receptor 1 deletion in dopamine neurons are due to changes in dopamine signaling in these areas.

      Strengths:

      Understanding how orexin interacts with the dopamine system is an important question and this paper contains several novel findings along these lines. Specifically:

      (1) The distribution of orexin receptor subtypes in VTA and SN is explored thoroughly.

      (2) Use of the genetic model that knocks out a specific orexin receptor subtype from only dopamine neurons is a useful model and helps to narrow down the behavioral significance of this interaction.

      (3) PET studies showing how central administration of orexin evokes dopamine release across the brain is intriguing, especially since two key areas are pursued - BNST and LPGi - where the dopamine projection is not as well described/understood.

      We thank the reviewer for the careful summary and highlighting the novelty of our study.

      Weaknesses:

      The role of the orexin-dopamine interaction is not explored in enough detail. The manuscript presents several related findings, but the combination of anatomy and manipulation studies does not quite tell a cogent story. Ideally, one would like to see the authors focus on a specific behavioral parameter and show that one of their final target areas (dBNST or LPGi) was responsible or at least correlated with this behavioral readout. In addition, some more discussion on what the results tell us about orexin signaling to dopamine neurons under normal physiological conditions would be very useful. For example, what is the relevance of the orexin-dopamine interaction blunting noveltyinduced locomotion under wildtype conditions?

      We agree that focusing on some orexin-dopamine targeting areas, such as dBNST or LPGi, is important to further reveal the anatomy-behavior links and underlying mechanisms. While we are very interested in further investigations, in the present manuscript we mainly aim to give an overview of the behavioral roles of orexin-dopamine interaction and to propose some promising downstream pathways in a relatively broad and systematical way. 

      We have explained the physiological meanings of our results in more detail in the discussion in the revised reviewed preprint (lines 282-293, 318-332, ). Novelty-induced behavioral response should be at proper levels under normal physiological conditions. The orexin-dopamine interaction blunting novelty-induced locomotion could be important to keep attention on the main task without being distracted too much by other random stimuli in the environment. When this balance is disrupted, behavioral deficit may happen, such as attention deficit and hyperactivity disorder (ADHD).  

      In some places in the Results, insufficient explanation and reporting is provided. For example, when reporting the behavioral effects of the Ox1 deletion in two bottle preference, it is stated that "[mutant] mice showed significant changes..." without stating the direction in which preference was affected.

      For the reward-related behaviors described in this study, we did not find significant changes between [mutant] and control mice. We agree that it will be helpful for readers by describing the behavioral tests in more details. In the revised reviewed preprint, we have described in more detail in the results and Materials and Methods section how the control and [mutant] mice behave to the reward (lines 162-165, 171-181, 526-528).  

      The cocaine CPP results are difficult to interpret because it is unclear whether any of the control mice developed a CPP preference. Therefore, it is difficult to conclude that the knockout animals were unaffected by drug reward learning. Similarly, the sucrose/sucralose preference scores are also difficult to interpret because no test of preference vs. water is performed (although the data appear to show that there is a preference at least at higher concentrations, it has not been tested).

      We described the CPP analysis in the Materials and Methods section (lines 523-528 ) as below: ‘The percentage of time spent in the reward-paired compartment was calculated: 100 x time spent in the compartment / (total time - time spent in the middle area). The CPP score was then analyzed using the calculated percentage of time: 100 x (time on the test day – time on pre-test days)/ time on pre-test days. The pre-test and test days were before and after the conditioning, respectively. Thus, the CPP score above zero indicates that the CPP preference has developed.’ In Figure 2—figure supplement 4 C and F, it was shown that most control and knockout mice had a CPP score above zero. The control and knockout groups both developed a preference and there was no significant difference between the groups. 

      For the sucrose/sucralose preference tests, in Figure 2—figure supplement 4 A and D, we present values as the percentages of sucrose/sucralose consumption in total daily drinking amount (sucrose/sucralose solution + water). Thus, percentages above 50% indicates mice prefer sucrose/sucralose to water. As shown in the figure, male mice only showed weak preference of 0.5% sucrose, compared to water, and under all other tested conditions, the mice showed strong preference of the sweet solution. There was no significant difference between control and knockout mice. 

      We have described this in more details in the Results and Materials and Methods section in the revised reviewed preprint. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      (1) Figure 1, A-I. It is difficult to depict the anatomical subdivision of VTA in Figure 1, panels A and B. It is recommended to add a panel showing a schematic illustration of the SNc and subregions of VTA: PN, PIF, PBP, IF (providing more detail than in Figure 1, panel J). It is also recommended to show lower magnification images (as in Figure 1 - supplement 1), including both hemispheres, and to delineate the outline of the different subregions using curved lines, based on reference atlases (similar to Figure 1, panel I, please include distance from bregma). It would be helpful to indicate in Figure 1 that panel A is a control mouse and panel B is a Ox1RΔDAT mouse and include C-F letters to show corresponding insets. Anatomically, the paraintrafasicular nucleus (PIF) is positioned between the paranigral nucleus (PN) and the parabrachial pigmented nucleus (PBP). The authors have depicted the PIF ventral to the PN in Figure 1 panels A, B, and I. These panels and the quantification of Ox1R/2R positive cells within the different subdivisions need to be corrected accordingly. The image analysis method used to quantify RNAscope fluorescent images is not described in sufficient detail. Please expand this section.

      According to the reviewer’s suggestions, we have refined Figure 1 in the revised reviewed preprint. We are now showing the schematic illustration of the SN and subregions of VTA in panel I, with blue squares to label the regions shown in panels A and B, and the distance from bregma is included. The outlines to delineate SN and the subregions of VTA are adjusted from straight to curved lines based on reference atlases. As suggested, we have also indicated panel A is a control and panel B is a Ox1RΔDAT mouse and included C-F letters to show corresponding insets. We apologize for the mistake about labeling PIF and PN positions in Figure A. We have corrected the labeling of their positions and double checked the quantification accordingly. This does not change our discussion or conclusion since both PIF and PN are the medial part of VTA, where both Ox1R and Ox2R are observed. The description of the image analysis in Matierials and Methods section has been improved (lines 378-385). We decided not to show lower magnification images than in Figure 1—supplement 1 to include both hemispheres, in the interests of clarity and reader-friendliness.  

      (2) Figure 1, J-L. The claim that orexin activates dopaminergic SN and VTA neurons is weakly supported by the data provided. Calcium imaging of SN dopaminergic neurons in control mice suggests a discrete effect of 100 nM orexin-A application compared to baseline. Application of 300 nM shows a slightly bigger effect, but none of these results are statistically analysed. 

      We are surprised by this comment and thank the reviewer for pointing out our apparent lack of clarity in the previous version (lines 96-106 and legend of Figure 1K, L). In more detail, we explain the data analysis in the new version (lines 119-133, 451-465) and the legend of Figure 1K, L and Figure 1-figure supplement 3).

      The main goal of this part of the project was to functionally validate the Ox1R knockout in dopaminergic (DAT-expressing) neurons. This was a prerequisite for the behavioral and PET imaging experiments. We used GCaMP-mediated Ca2+ imaging in acute brain slices to reach this goal. This analysis was performed on the dopaminergic SN neurons, which we used as an "indicator population" because a large number of these neurons express Ox1R, but only a few express Ox2R. 

      The analysis consisted of two parts:

      a) For each neuron, we tested whether it responded to orexin A. At the single cell level, a neuron was considered orexin A-responsive if the change in fluorescence induced by orexin A was three times larger than the standard deviation (3 σ criterion) of the baseline fluorescence, corresponding to a Zscore of 3. We found that 56% of the neurons tested responded to orexin A, while 44% of the neurons did not respond to orexin A (Figure 1L, top). These data agree with the number of Ox1R-expressing neurons (Figure 1J). 

      b) We also determined the orexin A-induced GCaMP fluorescence for each neuron, expressed as a percentage of GCaMP fluorescence induced upon application of high K+ saline. Accordingly, the "population response" of all analyzed neurons was expressed as the mean ± SEM of these responses. The significance of this mean response was tested for each group (control and Ox1R KO) using a onesample t-test. We found a marked and highly significant (p < 0.0001, n = 71) response of control neurons to 100 nM orexin A, while the Ox1R KO neurons did not respond (p = 0.5, n = 86). Note that, as described in a), 44% of the neurons contributing to the mean do not respond to orexin. Thus, the orexin responses of most responders are significantly higher than the mean. This is also evident in the example recordings in Figure 1K and Figure1—figure supplement 3. The orexin A-induced change in fluorescence was increased by increasing the orexin A concentration to 300 nM.

      Note: As mentioned above, the orexin A response was expressed for each neuron individually as a percentage of its high K+saline-induced GCaMP fluorescence. This value is a solid reference point, reflecting the GCaMP fluorescence at maximal voltage-activated Ca2+ influx. Obviously, the Ca2+ concentration at this point is extremely high and not typically reached under physiological conditions. Therefore, as shown in Figure1—figure supplement 3 for completeness, the physiologically relevant responses may appear relatively minor at first glance when presented together in one figure (compare Figure1—figure supplement 3 A and B).

      The authors should provide more evidence of the orexin-induced activation of dopaminergic neurons in the SN to support this claim and investigate whether a similar activation is observed in VTA neurons. 

      Following the reviewer's suggestion, we confirmed orexin A-induced activation of dopaminergic neurons in the mouse SN by using perforated patch clamp recordings (Figure1—figure supplement 2).

      This finding is consistent with previous extracellular in vivo recordings in rats (Liu et al., 2018).

      The activation of dopaminergic neurons in the mouse VTA by orexin A has been shown repeatedly in earlier studies (e.g., Baimel et al., 2017; Korotkova et al., 2003; Tung et al., 2016).

      In addition, Figure 3-Figure Supplement 2 shows that injection of orexin does not induce c-Fos expression in SN and VTA dopaminergic neurons of control and Ox1RΔDAT mice, which further weakens the claim made by the authors.

      Figure 3—Figure Supplement 2 in the original submission is now Figure 3—Figure Supplement 3 in the revised reviewed preprint. It shows low c-Fos expression in SN and VTA dopaminergic neurons, and orexin-induced c-Fos expression was observed in Th-negative cells in SN and VTA. 

      Technically relatively straightforward, Fos analysis is widely (and successfully) used in studies to reveal neuronal activation. However, this approach has limitations, e.g., regarding sensitivity and temporal resolution. Electrophysiological or optical imaging techniques can circumvent these shortcomings. The electrophysiological and Ca2+ imaging studies presented here, along with previous electrophysiological studies by others, clearly show that orexin A acutely and directly stimulates SN and VTA dopaminergic neurons.

      In vivo, the injection of orexin A induced a pronounced c-Fos activity in non-dopaminergic cells of the VTA and SN but not in dopaminergic neurons. This result shows that the detection of c-Fos has worked in principle. Whether the absent c-Fos staining in dopaminergic neurons is due to lack of sensitivity, whether other IEGs would have worked better here, or whether there are other, e.g., cell type-specific reasons for the absence of staining, cannot be determined from the current data.

      (3) Figure 2, I-L. The fact that ICV injection of both saline and orexin causes a sustained increase of locomotion (around 20 minutes in males, and over 30 minutes in females) is problematic and could mask the effects of orexin, particularly in females. It is unclear what panels J and L are showing. To be appropriately analysed, the authors should plot the pre- and post-injection AUC data for all groups and analyse it as a two-way mixed ANOVA, with the within-subjects factor "pre/post injection activity" and between-subjects factor "group". The authors can only warrant a statistically meaningful hyperlocomotor effect in Ox1RΔDAT mice if a significant interaction is found.

      Though mice were habituated to the injection, it still makes sense to see the injection-induced increase in locomotion to some extent. We described in the figure legend that the AUC was calculated for the period after orexin injection, which meant 5 – 90 min in Figure 2 I, K. We have clearly observed significant differences between genotypes and between saline and orexin application, which means the genotype and orexin impact is strong enough to pop up despite of the injection effect. 

      As the reviewer’s suggests, we have now plotted the pre- and post-injection AUC data for all groups and analyzed it as a two-way mixed ANOVA, with the within-subjects factor "pre/post injection activity" and between-subjects factor "group". To match the pre- and post-injection duration, we are now comparing AUC for around 60 min before and after the injection. A significant interaction is found here. Panels I-L are renewed, and the differences induced by Ox1R knockout and orexin confirmed the results shown in the initially submitted manuscript.  

      (4) Figure 3. The literature has robustly shown that one of the main projection areas of VTA and SN dopaminergic neurons is the striatum, in particular its ventral part. It is surprising to see that this region is not affected by the lack of OX1R or by the injection of orexin. How can the authors explain that identified regions with significantly different activity include neighbouring brain structures with heterogenous composition? See for example, in panel A, section bregma 0.62mm, a significant region is seen expanding across the cortex, corpus callosum, and striatum. While the data from PET studies is potentially interesting, it may not be adequate to provide enough resolution to allow examination of the anatomical distribution of orexin-mediated neuronal activation.

      While the striatum is a major projection area of dopaminergic neurons in VTA and SN, the projection and function of Ox1R-positive dopaminergic neurons is not clear. We have improved the description of dopamine function diversity in the revised reviewed preprint (lines 46-58), and it was reported before that the projection-defined dopaminergic populations in the VTA exhibited different responses to orexin A (Baimel et al., 2017). Moreover, the striatum activity is modulated by the indirect effect via other brain regions affected by Ox1R-positive dopaminergic neurons. It is unknown how the striatum activity should change after Ox1R deletion in dopaminergic neurons. We could not rule out the possibility that the striatum is indeed modulated by the Ox1R-positive dopaminergic neurons, though there was only a trend of genotype difference (Ox1RΔDAT vs. ctrl) in the ventral striatum in the section bregma 1.42 mm in Figure 3A. The ICV injection of orexin is potentially acting on Ox1R and Ox2R in the whole brain, so projections from other brain regions to the striatum also affect striatum activity and could have masked the effect of Ox1R-positive dopaminergic neurons. 

      The spatial resolution of the PET data is in the order of ~1 mm^3. As we also explained in the Materials and Methods section, the size of a voxel in the original PET data is 0.4mm x 0.4mm x 0.8 mm. All calculations were performed on this grid. The higher-resolved images shown in Figure 3 are for presentation purposes only inspired by a request of the reviewer who asked us to show this in the Jais et al. 2016 manuscript. To make this clearer we now added the p-map images with the original voxel size to the supplement (Figure 3—figure supplement 1). For the interest in specific brain areas, more precise identification of anatomical sub-regions requires using methods with higher spatial resolution such as staining of brain slices for c-Fos-positive cells as we do in Figure 4.

      PET is a powerful tool to identify global regions of activation/inhibition. In the manuscript, we have described in the results and discussion section that the activity in brain regions with related functions were changed. In panel A, Ox1RΔDAT showed activity increase in MPA, Pir and endopiriform claustrum, which are important for olfactory sensation; spinal trigeminal nucleus, sp5, and IRt, which regulates mastication and sensation of the oral cavity and the surface of the face; SubCV and Gi, which regulates sleeping and motion-related arousal and motivation. In panel B, changes in HDB, MCPO, Pir, DEn, S1, V2L and V1 are related to sensation, and changes in BNST, LPGi and M2 are important for emotion, exploration, and action selection. 

      (5) Figure 4. As in Figure 1, the authors should consider including a schematic illustration of the brain areas that are being analysed using a reference atlas. It is also recommended to provide more details describing the quantification of the images. Without such information, the data is not convincing, in particular, the claim that Ox1R depletion causes a decrease in DRD1 in BNST is unclear. Additional unbiased quantitative approaches could be used to strengthen this point.

      We have added Figure 4—figure supplement 1 as a schematic illustration of the brain areas that were being analyzed using a reference atlas. More details describing the unbiased quantification of the images have been added to Materials and Methods. We have added Figure 4—figure supplement 3, to show DRD1, DRD2 and the merged signal separately.  

      (6) The discussion starts by stating that the main findings of this study are based on RNAscope and optophysiological experiments, however, the latter are not presented anywhere in the manuscript. This sentence (line 192) should be revised. The authors state in line 193 that OX1R is the only orexin receptor in the SN, but they show in Figure 1 that in the SN, 3% of neurons express OX2R and 2% co-express both receptors. 

      We thank the reviewer for the input. We have rephrased the beginning of the discussion to clarify the objectives (lines 238 - 246). In doing so, we changed "optophysiological experiments" and "single orexin receptor" (lines 192 and 193 in the original manuscript) to " Ca2+ imaging experiments" and "main subtype of orexin receptors ", respectively. In this context, it should be noted that Ca2+ imaging is considered an optophysiological method - optophysiology generally refers to techniques that combine optical methods with physiological measurements.

      The results of LPGi and BNST dopamine receptors in control and Ox1RΔDAT mice are poorly discussed. The authors should justify why these two regions were selected for further validation and how these may be related to the behavioural effects found in Ox1RΔDAT regarding exposure to a novel context.

      Ox1RΔDAT mice exhibited increased novelty- and orexin-induced locomotion compared to control mice. After orexin injection, PET imaging shows that the neural activity of BNST and LPGi was lower or higher than in control mice, respectively. We selected BNST and LPGi for further validation because we think their key functional roles in regulating emotion, exploratory behaviors and locomotor speed are related to novelty-induced locomotion. We confirmed changes in neural activity change by c-Fos staining and investigated the expression patterns of dopamine receptors in BNST and LPGi. Our findings suggested that Ox1R deletion in dopaminergic neurons results in the disinhibition of neural activity in LPGi via dopaminergic pathways and the decrease of dopamine-mediated neural activity in BNST. Emotion perception affects the decision of how to respond to the novelty. It is possible that novelty activates the orexin system and Ox1R signaling in dopaminergic neurons promotes emotion perception and inhibits exploration. Of course, further careful investigation is necessary to test this hypothesis in the future experiments. We have improved the rational description and discussion in the

      ‘Results’ and ‘Discussion’ section in the revised reviewed preprint (lines 210-213, 259-270, 293-308). 

      Reviewer #2 (Recommendations For The Authors):

      A major recommendation - if possible - would be to directly show that one or both of the two target areas - dBNST and LPGi - are associated with the behavioral effects caused by the deletion of the orexin receptor 1 in dopamine neurons.

      We completely agree that it would be very valuable to directly show dBNST and LPGi are associated with the behavioral effects caused by the deletion of Ox1R in dopaminergic neurons. While we are very interested in carefully investigating specific orexin-dopamine targeting areas and related neural circuits in the future, in the present manuscript, we mainly aim to give an overview of the behavioral roles of orexin-dopamine interaction and propose some promising downstream pathways. 

      The authors should state if data are corrected for multiple comparisons, e.g., in the PET study of different regions.

      We have included information about the post-hoc tests for all 2-way ANOVA analyses in the submitted manuscript. For the PET study, the p-values in the p-maps were not corrected for multiple comparison, Figure 3—figure supplement 2 shows the raw data of each mouse and the analysis method (t-test). In the revised reviewed preprint, we include the information on the analysis method in the figure legends of Figure 3. 

      We consider that saline and orexin injections mimic the resting and active state of mice, respectively, and would like to study genotype effect under each condition. Doing 2-way ANOVA takes in count the difference between orexin and saline injection, which could mask the genotype effect under a certain condition. Therefore, we decided to perform t-tests for each condition in Figure 3. While we provide readers with full information in Figure 3—figure supplement 2 with the raw data of each individual mouse, below we present the p-maps after multiple comparisons (Sidak’s post hoc test). After multiple comparisons, we could see changes in similar brain regions as in Figure 3, though significant values are reduced by the correction for multiple comparisons, and under orexin-injection condition, we fail to see significantly higher activity around the lateral paragigantocellular nucleus (LPGi), nucleus of the horizontal limb of the diagonal band (HDB) and magnocellular preoptic nucleus (MCPO) in Ox1RΔDAT mice. In order to more precisely identify the anatomical locations, we performed additional experiments to confirm the changes revealed by PET. For example, LPGi is a relatively small region confirmed and identified more precisely by c-Fos immunostaining (Figure 4A, C). 

      Author response image 1.

      PET imaging studies comparing Ox1RΔDAT and control mice, with post-hoc t-test to correct for multiple comparisons. 3D maps of p-values in PET imaging studies comparing Ox1RΔDAT and control mice, after intracerebroventricular (ICV) injection of (A) saline (NS) and (B) orexin A. Control-NS, n = 8; control-orexin, n = 6; Ox1RΔDAT, n = 8. M2, secondary motor cortex; MPA, medial preoptic area; Pir, piriform cortex; IEn, intermediate endopiriform claustrum; DEn, dorsal endopiriform claustrum; VEn, ventral endopiriform claustrum; LSS, lateral stripe of the striatum; BNST, the dorsal bed nucleus of the stria terminalis; S1Sh, primary somatosensory cortex, shoulder region; S1HL, primary somatosensory cortex, hindlimb region; S1BF, primary somatosensory cortex, barrel field; S1Tr, primary somatosensory cortex, trunk region; V1, primary visual cortex; V2L, secondary visual cortex, lateral area; SubCV, subcoeruleus nucleus, ventral part; Gi, gigantocellular reticular nucleus; IRt, intermediate reticular nucleus; sp5, spinal trigeminal tract.

      Provide a rationale for following up on BNST and LPGi and not any of the regions identified in the PET study.

      We thank the reviewer for the careful reading and important input. Ox1RΔDAT mice exhibited increased novelty- and orexin-induced locomotion compared to control mice. After orexin injection, PET imaging shows that the neural activity of BNST and LPGi was lower or higher than control mice, respectively.

      We selected BNST and LPGi for further validation because we think their key functional roles in regulating emotion, exploratory behaviors and locomotor speed are related to novelty-induced locomotion. We confirmed the neural activity change by c-Fos staining and investigated the expression patterns of dopamine receptors in BNST and LPGi. Our findings suggested that Ox1R deletion in dopaminergic neurons results in the disinhibition of neural activity in LPGi via dopaminergic pathways and the decrease of dopamine-mediated neural activity in BNST. Emotion perception affects the decision how to respond to the novelty. It is possible that novelty activates the orexin system and Ox1R signaling in dopaminergic neurons promotes emotion perception and inhibits exploration. Of course, further investigation is necessary to test this hypothesis in future. We have improved the rational description and discussion in the ‘Results’ and ‘Discussion’ section in the revised reviewed preprint (lines 210-213, 259-270, 293-308). 

      Heatmap in Fig. 1K should not have smoothing across the y-axis, individual cells should be discrete.

      We thank the reviewer for bringing this issue to our attention. The data had not been intentionally smoothed (neither across the x-axis nor the y-axis), but it was probably a formatting issue. We have corrected this and separated individual cell traces with lines (Figure 1K, Figure 1—figure supplement 3).

      Dopamine cells are well known to lack Fos expression in most cases. Did the authors consider using another IEG to show neural activation, e.g., pERK?

      We did not use another IEG. The electrophysiological and Ca2+ imaging studies presented here, along with previous electrophysiological studies by others, clearly show that orexin A acutely and directly stimulates SN and VTA dopaminergic neurons. Please see also the response to a related comment of Reviewer 1.

      Consider adding a lower magnification section to anatomical figures to aid the reader in orienting and identifying the location.

      We have added the schematic illustration of SN, VTA, BNST and LPGi in Figure 1I and Figure 4— figure supplement 1. We hope this helps the reader in orienting and identifying the location.  

      Data availability should be stated.

      There are no restrictions on data availability. We have added this section to the revised reviewed preprint.

      Line 50. Some more references both historical and recent could be given to support this statement about the function of dopamine.

      We have improved the description and references to support the statement about dopamine function (lines 46-58). We have cited recent studies and some reviews in the revised reviewed preprint (lines 4658). 

      The PET data (Fig. 3) might be easier to visualize and interpret if a white background was used. In addition, is there a more refined way of presenting the data in Fig 3, S1?

      It is common to present imaging data such as PET and MRI on a black background. We also have already applied this color scheme in multiple publications and would therefore prefer to stick to this color scheme. 

      While Figure 3 is the concise way to present PET data, we aim to show the original individual results of mice in Figure 3—figure supplement 2 and to demonstrate how we performed the statistical analysis. Therefore, we take an example voxel of the respective brain area, perform the t-test, and present the data as bars with individual dots. 

      Line 97. State what type of Ca imaging here, e.g., "we performed Ca imaging in ex vivo slices of VTA and SN".

      As the reviewer suggested, we have specified the type of Ca2+ imaging (line 112).

      Line 165. State which groups this post-mortem analysis was performed on and if any differences were to be found (not expected to find differences in this anatomical tracing experiment but good to report this as both groups were used).

      Postmortem analysis of c-Fos staining revealed low c-Fos expression in dopaminergic neurons in the VTA and SN of Ox1RΔDAT and control mice after ICV injection of saline or orexin A (1 nmol). No obvious changes were observed among the groups. We have improved the description in the revised reviewed preprint (lines 202-208).

      Line 192. What do you mean by optophysiological here? The Ca imaging (which is a fairly small, confirmatory element of the manuscript).

      We have changed ‘optophysiological experiments’ (line 192 in initial submitted manuscript) to ‘calcium imaging experiments’ and rephrased the beginning of the discussion to clarify the objectives (lines 238246).

      The protein level in the diet is substantially higher than in most rodent diets (34% here vs 14-20% in most commercial rodent chows). Please comment on this.

      This diet is for rat and mouse maintenance, purchased from ssniff Spezialdiäten GmbH (product V1554).

      The percentage of calories supplied by protein is affected by the calculation methods. The company calculated with pig equation before and the value was 34% in the old instruction data sheet. They have updated the value to 23% in the new data sheet with calculations by Atwater factors. We thank the reviewer for reminding us and have updated the values in the revised reviewed preprint (lines 314-316). 

      Editor's note:

      Should you choose to revise your manuscript, please include full statistical reporting including exact p-values wherever possible alongside the summary statistics (test statistic and df) and 95% confidence intervals. These should be reported for all key questions and not only when the p-value is less than 0.05.

      We have provided the source data and the statistical reporting for each Figure with the revision

      References

      Baimel, C., Lau, B. K., Qiao, M., & Borgland, S. L. (2017). Projection-target-defined effects of orexin and dynorphin on VTA dopamine neurons. Cell Rep, 18(6), 1346-1355.  https://doi.org/10.1016/j.celrep.2017.01.030

      Korotkova, T. M., Eriksson, K. S., Haas, H. L., & Brown, R. E. (2002). Selective excitation of GABAergic neurons in the substantia nigra of the rat by orexin/hypocretin in vitro. Regul Pept, 104(1-3), 83-89. https://doi.org/10.1016/s0167-0115(01)00323-8 

      Korotkova, T. M., Sergeeva, O. A., Eriksson, K. S., Haas, H. L., & Brown, R. E. (2003). Excitation of ventral tegmental area dopaminergic and nondopaminergic neurons by orexins/hypocretins. J Neurosci, 23(1), 7-11. https://www.ncbi.nlm.nih.gov/pubmed/12514194

      Liu, C., Xue, Y., Liu, M. F., Wang, Y., Liu, Z. R., Diao, H. L., & Chen, L. (2018). Orexins increase the firing activity of nigral dopaminergic neurons and participate in motor control in rats. J Neurochem, 147(3), 380-394. https://doi.org/10.1111/jnc.14568 

      Tung, L. W., Lu, G. L., Lee, Y. H., Yu, L., Lee, H. J., Leishman, E., Bradshaw, H., Hwang, L. L., Hung, M. S., Mackie, K., Zimmer, A., & Chiou, L. C. (2016). Orexins contribute to restraint stress-induced cocaine relapse by endocannabinoid-mediated disinhibition of dopaminergic neurons. Nat Commun, 7, 12199. https://doi.org/10.1038/ncomms12199

    1. Author response:

      The following is the authors’ response to the original reviews.

      We would like to thank all of the reviewers for their helpful and the effort they made in reading and evaluating our manuscript. In response to them, we have made major changes to the text and figures and performed substantial new experiments. These new data and changes to the text and figures have substantially strengthened the manuscript. We believe that the manuscript is now very strong in both its impact and scope and we hope that reviewers will find it suitable for publication in eLife

      A point-by-point response to the reviewers' specific comments is provided below.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      In this report, Yu et al ascribe potential tumor suppressive functions to the non-core regions of RAG1/2 recombinases. Using a well-established BCR-ABL oncogene-driven system, the authors model the development of B cell acute lymphoblastic leukemia in mice and found that RAG mutants lacking non-core regions show accelerated leukemogenesis. They further report that the loss of non-core regions of RAG1/2 increases genomic instability, possibly caused by increased off-target recombination of aberrant RAG-induced breaks. The authors conclude that the non-core regions of RAG1 in particular not only increase the fidelity of VDJ recombination, but may also influence the recombination "range" of off-target joints, and that in the absence of the non-core regions, mutant RAG1/2 (termed cRAGs) catalyze high levels of off-target recombination leading to the development of aggressive leukemia.

      Strengths:

      The authors used a genetically defined oncogene-driven model to study the effect of RAG non-core regions on leukemogenesis. The animal studies were well performed and generally included a good number of mice. Therefore, the finding that cRAG expression led to the development of more aggressive BCR-ABL+ leukemia compared to fRAG is solid.

      Weaknesses:

      In general, I find the mechanistic explanation offered by the authors to explain how the non-core regions of RAG1/2 suppress leukemogenesis to be less convincing. My main concern is that cRAG1 and cRAG2 are overexpressed relative to fRAG1/2. This raises the possibility that the observed increased aggressiveness of cRAG tumors compared to fRAG tumors could be solely due to cRAG1/2 overexpression, rather than any intrinsic differences in the activity of cRAG1/2 vs fRAG1/2; and indeed, the authors allude to this possibility in Fig S8, where it was shown that elevated expression of RAG (i.e. fRAG) correlated with decreased survival in pediatric ALL. Although it doesn't mean the authors' assertions are incorrect, this potential caveat should nevertheless be discussed.

      We appreciate the valuable suggestions from the reviewer. BCR-ABL1+ B-ALL is characterized by halted early B-lineage differentiation. In BCR-ABL1+ B cells, RAG recombinases are highly expressed, leading to the inactivation of genes that encode essential transcription factors for B-lineage differentiation. This results in cells being trapped within the precursor compartment, thereby elevating RAG gene expression. Our interpretation of the data suggests that, in BCR-ABL1+ B-ALL mouse models, the high expression of both cRAG and fRAG and the deletion of the non-core regions influence the precision of RAG targeting within the genome. This causes more genomic damage in cRAG tumors than in fRAG tumors, consequently leading to the observed increased aggressiveness of cRAG tumors compared to fRAG tumors. We discussed the issues on Page 12, lines 295-307 in the revised manuscript.

      Some of the conclusions drawn were not supported by the data.

      (1) I'm not sure that the authors can conclude based on μHC expression that there is a loss of pre-BCR checkpoint in cRAG tumors. In fact, Fig. 2B showed that the differences are not statistically significant overall, and more importantly, μHC expression should be detectable in small pre-B cells (CD43-). This is also corroborated by the authors' analysis of VDJ rearrangements, showing that it has occurred at the H chain locus in cRAG cells.

      We appreciate the insightful comment from the reviewer. Upon reevaluation of the data presented in Fig. 2B, we identified and rectified certain errors. The revised analysis now shows that the differences in μHC expression are statistically significant. This significant expression of μHC in fRAG leukemic cells implies that these cells may progress further in differentiation, potentially acquiring an immune phenotype. These modifications have been incorporated into the manuscript on page 7, lines 153-156 in the revised manuscript.

      (2) The authors found a high degree of polyclonal VDJ rearrangements in fRAG tumor cells but a much more limited oligoclonal VDJ repertoire in cRAG tumors. They concluded that this explains why cRAG tumors are more aggressive because BCR-ABL induced leukemia requires secondary oncogenic hits, resulting in the outgrowth of a few dominant clones (Page 19, lines 381-398). I'm not sure this is necessarily a causal relationship since we don't know if the oligoclonality of cRAG tumors is due to selection based on oncogenic potential or if it may actually reflect a more restricted usage of different VDJ gene segments during rearrangement.

      Thank you for your insightful comments and questions regarding the relationship between the oligoclonality of V(D)J rearrangements and the aggressiveness of cRAG tumors. You raise an important point regarding whether the observed oligoclonality is a result of selective pressure favoring clones with specific oncogenic potential, or if it reflects inherent limitations in V(D)J segment usage during rearrangement in cRAG models. In our study, we observed a marked difference in the V(D)J rearrangement patterns between fRAG and cRAG tumor cells, with cRAG tumors exhibiting a more limited, oligoclonal repertoire. This observation led us to speculate that the aggressive nature of cRAG tumors might be linked to a selective advantage conferred by specific V(D)J rearrangements that cooperate with the BCR-ABL1 oncogene to drive leukemogenesis. However, we acknowledge that our current data do not definitively establish a causal relationship between oligoclonality and tumor aggressiveness. The restricted V(D)J repertoire in cRAG tumors could indeed be due to a more constrained rearrangement process, possibly influenced by the altered expression or function of RAG1/2 in the absence of non-core regions. This could limit the diversity of V(D)J rearrangements, leading to the emergence of a few dominant clones not necessarily because they have greater oncogenic potential, but because of a narrowed field of rearrangement possibilities.

      To address this question more thoroughly, future studies could examine the functional consequences of specific V(D)J rearrangements found in dominant cRAG tumor clones. This could include assessing the oncogenic potential of these rearrangements in isolation and in cooperation with BCR-ABL1, as well as exploring the mechanistic basis for the restricted V(D)J repertoire. Such studies would provide deeper insight into the interplay between RAG-mediated recombination, clonal selection, and leukemogenesis in BCR-ABL1+ B-ALL.

      We appreciate your feedback on this matter and agree that further investigation is required to unravel the precise relationship between V(D)J rearrangement diversity and leukemic progression in cRAG models. We have revised our discussion to reflect these considerations and to clarify the speculative nature of our conclusions regarding the link between oligoclonality and tumor aggressiveness. We added more discussion on this issue on Page 7, lines 166-170 in the revised manuscript.

      (3) What constitutes a cancer gene can be highly context- and tissue-dependent. Given that there is no additional information on how any putative cancer gene was disrupted (e.g., truncation of regulatory or coding regions), it is not possible to infer whether increased off-target cRAG activity really directly contributed to the increased aggressiveness of leukemia.

      We totally agree you raised the issues. In Supplementary Table 3, we have presented data on off-target gene disruptions, specifically in introns, exons, downstream regions, promoters, 3' UTRs, and 5' UTRs. However, this dataset alone does not suffice to conclusively determine whether the increased off-target activity of cRAG directly influences the heightened aggressiveness of leukemia. To bridge this knowledge gap, our future research will extend to include both knockout and overexpression experiments targeting these off-target genes.

      (4) Fig. 6A, it seems that it is really the first four nucleotide (CACA) that determines fRAG binding and the first three (CAC) that determine cRAG binding, as opposed to five for fRAG and four for cRAG, as the author wrote (page 24, lines 493-497).

      We thank the reviewer for the insightful comment. In response, we have revised the text to accurately reflect the nucleotide sequences responsible for RAG binding and cleavage. Specifically, we now clarify that the first four nucleotides (CACA) are crucial for fRAG binding and cleavage, while the initial three nucleotides (CAC) are essential for cRAG binding and cleavage. These updates have been made on page 10, lines 242-245 of the revised manuscript.

      (5) Fig S3B, I don't really see why "significant variations in NHEJ" would necessarily equate "aberrant expression of DNA repair pathways in cRAG leukemic cells". This is purely speculative. Since it has been reported previously that alt-EJ/MMEJ can join off target RAG breaks, do the authors detect high levels of microhomology usage at break points in cRAG tumors?

      We appreciate the reviewer's comment. Currently, we have not observed microhomology usage at breakpoints in cRAG tumors. We plan to address this aspect in a future, more detailed study. Regarding the 'aberrant expression of DNA repair pathways in cRAG leukemic cells, we acknowledge that this is speculative. Therefore, we have carefully rephrased this to 'suggesting a potential aberrant expression of DNA repair pathways in cRAG leukemic cells.' This modification is reflected on page 12, lines 290-291 of the revised manuscript.

      (6) Fig. S7, CDKN2B inhibits CDK4/6 activation by cyclin D, but I don't think it has been shown to regulate CDK6 mRNA expression. The increase in CDK6 mRNA likely just reflects a more proliferative tumor but may have nothing to do with CDKN2B deletion in cRAG1 tumors.

      We fully concur with the reviewer's comment. We have deleted this inappropriate part from the text.

      Insufficient details in some figures. For instance, Fig. 1A, please include statistics in the plot showing a comparison of fRAG vs cRAG1, fRAG vs cRAG2, cRAG1 vs cRAG2. As of now, there's a single p-value (0.0425) stated in the main text and the legend but why is there only one p-value when fRAG is compared to cRAG1 or cRAG2? Similarly, the authors wrote "median survival days 11-26, 10-16, 11-21 days, P < 0.0023-0.0299, Fig. S2B." However, it is difficult for me to figure out what are the numbers referring to. For instance, is 11-26 referring to median survival of fRAG inoculated with three different concentrations of GFP+ leukemic cells or is 11-26 referring to median survival of fRAG, cRAG1, cRAG2 inoculated with 10^5 cells? It would be much clearer if the authors can provide the numbers for each pair-wise comparison, if not in the main text, then at least in the figure legend. In Fig. 5A-B, do the plots depict SVs in cRAG tumors or both cRAG and fRAG cells? Also in Fig. 5, why did 24 SVs give rise to 42 breakpoints, and not 48? Doesn't it take 2 breaks to accomplish rearrangement? In Fig. 6B-C, it is not clear how the recombination sizes were calculated. In the examples shown in Fig. 4, only cRAG1 tumors show intra-chromosomal joins (chr 12), while fRAG and cRAG2 tumors show exclusively inter-chromosomal joins.

      We appreciate the reviewer's feedback and have made the following revisions:

      (1) The text has been adjusted to rectify the previously mentioned error in the figure legends (page 1, lines 5-6).

      (2) We have clarified the intended message in the revised text (page 6, lines 129-130) and the figure legend (page 4-5, lines 107-113) for greater precision.

      (3) Figure 5A-B now presents an overview of all structural variants (SVs) identified in both cRAG and fRAG cells, offering a comprehensive comparison.

      (4) Among the analyzed SVs, 24 generated a total of 48 breakpoints, with 41 occurring within gene bodies and the remaining 7 in adjacent flanking sequences. This informs our exon-intron distribution profile analysis.

      (5) We have defined recombination sizes as ‘the DNA fragment size spanning the two breakpoints’ for clarity (page 10, lines 251-252).

      (6) All off-target recombinations identified in the genome-wide analyses of fRAG, cRAG1, and cRAG2 leukemic cells were determined to be intra-chromosomal joins, highlighting their specific nature within the genomic context.

      Insufficient details on certain reagents/methods. For instance, are the cRAG1/2 mice of the same genetic background as fRAG mice (C57BL/6 WT)? On Page 23, line 481, what is a cancer gene? How are they defined? In Fig. 3C, are the FACS plots gated on intact cells? Since apoptotic cells show high levels of gH2AX, I'm surprised that the fraction of gH2AX+ cells is so much lower in fRAG tumors compared to cRAG tumors. The in vitro VDJ assay shown in Fig 3B is not described in the Method section (although it is described in Fig S5b). Fig. 5A-B, do the plots depict SVs in cRAG tumors or both cRAG and fRAG cells?

      We are grateful for the reviewer's feedback and have incorporated their insights as follows:

      (1) We clarify that both cRAG1/2 and fRAG mice share the same genetic background, specifically the C57BL/6 WT strain, ensuring consistency across experimental models.

      (2) We define a 'cancer gene' as one harboring somatic mutations implicated in cancer. To support our analysis, we refer to the Catalogue Of Somatic Mutations In Cancer (COSMIC) at http://cancer.sanger.ac.uk/cosmic. COSMIC serves as the most extensive repository for understanding the role of somatic mutations in human cancers.

      (3) Upon thorough review of the raw data for γ-H2AX and the fluorescence-activated cell sorting (FACS) plots gated on intact cells, we propose that the observed discrepancies might stem from the limited sensitivity of the γ-H2AX flow cytometry detection method. This insight prompts our commitment to employing more efficient detection methodologies in forthcoming studies.

      (4) Detailed procedures for the in vitro V(D)J recombination assay have been included in the Methods section (page 15, lines 384-388) to enhance the manuscript's comprehensiveness and reproducibility.

      (5) The presented plots offer a comprehensive overview of structural variants (SVs) identified in both cRAG and fRAG cells, providing a holistic view of the genomic landscape across different models.

      Reviewer #3 (Public Review):

      Summary:

      In the manuscript, the authors summarized and introduced the correlation between the non-core regions of RAG1 and RAG2 in BCR-ABL1+acute B lymphoblastic leukemia and off-target recombination which has certain innovative and clinical significance.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      I would suggest that the authors tone down some of their conclusions, which are not necessarily supported by their own data. in addition, there are some minor mistakes in figure assembly/presentation. For instance, I believe that the axes labels in Fig. 1E were flipped. BrdU should be on y-axis and 7-AAD on the x-axis. Fig. 3B, the y-axis contains a typo, it should be "CD90.1..." and not "D90.1...". In Fig. 5C, the numbers seem to be flipped, with 93% corresponding to cRAG1 and 100% to cRAG2 (compare with the description on page 23, lines 474-475). Fig. 5C, y-axis, "hybrid" is a typo. Page 3, line 59: The abbreviation of RSS has already been described earlier (p4, line 53).

      We thank the reviewer for these suggestions. We carefully checked the raw data and corrected these mistakes in the revised manuscript.

      Page 3, line 63: "signal" segment (commonly referred to as signal ends), not "signaling" segment.

      We have changed “signaling segment” to “signal ends in the revised manuscript. (page 3, lines 54-55)

      Page 3, lines 64-65: VDJ recombination promotes the development of both B and T cells, and aberrant recombination can cause both B and T cell lymphomas.

      The statement about the role of V(D)J recombination in B and T cell development and its link to lymphomagenesis is grounded in a substantial body of research. Theoretical frameworks and empirical studies delineate how aberrations in the recombination process can lead to genomic instability, potentially triggering oncogenic events. This connection is extensively documented in immunology and oncology literature, illustrating the critical balance between necessary genetic rearrangements for immune diversity and the risk of malignancy when these processes are dysregulated (Thomson, et al.,2020; Mendes, et al.,2014; Onozawa and Aplan,2012).

      Page 4, line 72: "recombinant dispensability" is not a commonly used phrase. Do the authors mean the say that the non-core regions of RAG1/2 are not strictly required for VDJ recombination?

      We thank the reviewers for their insightful suggestion. We have revised the sentence to read, 'Although the non-core regions of RAG1/2 are not essential for V(D)J recombination, the evolutionary conservation of these regions suggests their potential significance in vivo, possibly affecting RAG activity and expression in both quantitative and qualitative manners.' This revision appears on page 3, lines 61-62, in the revised manuscript.

      Fig. 4. It would have been nice to show at least one more cRAG1 tumor circus plot.

      We appreciate the reviewer's comment and concur with the suggestion. In future sequencing experiments, we will consider including additional replicates. However, due to time and financial constraints, the current sequencing effort was limited to a maximum of three replicates.

      Reviewer #3 (Recommendations For The Authors):

      In the manuscript, the authors summarized and introduced the correlation between the non-core regions of RAG1 and RAG2 in BCR-ABL1+acute B lymphoblastic leukemia and off-target recombination which has certain innovative and clinical significance. The following issues need to be addressed by the authors.

      (1) Authors should check and review extensively for improvements to the use of English.

      We thank the reviewer for their comment. With assistance from a native English speaker, we have carefully revised the manuscript to enhance its readability.

      (2) Authors should revise the conclusion so that the above can be clearly reviewed and summarized.

      The conclusion has been partially revised in the revised manuscript.

      (3) The article should state that the experiment was independently repeated three times.

      The experiment was repeated under the same conditions three times and the information has been descripted in Statistics section on page 19, lines 473-475 in the revised manuscript.

      (4) The article will be more convincing if it uses references in the last 5 years.

      We are grateful to the reviewer for their guidance in enhancing our manuscript. We have incorporated additional references from the past five years in the revised version.

      (5) Additional experiments are suggested to elucidate the molecular mechanisms related to off-target recombination.

      We thank the reviewer for this suggestion. In future experiments, we plan to perform ChIP-seq analysis to investigate the relationship between chromatin accessibility and off-target effects, as well as to examine the impact of knocking out and overexpressing off-target genes on cancer development and progression.

      (6) It is suggested to further analyze the effect of the absence of non-core RAG region on the differentiation and development of peripheral B cells in mice by flow analysis and expression of B1 and B2.

      Thank you very much for highlighting this crucial issue. FACS analysis was performed, revealing that leukemia cells in peripheral B cells in mice did not express CD5. The data are presented as follows:

      Author response image 1.

      (7) Fig3A should have three biological replicates and the molecular weight should be labeled on the right side of the strip.

      Thank you for this suggestion. The experiment was independently repeated three times, and the molecular weights have been labeled on the right side of the bands in the revised version

      References:

      Mendes RD, Sarmento LM, Canté-Barrett K, Zuurbier L, Buijs-Gladdines JG, Póvoa V, Smits WK, Abecasis M, Yunes JA, Sonneveld E, Horstmann MA, Pieters R, Barata JT, Meijerink JP. 2014. PTEN microdeletions in T-cell acute lymphoblastic leukemia are caused by illegitimate RAG-mediated recombination events. BLOOD 124:567-578. doi:10.1182/blood-2014-03-562751

      Onozawa M, Aplan PD. 2012. Illegitimate V(D)J recombination involving nonantigen receptor loci in lymphoid malignancy. Genes Chromosomes Cancer 51:525-535. doi:10.1002/gcc.21942

      Thomson DW, Shahrin NH, Wang P, Wadham C, Shanmuganathan N, Scott HS, Dinger ME, Hughes TP, Schreiber AW, Branford S. 2020. Aberrant RAG-mediated recombination contributes to multiple structural rearrangements in lymphoid blast crisis of chronic myeloid leukemia. LEUKEMIA 34:2051-2063. doi:10.1038/s41375-020-0751-y

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This work investigated the role of CXXC-finger protein 1 (CXXC1) in regulatory T cells. CXXC1-bound genomic regions largely overlap with Foxp3-bound regions and regions with H3K4me3 histone modifications in Treg cells. CXXC1 and Foxp3 interact with each other, as shown by co-immunoprecipitation. Mice with Treg-specific CXXC1 knockout (KO) succumb to lymphoproliferative diseases between 3 to 4 weeks of age, similar to Foxp3 KO mice. Although the immune suppression function of CXXC1 KO Treg is comparable to WT Treg in an in vitro assay, these KO Tregs failed to suppress autoimmune diseases such as EAE and colitis in Treg transfer models in vivo. This is partly due to the diminished survival of the KO Tregs after transfer. CXXC1 KO Tregs do not have an altered DNA methylation pattern; instead, they display weakened H3K4me3 modifications within the broad H3K4me3 domains, which contain a set of Treg signature genes. These results suggest that CXXC1 and Foxp3 collaborate to regulate Treg homeostasis and function by promoting Treg signature gene expression through maintaining H3K4me3 modification.

      Strengths:

      Epigenetic regulation of Treg cells has been a constantly evolving area of research. The current study revealed CXXC1 as a previously unidentified epigenetic regulator of Tregs. The strong phenotype of the knockout mouse supports the critical role CXXC1 plays in Treg cells. Mechanistically, the link between CXXC1 and the maintenance of broad H3K4me3 domains is also a novel finding.

      Weaknesses:

      (1) It is not clear why the authors chose to compare H3K4me3 and H3K27me3 enriched genomic regions. There are other histone modifications associated with transcription activation or repression. Please provide justification.

      Thank you for highlighting this important point. We chose to focus on H3K4me3 and H3K27me3 enriched genomic regions because these histone modifications are well-characterized markers of transcriptional activation and repression, respectively. H3K4me3 is predominantly associated with active promoters, while H3K27me3 marks repressed chromatin states, particularly in the context of gene regulation at promoters. This duality provides a robust framework for investigating the balance between transcriptional activation and repression in Treg cells. While histone acetylation, such as H3K27ac, is linked to enhancer activity and transcriptional elongation, our focus was on promoter-level regulation, where H3K4me3 and H3K27me3 are most relevant. Although other histone modifications could provide additional insights, we chose to focus on these two to maintain clarity and feasibility in our analysis. We have revised the text accordingly; please refer to Page 18, lines 353-356.

      (2) It is not clear what separates Clusters 1 and 3 in Figure 1C. It seems they share the same features.

      We apologize for not clarifying these clusters clearly. Cluster 1 and 3 are both H3K4me3 only group, with H3K4me3 enrichment and gene expression levels being higher in Cluster 1. At first, we divided the promoters into four categories because we wanted to try to classify them into four categories: H3K4me3 only, H3K27me3 only, H3K4me3-H3K27me3 co-occupied, and None. However, in actual classification, we could not distinguish H3K4me3-H3K27me3 co-occupied group. Instead, we had two categories of H3K4me3 only, with cluster 1 having a higher enrichment level for H3K4me3 and gene expression levels.

      (3) The claim, "These observations support the hypothesis that FOXP3 primarily functions as an activator by promoting H3K4me3 deposition in Treg cells." (line 344), seems to be a bit of an overstatement. Foxp3 certainly can promote transcription in ways other than promoting H3K3me3 deposition, and it also can repress gene transcription without affecting H3K27me3 deposition. Therefore, it is not justified to claim that promoting H3K4me3 deposition is Foxp3's primary function.

      Thank you for your insightful feedback. We agree that the statement in line 344 may have overstated the role of FOXP3 in promoting H3K4me3 deposition as its primary function. As you pointed out, FOXP3 is indeed a multifaceted transcription factor that regulates gene expression through various mechanisms. It can promote transcription independent of H3K4me3 deposition, as well as repress transcription without directly influencing H3K27me3 levels.

      To more accurately reflect the broader regulatory functions of FOXP3, we have revised the manuscript. The updated text (Page 19, lines 385-388) now reads:

      "These findings collectively support the conclusion that FOXP3 contributes to transcriptional activation in Treg cells by promoting H3K4me3 deposition at target loci, while also regulating gene expression directly or indirectly through other epigenetic modifications.

      (4) For the in vitro suppression assay in Figure S4C, and the Treg transfer EAE and colitis experiments in Figure 4, the Tregs should be isolated from Cxxc1 fl/fl x Foxp3 cre/wt female heterozygous mice instead of Cxxc1 fl/fl x Foxp3 cre/cre (or cre/Y) mice. Tregs from the homozygous KO mice are already activated by the lymphoproliferative environment and could have vastly different gene expression patterns and homeostatic features compared to resting Tregs. Therefore, it's not a fair comparison between these activated KO Tregs and resting WT Tregs.

      Thank you for raising this insightful point regarding the potential activation status of Treg cells in homozygous knockout mice. To address this concern, we performed additional experiments using Treg cells isolated from Foxp3<sup>Cre/+</sup>Cxxc1<sup>fl/fl</sup> (hereafter referred to as “het-KO”) female mice and their littermate controls, Foxp3<sup>Cre/+</sup>Cxxc1<sup>fl/+</sup> (referred to as “het-WT”) mice.

      The results of these new experiments are now included in the manuscript (Page25, lines 507–509, Figure 6E and Figure S6A-E):

      (1) In the in vitro suppression assay, Treg cells from het-KO mice exhibited reduced suppressive function compared to het-WT Treg cells. This finding underscores the intrinsic defect in Treg cells suppressive capacity attributable to the loss of one Cxxc1 allele.

      (2) In the experimental autoimmune encephalomyelitis (EAE) model, Treg cells isolated from het-KO mice also demonstrated impaired suppressive function.

      (5) The manuscript didn't provide a potential mechanism for how CXXC1 strengthens broad H3K4me3-modified genomic regions. The authors should perform Foxp3 ChIP-seq or Cut-n-Taq with WT and Cxxc1 cKO Tregs to determine whether CXXC1 deletion changes Foxp3's binding pattern in Treg cells.

      Thank you for raising this important point. To address your suggestion, we performed CUT&Tag experiments and found that Cxxc1 deletion does not alter FOXP3 binding patterns in Treg cells. Most FOXP3-bound regions in WT Treg cells were similarly enriched in KO Treg cells, indicating that Cxxc1 deficiency does not impair FOXP3’s DNA-binding ability. These results have been added to the revised manuscript (Page 28, lines 567-575, Figure S8A-B) and are further discussed in the Discussion (Pages 28-29, lines 581-587).

      Reviewer #2 (Public review):

      FOXP3 has been known to form diverse complexes with different transcription factors and enzymes responsible for epigenetic modifications, but how extracellular signals timely regulate FOXP3 complex dynamics remains to be fully understood. Histone H3K4 tri-methylation (H3K4me3) and CXXC finger protein 1 (CXXC1), which is required to regulate H3K4me3, also remain to be fully investigated in Treg cells. Here, Meng et al. performed a comprehensive analysis of H3K4me3 CUT&Tag assay on Treg cells and a comparison of the dataset with the FOXP3 ChIP-seq dataset revealed that FOXP3 could facilitate the regulation of target genes by promoting H3K4me3 deposition.

      Moreover, CXXC1-FOXP3 interaction is required for this regulation. They found that specific knockdown of Cxxc1 in Treg leads to spontaneous severe multi-organ inflammation in mice and that Cxxc1-deficient Treg exhibits enhanced activation and impaired suppression activity. In addition, they have also found that CXXC1 shares several binding sites with FOXP3 especially on Treg signature gene loci, which are necessary for maintaining homeostasis and identity of Treg cells.

      The findings of the current study are pretty intriguing, and it would be great if the authors could fully address the following comments to support these interesting findings.

      Major points:

      (1) There is insufficient evidence in the first part of the Results to support the conclusion that "FOXP3 functions as an activator by promoting H3K4Me3 deposition in Treg cells". The authors should compare the results for H3K4Me3 in FOXP3-negative conventional T cells to demonstrate that at these promoter loci, FOXP3 promotes H3K4Me3 deposition.

      Thank you for this insightful comment. We have already performed additional experiments comparing H3K4Me3 levels between FOXP3-positive Treg cells and FOXP3-negative conventional T cells (Tconv). Please refer to Pages 18, lines 361-368, and Figure 1C and Figure S1C for the results. Our results show that H3K4Me3 abundance is higher at many Treg-specific gene loci in Treg cells compared to Tconv cells. This supports our conclusion that FOXP3 promotes H3K4Me3 deposition at these loci.

      (2) In Figure 3 F&G, the activation status and IFNγ production should be analyzed in Treg cells and Tconv cells separately rather than in total CD4+ T cells. Moreover, are there changes in autoantibodies and IgG and IgE levels in the serum of cKO mice?

      Thank you for your valuable suggestions. In response to your comment, we reanalyzed the data in Figures 3F and 3G to assess the activation status and IFN-γ production in Tconv cells. The updated analysis revealed that Cxxc1 deletion in Treg cells leads to increased activation and IFN-γ production in Tconv cells. Additionally, we corrected the analysis of IL-17A and IL-4 expression, which were upregulated in Tconv cells. These updated results are now included in the revised manuscript (Page 21, lines 429-431, Figure 3I and Figure S3E-F).

      Additionally, we examined autoantibodies and immunoglobulin levels in the serum of Cxxc1 cKO mice. Our data show a significant increase in serum IgG levels, accompanied by elevated IgG autoantibodies, indicating heightened autoimmune responses. In contrast, serum IgE levels remained largely unchanged. The results are detailed in the revised manuscript (Page 21, lines 421-423, Figure 3E and Figure S3B).

      (3) Why did Cxxc1-deficient Treg cells not show impaired suppression than WT Treg during in vitro suppression assay, despite the reduced expression of Treg cell suppression assay -associated markers at the transcriptional level demonstrated in both scRNA-seq and bulk RNA-seq?

      Thank you for your thoughtful comment. The absence of impaired suppression in Cxxc1-deficient Treg cells from homozygous knockout (KO) mice during the in vitro suppression assay, despite the reduced expression of Treg-associated markers at the transcriptional level (as demonstrated by scRNA-seq), can likely be explained by the activated state of these Treg cells. In homozygous KO mice, Treg cells are already activated due to the lymphoproliferative environment, resulting in gene expression patterns that differ from those of resting Treg cells. This pre-activation may obscure the effect of Cxxc1 deletion on their suppressive function in vitro.

      To address this limitation, we used heterozygous Foxp3<sup>Cre/+</sup>Cxxc1<sup>fl/fl</sup> (het-KO) female mice, along with their littermate controls, Foxp3<sup>Cre/+</sup>Cxxc1<sup>fl/+</sup> (het-WT) mice. In these heterozygous mice, we observed an impairment in Treg cell suppressive function in vitro, which was accompanied by the downregulation of several key Treg-associated genes, as confirmed by RNA-Seq analysis.

      These updated findings, based on the use of het-KO mice, are now incorporated into the revised manuscript (Page 25, lines 507–509, Figure 6E).

      (4) Is there a disease in which Cxxc1 is expressed at low levels or absent in Treg cells? Is the same immunodeficiency phenotype present in patients as in mice?

      This is indeed a very meaningful and intriguing question, and we are equally interested in understanding whether low or absent Cxxc1 expression in Treg cells is associated with any human diseases. However, despite an extensive review of the literature and available data, we found no reports linking Cxxc1 deficiency in Treg cells to immunodeficiency phenotypes in patients comparable to those observed in mice.

      Reviewer #3 (Public review):

      In the report entitled "CXXC-finger protein 1 associates with FOXP3 to stabilize homeostasis and suppressive functions of regulatory T cells", the authors demonstrated that Cxxc1-deletion in Treg cells leads to the development of severe inflammatory disease with impaired suppressive function. Mechanistically, CXXC1 interacts with Foxp3 and regulates the expression of key Treg signature genes by modulating H3K4me3 deposition. Their findings are interesting and significant. However, there are several concerns regarding their analysis and conclusions.

      Major concerns:

      (1) Despite cKO mice showing an increase in Treg cells in the lymph nodes and Cxxc1-deficient Treg cells having normal suppressive function, the majority of cKO mice died within a month. What causes cKO mice to die from severe inflammation?

      Considering the results of Figures 4 and 5, a decrease in the Treg cell population due to their reduced proliferative capacity may be one of the causes. It would be informative to analyze the population of tissue Treg cells.

      Thank you for your insightful observation regarding the mortality of cKO mice despite increased Treg cells in lymph nodes and the normal suppressive function of Cxxc1-deficient Treg cells.

      As suggested, we hypothesized that the reduction of tissue-resident Treg cells could be a key factor. Additional experiments revealed a significant decrease in Treg cell populations in the small intestine lamina propria (LPL), liver, and lung of cKO mice. These findings highlight the critical role of tissue-resident Treg cells in preventing systemic inflammation.

      This reduction aligns with Figures 4 and 5, which demonstrate impaired proliferation and survival of Cxxc1-deficient Treg cells. Together, these defects lead to insufficient Treg populations in peripheral tissues, escalating localized inflammation into systemic immune dysregulation and early mortality.

      These additional results have been incorporated into the revised manuscript (Page21, lines 424-427, Figure 3G and Figure S3C).

      (2) In Figure 5B, scRNA-seq analysis indicated that the Mki67+ Treg subset is comparable between WT and Cxxc1-deficient Treg cells. On the other hand, FACS analysis demonstrated that Cxxc1-deficient Treg shows less Ki-67 expression compared to WT in Figure 5I. The authors should explain this discrepancy.

      Thank you for pointing out the apparent discrepancy between the scRNA-seq and FACS analyses regarding Ki-67 expression in Cxxc1-deficient Treg cells.

      In Figure 5B, the scRNA-seq analysis identified the Mki67+ Treg subset as comparable between WT and Cxxc1-deficient Treg cells. This finding reflects the overall proportion of cells expressing Mki67 transcripts within the Treg population. In contrast, the FACS analysis in Figure 5I specifically measures Ki-67 protein levels, revealing reduced expression in Cxxc1-deficient Treg cells compared to WT.

      To resolve this discrepancy, we performed additional analyses of the scRNA-seq data to directly compare the expression levels of Mki67 mRNA between WT and Cxxc1-deficient Treg cells. The results revealed a consistent reduction in Mki67 transcript levels in Cxxc1-deficient Treg cells, aligning with the reduced Ki-67 protein levels observed by FACS.

      These new analyses have been included in the revised manuscript (Author response image 1) to clarify this point and demonstrate consistency between the scRNA-seq and FACS data.

      Author response image 1.

      Violin plots displaying the expression levels of Mki67 in T<sub>reg</sub> cells from Foxp3<sup>cre</sup> and Foxp3<sup>cre</sup>Cxxc1<sup>fl/fl</sup> mice.

      In addition, the authors concluded on line 441 that CXXC1 plays a crucial role in maintaining Treg cell stability. However, there appears to be no data on Treg stability. Which data represent the Treg stability?

      Thank you for your valuable comment. We agree that our wording in line 441 may have been too conclusive. Our data focus on the impact of Cxxc1 deficiency on Treg cell homeostasis and transcriptional regulation, rather than directly measuring Treg cell stability. Specifically, the downregulation of Treg-specific suppressive genes and upregulation of pro-inflammatory markers suggest a shift in Treg cell function, which points to disrupted homeostasis rather than stability.

      We have revised the manuscript to clarify that CXXC1 plays a crucial role in maintaining Treg cell function and homeostasis, rather than stability (Page 24, lines 489-491).

      (3) The authors found that Cxxc1-deficient Treg cells exhibit weaker H3K4me3 signals compared to WT in Figure 7. This result suggests that Cxxc1 regulates H3K4me3 modification via H3K4 methyltransferases in Treg cells. The authors should clarify which H3K4 methyltransferases contribute to the modulation of H3K4me3 deposition by Cxxc1 in Treg cells.

      We appreciate the reviewer’s insightful comment regarding the role of H3K4 methyltransferases in regulating H3K4me3 deposition by CXXC1 in Treg cells.

      CXXC1 has been reported to function as a non-catalytic component of the Set1/COMPASS complex, which includes the H3K4 methyltransferases SETD1A and SETD1B—key enzymes responsible for H3K4 trimethylation(1-4). Based on these findings, we propose that CXXC1 modulates H3K4me3 levels in Treg cells by interacting with and stabilizing the activity of the Set1/COMPASS complex.

      These revisions are further discussed in the Discussion (Page 30-31, lines 624-632).

      Furthermore, it would be important to investigate whether Cxxc1-deletion alters Foxp3 binding to target genes.

      Thank you for raising this important point. To address your suggestion, we performed CUT&Tag experiments and found that Cxxc1 deletion does not alter FOXP3 binding patterns in Treg cells. Most FOXP3-bound regions in WT Treg cells were similarly enriched in KO Treg cells, indicating that Cxxc1 deficiency does not impair FOXP3’s DNA-binding ability. These results have been added to the revised manuscript (Page 28, lines 567-575, Figure S8A-B) and are further discussed in the Discussion (Pages 28-29, lines 581-587).

      (4) In Figure 7, the authors concluded that CXXC1 promotes Treg cell homeostasis and function by preserving the H3K4me3 modification since Cxxc1-deficient Treg cells show lower H3K4me3 densities at the key Treg signature genes. Are these Cxxc1-deficient Treg cells derived from mosaic mice? If Cxxc1-deficient Treg cells are derived from cKO mice, the gene expression and H3K4me3 modification status are inconsistent because scRNA-seq analysis indicated that expression of these Treg signature genes was increased in Cxxc1-deficient Treg cells compared to WT (Figure 5F and G).

      Thank you for your insightful comment. To clarify, the Cxxc1-deficient Treg cells analyzed for H3K4me3 modifications in Figure 7 were derived from Cxxc1 conditional knockout (cKO) mice, not mosaic mice.

      Regarding the apparent inconsistency between reduced H3K4me3 levels and the increased expression of Treg signature genes observed in scRNA-seq analysis (Figure 5F and G), we believe this discrepancy can be attributed to distinct mechanisms regulating gene expression. H3K4me3 is an epigenetic mark that facilitates chromatin accessibility and transcriptional regulation, reflecting upstream chromatin dynamics. However, gene expression levels are influenced by a combination of factors, including transcriptional activators, downstream compensatory mechanisms, and the inflammatory environment in cKO mice.

      The upregulation of Treg signature genes in scRNA-seq data likely reflects an activated or pro-inflammatory state of Cxxc1-deficient Treg cells in response to systemic inflammation, as previously described in the manuscript. This contrasts with the intrinsic reduction in H3K4me3 levels at these loci, indicating a loss of epigenetic regulation by CXXC1.

      To further support this interpretation, RNA-seq analysis of Treg cells from Foxp3<sup>Cre/+</sup> Cxxc1<sup>fl/fl</sup> (“het-KO”) and their littermate Foxp3<sup>Cre/+</sup> Cxxc1<sup>fl/+</sup> (“het-WT”) female mice (Figure S6C) revealed a significant reduction in key Treg signature genes such as Icos, Ctla4, Tnfrsf18, and Nt5e in het-KO Treg cells. These results align with the diminished H3K4me3 modifications observed in cKO Treg cells, further underscoring the role of CXXC1 as an epigenetic regulator.

      In summary, while the gene expression changes observed in scRNA-seq may reflect adaptive responses to inflammation, the reduced H3K4me3 modifications directly highlight the critical role of CXXC1 in maintaining the epigenetic landscape essential for Treg cell homeostasis and function.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In Figure 7E, the y-axis scale for H3K4me3 peaks at the Ctla4 locus should be consistent between WT and cKO samples.

      We thank the reviewer for pointing out the inconsistency in the y-axis scale for the H3K4me3 peaks at the Ctla4 locus in Figure 7E. We have carefully revised the figure to ensure that the y-axis scale is now consistent between the WT and cKO samples.

      We appreciate the reviewer’s attention to this detail, as it enhances the rigor of the data presentation. Please find the updated Figure 7E in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      In lines 455 and 466, the name of Treg signature markers validated by flow cytometry should be written as protein name and capitalized.

      Thank you for pointing this out. We have carefully reviewed lines 455 and 466 and have revised the text to ensure that the Treg signature markers validated by flow cytometry are referred to using their protein names, with proper capitalization.

      Reviewer #3 (Recommendations for the authors):

      (1) On line 431, "Cxxc1-deficient cells" should be Cxxc1-deficient Treg cells".

      We thank the reviewer for highlighting this oversight. On line 431, we have revised "Cxxc1-deficient cells" to "Cxxc1-deficient Treg cells" to provide a more accurate and specific description. We appreciate the reviewer's attention to detail, as this correction improves the precision of our manuscript.

      (2) In Figure 4H, negative values should be removed from the y-axis.

      Thank you for your observation. We have revised Figure 4H to remove the negative values from the y-axis, as requested. This adjustment ensures a more accurate and meaningful representation of the data.

      (3) It is better to provide the lists of overlapping genes in Figure 7C.

      Thank you for your suggestion. We agree that providing the lists of overlapping genes in Figure 7C would enhance the clarity and reproducibility of the results. We have now included the gene lists as supplementary information (Supplementary Table 3) accompanying Figure 7C.

      (1) Lee, J. H. & Skalnik, D. G. CpG-binding protein (CXXC finger protein 1) is a component of the mammalian set1 histone H3-Lys4 methyltransferase complex, the analogue of the yeast Set1/COMPASS complex. Journal of Biological Chemistry 280, 41725-41731, doi:10.1074/jbc.M508312200 (2005).

      (2) Thomson, J. P., Skene, P. J., Selfridge, J., Clouaire, T., Guy, J., Webb, S., Kerr, A. R. W., Deaton, A., Andrews, R., James, K. D., Turner, D. J., Illingworth, R. & Bird, A. CpG islands influence chromatin structure via the CpG-binding protein Cfp1. Nature 464, 1082-U1162, doi:10.1038/nature08924 (2010).

      (3) Shilatifard, A. in Annual Review of Biochemistry, Vol 81 Vol. 81 Annual Review of Biochemistry (ed R. D. Kornberg)  65-95 (2012).

      (4) Brown, D. A., Di Cerbo, V., Feldmann, A., Ahn, J., Ito, S., Blackledge, N. P., Nakayama, M., McClellan, M., Dimitrova, E., Turberfield, A. H., Long, H. K., King, H. W., Kriaucionis, S., Schermelleh, L., Kutateladze, T. G., Koseki, H. & Klose, R. J. The SET1 Complex Selects Actively Transcribed Target Genes via Multivalent Interaction with CpG Island Chromatin. Cell Reports 20, 2313-2327, doi:10.1016/j.celrep.2017.08.030 (2017).

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Zhang et al. describe a delicate relationship between Tet2 and FBP1 in the regulation of hepatic gluconeogenesis.

      Strengths:

      The studies are very mechanistic, indicating that this interaction occurs via demethylation of HNF4a. Phosphorylation of HNF4a at ser 313 induced by metformin also controls the interaction between Tet2 and FBP1.

      We are grateful for the reviewer's praise on the manuscript.

      Weaknesses:

      The results are briefly described, and oftentimes, the necessary information is not provided to interpret the data. Similarly, the methods section is not well developed to inform the reader about how these experiments were performed. While the findings are interesting, the results section needs to be better developed to increase confidence in the interpretation of the results.

      Thanks very much for pointing out the shortcomings of the manuscript. We apologize that we did not provide detailed description for some experimental methods and results. Following reviewer’s suggestion, we added the details in method section, including the generation of whole-body Tet2 KO mice and liver-specific Tet2 knockdown mice (AAV8-shTet2), the missing information of reagent, antibody, primer sequences and mutant generation, and the methods of chromatin immunoprecipitation (ChIP) and immunofluorescence. The interpretation of the results was also further developed according to reviewer’s comments.

      Reviewer #2 (Public review):

      Summary:

      This study reveals a novel role of TET2 in regulating gluconeogenesis. It shows that fasting and a high-fat diet increase TET2 expression in mice, and TET2 knockout reduces glucose production. The findings highlight that TET2 positively regulates FBP1, a key enzyme in gluconeogenesis, by interacting with HNF4α to demethylate the FBP1 promoter in response to glucagon. Additionally, metformin reduces FBP1 expression by preventing TET2-HNF4α interaction. This identifies an HNF4α-TET2-FBP1 axis as a potential target for T2D treatment.

      Strengths:

      The authors use several methods in vivo (PTT, GTT, and ITT in fasted and HFD mice; and KO mice) and in vitro (in HepG2 and primary hepatocytes) to support the existence of the HNF4alpha-TET-2-FBP-1 axis in the control of gluconeogenesis. These findings uncovered a previously unknown function of TET2 in gluconeogenesis.

      We are grateful for the reviewer's praise on the manuscript.

      Weaknesses:

      Although the authors provide evidence of an HNF4α-TET2-FBP1 axis in the control of gluconeogenesis, which contributes to the therapeutic effect of metformin on T2D, its role in the pathogenesis of T2D is less clear. The mechanisms by which TET2 is up-regulated by glucagon should be more explored.

      Thanks very much for pointing out the shortcomings of the manuscript. We agree with the reviewer that the manuscript is focused on the function of HNF4α-TET2-FBP1 axis in the control of gluconeogenesis, but not on its role in the pathogenesis of T2D. Following reviewer’s suggestion, we changed the title of the manuscript to “HNF4α-TET2-FBP1 axis contributes to gluconeogenesis and type 2 diabetes”. For the mechanisms by which TET2 is up-regulated by glucagon, we examined TET2 mRNA levels at different time points after a single dose of glucagon treatment in HepG2 cells. Interestingly, the results showed that TET2 mRNA levels significantly increased by 6 folds at 30 min and the sustained effect of glucagon on Tet2 mRNA levels persisted for more than 48 hours (refer to Fig. 3E).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):<br /> The authors indicate that they have overexpressed TET2 in HepG2 cells and primary mouse hepatocytes. The degree of overexpression should be shown. Is this similar to an increase in TET2 with fasting or HFD treatment?

      Thanks for reviewer’s helpful comment. Following reviewer’s suggestion, we examined the protein levels of overexpressed TET2 in HepG2 cells and primary mouse hepatocytes. The results revealed that the degree of TET2 overexpression (refer to Fig. 3J) is similar to the increase of TET2 under fasting or HFD treatment (Fig. 1C, D).

      In Figures 2E-2G, the authors report results in Tet2-KO mice. Information on how these mice were generated is lacking. There is limited information about how Tet2-KO cells were generated, but again, I could not find anything about these mice in the methods section or figure legend. Is this whole-body or liver-specific Tet2-KO? How old were the mice at the time of PTT, GTT, or ITT?

      Were these mice on chow or HFD? Are there any differences in body weight between WT and Tet2-KO mice?

      Thanks for reviewer’s helpful comment. Following reviewer’s suggestion, we provided the detailed information about the Tet2-KO mice, including the mouse generation in methods section. Moreover, the details of Tet2-KO mice used in each figure were clearly described in the figure legend. In this study, two mouse models were employed: whole-body Tet2-KO mice and liver-specific TET2 knockdown mice (AAV8-shTet2). The mice used for PTT, GTT and ITT were 8 weeks old and on HFD. To address reviewer’s concern, we compared the body weight of WT and Tet2-KO mice and results revealed that no significant differences in the body weight between WT and Tet2-KO mice at 8 and 10 weeks old when on a normal chow diet, as depicted in Figure 2I.

      Figures 3A-C shows that 48 hours after glucagon treatment, Tet2 and FBP1 mRNA increased. It's surprising that a single dose of glucagon would have effects that last that long. The peak rise in glucose following glucagon treatment occurs in 30 minutes. How do authors explain such a long effect of glucagon on Tet2 mRNA and protein?

      Thanks for reviewer’s constructive comment. To address reviewer’s concern, we examined the mRNA levels of TET2 and FBP1 at different time points following a single dose of glucagon treatment in HepG2 cells. Interestingly, the results showed that TET2 mRNA levels significantly increased by 6 folds at 30 min and the sustained effect of glucagon on Tet2 mRNA levels persisted for more than 48 hours (refer to Fig. 3E). The detailed mechanism underlying long effect of glucagon on Tet2 mRNA and protein needs further exploration.

      It's interesting that in Figure 3F, Fbp1 and Tet2 mRNA expression correlated positively in both ad libitum and fasting conditions. I would expect that during fed conditions, gluconeogenesis would not be activated and thus would expect no correlation.

      Thanks for reviewer’s constructive comment. According to the results in new Fig. 3H, the mRNA levels of Fbp1 and Tet2 indeed positively correlated in both ad libitum and fasting conditions, while the r value is higher and p value is lower in fasting condition compared to ad libitum. Notably, both the expression levels of Fbp1 and Tet2 increased under fasting treatment, which is consistent with Fig. 1C and Fig. 4K.

      The authors state that "Our results demonstrated that HNF4α recruits TET2 to the FBP1 promoter and activates FBP1 expression through demethylation" What data points out that this is mediated through demethylation?

      Thanks for reviewer’s constructive comment. Following reviewer’s suggestion, we conducted new ChIP experiments. These data demonstrated that HNF4α recruits TET2 to the FBP1 promoter and activates FBP1 expression through demethylation, as showed in Fig. 4F-H.

      For Figures 5B, 4D, and 3L-N y-axes are labeled as fold enrichment. The authors should clearly indicate what was being measured on y-axes.

      Thanks for reviewer’s helpful comment. Following reviewer’s suggestion, we clearly labeled all the y-axes in each figure.

      The authors indicate that metformin increases phosphorylation of Hnf4a at ser 313 Figure 5C. How do we know that ser 313 is involved? Only one antibody is listed for Hnf4a (SAB, 32591).

      Thanks very much for pointing out. We determined the phosphorylation levels of HNF4α at S313 using Anti-HNF4α (phospho S313) (ab78356), we apologize for not labeling it clearly. Now, we made it clear in Fig. 5C and the detailed information of the antibody was added to the method section of “Western Blot and Immunoprecipitation”.

      How did the authors make phosphomimetic mutation (S313D) and phosphoresistant mutation (S313A) of HNF4α? This is not described.

      Thanks very much for pointing out. Following reviewer’s suggestion, the detailed method for making phosphomimetic mutation (S313D) and phosphoresistant mutation (S313A) of HNF4α was added to the method section of “Gene Knockout Cells and Mutagenesis”.

      Reviewer #2 (Recommendations for the authors):

      Major points:

      (1) Other key gluconeogenesis genes (e.g. PEPCK and G6Pase) should have been investigated to demonstrate whether or not the regulation of TET-2 is specific on FBP-1.

      Thanks for reviewer’s helpful comment. Following reviewer’s suggestion, we designed the qPCR to assay other key gluconeogenesis genes, including PEPCK and G6Pase, and the results showed that glucagon treatment had no effect on PEPCK and G6Pase expression (Fig. 3D), suggesting the regulation of TET2 is specific on FBP1.

      (2) The methods are not well defined and more details should be given, for example, to explain how the Tet2 KO mice were generated. Since these animals are not KO liver-specific and TET2 is expressed in a variety of tissues and organs and is predominantly found in hematopoietic cells, including bone marrow and blood cells, the phenotype of these mice should be better characterized.

      Thanks for reviewer’s helpful comment. The Tet2 knockout (Tet2 KO) mice were originally purchased from the Jackson Laboratory (strain No. 023359) and we added the detailed information to method section of “Animal”. According to the previously reported phenotype of Tet2 KO mice, it mainly includes bone marrow, spleen, islet and heart. Specifically, Tet2 KO mice led to an increase of total cell numbers in the bone marrow and spleen (PMID: 21873190), as well as an elevated white blood cell (WBC) count (PMID: 37541212). Additionally, Tet2 KO mice exhibited splenomegaly (PMID: 37541212, PMID: 21723200, PMID: 38773071, PMID: 21723200). And the morphology of the islets (PMID: 34417463), anatomical chamber volumes or ventricular functions (PMID: 38357791) were indistinguishable between the Tet2 KO and wild type (WT) mice.

      (3) An experiment showing the co-localization of TET2 and HNF4α in the mouse liver in fasted mice and/or in HFD-mice would strengthen the data shown in Figure 3.

      Thanks for reviewer’s helpful comment. Following reviewer’s suggestion, the experiments showing the co-localization of TET2 and HNF4α in the mouse liver in fasted mice and FD mice were conducted, as shown in new Fig. 4B and C.

      Minor points:

      (1) Given that the manuscript does not focus on the role of TET2 in the pathogenesis of T2D, its title should be changed.

      hanks for reviewer’s helpful comment. Following reviewer’s suggestion, we changed the title of the manuscript to “HNF4α-TET2-FBP1 axis contributes to gluconeogenesis and type 2 diabetes”.

      (2) Please indicate the molecular weight of bands in all figures.

      Thanks for reviewer’s helpful comment. Following reviewer’s suggestion, the molecular weight of bands was indicated in all figures.

      (3) Why do the control values of the y-axis in Figure 1 A and B are so different? Please maintain the same scale in both figures.

      Thanks for reviewer’s helpful comment. Following reviewer’s suggestion, we recalculated and normalized the control value in Fig. 1A to maintain the same scale in both figures.

      (4) In Figure 2F, do the plasma insulin levels have altered in response to GTT in Tet2-KO mice? If so, please show the data and discuss.

      Thanks for reviewer’s helpful comment. Following reviewer’s suggestion, we examined the plasma insulin levels in the process of GTT assay, and the result revealed that Tet2-KO mice showed lower insulin levels after glucose administration, which reflects higher insulin sensitivity, as shown in new Fig. 2H.

      (5) The increase of TET2 hepatic protein levels in response to fasting occur in other tissues and hematopoietic cells?

      Thanks for reviewer’s helpful comment. Following reviewer’s suggestion, we examined Tet2 protein levels under fasting condition in other tissues and hematopoietic cells, and found that fasting also increased Tet2 protein levels in kidney, brain, and hematopoietic cells, but not in heart.

      Author response image 1.

      (6) Please indicate the glucagon concentration and metformin dose in all figures in which they are mentioned.

      Thanks for reviewer’s helpful comment. Following reviewer’s suggestion, the glucagon concentration (20 nM) and metformin concentration (10 mM for HepG2 cell treatment and 300 mg/kg per day for mice treatment) were added in the figure legends, respectively.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      In the current manuscript, the authors use theoretical and analytical tools to examine the possibility of neural projections to engage ensembles of synaptic clusters in active dendrites. The analysis is divided into multiple models that differ in the connectivity parameters, speed of interactions, and identity of the signal (electric vs. second messenger). They first show that random connectivity almost ensures the representation of presynaptic ensembles. As expected, this convergence is much more likely for small group sizes and slow processes, such as calcium dynamics. Conversely, fast signals (spikes and postsynaptic potentials) and large groups are much less likely to recruit spatially clustered inputs. Dendritic nonlinearity in the postsynaptic cells was found to play a highly important role in distinguishing these clustered activation patterns, both when activated simultaneously and in sequence. The authors tackled the difficult issue of noise, showing a beneficiary effect when noise 'happens' to fill in gaps in a sequential pattern but degraded performance at higher background activity levels. Last, the authors simulated selectivity to chemical and electrical signals. While they find that longer sequences are less perturbed by noise, in more realistic activation conditions, the signals are not well resolved in the soma.

      While I think the premise of the manuscript is worth exploring, I have a number of reservations regarding the results.

      (1) In the analysis, the authors made a simplifying assumption that the chemical and electrical processes are independent. However, this is not the case; excitatory inputs to spines often trigger depolarization combined with pronounced calcium influx; this mixed signaling could have dramatic implications on the analysis, particularly if the dendrites are nonlinear (see below)

      We thank the reviewer for pointing out that we were not entirely clear about the strong basis upon which we had built our analyses of nonlinearity. In the previous version we had relied on published work, notably (Bhalla 2017), which does include these nonlinearities. However, we agree it is preferable to unambiguously demonstrate all the reported selectivity properties in a single model with all the nonlinearities discussed. We have now done so. This is now reported in the paper:

      “A single model exhibits multiple forms of nonlinear dendritic selectivity

      We implemented all three forms of selectivity described above, in a single model which included six voltage and calcium-gated ion channels, NMDA, AMPA and GABA receptors, and chemical signaling processes in spines and dendrites. The goal of this was three fold: To show how these nonlinear operations emerge in a mechanistically detailed model, to show that they can coexist, and to show that they are separated in time-scales. We implemented a Y-branched neuron model with additional electrical compartments for the dendritic spines (Methods). This model was closely based on a published detailed chemical-electrical model (Bhalla 2017). We stimulated this model with synaptic input corresponding to the three kinds of spatiotemporal patterns described in figures Figure 8 - Supplement 1 (sequential synaptic activity triggering electrical sequence selectivity), Figure 8 - Supplement 2 (spatially grouped synaptic stimuli leading to local Ca4_CaM activation), and Figure 8 - Supplement 3 (sequential bursts of synaptic activity triggering chemical sequence selectivity). We found that each of these mechanisms show nonlinear selectivity with respect to both synaptic spacing and synaptic weights. Further, these forms of selectivity coexist in the composite model (Figure 8 Supplements 1, 2, 3), separated by the time-scales of the stimulus patterns (~ 100 ms, ~ 1s and ~10s respectively). Thus mixed signaling in active nonlinear dendrites yields selectivity of the same form as we explored in simpler individual models. A more complete analysis of the effect of morphology, branching and channel distributions deserves a separate in-depth analysis, and is outside the scope of the current study.”

      (2) Sequence detection in active dendrites is often simplified to investigating activation in a part of or the entirety of individual branches. However, the authors did not do that for most of their analysis. Instead, they treat the entire dendritic tree as one long branch and count how many inputs form clusters. I fail to see why simplification is required and suspect it can lead to wrong results. For example, two inputs that are mapped to different dendrites in the 'original' morphology but then happen to fall next to each other when the branches are staggered to form the long dendrites would be counted as neighbors.

      We have added the below section within the main text in the section titled “Grouped Convergence of Inputs” to address the effect of branching.

      “End-effects limit convergence zones for highly branched neurons

      Neurons exhibit considerable diversity with respect to their morphologies. How synapses extending across dendritic branch points interact in the context of a synaptic cluster/group, is a topic that needs detailed examination via experimental and modeling approaches. However for the sake of analysis, we present calculations under the assumption that selectivity for grouped inputs might be degraded across branch points.

      Zones beginning close to a branch point might get interrupted. Consider a neuron with B branches. The length of the typical branch would be L/B. As a conservative estimate if we exclude a region of length Z for every branch, the expected number of zones that begin too close to a branch point is

                                                                          [Equation 3]

      For typical pyramidal neurons B~50, so Eend ~ 0.05 for values of Z of ~10 µm. Thus pyramidal neurons will not be much affected by branching effects, Profusely branching neurons like Purkinje cells have B~900 for a total L of ~7800 µm, (McConnell and Berry, 1978), hence Eend ~1 for values of Z of ~10 µm. Thus almost all groups in Purkinje neurons would run into a branch point or terminal. For the case of electrical groups, this estimate would be scaled by a factor of 5 if we consider a zone length of 50 µm. However, it is important to note that these are very conservative estimates, as for clusters of 4-5 inputs, the number of synapses available within a zone are far greater (~100 synapses within 50 µm).”

      (3) The simulations were poorly executed. Figures 5 and 6 show examples but no summary statistics.

      We have included the summary statistics in Figure 5F and Figure 6E. The statistics for both these panels were generated by simulating multiple spatiotemporal combinations of ectopic input in the presence of different stimulus patterns for each sequence length.

      The authors emphasize the importance of nonlinear dendritic interactions, but they do not include them in their analysis of the ectopic signals! I find it to be wholly expected that the effects of dendritic ensembles are not pronounced when the dendrites are linear.

      We would like to clarify that both Figures 5 and 6 already included nonlinearities. In Figure 5, the chemical mechanism involving the bistable switch motif is strongly selective for ordered inputs in a nonlinear manner. A separate panel highlighting this (Panel C) has now been included in Figure 5. This result had been previously shown in Figure 3I of (Bhalla 2017). We have reproduced it in Figure 5C.

      The published electrical model used in Figure 6 also has a nonlinearity which predominantly stems from the interaction of the impedance gradient along the dendrite with the voltage dependence of NMDARs. Check Figure 4C,D of (Branco, Clark, and Häusser 2010).

      To provide a comprehensive analysis of dendritic integration, the authors could simulate more realistic synaptic conductances and voltage-gated channels. They would find much more complicated interactions between inputs on a single site, a sliding temporal and spatial window of nonlinear integration that depends on dendritic morphology, active and passive parameters, and synaptic properties. At different activation levels, the rules of synaptic integration shift to cooperativity between different dendrites and cellular compartments, further complicated by nonlinear interactions between somatic spikes and dendritic events.

      We would like to clarify two points. First, the key goal of our study was to understand the role played by random connectivity in giving rise to clustered computation. In this revision we provide simulations to show the mechanistic basis for the nonlinearities, and then abstracted these out in order to scale the analysis to networks. These nonlinearities were taken as a given, though we elaborated previous work slightly in order to address the question of ectopic inputs. Second, in our original submission we relied on published work for the estimates of dendritic nonlinearities. Previous work from (Poirazi, Brannon, and Mel 2003; Branco, Clark, and Häusser 2010; Bhalla 2017) have already carried out highly detailed realistic simulations, and in some cases including chemical and electrical nonlinearities as the reviewer mentions (Bhalla 2017). Hence we did not feel that this needed to be redone.

      In this resubmission we have addressed the above and two additional concerns, namely whether the different forms of selectivity can coexist in a single model including all these nonlinearities, and whether there is separation of time-scales. The answer is yes to both. The outcome of this is presented in Figure 8 and the associated supplementary figures, and all simulation details are provided on the github repository associated with this paper. A more complete analysis of interaction of multiple nonlinearities in a detailed model is material for further study.

      While it is tempting to extend back-of-the-napkin calculations of how many inputs can recruit nonlinear integration in active dendrites, the biological implementation is very different from this hypothetical. It is important to consider these questions, but I am not convinced that this manuscript adequately addressed the questions it set out to probe, nor does it provide information that was unknown beforehand.

      We developed our analysis systematically, and perhaps the reviewer refers to the first few calculations as back-of-the-napkin. However, the derivation rapidly becomes more complex when we factor in combinatorics and the effect of noise. This derivation is in the supplementary material. Furthermore, the exact form of the combinatorial and noise equations was non-trivial to derive and we worked closely with the connectivity simulations (Figures 2 and 4) to obtain equations which scale across a large parameter space by sampling connectivity for over 100000 neurons and activity over 100 trials for each of these neurons for each network configuration we have tested.

      the biological implementation is very different from this hypothetical.

      We do not quite understand in what respect the reviewer feels that this calculation is very different from the biological implementation. The calculation is about projection patterns. In the discussion we consider at length how our findings of selectivity from random projections may be an effective starting point for more elaborate biological connection rules. We have added the following sentence:

      “We present a first-order analysis of the simplest kind of connectivity rule (random), upon which more elaborate rules such as spatial gradients and activity-dependent wiring may be developed.”

      In case the reviewer was referring to the biological implementation of nonlinear integration, we treat the nonlinear integration in the dendrites as a separate set of simulations, most of which are closely based on published work (Bhalla 2017). We use these in the later sections of the paper to estimate selectivity terms, which inform our final analysis.

      In the revision we have worked to clarify this progression of the analysis. As indicated above, we have also made a composite model of all of the nonlinear dendritic mechanisms, chemical and electrical, which underlie our analysis.

      nor does it provide information that was unknown beforehand.

      We conducted a broad literature survey and to the best of our knowledge these calculations and findings have not been obtained previously. If the reviewer has some specific examples in mind we would be pleased to refer to it.

      Reviewer #2 (Public Review):

      Summary:

      If synaptic input is functionally clustered on dendrites, nonlinear integration could increase the computational power of neural networks. But this requires the right synapses to be located in the right places. This paper aims to address the question of whether such synaptic arrangements could arise by chance (i.e. without special rules for axon guidance or structural plasticity), and could therefore be exploited even in randomly connected networks. This is important, particularly for the dendrites and biological computation communities, where there is a pressing need to integrate decades of work at the single-neuron level with contemporary ideas about network function.

      Using an abstract model where ensembles of neurons project randomly to a postsynaptic population, back-of-envelope calculations are presented that predict the probability of finding clustered synapses and spatiotemporal sequences. Using data-constrained parameters, the authors conclude that clustering and sequences are indeed likely to occur by chance (for large enough ensembles), but require strong dendritic nonlinearities and low background noise to be useful.

      Strengths:

      (1) The back-of-envelope reasoning presented can provide fast and valuable intuition. The authors have also made the effort to connect the model parameters with measured values. Even an approximate understanding of cluster probability can direct theory and experiments towards promising directions, or away from lost causes.

      (2) I found the general approach to be refreshingly transparent and objective. Assumptions are stated clearly about the model and statistics of different circuits. Along with some positive results, many of the computed cluster probabilities are vanishingly small, and noise is found to be quite detrimental in several cases. This is important to know, and I was happy to see the authors take a balanced look at conditions that help/hinder clustering, rather than to just focus on a particular regime that works.

      (3) This paper is also a timely reminder that synaptic clusters and sequences can exist on multiple spatial and temporal scales. The authors present results pertaining to the standard `electrical' regime (~50-100 µm, <50 ms), as well as two modes of chemical signaling (~10 µm, 100-1000 ms). The senior author is indeed an authority on the latter, and the simulations in Figure 5, extending those from Bhalla (2017), are unique in this area. In my view, the role of chemical signaling in neural computation is understudied theoretically, but research will be increasingly important as experimental technologies continue to develop.

      Weaknesses:

      (1) The paper is mostly let down by the presentation. In the current form, some patience is needed to grasp the main questions and results, and it is hard to keep track of the many abbreviations and definitions. A paper like this can be impactful, but the writing needs to be crisp, and the logic of the derivation accessible to non-experts. See, for instance, Stepanyants, Hof & Chklovskii (2002) for a relevant example.

      It would be good to see a restructure that communicates the main points clearly and concisely, perhaps leaving other observations to an optional appendix. For the interested but time-pressed reader, I recommend starting with the last paragraph of the introduction, working through the main derivation on page 7, and writing out the full expression with key parameters exposed. Next, look at Table 1 and Figure 2J to see where different circuits and mechanisms fit in this scheme. Beyond this, the sequence derivation on page 15 and biophysical simulations in Figures 5 and 6 are also highlights.

      We appreciate the reviewers' suggestions. We have tightened the flow of the introduction. We understand that the abbreviations and definitions are challenging and have therefore provided intuitions and summaries of the equations discussed in the main text.

      Clusters calculations

      “Our approach is to ask how likely it is that a given set of inputs lands on a short segment of dendrite, and then scale it up to all segments on the entire dendritic length of the cell.

      Thus, the probability of occurrence of groups that receive connections from each of the M ensembles (PcFMG) is a function of the connection probability (p) between the two layers, the number of neurons in an ensemble (N), the relative zone-length with respect to the total dendritic arbor (Z/L) and the number of ensembles (M).”

      Sequence calculations

      “Here we estimate the likelihood of the first ensemble input arriving anywhere on the dendrite, and ask how likely it is that succeeding inputs of the sequence would arrive within a set spacing.

      Thus, the probability of occurrence of sequences that receive sequential connections (PcPOSS) from each of the M ensembles is a function of the connection probability (p) between the two layers, the number of neurons in an ensemble (N), the relative window size with respect to the total dendritic arbor (Δ/L) and the number of ensembles (M).”

      (2) I wonder if the authors are being overly conservative at times. The result highlighted in the abstract is that 10/100000 postsynaptic neurons are expected to exhibit synaptic clustering. This seems like a very small number, especially if circuits are to rely on such a mechanism. However, this figure assumes the convergence of 3-5 distinct ensembles. Convergence of inputs from just 2 ense mbles would be much more prevalent, but still advantageous computationally. There has been excitement in the field about experiments showing the clustering of synapses encoding even a single feature.

      We agree that short clusters of two inputs would be far more likely. We focused our analysis on clusters with three of more ensembles because of the following reasons:

      (1) The signal to noise in these clusters was very poor as the likelihood of noise clusters is high.

      (2) It is difficult to trigger nonlinearities with very few synaptic inputs.

      (3) At the ensemble sizes we considered (100 for clusters, 1000 for sequences), clusters arising from just two ensembles would result in high probability of occurrence on all neurons in a network (~50% in cortex, see p_CMFG in figures below.). These dense neural representations make it difficult for downstream networks to decode (Foldiak 2003).

      However, in the presence of ensembles containing fewer neurons or when the connection probability between the layers is low, short clusters can result in sparse representations (Figure 2 - Supplement 2). Arguments 1 and 2 hold for short sequences as well.

      (3) The analysis supporting the claim that strong nonlinearities are needed for cluster/sequence detection is unconvincing. In the analysis, different synapse distributions on a single long dendrite are convolved with a sigmoid function and then the sum is taken to reflect the somatic response. In reality, dendritic nonlinearities influence the soma in a complex and dynamic manner. It may be that the abstract approach the authors use captures some of this, but it needs to be validated with simulations to be trusted (in line with previous work, e.g. Poirazi, Brannon & Mel, (2003)).

      We agree that multiple factors might affect the influence of nonlinearities on the soma. The key goal of our study was to understand the role played by random connectivity in giving rise to clustered computation. Since simulating a wide range of connectivity and activity patterns in a detailed biophysical model was computationally expensive, we analyzed the exemplar detailed models for nonlinearity separately (Figures 5, 6, and new figure 8), and then used our abstract models as a proxy for understanding population dynamics. A complete analysis of the role played by morphology, channel kinetics and the effect of branching requires an in-depth study of its own, and some of these questions have already been tackled by (Poirazi, Brannon, and Mel 2003; Branco, Clark, and Häusser 2010; Bhalla 2017). However, in the revision, we have implemented a single model which incorporates the range of ion-channel, synaptic and biochemical signaling nonlinearities which we discuss in the paper (Figure 8, and Figure 8 Supplement 1, 2,3). We use this to demonstrate all three forms of sequence and grouped computation we use in the study, where the only difference is in the stimulus pattern and the separation of time-scales inherent in the stimuli.

      (4) It is unclear whether some of the conclusions would hold in the presence of learning. In the signal-to-noise analysis, all synaptic strengths are assumed equal. But if synapses involved in salient clusters or sequences were potentiated, presumably detection would become easier? Similarly, if presynaptic tuning and/or timing were reorganized through learning, the conditions for synaptic arrangements to be useful could be relaxed. Answering these questions is beyond the scope of the study, but there is a caveat there nonetheless.

      We agree with the reviewer. If synapses receiving connectivity from ensembles had stronger weights, this would make detection easier. Dendritic spikes arising from clustered inputs have been implicated in local cooperative plasticity (Golding, Staff, and Spruston 2002; Losonczy, Makara, and Magee 2008). Further, plasticity related proteins synthesized at a synapse undergoing L-LTP can diffuse to neighboring weakly co-active synapses, and thereby mediate cooperative plasticity (Harvey et al. 2008; Govindarajan, Kelleher, and Tonegawa 2006; Govindarajan et al. 2011). Thus if clusters of synapses were likely to be co-active, they could further engage these local plasticity mechanisms which could potentiate them while not potentiating synapses that are activated by background activity. This would depend on the activity correlation between synapses receiving ensemble inputs within a cluster vs those activated by background activity. We have mentioned some of these ideas in a published opinion paper (Pulikkottil, Somashekar, and Bhalla 2021). In the current study, we wanted to understand whether even in the absence of specialized connection rules, interesting computations could still emerge. Thus, we focused on asking whether clustered or sequential convergence could arise even in a purely randomly connected network, with the most basic set of assumptions. We agree that an analysis of how selectivity evolves with learning would be an interesting topic for further work.

      References

      Bhalla, Upinder S. 2017. “Synaptic Input Sequence Discrimination on Behavioral Timescales Mediated by Reaction-Diffusion Chemistry in Dendrites.” Edited by Frances K Skinner. eLife 6 (April):e25827. https://doi.org/10.7554/eLife.25827.

      Branco, Tiago, Beverley A. Clark, and Michael Häusser. 2010. “Dendritic Discrimination of Temporal Input Sequences in Cortical Neurons.” Science (New York, N.Y.) 329 (5999): 1671–75. https://doi.org/10.1126/science.1189664.

      Foldiak, Peter. 2003. “Sparse Coding in the Primate Cortex.” The Handbook of Brain Theory and Neural Networks. https://research-repository.st-andrews.ac.uk/bitstream/handle/10023/2994/FoldiakSparse HBTNN2e02.pdf?sequence=1.

      Golding, Nace L., Nathan P. Staff, and Nelson Spruston. 2002. “Dendritic Spikes as a Mechanism for Cooperative Long-Term Potentiation.” Nature 418 (6895): 326–31. https://doi.org/10.1038/nature00854.

      Govindarajan, Arvind, Inbal Israely, Shu-Ying Huang, and Susumu Tonegawa. 2011. “The Dendritic Branch Is the Preferred Integrative Unit for Protein Synthesis-Dependent LTP.” Neuron 69 (1): 132–46. https://doi.org/10.1016/j.neuron.2010.12.008.

      Govindarajan, Arvind, Raymond J. Kelleher, and Susumu Tonegawa. 2006. “A Clustered Plasticity Model of Long-Term Memory Engrams.” Nature Reviews Neuroscience 7 (7): 575–83. https://doi.org/10.1038/nrn1937.

      Harvey, Christopher D., Ryohei Yasuda, Haining Zhong, and Karel Svoboda. 2008. “The Spread of Ras Activity Triggered by Activation of a Single Dendritic Spine.” Science (New York, N.Y.) 321 (5885): 136–40. https://doi.org/10.1126/science.1159675.

      Losonczy, Attila, Judit K. Makara, and Jeffrey C. Magee. 2008. “Compartmentalized Dendritic Plasticity and Input Feature Storage in Neurons.” Nature 452 (7186): 436–41. https://doi.org/10.1038/nature06725.

      Poirazi, Panayiota, Terrence Brannon, and Bartlett W. Mel. 2003. “Pyramidal Neuron as Two-Layer Neural Network.” Neuron 37 (6): 989–99. https://doi.org/10.1016/S0896-6273(03)00149-1.

      Pulikkottil, Vinu Varghese, Bhanu Priya Somashekar, and Upinder S. Bhalla.     2021.

      “Computation, Wiring, and Plasticity in Synaptic Clusters.” Current Opinion in Neurobiology, Computational Neuroscience, 70 (October):101–12. https://doi.org/10.1016/j.conb.2021.08.001.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment

      This useful manuscript reports mechanisms behind the increase in fecundity in response to sub-lethal doses of pesticides in the crop pest, the brown plant hopper. The authors hypothesize that the pesticide works by inducing the JH titer, which through the JH signaling pathway induces egg development. Evidence for this is, however, inadequate.

      We greatly appreciate your valuable comments and constructive suggestions for our work. All in all, the manuscript has been carefully edited and improved following your suggestions. We also provide more evidence to support our statements by conducting new experiments. First, we found that also EB treatment of adult females can stimulate egg-laying. Second, EB treatment in female adults increases the number of mature eggs in the ovary and ovarioles. Third, EB treatment in females enhances the expression of the kr-h1 gene in the whole body of BPH. Finally, EB treatment in female adults increases the JHIII titer, but has no impact on the 20E titer.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Gao et al. have demonstrated that the pesticide emamectin benzoate (EB) treatment of brown planthopper (BPH) leads to increased egg-laying in the insect, which is a common agricultural pest. The authors hypothesize that EB upregulates JH titer resulting in increased fecundity.

      Strengths:

      The finding that a class of pesticide increases the fecundity of brown planthopper is interesting.

      We greatly appreciate your positive comments on our work.

      Weaknesses:

      (1) EB is an allosteric modulator of GluCl. That means EB physically interacts with GluCl initiating a structural change in the cannel protein. Yet the authors' central hypothesis here is about how EB can upregulate the mRNA of GluCl. I do not know whether there is any evidence that an allosteric modulator can function as a transcriptional activator for the same receptor protein. The basic premise of the paper sounds counterintuitive. This is a structural problem and should be addressed by the authors by giving sufficient evidence about such demonstrated mechanisms before.

      Thank you for your question. As the reviewer points out, EB physically interacts with its target protein GluCl and thus affects its downstream signaling pathway. In the manuscript, we reported that EB-treated brown planthoppers display increased expression of GluCl in the adult stage (Fig. 5A). Actually, there are many studies showing that insects treated with insecticides can increase the expression of target genes. For example, the relative expression level of the ryanodine receptor gene of the rice stem borer, Chilo suppressalis was increased 10-fold after treatment with chlorantraniliprole, an insecticide which targets the ryanodine receptor (Peng et al., 2017). Besides this, in Drosophila, starvation (and low insulin) elevates the transcription level of the sNPF and tachykinin receptors (Ko et al., 2015; Root et al., 2011). In brown planthoppers, reduction in mRNA and protein expression of a nicotinic acetylcholine receptor α8 subunit is associated with resistance to imidacloprid (Zhang et al., 2015). RNA interference knockdown of α8 gene decreased the sensitivity of N. lugens to imidacloprid (Zhang et al., 2015). Hence, expression of receptor genes can be regulated by diverse factors including insecticide treatment. In our case, we found that EB can upregulate its target gene GluCl. However, we did not claim that EB functions as transcriptional activator for GluCl, and we still do not know why EB treatment changes the expression of GluCl in the brown planthopper. Considering our experiments are lasting several days, it might be an indirect (or secondary) effect caused by other factors, which change the expression of GluCl gene upon EB action of the channel. One reason is maybe that the allosteric interaction with GluCl by EB makes it dysfunctional and the cellular response is to upregulate the channel/receptor to compensate. We have inserted text on lines 738 - 757 to explain these possibilities.

      (2) I am surprised to see a 4th instar larval application or treatment with EB results in the upregulation of JH in the adult stages. Complicating the results further is the observation that a 4th instar EB application results in an immediate decrease in JH titer. There is a high possibility that this late JH titer increase is an indirect effect.

      Thank you for your question. Treatment with low doses or sublethal doses of insecticides might have a strong and complex impact on insects (Gandara et al., 2024; Gong et al., 2022; Li et al., 2023; Martelli et al., 2022). We kept the 4th instar of brown planthoppers feeding on EB for four days. They will develop to 5th instar after four days treatment, which is the final nymphal stage of BPH. Since the brown planthopper is a hemimetabolous insect, we cannot rule out the possibility that an indirect effect of treatment with EB results in the upregulation of JH in the adult stages. In this new revised manuscript, we investigated the impact of EB treatment in the adult stage. We found that female adults treated with EB also laid more eggs than controls (Figure 1-figure supplement 1A). The following experiments were performed in adults to address how EB treated stimulates egg-laying in adult brown planthopper.

      (1) We found that EB treatment in adults increases the number of mature eggs in ovary (new Figure 2-figure supplement 1). We add this results in lines 234 – 238 and 281-285.

      (2) We measured the JH titer after the female adults had been treated with EB. We found that EB can also increase the JH titer but has no impact on the 20E titer in the female adult (Figure 3-S3A and B). We add this results in lines 351 – 356 and 281-285.

      (3) EB treatment in adults increases the gene expression of JHAMT and Kr-h1 (Figure 3-S3C and D). We add this results in lines 378 – 379, lines 387-390 and lines 457-462.

      (3) The writing quality of the paper needs improvement. Particularly with respect to describing processes and abbreviations. In several instances the authors have not adequately described the processes they have introduced, thus confusing readers.

      Thank you for your suggestion. We have thoroughly revised the paper to improve clarity.

      (4) In the section 'EB promotes ovarian development' the authors have shown that EB treatment results in increased detention of eggs which contradicts their own results which show that EB promotes egg laying. Again, this is a serious contradiction that nullifies their hypothesis.

      Thank you for pointing this out. We revised the figure 2B to show number of mature eggs in the ovary. The number of mature eggs in ovaries of females that fed on EB was higher than in control females. We also show that BPH fed with EB laid more eggs than controls. Thus, our results suggest that EB promotes ovary maturation (and egg production) and also increases egg laying (Figure 1 and Table S1). Thus, we found that EB treatment can increase both the production of eggs and increase egg laying. We add this results in lines 234 – 238.

      (5) Furthermore, the results suggest that oogenesis is not affected by EB application. The authors should devote a section to discussing how they are observing increased egg numbers in EB-treated insects while not impacting Oogenesis.

      Thank you for your suggestions, and apologies for the lack of clarity in our initial explanation. First, we found that EB treatment led to an increase in the number of eggs laid by female brown planthoppers (Figure 1). Through dissection experiments, we observed that EB-treated females had more mature eggs in their ovaries (Figure 2A and B), indicating that the increased egg-laying was due to a larger production of mature eggs in the ovaries after EB treatment. This is now explained on lines 229-238.

      Additionally, since there is no systematic description of oogenesis in the brown planthopper, we were the first to observe the oogenesis process in this species using immunohistochemistry and laser confocal microscopy. Based on the developmental characteristics, we defined the different stages of oogenesis (Figure 2C, Figure 2-figure supplement 2). We did not observe any significant effect of EB treatment on the various stages of oogenesis, indicating that EB treatment does not impair normal egg development (Figure 2D). Instead, the increase in vitellogenin accelerates the production of mature eggs. This is now explained on lines 243-262.

      During the maturation process, eggs require uptake of vitellogenin, and an increase in vitellogenin (Vg) content can accelerate egg maturation, producing more mature eggs. Our molecular data suggest that EB treatment leads to an upregulation of vg expression. Based on these findings, we conclude that the increase in egg-laying caused by EB treatment is due to the upregulation of vg (Figure 3I), which raises vitellogenin content, promoting the uptake of vitellogenin by maturing eggs and resulting in the production of more mature eggs. We have revised the text on lines 389-395 to clarify this point.

      (6) Met is the receptor of JH and to my understanding, remains mostly constant in terms of its mRNA or protein levels throughout various developmental periods in many different insects. Therefore, the presence of JH becomes the major driving factor for physiological events and not the presence of the receptor Met. Here the authors have demonstrated an increase in Met mRNA as a result of EB treatment. Their central hypothesis is that EB increases JH titer to result in enhanced fecundity. JH action will not result in the activation of Met. Although not contradictory to the hypothesis, the increase in mRNA content of Met is contrary to the findings of the JH field thus far.

      Thank you for your comment. Our results showed that EB treatment can mildly increase (about 2-fold) expression of the Met gene in brown planthoppers (Figure 3G). And our data indicated that Met and FAMeT expression levels were not influenced so much by EB compared with kr-h1 and vg (Figure 3H and I). We agree that JH action will not result in the increase of Met. However, we cannot rule out the possibility of other factors (indirect effects), induced by EB treatment that increase the mRNA expression level of Met. One recent paper reported that downregulation of transcription factor CncC will increase met expression in beetles (see Figure 6A in this reference) (Jiang et al., 2023). Many studies have reported that insecticide treatment will activate the CncC gene signaling pathway, which regulates detoxification gene expression (Amezian et al., 2023; Fu et al., 2024; Hu et al., 2021). Hence, it is possible that EB might influence the CncC gene pathway which then induces met expression. This EB effect on met upregulation may be similar to the upregulation of GluCl and some other secondary effects. We have discussed this on lines 725-738.

      (7) As pointed out before, it is hard to rationalize how a 4th instar exposure to EB can result in the upregulation of key genes involved in JH synthesis at the adult stage. The authors must consider providing a plausible explanation and discussion in this regard.

      Thank you for your comments. It must be mentioned that although we exposed the BPH to EB at 4th instar, we make the insect feed on the EB-treated rice plants for four days. After that, the insect will develop into 5<sup>th</sup> instar, the final nymphal stage of brown planthopper. Since brown planthoppers do not have a pupal stage, this might cause the EB presented to the insects last a longer time even in the adult stage. Besides this, we found that EB treatment will increase the weight of adult females (Figure 1-figure supplement 3E and F), which indicates that EB might increase food intake in BPHs that might produce more insulin peptide. Insulin might increase the JH synthesis at the adult stage. In our revised study we also investigate EB impairment in adult BPHs. We found that, similar to the nymphal stage, EB treatment in adult BPHs also increases the egg laying. Furthermore, the JH titer was increased after treatment of BPH with EB in adults. Besides this, GluCl and kr-h1 genes were also up-regulated after EB treatment in the adult stage. We have discussed this on lines 739-746.

      (8) I have strong reservations against such an irrational hypothesis that Met (the receptor for JH) and JH-Met target gene Kr-h1 regulate JH titer (Line 311, Fig 3 supplemental 2D). This would be the first report of such an event on the JH field and therefore must be analysed in depth. I strongly suggest the authors remove such claims from the manuscript without substantiating it.

      Thank you for your suggestions and comments. We have changed our claims in this revised MS. We found that EB treatment can enhance Kr-h1 expression. We have no evidence to support that JH can induce met expression. We have rewritten the manuscript to avoid confusion (see text on lines 725-735).

      (9) Kr-h1 is JH/Met target gene. The authors demonstrate that silencing of Kr-h1 results in inhibition of FAMeT, which is a gene involved in JH synthesis. A feedback loop in JH synthesis is unreported. It is the view of this reviewer that the authors must go ahead with a mechanistic detail of Kr-h1 mediated JH upregulation before this can be concluded. Mere qPCR experiments are not sufficient to substantiate a claim that is completely contrary to the current understanding of the JH signalling pathway.

      Thank you for your suggestions and comments. We agree that only qPCR experiments are not enough to provide this kind of claim. More evidences need to be provided to support this. We have revised the MS to avoid confusion (see text on lines 725-735).

      (10) The authors have performed knockdowns of JHAMT, Met, and Kr-h1 to demonstrate the effect of these factors on fecundity in BPH. Additionally, they have performed rescue experiments with EB application on these knockdown insects (Figure 3K-M). This, I believe, is a very flawed experiment. The authors demonstrate EB works through JHAMT in upregulating JH titer. In the absence of JHAMT, EB application is not expected to rescue the phenotype. But the authors have reported a complete rescue here. In the absence of Met, the receptor of JH, either EB or JH is not expected to rescue the phenotype. But a complete rescue has been reported. These two experimental results contradict their own hypothesis.

      Thank you for your comments. We thought that this rescue is possible since knockdown of the genes is incomplete when using dsRNA injection (and residual gene expression allows for EB action). It is not a total knockout and actually, these genes still have a low level of expression in the dsRNA-injected insects. Since EB can upregulate the expression of JHAMT, Met, and Kr-h1, it is reasonable that EB treatment can rescue the down-regulation effects of these three genes and make fecundity completely rescued. We have clarified this on lines 411-413).

      (11) A significant section of the paper deals with how EB upregulates JH titer. JH is a hormone synthesized in the Corpora Allata. Yet the authors have chosen to use the whole body for all of their experiment. Changes in the whole body for mRNA of those enzymes involved in JH synthesis may not reflect the situation in Corpora Allata. Although working with Corpora Allata is challenging, discarding the abdomen and thorax region and working with the head and neck region of the insect is easily doable. Results from such sampling are always more convincing when it comes to JH synthesis studies.

      Thank you for your suggestions. Because the head is very difficult to separate from the thorax region in brown planthoppers as you can see in Author response image 1. We are now trying to answer how EB regulates JH synthesis using Drosophila as a model.

      Author response image 1.

      The brown planthopper

      (12) The phenomenon reported was specific to BPH and not found in other insects. This limits the implications of the study.

      Thank you for your comments. The brown planthopper is a serious insect pest on rice in Asia. Our findings can guide the use of this insecticide in the field. Besides this, our findings indicated that EB, which targets GluCl can impair the JH titer. Our findings added new implications for how a neuronal system influences the JH signaling pathway. We will further investigate how EB influences JH in the future and will use Drosophila as a model to study the molecular mechanisms.

      (13) Overall, the molecular experiments are very poorly designed and can at best be termed superficial. There are several contradictions within the paper and no discussion or explanation has been provided for that.

      Thank you for your comments. We have revised the paper according to your suggestions and added further explanation of our results in the discussion parts and hope the conclusions are better supported in the new version. We have discussed this on lines 725-746 and 778-799.

      Reviewer #2 (Public Review):

      The brown plant hopper (BPH) is a notorious crop pest and pesticides are the most widespread means of controlling its population. This manuscript shows that in response to sublethal doses of the pesticide (EB), BPH females show enhanced fecundity. This is in keeping with field reports of population resurgence post-pesticide treatment. The authors work out the mechanism behind this increase in fecundity. They show that in response to EB exposure, the expression of its target receptor, GluCl, increases. This, they show, results in an increase in the expression of genes that regulate the synthesis of juvenile hormone (JH) and JH itself, which, in turn, results in enhanced egg-production and egg-laying. Interestingly, these effects of EB exposure are species-specific, as the authors report that other species of plant hoppers either don't show enhanced fecundity or show reduced fecundity. As the authors point out, it is unclear how an increase in GluCl levels could result in increased JH regulatory genes.

      We greatly appreciate your valuable comments and constructive suggestion to our work. We will try to figure out how EB interacts with its molecular target GluCl and then increases JH regulatory genes in the future work using Drosophila as models.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Overall, the molecular experiments are very poorly designed and can at best be termed superficial. There are several contradictions within the paper and no discussion or explanation has been provided for that.

      The authors should consider a thorough revision.

      Thank you for your comments. We have thoroughly revised the paper according to your suggestions and added further experiments and explanations of our results in the discussion parts.

      Reviewer #2 (Recommendations For The Authors):

      It would help the reader to have more schematics along with the figures. The final figure is helpful, but knowing the JH pathway, and where it acts would help with the interpretations as one reads the manuscript and the figures. The pathways represented in 4N or 5J are helpful but could be improved upon for better presentation.

      It would be nice to have some discussion on how the authors think EB exposure results in an increase in GluCl expression, and how that in turn affects the expression of so many genes.

      Thank you for your comments. We have thoroughly revised the paper according to your suggestions and added further experiments and explanations of how we think EB exposure results in an increase in JH titer and other genes in the discussion parts. We have added the test on lines 753-761.

      References

      Amezian, D., Fricaux, T., de Sousa, G., Maiwald, F., Huditz, H.-I., Nauen, R., Le Goff, G., 2023. Investigating the role of the ROS/CncC signaling pathway in the response to xenobiotics in Spodoptera frugiperda using Sf9 cells. Pesticide Biochemistry and Physiology 195, 105563.

      Fu, B., Liang, J., Hu, J., Du, T., Tan, Q., He, C., Wei, X., Gong, P., Yang, J., Liu, S., Huang, M., Gui, L., Liu, K., Zhou, X., Nauen, R., Bass, C., Yang, X., Zhang, Y., 2024. GPCR–MAPK signaling pathways underpin fitness trade-offs in whitefly. Proceedings of the National Academy of Sciences 121, e2402407121.

      Gandara, L., Jacoby, R., Laurent, F., Spatuzzi, M., Vlachopoulos, N., Borst, N.O., Ekmen, G., Potel, C.M., Garrido-Rodriguez, M., Böhmert, A.L., Misunou, N., Bartmanski, B.J., Li, X.C., Kutra, D., Hériché, J.-K., Tischer, C., Zimmermann-Kogadeeva, M., Ingham, V.A., Savitski, M.M., Masson, J.-B., Zimmermann, M., Crocker, J., 2024. Pervasive sublethal effects of agrochemicals on insects at environmentally relevant concentrations. Science 386, 446-453.

      Gong, Y., Cheng, S., Desneux, N., Gao, X., Xiu, X., Wang, F., Hou, M., 2022. Transgenerational hormesis effects of nitenpyram on fitness and insecticide tolerance/resistance of Nilaparvata lugens. Journal of Pest Science.

      Hu, B., Huang, H., Hu, S., Ren, M., Wei, Q., Tian, X., Esmail Abdalla Elzaki, M., Bass, C., Su, J., Reddy Palli, S., 2021. Changes in both trans- and cis-regulatory elements mediate insecticide resistance in a lepidopteron pest, Spodoptera exigua. PLOS Genetics 17, e1009403.

      Jiang, H., Meng, X., Zhang, N., Ge, H., Wei, J., Qian, K., Zheng, Y., Park, Y., Reddy Palli, S., Wang, J., 2023. The pleiotropic AMPK–CncC signaling pathway regulates the trade-off between detoxification and reproduction. Proceedings of the National Academy of Sciences 120, e2214038120.

      Ko, K.I., Root, C.M., Lindsay, S.A., Zaninovich, O.A., Shepherd, A.K., Wasserman, S.A., Kim, S.M., Wang, J.W., 2015. Starvation promotes concerted modulation of appetitive olfactory behavior via parallel neuromodulatory circuits. eLife 4, e08298.

      Li, Z., Wang, Y., Qin, Q., Chen, L., Dang, X., Ma, Z., Zhou, Z., 2023. Imidacloprid disrupts larval molting regulation and nutrient energy metabolism, causing developmental delay in honey bee Apis mellifera. eLife

      Martelli, F., Hernandes, N.H., Zuo, Z., Wang, J., Wong, C.-O., Karagas, N.E., Roessner, U., Rupasinghe, T., Robin, C., Venkatachalam, K., Perry, T., Batterham, P., Bellen, H.J., 2022. Low doses of the organic insecticide spinosad trigger lysosomal defects, elevated ROS, lipid dysregulation, and neurodegeneration in flies. eLife 11, e73812.

      Peng, Y.C., Sheng, C.W., Casida, J.E., Zhao, C.Q., Han, Z.J., 2017. Ryanodine receptor genes of the rice stem borer, Chilo suppressalis: Molecular cloning, alternative splicing and expression profiling. Pestic. Biochem. Physiol. 135, 69-77.

      Root, Cory M., Ko, Kang I., Jafari, A., Wang, Jing W., 2011. Presynaptic facilitation by neuropeptide signaling mediates odor-driven food search. Cell 145, 133-144.

      Zhang, Y., Wang, X., Yang, B., Hu, Y., Huang, L., Bass, C., Liu, Z., 2015. Reduction in mRNA and protein expression of a nicotinic acetylcholine receptor α8 subunit is associated with resistance to imidacloprid in the brown planthopper, Nilaparvata lugens. Journal of Neurochemistry 135, 686-694.